The patch below does not apply to the 5.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 662d66466637862ef955f7f6e78a286d8cf0ebef Mon Sep 17 00:00:00 2001
From: Kaike Wan <kaike.wan(a)intel.com>
Date: Mon, 18 Mar 2019 09:55:19 -0700
Subject: [PATCH] IB/hfi1: Failed to drain send queue when QP is put into error
state
When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:
(1) The QP builds a packet and tries to send through SDMA engine.
However, PIO engine is still busy. Consequently, this packet is put on
the QP's tx list and the QP is put on the PIO waiting list. The field
qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;
(2) The QP is put into error state by the user application and
notify_error_qp() is called, which removes the QP from the PIO waiting
list and the packet from the QP's tx list. In addition, qp->s_flags is
cleared of RVT_S_ANY_WAIT_IO bits, which does not include
HFI1_S_WAIT_PIO_DRAIN bit;
(3) The hfi1_schdule_send() function is called to drain the QP's send
queue. Subsequently, hfi1_do_send() is called. Since the flag bit
HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails. As
a result, hfi1_do_send() bails out without draining any request from
the send queue;
(4) The PIO engine completes the sending and tries to wake up any QP on
its waiting list. But the QP has been removed from the PIO waiting
list and therefore is kept in sleep forever.
The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.
Fixes: 2e2ba09e48b7 ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable(a)vger.kernel.org> # 4.19.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn(a)intel.com>
Reviewed-by: Alex Estrin <alex.estrin(a)intel.com>
Signed-off-by: Kaike Wan <kaike.wan(a)intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)intel.com>
Signed-off-by: Jason Gunthorpe <jgg(a)mellanox.com>
diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
index 9b643c2409cf..64889eda1735 100644
--- a/drivers/infiniband/hw/hfi1/qp.c
+++ b/drivers/infiniband/hw/hfi1/qp.c
@@ -898,7 +898,7 @@ void notify_error_qp(struct rvt_qp *qp)
if (!list_empty(&priv->s_iowait.list) &&
!(qp->s_flags & RVT_S_BUSY) &&
!(priv->s_flags & RVT_S_BUSY)) {
- qp->s_flags &= ~RVT_S_ANY_WAIT_IO;
+ qp->s_flags &= ~HFI1_S_ANY_WAIT_IO;
list_del_init(&priv->s_iowait.list);
priv->s_iowait.lock = NULL;
rvt_put_qp(qp);
The patch below does not apply to the 5.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8efd6365417a044db03009724ecc1a9521524913 Mon Sep 17 00:00:00 2001
From: Dinh Nguyen <dinguyen(a)kernel.org>
Date: Wed, 13 Mar 2019 17:28:37 -0500
Subject: [PATCH] arm64: dts: stratix10: add the sysmgr-syscon property from
the gmac's
The gmac ethernet driver uses the "altr,sysmgr-syscon" property to
configure phy settings for the gmac controller.
Add the "altr,sysmgr-syscon" property to all gmac nodes.
This patch fixes:
[ 0.917530] socfpga-dwmac ff800000.ethernet: No sysmgr-syscon node found
[ 0.924209] socfpga-dwmac ff800000.ethernet: Unable to parse OF data
Cc: stable(a)vger.kernel.org
Reported-by: Ley Foon Tan <ley.foon.tan(a)intel.com>
Signed-off-by: Dinh Nguyen <dinguyen(a)kernel.org>
diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index 7c649f6b14cb..cd7c76e58b09 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -162,6 +162,7 @@
rx-fifo-depth = <16384>;
snps,multicast-filter-bins = <256>;
iommus = <&smmu 1>;
+ altr,sysmgr-syscon = <&sysmgr 0x44 0>;
status = "disabled";
};
@@ -179,6 +180,7 @@
rx-fifo-depth = <16384>;
snps,multicast-filter-bins = <256>;
iommus = <&smmu 2>;
+ altr,sysmgr-syscon = <&sysmgr 0x48 0>;
status = "disabled";
};
@@ -196,6 +198,7 @@
rx-fifo-depth = <16384>;
snps,multicast-filter-bins = <256>;
iommus = <&smmu 3>;
+ altr,sysmgr-syscon = <&sysmgr 0x4c 0>;
status = "disabled";
};
Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: 8b298d3a0bd5 - Linux 5.0.7
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out a ref:
Repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Ref: 8b298d3a0bd5 - Linux 5.0.7
We then merged the patchset with `git am`:
drm-i915-gvt-do-not-let-pin-count-of-shadow-mm-go-ne.patch
kbuild-pkg-use-f-srctree-makefile-to-recurse-to-top-.patch
netfilter-nft_compat-use-.release_ops-and-remove-lis.patch
netfilter-nf_tables-use-after-free-in-dynamic-operat.patch
netfilter-nf_tables-add-missing-release_ops-in-error.patch
hv_netvsc-fix-unwanted-wakeup-after-tx_disable.patch
ibmvnic-fix-completion-structure-initialization.patch
ip6_tunnel-match-to-arphrd_tunnel6-for-dev-type.patch
ipv6-fix-dangling-pointer-when-ipv6-fragment.patch
ipv6-sit-reset-ip-header-pointer-in-ipip6_rcv.patch
kcm-switch-order-of-device-registration-to-fix-a-cra.patch
net-ethtool-not-call-vzalloc-for-zero-sized-memory-r.patch
net-gro-fix-gro-flush-when-receiving-a-gso-packet.patch
net-mlx5-decrease-default-mr-cache-size.patch
netns-provide-pure-entropy-for-net_hash_mix.patch
net-rds-force-to-destroy-connection-if-t_sock-is-nul.patch
net-sched-act_sample-fix-divide-by-zero-in-the-traff.patch
net-sched-fix-get-helper-of-the-matchall-cls.patch
openvswitch-fix-flow-actions-reallocation.patch
qmi_wwan-add-olicard-600.patch
r8169-disable-aspm-again.patch
sctp-initialize-_pad-of-sockaddr_in-before-copying-t.patch
tcp-ensure-dctcp-reacts-to-losses.patch
tcp-fix-a-potential-null-pointer-dereference-in-tcp_.patch
vrf-check-accept_source_route-on-the-original-netdev.patch
net-mlx5e-fix-error-handling-when-refreshing-tirs.patch
net-mlx5e-add-a-lock-on-tir-list.patch
nfp-validate-the-return-code-from-dev_queue_xmit.patch
nfp-disable-netpoll-on-representors.patch
bnxt_en-improve-rx-consumer-index-validity-check.patch
bnxt_en-reset-device-on-rx-buffer-errors.patch
net-ip_gre-fix-possible-use-after-free-in-erspan_rcv.patch
net-ip6_gre-fix-possible-use-after-free-in-ip6erspan.patch
net-bridge-always-clear-mcast-matching-struct-on-rep.patch
net-thunderx-fix-null-pointer-dereference-in-nicvf_o.patch
net-vrf-fix-ping-failed-when-vrf-mtu-is-set-to-0.patch
net-core-netif_receive_skb_list-unlist-skb-before-pa.patch
r8169-disable-default-rx-interrupt-coalescing-on-rtl.patch
net-mlx5-add-a-missing-check-on-idr_find-free-buf.patch
net-mlx5e-update-xoff-formula.patch
net-mlx5e-update-xon-formula.patch
kbuild-clang-choose-gcc_toolchain_dir-not-on-ld.patch
lib-string.c-implement-a-basic-bcmp.patch
revert-clk-meson-clean-up-clock-registration.patch
tty-mark-siemens-r3964-line-discipline-as-broken.patch
tty-ldisc-add-sysctl-to-prevent-autoloading-of-ldiscs.patch
hwmon-w83773g-select-regmap_i2c-to-fix-build-error.patch
hwmon-occ-fix-power-sensor-indexing.patch
smb3-allow-persistent-handle-timeout-to-be-configurable-on-mount.patch
hid-logitech-handle-0-scroll-events-for-the-m560.patch
acpica-clear-status-of-gpes-before-enabling-them.patch
acpica-namespace-remove-address-node-from-global-list-after-method-termination.patch
alsa-seq-fix-oob-reads-from-strlcpy.patch
alsa-hda-realtek-enable-headset-mic-of-acer-travelmate-b114-21-with-alc233.patch
alsa-hda-realtek-add-quirk-for-tuxedo-xc-1509.patch
alsa-xen-front-do-not-use-stream-buffer-size-before-it-is-set.patch
alsa-hda-add-two-more-machines-to-the-power_save_blacklist.patch
mm-huge_memory.c-fix-modifying-of-page-protection-by-insert_pfn_pmd.patch
arm64-dts-rockchip-fix-rk3328-sdmmc0-write-errors.patch
mmc-alcor-don-t-write-data-before-command-has-completed.patch
mmc-sdhci-omap-don-t-finish_mrq-on-a-command-error-during-tuning.patch
parisc-detect-qemu-earlier-in-boot-process.patch
parisc-regs_return_value-should-return-gpr28.patch
parisc-also-set-iaoq_b-in-instruction_pointer_set.patch
alarmtimer-return-correct-remaining-time.patch
drm-i915-gvt-do-not-deliver-a-workload-if-its-creation-fails.patch
drm-sun4i-dw-hdmi-lower-max.-supported-rate-for-h6.patch
drm-udl-add-a-release-method-and-delay-modeset-teardown.patch
kvm-svm-fix-potential-get_num_contig_pages-overflow.patch
include-linux-bitrev.h-fix-constant-bitrev.patch
Compile testing
---------------
We compiled the kernel for 3 architectures:
aarch64:
make options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
kernel build: https://artifacts.cki-project.org/builds/aarch64/kernel-stable_queue-aarch6…
ppc64le:
make options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
kernel build: https://artifacts.cki-project.org/builds/ppc64le/kernel-stable_queue-ppc64l…
x86_64:
make options: -j20 INSTALL_MOD_STRIP=1 targz-pkg
configuration: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
kernel build: https://artifacts.cki-project.org/builds/x86_64/kernel-stable_queue-x86_64-…
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
✅ Boot test [0]
✅ LTP lite [1]
✅ Loopdev Sanity [2]
✅ AMTU (Abstract Machine Test Utility) [3]
🚧 ✅ audit: audit testsuite test [4]
✅ httpd: mod_ssl smoke sanity [5]
✅ httpd: php sanity [6]
✅ iotop: sanity [7]
✅ /CoreOS/net-snmp/Regression/bz251332-tcp-transport
✅ tuned: tune-processes-through-perf [8]
✅ Usex - version 1.9-29 [9]
🚧 ✅ stress: stress-ng [10]
ppc64le:
✅ Boot test [0]
✅ LTP lite [1]
✅ Loopdev Sanity [2]
✅ AMTU (Abstract Machine Test Utility) [3]
🚧 ✅ audit: audit testsuite test [4]
✅ httpd: mod_ssl smoke sanity [5]
✅ httpd: php sanity [6]
✅ iotop: sanity [7]
✅ /CoreOS/net-snmp/Regression/bz251332-tcp-transport
🚧 ✅ selinux-policy: serge-testsuite [11]
✅ tuned: tune-processes-through-perf [8]
✅ Usex - version 1.9-29 [9]
🚧 ✅ stress: stress-ng [10]
x86_64:
✅ Boot test [0]
✅ LTP lite [1]
✅ Loopdev Sanity [2]
✅ AMTU (Abstract Machine Test Utility) [3]
🚧 ✅ audit: audit testsuite test [4]
✅ httpd: mod_ssl smoke sanity [5]
✅ httpd: php sanity [6]
✅ iotop: sanity [7]
✅ /CoreOS/net-snmp/Regression/bz251332-tcp-transport
🚧 ✅ selinux-policy: serge-testsuite [11]
✅ tuned: tune-processes-through-perf [8]
✅ Usex - version 1.9-29 [9]
🚧 ✅ stress: stress-ng [10]
Test source:
[0]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[1]: https://github.com/CKI-project/tests-beaker/archive/master.zip#distribution…
[2]: https://github.com/CKI-project/tests-beaker/archive/master.zip#filesystems/…
[3]: https://github.com/CKI-project/tests-beaker/archive/master.zip#misc/amtu
[4]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/aud…
[5]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[6]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/htt…
[7]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/iot…
[8]: https://github.com/CKI-project/tests-beaker/archive/master.zip#packages/tun…
[9]: https://github.com/CKI-project/tests-beaker/archive/master.zip#standards/us…
[10]: https://github.com/CKI-project/tests-beaker/archive/master.zip#stress/stres…
[11]: https://github.com/CKI-project/tests-beaker/archive/master.zip#/packages/se…
Waived tests (marked with 🚧)
-----------------------------
This test run included waived tests. Such tests are executed but their results
are not taken into account. Tests are waived when their results are not
reliable enough, e.g. when they're just introduced or are being fixed.
The patch below does not apply to the 5.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 21635d7311734d2d1b177f8a95e2f9386174b76d Mon Sep 17 00:00:00 2001
From: Jani Nikula <jani.nikula(a)intel.com>
Date: Fri, 5 Apr 2019 10:52:20 +0300
Subject: [PATCH] drm/i915/dp: revert back to max link rate and lane count on
eDP
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Commit 7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast
and narrow") started to optize the eDP 1.4+ link config, both per spec
and as preparation for display stream compression support.
Sadly, we again face panels that flat out fail with parameters they
claim to support. Revert, and go back to the drawing board.
v2: Actually revert to max params instead of just wide-and-slow.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109959
Fixes: 7769db588384 ("drm/i915/dp: optimize eDP 1.4+ link config fast and narrow")
Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Cc: Manasi Navare <manasi.d.navare(a)intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Matt Atwood <matthew.s.atwood(a)intel.com>
Cc: "Lee, Shawn C" <shawn.c.lee(a)intel.com>
Cc: Dave Airlie <airlied(a)gmail.com>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v5.0+
Reviewed-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Reviewed-by: Manasi Navare <manasi.d.navare(a)intel.com>
Tested-by: Albert Astals Cid <aacid(a)kde.org> # v5.0 backport
Tested-by: Emanuele Panigati <ilpanich(a)gmail.com> # v5.0 backport
Tested-by: Matteo Iervasi <matteoiervasi(a)gmail.com> # v5.0 backport
Signed-off-by: Jani Nikula <jani.nikula(a)intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190405075220.9815-1-jani.ni…
(cherry picked from commit f11cb1c19ad0563b3c1ea5eb16a6bac0e401f428)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
index cf709835fb9a..8891f29a8c7f 100644
--- a/drivers/gpu/drm/i915/intel_dp.c
+++ b/drivers/gpu/drm/i915/intel_dp.c
@@ -1859,42 +1859,6 @@ intel_dp_compute_link_config_wide(struct intel_dp *intel_dp,
return -EINVAL;
}
-/* Optimize link config in order: max bpp, min lanes, min clock */
-static int
-intel_dp_compute_link_config_fast(struct intel_dp *intel_dp,
- struct intel_crtc_state *pipe_config,
- const struct link_config_limits *limits)
-{
- struct drm_display_mode *adjusted_mode = &pipe_config->base.adjusted_mode;
- int bpp, clock, lane_count;
- int mode_rate, link_clock, link_avail;
-
- for (bpp = limits->max_bpp; bpp >= limits->min_bpp; bpp -= 2 * 3) {
- mode_rate = intel_dp_link_required(adjusted_mode->crtc_clock,
- bpp);
-
- for (lane_count = limits->min_lane_count;
- lane_count <= limits->max_lane_count;
- lane_count <<= 1) {
- for (clock = limits->min_clock; clock <= limits->max_clock; clock++) {
- link_clock = intel_dp->common_rates[clock];
- link_avail = intel_dp_max_data_rate(link_clock,
- lane_count);
-
- if (mode_rate <= link_avail) {
- pipe_config->lane_count = lane_count;
- pipe_config->pipe_bpp = bpp;
- pipe_config->port_clock = link_clock;
-
- return 0;
- }
- }
- }
- }
-
- return -EINVAL;
-}
-
static int intel_dp_dsc_compute_bpp(struct intel_dp *intel_dp, u8 dsc_max_bpc)
{
int i, num_bpc;
@@ -2031,15 +1995,13 @@ intel_dp_compute_link_config(struct intel_encoder *encoder,
limits.min_bpp = 6 * 3;
limits.max_bpp = intel_dp_compute_bpp(intel_dp, pipe_config);
- if (intel_dp_is_edp(intel_dp) && intel_dp->edp_dpcd[0] < DP_EDP_14) {
+ if (intel_dp_is_edp(intel_dp)) {
/*
* Use the maximum clock and number of lanes the eDP panel
- * advertizes being capable of. The eDP 1.3 and earlier panels
- * are generally designed to support only a single clock and
- * lane configuration, and typically these values correspond to
- * the native resolution of the panel. With eDP 1.4 rate select
- * and DSC, this is decreasingly the case, and we need to be
- * able to select less than maximum link config.
+ * advertizes being capable of. The panels are generally
+ * designed to support only a single clock and lane
+ * configuration, and typically these values correspond to the
+ * native resolution of the panel.
*/
limits.min_lane_count = limits.max_lane_count;
limits.min_clock = limits.max_clock;
@@ -2053,22 +2015,11 @@ intel_dp_compute_link_config(struct intel_encoder *encoder,
intel_dp->common_rates[limits.max_clock],
limits.max_bpp, adjusted_mode->crtc_clock);
- if (intel_dp_is_edp(intel_dp))
- /*
- * Optimize for fast and narrow. eDP 1.3 section 3.3 and eDP 1.4
- * section A.1: "It is recommended that the minimum number of
- * lanes be used, using the minimum link rate allowed for that
- * lane configuration."
- *
- * Note that we use the max clock and lane count for eDP 1.3 and
- * earlier, and fast vs. wide is irrelevant.
- */
- ret = intel_dp_compute_link_config_fast(intel_dp, pipe_config,
- &limits);
- else
- /* Optimize for slow and wide. */
- ret = intel_dp_compute_link_config_wide(intel_dp, pipe_config,
- &limits);
+ /*
+ * Optimize for slow and wide. This is the place to add alternative
+ * optimization policy.
+ */
+ ret = intel_dp_compute_link_config_wide(intel_dp, pipe_config, &limits);
/* enable compression if the mode doesn't fit available BW */
DRM_DEBUG_KMS("Force DSC en = %d\n", intel_dp->force_dsc_en);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0b3d6e6f2dd0a7b697b1aa8c167265908940624b Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen(a)google.com>
Date: Fri, 5 Apr 2019 18:39:18 -0700
Subject: [PATCH] mm: writeback: use exact memcg dirty counts
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic. Stat readers only check the atomic. Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.
Assuming 100 cpus:
4k x86 page_size: 13 MiB error per memcg
64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions the
errors double.
This inaccuracy can lead to undeserved oom kills. One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages. If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.
It could be argued that tiny containers are not supported, but it's more
subtle. It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.
The following test reliably ooms without this patch. This patch avoids
oom kills.
$ cat test
mount -t cgroup2 none /dev/cgroup
cd /dev/cgroup
echo +io +memory > cgroup.subtree_control
mkdir test
cd test
echo 10M > memory.max
(echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
(echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
$ cat memcg-writeback-stress.c
/*
* Dirty pages from all but one cpu.
* Clean pages from the non dirtying cpu.
* This is to stress per cpu counter imbalance.
* On a 100 cpu machine:
* - per memcg per cpu dirty count is 32 pages for each of 99 cpus
* - per memcg atomic is -99*32 pages
* - thus the complete dirty limit: sum of all counters 0
* - balance_dirty_pages() only sees atomic count -99*32 pages, which
* it max()s to 0.
* - So a workload can dirty -99*32 pages before balance_dirty_pages()
* cares.
*/
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/sysinfo.h>
#include <sys/types.h>
#include <unistd.h>
static char *buf;
static int bufSize;
static void set_affinity(int cpu)
{
cpu_set_t affinity;
CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
}
static void dirty_on(int output_fd, int cpu)
{
int i, wrote;
set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
}
int main(int argc, char **argv)
{
int cpu, flush_cpu = 1, output_fd;
const char *output;
if (argc != 2)
errx(1, "usage: output_file");
output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");
output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);
for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}
set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
}
Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters. This avoids the aforementioned oom
kills.
This does not affect the overhead of memory.stat, which still reads the
single atomic counter.
Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter. And the percpu_counter
spinlocks are more heavyweight than is required.
It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports. But that is saved for later.
Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen(a)google.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [4.16+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f3d880b7ca1..dbb6118370c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -566,7 +566,10 @@ struct mem_cgroup *lock_page_memcg(struct page *page);
void __unlock_page_memcg(struct mem_cgroup *memcg);
void unlock_page_memcg(struct page *page);
-/* idx can be of type enum memcg_stat_item or node_stat_item */
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page_state().
+ */
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
int idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 532e0e2a4817..81a0d3914ec9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3882,6 +3882,22 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
return &memcg->cgwb_domain;
}
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page().
+ */
+static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
+{
+ long x = atomic_long_read(&memcg->stat[idx]);
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
+ if (x < 0)
+ x = 0;
+ return x;
+}
+
/**
* mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
* @wb: bdi_writeback in question
@@ -3907,10 +3923,10 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
+ *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
/* this should eventually include NR_UNSTABLE_NFS */
- *pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
+ *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
(1 << LRU_ACTIVE_FILE));
*pheadroom = PAGE_COUNTER_MAX;