I'm announcing the release of the 6.1.67 kernel.
All users of the 6.1 kernel series must upgrade.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
net/wireless/core.h | 1
net/wireless/nl80211.c | 50 ++++++++++++++++++-------------------------------
3 files changed, 20 insertions(+), 33 deletions(-)
Greg Kroah-Hartman (2):
Revert "wifi: cfg80211: fix CQM for non-range use"
Linux 6.1.67
I'm announcing the release of the 6.6.6 kernel.
All users of the 6.6 kernel series must upgrade.
The updated 6.6.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.6.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
net/wireless/core.h | 1
net/wireless/nl80211.c | 50 ++++++++++++++++++-------------------------------
3 files changed, 20 insertions(+), 33 deletions(-)
Greg Kroah-Hartman (2):
Revert "wifi: cfg80211: fix CQM for non-range use"
Linux 6.6.6
On Sat, Dec 09, 2023 at 11:47:49AM +0100, Arnd Bergmann wrote:
>On Sat, Dec 9, 2023, at 03:46, Sasha Levin wrote:
>> This is a note to let you know that I've just added the patch titled
>>
>> Kbuild: move to -std=gnu11
>>
>> to the 5.15-stable tree which can be found at:
>>
>> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>>
>> The filename of the patch is:
>> kbuild-move-to-std-gnu11.patch
>> and it can be found in the queue-5.15 subdirectory.
>>
>> If you, or anyone else, feels it should not be added to the stable tree,
>> please let <stable(a)vger.kernel.org> know about it.
>
>I think the patch initially caused a few regressions, so
>I'm not sure if backporting this is the best idea. Is this
>needed for some other backport?
Hey Arnd,
I spent some time over the weeked trying to figure it out. Initially it
looked like there's something wrong with my local toolchain, but what I
found out is the following:
There is now kernel code relying on code constructs that are illegal
without c99/gnu11/etc. The example in this case is WMI code which even
in upstream [1] does:
list_for_each_entry(wblock, &wmi_block_list, list) {
/* skip warning and register if we know the driver will use struct wmi_driver */
for (int i = 0; allow_duplicates[i] != NULL; i++) {
^^^^^^^^^^^
if (guid_parse_and_compare(allow_duplicates[i], guid))
return false;
The decleration of a variable there doesn't work unless you're using a
newer standard, which is why the dependency bot ended up pulling this
commit in.
At this point, not taking this change means that we can't take some
commits without doing custom changes to backport them, which in turn
means that we'll keep diverging from upstream.
Agreeing with you that this isn't a trivial change, but it seems that we
need to take it to make even not-that-old trees (<=5.15) accept some
fixes.
With the commit in question applied I see no new errors or warnings, so
I'll keep it in 5.15, and we can see how it survives the -rc tests.
I haven't seen any fixes pointing to that commit besides a documentation
fix, so if I've missed anything please let me know.
[1]: https://elixir.bootlin.com/linux/v6.7-rc5/source/drivers/platform/x86/wmi.c…
--
Thanks,
Sasha
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x d839a656d0f3caca9f96e9bf912fd394ac6a11bc
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120316-seduce-vehicular-9e78@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
d839a656d0f3 ("kprobes: consistent rcu api usage for kretprobe holder")
4bbd93455659 ("kprobes: kretprobe scalability improvement")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d839a656d0f3caca9f96e9bf912fd394ac6a11bc Mon Sep 17 00:00:00 2001
From: JP Kobryn <inwardvessel(a)gmail.com>
Date: Fri, 1 Dec 2023 14:53:55 +0900
Subject: [PATCH] kprobes: consistent rcu api usage for kretprobe holder
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.
The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.
I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.
Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
Fixes: d741bf41d7c7 ("kprobes: Remove kretprobe hash")
Cc: stable(a)vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel(a)gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index ab1da3142b06..64672bace560 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -139,7 +139,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*
*/
struct kretprobe_holder {
- struct kretprobe *rp;
+ struct kretprobe __rcu *rp;
struct objpool_head pool;
};
@@ -245,10 +245,7 @@ unsigned long kretprobe_trampoline_handler(struct pt_regs *regs,
static nokprobe_inline struct kretprobe *get_kretprobe(struct kretprobe_instance *ri)
{
- RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),
- "Kretprobe is accessed from instance under preemptive context");
-
- return READ_ONCE(ri->rph->rp);
+ return rcu_dereference_check(ri->rph->rp, rcu_read_lock_any_held());
}
static nokprobe_inline unsigned long get_kretprobe_retaddr(struct kretprobe_instance *ri)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 075a632e6c7c..d5a0ee40bf66 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2252,7 +2252,7 @@ int register_kretprobe(struct kretprobe *rp)
rp->rph = NULL;
return -ENOMEM;
}
- rp->rph->rp = rp;
+ rcu_assign_pointer(rp->rph->rp, rp);
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2300,7 +2300,7 @@ void unregister_kretprobes(struct kretprobe **rps, int num)
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
rethook_free(rps[i]->rh);
#else
- rps[i]->rph->rp = NULL;
+ rcu_assign_pointer(rps[i]->rph->rp, NULL);
#endif
}
mutex_unlock(&kprobe_mutex);
The patch titled
Subject: Revert "selftests: error out if kernel header files are not yet built"
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
revert-selftests-error-out-if-kernel-header-files-are-not-yet-built.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: John Hubbard <jhubbard(a)nvidia.com>
Subject: Revert "selftests: error out if kernel header files are not yet built"
Date: Fri, 8 Dec 2023 18:01:44 -0800
This reverts commit 9fc96c7c19df ("selftests: error out if kernel header
files are not yet built").
It turns out that requiring the kernel headers to be built as a
prerequisite to building selftests, does not work in many cases. For
example, Peter Zijlstra writes:
"My biggest beef with the whole thing is that I simply do not want to use
'make headers', it doesn't work for me.
I have a ton of output directories and I don't care to build tools into
the output dirs, in fact some of them flat out refuse to work that way
(bpf comes to mind)." [1]
Therefore, stop erroring out on the selftests build. Additional patches
will be required in order to change over to not requiring the kernel
headers.
[1] https://lore.kernel.org/20231208221007.GO28727@noisy.programming.kicks-ass.…
Link: https://lkml.kernel.org/r/20231209020144.244759-1-jhubbard@nvidia.com
Fixes: 9fc96c7c19df ("selftests: error out if kernel header files are not yet built")
Signed-off-by: John Hubbard <jhubbard(a)nvidia.com>
Cc: Anders Roxell <anders.roxell(a)linaro.org>
Cc: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/Makefile | 21 ---------------
tools/testing/selftests/lib.mk | 40 ++---------------------------
2 files changed, 4 insertions(+), 57 deletions(-)
--- a/tools/testing/selftests/lib.mk~revert-selftests-error-out-if-kernel-header-files-are-not-yet-built
+++ a/tools/testing/selftests/lib.mk
@@ -44,26 +44,10 @@ endif
selfdir = $(realpath $(dir $(filter %/lib.mk,$(MAKEFILE_LIST))))
top_srcdir = $(selfdir)/../../..
-ifeq ("$(origin O)", "command line")
- KBUILD_OUTPUT := $(O)
+ifeq ($(KHDR_INCLUDES),)
+KHDR_INCLUDES := -isystem $(top_srcdir)/usr/include
endif
-ifneq ($(KBUILD_OUTPUT),)
- # Make's built-in functions such as $(abspath ...), $(realpath ...) cannot
- # expand a shell special character '~'. We use a somewhat tedious way here.
- abs_objtree := $(shell cd $(top_srcdir) && mkdir -p $(KBUILD_OUTPUT) && cd $(KBUILD_OUTPUT) && pwd)
- $(if $(abs_objtree),, \
- $(error failed to create output directory "$(KBUILD_OUTPUT)"))
- # $(realpath ...) resolves symlinks
- abs_objtree := $(realpath $(abs_objtree))
- KHDR_DIR := ${abs_objtree}/usr/include
-else
- abs_srctree := $(shell cd $(top_srcdir) && pwd)
- KHDR_DIR := ${abs_srctree}/usr/include
-endif
-
-KHDR_INCLUDES := -isystem $(KHDR_DIR)
-
# The following are built by lib.mk common compile rules.
# TEST_CUSTOM_PROGS should be used by tests that require
# custom build rule and prevent common build rule use.
@@ -74,25 +58,7 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)
TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
-all: kernel_header_files $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) \
- $(TEST_GEN_FILES)
-
-kernel_header_files:
- @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null; \
- if [ $$? -ne 0 ]; then \
- RED='\033[1;31m'; \
- NOCOLOR='\033[0m'; \
- echo; \
- echo -e "$${RED}error$${NOCOLOR}: missing kernel header files."; \
- echo "Please run this and try again:"; \
- echo; \
- echo " cd $(top_srcdir)"; \
- echo " make headers"; \
- echo; \
- exit 1; \
- fi
-
-.PHONY: kernel_header_files
+all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
define RUN_TESTS
BASE_DIR="$(selfdir)"; \
--- a/tools/testing/selftests/Makefile~revert-selftests-error-out-if-kernel-header-files-are-not-yet-built
+++ a/tools/testing/selftests/Makefile
@@ -155,12 +155,10 @@ ifneq ($(KBUILD_OUTPUT),)
abs_objtree := $(realpath $(abs_objtree))
BUILD := $(abs_objtree)/kselftest
KHDR_INCLUDES := -isystem ${abs_objtree}/usr/include
- KHDR_DIR := ${abs_objtree}/usr/include
else
BUILD := $(CURDIR)
abs_srctree := $(shell cd $(top_srcdir) && pwd)
KHDR_INCLUDES := -isystem ${abs_srctree}/usr/include
- KHDR_DIR := ${abs_srctree}/usr/include
DEFAULT_INSTALL_HDR_PATH := 1
endif
@@ -174,7 +172,7 @@ export KHDR_INCLUDES
# all isn't the first target in the file.
.DEFAULT_GOAL := all
-all: kernel_header_files
+all:
@ret=1; \
for TARGET in $(TARGETS); do \
BUILD_TARGET=$$BUILD/$$TARGET; \
@@ -185,23 +183,6 @@ all: kernel_header_files
ret=$$((ret * $$?)); \
done; exit $$ret;
-kernel_header_files:
- @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null; \
- if [ $$? -ne 0 ]; then \
- RED='\033[1;31m'; \
- NOCOLOR='\033[0m'; \
- echo; \
- echo -e "$${RED}error$${NOCOLOR}: missing kernel header files."; \
- echo "Please run this and try again:"; \
- echo; \
- echo " cd $(top_srcdir)"; \
- echo " make headers"; \
- echo; \
- exit 1; \
- fi
-
-.PHONY: kernel_header_files
-
run_tests: all
@for TARGET in $(TARGETS); do \
BUILD_TARGET=$$BUILD/$$TARGET; \
_
Patches currently in -mm which might be from jhubbard(a)nvidia.com are
revert-selftests-error-out-if-kernel-header-files-are-not-yet-built.patch
From: Ronald Wahl <ronald.wahl(a)raritan.com>
There is a bug in the ks8851 Ethernet driver that more data is written
to the hardware TX buffer than actually available. This is caused by
wrong accounting of the free TX buffer space.
The driver maintains a tx_space variable that represents the TX buffer
space that is deemed to be free. The ks8851_start_xmit_spi() function
adds an SKB to a queue if tx_space is large enough and reduces tx_space
by the amount of buffer space it will later need in the TX buffer and
then schedules a work item. If there is not enough space then the TX
queue is stopped.
The worker function ks8851_tx_work() dequeues all the SKBs and writes
the data into the hardware TX buffer. The last packet will trigger an
interrupt after it was send. Here it is assumed that all data fits into
the TX buffer.
In the interrupt routine (which runs asynchronously because it is a
threaded interrupt) tx_space is updated with the current value from the
hardware. Also the TX queue is woken up again.
Now it could happen that after data was sent to the hardware and before
handling the TX interrupt new data is queued in ks8851_start_xmit_spi()
when the TX buffer space had still some space left. When the interrupt
is actually handled tx_space is updated from the hardware but now we
already have new SKBs queued that have not been written to the hardware
TX buffer yet. Since tx_space has been overwritten by the value from the
hardware the space is not accounted for.
Now we have more data queued then buffer space available in the hardware
and ks8851_tx_work() will potentially overrun the hardware TX buffer. In
many cases it will still work because often the buffer is written out
fast enough so that no overrun occurs but for example if the peer
throttles us via flow control then an overrun may happen.
This can be fixed in different ways. The most simple way would be to set
tx_space to 0 before writing data to the hardware TX buffer preventing
the queuing of more SKBs until the TX interrupt has been handled. I have
choosen a slightly more efficient (and still rather simple) way and
track the amount of data that is already queued and not yet written to
the hardware. When new SKBs are to be queued the already queued amount
of data is honoured when checking free TX buffer space.
I tested this with a setup of two linked KS8851 running iperf3 between
the two in bidirectional mode. Before the fix I got a stall after some
minutes. With the fix I saw now issues anymore after hours.
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: netdev(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Ronald Wahl <ronald.wahl(a)raritan.com>
---
drivers/net/ethernet/micrel/ks8851.h | 1 +
drivers/net/ethernet/micrel/ks8851_common.c | 21 +++++------
drivers/net/ethernet/micrel/ks8851_spi.c | 41 +++++++++++++--------
3 files changed, 37 insertions(+), 26 deletions(-)
diff --git a/drivers/net/ethernet/micrel/ks8851.h b/drivers/net/ethernet/micrel/ks8851.h
index fecd43754cea..ce7e524f2542 100644
--- a/drivers/net/ethernet/micrel/ks8851.h
+++ b/drivers/net/ethernet/micrel/ks8851.h
@@ -399,6 +399,7 @@ struct ks8851_net {
struct work_struct rxctrl_work;
struct sk_buff_head txq;
+ unsigned int queued_len;
struct eeprom_93cx6 eeprom;
struct regulator *vdd_reg;
diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c
index cfbc900d4aeb..daab9358124b 100644
--- a/drivers/net/ethernet/micrel/ks8851_common.c
+++ b/drivers/net/ethernet/micrel/ks8851_common.c
@@ -362,16 +362,17 @@ static irqreturn_t ks8851_irq(int irq, void *_ks)
handled |= IRQ_RXPSI;
if (status & IRQ_TXI) {
- handled |= IRQ_TXI;
-
- /* no lock here, tx queue should have been stopped */
+ unsigned short tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+ netif_dbg(ks, intr, ks->netdev,
+ "%s: txspace %d\n", __func__, tx_space);
- /* update our idea of how much tx space is available to the
- * system */
- ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+ spin_lock(&ks->statelock);
+ ks->tx_space = tx_space;
+ if (netif_queue_stopped(ks->netdev))
+ netif_wake_queue(ks->netdev);
+ spin_unlock(&ks->statelock);
- netif_dbg(ks, intr, ks->netdev,
- "%s: txspace %d\n", __func__, ks->tx_space);
+ handled |= IRQ_TXI;
}
if (status & IRQ_RXI)
@@ -414,9 +415,6 @@ static irqreturn_t ks8851_irq(int irq, void *_ks)
if (status & IRQ_LCI)
mii_check_link(&ks->mii);
- if (status & IRQ_TXI)
- netif_wake_queue(ks->netdev);
-
return IRQ_HANDLED;
}
@@ -500,6 +498,7 @@ static int ks8851_net_open(struct net_device *dev)
ks8851_wrreg16(ks, KS_ISR, ks->rc_ier);
ks8851_wrreg16(ks, KS_IER, ks->rc_ier);
+ ks->queued_len = 0;
netif_start_queue(ks->netdev);
netif_dbg(ks, ifup, ks->netdev, "network device up\n");
diff --git a/drivers/net/ethernet/micrel/ks8851_spi.c b/drivers/net/ethernet/micrel/ks8851_spi.c
index 70bc7253454f..eb089b3120bc 100644
--- a/drivers/net/ethernet/micrel/ks8851_spi.c
+++ b/drivers/net/ethernet/micrel/ks8851_spi.c
@@ -286,6 +286,18 @@ static void ks8851_wrfifo_spi(struct ks8851_net *ks, struct sk_buff *txp,
netdev_err(ks->netdev, "%s: spi_sync() failed\n", __func__);
}
+/**
+ * calc_txlen - calculate size of message to send packet
+ * @len: Length of data
+ *
+ * Returns the size of the TXFIFO message needed to send
+ * this packet.
+ */
+static unsigned int calc_txlen(unsigned int len)
+{
+ return ALIGN(len + 4, 4);
+}
+
/**
* ks8851_rx_skb_spi - receive skbuff
* @ks: The device state
@@ -310,6 +322,8 @@ static void ks8851_tx_work(struct work_struct *work)
unsigned long flags;
struct sk_buff *txb;
bool last;
+ unsigned short tx_space;
+ unsigned int dequeued_len = 0;
kss = container_of(work, struct ks8851_net_spi, tx_work);
ks = &kss->ks8851;
@@ -320,6 +334,7 @@ static void ks8851_tx_work(struct work_struct *work)
while (!last) {
txb = skb_dequeue(&ks->txq);
last = skb_queue_empty(&ks->txq);
+ dequeued_len += calc_txlen(txb->len);
if (txb) {
ks8851_wrreg16_spi(ks, KS_RXQCR,
@@ -332,6 +347,13 @@ static void ks8851_tx_work(struct work_struct *work)
}
}
+ tx_space = ks8851_rdreg16_spi(ks, KS_TXMIR);
+
+ spin_lock(&ks->statelock);
+ ks->queued_len -= dequeued_len;
+ ks->tx_space = tx_space;
+ spin_unlock(&ks->statelock);
+
ks8851_unlock_spi(ks, &flags);
}
@@ -346,18 +368,6 @@ static void ks8851_flush_tx_work_spi(struct ks8851_net *ks)
flush_work(&kss->tx_work);
}
-/**
- * calc_txlen - calculate size of message to send packet
- * @len: Length of data
- *
- * Returns the size of the TXFIFO message needed to send
- * this packet.
- */
-static unsigned int calc_txlen(unsigned int len)
-{
- return ALIGN(len + 4, 4);
-}
-
/**
* ks8851_start_xmit_spi - transmit packet using SPI
* @skb: The buffer to transmit
@@ -386,16 +396,17 @@ static netdev_tx_t ks8851_start_xmit_spi(struct sk_buff *skb,
spin_lock(&ks->statelock);
- if (needed > ks->tx_space) {
+ if (ks->queued_len + needed > ks->tx_space) {
netif_stop_queue(dev);
ret = NETDEV_TX_BUSY;
} else {
- ks->tx_space -= needed;
+ ks->queued_len += needed;
skb_queue_tail(&ks->txq, skb);
}
spin_unlock(&ks->statelock);
- schedule_work(&kss->tx_work);
+ if (ret == NETDEV_TX_OK)
+ schedule_work(&kss->tx_work);
return ret;
}
--
2.43.0
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 9aa1345d66b8132745ffb99b348b1492088da9e2
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120945-citizen-library-9f46@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
9aa1345d66b8 ("mm: fix oops when filemap_map_pmd() without prealloc_pte")
03c4f20454e0 ("mm: introduce pmd_install() helper")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9aa1345d66b8132745ffb99b348b1492088da9e2 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd(a)google.com>
Date: Fri, 17 Nov 2023 00:49:18 -0800
Subject: [PATCH] mm: fix oops when filemap_map_pmd() without prealloc_pte
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports oops in lockdep's __lock_acquire(), called from
__pte_offset_map_lock() called from filemap_map_pages(); or when I run the
repro, the oops comes in pmd_install(), called from filemap_map_pmd()
called from filemap_map_pages(), just before the __pte_offset_map_lock().
The problem is that filemap_map_pmd() has been assuming that when it finds
pmd_none(), a page table has already been prepared in prealloc_pte; and
indeed do_fault_around() has been careful to preallocate one there, when
it finds pmd_none(): but what if *pmd became none in between?
My 6.6 mods in mm/khugepaged.c, avoiding mmap_lock for write, have made it
easy for *pmd to be cleared while servicing a page fault; but even before
those, a huge *pmd might be zapped while a fault is serviced.
The difference in symptomatic stack traces comes from the "memory model"
in use: pmd_install() uses pmd_populate() uses page_to_pfn(): in some
models that is strict, and will oops on the NULL prealloc_pte; in other
models, it will construct a bogus value to be populated into *pmd, then
__pte_offset_map_lock() oops when trying to access split ptlock pointer
(or some other symptom in normal case of ptlock embedded not pointer).
Link: https://lore.kernel.org/linux-mm/20231115065506.19780-1-jose.pekkarinen@fox…
Link: https://lkml.kernel.org/r/6ed0c50c-78ef-0719-b3c5-60c0c010431c@google.com
Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reported-and-tested-by: syzbot+89edd67979b52675ddec(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/0000000000005e44550608a0806c@google.com/
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>,
Cc: José Pekkarinen <jose.pekkarinen(a)foxhound.fi>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/filemap.c b/mm/filemap.c
index 32eedf3afd45..f1c8c278310f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3371,7 +3371,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
}
}
- if (pmd_none(*vmf->pmd))
+ if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
return false;
Forward to the stable mailing list.
Ryusuke Konishi
On Sun, Dec 10, 2023 at 11:37 AM Ryusuke Konishi wrote:
>
> Hi Greg,
>
> Please drop this patch from the 5.4-stable, 4.19-stable, and 4.14-stable queues.
>
> This patch uses nilfs_error() instead of nilfs_err(), which does not
> yet exist in these versions, but these are different routines and
> nilfs_error() should not be used as an alternative for init_nilfs(),
> as it can cause deadlock.
>
> I'll try to post a separate patch to replace it for these versions.
>
> Thanks,
> Ryusuke Konishi
>
> On Sat, Dec 9, 2023 at 9:32 PM wrote:
> >
> >
> > This is a note to let you know that I've just added the patch titled
> >
> > nilfs2: fix missing error check for sb_set_blocksize call
> >
> > to the 5.4-stable tree which can be found at:
> > http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
> >
> > The filename of the patch is:
> > nilfs2-fix-missing-error-check-for-sb_set_blocksize-call.patch
> > and it can be found in the queue-5.4 subdirectory.
> >
> > If you, or anyone else, feels it should not be added to the stable tree,
> > please let <stable(a)vger.kernel.org> know about it.
> >
> >
> > From d61d0ab573649789bf9eb909c89a1a193b2e3d10 Mon Sep 17 00:00:00 2001
> > From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
> > Date: Wed, 29 Nov 2023 23:15:47 +0900
> > Subject: nilfs2: fix missing error check for sb_set_blocksize call
> >
> > From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
> >
> > commit d61d0ab573649789bf9eb909c89a1a193b2e3d10 upstream.
> >
> > When mounting a filesystem image with a block size larger than the page
> > size, nilfs2 repeatedly outputs long error messages with stack traces to
> > the kernel log, such as the following:
> >
> > getblk(): invalid block size 8192 requested
> > logical block size: 512
> > ...
> > Call Trace:
> > dump_stack_lvl+0x92/0xd4
> > dump_stack+0xd/0x10
> > bdev_getblk+0x33a/0x354
> > __breadahead+0x11/0x80
> > nilfs_search_super_root+0xe2/0x704 [nilfs2]
> > load_nilfs+0x72/0x504 [nilfs2]
> > nilfs_mount+0x30f/0x518 [nilfs2]
> > legacy_get_tree+0x1b/0x40
> > vfs_get_tree+0x18/0xc4
> > path_mount+0x786/0xa88
> > __ia32_sys_mount+0x147/0x1a8
> > __do_fast_syscall_32+0x56/0xc8
> > do_fast_syscall_32+0x29/0x58
> > do_SYSENTER_32+0x15/0x18
> > entry_SYSENTER_32+0x98/0xf1
> > ...
> >
> > This overloads the system logger. And to make matters worse, it sometimes
> > crashes the kernel with a memory access violation.
> >
> > This is because the return value of the sb_set_blocksize() call, which
> > should be checked for errors, is not checked.
> >
> > The latter issue is due to out-of-buffer memory being accessed based on a
> > large block size that caused sb_set_blocksize() to fail for buffers read
> > with the initial minimum block size that remained unupdated in the
> > super_block structure.
> >
> > Since nilfs2 mkfs tool does not accept block sizes larger than the system
> > page size, this has been overlooked. However, it is possible to create
> > this situation by intentionally modifying the tool or by passing a
> > filesystem image created on a system with a large page size to a system
> > with a smaller page size and mounting it.
> >
> > Fix this issue by inserting the expected error handling for the call to
> > sb_set_blocksize().
> >
> > Link: https://lkml.kernel.org/r/20231129141547.4726-1-konishi.ryusuke@gmail.com
> > Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
> > Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
> > Cc: <stable(a)vger.kernel.org>
> > Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
> > Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> > ---
> > fs/nilfs2/the_nilfs.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > --- a/fs/nilfs2/the_nilfs.c
> > +++ b/fs/nilfs2/the_nilfs.c
> > @@ -688,7 +688,11 @@ int init_nilfs(struct the_nilfs *nilfs,
> > goto failed_sbh;
> > }
> > nilfs_release_super_block(nilfs);
> > - sb_set_blocksize(sb, blocksize);
> > + if (!sb_set_blocksize(sb, blocksize)) {
> > + nilfs_error(sb, "bad blocksize %d", blocksize);
> > + err = -EINVAL;
> > + goto out;
> > + }
> >
> > err = nilfs_load_super_block(nilfs, sb, blocksize, &sbp);
> > if (err)
> >
> >
> > Patches currently in stable-queue which might be from konishi.ryusuke(a)gmail.com are
> >
> > queue-5.4/nilfs2-prevent-warning-in-nilfs_sufile_set_segment_usage.patch
> > queue-5.4/nilfs2-fix-missing-error-check-for-sb_set_blocksize-call.patch
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 705318a99a138c29a512a72c3e0043b3cd7f55f4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120911-ecosystem-diary-c2b7@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
705318a99a13 ("io_uring/af_unix: disable sending io_uring over sockets")
38eddb2c75fb ("io_uring: remove FFS_SCM")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 705318a99a138c29a512a72c3e0043b3cd7f55f4 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Wed, 6 Dec 2023 13:26:47 +0000
Subject: [PATCH] io_uring/af_unix: disable sending io_uring over sockets
File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.
Cc: <stable(a)vger.kernel.org>
Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.17018687…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8625181fb87a..08ac0d8e07ef 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -77,17 +77,10 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file);
-#if defined(CONFIG_UNIX)
-static inline bool io_file_need_scm(struct file *filp)
-{
- return !!unix_get_socket(filp);
-}
-#else
static inline bool io_file_need_scm(struct file *filp)
{
return false;
}
-#endif
static inline int io_scm_file_account(struct io_ring_ctx *ctx,
struct file *file)
diff --git a/net/core/scm.c b/net/core/scm.c
index 880027ecf516..7dc47c17d863 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -26,6 +26,7 @@
#include <linux/nsproxy.h>
#include <linux/slab.h>
#include <linux/errqueue.h>
+#include <linux/io_uring.h>
#include <linux/uaccess.h>
@@ -103,6 +104,11 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
if (fd < 0 || !(file = fget_raw(fd)))
return -EBADF;
+ /* don't allow io_uring files */
+ if (io_uring_get_socket(file)) {
+ fput(file);
+ return -EINVAL;
+ }
*fpp++ = file;
fpl->count++;
}
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x d6a57588666301acd9d42d3b00d74240964f07f6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120953-unaligned-chowtime-dd0c@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
d6a575886663 ("drm/amdgpu: disable MCBP by default")
4e8303cf2c4d ("drm/amdgpu: Use function for IP version check")
6b7d211740da ("drm/amdgpu: Fix refclk reporting for SMU v13.0.6")
1b8e56b99459 ("drm/amdgpu: Restrict bootloader wait to SMUv13.0.6")
983ac45a06ae ("drm/amdgpu: update SET_HW_RESOURCES definition for UMSCH")
822f7808291f ("drm/amdgpu/discovery: enable UMSCH 4.0 in IP discovery")
3488c79beafa ("drm/amdgpu: add initial support for UMSCH")
2da1b04a2096 ("drm/amdgpu: add UMSCH 4.0 api definition")
3ee8fb7005ef ("drm/amdgpu: enable VPE for VPE 6.1.0")
9d4346bdbc64 ("drm/amdgpu: add VPE 6.1.0 support")
e370f8f38976 ("drm/amdgpu: Add bootloader wait for PSP v13")
aba2be41470a ("drm/amdgpu: add mmhub 3.3.0 support")
15e7cbd91de6 ("drm/amdgpu/gfx11: initialize gfx11.5.0")
f56c1941ebb7 ("drm/amdgpu: use 6.1.0 register offset for HDP CLK_CNTL")
15c5c5f57514 ("drm/amdgpu: Add bootloader status check")
3cce0bfcd0f9 ("drm/amd/display: Enable Replay for static screen use cases")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d6a57588666301acd9d42d3b00d74240964f07f6 Mon Sep 17 00:00:00 2001
From: Jiadong Zhu <Jiadong.Zhu(a)amd.com>
Date: Fri, 1 Dec 2023 08:38:15 +0800
Subject: [PATCH] drm/amdgpu: disable MCBP by default
Disable MCBP(mid command buffer preemption) by default as old Mesa
hangs with it. We shall not enable the feature that breaks old usermode
driver.
Fixes: 50a7c8765ca6 ("drm/amdgpu: enable mcbp by default on gfx9")
Signed-off-by: Jiadong Zhu <Jiadong.Zhu(a)amd.com>
Acked-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5c0817cbc7c2..1cc1ae2743c1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3791,10 +3791,6 @@ static void amdgpu_device_set_mcbp(struct amdgpu_device *adev)
adev->gfx.mcbp = true;
else if (amdgpu_mcbp == 0)
adev->gfx.mcbp = false;
- else if ((amdgpu_ip_version(adev, GC_HWIP, 0) >= IP_VERSION(9, 0, 0)) &&
- (amdgpu_ip_version(adev, GC_HWIP, 0) < IP_VERSION(10, 0, 0)) &&
- adev->gfx.num_gfx_rings)
- adev->gfx.mcbp = true;
if (amdgpu_sriov_vf(adev))
adev->gfx.mcbp = true;
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 6fce23a4d8c5f93bf80b7f122449fbb97f1e40dd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120932-suffocate-neuter-c5d7@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
6fce23a4d8c5 ("drm/amdgpu: Restrict extended wait to PSP v13.0.6")
d8c1925ba8cd ("drm/amdgpu: update retry times for psp BL wait")
fc5988907156 ("drm/amdgpu: update retry times for psp vmbx wait")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6fce23a4d8c5f93bf80b7f122449fbb97f1e40dd Mon Sep 17 00:00:00 2001
From: Lijo Lazar <lijo.lazar(a)amd.com>
Date: Wed, 29 Nov 2023 18:06:55 +0530
Subject: [PATCH] drm/amdgpu: Restrict extended wait to PSP v13.0.6
Only PSPv13.0.6 SOCs take a longer time to reach steady state. Other
PSPv13 based SOCs don't need extended wait. Also, reduce PSPv13.0.6 wait
time.
Cc: stable(a)vger.kernel.org
Fixes: fc5988907156 ("drm/amdgpu: update retry times for psp vmbx wait")
Fixes: d8c1925ba8cd ("drm/amdgpu: update retry times for psp BL wait")
Link: https://lore.kernel.org/amd-gfx/34dd4c66-f7bf-44aa-af8f-c82889dd652c@amd.co…
Signed-off-by: Lijo Lazar <lijo.lazar(a)amd.com>
Reviewed-by: Asad Kamal <asad.kamal(a)amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
index 5f46877f78cf..df1844d0800f 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
@@ -60,7 +60,7 @@ MODULE_FIRMWARE("amdgpu/psp_14_0_0_ta.bin");
#define GFX_CMD_USB_PD_USE_LFB 0x480
/* Retry times for vmbx ready wait */
-#define PSP_VMBX_POLLING_LIMIT 20000
+#define PSP_VMBX_POLLING_LIMIT 3000
/* VBIOS gfl defines */
#define MBOX_READY_MASK 0x80000000
@@ -161,14 +161,18 @@ static int psp_v13_0_wait_for_vmbx_ready(struct psp_context *psp)
static int psp_v13_0_wait_for_bootloader(struct psp_context *psp)
{
struct amdgpu_device *adev = psp->adev;
- int retry_loop, ret;
+ int retry_loop, retry_cnt, ret;
+ retry_cnt =
+ (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 6)) ?
+ PSP_VMBX_POLLING_LIMIT :
+ 10;
/* Wait for bootloader to signify that it is ready having bit 31 of
* C2PMSG_35 set to 1. All other bits are expected to be cleared.
* If there is an error in processing command, bits[7:0] will be set.
* This is applicable for PSP v13.0.6 and newer.
*/
- for (retry_loop = 0; retry_loop < PSP_VMBX_POLLING_LIMIT; retry_loop++) {
+ for (retry_loop = 0; retry_loop < retry_cnt; retry_loop++) {
ret = psp_wait_for(
psp, SOC15_REG_OFFSET(MP0, 0, regMP0_SMN_C2PMSG_35),
0x80000000, 0xffffffff, false);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f42ce5f087eb69e47294ababd2e7e6f88a82d308
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120918-qualify-refutable-6c2a@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
f42ce5f087eb ("mm/memory_hotplug: fix error handling in add_memory_resource()")
1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block")
2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
85a2b4b08f20 ("mm/memory_hotplug: allow architecture to override memmap on memory support check")
e3c2bfdd33a3 ("mm/memory_hotplug: allow memmap on memory hotplug request to fallback")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f42ce5f087eb69e47294ababd2e7e6f88a82d308 Mon Sep 17 00:00:00 2001
From: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Date: Mon, 20 Nov 2023 15:53:53 +0100
Subject: [PATCH] mm/memory_hotplug: fix error handling in
add_memory_resource()
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail. In that case, arch_remove_memory() is called to
perform necessary cleanup.
Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().
Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().
Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: kernel test robot <lkp(a)intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bb908289679e..7a5fc89a8652 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1458,7 +1458,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* create memory block devices after memory was added */
ret = create_memory_block_devices(start, size, params.altmap, group);
if (ret) {
- arch_remove_memory(start, size, NULL);
+ arch_remove_memory(start, size, params.altmap);
goto error_free;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x f42ce5f087eb69e47294ababd2e7e6f88a82d308
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120907-trophy-subdivide-10be@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
f42ce5f087eb ("mm/memory_hotplug: fix error handling in add_memory_resource()")
1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block")
2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
85a2b4b08f20 ("mm/memory_hotplug: allow architecture to override memmap on memory support check")
e3c2bfdd33a3 ("mm/memory_hotplug: allow memmap on memory hotplug request to fallback")
66361095129b ("mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory")
78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")
9c54c522bb76 ("mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsing")
6e02c46b4d97 ("mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on")
0effdf461c57 ("mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page crosses page boundaries")
47010c040dec ("mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*")
f10f1442c309 ("mm: hugetlb_vmemmap: cleanup hugetlb_free_vmemmap_enabled*")
5981611d0a00 ("mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functions")
1e63ac088f20 ("arm64: mm: hugetlb: enable HUGETLB_PAGE_FREE_VMEMMAP for arm64")
2e4ec02bbcc0 ("mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP")
2aa065f7afb2 ("drivers/base/memory: clarify adding and removing of memory blocks")
395f6081bad4 ("drivers/base/memory: determine and store zone for single-zone memory blocks")
7ea0d2d79da0 ("drivers/base/memory: add memory block to memory group after registration succeeded")
e54084173487 ("mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP")
a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f42ce5f087eb69e47294ababd2e7e6f88a82d308 Mon Sep 17 00:00:00 2001
From: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Date: Mon, 20 Nov 2023 15:53:53 +0100
Subject: [PATCH] mm/memory_hotplug: fix error handling in
add_memory_resource()
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail. In that case, arch_remove_memory() is called to
perform necessary cleanup.
Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().
Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().
Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: kernel test robot <lkp(a)intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bb908289679e..7a5fc89a8652 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1458,7 +1458,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* create memory block devices after memory was added */
ret = create_memory_block_devices(start, size, params.altmap, group);
if (ret) {
- arch_remove_memory(start, size, NULL);
+ arch_remove_memory(start, size, params.altmap);
goto error_free;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x f42ce5f087eb69e47294ababd2e7e6f88a82d308
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120904-duffel-joining-636f@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
f42ce5f087eb ("mm/memory_hotplug: fix error handling in add_memory_resource()")
1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block")
2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
85a2b4b08f20 ("mm/memory_hotplug: allow architecture to override memmap on memory support check")
e3c2bfdd33a3 ("mm/memory_hotplug: allow memmap on memory hotplug request to fallback")
66361095129b ("mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory")
78f39084b41d ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")
9c54c522bb76 ("mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsing")
6e02c46b4d97 ("mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on")
0effdf461c57 ("mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page crosses page boundaries")
47010c040dec ("mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*")
f10f1442c309 ("mm: hugetlb_vmemmap: cleanup hugetlb_free_vmemmap_enabled*")
5981611d0a00 ("mm: hugetlb_vmemmap: cleanup hugetlb_vmemmap related functions")
1e63ac088f20 ("arm64: mm: hugetlb: enable HUGETLB_PAGE_FREE_VMEMMAP for arm64")
2e4ec02bbcc0 ("mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP")
2aa065f7afb2 ("drivers/base/memory: clarify adding and removing of memory blocks")
395f6081bad4 ("drivers/base/memory: determine and store zone for single-zone memory blocks")
7ea0d2d79da0 ("drivers/base/memory: add memory block to memory group after registration succeeded")
e54084173487 ("mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP")
a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f42ce5f087eb69e47294ababd2e7e6f88a82d308 Mon Sep 17 00:00:00 2001
From: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Date: Mon, 20 Nov 2023 15:53:53 +0100
Subject: [PATCH] mm/memory_hotplug: fix error handling in
add_memory_resource()
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail. In that case, arch_remove_memory() is called to
perform necessary cleanup.
Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().
Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().
Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: kernel test robot <lkp(a)intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bb908289679e..7a5fc89a8652 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1458,7 +1458,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/* create memory block devices after memory was added */
ret = create_memory_block_devices(start, size, params.altmap, group);
if (ret) {
- arch_remove_memory(start, size, NULL);
+ arch_remove_memory(start, size, params.altmap);
goto error_free;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 001002e73712cdf6b8d9a103648cda3040ad7647
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120927-immobile-half-41d8@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
001002e73712 ("mm/memory_hotplug: add missing mem_hotplug_lock")
1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block")
2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
85a2b4b08f20 ("mm/memory_hotplug: allow architecture to override memmap on memory support check")
e3c2bfdd33a3 ("mm/memory_hotplug: allow memmap on memory hotplug request to fallback")
5033091de814 ("mm/hwpoison: introduce per-memory_block hwpoison counter")
a46c9304b4bb ("mm/hwpoison: pass pfn to num_poisoned_pages_*()")
d027122d8363 ("mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c")
e591ef7d96d6 ("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage")
0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
eba4d770efc8 ("mm/swap: comment all the ifdef in swapops.h")
21c9e90ab9a4 ("mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages")
f36a5543a748 ("mm, hwpoison: fix page refcnt leaking in try_memory_failure_hugetlb()")
7453bf621cfa ("mm, hwpoison: make __page_handle_poison returns int")
38f6d29397cc ("mm, hwpoison: set PG_hwpoison for busy hugetlb pages")
ac5fcde0a96a ("mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage")
161df60e9e89 ("mm, hwpoison, hugetlb: support saving mechanism of raw error pages")
6213834c10de ("mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability")
998a2997885f ("mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c")
dff033818a06 ("mm: hugetlb_vmemmap: introduce the name HVO")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 001002e73712cdf6b8d9a103648cda3040ad7647 Mon Sep 17 00:00:00 2001
From: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Date: Mon, 20 Nov 2023 15:53:52 +0100
Subject: [PATCH] mm/memory_hotplug: add missing mem_hotplug_lock
From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).
mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.
When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0: | CPU 1:
memory_offline() |
-> offline_pages() |
-> mem_hotplug_begin() |
... |
-> mem_hotplug_done() |
| kmemleak_scan()
| -> get_online_mems()
| ...
-> mhp_deinit_memmap_on_memory() |
[not protected by mem_hotplug_begin/done()]|
Marks memory section as offline, | Retrieves zone_start_pfn
poisons vmemmap struct pages and updates | and struct page members.
the zone related data |
| ...
| -> put_online_mems()
Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory(). Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.
online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.
Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: kernel test robot <lkp(a)intel.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f3b9a4d0fa3b..8a13babd826c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -180,6 +180,9 @@ static inline unsigned long memblk_nr_poison(struct memory_block *mem)
}
#endif
+/*
+ * Must acquire mem_hotplug_lock in write mode.
+ */
static int memory_block_online(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
@@ -204,10 +207,11 @@ static int memory_block_online(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;
+ mem_hotplug_begin();
if (nr_vmemmap_pages) {
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
if (ret)
- return ret;
+ goto out;
}
ret = online_pages(start_pfn + nr_vmemmap_pages,
@@ -215,7 +219,7 @@ static int memory_block_online(struct memory_block *mem)
if (ret) {
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
- return ret;
+ goto out;
}
/*
@@ -227,9 +231,14 @@ static int memory_block_online(struct memory_block *mem)
nr_vmemmap_pages);
mem->zone = zone;
+out:
+ mem_hotplug_done();
return ret;
}
+/*
+ * Must acquire mem_hotplug_lock in write mode.
+ */
static int memory_block_offline(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
@@ -247,6 +256,7 @@ static int memory_block_offline(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;
+ mem_hotplug_begin();
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
-nr_vmemmap_pages);
@@ -258,13 +268,15 @@ static int memory_block_offline(struct memory_block *mem)
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn),
mem->group, nr_vmemmap_pages);
- return ret;
+ goto out;
}
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
mem->zone = NULL;
+out:
+ mem_hotplug_done();
return ret;
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ab41a511e20a..bb908289679e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1129,6 +1129,9 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
}
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group)
{
@@ -1149,7 +1152,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
- mem_hotplug_begin();
/* associate pfn range with the zone */
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
@@ -1208,7 +1210,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
writeback_set_ratelimit();
memory_notify(MEM_ONLINE, &arg);
- mem_hotplug_done();
return 0;
failed_addition:
@@ -1217,7 +1218,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
(((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
memory_notify(MEM_CANCEL_ONLINE, &arg);
remove_pfn_range_from_zone(zone, pfn, nr_pages);
- mem_hotplug_done();
return ret;
}
@@ -1863,6 +1863,9 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
return 0;
}
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group)
{
@@ -1885,8 +1888,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
- mem_hotplug_begin();
-
/*
* Don't allow to offline memory blocks that contain holes.
* Consequently, memory blocks with holes can never get onlined
@@ -2031,7 +2032,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
memory_notify(MEM_OFFLINE, &arg);
remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
- mem_hotplug_done();
return 0;
failed_removal_isolated:
@@ -2046,7 +2046,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
(unsigned long long) start_pfn << PAGE_SHIFT,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
reason);
- mem_hotplug_done();
return ret;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 001002e73712cdf6b8d9a103648cda3040ad7647
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120925-hazard-mustard-8df5@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
001002e73712 ("mm/memory_hotplug: add missing mem_hotplug_lock")
1a8c64e11043 ("mm/memory_hotplug: embed vmem_altmap details in memory block")
2d1f649c7c08 ("mm/memory_hotplug: support memmap_on_memory when memmap is not aligned to pageblocks")
85a2b4b08f20 ("mm/memory_hotplug: allow architecture to override memmap on memory support check")
e3c2bfdd33a3 ("mm/memory_hotplug: allow memmap on memory hotplug request to fallback")
5033091de814 ("mm/hwpoison: introduce per-memory_block hwpoison counter")
a46c9304b4bb ("mm/hwpoison: pass pfn to num_poisoned_pages_*()")
d027122d8363 ("mm/hwpoison: move definitions of num_poisoned_pages_* to memory-failure.c")
e591ef7d96d6 ("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 001002e73712cdf6b8d9a103648cda3040ad7647 Mon Sep 17 00:00:00 2001
From: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Date: Mon, 20 Nov 2023 15:53:52 +0100
Subject: [PATCH] mm/memory_hotplug: add missing mem_hotplug_lock
From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).
mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.
When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0: | CPU 1:
memory_offline() |
-> offline_pages() |
-> mem_hotplug_begin() |
... |
-> mem_hotplug_done() |
| kmemleak_scan()
| -> get_online_mems()
| ...
-> mhp_deinit_memmap_on_memory() |
[not protected by mem_hotplug_begin/done()]|
Marks memory section as offline, | Retrieves zone_start_pfn
poisons vmemmap struct pages and updates | and struct page members.
the zone related data |
| ...
| -> put_online_mems()
Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory(). Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.
online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.
Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes: a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Gordeev <agordeev(a)linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Heiko Carstens <hca(a)linux.ibm.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: kernel test robot <lkp(a)intel.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f3b9a4d0fa3b..8a13babd826c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -180,6 +180,9 @@ static inline unsigned long memblk_nr_poison(struct memory_block *mem)
}
#endif
+/*
+ * Must acquire mem_hotplug_lock in write mode.
+ */
static int memory_block_online(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
@@ -204,10 +207,11 @@ static int memory_block_online(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;
+ mem_hotplug_begin();
if (nr_vmemmap_pages) {
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
if (ret)
- return ret;
+ goto out;
}
ret = online_pages(start_pfn + nr_vmemmap_pages,
@@ -215,7 +219,7 @@ static int memory_block_online(struct memory_block *mem)
if (ret) {
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
- return ret;
+ goto out;
}
/*
@@ -227,9 +231,14 @@ static int memory_block_online(struct memory_block *mem)
nr_vmemmap_pages);
mem->zone = zone;
+out:
+ mem_hotplug_done();
return ret;
}
+/*
+ * Must acquire mem_hotplug_lock in write mode.
+ */
static int memory_block_offline(struct memory_block *mem)
{
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
@@ -247,6 +256,7 @@ static int memory_block_offline(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;
+ mem_hotplug_begin();
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
-nr_vmemmap_pages);
@@ -258,13 +268,15 @@ static int memory_block_offline(struct memory_block *mem)
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn),
mem->group, nr_vmemmap_pages);
- return ret;
+ goto out;
}
if (nr_vmemmap_pages)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
mem->zone = NULL;
+out:
+ mem_hotplug_done();
return ret;
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ab41a511e20a..bb908289679e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1129,6 +1129,9 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages)
kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
}
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group)
{
@@ -1149,7 +1152,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
- mem_hotplug_begin();
/* associate pfn range with the zone */
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
@@ -1208,7 +1210,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
writeback_set_ratelimit();
memory_notify(MEM_ONLINE, &arg);
- mem_hotplug_done();
return 0;
failed_addition:
@@ -1217,7 +1218,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
(((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
memory_notify(MEM_CANCEL_ONLINE, &arg);
remove_pfn_range_from_zone(zone, pfn, nr_pages);
- mem_hotplug_done();
return ret;
}
@@ -1863,6 +1863,9 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
return 0;
}
+/*
+ * Must be called with mem_hotplug_lock in write mode.
+ */
int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group)
{
@@ -1885,8 +1888,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
- mem_hotplug_begin();
-
/*
* Don't allow to offline memory blocks that contain holes.
* Consequently, memory blocks with holes can never get onlined
@@ -2031,7 +2032,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
memory_notify(MEM_OFFLINE, &arg);
remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
- mem_hotplug_done();
return 0;
failed_removal_isolated:
@@ -2046,7 +2046,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
(unsigned long long) start_pfn << PAGE_SHIFT,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
reason);
- mem_hotplug_done();
return ret;
}
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 487635756198cad563feb47539c6a37ea57f1dae
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120949-antiquity-primer-60c1@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
487635756198 ("parisc: Fix asm operand number out of range build error in bug table")
43266838515d ("parisc: Reduce size of the bug_table on 64-bit kernel by half")
fe76a1349f23 ("parisc: Use natural CPU alignment for bug_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 487635756198cad563feb47539c6a37ea57f1dae Mon Sep 17 00:00:00 2001
From: Helge Deller <deller(a)gmx.de>
Date: Mon, 27 Nov 2023 10:39:26 +0100
Subject: [PATCH] parisc: Fix asm operand number out of range build error in
bug table
Build is broken if CONFIG_DEBUG_BUGVERBOSE=n.
Fix it be using the correct asm operand number.
Signed-off-by: Helge Deller <deller(a)gmx.de>
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
Fixes: fe76a1349f23 ("parisc: Use natural CPU alignment for bug_table")
Cc: stable(a)vger.kernel.org # v6.0+
diff --git a/arch/parisc/include/asm/bug.h b/arch/parisc/include/asm/bug.h
index 1641ff9a8b83..833555f74ffa 100644
--- a/arch/parisc/include/asm/bug.h
+++ b/arch/parisc/include/asm/bug.h
@@ -71,7 +71,7 @@
asm volatile("\n" \
"1:\t" PARISC_BUG_BREAK_ASM "\n" \
"\t.pushsection __bug_table,\"a\"\n" \
- "\t.align %2\n" \
+ "\t.align 4\n" \
"2:\t" __BUG_REL(1b) "\n" \
"\t.short %0\n" \
"\t.blockz %1-4-2\n" \
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 4b0768b6556af56ee9b7cf4e68452a2b6289ae45
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120919-consensus-dimmer-f852@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
4b0768b6556a ("r8169: fix rtl8125b PAUSE frames blasting when suspended")
efc37109c780 ("r8169: remove support for chip version 60")
ebe598985711 ("r8169: remove support for chip versions 45 and 47")
68650b4e6c13 ("r8169: support L1.2 control on RTL8168h")
c217ab7a3961 ("r8169: enable ASPM L1.2 if system vendor flags it as safe")
4b5f82f6aaef ("r8169: enable ASPM L1/L1.1 from RTL8168h")
18a9eae240cb ("r8169: enable ASPM L0s state")
e6d6ca6e1204 ("r8169: Add support for another RTL8168FP")
e0d38b588075 ("r8169: improve DASH support")
c6cff9dfebb3 ("r8169: move ERI access functions to avoid forward declaration")
ae0d0bb29b31 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4b0768b6556af56ee9b7cf4e68452a2b6289ae45 Mon Sep 17 00:00:00 2001
From: ChunHao Lin <hau(a)realtek.com>
Date: Wed, 29 Nov 2023 23:53:50 +0800
Subject: [PATCH] r8169: fix rtl8125b PAUSE frames blasting when suspended
When FIFO reaches near full state, device will issue pause frame.
If pause slot is enabled(set to 1), in this time, device will issue
pause frame only once. But if pause slot is disabled(set to 0), device
will keep sending pause frames until FIFO reaches near empty state.
When pause slot is disabled, if there is no one to handle receive
packets, device FIFO will reach near full state and keep sending
pause frames. That will impact entire local area network.
This issue can be reproduced in Chromebox (not Chromebook) in
developer mode running a test image (and v5.10 kernel):
1) ping -f $CHROMEBOX (from workstation on same local network)
2) run "powerd_dbus_suspend" from command line on the $CHROMEBOX
3) ping $ROUTER (wait until ping fails from workstation)
Takes about ~20-30 seconds after step 2 for the local network to
stop working.
Fix this issue by enabling pause slot to only send pause frame once
when FIFO reaches near full state.
Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125")
Reported-by: Grant Grundler <grundler(a)chromium.org>
Tested-by: Grant Grundler <grundler(a)chromium.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: ChunHao Lin <hau(a)realtek.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Reviewed-by: Heiner Kallweit <hkallweit1(a)gmail.com>
Link: https://lore.kernel.org/r/20231129155350.5843-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 62cabeeb842a..bb787a52bc75 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -196,6 +196,7 @@ enum rtl_registers {
/* No threshold before first PCI xfer */
#define RX_FIFO_THRESH (7 << RXCFG_FIFO_SHIFT)
#define RX_EARLY_OFF (1 << 11)
+#define RX_PAUSE_SLOT_ON (1 << 11) /* 8125b and later */
#define RXCFG_DMA_SHIFT 8
/* Unlimited maximum PCI burst. */
#define RX_DMA_BURST (7 << RXCFG_DMA_SHIFT)
@@ -2306,9 +2307,13 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_53:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST | RX_EARLY_OFF);
break;
- case RTL_GIGA_MAC_VER_61 ... RTL_GIGA_MAC_VER_63:
+ case RTL_GIGA_MAC_VER_61:
RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST);
break;
+ case RTL_GIGA_MAC_VER_63:
+ RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST |
+ RX_PAUSE_SLOT_ON);
+ break;
default:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_DMA_BURST);
break;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 4b0768b6556af56ee9b7cf4e68452a2b6289ae45
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120918-coma-plentiful-2159@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
4b0768b6556a ("r8169: fix rtl8125b PAUSE frames blasting when suspended")
efc37109c780 ("r8169: remove support for chip version 60")
ebe598985711 ("r8169: remove support for chip versions 45 and 47")
68650b4e6c13 ("r8169: support L1.2 control on RTL8168h")
c217ab7a3961 ("r8169: enable ASPM L1.2 if system vendor flags it as safe")
4b5f82f6aaef ("r8169: enable ASPM L1/L1.1 from RTL8168h")
18a9eae240cb ("r8169: enable ASPM L0s state")
e6d6ca6e1204 ("r8169: Add support for another RTL8168FP")
e0d38b588075 ("r8169: improve DASH support")
c6cff9dfebb3 ("r8169: move ERI access functions to avoid forward declaration")
ae0d0bb29b31 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4b0768b6556af56ee9b7cf4e68452a2b6289ae45 Mon Sep 17 00:00:00 2001
From: ChunHao Lin <hau(a)realtek.com>
Date: Wed, 29 Nov 2023 23:53:50 +0800
Subject: [PATCH] r8169: fix rtl8125b PAUSE frames blasting when suspended
When FIFO reaches near full state, device will issue pause frame.
If pause slot is enabled(set to 1), in this time, device will issue
pause frame only once. But if pause slot is disabled(set to 0), device
will keep sending pause frames until FIFO reaches near empty state.
When pause slot is disabled, if there is no one to handle receive
packets, device FIFO will reach near full state and keep sending
pause frames. That will impact entire local area network.
This issue can be reproduced in Chromebox (not Chromebook) in
developer mode running a test image (and v5.10 kernel):
1) ping -f $CHROMEBOX (from workstation on same local network)
2) run "powerd_dbus_suspend" from command line on the $CHROMEBOX
3) ping $ROUTER (wait until ping fails from workstation)
Takes about ~20-30 seconds after step 2 for the local network to
stop working.
Fix this issue by enabling pause slot to only send pause frame once
when FIFO reaches near full state.
Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125")
Reported-by: Grant Grundler <grundler(a)chromium.org>
Tested-by: Grant Grundler <grundler(a)chromium.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: ChunHao Lin <hau(a)realtek.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Reviewed-by: Heiner Kallweit <hkallweit1(a)gmail.com>
Link: https://lore.kernel.org/r/20231129155350.5843-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 62cabeeb842a..bb787a52bc75 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -196,6 +196,7 @@ enum rtl_registers {
/* No threshold before first PCI xfer */
#define RX_FIFO_THRESH (7 << RXCFG_FIFO_SHIFT)
#define RX_EARLY_OFF (1 << 11)
+#define RX_PAUSE_SLOT_ON (1 << 11) /* 8125b and later */
#define RXCFG_DMA_SHIFT 8
/* Unlimited maximum PCI burst. */
#define RX_DMA_BURST (7 << RXCFG_DMA_SHIFT)
@@ -2306,9 +2307,13 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_53:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST | RX_EARLY_OFF);
break;
- case RTL_GIGA_MAC_VER_61 ... RTL_GIGA_MAC_VER_63:
+ case RTL_GIGA_MAC_VER_61:
RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST);
break;
+ case RTL_GIGA_MAC_VER_63:
+ RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST |
+ RX_PAUSE_SLOT_ON);
+ break;
default:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_DMA_BURST);
break;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 4b0768b6556af56ee9b7cf4e68452a2b6289ae45
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120916-frigidly-dispatch-b4f7@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
4b0768b6556a ("r8169: fix rtl8125b PAUSE frames blasting when suspended")
efc37109c780 ("r8169: remove support for chip version 60")
ebe598985711 ("r8169: remove support for chip versions 45 and 47")
68650b4e6c13 ("r8169: support L1.2 control on RTL8168h")
c217ab7a3961 ("r8169: enable ASPM L1.2 if system vendor flags it as safe")
4b5f82f6aaef ("r8169: enable ASPM L1/L1.1 from RTL8168h")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4b0768b6556af56ee9b7cf4e68452a2b6289ae45 Mon Sep 17 00:00:00 2001
From: ChunHao Lin <hau(a)realtek.com>
Date: Wed, 29 Nov 2023 23:53:50 +0800
Subject: [PATCH] r8169: fix rtl8125b PAUSE frames blasting when suspended
When FIFO reaches near full state, device will issue pause frame.
If pause slot is enabled(set to 1), in this time, device will issue
pause frame only once. But if pause slot is disabled(set to 0), device
will keep sending pause frames until FIFO reaches near empty state.
When pause slot is disabled, if there is no one to handle receive
packets, device FIFO will reach near full state and keep sending
pause frames. That will impact entire local area network.
This issue can be reproduced in Chromebox (not Chromebook) in
developer mode running a test image (and v5.10 kernel):
1) ping -f $CHROMEBOX (from workstation on same local network)
2) run "powerd_dbus_suspend" from command line on the $CHROMEBOX
3) ping $ROUTER (wait until ping fails from workstation)
Takes about ~20-30 seconds after step 2 for the local network to
stop working.
Fix this issue by enabling pause slot to only send pause frame once
when FIFO reaches near full state.
Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125")
Reported-by: Grant Grundler <grundler(a)chromium.org>
Tested-by: Grant Grundler <grundler(a)chromium.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: ChunHao Lin <hau(a)realtek.com>
Reviewed-by: Jacob Keller <jacob.e.keller(a)intel.com>
Reviewed-by: Heiner Kallweit <hkallweit1(a)gmail.com>
Link: https://lore.kernel.org/r/20231129155350.5843-1-hau@realtek.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 62cabeeb842a..bb787a52bc75 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -196,6 +196,7 @@ enum rtl_registers {
/* No threshold before first PCI xfer */
#define RX_FIFO_THRESH (7 << RXCFG_FIFO_SHIFT)
#define RX_EARLY_OFF (1 << 11)
+#define RX_PAUSE_SLOT_ON (1 << 11) /* 8125b and later */
#define RXCFG_DMA_SHIFT 8
/* Unlimited maximum PCI burst. */
#define RX_DMA_BURST (7 << RXCFG_DMA_SHIFT)
@@ -2306,9 +2307,13 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_53:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST | RX_EARLY_OFF);
break;
- case RTL_GIGA_MAC_VER_61 ... RTL_GIGA_MAC_VER_63:
+ case RTL_GIGA_MAC_VER_61:
RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST);
break;
+ case RTL_GIGA_MAC_VER_63:
+ RTL_W32(tp, RxConfig, RX_FETCH_DFLT_8125 | RX_DMA_BURST |
+ RX_PAUSE_SLOT_ON);
+ break;
default:
RTL_W32(tp, RxConfig, RX128_INT_EN | RX_DMA_BURST);
break;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x b538bf7d0ec11ca49f536dfda742a5f6db90a798
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120910-reputable-mummy-5f86@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
b538bf7d0ec1 ("tracing: Disable snapshot buffer when stopping instance tracers")
13292494379f ("tracing: Make struct ring_buffer less ambiguous")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
2d6425af6116 ("tracing: Declare newly exported APIs in include/linux/trace.h")
a47b53e95acc ("tracing: Rename tracing_reset() to tracing_reset_cpu()")
46cc0b44428d ("tracing/snapshot: Resize spare buffer if size changed")
d2d8b146043a ("Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b538bf7d0ec11ca49f536dfda742a5f6db90a798 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:11 -0500
Subject: [PATCH] tracing: Disable snapshot buffer when stopping instance
tracers
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.
Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.
Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e978868b1a22..2492c6c76850 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2360,49 +2360,6 @@ int is_tracing_stopped(void)
return global_trace.stop_count;
}
-/**
- * tracing_start - quick start of the tracer
- *
- * If tracing is enabled but was stopped by tracing_stop,
- * this will start the tracer back up.
- */
-void tracing_start(void)
-{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- if (tracing_disabled)
- return;
-
- raw_spin_lock_irqsave(&global_trace.start_lock, flags);
- if (--global_trace.stop_count) {
- if (global_trace.stop_count < 0) {
- /* Someone screwed up their debugging */
- WARN_ON_ONCE(1);
- global_trace.stop_count = 0;
- }
- goto out;
- }
-
- /* Prevent the buffers from switching */
- arch_spin_lock(&global_trace.max_lock);
-
- buffer = global_trace.array_buffer.buffer;
- if (buffer)
- ring_buffer_record_enable(buffer);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
- buffer = global_trace.max_buffer.buffer;
- if (buffer)
- ring_buffer_record_enable(buffer);
-#endif
-
- arch_spin_unlock(&global_trace.max_lock);
-
- out:
- raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
-}
-
static void tracing_start_tr(struct trace_array *tr)
{
struct trace_buffer *buffer;
@@ -2411,25 +2368,70 @@ static void tracing_start_tr(struct trace_array *tr)
if (tracing_disabled)
return;
- /* If global, we need to also start the max tracer */
- if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
- return tracing_start();
-
raw_spin_lock_irqsave(&tr->start_lock, flags);
-
if (--tr->stop_count) {
- if (tr->stop_count < 0) {
+ if (WARN_ON_ONCE(tr->stop_count < 0)) {
/* Someone screwed up their debugging */
- WARN_ON_ONCE(1);
tr->stop_count = 0;
}
goto out;
}
+ /* Prevent the buffers from switching */
+ arch_spin_lock(&tr->max_lock);
+
buffer = tr->array_buffer.buffer;
if (buffer)
ring_buffer_record_enable(buffer);
+#ifdef CONFIG_TRACER_MAX_TRACE
+ buffer = tr->max_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_enable(buffer);
+#endif
+
+ arch_spin_unlock(&tr->max_lock);
+
+ out:
+ raw_spin_unlock_irqrestore(&tr->start_lock, flags);
+}
+
+/**
+ * tracing_start - quick start of the tracer
+ *
+ * If tracing is enabled but was stopped by tracing_stop,
+ * this will start the tracer back up.
+ */
+void tracing_start(void)
+
+{
+ return tracing_start_tr(&global_trace);
+}
+
+static void tracing_stop_tr(struct trace_array *tr)
+{
+ struct trace_buffer *buffer;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tr->start_lock, flags);
+ if (tr->stop_count++)
+ goto out;
+
+ /* Prevent the buffers from switching */
+ arch_spin_lock(&tr->max_lock);
+
+ buffer = tr->array_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_disable(buffer);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+ buffer = tr->max_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_disable(buffer);
+#endif
+
+ arch_spin_unlock(&tr->max_lock);
+
out:
raw_spin_unlock_irqrestore(&tr->start_lock, flags);
}
@@ -2442,51 +2444,7 @@ static void tracing_start_tr(struct trace_array *tr)
*/
void tracing_stop(void)
{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&global_trace.start_lock, flags);
- if (global_trace.stop_count++)
- goto out;
-
- /* Prevent the buffers from switching */
- arch_spin_lock(&global_trace.max_lock);
-
- buffer = global_trace.array_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
- buffer = global_trace.max_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-#endif
-
- arch_spin_unlock(&global_trace.max_lock);
-
- out:
- raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
-}
-
-static void tracing_stop_tr(struct trace_array *tr)
-{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- /* If global, we need to also stop the max tracer */
- if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
- return tracing_stop();
-
- raw_spin_lock_irqsave(&tr->start_lock, flags);
- if (tr->stop_count++)
- goto out;
-
- buffer = tr->array_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-
- out:
- raw_spin_unlock_irqrestore(&tr->start_lock, flags);
+ return tracing_stop_tr(&global_trace);
}
static int trace_save_cmdline(struct task_struct *tsk)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x b538bf7d0ec11ca49f536dfda742a5f6db90a798
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120909-tutu-fabulous-be5d@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
b538bf7d0ec1 ("tracing: Disable snapshot buffer when stopping instance tracers")
13292494379f ("tracing: Make struct ring_buffer less ambiguous")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
2d6425af6116 ("tracing: Declare newly exported APIs in include/linux/trace.h")
a47b53e95acc ("tracing: Rename tracing_reset() to tracing_reset_cpu()")
46cc0b44428d ("tracing/snapshot: Resize spare buffer if size changed")
d2d8b146043a ("Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b538bf7d0ec11ca49f536dfda742a5f6db90a798 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:11 -0500
Subject: [PATCH] tracing: Disable snapshot buffer when stopping instance
tracers
It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.
Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.
Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e978868b1a22..2492c6c76850 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2360,49 +2360,6 @@ int is_tracing_stopped(void)
return global_trace.stop_count;
}
-/**
- * tracing_start - quick start of the tracer
- *
- * If tracing is enabled but was stopped by tracing_stop,
- * this will start the tracer back up.
- */
-void tracing_start(void)
-{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- if (tracing_disabled)
- return;
-
- raw_spin_lock_irqsave(&global_trace.start_lock, flags);
- if (--global_trace.stop_count) {
- if (global_trace.stop_count < 0) {
- /* Someone screwed up their debugging */
- WARN_ON_ONCE(1);
- global_trace.stop_count = 0;
- }
- goto out;
- }
-
- /* Prevent the buffers from switching */
- arch_spin_lock(&global_trace.max_lock);
-
- buffer = global_trace.array_buffer.buffer;
- if (buffer)
- ring_buffer_record_enable(buffer);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
- buffer = global_trace.max_buffer.buffer;
- if (buffer)
- ring_buffer_record_enable(buffer);
-#endif
-
- arch_spin_unlock(&global_trace.max_lock);
-
- out:
- raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
-}
-
static void tracing_start_tr(struct trace_array *tr)
{
struct trace_buffer *buffer;
@@ -2411,25 +2368,70 @@ static void tracing_start_tr(struct trace_array *tr)
if (tracing_disabled)
return;
- /* If global, we need to also start the max tracer */
- if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
- return tracing_start();
-
raw_spin_lock_irqsave(&tr->start_lock, flags);
-
if (--tr->stop_count) {
- if (tr->stop_count < 0) {
+ if (WARN_ON_ONCE(tr->stop_count < 0)) {
/* Someone screwed up their debugging */
- WARN_ON_ONCE(1);
tr->stop_count = 0;
}
goto out;
}
+ /* Prevent the buffers from switching */
+ arch_spin_lock(&tr->max_lock);
+
buffer = tr->array_buffer.buffer;
if (buffer)
ring_buffer_record_enable(buffer);
+#ifdef CONFIG_TRACER_MAX_TRACE
+ buffer = tr->max_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_enable(buffer);
+#endif
+
+ arch_spin_unlock(&tr->max_lock);
+
+ out:
+ raw_spin_unlock_irqrestore(&tr->start_lock, flags);
+}
+
+/**
+ * tracing_start - quick start of the tracer
+ *
+ * If tracing is enabled but was stopped by tracing_stop,
+ * this will start the tracer back up.
+ */
+void tracing_start(void)
+
+{
+ return tracing_start_tr(&global_trace);
+}
+
+static void tracing_stop_tr(struct trace_array *tr)
+{
+ struct trace_buffer *buffer;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&tr->start_lock, flags);
+ if (tr->stop_count++)
+ goto out;
+
+ /* Prevent the buffers from switching */
+ arch_spin_lock(&tr->max_lock);
+
+ buffer = tr->array_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_disable(buffer);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+ buffer = tr->max_buffer.buffer;
+ if (buffer)
+ ring_buffer_record_disable(buffer);
+#endif
+
+ arch_spin_unlock(&tr->max_lock);
+
out:
raw_spin_unlock_irqrestore(&tr->start_lock, flags);
}
@@ -2442,51 +2444,7 @@ static void tracing_start_tr(struct trace_array *tr)
*/
void tracing_stop(void)
{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&global_trace.start_lock, flags);
- if (global_trace.stop_count++)
- goto out;
-
- /* Prevent the buffers from switching */
- arch_spin_lock(&global_trace.max_lock);
-
- buffer = global_trace.array_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
- buffer = global_trace.max_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-#endif
-
- arch_spin_unlock(&global_trace.max_lock);
-
- out:
- raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
-}
-
-static void tracing_stop_tr(struct trace_array *tr)
-{
- struct trace_buffer *buffer;
- unsigned long flags;
-
- /* If global, we need to also stop the max tracer */
- if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
- return tracing_stop();
-
- raw_spin_lock_irqsave(&tr->start_lock, flags);
- if (tr->stop_count++)
- goto out;
-
- buffer = tr->array_buffer.buffer;
- if (buffer)
- ring_buffer_record_disable(buffer);
-
- out:
- raw_spin_unlock_irqrestore(&tr->start_lock, flags);
+ return tracing_stop_tr(&global_trace);
}
static int trace_save_cmdline(struct task_struct *tsk)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120955-jolly-spool-3d4c@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
a47b53e95acc ("tracing: Rename tracing_reset() to tracing_reset_cpu()")
46cc0b44428d ("tracing/snapshot: Resize spare buffer if size changed")
d2d8b146043a ("Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120954-satisfy-sample-730b@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
a47b53e95acc ("tracing: Rename tracing_reset() to tracing_reset_cpu()")
46cc0b44428d ("tracing/snapshot: Resize spare buffer if size changed")
d2d8b146043a ("Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120953-gloomily-roping-6f9f@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
1c5eb4481e01 ("tracing: Rename trace_buffer to array_buffer")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120951-growl-jacket-27c1@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120950-limeade-greedily-88e8@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x d78ab792705c7be1b91243b2544d1a79406a2ad7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120949-giddily-cogwheel-28e8@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
d78ab792705c ("tracing: Stop current tracer when resizing buffer")
6d98a0f2ac3c ("tracing: Set actual size after ring buffer resize")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d78ab792705c7be1b91243b2544d1a79406a2ad7 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Tue, 5 Dec 2023 16:52:10 -0500
Subject: [PATCH] tracing: Stop current tracer when resizing buffer
When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.
Simply stop the running tracer before resizing the buffers and enable it
again when finished.
Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
+ /* Do not allow tracing while resizng ring buffer */
+ tracing_stop_tr(tr);
+
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
if (ret < 0)
- return ret;
+ goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
WARN_ON(1);
tracing_disabled = 1;
}
- return ret;
+ goto out_start;
}
update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
#endif /* CONFIG_TRACER_MAX_TRACE */
update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+ tracing_start_tr(tr);
return ret;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x b2dd797543cfa6580eac8408dd67fa02164d9e56
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120940-capitol-parade-09df@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event")
6f6be606e763 ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b2dd797543cfa6580eac8408dd67fa02164d9e56 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 6 Dec 2023 10:02:44 -0500
Subject: [PATCH] ring-buffer: Force absolute timestamp on discard of event
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.
If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..a6da2d765c78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3030,22 +3030,19 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
local_read(&bpage->write) & ~RB_WRITE_MASK;
unsigned long event_length = rb_event_length(event);
+ /*
+ * For the before_stamp to be different than the write_stamp
+ * to make sure that the next event adds an absolute
+ * value and does not rely on the saved write stamp, which
+ * is now going to be bogus.
+ */
+ rb_time_set(&cpu_buffer->before_stamp, 0);
+
/* Something came in, can't discard */
if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
write_stamp, write_stamp - delta))
return false;
- /*
- * It's possible that the event time delta is zero
- * (has the same time stamp as the previous event)
- * in which case write_stamp and before_stamp could
- * be the same. In such a case, force before_stamp
- * to be different than write_stamp. It doesn't
- * matter what it is, as long as its different.
- */
- if (!delta)
- rb_time_set(&cpu_buffer->before_stamp, 0);
-
/*
* If an event were to come in now, it would see that the
* write_stamp and the before_stamp are different, and assume
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x b2dd797543cfa6580eac8408dd67fa02164d9e56
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120939-gladly-chef-ae61@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b2dd797543cfa6580eac8408dd67fa02164d9e56 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 6 Dec 2023 10:02:44 -0500
Subject: [PATCH] ring-buffer: Force absolute timestamp on discard of event
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.
If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..a6da2d765c78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3030,22 +3030,19 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
local_read(&bpage->write) & ~RB_WRITE_MASK;
unsigned long event_length = rb_event_length(event);
+ /*
+ * For the before_stamp to be different than the write_stamp
+ * to make sure that the next event adds an absolute
+ * value and does not rely on the saved write stamp, which
+ * is now going to be bogus.
+ */
+ rb_time_set(&cpu_buffer->before_stamp, 0);
+
/* Something came in, can't discard */
if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
write_stamp, write_stamp - delta))
return false;
- /*
- * It's possible that the event time delta is zero
- * (has the same time stamp as the previous event)
- * in which case write_stamp and before_stamp could
- * be the same. In such a case, force before_stamp
- * to be different than write_stamp. It doesn't
- * matter what it is, as long as its different.
- */
- if (!delta)
- rb_time_set(&cpu_buffer->before_stamp, 0);
-
/*
* If an event were to come in now, it would see that the
* write_stamp and the before_stamp are different, and assume
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x b2dd797543cfa6580eac8408dd67fa02164d9e56
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120938-unclamped-fleshy-688e@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b2dd797543cfa6580eac8408dd67fa02164d9e56 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
Date: Wed, 6 Dec 2023 10:02:44 -0500
Subject: [PATCH] ring-buffer: Force absolute timestamp on discard of event
There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.
If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.
Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..a6da2d765c78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3030,22 +3030,19 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
local_read(&bpage->write) & ~RB_WRITE_MASK;
unsigned long event_length = rb_event_length(event);
+ /*
+ * For the before_stamp to be different than the write_stamp
+ * to make sure that the next event adds an absolute
+ * value and does not rely on the saved write stamp, which
+ * is now going to be bogus.
+ */
+ rb_time_set(&cpu_buffer->before_stamp, 0);
+
/* Something came in, can't discard */
if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
write_stamp, write_stamp - delta))
return false;
- /*
- * It's possible that the event time delta is zero
- * (has the same time stamp as the previous event)
- * in which case write_stamp and before_stamp could
- * be the same. In such a case, force before_stamp
- * to be different than write_stamp. It doesn't
- * matter what it is, as long as its different.
- */
- if (!delta)
- rb_time_set(&cpu_buffer->before_stamp, 0);
-
/*
* If an event were to come in now, it would see that the
* write_stamp and the before_stamp are different, and assume
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 705318a99a138c29a512a72c3e0043b3cd7f55f4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120912-cabana-cartridge-5e80@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
705318a99a13 ("io_uring/af_unix: disable sending io_uring over sockets")
38eddb2c75fb ("io_uring: remove FFS_SCM")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 705318a99a138c29a512a72c3e0043b3cd7f55f4 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Wed, 6 Dec 2023 13:26:47 +0000
Subject: [PATCH] io_uring/af_unix: disable sending io_uring over sockets
File reference cycles have caused lots of problems for io_uring
in the past, and it still doesn't work exactly right and races with
unix_stream_read_generic(). The safest fix would be to completely
disallow sending io_uring files via sockets via SCM_RIGHT, so there
are no possible cycles invloving registered files and thus rendering
SCM accounting on the io_uring side unnecessary.
Cc: <stable(a)vger.kernel.org>
Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release")
Reported-and-suggested-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.17018687…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8625181fb87a..08ac0d8e07ef 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -77,17 +77,10 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file);
-#if defined(CONFIG_UNIX)
-static inline bool io_file_need_scm(struct file *filp)
-{
- return !!unix_get_socket(filp);
-}
-#else
static inline bool io_file_need_scm(struct file *filp)
{
return false;
}
-#endif
static inline int io_scm_file_account(struct io_ring_ctx *ctx,
struct file *file)
diff --git a/net/core/scm.c b/net/core/scm.c
index 880027ecf516..7dc47c17d863 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -26,6 +26,7 @@
#include <linux/nsproxy.h>
#include <linux/slab.h>
#include <linux/errqueue.h>
+#include <linux/io_uring.h>
#include <linux/uaccess.h>
@@ -103,6 +104,11 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
if (fd < 0 || !(file = fget_raw(fd)))
return -EBADF;
+ /* don't allow io_uring files */
+ if (io_uring_get_socket(file)) {
+ fput(file);
+ return -EINVAL;
+ }
*fpp++ = file;
fpl->count++;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x d839a656d0f3caca9f96e9bf912fd394ac6a11bc
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023120319-failing-aviator-303a@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
d839a656d0f3 ("kprobes: consistent rcu api usage for kretprobe holder")
4bbd93455659 ("kprobes: kretprobe scalability improvement")
8865aea0471c ("kernel: kprobes: Use struct_size()")
195b9cb5b288 ("fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free()")
5f81018753df ("fprobe: Release rethook after the ftrace_ops is unregistered")
76d0de5729c0 ("fprobe: Pass entry_data to handlers")
0a1ebe35cb3b ("rethook: fix a potential memleak in rethook_alloc()")
d05ea35e7eea ("fprobe: Check rethook_alloc() return in rethook initialization")
c09eb2e578eb ("bpf: Adjust kprobe_multi entry_ip for CONFIG_X86_KERNEL_IBT")
c0f3bb4054ef ("rethook: Reject getting a rethook if RCU is not watching")
439940491807 ("kprobes: Fix build errors with CONFIG_KRETPROBES=n")
73f9b911faa7 ("kprobes: Use rethook for kretprobe if possible")
9052e4e83762 ("fprobe: Fix smatch type mismatch warning")
f70986902c86 ("bpf: Fix kprobe_multi return probe backtrace")
f705ec764b34 ("Revert "bpf: Add support to inline bpf_get_func_ip helper on x86"")
2c6401c966ae ("selftests/bpf: Add kprobe_multi bpf_cookie test")
f7a11eeccb11 ("selftests/bpf: Add kprobe_multi attach test")
ca74823c6e16 ("bpf: Add cookie support to programs attached with kprobe multi link")
97ee4d20ee67 ("bpf: Add support to inline bpf_get_func_ip helper on x86")
42a5712094e8 ("bpf: Add bpf_get_func_ip kprobe helper for multi kprobe link")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d839a656d0f3caca9f96e9bf912fd394ac6a11bc Mon Sep 17 00:00:00 2001
From: JP Kobryn <inwardvessel(a)gmail.com>
Date: Fri, 1 Dec 2023 14:53:55 +0900
Subject: [PATCH] kprobes: consistent rcu api usage for kretprobe holder
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
The thought behind this patch is to make use of the RCU API where possible
when accessing this pointer so that the needed barriers are always in place
and to self-document the code.
The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
done to the "rp" pointer are changed to make use of the RCU macro for
assignment. For the single read, the implementation of get_kretprobe()
is simplified by making use of an RCU macro which accomplishes the same,
but note that the log warning text will be more generic.
I did find that there is a difference in assembly generated between the
usage of the RCU macros vs without. For example, on arm64, when using
rcu_assign_pointer(), the corresponding store instruction is a
store-release (STLR) which has an implicit barrier. When normal assignment
is done, a regular store (STR) is found. In the macro case, this seems to
be a result of rcu_assign_pointer() using smp_store_release() when the
value to write is not NULL.
Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
Fixes: d741bf41d7c7 ("kprobes: Remove kretprobe hash")
Cc: stable(a)vger.kernel.org
Signed-off-by: JP Kobryn <inwardvessel(a)gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index ab1da3142b06..64672bace560 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -139,7 +139,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*
*/
struct kretprobe_holder {
- struct kretprobe *rp;
+ struct kretprobe __rcu *rp;
struct objpool_head pool;
};
@@ -245,10 +245,7 @@ unsigned long kretprobe_trampoline_handler(struct pt_regs *regs,
static nokprobe_inline struct kretprobe *get_kretprobe(struct kretprobe_instance *ri)
{
- RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),
- "Kretprobe is accessed from instance under preemptive context");
-
- return READ_ONCE(ri->rph->rp);
+ return rcu_dereference_check(ri->rph->rp, rcu_read_lock_any_held());
}
static nokprobe_inline unsigned long get_kretprobe_retaddr(struct kretprobe_instance *ri)
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 075a632e6c7c..d5a0ee40bf66 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2252,7 +2252,7 @@ int register_kretprobe(struct kretprobe *rp)
rp->rph = NULL;
return -ENOMEM;
}
- rp->rph->rp = rp;
+ rcu_assign_pointer(rp->rph->rp, rp);
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2300,7 +2300,7 @@ void unregister_kretprobes(struct kretprobe **rps, int num)
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
rethook_free(rps[i]->rh);
#else
- rps[i]->rph->rp = NULL;
+ rcu_assign_pointer(rps[i]->rph->rp, NULL);
#endif
}
mutex_unlock(&kprobe_mutex);
commit 1aa3aaf8953c84bad398adf6c3cabc9d6685bf7d upstream
A transaction complete work is allocated and queued for each
transaction. Under certain conditions the work->type might be marked as
BINDER_WORK_TRANSACTION_ONEWAY_SPAM_SUSPECT to notify userspace about
potential spamming threads or as BINDER_WORK_TRANSACTION_PENDING when
the target is currently frozen.
However, these work types are not being handled in binder_release_work()
so they will leak during a cleanup. This was reported by syzkaller with
the following kmemleak dump:
BUG: memory leak
unreferenced object 0xffff88810e2d6de0 (size 32):
comm "syz-executor338", pid 5046, jiffies 4294968230 (age 13.590s)
hex dump (first 32 bytes):
e0 6d 2d 0e 81 88 ff ff e0 6d 2d 0e 81 88 ff ff .m-......m-.....
04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81573b75>] kmalloc_trace+0x25/0x90 mm/slab_common.c:1114
[<ffffffff83d41873>] kmalloc include/linux/slab.h:599 [inline]
[<ffffffff83d41873>] kzalloc include/linux/slab.h:720 [inline]
[<ffffffff83d41873>] binder_transaction+0x573/0x4050 drivers/android/binder.c:3152
[<ffffffff83d45a05>] binder_thread_write+0x6b5/0x1860 drivers/android/binder.c:4010
[<ffffffff83d486dc>] binder_ioctl_write_read drivers/android/binder.c:5066 [inline]
[<ffffffff83d486dc>] binder_ioctl+0x1b2c/0x3cf0 drivers/android/binder.c:5352
[<ffffffff816b25f2>] vfs_ioctl fs/ioctl.c:51 [inline]
[<ffffffff816b25f2>] __do_sys_ioctl fs/ioctl.c:871 [inline]
[<ffffffff816b25f2>] __se_sys_ioctl fs/ioctl.c:857 [inline]
[<ffffffff816b25f2>] __x64_sys_ioctl+0xf2/0x140 fs/ioctl.c:857
[<ffffffff84b30008>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84b30008>] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
[<ffffffff84c0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fix the leaks by kfreeing these work types in binder_release_work() and
handle them as a BINDER_WORK_TRANSACTION_COMPLETE cleanup.
Cc: stable(a)vger.kernel.org
Fixes: a7dc1e6f99df ("binder: tell userspace to dump current backtrace when detected oneway spamming")
Reported-by: syzbot+7f10c1653e35933c0f1e(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7f10c1653e35933c0f1e
Suggested-by: Alice Ryhl <aliceryhl(a)google.com>
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
Reviewed-by: Alice Ryhl <aliceryhl(a)google.com>
Acked-by: Todd Kjos <tkjos(a)google.com>
Link: https://lore.kernel.org/r/20230922175138.230331-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
[cmllamas: backport to v6.1 by dropping BINDER_WORK_TRANSACTION_PENDING
as commit 0567461a7a6e is not present. Remove fixes tag accordingly.]
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
---
drivers/android/binder.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index e4a6da81cd4b..9cc3a2b1b4fc 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -4788,6 +4788,7 @@ static void binder_release_work(struct binder_proc *proc,
"undelivered TRANSACTION_ERROR: %u\n",
e->cmd);
} break;
+ case BINDER_WORK_TRANSACTION_ONEWAY_SPAM_SUSPECT:
case BINDER_WORK_TRANSACTION_COMPLETE: {
binder_debug(BINDER_DEBUG_DEAD_TRANSACTION,
"undelivered TRANSACTION_COMPLETE\n");
base-commit: c6114c845984144944f1abc07c61de219367a4da
--
2.43.0.472.g3155946c3a-goog
Hello,
On Fri, Dec 08, 2023 at 05:08:32AM -0500, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> spi: imx: add a device specific prepare_message callback
>
> to the 4.19-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> spi-imx-add-a-device-specific-prepare_message-callba.patch
> and it can be found in the queue-4.19 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit b19a3770ce84da3c16acc7142e754cd8ff80ad3d
> Author: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
> Date: Fri Nov 30 07:47:05 2018 +0100
>
> spi: imx: add a device specific prepare_message callback
>
> [ Upstream commit e697271c4e2987b333148e16a2eb8b5b924fd40a ]
>
> This is just preparatory work which allows to move some initialisation
> that currently is done in the per transfer hook .config to an earlier
> point in time in the next few patches. There is no change in behaviour
> introduced by this patch.
>
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
> Signed-off-by: Mark Brown <broonie(a)kernel.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
The patch alone shouldn't be needed for stable and there is no
indication that it's a dependency for another patch. Is this an
oversight?
Other than that: IMHO the subject for this type of report could be improved. Currently it's:
Subject: Patch "spi: imx: add a device specific prepare_message callback" has been added to the 4.19-stable tree
The most important part of it is "4.19-stable", but that only appears
after character column 90 and so my MUA doesn't show it unless the
window is wider than my default setting. Maybe make this:
Subject: for-stable-4.19: "spi: imx: add a device specific prepare_message callback"
?
Another thing I wonder about is: The mail contains
If you, or anyone else, feels it should not be added to the
stable tree, please let <stable(a)vger.kernel.org> know about it.
but it wasn't sent to stable(a)vger.kernel.org.
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-König |
Industrial Linux Solutions | https://www.pengutronix.de/ |
Hi Greg, Sasha,
Please pick up the following upstream commits for 6.6.x stable:
commit 04cef5f58395 ("drm/amd/amdgpu/amdgpu_doorbell_mgr: Correct
misdocumented param 'doorbell_index'")
commit 367a0af43373 ("drm/amdkfd: get doorbell's absolute offset based
on the db_size")
They fix the following reported issue:
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3007
Fixes: c31866651086 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Thanks,
Alex
This is a partial revert of commit 8b3517f88ff2 ("PCI:
loongson: Prevent LS7A MRRS increases") for MIPS based Loongson.
There are many MIPS based Loongson systems in wild that
shipped with firmware which does not set maximum MRRS properly.
Limiting MRRS to 256 for all as MIPS Loongson comes with higher
MRRS support is considered rare.
It must be done at device enablement stage because MRRS setting
may get lost if the parent bridge lost PCI_COMMAND_MASTER, and
we are only sure parent bridge is enabled at this point.
Cc: stable(a)vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217680
Fixes: 8b3517f88ff2 ("PCI: loongson: Prevent LS7A MRRS increases")
Signed-off-by: Jiaxun Yang <jiaxun.yang(a)flygoat.com>
---
v4: Improve commit message
v5:
- Improve commit message and comments.
- Style fix from Huacai's off-list input.
v6: Fix a typo
---
drivers/pci/controller/pci-loongson.c | 47 ++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/drivers/pci/controller/pci-loongson.c b/drivers/pci/controller/pci-loongson.c
index d45e7b8dc530..e181d99decf1 100644
--- a/drivers/pci/controller/pci-loongson.c
+++ b/drivers/pci/controller/pci-loongson.c
@@ -80,13 +80,50 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_LOONGSON,
DEV_LS7A_LPC, system_bus_quirk);
+/*
+ * Some Loongson PCIe ports have h/w limitations of maximum read
+ * request size. They can't handle anything larger than this.
+ * Sane firmware will set proper MRRS at boot, so we only need
+ * no_inc_mrrs for bridges. However, some MIPS Loongson firmware
+ * won't set MRRS properly, and we have to enforce maximum safe
+ * MRRS, which is 256 bytes.
+ */
+#ifdef CONFIG_MIPS
+static void loongson_set_min_mrrs_quirk(struct pci_dev *pdev)
+{
+ struct pci_bus *bus = pdev->bus;
+ struct pci_dev *bridge;
+ static const struct pci_device_id bridge_devids[] = {
+ { PCI_VDEVICE(LOONGSON, DEV_LS2K_PCIE_PORT0) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT0) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT1) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT2) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT3) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT4) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT5) },
+ { PCI_VDEVICE(LOONGSON, DEV_LS7A_PCIE_PORT6) },
+ { 0, },
+ };
+
+ /* look for the matching bridge */
+ while (!pci_is_root_bus(bus)) {
+ bridge = bus->self;
+ bus = bus->parent;
+
+ if (pci_match_id(bridge_devids, bridge)) {
+ if (pcie_get_readrq(pdev) > 256) {
+ pci_info(pdev, "limiting MRRS to 256\n");
+ pcie_set_readrq(pdev, 256);
+ }
+ break;
+ }
+ }
+}
+DECLARE_PCI_FIXUP_ENABLE(PCI_ANY_ID, PCI_ANY_ID, loongson_set_min_mrrs_quirk);
+#endif
+
static void loongson_mrrs_quirk(struct pci_dev *pdev)
{
- /*
- * Some Loongson PCIe ports have h/w limitations of maximum read
- * request size. They can't handle anything larger than this. So
- * force this limit on any devices attached under these ports.
- */
struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
bridge->no_inc_mrrs = 1;
--
2.34.1
Please merge commit 85c2ceaafbd3 ("mm/damon/sysfs: eliminate potential
uninitialized variable warning") to >=5.19 stable kernels.
In 2023-10-31, I sent[1] a fix for v5.19. After a week, Dan found an issue in
the fix and sent a fix. At that time, the commit that Dan was fixing was
merged in the mm tree but not in the mainline. Hence, Dan didn't Cc stable@.
However, now the broken fix[1] is merged in the mainline as commit 973233600676
("mm/damon/sysfs: update monitoring target regions for online input commit"),
and all >=5.19 stable trees. Hence Dan's fix should also applied to those
trees. Please apply those.
Note that the bug was only potential[3] due to unchecked return value.
However, the unchecked return value was not an intentional behavior but a bug.
Hence we further made the return value to be checked[4]. The return value
check fix is also merged in the relevant stable trees, so the fix is now needed
for a real bug.
[1] https://lore.kernel.org/all/20231031170131.46972-1-sj@kernel.org/
[2] https://lore.kernel.org/all/739e6aaf-a634-4e33-98a8-16546379ec9f@moroto.mou…
[3] https://lore.kernel.org/all/20231106165205.48264-1-sj@kernel.org/
[4] https://lore.kernel.org/all/20231106233408.51159-1-sj@kernel.org/
Thanks,
SJ
The patch titled
Subject: mm/mglru: reclaim offlined memcgs harder
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-reclaim-offlined-memcgs-harder.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: reclaim offlined memcgs harder
Date: Thu, 7 Dec 2023 23:14:07 -0700
In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically,
instead of rotating them to the tail of the current generation
(MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
generation (MEMCG_LRU_YOUNG) after the first attempt.
Not applying enough pressure on offlined memcgs can cause them to build
up, and this can be particularly harmful to memory-constrained systems.
On Pixel 8 Pro, launching apps for 50 cycles:
Before After Change
Zombie memcgs 45 35 -22%
[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA…
Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: T.J. Mercier <tjmercier(a)google.com>
Tested-by: T.J. Mercier <tjmercier(a)google.com>
Cc: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 8 ++++----
mm/vmscan.c | 24 ++++++++++++++++--------
2 files changed, 20 insertions(+), 12 deletions(-)
--- a/include/linux/mmzone.h~mm-mglru-reclaim-offlined-memcgs-harder
+++ a/include/linux/mmzone.h
@@ -519,10 +519,10 @@ void lru_gen_look_around(struct page_vma
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
* 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
- * which triggers MEMCG_LRU_YOUNG;
+ * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_TAIL;
+ * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
+ * threshold, which triggers MEMCG_LRU_YOUNG;
* 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
* 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
--- a/mm/vmscan.c~mm-mglru-reclaim-offlined-memcgs-harder
+++ a/mm/vmscan.c
@@ -4598,7 +4598,12 @@ static bool should_run_aging(struct lruv
}
/* try to scrape all its memory if this memcg was deleted */
- *nr_to_scan = mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
+ if (!mem_cgroup_online(memcg)) {
+ *nr_to_scan = total;
+ return false;
+ }
+
+ *nr_to_scan = total >> sc->priority;
/*
* The aging tries to be lazy to reduce the overhead, while the eviction
@@ -4719,14 +4724,9 @@ static int shrink_one(struct lruvec *lru
bool success;
unsigned long scanned = sc->nr_scanned;
unsigned long reclaimed = sc->nr_reclaimed;
- int seg = lru_gen_memcg_seg(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- /* see the comment on MEMCG_NR_GENS */
- if (!lruvec_is_sizable(lruvec, sc))
- return seg != MEMCG_LRU_TAIL ? MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
-
mem_cgroup_calculate_protection(NULL, memcg);
if (mem_cgroup_below_min(NULL, memcg))
@@ -4734,7 +4734,7 @@ static int shrink_one(struct lruvec *lru
if (mem_cgroup_below_low(NULL, memcg)) {
/* see the comment on MEMCG_NR_GENS */
- if (seg != MEMCG_LRU_TAIL)
+ if (lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL)
return MEMCG_LRU_TAIL;
memcg_memory_event(memcg, MEMCG_LOW);
@@ -4750,7 +4750,15 @@ static int shrink_one(struct lruvec *lru
flush_reclaim_state(sc);
- return success ? MEMCG_LRU_YOUNG : 0;
+ if (success && mem_cgroup_online(memcg))
+ return MEMCG_LRU_YOUNG;
+
+ if (!success && lruvec_is_sizable(lruvec, sc))
+ return 0;
+
+ /* one retry if offlined or too small */
+ return lru_gen_memcg_seg(lruvec) != MEMCG_LRU_TAIL ?
+ MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}
#ifdef CONFIG_MEMCG
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: respect min_ttl_ms with memcgs
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: respect min_ttl_ms with memcgs
Date: Thu, 7 Dec 2023 23:14:06 -0700
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.
Before the memcg LRU:
kswapd()
shrink_node_memcgs()
mem_cgroup_iter()
inc_max_seq() // always hit a different memcg
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
After the memcg LRU:
kswapd()
shrink_many()
restart:
iterate the memcg LRU:
inc_max_seq() // occasionally hit the same memcg
if raced with lru_gen_rotate_memcg():
goto restart
lru_gen_age_node()
mem_cgroup_iter()
check the timestamp of the oldest generation
Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with. In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation. Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).
[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg…
Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Tested-by: T.J. Mercier <tjmercier(a)google.com>
Cc: Charan Teja Kalla <quic_charante(a)quicinc.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 30 +++++++++++++++++-------------
mm/vmscan.c | 30 ++++++++++++++++--------------
2 files changed, 33 insertions(+), 27 deletions(-)
--- a/include/linux/mmzone.h~mm-mglru-respect-min_ttl_ms-with-memcgs
+++ a/include/linux/mmzone.h
@@ -505,33 +505,37 @@ void lru_gen_look_around(struct page_vma
* the old generation, is incremented when all its bins become empty.
*
* There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its
+ * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
* current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its
+ * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
* current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old
+ * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
* generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the
+ * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
* young generation, updates its "gen" to "young" and resets its "seg" to
* "default".
*
* The events that trigger the above operations are:
* 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim an memcg below low, which triggers
+ * 2. The first attempt to reclaim a memcg below low, which triggers
* MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim an memcg below reclaimable size threshold,
+ * 3. The first attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim an memcg below reclaimable size threshold,
+ * 4. The second attempt to reclaim a memcg below reclaimable size threshold,
* which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG;
+ * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
* 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
+ * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
*
- * Note that memcg LRU only applies to global reclaim, and the round-robin
- * incrementing of their max_seq counters ensures the eventual fairness to all
- * eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * Notes:
+ * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
+ * of their max_seq counters ensures the eventual fairness to all eligible
+ * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
+ * 2. There are only two valid generations: old (seq) and young (seq+1).
+ * MEMCG_NR_GENS is set to three so that when reading the generation counter
+ * locklessly, a stale value (seq-1) does not wraparound to young.
*/
-#define MEMCG_NR_GENS 2
+#define MEMCG_NR_GENS 3
#define MEMCG_NR_BINS 8
struct lru_gen_memcg {
--- a/mm/vmscan.c~mm-mglru-respect-min_ttl_ms-with-memcgs
+++ a/mm/vmscan.c
@@ -4089,6 +4089,9 @@ static void lru_gen_rotate_memcg(struct
else
VM_WARN_ON_ONCE(true);
+ WRITE_ONCE(lruvec->lrugen.seg, seg);
+ WRITE_ONCE(lruvec->lrugen.gen, new);
+
hlist_nulls_del_rcu(&lruvec->lrugen.list);
if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
@@ -4099,9 +4102,6 @@ static void lru_gen_rotate_memcg(struct
pgdat->memcg_lru.nr_memcgs[old]--;
pgdat->memcg_lru.nr_memcgs[new]++;
- lruvec->lrugen.gen = new;
- WRITE_ONCE(lruvec->lrugen.seg, seg);
-
if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
@@ -4124,11 +4124,11 @@ void lru_gen_online_memcg(struct mem_cgr
gen = get_memcg_gen(pgdat->memcg_lru.seq);
+ lruvec->lrugen.gen = gen;
+
hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
pgdat->memcg_lru.nr_memcgs[gen]++;
- lruvec->lrugen.gen = gen;
-
spin_unlock_irq(&pgdat->memcg_lru.lock);
}
}
@@ -4635,7 +4635,7 @@ static long get_nr_to_scan(struct lruvec
DEFINE_MAX_SEQ(lruvec);
if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
- return 0;
+ return -1;
if (!should_run_aging(lruvec, max_seq, sc, can_swap, &nr_to_scan))
return nr_to_scan;
@@ -4710,7 +4710,7 @@ static bool try_to_shrink_lruvec(struct
cond_resched();
}
- /* whether try_to_inc_max_seq() was successful */
+ /* whether this lruvec should be rotated */
return nr_to_scan < 0;
}
@@ -4764,13 +4764,13 @@ static void shrink_many(struct pglist_da
struct lruvec *lruvec;
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
- const struct hlist_nulls_node *pos;
+ struct hlist_nulls_node *pos;
+ gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
op = 0;
memcg = NULL;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
rcu_read_lock();
@@ -4781,6 +4781,10 @@ restart:
}
mem_cgroup_put(memcg);
+ memcg = NULL;
+
+ if (gen != READ_ONCE(lrugen->gen))
+ continue;
lruvec = container_of(lrugen, struct lruvec, lrugen);
memcg = lruvec_memcg(lruvec);
@@ -4865,16 +4869,14 @@ static void set_initial_priority(struct
if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH)
return;
/*
- * Determine the initial priority based on ((total / MEMCG_NR_GENS) >>
- * priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, where the
- * estimated reclaimed_to_scanned_ratio = inactive / total.
+ * Determine the initial priority based on
+ * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim,
+ * where reclaimed_to_scanned_ratio = inactive / total.
*/
reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE);
if (get_swappiness(lruvec, sc))
reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON);
- reclaimable /= MEMCG_NR_GENS;
-
/* round down reclaimable and round up sc->nr_to_reclaim */
priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: try to stop at high watermarks
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-try-to-stop-at-high-watermarks.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: try to stop at high watermarks
Date: Thu, 7 Dec 2023 23:14:05 -0700
The initial MGLRU patchset didn't include the memcg LRU support, and it
relied on should_abort_scan(), added by commit f76c83378851 ("mm:
multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid
overshooting their aggregate reclaim target by too much".
Later on when the memcg LRU was added, should_abort_scan() was deemed
unnecessary, and the test results [1] showed no side effects after it was
removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction
fairness safeguard").
However, that test used memory.reclaim, which sets nr_to_reclaim to
SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages,
i.e., from nr_reclaimed=nr_to_reclaim-1 to
nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch
size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore
that test isn't able to reproduce the worst case scenario, i.e., kswapd
overshooting GBs on large systems and "consuming 100% CPU" (see the Closes
tag).
Bring back a simplified version of should_abort_scan() on top of the memcg
LRU, so that kswapd stops when all eligible zones are above their
respective high watermarks plus a small delta to lower the chance of
KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0
reclaim, meaning compaction-induced reclaim can still run wild (which is a
different problem).
On Android, launching 55 apps sequentially:
Before After Change
pgpgin 838377172 802955040 -4%
pgpgout 38037080 34336300 -10%
[1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com
Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg…
Tested-by: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Tested-by: Kalesh Singh <kaleshsingh(a)google.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: T.J. Mercier <tjmercier(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
--- a/mm/vmscan.c~mm-mglru-try-to-stop-at-high-watermarks
+++ a/mm/vmscan.c
@@ -4648,20 +4648,41 @@ static long get_nr_to_scan(struct lruvec
return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}
-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
{
+ int i;
+ enum zone_watermarks mark;
+
/* don't abort memcg reclaim to ensure fairness */
if (!root_reclaim(sc))
- return -1;
+ return false;
+
+ if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
+ return true;
+
+ /* check the order to exclude compaction-induced reclaim */
+ if (!current_is_kswapd() || sc->order)
+ return false;
+
+ mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+ WMARK_PROMO : WMARK_HIGH;
+
+ for (i = 0; i <= sc->reclaim_idx; i++) {
+ struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+ unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
+
+ if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
+ return false;
+ }
- return max(sc->nr_to_reclaim, compact_gap(sc->order));
+ /* kswapd should abort if all eligible zones are safe */
+ return true;
}
static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
long nr_to_scan;
unsigned long scanned = 0;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
int swappiness = get_swappiness(lruvec, sc);
/* clean file folios are more likely to exist */
@@ -4683,7 +4704,7 @@ static bool try_to_shrink_lruvec(struct
if (scanned >= nr_to_scan)
break;
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;
cond_resched();
@@ -4744,7 +4765,6 @@ static void shrink_many(struct pglist_da
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
const struct hlist_nulls_node *pos;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
@@ -4777,7 +4797,7 @@ restart:
rcu_read_lock();
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (should_abort_scan(lruvec, sc))
break;
}
@@ -4788,7 +4808,7 @@ restart:
mem_cgroup_put(memcg);
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (!is_a_nulls(pos))
return;
/* restart if raced with lru_gen_rotate_memcg() */
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/mglru: fix underprotected page cache
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-mglru-fix-underprotected-page-cache.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm/mglru: fix underprotected page cache
Date: Thu, 7 Dec 2023 23:14:04 -0700
Unmapped folios accessed through file descriptors can be underprotected.
Those folios are added to the oldest generation based on:
1. The fact that they are less costly to reclaim (no need to walk the
rmap and flush the TLB) and have less impact on performance (don't
cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
client use cases like Android, its apps parse configuration files
and store the data in heap (anon); for server use cases like MySQL,
it reads from InnoDB files and holds the cached data for tables in
buffer pools (anon).
However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults. (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.
To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
tracking range of tiers, i.e., five accesses through file
descriptors, move them to the second oldest generation to give them
more time to age. (Note that tiers are used by the PID controller
to statistically determine whether folios accessed multiple times
through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
that they are not too close to the tail. The effect of this is
similar to the above.
On Android, launching 55 apps sequentially:
Before After Change
workingset_refault_anon 25641024 25598972 0%
workingset_refault_file 115016834 106178438 -8%
Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh(a)google.com>
Cc: T.J. Mercier <tjmercier(a)google.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Hillf Danton <hdanton(a)sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart(a)gooddata.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm_inline.h | 23 ++++++++++++++---------
mm/vmscan.c | 2 +-
mm/workingset.c | 6 +++---
3 files changed, 18 insertions(+), 13 deletions(-)
--- a/include/linux/mm_inline.h~mm-mglru-fix-underprotected-page-cache
+++ a/include/linux/mm_inline.h
@@ -232,22 +232,27 @@ static inline bool lru_gen_add_folio(str
if (folio_test_unevictable(folio) || !lrugen->enabled)
return false;
/*
- * There are three common cases for this page:
- * 1. If it's hot, e.g., freshly faulted in or previously hot and
- * migrated, add it to the youngest generation.
- * 2. If it's cold but can't be evicted immediately, i.e., an anon page
- * not in swapcache or a dirty page pending writeback, add it to the
- * second oldest generation.
- * 3. Everything else (clean, cold) is added to the oldest generation.
+ * There are four common cases for this page:
+ * 1. If it's hot, i.e., freshly faulted in, add it to the youngest
+ * generation, and it's protected over the rest below.
+ * 2. If it can't be evicted immediately, i.e., a dirty page pending
+ * writeback, add it to the second youngest generation.
+ * 3. If it should be evicted first, e.g., cold and clean from
+ * folio_rotate_reclaimable(), add it to the oldest generation.
+ * 4. Everything else falls between 2 & 3 above and is added to the
+ * second oldest generation if it's considered inactive, or the
+ * oldest generation otherwise. See lru_gen_is_active().
*/
if (folio_test_active(folio))
seq = lrugen->max_seq;
else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
(folio_test_reclaim(folio) &&
(folio_test_dirty(folio) || folio_test_writeback(folio))))
- seq = lrugen->min_seq[type] + 1;
- else
+ seq = lrugen->max_seq - 1;
+ else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >= lrugen->max_seq)
seq = lrugen->min_seq[type];
+ else
+ seq = lrugen->min_seq[type] + 1;
gen = lru_gen_from_seq(seq);
flags = (gen + 1UL) << LRU_GEN_PGOFF;
--- a/mm/vmscan.c~mm-mglru-fix-underprotected-page-cache
+++ a/mm/vmscan.c
@@ -4232,7 +4232,7 @@ static bool sort_folio(struct lruvec *lr
}
/* protected */
- if (tier > tier_idx) {
+ if (tier > tier_idx || refs == BIT(LRU_REFS_WIDTH)) {
int hist = lru_hist_from_seq(lrugen->min_seq[type]);
gen = folio_inc_gen(lruvec, folio, false);
--- a/mm/workingset.c~mm-mglru-fix-underprotected-page-cache
+++ a/mm/workingset.c
@@ -313,10 +313,10 @@ static void lru_gen_refault(struct folio
* 1. For pages accessed through page tables, hotter pages pushed out
* hot pages which refaulted immediately.
* 2. For pages accessed multiple times through file descriptors,
- * numbers of accesses might have been out of the range.
+ * they would have been protected by sort_folio().
*/
- if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
- folio_set_workingset(folio);
+ if (lru_gen_in_fault() || refs >= BIT(LRU_REFS_WIDTH) - 1) {
+ set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset));
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
}
unlock:
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-mglru-fix-underprotected-page-cache.patch
mm-mglru-try-to-stop-at-high-watermarks.patch
mm-mglru-respect-min_ttl_ms-with-memcgs.patch
mm-mglru-reclaim-offlined-memcgs-harder.patch
The patch titled
Subject: mm/damon/core: make damon_start() waits until kdamond_fn() starts
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: SeongJae Park <sj(a)kernel.org>
Subject: mm/damon/core: make damon_start() waits until kdamond_fn() starts
Date: Fri, 8 Dec 2023 17:50:18 +0000
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org
Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism")
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Reported-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: Changbin Du <changbin.du(a)intel.com>
Cc: Jakub Acs <acsjakub(a)amazon.de>
Cc: <stable(a)vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/damon.h | 2 ++
mm/damon/core.c | 6 ++++++
2 files changed, 8 insertions(+)
--- a/include/linux/damon.h~mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts
+++ a/include/linux/damon.h
@@ -559,6 +559,8 @@ struct damon_ctx {
* update
*/
unsigned long next_ops_update_sis;
+ /* for waiting until the execution of the kdamond_fn is started */
+ struct completion kdamond_started;
/* public: */
struct task_struct *kdamond;
--- a/mm/damon/core.c~mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts
+++ a/mm/damon/core.c
@@ -445,6 +445,8 @@ struct damon_ctx *damon_new_ctx(void)
if (!ctx)
return NULL;
+ init_completion(&ctx->kdamond_started);
+
ctx->attrs.sample_interval = 5 * 1000;
ctx->attrs.aggr_interval = 100 * 1000;
ctx->attrs.ops_update_interval = 60 * 1000 * 1000;
@@ -668,11 +670,14 @@ static int __damon_start(struct damon_ct
mutex_lock(&ctx->kdamond_lock);
if (!ctx->kdamond) {
err = 0;
+ reinit_completion(&ctx->kdamond_started);
ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
nr_running_ctxs);
if (IS_ERR(ctx->kdamond)) {
err = PTR_ERR(ctx->kdamond);
ctx->kdamond = NULL;
+ } else {
+ wait_for_completion(&ctx->kdamond_started);
}
}
mutex_unlock(&ctx->kdamond_lock);
@@ -1433,6 +1438,7 @@ static int kdamond_fn(void *data)
pr_debug("kdamond (%d) starts\n", current->pid);
+ complete(&ctx->kdamond_started);
kdamond_init_intervals_sis(ctx);
if (ctx->ops.init)
_
Patches currently in -mm which might be from sj(a)kernel.org are
mm-damon-core-make-damon_start-waits-until-kdamond_fn-starts.patch
mm-damon-core-test-test-damon_split_region_ats-access-rate-copying.patch
mm-damon-core-implement-goal-oriented-feedback-driven-quota-auto-tuning.patch
mm-damon-core-implement-goal-oriented-feedback-driven-quota-auto-tuning-fix.patch
mm-damon-sysfs-schemes-implement-files-for-scheme-quota-goals-setup.patch
mm-damon-sysfs-schemes-commit-damos-quota-goals-user-input-to-damos.patch
mm-damon-sysfs-schemes-implement-a-command-for-scheme-quota-goals-only-commit.patch
mm-damon-core-test-add-a-unit-test-for-the-feedback-loop-algorithm.patch
selftests-damon-test-quota-goals-directory.patch
docs-mm-damon-design-document-damos-quota-auto-tuning.patch
docs-abi-damon-document-damos-quota-goals.patch
docs-admin-guide-mm-damon-usage-document-for-quota-goals.patch
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn(). However, commit
0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().
As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start(). Especially the skipped reset
of ->kdamond could cause a use-after-free.
Fix it by waiting for start of kdamond_fn() execution from
damon_start().
Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism")
Reported-by: Jakub Acs <acsjakub(a)amazon.de>
Cc: <stable(a)vger.kernel.org> # 5.15.x
Signed-off-by: SeongJae Park <sj(a)kernel.org>
---
Note that the report has not publicly made, so this patch doesn't have a
Closes: tag.
include/linux/damon.h | 2 ++
mm/damon/core.c | 6 ++++++
2 files changed, 8 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index aa34ab433bc5..12510d8c51c6 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -579,6 +579,8 @@ struct damon_ctx {
* update
*/
unsigned long next_ops_update_sis;
+ /* for waiting until the execution of the kdamond_fn is started */
+ struct completion kdamond_started;
/* public: */
struct task_struct *kdamond;
diff --git a/mm/damon/core.c b/mm/damon/core.c
index f91715a58dc7..2c0cc65d041e 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -445,6 +445,8 @@ struct damon_ctx *damon_new_ctx(void)
if (!ctx)
return NULL;
+ init_completion(&ctx->kdamond_started);
+
ctx->attrs.sample_interval = 5 * 1000;
ctx->attrs.aggr_interval = 100 * 1000;
ctx->attrs.ops_update_interval = 60 * 1000 * 1000;
@@ -668,11 +670,14 @@ static int __damon_start(struct damon_ctx *ctx)
mutex_lock(&ctx->kdamond_lock);
if (!ctx->kdamond) {
err = 0;
+ reinit_completion(&ctx->kdamond_started);
ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
nr_running_ctxs);
if (IS_ERR(ctx->kdamond)) {
err = PTR_ERR(ctx->kdamond);
ctx->kdamond = NULL;
+ } else {
+ wait_for_completion(&ctx->kdamond_started);
}
}
mutex_unlock(&ctx->kdamond_lock);
@@ -1483,6 +1488,7 @@ static int kdamond_fn(void *data)
pr_debug("kdamond (%d) starts\n", current->pid);
+ complete(&ctx->kdamond_started);
kdamond_init_intervals_sis(ctx);
if (ctx->ops.init)
--
2.34.1
When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 5db11d50b4b0..b6928fc3b395 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2214,6 +2214,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
q = dev_get_drvdata(&apdev->device);
get_update_locks_for_queue(q);
matrix_mdev = q->matrix_mdev;
+ apid = AP_QID_CARD(q->apqn);
+ apqi = AP_QID_QUEUE(q->apqn);
if (matrix_mdev) {
/* If the queue is assigned to the guest's AP configuration */
@@ -2231,8 +2233,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
}
}
- vfio_ap_mdev_reset_queue(q);
- flush_work(&q->reset_work);
+ /*
+ * If the queue is not in the host's AP configuration, then resetting
+ * it will fail with response code 01, (APQN not valid); so, let's make
+ * sure it is in the host's config.
+ */
+ if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+ test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+ vfio_ap_mdev_reset_queue(q);
+ flush_work(&q->reset_work);
+ }
done:
if (matrix_mdev)
--
2.43.0
When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.
If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 39 ++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 9 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index f08321385058..5db11d50b4b0 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2187,6 +2187,23 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
return ret;
}
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long apid)
+{
+ DECLARE_BITMAP(apm_reset, AP_DEVICES);
+
+ /*
+ * If the adapter is not in the host's AP configuration, then resetting
+ * any queue for that adapter will fail with response code 01, (APQN not
+ * valid).
+ */
+ if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm)) {
+ bitmap_clear(apm_reset, 0, AP_DEVICES);
+ set_bit_inv(apid, apm_reset);
+ reset_queues_for_apids(matrix_mdev, apm_reset);
+ }
+}
+
void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
{
unsigned long apid, apqi;
@@ -2199,24 +2216,28 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
matrix_mdev = q->matrix_mdev;
if (matrix_mdev) {
- vfio_ap_unlink_queue_fr_mdev(q);
-
- apid = AP_QID_CARD(q->apqn);
- apqi = AP_QID_QUEUE(q->apqn);
-
- /*
- * If the queue is assigned to the guest's APCB, then remove
- * the adapter's APID from the APCB and hot it into the guest.
- */
+ /* If the queue is assigned to the guest's AP configuration */
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+ /*
+ * Since the queues are defined via a matrix of adapters
+ * and domains, it is not possible to hot unplug a
+ * single queue; so, let's unplug the adapter.
+ */
clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apid(matrix_mdev, apid);
+ goto done;
}
}
vfio_ap_mdev_reset_queue(q);
flush_work(&q->reset_work);
+
+done:
+ if (matrix_mdev)
+ vfio_ap_unlink_queue_fr_mdev(q);
+
dev_set_drvdata(&apdev->device, NULL);
kfree(q);
release_update_locks_for_mdev(matrix_mdev);
--
2.43.0
When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.
In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 157 +++++++++++++++++++-------
drivers/s390/crypto/vfio_ap_private.h | 11 +-
2 files changed, 126 insertions(+), 42 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 26bd4aca497a..f08321385058 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -33,6 +33,7 @@
#define AP_RESET_INTERVAL 20 /* Reset sleep interval (20ms) */
static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -661,16 +662,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
* device driver.
*
* @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ * guest's AP configuration that are still in the host's AP
+ * configuration.
*
* Note: If an APQN referencing a queue device that is not bound to the vfio_ap
* driver, its APID will be filtered from the guest's APCB. The matrix
* structure precludes filtering an individual APQN, so its APID will be
- * filtered.
+ * filtered. Consequently, all queues associated with the adapter that
+ * are in the host's AP configuration must be reset. If queues are
+ * subsequently made available again to the guest, they should re-appear
+ * in a reset state
*
* Return: a boolean value indicating whether the KVM guest's APCB was changed
* by the filtering or not.
*/
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long *apm_filtered)
{
unsigned long apid, apqi, apqn;
DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -680,6 +688,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+ bitmap_clear(apm_filtered, 0, AP_DEVICES);
/*
* Copy the adapters, domains and control domains to the shadow_apcb
@@ -705,8 +714,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
apqn = AP_MKQID(apid, apqi);
q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
if (!q || q->reset_status.response_code) {
- clear_bit_inv(apid,
- matrix_mdev->shadow_apcb.apm);
+ clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+ /*
+ * If the adapter was previously plugged into
+ * the guest, let's let the caller know that
+ * the APID was filtered.
+ */
+ if (test_bit_inv(apid, prev_shadow_apm))
+ set_bit_inv(apid, apm_filtered);
+
break;
}
}
@@ -918,6 +935,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
AP_MKQID(apid, apqi));
}
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long *apm_reset)
+{
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
+ unsigned long apid, apqi;
+ int apqn, ret = 0;
+
+ if (bitmap_empty(apm_reset, AP_DEVICES))
+ return 0;
+
+ INIT_LIST_HEAD(&qlist);
+
+ for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+ AP_DOMAINS) {
+ /*
+ * If the domain is not in the host's AP configuration,
+ * then resetting it will fail with response code 01
+ * (APQN not valid).
+ */
+ if (!test_bit_inv(apqi,
+ (unsigned long *)matrix_dev->info.aqm))
+ continue;
+
+ apqn = AP_MKQID(apid, apqi);
+ q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+ if (q)
+ list_add_tail(&q->reset_qnode, &qlist);
+ }
+ }
+
+ ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+ list_del(&q->reset_qnode);
+
+ return ret;
+}
+
/**
* assign_adapter_store - parses the APID from @buf and sets the
* corresponding bit in the mediated matrix device's APM
@@ -958,6 +1016,7 @@ static ssize_t assign_adapter_store(struct device *dev,
{
int ret;
unsigned long apid;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -987,8 +1046,10 @@ static ssize_t assign_adapter_store(struct device *dev,
vfio_ap_mdev_link_adapter(matrix_mdev, apid);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
ret = count;
done:
@@ -1019,11 +1080,12 @@ static struct vfio_ap_queue
* adapter was assigned.
* @matrix_mdev: the matrix mediated device to which the adapter was assigned.
* @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ * need to be reset.
*/
static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
unsigned long apid,
- struct ap_queue_table *qtable)
+ struct list_head *qlist)
{
unsigned long apqi;
struct vfio_ap_queue *q;
@@ -1031,11 +1093,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
- if (q && qtable) {
+ if (q && qlist) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ list_add_tail(&q->reset_qnode, qlist);
}
}
}
@@ -1043,26 +1104,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
unsigned long apid)
{
- int loop_cursor;
- struct vfio_ap_queue *q;
- struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
- hash_init(qtable->queues);
- vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+ INIT_LIST_HEAD(&qlist);
+ vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
- vfio_ap_mdev_reset_queues(qtable);
+ vfio_ap_mdev_reset_qlist(&qlist);
- hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
vfio_ap_unlink_mdev_fr_queue(q);
- hash_del(&q->mdev_qnode);
+ list_del(&q->reset_qnode);
}
-
- kfree(qtable);
}
/**
@@ -1163,6 +1221,7 @@ static ssize_t assign_domain_store(struct device *dev,
{
int ret;
unsigned long apqi;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -1192,8 +1251,10 @@ static ssize_t assign_domain_store(struct device *dev,
vfio_ap_mdev_link_domain(matrix_mdev, apqi);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
ret = count;
done:
@@ -1206,7 +1267,7 @@ static DEVICE_ATTR_WO(assign_domain);
static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
unsigned long apqi,
- struct ap_queue_table *qtable)
+ struct list_head *qlist)
{
unsigned long apid;
struct vfio_ap_queue *q;
@@ -1214,11 +1275,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
- if (q && qtable) {
+ if (q && qlist) {
if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
- hash_add(qtable->queues, &q->mdev_qnode,
- q->apqn);
+ list_add_tail(&q->reset_qnode, qlist);
}
}
}
@@ -1226,26 +1286,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
unsigned long apqi)
{
- int loop_cursor;
- struct vfio_ap_queue *q;
- struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+ struct vfio_ap_queue *q, *tmpq;
+ struct list_head qlist;
- hash_init(qtable->queues);
- vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+ INIT_LIST_HEAD(&qlist);
+ vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
- vfio_ap_mdev_reset_queues(qtable);
+ vfio_ap_mdev_reset_qlist(&qlist);
- hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+ list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
vfio_ap_unlink_mdev_fr_queue(q);
- hash_del(&q->mdev_qnode);
+ list_del(&q->reset_qnode);
}
-
- kfree(qtable);
}
/**
@@ -1754,6 +1811,24 @@ static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
return ret;
}
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+ int ret = 0;
+ struct vfio_ap_queue *q;
+
+ list_for_each_entry(q, qlist, reset_qnode)
+ vfio_ap_mdev_reset_queue(q);
+
+ list_for_each_entry(q, qlist, reset_qnode) {
+ flush_work(&q->reset_work);
+
+ if (q->reset_status.response_code)
+ ret = -EIO;
+ }
+
+ return ret;
+}
+
static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
{
struct ap_matrix_mdev *matrix_mdev =
@@ -2062,6 +2137,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
{
int ret;
struct vfio_ap_queue *q;
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev;
ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2094,15 +2170,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
!bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
goto done;
- if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+ }
}
done:
dev_set_drvdata(&apdev->device, q);
release_update_locks_for_mdev(matrix_mdev);
- return 0;
+ return ret;
err_remove_group:
sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2446,6 +2524,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
{
+ DECLARE_BITMAP(apm_filtered, AP_DEVICES);
bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
mutex_lock(&matrix_mdev->kvm->lock);
@@ -2459,7 +2538,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
matrix_mdev->adm_add, AP_DOMAINS);
if (filter_adapters || filter_domains)
- do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+ do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
if (filter_cdoms)
do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2467,6 +2546,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
if (do_hotplug)
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+ reset_queues_for_apids(matrix_mdev, apm_filtered);
+
mutex_unlock(&matrix_dev->mdevs_lock);
mutex_unlock(&matrix_mdev->kvm->lock);
}
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
};
/**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+ * struct ap_queue_table - a table of queue objects.
+ *
+ * @queues: a hashtable of queues (struct vfio_ap_queue).
+ */
struct ap_queue_table {
DECLARE_HASHTABLE(queues, 8);
};
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
* @apqn: the APQN of the AP queue device
* @saved_isc: the guest ISC registered with the GIB interface
* @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ * that need to be reset
* @reset_status: the status from the last reset of the queue
* @reset_work: work to wait for queue reset to complete
*/
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
#define VFIO_AP_ISC_INVALID 0xff
unsigned char saved_isc;
struct hlist_node mdev_qnode;
+ struct list_head reset_qnode;
struct ap_queue_status reset_status;
struct work_struct reset_work;
};
--
2.43.0
While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:
1. An adapter or domain that is not in the host's AP configuration can be
assigned to the matrix; this is known as over-provisioning. Queue
devices, however, are only created for adapters and domains in the
host's AP configuration, so there will be no queues associated with an
over-provisioned adapter or domain to filter.
2. The adapter or domain may have been externally removed from the host's
configuration via an SE or HMC attached to a DPM enabled LPAR. In this
case, the vfio_ap device driver would have been notified by the AP bus
via the on_config_changed callback and the adapter or domain would
have already been filtered.
Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 9382b32e5bd1..47232e19a50e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -691,8 +691,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
(unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
- for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
- for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+ for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+ AP_DOMAINS) {
/*
* If the APQN is not bound to the vfio_ap device
* driver, then we can't assign it to the guest's
--
2.43.0
The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.
Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:
guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006
unassign domain 4
unbind queue 16.0005
assign domain 4
When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004
Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.
To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.
Signed-off-by: Tony Krowiak <akrowiak(a)linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable(a)vger.kernel.org>
---
drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
1 file changed, 17 insertions(+), 40 deletions(-)
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..9382b32e5bd1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -670,8 +670,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
* Return: a boolean value indicating whether the KVM guest's APCB was changed
* by the filtering or not.
*/
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
- struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
{
unsigned long apid, apqi, apqn;
DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -692,8 +691,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
(unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
- for_each_set_bit_inv(apid, apm, AP_DEVICES) {
- for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+ for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+ for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
/*
* If the APQN is not bound to the vfio_ap device
* driver, then we can't assign it to the guest's
@@ -958,7 +957,6 @@ static ssize_t assign_adapter_store(struct device *dev,
{
int ret;
unsigned long apid;
- DECLARE_BITMAP(apm_delta, AP_DEVICES);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -987,11 +985,8 @@ static ssize_t assign_adapter_store(struct device *dev,
}
vfio_ap_mdev_link_adapter(matrix_mdev, apid);
- memset(apm_delta, 0, sizeof(apm_delta));
- set_bit_inv(apid, apm_delta);
- if (vfio_ap_mdev_filter_matrix(apm_delta,
- matrix_mdev->matrix.aqm, matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
ret = count;
@@ -1167,7 +1162,6 @@ static ssize_t assign_domain_store(struct device *dev,
{
int ret;
unsigned long apqi;
- DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
mutex_lock(&ap_perms_mutex);
@@ -1196,11 +1190,8 @@ static ssize_t assign_domain_store(struct device *dev,
}
vfio_ap_mdev_link_domain(matrix_mdev, apqi);
- memset(aqm_delta, 0, sizeof(aqm_delta));
- set_bit_inv(apqi, aqm_delta);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
- matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
ret = count;
@@ -2091,9 +2082,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
if (matrix_mdev) {
vfio_ap_mdev_link_queue(matrix_mdev, q);
- if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
- matrix_mdev->matrix.aqm,
- matrix_mdev))
+ if (vfio_ap_mdev_filter_matrix(matrix_mdev))
vfio_ap_mdev_update_guest_apcb(matrix_mdev);
}
dev_set_drvdata(&apdev->device, q);
@@ -2443,34 +2432,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
{
- bool do_hotplug = false;
- int filter_domains = 0;
- int filter_adapters = 0;
- DECLARE_BITMAP(apm, AP_DEVICES);
- DECLARE_BITMAP(aqm, AP_DOMAINS);
+ bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
mutex_lock(&matrix_mdev->kvm->lock);
mutex_lock(&matrix_dev->mdevs_lock);
- filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
- matrix_mdev->apm_add, AP_DEVICES);
- filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
- matrix_mdev->aqm_add, AP_DOMAINS);
-
- if (filter_adapters && filter_domains)
- do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
- else if (filter_adapters)
- do_hotplug |=
- vfio_ap_mdev_filter_matrix(apm,
- matrix_mdev->shadow_apcb.aqm,
- matrix_mdev);
- else
- do_hotplug |=
- vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
- aqm, matrix_mdev);
+ filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+ matrix_mdev->apm_add, AP_DEVICES);
+ filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+ matrix_mdev->aqm_add, AP_DOMAINS);
+ filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+ matrix_mdev->adm_add, AP_DOMAINS);
+
+ if (filter_adapters || filter_domains)
+ do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
- if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
- AP_DOMAINS))
+ if (filter_cdoms)
do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
if (do_hotplug)
--
2.43.0
Check for additional CPUID bits to identify TDX guests running with Trust
Domain (TD) partitioning enabled. TD partitioning is like nested virtualization
inside the Trust Domain so there is a L1 TD VM(M) and there can be L2 TD VM(s).
In this arrangement we are not guaranteed that the TDX_CPUID_LEAF_ID is visible
to Linux running as an L2 TD VM. This is because a majority of TDX facilities
are controlled by the L1 VMM and the L2 TDX guest needs to use TD partitioning
aware mechanisms for what's left. So currently such guests do not have
X86_FEATURE_TDX_GUEST set.
We want the kernel to have X86_FEATURE_TDX_GUEST set for all TDX guests so we
need to check these additional CPUID bits, but we skip further initialization
in the function as we aren't guaranteed access to TDX module calls.
Cc: <stable(a)vger.kernel.org> # v6.5+
Signed-off-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
---
arch/x86/coco/tdx/tdx.c | 29 ++++++++++++++++++++++++++---
arch/x86/include/asm/tdx.h | 3 +++
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1d6b863c42b0..c7bbbaaf654d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <asm/coco.h>
+#include <asm/hyperv-tlfs.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
#include <asm/insn.h>
@@ -37,6 +38,8 @@
#define TDREPORT_SUBTYPE_0 0
+bool tdx_partitioning_active;
+
/* Called from __tdx_hypercall() for unrecoverable failure */
noinstr void __tdx_hypercall_failed(void)
{
@@ -757,19 +760,38 @@ static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return true;
}
+
+static bool early_is_hv_tdx_partitioning(void)
+{
+ u32 eax, ebx, ecx, edx;
+ cpuid(HYPERV_CPUID_ISOLATION_CONFIG, &eax, &ebx, &ecx, &edx);
+ return eax & HV_PARAVISOR_PRESENT &&
+ (ebx & HV_ISOLATION_TYPE) == HV_ISOLATION_TYPE_TDX;
+}
+
void __init tdx_early_init(void)
{
u64 cc_mask;
u32 eax, sig[3];
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
-
- if (memcmp(TDX_IDENT, sig, sizeof(sig)))
- return;
+ if (memcmp(TDX_IDENT, sig, sizeof(sig))) {
+ tdx_partitioning_active = early_is_hv_tdx_partitioning();
+ if (!tdx_partitioning_active)
+ return;
+ }
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
cc_vendor = CC_VENDOR_INTEL;
+
+ /*
+ * Need to defer cc_mask and page visibility callback initializations
+ * to a TD-partitioning aware implementation.
+ */
+ if (tdx_partitioning_active)
+ goto exit;
+
tdx_parse_tdinfo(&cc_mask);
cc_set_mask(cc_mask);
@@ -820,5 +842,6 @@ void __init tdx_early_init(void)
*/
x86_cpuinit.parallel_bringup = false;
+exit:
pr_info("Guest detected\n");
}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 603e6d1e9d4a..fe22f8675859 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -52,6 +52,7 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
+extern bool tdx_partitioning_active;
#else
static inline void tdx_early_init(void) { };
@@ -71,6 +72,8 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
{
return -ENODEV;
}
+
+#define tdx_partitioning_active false
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */
--
2.39.2