From: Catalin Marinas <catalin.marinas(a)arm.com>
Since the actual slab freeing is deferred when calling kvfree_rcu(), so
is the kmemleak_free() callback informing kmemleak of the object
deletion. From the perspective of the kvfree_rcu() caller, the object is
freed and it may remove any references to it. Since kmemleak does not
scan RCU internal data storing the pointer, it will report such objects
as leaks during the grace period.
Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note
that the tiny RCU implementation does not have such issue since the
objects can be tracked from the rcu_ctrlblk structure.
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/
Cc: <stable(a)vger.kernel.org>
Tested-by: Christoph Paasch <cpaasch(a)apple.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/rcu/tree.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd0..24423877962c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -31,6 +31,7 @@
#include <linux/bitops.h>
#include <linux/export.h>
#include <linux/completion.h>
+#include <linux/kmemleak.h>
#include <linux/moduleparam.h>
#include <linux/panic.h>
#include <linux/panic_notifier.h>
@@ -3388,6 +3389,14 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
success = true;
}
+ /*
+ * The kvfree_rcu() caller considers the pointer freed at this point
+ * and likely removes any references to it. Since the actual slab
+ * freeing (and kmemleak_free()) is deferred, tell kmemleak to ignore
+ * this object (no scanning or false positives reporting).
+ */
+ kmemleak_ignore(ptr);
+
// Set timer to drain after KFREE_DRAIN_JIFFIES.
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
schedule_delayed_monitor_work(krcp);
--
2.42.0.582.g8ccd20d70d-goog
Hi Thinh,
I can confirm your patch fixed the issue on RK3399 when I was running
on Linux 6.1.54.
I'm not on the ML for this so I'm sorry if this email causes any issue
as I'm not sure how to reply to a thread from a ML I am not on.
Best,
Da
Like the Lenovo 82TL, 82V2, 82QF and 82UG, the 82YM (Yoga 7 14ARP8)
requires an entry in the quirk list to enable the internal microphone.
The latter two received similar fixes in commit 1263cc0f414d
("ASoC: amd: yc: Fix non-functional mic on Lenovo 82QF and 82UG").
Fixes: c008323fe361 ("ASoC: amd: yc: Fix a non-functional mic on Lenovo 82SJ")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sven Frotscher <sven.frotscher(a)gmail.com>
---
v3->v4 changes:
* re-add blank line between commit message title and body
---
v2->v3 changes:
* add message title of referenced commit to commit message
* make whitespace consistent with surrounding code
* use a patch-friendly e-mail client
---
v1->v2 changes:
* add Fixes and Cc tags to commit message
* remove redundant LKML link from commit message
* fix mangled diff
---
sound/soc/amd/yc/acp6x-mach.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/sound/soc/amd/yc/acp6x-mach.c b/sound/soc/amd/yc/acp6x-mach.c
index 94e9eb8e73f2..15a864dcd7bd 100644
--- a/sound/soc/amd/yc/acp6x-mach.c
+++ b/sound/soc/amd/yc/acp6x-mach.c
@@ -241,6 +241,13 @@ static const struct dmi_system_id yc_acp_quirk_table[] = {
DMI_MATCH(DMI_PRODUCT_NAME, "82V2"),
}
},
+ {
+ .driver_data = &acp6x_card,
+ .matches = {
+ DMI_MATCH(DMI_BOARD_VENDOR, "LENOVO"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "82YM"),
+ }
+ },
{
.driver_data = &acp6x_card,
.matches = {
--
2.42.0
Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".
Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.
The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).
Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
Changes in v2:
* As Marc sugested, use netif_tx_{,un}lock() instead of our own
spin_lock.
drivers/net/can/sja1000/sja1000.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index ae47fc72aa96..91e3fb3eed20 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -297,6 +297,7 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
if (can_dropped_invalid_skb(dev, skb))
return NETDEV_TX_OK;
+ netif_tx_lock(dev);
netif_stop_queue(dev);
fi = dlc = cf->can_dlc;
@@ -335,6 +336,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
sja1000_write_cmdreg(priv, cmd_reg_val);
+ netif_tx_unlock(dev);
+
return NETDEV_TX_OK;
}
@@ -396,7 +399,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
struct net_device *dev = (struct net_device *)dev_id;
netdev_dbg(dev, "performing a soft reset upon overrun\n");
- sja1000_start(dev);
+
+ netif_tx_lock(dev);
+
+ can_free_echo_skb(dev, 0);
+ sja1000_set_mode(dev, CAN_MODE_START);
+
+ netif_tx_unlock(dev);
return IRQ_HANDLED;
}
--
2.34.1
Fix four issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes
and the uninitialized member bug was discovered during the review.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v3:
- Add fix to uninitialized sa_flags
- Handle ksft_exit_fail_msg() in per test functions
- Make signal handler register fails to also exit
- Improve changelogs
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (7):
selftests/resctrl: Fix uninitialized .sa_flags
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++-------
tools/testing/selftests/resctrl/resctrl_val.c | 24 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++----------
9 files changed, 96 insertions(+), 103 deletions(-)
--
2.30.2
During the error path, the vma iterator may not be correctly positioned
or set to the correct range. Undo the vma_prev() call by resetting to
the passed in address. Re-walking to the same range will fix the range
to the area previously passed in.
Users would notice increased cycles as vma_merge() would be called an
extra time with vma == prev, and thus would fail to merge and return.
Link: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Closes: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejt…
Fixes: 18b098af2890 ("vma_merge: set vma iterator to correct position.")
Cc: stable(a)vger.kernel.org
Cc: Jann Horn <jannh(a)google.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
mm/mmap.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index b56a7f0c9f85..acb7dea49e23 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -975,7 +975,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
/* Error in anon_vma clone. */
if (err)
- return NULL;
+ goto anon_vma_fail;
if (vma_start < vma->vm_start || vma_end > vma->vm_end)
vma_expanded = true;
@@ -988,7 +988,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
}
if (vma_iter_prealloc(vmi, vma))
- return NULL;
+ goto prealloc_fail;
init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
@@ -1016,6 +1016,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
vma_complete(&vp, vmi, mm);
khugepaged_enter_vma(res, vm_flags);
return res;
+
+prealloc_fail:
+anon_vma_fail:
+ vma_iter_set(vmi, addr);
+ vma_iter_load(vmi);
+ return NULL;
}
/*
--
2.40.1
commit 0bdf399342c5 ("net: Avoid address overwrite in kernel_connect")
ensured that kernel_connect() will not overwrite the address parameter
in cases where BPF connect hooks perform an address rewrite. This change
replaces direct calls to sock->ops->connect() in net with kernel_connect()
to make these call safe.
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Cc: stable(a)vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Jordan Rife <jrife(a)google.com>
---
v4->v5: Remove non-net changes.
v3->v4: Remove precondition check for addrlen.
v2->v3: Add "Fixes" tag. Check for positivity in addrlen sanity check.
v1->v2: Split up original patch into patch series. Insulate calls with
kernel_connect() instead of pushing address copy deeper into
sock->ops->connect().
net/netfilter/ipvs/ip_vs_sync.c | 4 ++--
net/rds/tcp_connect.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index da5af28ff57b5..6e4ed1e11a3b7 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1505,8 +1505,8 @@ static int make_send_sock(struct netns_ipvs *ipvs, int id,
}
get_mcast_sockaddr(&mcast_addr, &salen, &ipvs->mcfg, id);
- result = sock->ops->connect(sock, (struct sockaddr *) &mcast_addr,
- salen, 0);
+ result = kernel_connect(sock, (struct sockaddr *)&mcast_addr,
+ salen, 0);
if (result < 0) {
pr_err("Error connecting to the multicast addr\n");
goto error;
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index f0c477c5d1db4..d788c6d28986f 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -173,7 +173,7 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
* own the socket
*/
rds_tcp_set_callbacks(sock, cp);
- ret = sock->ops->connect(sock, addr, addrlen, O_NONBLOCK);
+ ret = kernel_connect(sock, addr, addrlen, O_NONBLOCK);
rdsdebug("connect to address %pI6c returned %d\n", &conn->c_faddr, ret);
if (ret == -EINPROGRESS)
--
2.42.0.515.g380fc7ccd1-goog
Hello guys,
During the last few months, I felt a performance regression when using
read() and write() on my high-speed Nvme SSD (about 7GB/s).
To get more precise information about it I quickly developed benchmark
tool basically running read() or write() in a loop to simulate a
sequential file read or write. The tool also measures the real time
consumed by the loop. Finally, the tool can call open() with or without
O_DIRECT.
I ran the tests on EXT4 and Exfat with following settings (buffer
values have been set for best result):
- Write settings: buffer 400mb * 100
- Read settings: buffer 200mb
- Drop caches before non-direct read/write test
With this hardware:
- CPU AMD Ryzen 7600X
- RAM DDR5 5200 32GB
- SSD Kingston Fury Renegade 4TB with 4K LBA
Here are some results I got with last upstream kernels (default
config):
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
| ~42GB | O_DIRECT | Linux 6.2.0 | Linux 6.3.0 |
Linux 6.4.0 | Linux 6.5.0 | Linux 6.5.5 |
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
| Ext4 (sector 4k) | | | |
| | |
| Read | no | 7.2s (5800MB/s) | 7.1s (5890MB/s) |
8.3s (5050MB/s) | 13.2s (3180MB/s) | 13.2s (3180MB/s) |
| Write | no | 12.0s (3500MB/s) | 12.6s (3340MB/s) |
12.2s (3440MB/s) | 28.9s (1450MB/s) | 28.9s (1450MB/s) |
| Read | yes | 6.0s (7000MB/s) | 6.0s (7020MB/s) |
5.9s (7170MB/s) | 5.9s (7100MB/s) | 5.9s (7100MB/s) |
| Write | yes | 6.7s (6220MB/s) | 6.7s (6290MB/s) |
6.9s (6080MB/s) | 6.9s (6080MB/s) | 6.9s (6970MB/s) |
| Exfat (sector ?) | | | |
| | |
| Read | no | 7.3s (5770MB/s) | 7.2s (5830MB/s) |
9s (4620MB/s) | 13.3s (3150MB/s) | 13.2s (3180MB/s) |
| Write | no | 8.3s (5040MB/s) | 8.9s (4750MB/s) |
8.3s (5040MB/s) | 18.3s (2290MB/s) | 18.5s (2260MB/s) |
| Read | yes | 6.2s (6760MB/s) | 6.1s (6870MB/s) |
6.0s (6980MB/s) | 6.5s (6440MB/s) | 6.6s (6320MB/s) |
| Write | yes | 16.1s (2610MB/s) | 16.0s (2620MB/s) |
18.7s (2240MB/s) | 34.1s (1230MB/s) | 34.5s (1220MB/s) |
+------------------+----------+------------------+------------------+--
----------------+------------------+------------------+
Please note that I rounded some values to clarify readiness. Small
variations can be considered as margin error.
Ext4 results: cached reads/writes time have increased of almost 100%
from 6.2.0 to 6.5.0 with a first increase with 6.4.0. Direct access
times have stayed similar though.
Exfat results: performance decrease too with and without direct access
this time.
I realize there are thousands of commits between, plus the issue can
come from multiple kernel parts such as the page cache, the file system
implementation (especially for Exfat), the IO engine, a driver, etc.
The results also showed that there is not only a specific version
impacted. Anyway, at the end the performance have highly decreased.
If you want to verify my benchmark tool source code, please ask.
PS: sending again as only text body is accepted
Regards
Florent DELAHAYE
[ Replying to show this patch that got dropped from the mailing list ]
On Sat, 30 Sep 2023 16:32:16 -0400
Steven Rostedt <rostedt(a)goodmis.org> wrote:
> From: Beau Belgrave <beaub(a)linux.microsoft.com>
>
> All architectures should use a long aligned address passed to set_bit().
> User processes can pass either a 32-bit or 64-bit sized value to be
> updated when tracing is enabled when on a 64-bit kernel. Both cases are
> ensured to be naturally aligned, however, that is not enough. The
> address must be long aligned without affecting checks on the value
> within the user process which require different adjustments for the bit
> for little and big endian CPUs.
>
> Add a compat flag to user_event_enabler that indicates when a 32-bit
> value is being used on a 64-bit kernel. Long align addresses and correct
> the bit to be used by set_bit() to account for this alignment. Ensure
> compat flags are copied during forks and used during deletion clears.
>
> Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux…
> Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@r…
>
> Cc: stable(a)vger.kernel.org
> Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
> Reported-by: Clément Léger <cleger(a)rivosinc.com>
> Suggested-by: Clément Léger <cleger(a)rivosinc.com>
> Signed-off-by: Beau Belgrave <beaub(a)linux.microsoft.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
> ---
> kernel/trace/trace_events_user.c | 58 ++++++++++++++++++++++++++++----
> 1 file changed, 51 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index 6f046650e527..b87f41187c6a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -127,8 +127,13 @@ struct user_event_enabler {
> /* Bit 7 is for freeing status of enablement */
> #define ENABLE_VAL_FREEING_BIT 7
>
> -/* Only duplicate the bit value */
> -#define ENABLE_VAL_DUP_MASK ENABLE_VAL_BIT_MASK
> +/* Bit 8 is for marking 32-bit on 64-bit */
> +#define ENABLE_VAL_32_ON_64_BIT 8
> +
> +#define ENABLE_VAL_COMPAT_MASK (1 << ENABLE_VAL_32_ON_64_BIT)
> +
> +/* Only duplicate the bit and compat values */
> +#define ENABLE_VAL_DUP_MASK (ENABLE_VAL_BIT_MASK | ENABLE_VAL_COMPAT_MASK)
>
> #define ENABLE_BITOPS(e) (&(e)->values)
>
> @@ -174,6 +179,30 @@ struct user_event_validator {
> int flags;
> };
>
> +static inline void align_addr_bit(unsigned long *addr, int *bit,
> + unsigned long *flags)
> +{
> + if (IS_ALIGNED(*addr, sizeof(long))) {
> +#ifdef __BIG_ENDIAN
> + /* 32 bit on BE 64 bit requires a 32 bit offset when aligned. */
> + if (test_bit(ENABLE_VAL_32_ON_64_BIT, flags))
> + *bit += 32;
> +#endif
> + return;
> + }
> +
> + *addr = ALIGN_DOWN(*addr, sizeof(long));
> +
> + /*
> + * We only support 32 and 64 bit values. The only time we need
> + * to align is a 32 bit value on a 64 bit kernel, which on LE
> + * is always 32 bits, and on BE requires no change when unaligned.
> + */
> +#ifdef __LITTLE_ENDIAN
> + *bit += 32;
> +#endif
> +}
> +
> typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter *i,
> void *tpdata, bool *faulted);
>
> @@ -482,6 +511,7 @@ static int user_event_enabler_write(struct user_event_mm *mm,
> unsigned long *ptr;
> struct page *page;
> void *kaddr;
> + int bit = ENABLE_BIT(enabler);
> int ret;
>
> lockdep_assert_held(&event_mutex);
> @@ -497,6 +527,8 @@ static int user_event_enabler_write(struct user_event_mm *mm,
> test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
> return -EBUSY;
>
> + align_addr_bit(&uaddr, &bit, ENABLE_BITOPS(enabler));
> +
> ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
> &page, NULL);
>
> @@ -515,9 +547,9 @@ static int user_event_enabler_write(struct user_event_mm *mm,
>
> /* Update bit atomically, user tracers must be atomic as well */
> if (enabler->event && enabler->event->status)
> - set_bit(ENABLE_BIT(enabler), ptr);
> + set_bit(bit, ptr);
> else
> - clear_bit(ENABLE_BIT(enabler), ptr);
> + clear_bit(bit, ptr);
>
> kunmap_local(kaddr);
> unpin_user_pages_dirty_lock(&page, 1, true);
> @@ -849,6 +881,12 @@ static struct user_event_enabler
> enabler->event = user;
> enabler->addr = uaddr;
> enabler->values = reg->enable_bit;
> +
> +#if BITS_PER_LONG >= 64
> + if (reg->enable_size == 4)
> + set_bit(ENABLE_VAL_32_ON_64_BIT, ENABLE_BITOPS(enabler));
> +#endif
> +
> retry:
> /* Prevents state changes from racing with new enablers */
> mutex_lock(&event_mutex);
> @@ -2377,7 +2415,8 @@ static long user_unreg_get(struct user_unreg __user *ureg,
> }
>
> static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
> - unsigned long uaddr, unsigned char bit)
> + unsigned long uaddr, unsigned char bit,
> + unsigned long flags)
> {
> struct user_event_enabler enabler;
> int result;
> @@ -2385,7 +2424,7 @@ static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
>
> memset(&enabler, 0, sizeof(enabler));
> enabler.addr = uaddr;
> - enabler.values = bit;
> + enabler.values = bit | flags;
> retry:
> /* Prevents state changes from racing with new enablers */
> mutex_lock(&event_mutex);
> @@ -2415,6 +2454,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> struct user_event_mm *mm = current->user_event_mm;
> struct user_event_enabler *enabler, *next;
> struct user_unreg reg;
> + unsigned long flags;
> long ret;
>
> ret = user_unreg_get(ureg, ®);
> @@ -2425,6 +2465,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> if (!mm)
> return -ENOENT;
>
> + flags = 0;
> ret = -ENOENT;
>
> /*
> @@ -2441,6 +2482,9 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> ENABLE_BIT(enabler) == reg.disable_bit) {
> set_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler));
>
> + /* We must keep compat flags for the clear */
> + flags |= enabler->values & ENABLE_VAL_COMPAT_MASK;
> +
> if (!test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)))
> user_event_enabler_destroy(enabler, true);
>
> @@ -2454,7 +2498,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
> /* Ensure bit is now cleared for user, regardless of event status */
> if (!ret)
> ret = user_event_mm_clear_bit(mm, reg.disable_addr,
> - reg.disable_bit);
> + reg.disable_bit, flags);
>
> return ret;
> }
From: "Steven Rostedt (Google)" <rostedt(a)goodmis.org>
It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84ff by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.
The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.
Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.
To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.
Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.
Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschac…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall(a)inria.fr>
Tested-by: Julia Lawall <julia.lawall(a)inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ring_buffer.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28daf0ce95c5..515cafdb18d9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1137,6 +1137,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
if (full) {
poll_wait(filp, &work->full_waiters, poll_table);
work->full_waiters_pending = true;
+ if (!cpu_buffer->shortest_full ||
+ cpu_buffer->shortest_full > full)
+ cpu_buffer->shortest_full = full;
} else {
poll_wait(filp, &work->waiters, poll_table);
work->waiters_pending = true;
--
2.40.1