The following commit has been merged into the smp/urgent branch of tip:
Commit-ID: 2b8272ff4a70b866106ae13c36be7ecbef5d5da2
Gitweb: https://git.kernel.org/tip/2b8272ff4a70b866106ae13c36be7ecbef5d5da2
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Wed, 23 Aug 2023 10:47:02 +02:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Wed, 30 Aug 2023 12:24:22 +02:00
cpu/hotplug: Prevent self deadlock on CPU hot-unplug
Xiongfeng reported and debugged a self deadlock of the task which initiates
and controls a CPU hot-unplug operation vs. the CFS bandwidth timer.
CPU1 CPU2
T1 sets cfs_quota
starts hrtimer cfs_bandwidth 'period_timer'
T1 is migrated to CPU2
T1 initiates offlining of CPU1
Hotplug operation starts
...
'period_timer' expires and is re-enqueued on CPU1
...
take_cpu_down()
CPU1 shuts down and does not handle timers
anymore. They have to be migrated in the
post dead hotplug steps by the control task.
T1 runs the post dead offline operation
T1 is scheduled out
T1 waits for 'period_timer' to expire
T1 waits there forever if it is scheduled out before it can execute the hrtimer
offline callback hrtimers_dead_cpu().
Cure this by delegating the hotplug control operation to a worker thread on
an online CPU. This takes the initiating user space task, which might be
affected by the bandwidth timer, completely out of the picture.
Reported-by: Xiongfeng Wang <wangxiongfeng2(a)huawei.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Yu Liao <liaoyu15(a)huawei.com>
Acked-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com
Link: https://lore.kernel.org/r/87h6oqdq0i.ffs@tglx
---
kernel/cpu.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f6811c8..6de7c6b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1487,8 +1487,22 @@ out:
return ret;
}
+struct cpu_down_work {
+ unsigned int cpu;
+ enum cpuhp_state target;
+};
+
+static long __cpu_down_maps_locked(void *arg)
+{
+ struct cpu_down_work *work = arg;
+
+ return _cpu_down(work->cpu, 0, work->target);
+}
+
static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
{
+ struct cpu_down_work work = { .cpu = cpu, .target = target, };
+
/*
* If the platform does not support hotplug, report it explicitly to
* differentiate it from a transient offlining failure.
@@ -1497,7 +1511,15 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
- return _cpu_down(cpu, 0, target);
+
+ /*
+ * Ensure that the control task does not run on the to be offlined
+ * CPU to prevent a deadlock against cfs_b->period_timer.
+ */
+ cpu = cpumask_any_but(cpu_online_mask, cpu);
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+ return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
}
static int cpu_down(unsigned int cpu, enum cpuhp_state target)
This is the start of the stable review cycle for the 4.14.324 release.
There are 57 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 30 Aug 2023 10:11:30 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.324-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.324-rc1
Rob Clark <robdclark(a)chromium.org>
dma-buf/sw_sync: Avoid recursive lock during fence signal
Zhu Wang <wangzhu9(a)huawei.com>
scsi: core: raid_class: Remove raid_component_add()
Zhu Wang <wangzhu9(a)huawei.com>
scsi: snic: Fix double free in snic_tgt_create()
Ido Schimmel <idosch(a)nvidia.com>
rtnetlink: Reject negative ifindexes in RTM_NEWLINK
Feng Tang <feng.tang(a)intel.com>
x86/fpu: Set X86_FEATURE_OSXSAVE feature after enabling OSXSAVE in CR4
Wei Chen <harperchen1110(a)gmail.com>
media: vcodec: Fix potential array out-of-bounds in encoder queue_setup
Helge Deller <deller(a)gmx.de>
lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Fix batadv_v_ogm_aggr_send memory leak
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Fix TT global entry leak when client roamed back
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not get eth header before batadv_check_management_packet
Sven Eckelmann <sven(a)narfation.org>
batman-adv: Trigger events for auto adjusted MTU
Michael Ellerman <mpe(a)ellerman.id.au>
ibmveth: Use dcbf rather than dcbfl
Sishuai Gong <sishuai.system(a)gmail.com>
ipvs: fix racy memcpy in proc_do_sync_threshold
Junwei Hu <hujunwei4(a)huawei.com>
ipvs: Improve robustness to the ipvs sysctl
Alessio Igor Bogani <alessio.bogani(a)elettra.eu>
igb: Avoid starting unnecessary workqueues
Eric Dumazet <edumazet(a)google.com>
sock: annotate data-races around prot->memory_pressure
Zheng Yejian <zhengyejian1(a)huawei.com>
tracing: Fix memleak due to race between current_tracer and trace
Justin Chen <justin.chen(a)broadcom.com>
net: phy: broadcom: stub c45 read/write for 54810
Lin Ma <linma(a)zju.edu.cn>
net: xfrm: Amend XFRMA_SEC_CTX nla_policy structure
Jason Xing <kernelxing(a)tencent.com>
net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
Kuniyuki Iwashima <kuniyu(a)amazon.com>
af_unix: Fix null-ptr-deref in unix_stream_sendpage().
Zhang Shurong <zhang_shurong(a)foxmail.com>
ASoC: rt5665: add missed regulator_bulk_disable
Xin Long <lucien.xin(a)gmail.com>
netfilter: set default timeout to 3 secs for sctp shutdown send and recv state
Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
test_firmware: prevent race conditions by a correct implementation of locking
Qi Zheng <zhengqi.arch(a)bytedance.com>
binder: fix memory leak in binder_init()
Tony Lindgren <tony(a)atomide.com>
serial: 8250: Fix oops for port->pm on uart_change_pm()
Yang Yingliang <yangyingliang(a)huawei.com>
mmc: wbsd: fix double mmc_free_host() in wbsd_init()
Russell Harmon via samba-technical <samba-technical(a)lists.samba.org>
cifs: Release folio lock on fscache read hit.
dengxiang <dengxiang(a)nfschina.com>
ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces.
Eric Dumazet <edumazet(a)google.com>
net: do not allow gso_size to be set to GSO_BY_FRAGS
Abel Wu <wuyun.abel(a)bytedance.com>
sock: Fix misuse of sk_under_memory_pressure()
Andrii Staikov <andrii.staikov(a)intel.com>
i40e: fix misleading debug logs
Ziyang Xuan <william.xuanziyang(a)huawei.com>
team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_dynset: disallow object maps
Lin Ma <linma(a)zju.edu.cn>
xfrm: add NULL check in xfrm_update_ae_params
Zhengchao Shao <shaozhengchao(a)huawei.com>
ip_vti: fix potential slab-use-after-free in decode_session6
Zhengchao Shao <shaozhengchao(a)huawei.com>
ip6_vti: fix slab-use-after-free in decode_session6
Lin Ma <linma(a)zju.edu.cn>
net: af_key: fix sadb_x_filter validation
Lin Ma <linma(a)zju.edu.cn>
net: xfrm: Fix xfrm_address_filter OOB read
Nathan Lynch <nathanl(a)linux.ibm.com>
powerpc/rtas_flash: allow user copy to flash block cache objects
Yuanjun Gong <ruc_gongyuanjun(a)163.com>
fbdev: mmp: fix value check in mmphw_probe()
shanzhulig <shanzhulig(a)gmail.com>
drm/amdgpu: Fix potential fence use-after-free v2
Zhengping Jiang <jiangzp(a)google.com>
Bluetooth: L2CAP: Fix use-after-free
Armin Wolf <W_Armin(a)gmx.de>
pcmcia: rsrc_nonstatic: Fix memory leak in nonstatic_release_resource_db()
Tuo Li <islituo(a)gmail.com>
gfs2: Fix possible data races in gfs2_show_options()
Hans Verkuil <hverkuil-cisco(a)xs4all.nl>
media: platform: mediatek: vpu: fix NULL ptr dereference
Yunfei Dong <yunfei.dong(a)mediatek.com>
media: v4l2-mem2mem: add lock to protect parameter num_rdy
Immad Mir <mirimmad17(a)gmail.com>
FS: JFS: Check for read-only mounted filesystem in txBegin
Immad Mir <mirimmad17(a)gmail.com>
FS: JFS: Fix null-ptr-deref Read in txBegin
Gustavo A. R. Silva <gustavoars(a)kernel.org>
MIPS: dec: prom: Address -Warray-bounds warning
Yogesh <yogi.kernel(a)gmail.com>
fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev
Jan Kara <jack(a)suse.cz>
udf: Fix uninitialized array access for some pathnames
Ye Bin <yebin10(a)huawei.com>
quota: fix warning in dqgrab()
Jan Kara <jack(a)suse.cz>
quota: Properly disable quotas when add_dquot_ref() fails
Oswald Buddenhagen <oswald.buddenhagen(a)gmx.de>
ALSA: emu10k1: roll up loops in DSP setup code for Audigy
hackyzh002 <hackyzh002(a)gmail.com>
drm/radeon: Fix integer overflow in radeon_cs_parser_init
Nathan Chancellor <natechancellor(a)gmail.com>
lib/mpi: Eliminate unused umul_ppmm definitions for MIPS
-------------
Diffstat:
Makefile | 4 +-
arch/mips/include/asm/dec/prom.h | 2 +-
arch/powerpc/kernel/rtas_flash.c | 6 +-
arch/x86/kernel/fpu/xstate.c | 8 ++
drivers/android/binder.c | 1 +
drivers/android/binder_alloc.c | 6 ++
drivers/android/binder_alloc.h | 1 +
drivers/dma-buf/sw_sync.c | 18 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 +
drivers/gpu/drm/radeon/radeon_cs.c | 3 +-
drivers/media/platform/mtk-vcodec/mtk_vcodec_enc.c | 2 +
drivers/media/platform/mtk-vpu/mtk_vpu.c | 6 +-
drivers/mmc/host/wbsd.c | 2 -
drivers/net/ethernet/ibm/ibmveth.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_nvm.c | 16 +--
drivers/net/ethernet/intel/igb/igb_ptp.c | 24 ++---
drivers/net/phy/broadcom.c | 13 +++
drivers/net/team/team.c | 4 +-
drivers/pcmcia/rsrc_nonstatic.c | 2 +
drivers/scsi/raid_class.c | 48 ---------
drivers/scsi/snic/snic_disc.c | 3 +-
drivers/tty/serial/8250/8250_port.c | 1 +
drivers/video/fbdev/mmp/hw/mmp_ctrl.c | 4 +-
fs/cifs/file.c | 2 +-
fs/gfs2/super.c | 26 +++--
fs/jfs/jfs_dmap.c | 3 +
fs/jfs/jfs_txnmgr.c | 5 +
fs/jfs/namei.c | 5 +
fs/quota/dquot.c | 5 +-
fs/udf/unicode.c | 2 +-
include/linux/raid_class.h | 4 -
include/linux/virtio_net.h | 4 +
include/media/v4l2-mem2mem.h | 18 +++-
include/net/sock.h | 11 +-
kernel/trace/trace.c | 9 +-
kernel/trace/trace_irqsoff.c | 3 +-
kernel/trace/trace_sched_wakeup.c | 2 +
lib/clz_ctz.c | 32 ++----
lib/mpi/longlong.h | 36 +------
lib/test_firmware.c | 39 +++++--
net/batman-adv/bat_v_elp.c | 3 +-
net/batman-adv/bat_v_ogm.c | 7 +-
net/batman-adv/hard-interface.c | 2 +-
net/batman-adv/translation-table.c | 1 -
net/bluetooth/l2cap_core.c | 5 +
net/core/rtnetlink.c | 5 +-
net/core/sock.c | 2 +-
net/ipv4/ip_vti.c | 4 +-
net/ipv4/tcp_timer.c | 4 +-
net/ipv6/ip6_vti.c | 4 +-
net/key/af_key.c | 4 +-
net/netfilter/ipvs/ip_vs_ctl.c | 74 +++++++-------
net/netfilter/nf_conntrack_proto_sctp.c | 6 +-
net/netfilter/nft_dynset.c | 3 +
net/sctp/socket.c | 2 +-
net/unix/af_unix.c | 9 +-
net/xfrm/xfrm_user.c | 13 ++-
sound/pci/emu10k1/emufx.c | 112 ++-------------------
sound/soc/codecs/rt5665.c | 2 +
sound/usb/quirks-table.h | 29 ++++++
60 files changed, 325 insertions(+), 351 deletions(-)
From: Linus Torvalds <torvalds(a)linux-foundation.org>
[ upstream commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79 ]
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@…
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@…
Reported-and-tested-by: Michael Larabel <Michael(a)michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Chris Mason <clm(a)fb.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Amir Goldstein <amir73il(a)gmail.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: <stable(a)vger.kernel.org> # 5.4
[ mheyne: fixed contextual conflict in mm/filemap.c due to missing
commit c7510ab2cf5c ("mm: abstract out wake_page_match() from
wake_page_function()"). Added WQ_FLAG_CUSTOM due to missing commit
7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem") ]
Signed-off-by: Maximilian Heyne <mheyne(a)amazon.de>
---
include/linux/mm.h | 2 +
include/linux/wait.h | 2 +
kernel/sysctl.c | 8 +++
mm/filemap.c | 160 ++++++++++++++++++++++++++++++++++---------
4 files changed, 141 insertions(+), 31 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d35c29d322d8..d14aba548ff4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -37,6 +37,8 @@ struct user_struct;
struct writeback_control;
struct bdi_writeback;
+extern int sysctl_page_lock_unfairness;
+
void init_mm_internals(void);
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 7d04c1b588c7..03bff85e365f 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -20,6 +20,8 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
#define WQ_FLAG_EXCLUSIVE 0x01
#define WQ_FLAG_WOKEN 0x02
#define WQ_FLAG_BOOKMARK 0x04
+#define WQ_FLAG_CUSTOM 0x08
+#define WQ_FLAG_DONE 0x10
/*
* A single wait-queue entry structure:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index decabf5714c0..4f85f7ed42fc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1563,6 +1563,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = percpu_pagelist_fraction_sysctl_handler,
.extra1 = SYSCTL_ZERO,
},
+ {
+ .procname = "page_lock_unfairness",
+ .data = &sysctl_page_lock_unfairness,
+ .maxlen = sizeof(sysctl_page_lock_unfairness),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
#ifdef CONFIG_MMU
{
.procname = "max_map_count",
diff --git a/mm/filemap.c b/mm/filemap.c
index adc27af737c6..f1ed0400c37c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1044,9 +1044,43 @@ struct wait_page_queue {
wait_queue_entry_t wait;
};
+/*
+ * The page wait code treats the "wait->flags" somewhat unusually, because
+ * we have multiple different kinds of waits, not just he usual "exclusive"
+ * one.
+ *
+ * We have:
+ *
+ * (a) no special bits set:
+ *
+ * We're just waiting for the bit to be released, and when a waker
+ * calls the wakeup function, we set WQ_FLAG_WOKEN and wake it up,
+ * and remove it from the wait queue.
+ *
+ * Simple and straightforward.
+ *
+ * (b) WQ_FLAG_EXCLUSIVE:
+ *
+ * The waiter is waiting to get the lock, and only one waiter should
+ * be woken up to avoid any thundering herd behavior. We'll set the
+ * WQ_FLAG_WOKEN bit, wake it up, and remove it from the wait queue.
+ *
+ * This is the traditional exclusive wait.
+ *
+ * (b) WQ_FLAG_EXCLUSIVE | WQ_FLAG_CUSTOM:
+ *
+ * The waiter is waiting to get the bit, and additionally wants the
+ * lock to be transferred to it for fair lock behavior. If the lock
+ * cannot be taken, we stop walking the wait queue without waking
+ * the waiter.
+ *
+ * This is the "fair lock handoff" case, and in addition to setting
+ * WQ_FLAG_WOKEN, we set WQ_FLAG_DONE to let the waiter easily see
+ * that it now has the lock.
+ */
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
{
- int ret;
+ unsigned int flags;
struct wait_page_key *key = arg;
struct wait_page_queue *wait_page
= container_of(wait, struct wait_page_queue, wait);
@@ -1059,35 +1093,44 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
return 0;
/*
- * If it's an exclusive wait, we get the bit for it, and
- * stop walking if we can't.
- *
- * If it's a non-exclusive wait, then the fact that this
- * wake function was called means that the bit already
- * was cleared, and we don't care if somebody then
- * re-took it.
+ * If it's a lock handoff wait, we get the bit for it, and
+ * stop walking (and do not wake it up) if we can't.
*/
- ret = 0;
- if (wait->flags & WQ_FLAG_EXCLUSIVE) {
- if (test_and_set_bit(key->bit_nr, &key->page->flags))
+ flags = wait->flags;
+ if (flags & WQ_FLAG_EXCLUSIVE) {
+ if (test_bit(key->bit_nr, &key->page->flags))
return -1;
- ret = 1;
+ if (flags & WQ_FLAG_CUSTOM) {
+ if (test_and_set_bit(key->bit_nr, &key->page->flags))
+ return -1;
+ flags |= WQ_FLAG_DONE;
+ }
}
- wait->flags |= WQ_FLAG_WOKEN;
+ /*
+ * We are holding the wait-queue lock, but the waiter that
+ * is waiting for this will be checking the flags without
+ * any locking.
+ *
+ * So update the flags atomically, and wake up the waiter
+ * afterwards to avoid any races. This store-release pairs
+ * with the load-acquire in wait_on_page_bit_common().
+ */
+ smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
wake_up_state(wait->private, mode);
/*
* Ok, we have successfully done what we're waiting for,
* and we can unconditionally remove the wait entry.
*
- * Note that this has to be the absolute last thing we do,
- * since after list_del_init(&wait->entry) the wait entry
+ * Note that this pairs with the "finish_wait()" in the
+ * waiter, and has to be the absolute last thing we do.
+ * After this list_del_init(&wait->entry) the wait entry
* might be de-allocated and the process might even have
* exited.
*/
list_del_init_careful(&wait->entry);
- return ret;
+ return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}
static void wake_up_page_bit(struct page *page, int bit_nr)
@@ -1167,8 +1210,8 @@ enum behavior {
};
/*
- * Attempt to check (or get) the page bit, and mark the
- * waiter woken if successful.
+ * Attempt to check (or get) the page bit, and mark us done
+ * if successful.
*/
static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
struct wait_queue_entry *wait)
@@ -1179,13 +1222,17 @@ static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
} else if (test_bit(bit_nr, &page->flags))
return false;
- wait->flags |= WQ_FLAG_WOKEN;
+ wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE;
return true;
}
+/* How many times do we accept lock stealing from under a waiter? */
+int sysctl_page_lock_unfairness = 5;
+
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct page *page, int bit_nr, int state, enum behavior behavior)
{
+ int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
bool thrashing = false;
@@ -1203,11 +1250,18 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
}
init_wait(wait);
- wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
wait->func = wake_page_function;
wait_page.page = page;
wait_page.bit_nr = bit_nr;
+repeat:
+ wait->flags = 0;
+ if (behavior == EXCLUSIVE) {
+ wait->flags = WQ_FLAG_EXCLUSIVE;
+ if (--unfairness < 0)
+ wait->flags |= WQ_FLAG_CUSTOM;
+ }
+
/*
* Do one last check whether we can get the
* page bit synchronously.
@@ -1230,27 +1284,63 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
/*
* From now on, all the logic will be based on
- * the WQ_FLAG_WOKEN flag, and the and the page
- * bit testing (and setting) will be - or has
- * already been - done by the wake function.
+ * the WQ_FLAG_WOKEN and WQ_FLAG_DONE flag, to
+ * see whether the page bit testing has already
+ * been done by the wake function.
*
* We can drop our reference to the page.
*/
if (behavior == DROP)
put_page(page);
+ /*
+ * Note that until the "finish_wait()", or until
+ * we see the WQ_FLAG_WOKEN flag, we need to
+ * be very careful with the 'wait->flags', because
+ * we may race with a waker that sets them.
+ */
for (;;) {
+ unsigned int flags;
+
set_current_state(state);
- if (signal_pending_state(state, current))
+ /* Loop until we've been woken or interrupted */
+ flags = smp_load_acquire(&wait->flags);
+ if (!(flags & WQ_FLAG_WOKEN)) {
+ if (signal_pending_state(state, current))
+ break;
+
+ io_schedule();
+ continue;
+ }
+
+ /* If we were non-exclusive, we're done */
+ if (behavior != EXCLUSIVE)
break;
- if (wait->flags & WQ_FLAG_WOKEN)
+ /* If the waker got the lock for us, we're done */
+ if (flags & WQ_FLAG_DONE)
break;
- io_schedule();
+ /*
+ * Otherwise, if we're getting the lock, we need to
+ * try to get it ourselves.
+ *
+ * And if that fails, we'll have to retry this all.
+ */
+ if (unlikely(test_and_set_bit(bit_nr, &page->flags)))
+ goto repeat;
+
+ wait->flags |= WQ_FLAG_DONE;
+ break;
}
+ /*
+ * If a signal happened, this 'finish_wait()' may remove the last
+ * waiter from the wait-queues, but the PageWaiters bit will remain
+ * set. That's ok. The next wakeup will take care of it, and trying
+ * to do it here would be difficult and prone to races.
+ */
finish_wait(q, wait);
if (thrashing) {
@@ -1260,12 +1350,20 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
}
/*
- * A signal could leave PageWaiters set. Clearing it here if
- * !waitqueue_active would be possible (by open-coding finish_wait),
- * but still fail to catch it in the case of wait hash collision. We
- * already can fail to clear wait hash collision cases, so don't
- * bother with signals either.
+ * NOTE! The wait->flags weren't stable until we've done the
+ * 'finish_wait()', and we could have exited the loop above due
+ * to a signal, and had a wakeup event happen after the signal
+ * test but before the 'finish_wait()'.
+ *
+ * So only after the finish_wait() can we reliably determine
+ * if we got woken up or not, so we can now figure out the final
+ * return value based on that state without races.
+ *
+ * Also note that WQ_FLAG_WOKEN is sufficient for a non-exclusive
+ * waiter, but an exclusive one requires WQ_FLAG_DONE.
*/
+ if (behavior == EXCLUSIVE)
+ return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
}
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
From: Song Shuai <suagrfillet(a)gmail.com>
The pt_level uses CONFIG_PGTABLE_LEVELS to display page table names.
But if page mode is downgraded from kernel cmdline or restricted by
the hardware in 64BIT, it will give a wrong name.
Like, using no4lvl for sv39, ptdump named the 1G-mapping as "PUD"
that should be "PGD":
0xffffffd840000000-0xffffffd900000000 0x00000000c0000000 3G PUD D A G . . W R V
So select "P4D/PUD" or "PGD" via pgtable_l5/4_enabled to correct it.
Fixes: e8a62cc26ddf ("riscv: Implement sv48 support")
Reviewed-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Signed-off-by: Song Shuai <suagrfillet(a)gmail.com>
Link: https://lore.kernel.org/r/20230712115740.943324-1-suagrfillet@gmail.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
---
arch/riscv/mm/ptdump.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 20a9f991a6d7..e9090b38f811 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -384,6 +384,9 @@ static int __init ptdump_init(void)
kernel_ptd_info.base_addr = KERN_VIRT_START;
+ pg_level[1].name = pgtable_l5_enabled ? "P4D" : "PGD";
+ pg_level[2].name = pgtable_l4_enabled ? "PUD" : "PGD";
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
--
2.41.0
There are instances where rcu_cpu_stall_reset() is called when jiffies
did not get a chance to update for a long time. Before jiffies is
updated, the CPU stall detector can go off triggering false-positives
where a just-started grace period appears to be ages old. In the past,
we disabled stall detection in rcu_cpu_stall_reset() however this got
changed [1]. This is resulting in false-positives in KGDB usecase [2].
Fix this by deferring the update of jiffies to the third run of the FQS
loop. This is more robust, as, even if rcu_cpu_stall_reset() is called
just before jiffies is read, we would end up pushing out the jiffies
read by 3 more FQS loops. Meanwhile the CPU stall detection will be
delayed and we will not get any false positives.
[1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.or…
[2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
Tested with rcutorture.cpu_stall option as well to verify stall behavior
with/without patch.
Tested-by: Huacai Chen <chenhuacai(a)loongson.cn>
Reported-by: Huacai Chen <chenhuacai(a)loongson.cn>
Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/
Suggested-by: Paul McKenney <paulmck(a)kernel.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/rcu/tree.c | 12 ++++++++++++
kernel/rcu/tree.h | 4 ++++
kernel/rcu/tree_stall.h | 20 ++++++++++++++++++--
3 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1449cb69a0e0..b695c0eb515a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1552,10 +1552,22 @@ static bool rcu_gp_fqs_check_wake(int *gfp)
*/
static void rcu_gp_fqs(bool first_time)
{
+ int nr_fqs = READ_ONCE(rcu_state.nr_fqs_jiffies_stall);
struct rcu_node *rnp = rcu_get_root();
WRITE_ONCE(rcu_state.gp_activity, jiffies);
WRITE_ONCE(rcu_state.n_force_qs, rcu_state.n_force_qs + 1);
+
+ WARN_ON_ONCE(nr_fqs > 3);
+ /* Only countdown nr_fqs for stall purposes if jiffies moves. */
+ if (nr_fqs) {
+ if (nr_fqs == 1) {
+ WRITE_ONCE(rcu_state.jiffies_stall,
+ jiffies + rcu_jiffies_till_stall_check());
+ }
+ WRITE_ONCE(rcu_state.nr_fqs_jiffies_stall, --nr_fqs);
+ }
+
if (first_time) {
/* Collect dyntick-idle snapshots. */
force_qs_rnp(dyntick_save_progress_counter);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 192536916f9a..e9821a8422db 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -386,6 +386,10 @@ struct rcu_state {
/* in jiffies. */
unsigned long jiffies_stall; /* Time at which to check */
/* for CPU stalls. */
+ int nr_fqs_jiffies_stall; /* Number of fqs loops after
+ * which read jiffies and set
+ * jiffies_stall. Stall
+ * warnings disabled if !0. */
unsigned long jiffies_resched; /* Time at which to resched */
/* a reluctant CPU. */
unsigned long n_force_qs_gpstart; /* Snapshot of n_force_qs at */
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b10b8349bb2a..a2fa6b22e248 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -149,12 +149,17 @@ static void panic_on_rcu_stall(void)
/**
* rcu_cpu_stall_reset - restart stall-warning timeout for current grace period
*
+ * To perform the reset request from the caller, disable stall detection until
+ * 3 fqs loops have passed. This is required to ensure a fresh jiffies is
+ * loaded. It should be safe to do from the fqs loop as enough timer
+ * interrupts and context switches should have passed.
+ *
* The caller must disable hard irqs.
*/
void rcu_cpu_stall_reset(void)
{
- WRITE_ONCE(rcu_state.jiffies_stall,
- jiffies + rcu_jiffies_till_stall_check());
+ WRITE_ONCE(rcu_state.nr_fqs_jiffies_stall, 3);
+ WRITE_ONCE(rcu_state.jiffies_stall, ULONG_MAX);
}
//////////////////////////////////////////////////////////////////////////////
@@ -170,6 +175,7 @@ static void record_gp_stall_check_time(void)
WRITE_ONCE(rcu_state.gp_start, j);
j1 = rcu_jiffies_till_stall_check();
smp_mb(); // ->gp_start before ->jiffies_stall and caller's ->gp_seq.
+ WRITE_ONCE(rcu_state.nr_fqs_jiffies_stall, 0);
WRITE_ONCE(rcu_state.jiffies_stall, j + j1);
rcu_state.jiffies_resched = j + j1 / 2;
rcu_state.n_force_qs_gpstart = READ_ONCE(rcu_state.n_force_qs);
@@ -725,6 +731,16 @@ static void check_cpu_stall(struct rcu_data *rdp)
!rcu_gp_in_progress())
return;
rcu_stall_kick_kthreads();
+
+ /*
+ * Check if it was requested (via rcu_cpu_stall_reset()) that the FQS
+ * loop has to set jiffies to ensure a non-stale jiffies value. This
+ * is required to have good jiffies value after coming out of long
+ * breaks of jiffies updates. Not doing so can cause false positives.
+ */
+ if (READ_ONCE(rcu_state.nr_fqs_jiffies_stall) > 0)
+ return;
+
j = jiffies;
/*
--
2.42.0.rc2.253.gd59a3bf2b4-goog
Dear friend
my name is Mohamed Abdul please send me your what-sap number for easy
communication i have millions of dollars to invest
thanks
Mohamed Abdul
In the line 910, the value of m1 may be zero, so there is a possibility
of dividing by zero, we fixed it by checking the values before dividing
(found with SVACE). In the same way, after checking and reading the
function, we found that lines 906, 908, 912 have the same situation, so
we fixed them as well.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Rand Deeb <deeb.rand(a)confident.ru>
---
drivers/ssb/main.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/ssb/main.c b/drivers/ssb/main.c
index 0a26984acb2c..e0776a16d04d 100644
--- a/drivers/ssb/main.c
+++ b/drivers/ssb/main.c
@@ -903,13 +903,21 @@ u32 ssb_calc_clock_rate(u32 plltype, u32 n, u32 m)
case SSB_CHIPCO_CLK_MC_BYPASS:
return clock;
case SSB_CHIPCO_CLK_MC_M1:
- return (clock / m1);
+ if (m1 != 0)
+ return (clock / m1);
+ break;
case SSB_CHIPCO_CLK_MC_M1M2:
- return (clock / (m1 * m2));
+ if ((m1 * m2) != 0)
+ return (clock / (m1 * m2));
+ break;
case SSB_CHIPCO_CLK_MC_M1M2M3:
- return (clock / (m1 * m2 * m3));
+ if ((m1 * m2 * m3) != 0)
+ return (clock / (m1 * m2 * m3));
+ break;
case SSB_CHIPCO_CLK_MC_M1M3:
- return (clock / (m1 * m3));
+ if ((m1 * m3) != 0)
+ return (clock / (m1 * m3));
+ break;
}
return 0;
case SSB_PLLTYPE_2:
--
2.34.1
v1 -> v2:
- Address the comment to reduce size of queue pointer from queue size
- Consider the data size during memcpy to avoid OOB write
- Use hweight_long() to count the setbits representing the supported codecs
v1: https://lore.kernel.org/all/1690432469-14803-1-git-send-email-quic_vgarodia…
This series primarily adds check at relevant places in venus driver where there are possible OOB
accesses due to unexpected payload from venus firmware. The patches describes the specific OOB
possibility.
Please review and share your feedback.
Vikash Garodia (4):
venus: hfi: add checks to perform sanity on queue pointers
venus: hfi: fix the check to handle session buffer requirement
venus: hfi: add checks to handle capabilities from firmware
venus: hfi_parser: Add check to keep the number of codecs within range
drivers/media/platform/qcom/venus/hfi_msgs.c | 2 +-
drivers/media/platform/qcom/venus/hfi_parser.c | 15 +++++++++++++++
drivers/media/platform/qcom/venus/hfi_venus.c | 10 ++++++++++
3 files changed, 26 insertions(+), 1 deletion(-)
--
2.7.4
From: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
[ Upstream commit 8183bb7e291b7818f49ea39687c2fafa01a46e27 ]
GPIO_ACTIVE_x flags are not correct in the context of interrupt flags.
These are simple defines so they could be used in DTS but they will not
have the same meaning: GPIO_ACTIVE_HIGH = 0 = IRQ_TYPE_NONE.
Correct the interrupt flags, assuming the author of the code wanted same
logical behavior behind the name "ACTIVE_xxx", this is:
ACTIVE_HIGH => IRQ_TYPE_LEVEL_HIGH
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Link: https://lore.kernel.org/r/20230707063335.13317-1-krzysztof.kozlowski@linaro…
Signed-off-by: Heiko Stuebner <heiko(a)sntech.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm64/boot/dts/rockchip/rk3399-eaidk-610.dts | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/rockchip/rk3399-eaidk-610.dts b/arch/arm64/boot/dts/rockchip/rk3399-eaidk-610.dts
index d1f343345f674..6464ef4d113dd 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399-eaidk-610.dts
+++ b/arch/arm64/boot/dts/rockchip/rk3399-eaidk-610.dts
@@ -773,7 +773,7 @@ brcmf: wifi@1 {
compatible = "brcm,bcm4329-fmac";
reg = <1>;
interrupt-parent = <&gpio0>;
- interrupts = <RK_PA3 GPIO_ACTIVE_HIGH>;
+ interrupts = <RK_PA3 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "host-wake";
pinctrl-names = "default";
pinctrl-0 = <&wifi_host_wake_l>;
--
2.40.1
cros_ec_sensors_push_data() reads some `indio_dev` states (e.g.
iio_buffer_enabled() and `indio_dev->active_scan_mask`) without holding
the `mlock`.
An use-after-free on `indio_dev->active_scan_mask` was observed. The
call trace:
[...]
_find_next_bit
cros_ec_sensors_push_data
cros_ec_sensorhub_event
blocking_notifier_call_chain
cros_ec_irq_thread
It was caused by a race condition: one thread just freed
`active_scan_mask` at [1]; while another thread tried to access the
memory at [2].
Fix it by acquiring the `mlock` before accessing the `indio_dev` states.
[1]: https://elixir.bootlin.com/linux/v6.5/source/drivers/iio/industrialio-buffe…
[2]: https://elixir.bootlin.com/linux/v6.5/source/drivers/iio/common/cros_ec_sen…
Cc: stable(a)vger.kernel.org
Fixes: aa984f1ba4a4 ("iio: cros_ec: Register to cros_ec_sensorhub when EC supports FIFO")
Signed-off-by: Tzung-Bi Shih <tzungbi(a)kernel.org>
---
drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c b/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
index b72d39fc2434..a514d0dbafc7 100644
--- a/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
+++ b/drivers/iio/common/cros_ec_sensors/cros_ec_sensors_core.c
@@ -182,17 +182,20 @@ int cros_ec_sensors_push_data(struct iio_dev *indio_dev,
s16 *data,
s64 timestamp)
{
+ struct iio_dev_opaque *iio_dev_opaque = to_iio_dev_opaque(indio_dev);
struct cros_ec_sensors_core_state *st = iio_priv(indio_dev);
s16 *out;
s64 delta;
unsigned int i;
+ mutex_lock(&iio_dev_opaque->mlock);
+
/*
* Ignore samples if the buffer is not set: it is needed if the ODR is
* set but the buffer is not enabled yet.
*/
if (!iio_buffer_enabled(indio_dev))
- return 0;
+ goto exit;
out = (s16 *)st->samples;
for_each_set_bit(i,
@@ -210,6 +213,8 @@ int cros_ec_sensors_push_data(struct iio_dev *indio_dev,
iio_push_to_buffers_with_timestamp(indio_dev, st->samples,
timestamp + delta);
+exit:
+ mutex_unlock(&iio_dev_opaque->mlock);
return 0;
}
EXPORT_SYMBOL_GPL(cros_ec_sensors_push_data);
--
2.42.0.rc1.204.g551eb34607-goog
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 147f04b14adde831eb4a0a1e378667429732f9e8 ]
If an RCU expedited grace period starts just when a CPU is in the process
of going offline, so that the outgoing CPU has completed its pass through
stop-machine but has not yet completed its final dive into the idle loop,
RCU will attempt to enable that CPU's scheduling-clock tick via a call
to tick_dep_set_cpu(). For this to happen, that CPU has to have been
online when the expedited grace period completed its CPU-selection phase.
This is pointless: The outgoing CPU has interrupts disabled, so it cannot
take a scheduling-clock tick anyway. In addition, the tick_dep_set_cpu()
function's eventual call to irq_work_queue_on() will splat as follows:
smpboot: CPU 1 is now offline
WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95
+irq_work_queue_on+0x57/0x60
Modules linked in:
CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
+rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
Workqueue: rcu_gp wait_rcu_exp_gp
RIP: 0010:irq_work_queue_on+0x57/0x60
Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8
+9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7
+0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91
6f
RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282
RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680
RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680
R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8
R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00
FS: 0000000000000000(0000) GS:ffff9ad01f580000(0000)
+knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tick_nohz_dep_set_cpu+0x59/0x70
rcu_exp_wait_wake+0x54e/0x870
? sync_rcu_exp_select_cpus+0x1fc/0x390
process_one_work+0x1ef/0x3c0
? process_one_work+0x3c0/0x3c0
worker_thread+0x28/0x3c0
? process_one_work+0x3c0/0x3c0
kthread+0x115/0x140
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
---[ end trace c5bf75eb6aa80bc6 ]---
This commit therefore avoids invoking tick_dep_set_cpu() on offlined
CPUs to limit both futility and false-positive splats.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
kernel/rcu/tree_exp.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index f46c0c1a5eb3..407941a2903b 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -507,7 +507,10 @@ static void synchronize_rcu_expedited_wait(void)
if (rdp->rcu_forced_tick_exp)
continue;
rdp->rcu_forced_tick_exp = true;
- tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP);
+ preempt_disable();
+ if (cpu_online(cpu))
+ tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP);
+ preempt_enable();
}
}
j = READ_ONCE(jiffies_till_first_fqs);
--
2.42.0.rc2.253.gd59a3bf2b4-goog
commit 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
changed pci_bridge_d3_possible() so that any vendor's PCIe ports
from modern machines (>=2015) are allowed to be put into D3.
Iain reports that USB devices can't be used to wake a Lenovo Z13
from suspend. This is because the PCIe root port has been put
into D3 and AMD's platform can't handle USB devices waking in this
case.
This behavior is only reported on Linux. Comparing the behavior
on Windows and Linux, Windows doesn't put the root ports into D3.
To fix the issue without regressing existing Intel systems,
limit the >=2015 check to only apply to Intel PCIe ports.
Cc: stable(a)vger.kernel.org
Fixes: 9d26d3a8f1b0 ("PCI: Put PCIe ports into D3 during suspend")
Reported-by: Iain Lane <iain(a)orangesquash.org.uk>
Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from-suspend-with-exte…
Reviewed-by:Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy(a)linux.intel.com>
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
In v14 this series has been split into 3 parts.
part A: Immediate fix for AMD issue.
part B: LPS0 export improvements
part C: Long term solution for all vendors
v13->v14:
* Reword the comment
* add tag
v12->v13:
* New patch
---
drivers/pci/pci.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 60230da957e0c..bfdad2eb36d13 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3037,10 +3037,15 @@ bool pci_bridge_d3_possible(struct pci_dev *bridge)
return false;
/*
- * It should be safe to put PCIe ports from 2015 or newer
- * to D3.
+ * Allow Intel PCIe ports from 2015 onward to go into D3 to
+ * achieve additional energy conservation on some platforms.
+ *
+ * This is only set for Intel PCIe ports as it causes problems
+ * on both AMD Rembrandt and Phoenix platforms where USB keyboards
+ * can not be used to wake the system from suspend.
*/
- if (dmi_get_bios_year() >= 2015)
+ if (bridge->vendor == PCI_VENDOR_ID_INTEL &&
+ dmi_get_bios_year() >= 2015)
return true;
break;
}
--
2.34.1
Thanks for your response,
Kindly reconfirm your interest to further discuss the investment Thanks for your response,
partnership within your country as I explained earlier so we can have a further discussion to facilitate the process for mutual interest.
Looking forward to your response.
Hou Qijun
Vice President- CNPC
China National Petroleum Corporation
No. 9 Dongzhimen North Street Dongcheng District Beijing.
[My two cents]
stable-rc linux-6.1.y and linux-6.4.y x86 clang-nightly builds fail with
following warnings / errors.
Build errors:
--------------
drivers/net/ethernet/qlogic/qed/qed_main.c:1227:3: error: 'snprintf'
will always be truncated; specified size is 16, but format string
expands to at least 18 [-Werror,-Wfortify-source]
1227 | snprintf(name, NAME_SIZE, "slowpath-%02x:%02x.%02x",
| ^
1 error generated.
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
--
Linaro LKFT
https://lkft.linaro.org
check_clock doesn't account for vfe_lite which means that vfe_lite will
never get validated by this routine. Add the clock name to the expected set
to remediate.
Fixes: 7319cdf189bb ("media: camss: Add support for VFE hardware version Titan 170")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
drivers/media/platform/qcom/camss/camss-vfe.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/media/platform/qcom/camss/camss-vfe.c b/drivers/media/platform/qcom/camss/camss-vfe.c
index 938f373bcd1fd..b021f81cef123 100644
--- a/drivers/media/platform/qcom/camss/camss-vfe.c
+++ b/drivers/media/platform/qcom/camss/camss-vfe.c
@@ -535,7 +535,8 @@ static int vfe_check_clock_rates(struct vfe_device *vfe)
struct camss_clock *clock = &vfe->clock[i];
if (!strcmp(clock->name, "vfe0") ||
- !strcmp(clock->name, "vfe1")) {
+ !strcmp(clock->name, "vfe1") ||
+ !strcmp(clock->name, "vfe_lite")) {
u64 min_rate = 0;
unsigned long rate;
--
2.41.0
There are two problems with the current vfe_disable_output() routine.
Firstly we rightly use a spinlock to protect output->gen2.active_num
everywhere except for in the IDLE timeout path of vfe_disable_output().
Even if that is not racy "in practice" somehow it is by happenstance not
by design.
Secondly we do not get consistent behaviour from this routine. On
sc8280xp 50% of the time I get "VFE idle timeout - resetting". In this
case the subsequent capture will succeed. The other 50% of the time, we
don't hit the idle timeout, never do the VFE reset and subsequent
captures stall indefinitely.
Rewrite the vfe_disable_output() routine to
- Quiesce write masters with vfe_wm_stop()
- Set active_num = 0
remembering to hold the spinlock when we do so followed by
- Reset the VFE
Testing on sc8280xp and sdm845 shows this to be a valid fix.
Fixes: 7319cdf189bb ("media: camss: Add support for VFE hardware version Titan 170")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
.../media/platform/qcom/camss/camss-vfe-170.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/drivers/media/platform/qcom/camss/camss-vfe-170.c b/drivers/media/platform/qcom/camss/camss-vfe-170.c
index 02494c89da91c..ae9137633c301 100644
--- a/drivers/media/platform/qcom/camss/camss-vfe-170.c
+++ b/drivers/media/platform/qcom/camss/camss-vfe-170.c
@@ -500,28 +500,15 @@ static int vfe_disable_output(struct vfe_line *line)
struct vfe_output *output = &line->output;
unsigned long flags;
unsigned int i;
- bool done;
- int timeout = 0;
-
- do {
- spin_lock_irqsave(&vfe->output_lock, flags);
- done = !output->gen2.active_num;
- spin_unlock_irqrestore(&vfe->output_lock, flags);
- usleep_range(10000, 20000);
-
- if (timeout++ == 100) {
- dev_err(vfe->camss->dev, "VFE idle timeout - resetting\n");
- vfe_reset(vfe);
- output->gen2.active_num = 0;
- return 0;
- }
- } while (!done);
spin_lock_irqsave(&vfe->output_lock, flags);
for (i = 0; i < output->wm_num; i++)
vfe_wm_stop(vfe, output->wm_idx[i]);
+ output->gen2.active_num = 0;
spin_unlock_irqrestore(&vfe->output_lock, flags);
+ vfe_reset(vfe);
+
return 0;
}
--
2.41.0
We need to make sure camss_configure_pd() happens before
camss_register_entities() as the vfe_get() path relies on the pointer
provided by camss_configure_pd().
Fix the ordering sequence in probe to ensure the pointers vfe_get() demands
are present by the time camss_register_entities() runs.
In order to facilitate backporting to stable kernels I've moved the
configure_pd() call pretty early on the probe() function so that
irrespective of the existence of the old error handling jump labels this
patch should still apply to -next circa Aug 2023 to v5.13 inclusive.
Fixes: 2f6f8af67203 ("media: camss: Refactor VFE power domain toggling")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
---
drivers/media/platform/qcom/camss/camss.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/media/platform/qcom/camss/camss.c b/drivers/media/platform/qcom/camss/camss.c
index f11dc59135a5a..75991d849b571 100644
--- a/drivers/media/platform/qcom/camss/camss.c
+++ b/drivers/media/platform/qcom/camss/camss.c
@@ -1619,6 +1619,12 @@ static int camss_probe(struct platform_device *pdev)
if (ret < 0)
goto err_cleanup;
+ ret = camss_configure_pd(camss);
+ if (ret < 0) {
+ dev_err(dev, "Failed to configure power domains: %d\n", ret);
+ goto err_cleanup;
+ }
+
ret = camss_init_subdevices(camss);
if (ret < 0)
goto err_cleanup;
@@ -1678,12 +1684,6 @@ static int camss_probe(struct platform_device *pdev)
}
}
- ret = camss_configure_pd(camss);
- if (ret < 0) {
- dev_err(dev, "Failed to configure power domains: %d\n", ret);
- return ret;
- }
-
pm_runtime_enable(dev);
return 0;
--
2.41.0
Hi,
Looks like we missed a few backports related to MSG_RING, most likely
because they required a bit of hand massaging. Can you queue these up?
Thanks!
--
Jens Axboe
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x f0362a253606e2031f8d61c74195d4d6556e12a4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082642-catfight-gallantly-8b84@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
f0362a253606 ("mm,ima,kexec,of: use memblock_free_late from ima_free_kexec_buffer")
3ecc68349bba ("memblock: rename memblock_free to memblock_phys_free")
fa27717110ae ("memblock: drop memblock_free_early_nid() and memblock_free_early()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f0362a253606e2031f8d61c74195d4d6556e12a4 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel(a)surriel.com>
Date: Thu, 17 Aug 2023 13:57:59 -0400
Subject: [PATCH] mm,ima,kexec,of: use memblock_free_late from
ima_free_kexec_buffer
The code calling ima_free_kexec_buffer runs long after the memblock
allocator has already been torn down, potentially resulting in a use
after free in memblock_isolate_range.
With KASAN or KFENCE, this use after free will result in a BUG
from the idle task, and a subsequent kernel panic.
Switch ima_free_kexec_buffer over to memblock_free_late to avoid
that issue.
Fixes: fee3ff99bc67 ("powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c")
Cc: stable(a)kernel.org
Signed-off-by: Rik van Riel <riel(a)surriel.com>
Suggested-by: Mike Rappoport <rppt(a)kernel.org>
Link: https://lore.kernel.org/r/20230817135759.0888e5ef@imladris.surriel.com
Signed-off-by: Rob Herring <robh(a)kernel.org>
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index f26d2ba8a371..68278340cecf 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -184,7 +184,8 @@ int __init ima_free_kexec_buffer(void)
if (ret)
return ret;
- return memblock_phys_free(addr, size);
+ memblock_free_late(addr, size);
+ return 0;
}
#endif
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 10f84c2cfb5045e37d78cb5d4c8e8321e06ae18f ]
Currently, the various torture tests sometimes react to an early-boot
bug by rebooting. This is almost always counterproductive, needlessly
consuming CPU time and bloating the console log. This commit therefore
adds the "-no-reboot" argument to qemu so that reboot requests will
cause qemu to exit.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index f4c8055dbf7a..c57be9563214 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -9,9 +9,10 @@
#
# Usage: kvm-test-1-run.sh config resdir seconds qemu-args boot_args_in
#
-# qemu-args defaults to "-enable-kvm -nographic", along with arguments
-# specifying the number of CPUs and other options
-# generated from the underlying CPU architecture.
+# qemu-args defaults to "-enable-kvm -nographic -no-reboot", along with
+# arguments specifying the number of CPUs and
+# other options generated from the underlying
+# CPU architecture.
# boot_args_in defaults to value returned by the per_version_boot_params
# shell function.
#
@@ -141,7 +142,7 @@ then
fi
# Generate -smp qemu argument.
-qemu_args="-enable-kvm -nographic $qemu_args"
+qemu_args="-enable-kvm -nographic -no-reboot $qemu_args"
cpu_count=`configNR_CPUS.sh $resdir/ConfigFragment`
cpu_count=`configfrag_boot_cpus "$boot_args_in" "$config_template" "$cpu_count"`
if test "$cpu_count" -gt "$TORTURE_ALLOTED_CPUS"
--
2.42.0.rc1.204.g551eb34607-goog
--
My name is Dr.Law Kok Chung, I have very important information that I
would like to pass to you but I was wondering if you are still using
this email ID or not. Please if I reach you on this email as I am
hopeful, Please endeavor to confirm to me so I can pass the detailed
information to you.
I will be waiting for your feedback.
Thanks
Dr.Law Kok Chung
Research Assistant
CV Industrial laboratory Ltd
Dzień dobry,
jakiś czas temu zgłosiła się do nas firma, której strona internetowa nie pozycjonowała się wysoko w wyszukiwarce Google.
Na podstawie wykonanego przez nas audytu SEO zoptymalizowaliśmy treści na stronie pod kątem wcześniej opracowanych słów kluczowych. Nasz wewnętrzny system codziennie analizuje prawidłowe działanie witryny. Dzięki indywidualnej strategii, firma zdobywa coraz więcej Klientów.
Czy chcieliby Państwo zwiększyć liczbę osób odwiedzających stronę internetową firmy? Mógłbym przedstawić ofertę?
Pozdrawiam serdecznie,
Dominik Perkowski
This is the start of the stable review cycle for the 6.1.49 release.
There are 4 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Mon, 28 Aug 2023 15:46:14 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.49-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.49-rc1
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "f2fs: fix to do sanity check on direct node in truncate_dnode()"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "f2fs: fix to set flush_merge opt and show noflush_merge"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "f2fs: don't reset unchangable mount option in f2fs_remount()"
Peter Zijlstra <peterz(a)infradead.org>
objtool/x86: Fix SRSO mess
-------------
Diffstat:
Makefile | 4 ++--
fs/f2fs/f2fs.h | 1 +
fs/f2fs/file.c | 5 +++++
fs/f2fs/node.c | 14 ++----------
fs/f2fs/super.c | 43 ++++++++++++------------------------
include/linux/f2fs_fs.h | 1 -
tools/objtool/arch/x86/decode.c | 11 +++++----
tools/objtool/check.c | 22 +++++++++++++++++-
tools/objtool/include/objtool/arch.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
10 files changed, 54 insertions(+), 49 deletions(-)
Hi Greg,
please revert the following two patches as 5.10.192 fails to build with
them:
asoc-intel-sof_sdw-add-quirk-for-lnl-rvp.patch
asoc-intel-sof_sdw-add-quirk-for-mtl-rvp.patch
Error message: error: ‘RT711_JD2_100K’ undeclared here (not in a function)
2023-08-26T17:46:51.3733116Z sound/soc/intel/boards/sof_sdw.c:208:41:
error: ‘RT711_JD2_100K’ undeclared here (not in a function)
2023-08-26T17:46:51.3744338Z 208 | .driver_data =
(void *)(RT711_JD2_100K),
2023-08-26T17:46:51.3745547Z |
^~~~~~~~~~~~~~
2023-08-26T17:46:51.4620173Z make[4]: *** [scripts/Makefile.build:286:
sound/soc/intel/boards/sof_sdw.o] Error 1
2023-08-26T17:46:51.4625055Z make[3]: *** [scripts/Makefile.build:503:
sound/soc/intel/boards] Error 2
2023-08-26T17:46:51.4626370Z make[2]: *** [scripts/Makefile.build:503:
sound/soc/intel] Error 2
This happened before already:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/com…
--
Best, Philip
This is a backport of the series that fixes the way deadline bandwidth
restoration is done which is causing noticeable delay on resume path. It also
converts the cpuset lock back into a mutex which some users on Android too.
I lack the details but AFAIU the read/write semaphore was slower on high
contention.
Compile tested against some randconfig for different archs. Only boot tested on
x86 qemu.
Based on v6.4.11
Original series:
https://lore.kernel.org/lkml/20230508075854.17215-1-juri.lelli@redhat.com/
Thanks!
--
Qais Yousef
Dietmar Eggemann (2):
sched/deadline: Create DL BW alloc, free & check overflow interface
cgroup/cpuset: Free DL BW in case can_attach() fails
Juri Lelli (4):
cgroup/cpuset: Rename functions dealing with DEADLINE accounting
sched/cpuset: Bring back cpuset_mutex
sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets
cgroup/cpuset: Iterate only if DEADLINE tasks are present
include/linux/cpuset.h | 12 +-
include/linux/sched.h | 4 +-
kernel/cgroup/cgroup.c | 4 +
kernel/cgroup/cpuset.c | 244 ++++++++++++++++++++++++++--------------
kernel/sched/core.c | 41 +++----
kernel/sched/deadline.c | 67 ++++++++---
kernel/sched/sched.h | 2 +-
7 files changed, 246 insertions(+), 128 deletions(-)
--
2.34.1
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit 10f84c2cfb5045e37d78cb5d4c8e8321e06ae18f ]
Currently, the various torture tests sometimes react to an early-boot
bug by rebooting. This is almost always counterproductive, needlessly
consuming CPU time and bloating the console log. This commit therefore
adds the "-no-reboot" argument to qemu so that reboot requests will
cause qemu to exit.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 6dc2b49b85ea..bdd747dc61f2 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -9,7 +9,7 @@
#
# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args
#
-# qemu-args defaults to "-enable-kvm -nographic", along with arguments
+# qemu-args defaults to "-enable-kvm -nographic -no-reboot", along with arguments
# specifying the number of CPUs and other options
# generated from the underlying CPU architecture.
# boot_args defaults to value returned by the per_version_boot_params
@@ -132,7 +132,7 @@ then
fi
# Generate -smp qemu argument.
-qemu_args="-enable-kvm -nographic $qemu_args"
+qemu_args="-enable-kvm -nographic -no-reboot $qemu_args"
cpu_count=`configNR_CPUS.sh $resdir/ConfigFragment`
cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"`
if test "$cpu_count" -gt "$TORTURE_ALLOTED_CPUS"
--
2.42.0.rc1.204.g551eb34607-goog
I'm announcing the release of the 6.1.49 kernel.
This upgrade is only for all users of the 6.1 series that use the x86
platform OR the F2FS file system. If that's not you, feel free to
ignore this release.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
fs/f2fs/f2fs.h | 1
fs/f2fs/file.c | 5 ++++
fs/f2fs/node.c | 14 +----------
fs/f2fs/super.c | 43 +++++++++++------------------------
include/linux/f2fs_fs.h | 1
tools/objtool/arch/x86/decode.c | 11 +++++---
tools/objtool/check.c | 22 +++++++++++++++++
tools/objtool/include/objtool/arch.h | 1
tools/objtool/include/objtool/elf.h | 1
10 files changed, 53 insertions(+), 48 deletions(-)
Greg Kroah-Hartman (4):
Revert "f2fs: don't reset unchangable mount option in f2fs_remount()"
Revert "f2fs: fix to set flush_merge opt and show noflush_merge"
Revert "f2fs: fix to do sanity check on direct node in truncate_dnode()"
Linux 6.1.49
Peter Zijlstra (1):
objtool/x86: Fix SRSO mess
Cc: stable(a)vger.kernel.org # v5.11+
Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308221542.11UpkVfp-lkp@intel.com/
Signed-off-by: Jarkko Sakkinen <jarkko(a)kernel.org>
---
arch/x86/kernel/cpu/sgx/main.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..388350b8f5e3 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -732,6 +732,10 @@ int arch_memory_failure(unsigned long pfn, int flags)
}
/**
+ * sgx_calc_section_metric() - Calculate an EPC section metric
+ * @low: low 32-bit word from CPUID:0x12:{2, ...}
+ * @high: high 32-bit word from CPUID:0x12:{2, ...}
+ *
* A section metric is concatenated in a way that @low bits 12-31 define the
* bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
* metric.
--
2.39.2
Patch series "don't use mapcount() to check large folio sharing", v2.
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases:
User space application call madvise() with MADV_FREE, MADV_COLD and
MADV_PAGEOUT for specific address range. There are THP mapped to the
range. Without the patchset, the THP is skipped. With the patch, the
THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT
skip THP:
https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel…
and I confirmed this patchset make it work again.
This patch (of 3):
Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range()
to use folios") replaced the page_mapcount() with folio_mapcount() to
check whether the folio is shared by other mapping.
It's not correct for large folio. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
(cherry picked from commit 2f406263e3e954aa24c1248edcfa9be0c1bb30fa)
---
mm/madvise.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..6adee363a9fa 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -375,7 +375,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -447,7 +447,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
* are sure it's worth. Split it if we are only owner.
*/
if (folio_test_large(folio)) {
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
--
2.39.2
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x 0e0e9bd5f7b9d40fd03b70092367247d52da1db0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082614-choice-mongoose-0731@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0e0e9bd5f7b9d40fd03b70092367247d52da1db0 Mon Sep 17 00:00:00 2001
From: Yin Fengwei <fengwei.yin(a)intel.com>
Date: Tue, 8 Aug 2023 10:09:17 +0800
Subject: [PATCH] madvise:madvise_free_pte_range(): don't use mapcount()
against large folio for sharing check
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/madvise.c b/mm/madvise.c
index 46802b4cf65a..ec30f48f8f2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -680,7 +680,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 0e0e9bd5f7b9d40fd03b70092367247d52da1db0
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082616-velocity-mocha-97c0@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
0e0e9bd5f7b9 ("madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check")
f3cd4ab0aabf ("mm/madvise: clean up pte_offset_map_lock() scans")
07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
fd3b1bc3c86e ("mm/madvise: fix madvise_pageout for private file mappings")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0e0e9bd5f7b9d40fd03b70092367247d52da1db0 Mon Sep 17 00:00:00 2001
From: Yin Fengwei <fengwei.yin(a)intel.com>
Date: Tue, 8 Aug 2023 10:09:17 +0800
Subject: [PATCH] madvise:madvise_free_pte_range(): don't use mapcount()
against large folio for sharing check
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/madvise.c b/mm/madvise.c
index 46802b4cf65a..ec30f48f8f2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -680,7 +680,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 56b930dcd88c2adc261410501c402c790980bdb5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023081222-chummy-aqueduct-85c2@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
56b930dcd88c ("hwmon: (aquacomputer_d5next) Add selective 200ms delay after sending ctrl report")
19692f17cd13 ("hwmon: (aquacomputer_d5next) Add support for Aquacomputer Aquastream XT")
866e630a3b8b ("hwmon: (aquacomputer_d5next) Add temperature offset control for Aquaero")
6c83ccb10c49 ("hwmon: (aquacomputer_d5next) Add infrastructure for Aquaero control reports")
b29090bac935 ("hwmon: (aquacomputer_d5next) Device dependent control report settings")
7505dab78f58 ("hwmon: (aquacomputer_d5next) Add support for Aquacomputer Aquastream Ultimate")
e0f6c370f0ad ("hwmon: (aquacomputer_d5next) Add support for Aquacomputer Poweradjust 3")
3d2e9f582a8e ("hwmon: (aquacomputer_d5next) Add support for reading calculated Aquaero sensors")
2c55211104b4 ("hwmon: (aquacomputer_d5next) Support sensors for Aquacomputer Aquaero")
ad2f0811fbeb ("hwmon: (aquacomputer_d5next) Device dependent serial number and firmware offsets")
249c752110a5 ("hwmon: (aquacomputer_d5next) Add structure for fan layout")
8bcb02bdc638 ("hwmon: (aquacomputer_d5next) Rename AQC_TEMP_SENSOR_SIZE to AQC_SENSOR_SIZE")
6ff838f2877d ("hwmon: (aquacomputer_d5next) Add support for Quadro flow sensor pulses")
d5d896b83822 ("hwmon: (aquacomputer_d5next) Clear up macros and comments")
662d20b3a5af ("hwmon: (aquacomputer_d5next) Add support for temperature sensor offsets")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 56b930dcd88c2adc261410501c402c790980bdb5 Mon Sep 17 00:00:00 2001
From: Aleksa Savic <savicaleksa83(a)gmail.com>
Date: Mon, 7 Aug 2023 19:20:03 +0200
Subject: [PATCH] hwmon: (aquacomputer_d5next) Add selective 200ms delay after
sending ctrl report
Add a 200ms delay after sending a ctrl report to Quadro,
Octo, D5 Next and Aquaero to give them enough time to
process the request and save the data to memory. Otherwise,
under heavier userspace loads where multiple sysfs entries
are usually set in quick succession, a new ctrl report could
be requested from the device while it's still processing the
previous one and fail with -EPIPE. The delay is only applied
if two ctrl report operations are near each other in time.
Reported by a user on Github [1] and tested by both of us.
[1] https://github.com/aleksamagicka/aquacomputer_d5next-hwmon/issues/82
Fixes: 752b927951ea ("hwmon: (aquacomputer_d5next) Add support for Aquacomputer Octo")
Signed-off-by: Aleksa Savic <savicaleksa83(a)gmail.com>
Link: https://lore.kernel.org/r/20230807172004.456968-1-savicaleksa83@gmail.com
Signed-off-by: Guenter Roeck <linux(a)roeck-us.net>
diff --git a/drivers/hwmon/aquacomputer_d5next.c b/drivers/hwmon/aquacomputer_d5next.c
index a997dbcb563f..023807859be7 100644
--- a/drivers/hwmon/aquacomputer_d5next.c
+++ b/drivers/hwmon/aquacomputer_d5next.c
@@ -13,9 +13,11 @@
#include <linux/crc16.h>
#include <linux/debugfs.h>
+#include <linux/delay.h>
#include <linux/hid.h>
#include <linux/hwmon.h>
#include <linux/jiffies.h>
+#include <linux/ktime.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/seq_file.h>
@@ -63,6 +65,8 @@ static const char *const aqc_device_names[] = {
#define CTRL_REPORT_ID 0x03
#define AQUAERO_CTRL_REPORT_ID 0x0b
+#define CTRL_REPORT_DELAY 200 /* ms */
+
/* The HID report that the official software always sends
* after writing values, currently same for all devices
*/
@@ -527,6 +531,9 @@ struct aqc_data {
int secondary_ctrl_report_size;
u8 *secondary_ctrl_report;
+ ktime_t last_ctrl_report_op;
+ int ctrl_report_delay; /* Delay between two ctrl report operations, in ms */
+
int buffer_size;
u8 *buffer;
int checksum_start;
@@ -611,17 +618,35 @@ static int aqc_aquastreamxt_convert_fan_rpm(u16 val)
return 0;
}
+static void aqc_delay_ctrl_report(struct aqc_data *priv)
+{
+ /*
+ * If previous read or write is too close to this one, delay the current operation
+ * to give the device enough time to process the previous one.
+ */
+ if (priv->ctrl_report_delay) {
+ s64 delta = ktime_ms_delta(ktime_get(), priv->last_ctrl_report_op);
+
+ if (delta < priv->ctrl_report_delay)
+ msleep(priv->ctrl_report_delay - delta);
+ }
+}
+
/* Expects the mutex to be locked */
static int aqc_get_ctrl_data(struct aqc_data *priv)
{
int ret;
+ aqc_delay_ctrl_report(priv);
+
memset(priv->buffer, 0x00, priv->buffer_size);
ret = hid_hw_raw_request(priv->hdev, priv->ctrl_report_id, priv->buffer, priv->buffer_size,
HID_FEATURE_REPORT, HID_REQ_GET_REPORT);
if (ret < 0)
ret = -ENODATA;
+ priv->last_ctrl_report_op = ktime_get();
+
return ret;
}
@@ -631,6 +656,8 @@ static int aqc_send_ctrl_data(struct aqc_data *priv)
int ret;
u16 checksum;
+ aqc_delay_ctrl_report(priv);
+
/* Checksum is not needed for Aquaero */
if (priv->kind != aquaero) {
/* Init and xorout value for CRC-16/USB is 0xffff */
@@ -646,12 +673,16 @@ static int aqc_send_ctrl_data(struct aqc_data *priv)
ret = hid_hw_raw_request(priv->hdev, priv->ctrl_report_id, priv->buffer, priv->buffer_size,
HID_FEATURE_REPORT, HID_REQ_SET_REPORT);
if (ret < 0)
- return ret;
+ goto record_access_and_ret;
/* The official software sends this report after every change, so do it here as well */
ret = hid_hw_raw_request(priv->hdev, priv->secondary_ctrl_report_id,
priv->secondary_ctrl_report, priv->secondary_ctrl_report_size,
HID_FEATURE_REPORT, HID_REQ_SET_REPORT);
+
+record_access_and_ret:
+ priv->last_ctrl_report_op = ktime_get();
+
return ret;
}
@@ -1524,6 +1555,7 @@ static int aqc_probe(struct hid_device *hdev, const struct hid_device_id *id)
priv->buffer_size = AQUAERO_CTRL_REPORT_SIZE;
priv->temp_ctrl_offset = AQUAERO_TEMP_CTRL_OFFSET;
+ priv->ctrl_report_delay = CTRL_REPORT_DELAY;
priv->temp_label = label_temp_sensors;
priv->virtual_temp_label = label_virtual_temp_sensors;
@@ -1547,6 +1579,7 @@ static int aqc_probe(struct hid_device *hdev, const struct hid_device_id *id)
priv->temp_ctrl_offset = D5NEXT_TEMP_CTRL_OFFSET;
priv->buffer_size = D5NEXT_CTRL_REPORT_SIZE;
+ priv->ctrl_report_delay = CTRL_REPORT_DELAY;
priv->power_cycle_count_offset = D5NEXT_POWER_CYCLES;
@@ -1597,6 +1630,7 @@ static int aqc_probe(struct hid_device *hdev, const struct hid_device_id *id)
priv->temp_ctrl_offset = OCTO_TEMP_CTRL_OFFSET;
priv->buffer_size = OCTO_CTRL_REPORT_SIZE;
+ priv->ctrl_report_delay = CTRL_REPORT_DELAY;
priv->power_cycle_count_offset = OCTO_POWER_CYCLES;
@@ -1624,6 +1658,7 @@ static int aqc_probe(struct hid_device *hdev, const struct hid_device_id *id)
priv->temp_ctrl_offset = QUADRO_TEMP_CTRL_OFFSET;
priv->buffer_size = QUADRO_CTRL_REPORT_SIZE;
+ priv->ctrl_report_delay = CTRL_REPORT_DELAY;
priv->flow_pulses_ctrl_offset = QUADRO_FLOW_PULSES_CTRL_OFFSET;
priv->power_cycle_count_offset = QUADRO_POWER_CYCLES;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x c275a176e4b69868576e543409927ae75e3a3288
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082737-cavity-bloating-1779@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
c275a176e4b6 ("can: raw: add missing refcount for memory leak fix")
ee8b94c8510c ("can: raw: fix receiver memory leak")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c275a176e4b69868576e543409927ae75e3a3288 Mon Sep 17 00:00:00 2001
From: Oliver Hartkopp <socketcan(a)hartkopp.net>
Date: Mon, 21 Aug 2023 16:45:47 +0200
Subject: [PATCH] can: raw: add missing refcount for memory leak fix
Commit ee8b94c8510c ("can: raw: fix receiver memory leak") introduced
a new reference to the CAN netdevice that has assigned CAN filters.
But this new ro->dev reference did not maintain its own refcount which
lead to another KASAN use-after-free splat found by Eric Dumazet.
This patch ensures a proper refcount for the CAN nedevice.
Fixes: ee8b94c8510c ("can: raw: fix receiver memory leak")
Reported-by: Eric Dumazet <edumazet(a)google.com>
Cc: Ziyang Xuan <william.xuanziyang(a)huawei.com>
Signed-off-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Link: https://lore.kernel.org/r/20230821144547.6658-3-socketcan@hartkopp.net
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/can/raw.c b/net/can/raw.c
index e10f59375659..d50c3f3d892f 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -85,6 +85,7 @@ struct raw_sock {
int bound;
int ifindex;
struct net_device *dev;
+ netdevice_tracker dev_tracker;
struct list_head notifier;
int loopback;
int recv_own_msgs;
@@ -285,8 +286,10 @@ static void raw_notify(struct raw_sock *ro, unsigned long msg,
case NETDEV_UNREGISTER:
lock_sock(sk);
/* remove current filters & unregister */
- if (ro->bound)
+ if (ro->bound) {
raw_disable_allfilters(dev_net(dev), dev, sk);
+ netdev_put(dev, &ro->dev_tracker);
+ }
if (ro->count > 1)
kfree(ro->filter);
@@ -391,10 +394,12 @@ static int raw_release(struct socket *sock)
/* remove current filters & unregister */
if (ro->bound) {
- if (ro->dev)
+ if (ro->dev) {
raw_disable_allfilters(dev_net(ro->dev), ro->dev, sk);
- else
+ netdev_put(ro->dev, &ro->dev_tracker);
+ } else {
raw_disable_allfilters(sock_net(sk), NULL, sk);
+ }
}
if (ro->count > 1)
@@ -445,10 +450,10 @@ static int raw_bind(struct socket *sock, struct sockaddr *uaddr, int len)
goto out;
}
if (dev->type != ARPHRD_CAN) {
- dev_put(dev);
err = -ENODEV;
- goto out;
+ goto out_put_dev;
}
+
if (!(dev->flags & IFF_UP))
notify_enetdown = 1;
@@ -456,7 +461,9 @@ static int raw_bind(struct socket *sock, struct sockaddr *uaddr, int len)
/* filters set by default/setsockopt */
err = raw_enable_allfilters(sock_net(sk), dev, sk);
- dev_put(dev);
+ if (err)
+ goto out_put_dev;
+
} else {
ifindex = 0;
@@ -467,18 +474,28 @@ static int raw_bind(struct socket *sock, struct sockaddr *uaddr, int len)
if (!err) {
if (ro->bound) {
/* unregister old filters */
- if (ro->dev)
+ if (ro->dev) {
raw_disable_allfilters(dev_net(ro->dev),
ro->dev, sk);
- else
+ /* drop reference to old ro->dev */
+ netdev_put(ro->dev, &ro->dev_tracker);
+ } else {
raw_disable_allfilters(sock_net(sk), NULL, sk);
+ }
}
ro->ifindex = ifindex;
ro->bound = 1;
+ /* bind() ok -> hold a reference for new ro->dev */
ro->dev = dev;
+ if (ro->dev)
+ netdev_hold(ro->dev, &ro->dev_tracker, GFP_KERNEL);
}
- out:
+out_put_dev:
+ /* remove potential reference from dev_get_by_index() */
+ if (dev)
+ dev_put(dev);
+out:
release_sock(sk);
rtnl_unlock();
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x dbf46008775516f7f25c95b7760041c286299783
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082724-deflate-drinkable-54a1@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
dbf460087755 ("objtool/x86: Fixup frame-pointer vs rethunk")
c6f5dc28fb3d ("objtool: Union instruction::{call_dest,jump_table}")
0932dbe1f568 ("objtool: Remove instruction::reloc")
8b2de412158e ("objtool: Shrink instruction::{type,visited}")
d54066546121 ("objtool: Make instruction::alts a single-linked list")
3ee88df1b063 ("objtool: Make instruction::stack_ops a single-linked list")
20a554638dd2 ("objtool: Change arch_decode_instruction() signature")
5f6e430f931d ("Merge tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dbf46008775516f7f25c95b7760041c286299783 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 16 Aug 2023 13:59:21 +0200
Subject: [PATCH] objtool/x86: Fixup frame-pointer vs rethunk
For stack-validation of a frame-pointer build, objtool validates that
every CALL instruction is preceded by a frame-setup. The new SRSO
return thunks violate this with their RSB stuffing trickery.
Extend the __fentry__ exception to also cover the embedded_insn case
used for this. This cures:
vmlinux.o: warning: objtool: srso_untrain_ret+0xd: call without frame pointer save/setup
Fixes: 4ae68b26c3ab ("objtool/x86: Fix SRSO mess")
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe(a)kernel.org>
Link: https://lore.kernel.org/r/20230816115921.GH980931@hirez.programming.kicks-a…
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 7a9aaf400873..1384090530db 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2650,12 +2650,17 @@ static int decode_sections(struct objtool_file *file)
return 0;
}
-static bool is_fentry_call(struct instruction *insn)
+static bool is_special_call(struct instruction *insn)
{
- if (insn->type == INSN_CALL &&
- insn_call_dest(insn) &&
- insn_call_dest(insn)->fentry)
- return true;
+ if (insn->type == INSN_CALL) {
+ struct symbol *dest = insn_call_dest(insn);
+
+ if (!dest)
+ return false;
+
+ if (dest->fentry || dest->embedded_insn)
+ return true;
+ }
return false;
}
@@ -3656,7 +3661,7 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
if (ret)
return ret;
- if (opts.stackval && func && !is_fentry_call(insn) &&
+ if (opts.stackval && func && !is_special_call(insn) &&
!has_valid_stack_frame(&state)) {
WARN_INSN(insn, "call without frame pointer save/setup");
return 1;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x dbf46008775516f7f25c95b7760041c286299783
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082722-rocking-unbaked-1baf@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
dbf460087755 ("objtool/x86: Fixup frame-pointer vs rethunk")
c6f5dc28fb3d ("objtool: Union instruction::{call_dest,jump_table}")
0932dbe1f568 ("objtool: Remove instruction::reloc")
8b2de412158e ("objtool: Shrink instruction::{type,visited}")
d54066546121 ("objtool: Make instruction::alts a single-linked list")
3ee88df1b063 ("objtool: Make instruction::stack_ops a single-linked list")
20a554638dd2 ("objtool: Change arch_decode_instruction() signature")
5f6e430f931d ("Merge tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dbf46008775516f7f25c95b7760041c286299783 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 16 Aug 2023 13:59:21 +0200
Subject: [PATCH] objtool/x86: Fixup frame-pointer vs rethunk
For stack-validation of a frame-pointer build, objtool validates that
every CALL instruction is preceded by a frame-setup. The new SRSO
return thunks violate this with their RSB stuffing trickery.
Extend the __fentry__ exception to also cover the embedded_insn case
used for this. This cures:
vmlinux.o: warning: objtool: srso_untrain_ret+0xd: call without frame pointer save/setup
Fixes: 4ae68b26c3ab ("objtool/x86: Fix SRSO mess")
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe(a)kernel.org>
Link: https://lore.kernel.org/r/20230816115921.GH980931@hirez.programming.kicks-a…
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 7a9aaf400873..1384090530db 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2650,12 +2650,17 @@ static int decode_sections(struct objtool_file *file)
return 0;
}
-static bool is_fentry_call(struct instruction *insn)
+static bool is_special_call(struct instruction *insn)
{
- if (insn->type == INSN_CALL &&
- insn_call_dest(insn) &&
- insn_call_dest(insn)->fentry)
- return true;
+ if (insn->type == INSN_CALL) {
+ struct symbol *dest = insn_call_dest(insn);
+
+ if (!dest)
+ return false;
+
+ if (dest->fentry || dest->embedded_insn)
+ return true;
+ }
return false;
}
@@ -3656,7 +3661,7 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
if (ret)
return ret;
- if (opts.stackval && func && !is_fentry_call(insn) &&
+ if (opts.stackval && func && !is_special_call(insn) &&
!has_valid_stack_frame(&state)) {
WARN_INSN(insn, "call without frame pointer save/setup");
return 1;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x dbf46008775516f7f25c95b7760041c286299783
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082723-bribe-sporty-3c8c@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
dbf460087755 ("objtool/x86: Fixup frame-pointer vs rethunk")
c6f5dc28fb3d ("objtool: Union instruction::{call_dest,jump_table}")
0932dbe1f568 ("objtool: Remove instruction::reloc")
8b2de412158e ("objtool: Shrink instruction::{type,visited}")
d54066546121 ("objtool: Make instruction::alts a single-linked list")
3ee88df1b063 ("objtool: Make instruction::stack_ops a single-linked list")
20a554638dd2 ("objtool: Change arch_decode_instruction() signature")
5f6e430f931d ("Merge tag 'powerpc-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From dbf46008775516f7f25c95b7760041c286299783 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 16 Aug 2023 13:59:21 +0200
Subject: [PATCH] objtool/x86: Fixup frame-pointer vs rethunk
For stack-validation of a frame-pointer build, objtool validates that
every CALL instruction is preceded by a frame-setup. The new SRSO
return thunks violate this with their RSB stuffing trickery.
Extend the __fentry__ exception to also cover the embedded_insn case
used for this. This cures:
vmlinux.o: warning: objtool: srso_untrain_ret+0xd: call without frame pointer save/setup
Fixes: 4ae68b26c3ab ("objtool/x86: Fix SRSO mess")
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe(a)kernel.org>
Link: https://lore.kernel.org/r/20230816115921.GH980931@hirez.programming.kicks-a…
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 7a9aaf400873..1384090530db 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2650,12 +2650,17 @@ static int decode_sections(struct objtool_file *file)
return 0;
}
-static bool is_fentry_call(struct instruction *insn)
+static bool is_special_call(struct instruction *insn)
{
- if (insn->type == INSN_CALL &&
- insn_call_dest(insn) &&
- insn_call_dest(insn)->fentry)
- return true;
+ if (insn->type == INSN_CALL) {
+ struct symbol *dest = insn_call_dest(insn);
+
+ if (!dest)
+ return false;
+
+ if (dest->fentry || dest->embedded_insn)
+ return true;
+ }
return false;
}
@@ -3656,7 +3661,7 @@ static int validate_branch(struct objtool_file *file, struct symbol *func,
if (ret)
return ret;
- if (opts.stackval && func && !is_fentry_call(insn) &&
+ if (opts.stackval && func && !is_special_call(insn) &&
!has_valid_stack_frame(&state)) {
WARN_INSN(insn, "call without frame pointer save/setup");
return 1;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x c4d6b5438116c184027b2e911c0f2c7c406fb47c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082711-crown-acuteness-453a@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
c4d6b5438116 ("tracing/synthetic: Allocate one additional element for size")
fc1a9dc10129 ("tracing/histogram: Don't use strlen to find length of stacktrace variables")
288709c9f3b0 ("tracing: Allow stacktraces to be saved as histogram variables")
5f2e094ed259 ("tracing: Allow multiple hitcount values in histograms")
0934ae9977c2 ("tracing: Fix reading strings from synthetic events")
86087383ec0a ("tracing/hist: Call hist functions directly via a switch statement")
05770dd0ad11 ("tracing: Support __rel_loc relative dynamic data location attribute")
938aa33f1465 ("tracing: Add length protection to histogram string copies")
63f84ae6b82b ("tracing/histogram: Do not copy the fixed-size char array field over the field size")
8b5d46fd7a38 ("tracing/histogram: Optimize division by constants")
f47716b7a955 ("tracing/histogram: Covert expr to const if both operands are constants")
c5eac6ee8bc5 ("tracing/histogram: Simplify handling of .sym-offset in expressions")
9710b2f341a0 ("tracing: Fix operator precedence for hist triggers expression")
bcef04415032 ("tracing: Add division and multiplication support for hist triggers")
52cfb373536a ("tracing: Add support for creating hist trigger variables from literal")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c4d6b5438116c184027b2e911c0f2c7c406fb47c Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)linux.ibm.com>
Date: Wed, 16 Aug 2023 17:49:28 +0200
Subject: [PATCH] tracing/synthetic: Allocate one additional element for size
While debugging another issue I noticed that the stack trace contains one
invalid entry at the end:
<idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK:
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x150/0x2c0
=> kcompactd+0x9ca/0xc20
=> kthread+0x2f6/0x3d8
=> __ret_from_fork+0x8a/0xe8
=> 0x6b6b6b6b6b6b6b6b
This is because the code failed to add the one element containing the
number of entries to field_size.
Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens(a)linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 80a2a832f857..9897d0bfcab7 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -528,7 +528,8 @@ static notrace void trace_event_raw_event_synth(void *__data,
str_val = (char *)(long)var_ref_vals[val_idx];
if (event->dynamic_fields[i]->is_stack) {
- len = *((unsigned long *)str_val);
+ /* reserve one extra element for size */
+ len = *((unsigned long *)str_val) + 1;
len *= sizeof(unsigned long);
} else {
len = fetch_store_strlen((unsigned long)str_val);
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x c4d6b5438116c184027b2e911c0f2c7c406fb47c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082710-twisty-automatic-966e@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
c4d6b5438116 ("tracing/synthetic: Allocate one additional element for size")
fc1a9dc10129 ("tracing/histogram: Don't use strlen to find length of stacktrace variables")
288709c9f3b0 ("tracing: Allow stacktraces to be saved as histogram variables")
5f2e094ed259 ("tracing: Allow multiple hitcount values in histograms")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c4d6b5438116c184027b2e911c0f2c7c406fb47c Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)linux.ibm.com>
Date: Wed, 16 Aug 2023 17:49:28 +0200
Subject: [PATCH] tracing/synthetic: Allocate one additional element for size
While debugging another issue I noticed that the stack trace contains one
invalid entry at the end:
<idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK:
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x150/0x2c0
=> kcompactd+0x9ca/0xc20
=> kthread+0x2f6/0x3d8
=> __ret_from_fork+0x8a/0xe8
=> 0x6b6b6b6b6b6b6b6b
This is because the code failed to add the one element containing the
number of entries to field_size.
Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens(a)linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 80a2a832f857..9897d0bfcab7 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -528,7 +528,8 @@ static notrace void trace_event_raw_event_synth(void *__data,
str_val = (char *)(long)var_ref_vals[val_idx];
if (event->dynamic_fields[i]->is_stack) {
- len = *((unsigned long *)str_val);
+ /* reserve one extra element for size */
+ len = *((unsigned long *)str_val) + 1;
len *= sizeof(unsigned long);
} else {
len = fetch_store_strlen((unsigned long)str_val);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 887f92e09ef34a949745ad26ce82be69e2dabcf6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082700-clamp-coerce-0153@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
887f92e09ef3 ("tracing/synthetic: Skip first entry for stack traces")
ddeea494a16f ("tracing/synthetic: Use union instead of casts")
116b41162f8b ("Merge tag 'probes-v6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 887f92e09ef34a949745ad26ce82be69e2dabcf6 Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)linux.ibm.com>
Date: Wed, 16 Aug 2023 17:49:27 +0200
Subject: [PATCH] tracing/synthetic: Skip first entry for stack traces
While debugging another issue I noticed that the stack trace output
contains the number of entries on top:
<idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK:
=> 0x10
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x242/0x2c0
=> __wait_for_common+0x434/0x680
=> __wait_rcu_gp+0x198/0x3e0
=> synchronize_rcu+0x112/0x138
=> ring_buffer_reset_online_cpus+0x140/0x2e0
=> tracing_reset_online_cpus+0x15c/0x1d0
=> tracing_set_clock+0x180/0x1d8
=> hist_register_trigger+0x486/0x670
=> event_hist_trigger_parse+0x494/0x1318
=> trigger_process_regex+0x1d4/0x258
=> event_trigger_write+0xb4/0x170
=> vfs_write+0x210/0xad0
=> ksys_write+0x122/0x208
Fix this by skipping the first element. Also replace the pointer
logic with an index variable which is easier to read.
Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens(a)linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 7fff8235075f..80a2a832f857 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -350,7 +350,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
struct trace_seq *s = &iter->seq;
struct synth_trace_event *entry;
struct synth_event *se;
- unsigned int i, n_u64;
+ unsigned int i, j, n_u64;
char print_fmt[32];
const char *fmt;
@@ -389,18 +389,13 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
n_u64 += STR_VAR_LEN_MAX / sizeof(u64);
}
} else if (se->fields[i]->is_stack) {
- unsigned long *p, *end;
union trace_synth_field *data = &entry->fields[n_u64];
-
- p = (void *)entry + data->as_dynamic.offset;
- end = (void *)p + data->as_dynamic.len - (sizeof(long) - 1);
+ unsigned long *p = (void *)entry + data->as_dynamic.offset;
trace_seq_printf(s, "%s=STACK:\n", se->fields[i]->name);
-
- for (; *p && p < end; p++)
- trace_seq_printf(s, "=> %pS\n", (void *)*p);
+ for (j = 1; j < data->as_dynamic.len / sizeof(long); j++)
+ trace_seq_printf(s, "=> %pS\n", (void *)p[j]);
n_u64++;
-
} else {
struct trace_print_flags __flags[] = {
__def_gfpflag_names, {-1, NULL} };
@@ -490,10 +485,6 @@ static unsigned int trace_stack(struct synth_trace_event *entry,
break;
}
- /* Include the zero'd element if it fits */
- if (len < HIST_STACKTRACE_DEPTH)
- len++;
-
len *= sizeof(long);
/* Find the dynamic section to copy the stack into. */
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 887f92e09ef34a949745ad26ce82be69e2dabcf6
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082759-esquire-online-0814@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
887f92e09ef3 ("tracing/synthetic: Skip first entry for stack traces")
ddeea494a16f ("tracing/synthetic: Use union instead of casts")
116b41162f8b ("Merge tag 'probes-v6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 887f92e09ef34a949745ad26ce82be69e2dabcf6 Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)linux.ibm.com>
Date: Wed, 16 Aug 2023 17:49:27 +0200
Subject: [PATCH] tracing/synthetic: Skip first entry for stack traces
While debugging another issue I noticed that the stack trace output
contains the number of entries on top:
<idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK:
=> 0x10
=> __schedule+0xac6/0x1a98
=> schedule+0x126/0x2c0
=> schedule_timeout+0x242/0x2c0
=> __wait_for_common+0x434/0x680
=> __wait_rcu_gp+0x198/0x3e0
=> synchronize_rcu+0x112/0x138
=> ring_buffer_reset_online_cpus+0x140/0x2e0
=> tracing_reset_online_cpus+0x15c/0x1d0
=> tracing_set_clock+0x180/0x1d8
=> hist_register_trigger+0x486/0x670
=> event_hist_trigger_parse+0x494/0x1318
=> trigger_process_regex+0x1d4/0x258
=> event_trigger_write+0xb4/0x170
=> vfs_write+0x210/0xad0
=> ksys_write+0x122/0x208
Fix this by skipping the first element. Also replace the pointer
logic with an index variable which is easier to read.
Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Sven Schnelle <svens(a)linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 7fff8235075f..80a2a832f857 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -350,7 +350,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
struct trace_seq *s = &iter->seq;
struct synth_trace_event *entry;
struct synth_event *se;
- unsigned int i, n_u64;
+ unsigned int i, j, n_u64;
char print_fmt[32];
const char *fmt;
@@ -389,18 +389,13 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
n_u64 += STR_VAR_LEN_MAX / sizeof(u64);
}
} else if (se->fields[i]->is_stack) {
- unsigned long *p, *end;
union trace_synth_field *data = &entry->fields[n_u64];
-
- p = (void *)entry + data->as_dynamic.offset;
- end = (void *)p + data->as_dynamic.len - (sizeof(long) - 1);
+ unsigned long *p = (void *)entry + data->as_dynamic.offset;
trace_seq_printf(s, "%s=STACK:\n", se->fields[i]->name);
-
- for (; *p && p < end; p++)
- trace_seq_printf(s, "=> %pS\n", (void *)*p);
+ for (j = 1; j < data->as_dynamic.len / sizeof(long); j++)
+ trace_seq_printf(s, "=> %pS\n", (void *)p[j]);
n_u64++;
-
} else {
struct trace_print_flags __flags[] = {
__def_gfpflag_names, {-1, NULL} };
@@ -490,10 +485,6 @@ static unsigned int trace_stack(struct synth_trace_event *entry,
break;
}
- /* Include the zero'd element if it fits */
- if (len < HIST_STACKTRACE_DEPTH)
- len++;
-
len *= sizeof(long);
/* Find the dynamic section to copy the stack into. */
From: Pietro Borrello <borrello(a)diag.uniroma1.it>
commit 7c4a5b89a0b5a57a64b601775b296abf77a9fe97 upstream.
Commit 326587b84078 ("sched: fix goto retry in pick_next_task_rt()")
removed any path which could make pick_next_rt_entity() return NULL.
However, BUG_ON(!rt_se) in _pick_next_task_rt() (the only caller of
pick_next_rt_entity()) still checks the error condition, which can
never happen, since list_entry() never returns NULL.
Remove the BUG_ON check, and instead emit a warning in the only
possible error condition here: the queue being empty which should
never happen.
Fixes: 326587b84078 ("sched: fix goto retry in pick_next_task_rt()")
Signed-off-by: Pietro Borrello <borrello(a)diag.uniroma1.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Phil Auld <pauld(a)redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Link: https://lore.kernel.org/r/20230128-list-entry-null-check-sched-v3-1-b1a71bd…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[Srish: Fixes CVE-2023-1077: sched/rt: pick_next_rt_entity(): check list_entry
An insufficient list empty checking in pick_next_rt_entity().
The _pick_next_task_rt() checks pick_next_rt_entity() returns
NULL or not but pick_next_rt_entity() never returns NULL.
So, even if the list is empty, _pick_next_task_rt() continues
its process.]
Signed-off-by: Srish Srinivasan <ssrish(a)vmware.com>
---
kernel/sched/rt.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9c6c3572b..394c66442 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1522,6 +1522,8 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
BUG_ON(idx >= MAX_RT_PRIO);
queue = array->queue + idx;
+ if (SCHED_WARN_ON(list_empty(queue)))
+ return NULL;
next = list_entry(queue->next, struct sched_rt_entity, run_list);
return next;
@@ -1535,7 +1537,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
do {
rt_se = pick_next_rt_entity(rq, rt_rq);
- BUG_ON(!rt_se);
+ if (unlikely(!rt_se))
+ return NULL;
rt_rq = group_rt_rq(rt_se);
} while (rt_rq);
--
2.35.6
commit 9c7c4bc986932218fd0df9d2a100509772028fb1 upstream
sizeof(struct ublksrv_io_cmd) is 16bytes, which can be held in 64byte SQE,
so not necessary to check IO_URING_F_SQE128.
With this change, we get chance to save half SQ ring memory.
Fixed: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
Link: https://lore.kernel.org/r/20230220041413.1524335-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
---
drivers/block/ublk_drv.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f48d213fb65e..09d29fa53939 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1271,9 +1271,6 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
__func__, cmd->cmd_op, ub_cmd->q_id, tag,
ub_cmd->result);
- if (!(issue_flags & IO_URING_F_SQE128))
- goto out;
-
if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
goto out;
--
2.40.1
From: Sanjay R Mehta <sanju.mehta(a)amd.com>
Previously, on unplug events, the TMU mode was disabled first
followed by the Time Synchronization Handshake, irrespective of
whether the tb_switch_tmu_rate_write() API was successful or not.
However, this caused a problem with Thunderbolt 3 (TBT3)
devices, as the TSPacketInterval bits were always enabled by default,
leading the host router to assume that the device router's TMU was
already enabled and preventing it from initiating the Time
Synchronization Handshake. As a result, TBT3 monitors experienced
display flickering from the second hot plug onwards.
To address this issue, we have modified the code to only disable the
Time Synchronization Handshake during TMU disable if the
tb_switch_tmu_rate_write() function is successful. This ensures that
the TBT3 devices function correctly and eliminates the display
flickering issue.
Co-developed-by: Sanath S <Sanath.S(a)amd.com>
Signed-off-by: Sanath S <Sanath.S(a)amd.com>
Signed-off-by: Sanjay R Mehta <sanju.mehta(a)amd.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
(cherry picked from commit 583893a66d731f5da010a3fa38a0460e05f0149b)
USB4v2 introduced support for uni-directional TMU mode as part of
d49b4f043d63 ("thunderbolt: Add support for enhanced uni-directional TMU mode")
This is not a stable candidate commit, so adjust the code for backport.
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
drivers/thunderbolt/tmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/thunderbolt/tmu.c b/drivers/thunderbolt/tmu.c
index 626aca3124b1..d9544600b386 100644
--- a/drivers/thunderbolt/tmu.c
+++ b/drivers/thunderbolt/tmu.c
@@ -415,7 +415,8 @@ int tb_switch_tmu_disable(struct tb_switch *sw)
* uni-directional mode and we don't want to change it's TMU
* mode.
*/
- tb_switch_tmu_rate_write(sw, TB_SWITCH_TMU_RATE_OFF);
+ ret = tb_switch_tmu_rate_write(sw, TB_SWITCH_TMU_RATE_OFF);
+ return ret;
tb_port_tmu_time_sync_disable(up);
ret = tb_port_tmu_time_sync_disable(down);
--
2.34.1
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x 9d3de7ee192a6a253f475197fe4d2e2af10a731f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023072413-glamorous-unjustly-bb12@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9d3de7ee192a6a253f475197fe4d2e2af10a731f Mon Sep 17 00:00:00 2001
From: Ojaswin Mujoo <ojaswin(a)linux.ibm.com>
Date: Sat, 22 Jul 2023 22:45:24 +0530
Subject: [PATCH] ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.
To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:
1. It must have it's logical start immediately to the left of
(ie less than) original logical start.
2. It must not be deleted
To find this pa we use the following traversal method:
1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.
2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.
3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.
4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.
This approach also takes care of having deleted PAs in the tree.
(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)
[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+t…
Cc: stable(a)kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin(a)linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list(a)gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list(a)gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.16900459…
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 456150ef6111..21b903fe546e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4765,8 +4765,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
int order, i;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_locality_group *lg;
- struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
- ext4_lblk_t tmp_pa_start, tmp_pa_end;
+ struct ext4_prealloc_space *tmp_pa = NULL, *cpa = NULL;
+ loff_t tmp_pa_end;
struct rb_node *iter;
ext4_fsblk_t goal_block;
@@ -4774,47 +4774,151 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return false;
- /* first, try per-file preallocation */
+ /*
+ * first, try per-file preallocation by searching the inode pa rbtree.
+ *
+ * Here, we can't do a direct traversal of the tree because
+ * ext4_mb_discard_group_preallocation() can paralelly mark the pa
+ * deleted and that can cause direct traversal to skip some entries.
+ */
read_lock(&ei->i_prealloc_lock);
+
+ if (RB_EMPTY_ROOT(&ei->i_prealloc_node)) {
+ goto try_group_pa;
+ }
+
+ /*
+ * Step 1: Find a pa with logical start immediately adjacent to the
+ * original logical start. This could be on the left or right.
+ *
+ * (tmp_pa->pa_lstart never changes so we can skip locking for it).
+ */
for (iter = ei->i_prealloc_node.rb_node; iter;
iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
- tmp_pa_start, iter)) {
+ tmp_pa->pa_lstart, iter)) {
tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
pa_node.inode_node);
+ }
- /* all fields in this condition don't change,
- * so we can skip locking for them */
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+ /*
+ * Step 2: The adjacent pa might be to the right of logical start, find
+ * the left adjacent pa. After this step we'd have a valid tmp_pa whose
+ * logical start is towards the left of original request's logical start
+ */
+ if (tmp_pa->pa_lstart > ac->ac_o_ex.fe_logical) {
+ struct rb_node *tmp;
+ tmp = rb_prev(&tmp_pa->pa_node.inode_node);
- /* original request start doesn't lie in this PA */
- if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
- ac->ac_o_ex.fe_logical >= tmp_pa_end)
- continue;
-
- /* non-extent files can't have physical blocks past 2^32 */
- if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
- (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
- EXT4_MAX_BLOCK_FILE_PHYS)) {
+ if (tmp) {
+ tmp_pa = rb_entry(tmp, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ } else {
/*
- * Since PAs don't overlap, we won't find any
- * other PA to satisfy this.
+ * If there is no adjacent pa to the left then finding
+ * an overlapping pa is not possible hence stop searching
+ * inode pa tree
+ */
+ goto try_group_pa;
+ }
+ }
+
+ BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
+
+ /*
+ * Step 3: If the left adjacent pa is deleted, keep moving left to find
+ * the first non deleted adjacent pa. After this step we should have a
+ * valid tmp_pa which is guaranteed to be non deleted.
+ */
+ for (iter = &tmp_pa->pa_node.inode_node;; iter = rb_prev(iter)) {
+ if (!iter) {
+ /*
+ * no non deleted left adjacent pa, so stop searching
+ * inode pa tree
+ */
+ goto try_group_pa;
+ }
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0) {
+ /*
+ * We will keep holding the pa_lock from
+ * this point on because we don't want group discard
+ * to delete this pa underneath us. Since group
+ * discard is anyways an ENOSPC operation it
+ * should be okay for it to wait a few more cycles.
*/
break;
- }
-
- /* found preallocated blocks, use them */
- spin_lock(&tmp_pa->pa_lock);
- if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free &&
- likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
- atomic_inc(&tmp_pa->pa_count);
- ext4_mb_use_inode_pa(ac, tmp_pa);
+ } else {
spin_unlock(&tmp_pa->pa_lock);
- read_unlock(&ei->i_prealloc_lock);
- return true;
}
- spin_unlock(&tmp_pa->pa_lock);
}
+
+ BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
+ BUG_ON(tmp_pa->pa_deleted == 1);
+
+ /*
+ * Step 4: We now have the non deleted left adjacent pa. Only this
+ * pa can possibly satisfy the request hence check if it overlaps
+ * original logical start and stop searching if it doesn't.
+ */
+ tmp_pa_end = (loff_t)tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ if (ac->ac_o_ex.fe_logical >= tmp_pa_end) {
+ spin_unlock(&tmp_pa->pa_lock);
+ goto try_group_pa;
+ }
+
+ /* non-extent files can't have physical blocks past 2^32 */
+ if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
+ (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
+ EXT4_MAX_BLOCK_FILE_PHYS)) {
+ /*
+ * Since PAs don't overlap, we won't find any other PA to
+ * satisfy this.
+ */
+ spin_unlock(&tmp_pa->pa_lock);
+ goto try_group_pa;
+ }
+
+ if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
+ atomic_inc(&tmp_pa->pa_count);
+ ext4_mb_use_inode_pa(ac, tmp_pa);
+ spin_unlock(&tmp_pa->pa_lock);
+ read_unlock(&ei->i_prealloc_lock);
+ return true;
+ } else {
+ /*
+ * We found a valid overlapping pa but couldn't use it because
+ * it had no free blocks. This should ideally never happen
+ * because:
+ *
+ * 1. When a new inode pa is added to rbtree it must have
+ * pa_free > 0 since otherwise we won't actually need
+ * preallocation.
+ *
+ * 2. An inode pa that is in the rbtree can only have it's
+ * pa_free become zero when another thread calls:
+ * ext4_mb_new_blocks
+ * ext4_mb_use_preallocated
+ * ext4_mb_use_inode_pa
+ *
+ * 3. Further, after the above calls make pa_free == 0, we will
+ * immediately remove it from the rbtree in:
+ * ext4_mb_new_blocks
+ * ext4_mb_release_context
+ * ext4_mb_put_pa
+ *
+ * 4. Since the pa_free becoming 0 and pa_free getting removed
+ * from tree both happen in ext4_mb_new_blocks, which is always
+ * called with i_data_sem held for data allocations, we can be
+ * sure that another process will never see a pa in rbtree with
+ * pa_free == 0.
+ */
+ WARN_ON_ONCE(tmp_pa->pa_free == 0);
+ }
+ spin_unlock(&tmp_pa->pa_lock);
+try_group_pa:
read_unlock(&ei->i_prealloc_lock);
/* can we use group allocation? */
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x 5d5460fa7932bed3a9082a6a8852cfbdb46acbe8
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023072456-starting-gauging-768c@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5d5460fa7932bed3a9082a6a8852cfbdb46acbe8 Mon Sep 17 00:00:00 2001
From: Ojaswin Mujoo <ojaswin(a)linux.ibm.com>
Date: Fri, 9 Jun 2023 16:04:03 +0530
Subject: [PATCH] ext4: fix off by one issue in
ext4_mb_choose_next_group_best_avail()
In ext4_mb_choose_next_group_best_avail(), we want the start order to be
1 less than goal length and the min_order to be, at max, 1 more than the
original length. This commit fixes an off by one issue that arose due to
the fact that 1 << fls(n) > (n).
After all the processing:
order = 1 order below goal len
min_order = maximum of the three:-
- order - trim_order
- 1 order below B2C(s_stripe)
- 1 order above original len
Cc: stable(a)kernel.org
Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Signed-off-by: Ojaswin Mujoo <ojaswin(a)linux.ibm.com>
Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a2475b8c9fb5..456150ef6111 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1006,14 +1006,11 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
* fls() instead since we need to know the actual length while modifying
* goal length.
*/
- order = fls(ac->ac_g_ex.fe_len);
+ order = fls(ac->ac_g_ex.fe_len) - 1;
min_order = order - sbi->s_mb_best_avail_max_trim_order;
if (min_order < 0)
min_order = 0;
- if (1 << min_order < ac->ac_o_ex.fe_len)
- min_order = fls(ac->ac_o_ex.fe_len) + 1;
-
if (sbi->s_stripe > 0) {
/*
* We are assuming that stripe size is always a multiple of
@@ -1021,9 +1018,16 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
*/
num_stripe_clusters = EXT4_NUM_B2C(sbi, sbi->s_stripe);
if (1 << min_order < num_stripe_clusters)
- min_order = fls(num_stripe_clusters);
+ /*
+ * We consider 1 order less because later we round
+ * up the goal len to num_stripe_clusters
+ */
+ min_order = fls(num_stripe_clusters) - 1;
}
+ if (1 << min_order < ac->ac_o_ex.fe_len)
+ min_order = fls(ac->ac_o_ex.fe_len);
+
for (i = order; i >= min_order; i--) {
int frag_order;
/*
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 4b430d4ac99750ee2ae2f893f1055c7af1ec3dc5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082118-donation-clench-604d@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
4b430d4ac997 ("mmc: block: Fix in_flight[issue_type] value error")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 4b430d4ac99750ee2ae2f893f1055c7af1ec3dc5 Mon Sep 17 00:00:00 2001
From: Yibin Ding <yibin.ding(a)unisoc.com>
Date: Wed, 2 Aug 2023 10:30:23 +0800
Subject: [PATCH] mmc: block: Fix in_flight[issue_type] value error
For a completed request, after the mmc_blk_mq_complete_rq(mq, req)
function is executed, the bitmap_tags corresponding to the
request will be cleared, that is, the request will be regarded as
idle. If the request is acquired by a different type of process at
this time, the issue_type of the request may change. It further
caused the value of mq->in_flight[issue_type] to be abnormal,
and a large number of requests could not be sent.
p1: p2:
mmc_blk_mq_complete_rq
blk_mq_free_request
blk_mq_get_request
blk_mq_rq_ctx_init
mmc_blk_mq_dec_in_flight
mmc_issue_type(mq, req)
This strategy can ensure the consistency of issue_type
before and after executing mmc_blk_mq_complete_rq.
Fixes: 81196976ed94 ("mmc: block: Add blk-mq support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yibin Ding <yibin.ding(a)unisoc.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Link: https://lore.kernel.org/r/20230802023023.1318134-1-yunlong.xing@unisoc.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index f701efb1fa78..b6f4be25b31b 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2097,14 +2097,14 @@ static void mmc_blk_mq_poll_completion(struct mmc_queue *mq,
mmc_blk_urgent_bkops(mq, mqrq);
}
-static void mmc_blk_mq_dec_in_flight(struct mmc_queue *mq, struct request *req)
+static void mmc_blk_mq_dec_in_flight(struct mmc_queue *mq, enum mmc_issue_type issue_type)
{
unsigned long flags;
bool put_card;
spin_lock_irqsave(&mq->lock, flags);
- mq->in_flight[mmc_issue_type(mq, req)] -= 1;
+ mq->in_flight[issue_type] -= 1;
put_card = (mmc_tot_in_flight(mq) == 0);
@@ -2117,6 +2117,7 @@ static void mmc_blk_mq_dec_in_flight(struct mmc_queue *mq, struct request *req)
static void mmc_blk_mq_post_req(struct mmc_queue *mq, struct request *req,
bool can_sleep)
{
+ enum mmc_issue_type issue_type = mmc_issue_type(mq, req);
struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
struct mmc_request *mrq = &mqrq->brq.mrq;
struct mmc_host *host = mq->card->host;
@@ -2136,7 +2137,7 @@ static void mmc_blk_mq_post_req(struct mmc_queue *mq, struct request *req,
blk_mq_complete_request(req);
}
- mmc_blk_mq_dec_in_flight(mq, req);
+ mmc_blk_mq_dec_in_flight(mq, issue_type);
}
void mmc_blk_mq_recovery(struct mmc_queue *mq)
We observed a 35% of regression running phoronix pts/ramspeed and also 16%
with unixbench. Regression is caused by the following commit:
dd0f194cfeb5 | mm: rewrite wait_on_page_bit_common() logic
Backporting this fixes the regression (this is already in 5.9+):
- 5ef64cc8987a mm: allow a controlled amount of unfairness in the page lock
Linus Torvalds (1):
mm: allow a controlled amount of unfairness in the page lock
include/linux/mm.h | 2 +
include/linux/wait.h | 2 +
kernel/sysctl.c | 8 +++
mm/filemap.c | 160 ++++++++++++++++++++++++++++++++++---------
4 files changed, 141 insertions(+), 31 deletions(-)
--
2.41.0
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x a337b64f0d5717248a0c894e2618e658e6a9de9f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023080742-ion-implement-ceb1@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
a337b64f0d57 ("drm/i915: Fix premature release of request's reusable memory")
506006055769 ("drm/i915/active: Fix misuse of non-idle barriers as fence trackers")
ad5c99e02047 ("drm/i915: Remove unused bits of i915_vma/active api")
f6c466b84cfa ("drm/i915: Add support for moving fence waiting")
544460c33821 ("drm/i915: Multi-BB execbuf")
5851387a422c ("drm/i915/guc: Implement no mid batch preemption for multi-lrc")
e5e32171a2cf ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
d38a9294491d ("drm/i915/guc: Update debugfs for GuC multi-lrc")
bc955204919e ("drm/i915/guc: Insert submit fences between requests in parent-child relationship")
6b540bf6f143 ("drm/i915/guc: Implement multi-lrc submission")
99b47aaddfa9 ("drm/i915/guc: Implement parallel context pin / unpin functions")
c2aa552ff09d ("drm/i915/guc: Add multi-lrc context registration")
3897df4c0187 ("drm/i915/guc: Introduce context parent-child relationship")
4f3059dc2dbb ("drm/i915: Add logical engine mapping")
1a52faed3131 ("drm/i915/guc: Take GT PM ref when deregistering context")
0ea92ace8b95 ("drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct")
0d8ee5ba8db4 ("drm/i915: Don't back up pinned LMEM context images and rings during suspend")
c56ce9565374 ("drm/i915 Implement LMEM backup and restore for suspend / resume")
0d9388635a22 ("drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects")
68c03c0e985e ("drm/i915/debugfs: Do not report currently active engine when describing objects")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a337b64f0d5717248a0c894e2618e658e6a9de9f Mon Sep 17 00:00:00 2001
From: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Date: Thu, 20 Jul 2023 11:35:44 +0200
Subject: [PATCH] drm/i915: Fix premature release of request's reusable memory
Infinite waits for completion of GPU activity have been observed in CI,
mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
igt@perf@stress-open-close. Root cause analysis, based of ftrace dumps
generated with a lot of extra trace_printk() calls added to the code,
revealed loops of request dependencies being accidentally built,
preventing the requests from being processed, each waiting for completion
of another one's activity.
After we substitute a new request for a last active one tracked on a
timeline, we set up a dependency of our new request to wait on completion
of current activity of that previous one. While doing that, we must take
care of keeping the old request still in memory until we use its
attributes for setting up that await dependency, or we can happen to set
up the await dependency on an unrelated request that already reuses the
memory previously allocated to the old one, already released. Combined
with perf adding consecutive kernel context remote requests to different
user context timelines, unresolvable loops of await dependencies can be
built, leading do infinite waits.
We obtain a pointer to the previous request to wait upon when we
substitute it with a pointer to our new request in an active tracker,
e.g. in intel_timeline.last_request. In some processing paths we protect
that old request from being freed before we use it by getting a reference
to it under RCU protection, but in others, e.g. __i915_request_commit()
-> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
we don't. But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
that RCU protection is not sufficient against reuse of memory.
We could protect i915_request's memory from being prematurely reused by
calling its release function via call_rcu() and using rcu_read_lock()
consequently, as proposed in v1. However, that approach leads to
significant (up to 10 times) increase of SLAB utilization by i915_request
SLAB cache. Another potential approach is to take a reference to the
previous active fence.
When updating an active fence tracker, we first lock the new fence,
substitute a pointer of the current active fence with the new one, then we
lock the substituted fence. With this approach, there is a time window
after the substitution and before the lock when the request can be
concurrently released by an interrupt handler and its memory reused, then
we may happen to lock and return a new, unrelated request.
Always get a reference to the current active fence first, before
replacing it with a new one. Having it protected from premature release
and reuse, lock it and then replace with the new one but only if not
yet signalled via a potential concurrent interrupt nor replaced with
another one by a potential concurrent thread, otherwise retry, starting
from getting a reference to the new current one. Adjust users to not
get a reference to the previous active fence themselves and always put the
reference got by __i915_active_fence_set() when no longer needed.
v3: Fix lockdep splat reports and other issues caused by incorrect use of
try_cmpxchg() (use (cmpxchg() != prev) instead)
v2: Protect request's memory by getting a reference to it in favor of
delegating its release to call_rcu() (Chris)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with itself")
Suggested-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> # v5.6+
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230720093543.832147-2-janus…
(cherry picked from commit 946e047a3d88d46d15b5c5af0414098e12b243f7)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 8ef93889061a..5ec293011d99 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -449,8 +449,11 @@ int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
}
} while (unlikely(is_barrier(active)));
- if (!__i915_active_fence_set(active, fence))
+ fence = __i915_active_fence_set(active, fence);
+ if (!fence)
__i915_active_acquire(ref);
+ else
+ dma_fence_put(fence);
out:
i915_active_release(ref);
@@ -469,13 +472,9 @@ __i915_active_set_fence(struct i915_active *ref,
return NULL;
}
- rcu_read_lock();
prev = __i915_active_fence_set(active, fence);
- if (prev)
- prev = dma_fence_get_rcu(prev);
- else
+ if (!prev)
__i915_active_acquire(ref);
- rcu_read_unlock();
return prev;
}
@@ -1019,10 +1018,11 @@ void i915_request_add_active_barriers(struct i915_request *rq)
*
* Records the new @fence as the last active fence along its timeline in
* this active tracker, moving the tracking callbacks from the previous
- * fence onto this one. Returns the previous fence (if not already completed),
- * which the caller must ensure is executed before the new fence. To ensure
- * that the order of fences within the timeline of the i915_active_fence is
- * understood, it should be locked by the caller.
+ * fence onto this one. Gets and returns a reference to the previous fence
+ * (if not already completed), which the caller must put after making sure
+ * that it is executed before the new fence. To ensure that the order of
+ * fences within the timeline of the i915_active_fence is understood, it
+ * should be locked by the caller.
*/
struct dma_fence *
__i915_active_fence_set(struct i915_active_fence *active,
@@ -1031,7 +1031,23 @@ __i915_active_fence_set(struct i915_active_fence *active,
struct dma_fence *prev;
unsigned long flags;
- if (fence == rcu_access_pointer(active->fence))
+ /*
+ * In case of fences embedded in i915_requests, their memory is
+ * SLAB_FAILSAFE_BY_RCU, then it can be reused right after release
+ * by new requests. Then, there is a risk of passing back a pointer
+ * to a new, completely unrelated fence that reuses the same memory
+ * while tracked under a different active tracker. Combined with i915
+ * perf open/close operations that build await dependencies between
+ * engine kernel context requests and user requests from different
+ * timelines, this can lead to dependency loops and infinite waits.
+ *
+ * As a countermeasure, we try to get a reference to the active->fence
+ * first, so if we succeed and pass it back to our user then it is not
+ * released and potentially reused by an unrelated request before the
+ * user has a chance to set up an await dependency on it.
+ */
+ prev = i915_active_fence_get(active);
+ if (fence == prev)
return fence;
GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags));
@@ -1040,27 +1056,56 @@ __i915_active_fence_set(struct i915_active_fence *active,
* Consider that we have two threads arriving (A and B), with
* C already resident as the active->fence.
*
- * A does the xchg first, and so it sees C or NULL depending
- * on the timing of the interrupt handler. If it is NULL, the
- * previous fence must have been signaled and we know that
- * we are first on the timeline. If it is still present,
- * we acquire the lock on that fence and serialise with the interrupt
- * handler, in the process removing it from any future interrupt
- * callback. A will then wait on C before executing (if present).
- *
- * As B is second, it sees A as the previous fence and so waits for
- * it to complete its transition and takes over the occupancy for
- * itself -- remembering that it needs to wait on A before executing.
+ * Both A and B have got a reference to C or NULL, depending on the
+ * timing of the interrupt handler. Let's assume that if A has got C
+ * then it has locked C first (before B).
*
* Note the strong ordering of the timeline also provides consistent
* nesting rules for the fence->lock; the inner lock is always the
* older lock.
*/
spin_lock_irqsave(fence->lock, flags);
- prev = xchg(__active_fence_slot(active), fence);
- if (prev) {
- GEM_BUG_ON(prev == fence);
+ if (prev)
spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
+
+ /*
+ * A does the cmpxchg first, and so it sees C or NULL, as before, or
+ * something else, depending on the timing of other threads and/or
+ * interrupt handler. If not the same as before then A unlocks C if
+ * applicable and retries, starting from an attempt to get a new
+ * active->fence. Meanwhile, B follows the same path as A.
+ * Once A succeeds with cmpxch, B fails again, retires, gets A from
+ * active->fence, locks it as soon as A completes, and possibly
+ * succeeds with cmpxchg.
+ */
+ while (cmpxchg(__active_fence_slot(active), prev, fence) != prev) {
+ if (prev) {
+ spin_unlock(prev->lock);
+ dma_fence_put(prev);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ prev = i915_active_fence_get(active);
+ GEM_BUG_ON(prev == fence);
+
+ spin_lock_irqsave(fence->lock, flags);
+ if (prev)
+ spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
+ }
+
+ /*
+ * If prev is NULL then the previous fence must have been signaled
+ * and we know that we are first on the timeline. If it is still
+ * present then, having the lock on that fence already acquired, we
+ * serialise with the interrupt handler, in the process of removing it
+ * from any future interrupt callback. A will then wait on C before
+ * executing (if present).
+ *
+ * As B is second, it sees A as the previous fence and so waits for
+ * it to complete its transition and takes over the occupancy for
+ * itself -- remembering that it needs to wait on A before executing.
+ */
+ if (prev) {
__list_del_entry(&active->cb.node);
spin_unlock(prev->lock); /* serialise with prev->cb_list */
}
@@ -1077,11 +1122,7 @@ int i915_active_fence_set(struct i915_active_fence *active,
int err = 0;
/* Must maintain timeline ordering wrt previous active requests */
- rcu_read_lock();
fence = __i915_active_fence_set(active, &rq->fence);
- if (fence) /* but the previous fence may not belong to that timeline! */
- fence = dma_fence_get_rcu(fence);
- rcu_read_unlock();
if (fence) {
err = i915_request_await_dma_fence(rq, fence);
dma_fence_put(fence);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 894068bb37b6..833b73edefdb 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1661,6 +1661,11 @@ __i915_request_ensure_parallel_ordering(struct i915_request *rq,
request_to_parent(rq)->parallel.last_rq = i915_request_get(rq);
+ /*
+ * Users have to put a reference potentially got by
+ * __i915_active_fence_set() to the returned request
+ * when no longer needed
+ */
return to_request(__i915_active_fence_set(&timeline->last_request,
&rq->fence));
}
@@ -1707,6 +1712,10 @@ __i915_request_ensure_ordering(struct i915_request *rq,
0);
}
+ /*
+ * Users have to put the reference to prev potentially got
+ * by __i915_active_fence_set() when no longer needed
+ */
return prev;
}
@@ -1760,6 +1769,8 @@ __i915_request_add_to_timeline(struct i915_request *rq)
prev = __i915_request_ensure_ordering(rq, timeline);
else
prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+ if (prev)
+ i915_request_put(prev);
/*
* Make sure that no request gazumped us - if it was allocated after
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x 656f9aec07dba7c61d469727494a5d1b18d0bef4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082705-predator-enjoyable-15fb@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 656f9aec07dba7c61d469727494a5d1b18d0bef4 Mon Sep 17 00:00:00 2001
From: Huacai Chen <chenhuacai(a)kernel.org>
Date: Sat, 26 Aug 2023 22:21:57 +0800
Subject: [PATCH] LoongArch: Ensure FP/SIMD registers in the core dump file is
up to date
This is a port of commit 379eb01c21795edb4c ("riscv: Ensure the value
of FP registers in the core dump file is up to date").
The values of FP/SIMD registers in the core dump file come from the
thread.fpu. However, kernel saves the FP/SIMD registers only before
scheduling out the process. If no process switch happens during the
exception handling, kernel will not have a chance to save the latest
values of FP/SIMD registers. So it may cause their values in the core
dump file incorrect. To solve this problem, force fpr_get()/simd_get()
to save the FP/SIMD registers into the thread.fpu if the target task
equals the current task.
Cc: stable(a)vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
diff --git a/arch/loongarch/include/asm/fpu.h b/arch/loongarch/include/asm/fpu.h
index b541f6248837..c2d8962fda00 100644
--- a/arch/loongarch/include/asm/fpu.h
+++ b/arch/loongarch/include/asm/fpu.h
@@ -173,16 +173,30 @@ static inline void restore_fp(struct task_struct *tsk)
_restore_fp(&tsk->thread.fpu);
}
-static inline union fpureg *get_fpu_regs(struct task_struct *tsk)
+static inline void save_fpu_regs(struct task_struct *tsk)
{
+ unsigned int euen;
+
if (tsk == current) {
preempt_disable();
- if (is_fpu_owner())
+
+ euen = csr_read32(LOONGARCH_CSR_EUEN);
+
+#ifdef CONFIG_CPU_HAS_LASX
+ if (euen & CSR_EUEN_LASXEN)
+ _save_lasx(¤t->thread.fpu);
+ else
+#endif
+#ifdef CONFIG_CPU_HAS_LSX
+ if (euen & CSR_EUEN_LSXEN)
+ _save_lsx(¤t->thread.fpu);
+ else
+#endif
+ if (euen & CSR_EUEN_FPEN)
_save_fp(¤t->thread.fpu);
+
preempt_enable();
}
-
- return tsk->thread.fpu.fpr;
}
static inline int is_simd_owner(void)
diff --git a/arch/loongarch/kernel/ptrace.c b/arch/loongarch/kernel/ptrace.c
index a0767c3a0f0a..f72adbf530c6 100644
--- a/arch/loongarch/kernel/ptrace.c
+++ b/arch/loongarch/kernel/ptrace.c
@@ -147,6 +147,8 @@ static int fpr_get(struct task_struct *target,
{
int r;
+ save_fpu_regs(target);
+
if (sizeof(target->thread.fpu.fpr[0]) == sizeof(elf_fpreg_t))
r = gfpr_get(target, &to);
else
@@ -278,6 +280,8 @@ static int simd_get(struct task_struct *target,
{
const unsigned int wr_size = NUM_FPU_REGS * regset->size;
+ save_fpu_regs(target);
+
if (!tsk_used_math(target)) {
/* The task hasn't used FP or LSX, fill with 0xff */
copy_pad_fprs(target, regset, &to, 0);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x a337b64f0d5717248a0c894e2618e658e6a9de9f
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023080739-trinity-overfed-f523@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
a337b64f0d57 ("drm/i915: Fix premature release of request's reusable memory")
506006055769 ("drm/i915/active: Fix misuse of non-idle barriers as fence trackers")
ad5c99e02047 ("drm/i915: Remove unused bits of i915_vma/active api")
f6c466b84cfa ("drm/i915: Add support for moving fence waiting")
544460c33821 ("drm/i915: Multi-BB execbuf")
5851387a422c ("drm/i915/guc: Implement no mid batch preemption for multi-lrc")
e5e32171a2cf ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
d38a9294491d ("drm/i915/guc: Update debugfs for GuC multi-lrc")
bc955204919e ("drm/i915/guc: Insert submit fences between requests in parent-child relationship")
6b540bf6f143 ("drm/i915/guc: Implement multi-lrc submission")
99b47aaddfa9 ("drm/i915/guc: Implement parallel context pin / unpin functions")
c2aa552ff09d ("drm/i915/guc: Add multi-lrc context registration")
3897df4c0187 ("drm/i915/guc: Introduce context parent-child relationship")
4f3059dc2dbb ("drm/i915: Add logical engine mapping")
1a52faed3131 ("drm/i915/guc: Take GT PM ref when deregistering context")
0ea92ace8b95 ("drm/i915/guc: Move GuC guc_id allocation under submission state sub-struct")
0d8ee5ba8db4 ("drm/i915: Don't back up pinned LMEM context images and rings during suspend")
c56ce9565374 ("drm/i915 Implement LMEM backup and restore for suspend / resume")
0d9388635a22 ("drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects")
68c03c0e985e ("drm/i915/debugfs: Do not report currently active engine when describing objects")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a337b64f0d5717248a0c894e2618e658e6a9de9f Mon Sep 17 00:00:00 2001
From: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Date: Thu, 20 Jul 2023 11:35:44 +0200
Subject: [PATCH] drm/i915: Fix premature release of request's reusable memory
Infinite waits for completion of GPU activity have been observed in CI,
mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
igt@perf@stress-open-close. Root cause analysis, based of ftrace dumps
generated with a lot of extra trace_printk() calls added to the code,
revealed loops of request dependencies being accidentally built,
preventing the requests from being processed, each waiting for completion
of another one's activity.
After we substitute a new request for a last active one tracked on a
timeline, we set up a dependency of our new request to wait on completion
of current activity of that previous one. While doing that, we must take
care of keeping the old request still in memory until we use its
attributes for setting up that await dependency, or we can happen to set
up the await dependency on an unrelated request that already reuses the
memory previously allocated to the old one, already released. Combined
with perf adding consecutive kernel context remote requests to different
user context timelines, unresolvable loops of await dependencies can be
built, leading do infinite waits.
We obtain a pointer to the previous request to wait upon when we
substitute it with a pointer to our new request in an active tracker,
e.g. in intel_timeline.last_request. In some processing paths we protect
that old request from being freed before we use it by getting a reference
to it under RCU protection, but in others, e.g. __i915_request_commit()
-> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
we don't. But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
that RCU protection is not sufficient against reuse of memory.
We could protect i915_request's memory from being prematurely reused by
calling its release function via call_rcu() and using rcu_read_lock()
consequently, as proposed in v1. However, that approach leads to
significant (up to 10 times) increase of SLAB utilization by i915_request
SLAB cache. Another potential approach is to take a reference to the
previous active fence.
When updating an active fence tracker, we first lock the new fence,
substitute a pointer of the current active fence with the new one, then we
lock the substituted fence. With this approach, there is a time window
after the substitution and before the lock when the request can be
concurrently released by an interrupt handler and its memory reused, then
we may happen to lock and return a new, unrelated request.
Always get a reference to the current active fence first, before
replacing it with a new one. Having it protected from premature release
and reuse, lock it and then replace with the new one but only if not
yet signalled via a potential concurrent interrupt nor replaced with
another one by a potential concurrent thread, otherwise retry, starting
from getting a reference to the new current one. Adjust users to not
get a reference to the previous active fence themselves and always put the
reference got by __i915_active_fence_set() when no longer needed.
v3: Fix lockdep splat reports and other issues caused by incorrect use of
try_cmpxchg() (use (cmpxchg() != prev) instead)
v2: Protect request's memory by getting a reference to it in favor of
delegating its release to call_rcu() (Chris)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with itself")
Suggested-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> # v5.6+
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230720093543.832147-2-janus…
(cherry picked from commit 946e047a3d88d46d15b5c5af0414098e12b243f7)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 8ef93889061a..5ec293011d99 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -449,8 +449,11 @@ int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
}
} while (unlikely(is_barrier(active)));
- if (!__i915_active_fence_set(active, fence))
+ fence = __i915_active_fence_set(active, fence);
+ if (!fence)
__i915_active_acquire(ref);
+ else
+ dma_fence_put(fence);
out:
i915_active_release(ref);
@@ -469,13 +472,9 @@ __i915_active_set_fence(struct i915_active *ref,
return NULL;
}
- rcu_read_lock();
prev = __i915_active_fence_set(active, fence);
- if (prev)
- prev = dma_fence_get_rcu(prev);
- else
+ if (!prev)
__i915_active_acquire(ref);
- rcu_read_unlock();
return prev;
}
@@ -1019,10 +1018,11 @@ void i915_request_add_active_barriers(struct i915_request *rq)
*
* Records the new @fence as the last active fence along its timeline in
* this active tracker, moving the tracking callbacks from the previous
- * fence onto this one. Returns the previous fence (if not already completed),
- * which the caller must ensure is executed before the new fence. To ensure
- * that the order of fences within the timeline of the i915_active_fence is
- * understood, it should be locked by the caller.
+ * fence onto this one. Gets and returns a reference to the previous fence
+ * (if not already completed), which the caller must put after making sure
+ * that it is executed before the new fence. To ensure that the order of
+ * fences within the timeline of the i915_active_fence is understood, it
+ * should be locked by the caller.
*/
struct dma_fence *
__i915_active_fence_set(struct i915_active_fence *active,
@@ -1031,7 +1031,23 @@ __i915_active_fence_set(struct i915_active_fence *active,
struct dma_fence *prev;
unsigned long flags;
- if (fence == rcu_access_pointer(active->fence))
+ /*
+ * In case of fences embedded in i915_requests, their memory is
+ * SLAB_FAILSAFE_BY_RCU, then it can be reused right after release
+ * by new requests. Then, there is a risk of passing back a pointer
+ * to a new, completely unrelated fence that reuses the same memory
+ * while tracked under a different active tracker. Combined with i915
+ * perf open/close operations that build await dependencies between
+ * engine kernel context requests and user requests from different
+ * timelines, this can lead to dependency loops and infinite waits.
+ *
+ * As a countermeasure, we try to get a reference to the active->fence
+ * first, so if we succeed and pass it back to our user then it is not
+ * released and potentially reused by an unrelated request before the
+ * user has a chance to set up an await dependency on it.
+ */
+ prev = i915_active_fence_get(active);
+ if (fence == prev)
return fence;
GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags));
@@ -1040,27 +1056,56 @@ __i915_active_fence_set(struct i915_active_fence *active,
* Consider that we have two threads arriving (A and B), with
* C already resident as the active->fence.
*
- * A does the xchg first, and so it sees C or NULL depending
- * on the timing of the interrupt handler. If it is NULL, the
- * previous fence must have been signaled and we know that
- * we are first on the timeline. If it is still present,
- * we acquire the lock on that fence and serialise with the interrupt
- * handler, in the process removing it from any future interrupt
- * callback. A will then wait on C before executing (if present).
- *
- * As B is second, it sees A as the previous fence and so waits for
- * it to complete its transition and takes over the occupancy for
- * itself -- remembering that it needs to wait on A before executing.
+ * Both A and B have got a reference to C or NULL, depending on the
+ * timing of the interrupt handler. Let's assume that if A has got C
+ * then it has locked C first (before B).
*
* Note the strong ordering of the timeline also provides consistent
* nesting rules for the fence->lock; the inner lock is always the
* older lock.
*/
spin_lock_irqsave(fence->lock, flags);
- prev = xchg(__active_fence_slot(active), fence);
- if (prev) {
- GEM_BUG_ON(prev == fence);
+ if (prev)
spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
+
+ /*
+ * A does the cmpxchg first, and so it sees C or NULL, as before, or
+ * something else, depending on the timing of other threads and/or
+ * interrupt handler. If not the same as before then A unlocks C if
+ * applicable and retries, starting from an attempt to get a new
+ * active->fence. Meanwhile, B follows the same path as A.
+ * Once A succeeds with cmpxch, B fails again, retires, gets A from
+ * active->fence, locks it as soon as A completes, and possibly
+ * succeeds with cmpxchg.
+ */
+ while (cmpxchg(__active_fence_slot(active), prev, fence) != prev) {
+ if (prev) {
+ spin_unlock(prev->lock);
+ dma_fence_put(prev);
+ }
+ spin_unlock_irqrestore(fence->lock, flags);
+
+ prev = i915_active_fence_get(active);
+ GEM_BUG_ON(prev == fence);
+
+ spin_lock_irqsave(fence->lock, flags);
+ if (prev)
+ spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
+ }
+
+ /*
+ * If prev is NULL then the previous fence must have been signaled
+ * and we know that we are first on the timeline. If it is still
+ * present then, having the lock on that fence already acquired, we
+ * serialise with the interrupt handler, in the process of removing it
+ * from any future interrupt callback. A will then wait on C before
+ * executing (if present).
+ *
+ * As B is second, it sees A as the previous fence and so waits for
+ * it to complete its transition and takes over the occupancy for
+ * itself -- remembering that it needs to wait on A before executing.
+ */
+ if (prev) {
__list_del_entry(&active->cb.node);
spin_unlock(prev->lock); /* serialise with prev->cb_list */
}
@@ -1077,11 +1122,7 @@ int i915_active_fence_set(struct i915_active_fence *active,
int err = 0;
/* Must maintain timeline ordering wrt previous active requests */
- rcu_read_lock();
fence = __i915_active_fence_set(active, &rq->fence);
- if (fence) /* but the previous fence may not belong to that timeline! */
- fence = dma_fence_get_rcu(fence);
- rcu_read_unlock();
if (fence) {
err = i915_request_await_dma_fence(rq, fence);
dma_fence_put(fence);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 894068bb37b6..833b73edefdb 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1661,6 +1661,11 @@ __i915_request_ensure_parallel_ordering(struct i915_request *rq,
request_to_parent(rq)->parallel.last_rq = i915_request_get(rq);
+ /*
+ * Users have to put a reference potentially got by
+ * __i915_active_fence_set() to the returned request
+ * when no longer needed
+ */
return to_request(__i915_active_fence_set(&timeline->last_request,
&rq->fence));
}
@@ -1707,6 +1712,10 @@ __i915_request_ensure_ordering(struct i915_request *rq,
0);
}
+ /*
+ * Users have to put the reference to prev potentially got
+ * by __i915_active_fence_set() when no longer needed
+ */
return prev;
}
@@ -1760,6 +1769,8 @@ __i915_request_add_to_timeline(struct i915_request *rq)
prev = __i915_request_ensure_ordering(rq, timeline);
else
prev = __i915_request_ensure_parallel_ordering(rq, timeline);
+ if (prev)
+ i915_request_put(prev);
/*
* Make sure that no request gazumped us - if it was allocated after
From: Joel Fernandes <joel(a)joelfernandes.org>
[ Upstream commit d52d3a2bf408ff86f3a79560b5cce80efb340239 ]
During shutdown of rcutorture, the shutdown thread in
rcu_torture_cleanup() calls torture_cleanup_begin() which sets fullstop
to FULLSTOP_RMMOD. This is enough to cause the rcutorture threads for
readers and fakewriters to breakout of their main while loop and start
shutting down.
Once out of their main loop, they then call torture_kthread_stopping()
which in turn waits for kthread_stop() to be called, however
rcu_torture_cleanup() has not even called kthread_stop() on those
threads yet, it does that a bit later. However, before it gets a chance
to do so, torture_kthread_stopping() calls
schedule_timeout_interruptible(1) in a tight loop. Tracing confirmed
this makes the timer softirq constantly execute timer callbacks, while
never returning back to the softirq exit path and is essentially "locked
up" because of that. If the softirq preempts the shutdown thread,
kthread_stop() may never be called.
This commit improves the situation dramatically, by increasing timeout
passed to schedule_timeout_interruptible() 1/20th of a second. This
causes the timer softirq to not lock up a CPU and everything works fine.
Testing has shown 100 runs of TREE07 passing reliably, which was not the
case before because of RCU stalls.
Cc: Paul McKenney <paulmck(a)kernel.org>
Cc: Frederic Weisbecker <fweisbec(a)gmail.com>
Cc: Zhouyi Zhou <zhouzhouyi(a)gmail.com>
Cc: <stable(a)vger.kernel.org> # 6.0.x
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Reviewed-by: Davidlohr Bueso <dave(a)stgolabs.net>
Tested-by: Zhouyi Zhou <zhouzhouyi(a)gmail.com>
---
kernel/torture.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/torture.c b/kernel/torture.c
index 1061492f14bd..477d9b601438 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -788,7 +788,7 @@ void torture_kthread_stopping(char *title)
VERBOSE_TOROUT_STRING(buf);
while (!kthread_should_stop()) {
torture_shutdown_absorb(title);
- schedule_timeout_uninterruptible(1);
+ schedule_timeout_uninterruptible(HZ/20);
}
}
EXPORT_SYMBOL_GPL(torture_kthread_stopping);
--
2.41.0.640.ga95def55d0-goog
These two are backports for 6.1.y. Conflict resolution in done in
both patches.
I have tested LTP-nfs fchown02 and chown02 on 6.1.y with below patches
applied. The tests passed.
I would like to have a review as I am not familiar with this code.
Thanks to Vegard for helping me with this.
Thanks,
Harshit
Christian Brauner (2):
nfs: use vfs setgid helper
nfsd: use vfs setgid helper
fs/attr.c | 1 +
fs/internal.h | 2 --
fs/nfs/inode.c | 4 +---
fs/nfsd/vfs.c | 4 +++-
include/linux/fs.h | 2 ++
5 files changed, 7 insertions(+), 6 deletions(-)
--
2.34.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f9e96bf1905479f18e83a3a4c314a8dfa56ede2c
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082736-canal-swimsuit-6b7b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
f9e96bf19054 ("drm/vmwgfx: Fix possible invalid drm gem put calls")
9ef8d83e8e25 ("drm/vmwgfx: Do not drop the reference to the handle too soon")
668b206601c5 ("drm/vmwgfx: Stop using raw ttm_buffer_object's")
39985eea5a6d ("drm/vmwgfx: Abstract placement selection")
e0029da927fa ("drm/vmwgfx: Rename dummy to is_iomem")
cb8097a45da1 ("drm/vmwgfx: Cleanup the vmw bo usage in the cursor paths")
6703e28f976d ("drm/vmwgfx: Simplify fb pinning")
09881d2940bb ("drm/vmwgfx: Rename vmw_buffer_object to vmw_bo")
6b2e8aa45126 ("drm/vmwgfx: Remove the duplicate bo_free function")
f87c1f0b7b79 ("drm/ttm: prevent moving of pinned BOs")
aebd8f0c6f82 ("Merge v6.2-rc6 into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f9e96bf1905479f18e83a3a4c314a8dfa56ede2c Mon Sep 17 00:00:00 2001
From: Zack Rusin <zackr(a)vmware.com>
Date: Fri, 18 Aug 2023 00:13:01 -0400
Subject: [PATCH] drm/vmwgfx: Fix possible invalid drm gem put calls
vmw_bo_unreference sets the input buffer to null on exit, resulting in
null ptr deref's on the subsequent drm gem put calls.
This went unnoticed because only very old userspace would be exercising
those paths but it wouldn't be hard to hit on old distros with brand
new kernels.
Introduce a new function that abstracts unrefing of user bo's to make
the code cleaner and more explicit.
Signed-off-by: Zack Rusin <zackr(a)vmware.com>
Reported-by: Ian Forbes <iforbes(a)vmware.com>
Fixes: 9ef8d83e8e25 ("drm/vmwgfx: Do not drop the reference to the handle too soon")
Cc: <stable(a)vger.kernel.org> # v6.4+
Reviewed-by: Maaz Mombasawala<mombasawalam(a)vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230818041301.407636-1-zack@…
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
index 82094c137855..c43853597776 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.c
@@ -497,10 +497,9 @@ static int vmw_user_bo_synccpu_release(struct drm_file *filp,
if (!(flags & drm_vmw_synccpu_allow_cs)) {
atomic_dec(&vmw_bo->cpu_writers);
}
- ttm_bo_put(&vmw_bo->tbo);
+ vmw_user_bo_unref(vmw_bo);
}
- drm_gem_object_put(&vmw_bo->tbo.base);
return ret;
}
@@ -540,8 +539,7 @@ int vmw_user_bo_synccpu_ioctl(struct drm_device *dev, void *data,
return ret;
ret = vmw_user_bo_synccpu_grab(vbo, arg->flags);
- vmw_bo_unreference(&vbo);
- drm_gem_object_put(&vbo->tbo.base);
+ vmw_user_bo_unref(vbo);
if (unlikely(ret != 0)) {
if (ret == -ERESTARTSYS || ret == -EBUSY)
return -EBUSY;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.h b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.h
index 50a836e70994..1d433fceed3d 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_bo.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_bo.h
@@ -195,6 +195,14 @@ static inline struct vmw_bo *vmw_bo_reference(struct vmw_bo *buf)
return buf;
}
+static inline void vmw_user_bo_unref(struct vmw_bo *vbo)
+{
+ if (vbo) {
+ ttm_bo_put(&vbo->tbo);
+ drm_gem_object_put(&vbo->tbo.base);
+ }
+}
+
static inline struct vmw_bo *to_vmw_bo(struct drm_gem_object *gobj)
{
return container_of((gobj), struct vmw_bo, tbo.base);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index d30c0e3d3ab7..98e0723ca6f5 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -1164,8 +1164,7 @@ static int vmw_translate_mob_ptr(struct vmw_private *dev_priv,
}
vmw_bo_placement_set(vmw_bo, VMW_BO_DOMAIN_MOB, VMW_BO_DOMAIN_MOB);
ret = vmw_validation_add_bo(sw_context->ctx, vmw_bo);
- ttm_bo_put(&vmw_bo->tbo);
- drm_gem_object_put(&vmw_bo->tbo.base);
+ vmw_user_bo_unref(vmw_bo);
if (unlikely(ret != 0))
return ret;
@@ -1221,8 +1220,7 @@ static int vmw_translate_guest_ptr(struct vmw_private *dev_priv,
vmw_bo_placement_set(vmw_bo, VMW_BO_DOMAIN_GMR | VMW_BO_DOMAIN_VRAM,
VMW_BO_DOMAIN_GMR | VMW_BO_DOMAIN_VRAM);
ret = vmw_validation_add_bo(sw_context->ctx, vmw_bo);
- ttm_bo_put(&vmw_bo->tbo);
- drm_gem_object_put(&vmw_bo->tbo.base);
+ vmw_user_bo_unref(vmw_bo);
if (unlikely(ret != 0))
return ret;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c b/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c
index b62207be3363..1489ad73c103 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_kms.c
@@ -1665,10 +1665,8 @@ static struct drm_framebuffer *vmw_kms_fb_create(struct drm_device *dev,
err_out:
/* vmw_user_lookup_handle takes one ref so does new_fb */
- if (bo) {
- vmw_bo_unreference(&bo);
- drm_gem_object_put(&bo->tbo.base);
- }
+ if (bo)
+ vmw_user_bo_unref(bo);
if (surface)
vmw_surface_unreference(&surface);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_overlay.c b/drivers/gpu/drm/vmwgfx/vmwgfx_overlay.c
index 7e112319a23c..fb85f244c3d0 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_overlay.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_overlay.c
@@ -451,8 +451,7 @@ int vmw_overlay_ioctl(struct drm_device *dev, void *data,
ret = vmw_overlay_update_stream(dev_priv, buf, arg, true);
- vmw_bo_unreference(&buf);
- drm_gem_object_put(&buf->tbo.base);
+ vmw_user_bo_unref(buf);
out_unlock:
mutex_unlock(&overlay->mutex);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c b/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
index e7226db8b242..1e81ff2422cf 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
@@ -809,8 +809,7 @@ static int vmw_shader_define(struct drm_device *dev, struct drm_file *file_priv,
shader_type, num_input_sig,
num_output_sig, tfile, shader_handle);
out_bad_arg:
- vmw_bo_unreference(&buffer);
- drm_gem_object_put(&buffer->tbo.base);
+ vmw_user_bo_unref(buffer);
return ret;
}
With commit 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver
with pata_falcon and falconide"), the Q40 IDE driver was
replaced by pata_falcon.c.
Both IO and memory resources were defined for the Q40 IDE
platform device, but definition of the IDE register addresses
was modeled after the Falcon case, both in use of the memory
resources and in including register shift and byte vs. word
offset in the address.
This was correct for the Falcon case, which does not apply
any address translation to the register addresses. In the
Q40 case, all of device base address, byte access offset
and register shift is included in the platform specific
ISA access translation (in asm/mm_io.h).
As a consequence, such address translation gets applied
twice, and register addresses are mangled.
Use the device base address from the platform IO resource
for Q40 (the IO address translation will then add the correct
ISA window base address and byte access offset), with register
shift 1. Use MMIO base address and register shift 2 as before
for Falcon.
Encode PIO_OFFSET into IO port addresses for all registers
for Q40 except the data transfer register. Encode the MMIO
offset there (pata_falcon_data_xfer() directly uses raw IO
with no address translation).
Reported-by: William R Sowerbutts <will(a)sowerbutts.com>
Closes: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Link: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Fixes: 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver with pata_falcon and falconide")
Cc: stable(a)vger.kernel.org
Cc: Finn Thain <fthain(a)linux-m68k.org>
Cc: Geert Uytterhoeven <geert(a)linux-m68k.org>
Tested-by: William R Sowerbutts <will(a)sowerbutts.com>
Signed-off-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov(a)omp.ru>
Reviewed-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
Changes from v4:
Geert Uytterhoeven:
- use %px for ap->ioaddr.data_addr
Changes from v3:
Sergey Shtylyov:
- change use of reg_scale to reg_shift
Geert Uytterhoeven:
- factor out ata_port_desc() from platform specific code
Changes from v2:
Finn Thain:
- add back stable Cc:
Changes from v1:
Damien Le Moal:
- change patch title
- drop stable backport tag
Changes from RFC v3:
- split off byte swap option into separate patch
Geert Uytterhoeven:
- review comments
Changes from RFC v2:
- add driver parameter 'data_swap' as bit mask for drives to swap
Changes from RFC v1:
Finn Thain:
- take care to supply IO address suitable for ioread8/iowrite8
- use MMIO address for data transfer
---
drivers/ata/pata_falcon.c | 50 +++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 21 deletions(-)
diff --git a/drivers/ata/pata_falcon.c b/drivers/ata/pata_falcon.c
index 996516e64f13..616064b02de6 100644
--- a/drivers/ata/pata_falcon.c
+++ b/drivers/ata/pata_falcon.c
@@ -123,8 +123,8 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
struct resource *base_res, *ctl_res, *irq_res;
struct ata_host *host;
struct ata_port *ap;
- void __iomem *base;
- int irq = 0;
+ void __iomem *base, *ctl_base;
+ int irq = 0, io_offset = 1, reg_shift = 2; /* Falcon defaults */
dev_info(&pdev->dev, "Atari Falcon and Q40/Q60 PATA controller\n");
@@ -165,26 +165,34 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
ap->pio_mask = ATA_PIO4;
ap->flags |= ATA_FLAG_SLAVE_POSS | ATA_FLAG_NO_IORDY;
- base = (void __iomem *)base_mem_res->start;
/* N.B. this assumes data_addr will be used for word-sized I/O only */
- ap->ioaddr.data_addr = base + 0 + 0 * 4;
- ap->ioaddr.error_addr = base + 1 + 1 * 4;
- ap->ioaddr.feature_addr = base + 1 + 1 * 4;
- ap->ioaddr.nsect_addr = base + 1 + 2 * 4;
- ap->ioaddr.lbal_addr = base + 1 + 3 * 4;
- ap->ioaddr.lbam_addr = base + 1 + 4 * 4;
- ap->ioaddr.lbah_addr = base + 1 + 5 * 4;
- ap->ioaddr.device_addr = base + 1 + 6 * 4;
- ap->ioaddr.status_addr = base + 1 + 7 * 4;
- ap->ioaddr.command_addr = base + 1 + 7 * 4;
-
- base = (void __iomem *)ctl_mem_res->start;
- ap->ioaddr.altstatus_addr = base + 1;
- ap->ioaddr.ctl_addr = base + 1;
-
- ata_port_desc(ap, "cmd 0x%lx ctl 0x%lx",
- (unsigned long)base_mem_res->start,
- (unsigned long)ctl_mem_res->start);
+ ap->ioaddr.data_addr = (void __iomem *)base_mem_res->start;
+
+ if (base_res) { /* only Q40 has IO resources */
+ io_offset = 0x10000;
+ reg_shift = 0;
+ base = (void __iomem *)base_res->start;
+ ctl_base = (void __iomem *)ctl_res->start;
+ } else {
+ base = (void __iomem *)base_mem_res->start;
+ ctl_base = (void __iomem *)ctl_mem_res->start;
+ }
+
+ ap->ioaddr.error_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.feature_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.nsect_addr = base + io_offset + (2 << reg_shift);
+ ap->ioaddr.lbal_addr = base + io_offset + (3 << reg_shift);
+ ap->ioaddr.lbam_addr = base + io_offset + (4 << reg_shift);
+ ap->ioaddr.lbah_addr = base + io_offset + (5 << reg_shift);
+ ap->ioaddr.device_addr = base + io_offset + (6 << reg_shift);
+ ap->ioaddr.status_addr = base + io_offset + (7 << reg_shift);
+ ap->ioaddr.command_addr = base + io_offset + (7 << reg_shift);
+
+ ap->ioaddr.altstatus_addr = ctl_base + io_offset;
+ ap->ioaddr.ctl_addr = ctl_base + io_offset;
+
+ ata_port_desc(ap, "cmd %px ctl %px data %px",
+ base, ctl_base, ap->ioaddr.data_addr);
irq_res = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
if (irq_res && irq_res->start > 0) {
--
2.17.1
since the commit
a855724dc08b pinctrl: amd: Fix mistake in handling clearing pins at startup
after boot, pressing power button is detected just once and ignored
afterwards.
product: IdeaPad 5 14ALC05
cpu: AMD Ryzen 5 5500U with Radeon Graphics
bios version: G5CN16WW(V1.04)
distro: Arch Linux
desktop environment: KDE Plasma 5.27.7
steps to reproduce:
boot the computer
log in
run sudo evtest
select event2
(on my computer the power button is always represented by
/dev/input/event2,
i don't know if it's the same on others)
press the power button multiple times
(might have to close the log out dialog depending on the DE)
expected behavior:
all the power button presses are recorded
observed behavior:
only the first power button press is recorded
i also have a desktop computer with a ryzen 5 2600x processor, but that
isn't affected
#regzbot introduced: a855724dc08b
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x 2f406263e3e954aa24c1248edcfa9be0c1bb30fa
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082620-overact-feisty-c309@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2f406263e3e954aa24c1248edcfa9be0c1bb30fa Mon Sep 17 00:00:00 2001
From: Yin Fengwei <fengwei.yin(a)intel.com>
Date: Tue, 8 Aug 2023 10:09:15 +0800
Subject: [PATCH] madvise:madvise_cold_or_pageout_pte_range(): don't use
mapcount() against large folio for sharing check
Patch series "don't use mapcount() to check large folio sharing", v2.
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases:
User space application call madvise() with MADV_FREE, MADV_COLD and
MADV_PAGEOUT for specific address range. There are THP mapped to the
range. Without the patchset, the THP is skipped. With the patch, the
THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT
skip THP:
https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel…
and I confirmed this patchset make it work again.
This patch (of 3):
Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range()
to use folios") replaced the page_mapcount() with folio_mapcount() to
check whether the folio is shared by other mapping.
It's not correct for large folio. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/madvise.c b/mm/madvise.c
index bfe0e06427bd..46802b4cf65a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -384,7 +384,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -458,7 +458,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082644-dimmed-purse-07c2@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
cfeb6ae8bcb9 ("maple_tree: disable mas_wr_append() when other readers are possible")
2e1da329b424 ("maple_tree: add comments and some minor cleanups to mas_wr_append()")
c6fc9e4a5c50 ("maple_tree: add mas_wr_new_end() to calculate new_end accurately")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 18 Aug 2023 20:43:55 -0400
Subject: [PATCH] maple_tree: disable mas_wr_append() when other readers are
possible
The current implementation of append may cause duplicate data and/or
incorrect ranges to be returned to a reader during an update. Although
this has not been reported or seen, disable the append write operation
while the tree is in rcu mode out of an abundance of caution.
During the analysis of the mas_next_slot() the following was
artificially created by separating the writer and reader code:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last slot write is to start of slot
store current contents in slot
overwrite old end pivot
mas_next_slot():
read end metadata
read old end pivot
return with incorrect range
store new value
Alternatively:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last lost write to end of slot
store value
mas_next_slot():
read end metadata
read old end pivot
read new end pivot
return with incorrect range
set old end pivot
There may be other accesses that are not safe since we are now updating
both metadata and pointers, so disabling append if there could be rcu
readers is the safest action.
Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 4dd73cf936a6..f723024e1426 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4265,6 +4265,10 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas)
* mas_wr_append: Attempt to append
* @wr_mas: the maple write state
*
+ * This is currently unsafe in rcu mode since the end of the node may be cached
+ * by readers while the node contents may be updated which could result in
+ * inaccurate information.
+ *
* Return: True if appended, false otherwise
*/
static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
@@ -4274,6 +4278,9 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
struct ma_state *mas = wr_mas->mas;
unsigned char node_pivots = mt_pivots[wr_mas->type];
+ if (mt_in_rcu(mas->tree))
+ return false;
+
if (mas->offset != wr_mas->node_end)
return false;
The patch below does not apply to the 6.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.4.y
git checkout FETCH_HEAD
git cherry-pick -x cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082642-energetic-playpen-0bea@gregkh' --subject-prefix 'PATCH 6.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cfeb6ae8bcb96ccf674724f223661bbcef7b0d0b Mon Sep 17 00:00:00 2001
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Date: Fri, 18 Aug 2023 20:43:55 -0400
Subject: [PATCH] maple_tree: disable mas_wr_append() when other readers are
possible
The current implementation of append may cause duplicate data and/or
incorrect ranges to be returned to a reader during an update. Although
this has not been reported or seen, disable the append write operation
while the tree is in rcu mode out of an abundance of caution.
During the analysis of the mas_next_slot() the following was
artificially created by separating the writer and reader code:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last slot write is to start of slot
store current contents in slot
overwrite old end pivot
mas_next_slot():
read end metadata
read old end pivot
return with incorrect range
store new value
Alternatively:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last lost write to end of slot
store value
mas_next_slot():
read end metadata
read old end pivot
read new end pivot
return with incorrect range
set old end pivot
There may be other accesses that are not safe since we are now updating
both metadata and pointers, so disabling append if there could be rcu
readers is the safest action.
Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 4dd73cf936a6..f723024e1426 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4265,6 +4265,10 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas)
* mas_wr_append: Attempt to append
* @wr_mas: the maple write state
*
+ * This is currently unsafe in rcu mode since the end of the node may be cached
+ * by readers while the node contents may be updated which could result in
+ * inaccurate information.
+ *
* Return: True if appended, false otherwise
*/
static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
@@ -4274,6 +4278,9 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
struct ma_state *mas = wr_mas->mas;
unsigned char node_pivots = mt_pivots[wr_mas->type];
+ if (mt_in_rcu(mas->tree))
+ return false;
+
if (mas->offset != wr_mas->node_end)
return false;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x 987aae75fc1041072941ffb622b45ce2359a99b9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082645-unicycle-diaphragm-272c@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
987aae75fc10 ("batman-adv: Hold rtnl lock during MTU update via netlink")
3e15b06eb7e4 ("batman-adv: Add fragmentation mesh genl configuration")
a1c8de803296 ("batman-adv: Add distributed_arp_table mesh genl configuration")
43ff6105a527 ("batman-adv: Add bridge_loop_avoidance mesh genl configuration")
d7e52506b680 ("batman-adv: Add bonding mesh genl configuration")
e43d16b87dc2 ("batman-adv: Add ap_isolation mesh/vlan genl configuration")
9ab4cee5ced9 ("batman-adv: Add aggregated_ogms mesh genl configuration")
49e7e37cd981 ("batman-adv: Prepare framework for vlan genl config")
5c55a40fa801 ("batman-adv: Prepare framework for hardif genl config")
600405135360 ("batman-adv: Prepare framework for mesh genl config")
c4a7a8d9bb8f ("batman-adv: Move common genl doit code pre/post hooks")
fb69be697916 ("batman-adv: Add inconsistent hardif netlink dump detection")
53dd9a68ba68 ("batman-adv: add multicast flags netlink support")
41aeefcc38a2 ("batman-adv: add DAT cache netlink support")
fec149f5d323 ("batman-adv: Convert packet.h to uapi header")
7e9a8c2ce7c5 ("batman-adv: Use parentheses in function kernel-doc")
7db7d9f369a4 ("batman-adv: Add SPDX license identifier above copyright header")
40b16b9be577 ("batman-adv: use inline kernel-doc for uapi constants")
706cc9f51d9a ("batman-adv: Add argument names for function ptr definitions")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 987aae75fc1041072941ffb622b45ce2359a99b9 Mon Sep 17 00:00:00 2001
From: Sven Eckelmann <sven(a)narfation.org>
Date: Mon, 21 Aug 2023 21:48:48 +0200
Subject: [PATCH] batman-adv: Hold rtnl lock during MTU update via netlink
The automatic recalculation of the maximum allowed MTU is usually triggered
by code sections which are already rtnl lock protected by callers outside
of batman-adv. But when the fragmentation setting is changed via
batman-adv's own batadv genl family, then the rtnl lock is not yet taken.
But dev_set_mtu requires that the caller holds the rtnl lock because it
uses netdevice notifiers. And this code will then fail the check for this
lock:
RTNL: assertion failed at net/core/dev.c (1953)
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+f8812454d9b3ac00d282(a)syzkaller.appspotmail.com
Fixes: c6a953cce8d0 ("batman-adv: Trigger events for auto adjusted MTU")
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Reviewed-by: Simon Horman <horms(a)kernel.org>
Link: https://lore.kernel.org/r/20230821-batadv-missing-mtu-rtnl-lock-v1-1-1c5a7b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/batman-adv/netlink.c b/net/batman-adv/netlink.c
index ad5714f737be..6efbc9275aec 100644
--- a/net/batman-adv/netlink.c
+++ b/net/batman-adv/netlink.c
@@ -495,7 +495,10 @@ static int batadv_netlink_set_mesh(struct sk_buff *skb, struct genl_info *info)
attr = info->attrs[BATADV_ATTR_FRAGMENTATION_ENABLED];
atomic_set(&bat_priv->fragmentation, !!nla_get_u8(attr));
+
+ rtnl_lock();
batadv_update_min_mtu(bat_priv->soft_iface);
+ rtnl_unlock();
}
if (info->attrs[BATADV_ATTR_GW_BANDWIDTH_DOWN]) {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 987aae75fc1041072941ffb622b45ce2359a99b9
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082644-placard-bullfight-dbc3@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
987aae75fc10 ("batman-adv: Hold rtnl lock during MTU update via netlink")
3e15b06eb7e4 ("batman-adv: Add fragmentation mesh genl configuration")
a1c8de803296 ("batman-adv: Add distributed_arp_table mesh genl configuration")
43ff6105a527 ("batman-adv: Add bridge_loop_avoidance mesh genl configuration")
d7e52506b680 ("batman-adv: Add bonding mesh genl configuration")
e43d16b87dc2 ("batman-adv: Add ap_isolation mesh/vlan genl configuration")
9ab4cee5ced9 ("batman-adv: Add aggregated_ogms mesh genl configuration")
49e7e37cd981 ("batman-adv: Prepare framework for vlan genl config")
5c55a40fa801 ("batman-adv: Prepare framework for hardif genl config")
600405135360 ("batman-adv: Prepare framework for mesh genl config")
c4a7a8d9bb8f ("batman-adv: Move common genl doit code pre/post hooks")
fb69be697916 ("batman-adv: Add inconsistent hardif netlink dump detection")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 987aae75fc1041072941ffb622b45ce2359a99b9 Mon Sep 17 00:00:00 2001
From: Sven Eckelmann <sven(a)narfation.org>
Date: Mon, 21 Aug 2023 21:48:48 +0200
Subject: [PATCH] batman-adv: Hold rtnl lock during MTU update via netlink
The automatic recalculation of the maximum allowed MTU is usually triggered
by code sections which are already rtnl lock protected by callers outside
of batman-adv. But when the fragmentation setting is changed via
batman-adv's own batadv genl family, then the rtnl lock is not yet taken.
But dev_set_mtu requires that the caller holds the rtnl lock because it
uses netdevice notifiers. And this code will then fail the check for this
lock:
RTNL: assertion failed at net/core/dev.c (1953)
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+f8812454d9b3ac00d282(a)syzkaller.appspotmail.com
Fixes: c6a953cce8d0 ("batman-adv: Trigger events for auto adjusted MTU")
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Reviewed-by: Simon Horman <horms(a)kernel.org>
Link: https://lore.kernel.org/r/20230821-batadv-missing-mtu-rtnl-lock-v1-1-1c5a7b…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/net/batman-adv/netlink.c b/net/batman-adv/netlink.c
index ad5714f737be..6efbc9275aec 100644
--- a/net/batman-adv/netlink.c
+++ b/net/batman-adv/netlink.c
@@ -495,7 +495,10 @@ static int batadv_netlink_set_mesh(struct sk_buff *skb, struct genl_info *info)
attr = info->attrs[BATADV_ATTR_FRAGMENTATION_ENABLED];
atomic_set(&bat_priv->fragmentation, !!nla_get_u8(attr));
+
+ rtnl_lock();
batadv_update_min_mtu(bat_priv->soft_iface);
+ rtnl_unlock();
}
if (info->attrs[BATADV_ATTR_GW_BANDWIDTH_DOWN]) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.14.y
git checkout FETCH_HEAD
git cherry-pick -x d8e42a2b0addf238be8b3b37dcd9795a5c1be459
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082656-sadly-harness-6601@gregkh' --subject-prefix 'PATCH 4.14.y' HEAD^..
Possible dependencies:
d8e42a2b0add ("batman-adv: Don't increase MTU when set by user")
c6a953cce8d0 ("batman-adv: Trigger events for auto adjusted MTU")
8b84cc4fb556 ("batman-adv: Use inline kernel-doc for enum/struct")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d8e42a2b0addf238be8b3b37dcd9795a5c1be459 Mon Sep 17 00:00:00 2001
From: Sven Eckelmann <sven(a)narfation.org>
Date: Wed, 19 Jul 2023 10:01:15 +0200
Subject: [PATCH] batman-adv: Don't increase MTU when set by user
If the user set an MTU value, it usually means that there are special
requirements for the MTU. But if an interface gots activated, the MTU was
always recalculated and then the user set value was overwritten.
The only reason why this user set value has to be overwritten, is when the
MTU has to be decreased because batman-adv is not able to transfer packets
with the user specified size.
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
diff --git a/net/batman-adv/hard-interface.c b/net/batman-adv/hard-interface.c
index ae5762af0146..24c9c0c3f316 100644
--- a/net/batman-adv/hard-interface.c
+++ b/net/batman-adv/hard-interface.c
@@ -630,7 +630,19 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface)
*/
void batadv_update_min_mtu(struct net_device *soft_iface)
{
- dev_set_mtu(soft_iface, batadv_hardif_min_mtu(soft_iface));
+ struct batadv_priv *bat_priv = netdev_priv(soft_iface);
+ int limit_mtu;
+ int mtu;
+
+ mtu = batadv_hardif_min_mtu(soft_iface);
+
+ if (bat_priv->mtu_set_by_user)
+ limit_mtu = bat_priv->mtu_set_by_user;
+ else
+ limit_mtu = ETH_DATA_LEN;
+
+ mtu = min(mtu, limit_mtu);
+ dev_set_mtu(soft_iface, mtu);
/* Check if the local translate table should be cleaned up to match a
* new (and smaller) MTU.
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index d3fdf82282af..85d00dc9ce32 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -153,11 +153,14 @@ static int batadv_interface_set_mac_addr(struct net_device *dev, void *p)
static int batadv_interface_change_mtu(struct net_device *dev, int new_mtu)
{
+ struct batadv_priv *bat_priv = netdev_priv(dev);
+
/* check ranges */
if (new_mtu < 68 || new_mtu > batadv_hardif_min_mtu(dev))
return -EINVAL;
dev->mtu = new_mtu;
+ bat_priv->mtu_set_by_user = new_mtu;
return 0;
}
diff --git a/net/batman-adv/types.h b/net/batman-adv/types.h
index ca9449ec9836..cf1a0eafe3ab 100644
--- a/net/batman-adv/types.h
+++ b/net/batman-adv/types.h
@@ -1546,6 +1546,12 @@ struct batadv_priv {
/** @soft_iface: net device which holds this struct as private data */
struct net_device *soft_iface;
+ /**
+ * @mtu_set_by_user: MTU was set once by user
+ * protected by rtnl_lock
+ */
+ int mtu_set_by_user;
+
/**
* @bat_counters: mesh internal traffic statistic counters (see
* batadv_counters)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x e2c1ab070fdc81010ec44634838d24fce9ff9e53
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082631-preacher-spousal-e0e5@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
e2c1ab070fdc ("mm: memory-failure: fix unexpected return value in soft_offline_page()")
7adb45887c8a ("mm: memory-failure: kill soft_offline_free_page()")
2a57d83c78f8 ("mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e2c1ab070fdc81010ec44634838d24fce9ff9e53 Mon Sep 17 00:00:00 2001
From: Miaohe Lin <linmiaohe(a)huawei.com>
Date: Tue, 27 Jun 2023 19:28:08 +0800
Subject: [PATCH] mm: memory-failure: fix unexpected return value in
soft_offline_page()
When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.
Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.
Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 139b31fdb678..fe121fdb05f7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2741,10 +2741,13 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0) {
ret = soft_offline_in_use_page(page);
} else if (ret == 0) {
- if (!page_handle_poison(page, true, false) && try_again) {
- try_again = false;
- flags &= ~MF_COUNT_INCREASED;
- goto retry;
+ if (!page_handle_poison(page, true, false)) {
+ if (try_again) {
+ try_again = false;
+ flags &= ~MF_COUNT_INCREASED;
+ goto retry;
+ }
+ ret = -EBUSY;
}
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x e2c1ab070fdc81010ec44634838d24fce9ff9e53
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082619-rocky-strobe-5ad9@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
e2c1ab070fdc ("mm: memory-failure: fix unexpected return value in soft_offline_page()")
7adb45887c8a ("mm: memory-failure: kill soft_offline_free_page()")
2a57d83c78f8 ("mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()")
dad4e5b39086 ("mm: fix page reference leak in soft_offline_page()")
8295d535e2aa ("mm,hwpoison: refactor get_any_page")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e2c1ab070fdc81010ec44634838d24fce9ff9e53 Mon Sep 17 00:00:00 2001
From: Miaohe Lin <linmiaohe(a)huawei.com>
Date: Tue, 27 Jun 2023 19:28:08 +0800
Subject: [PATCH] mm: memory-failure: fix unexpected return value in
soft_offline_page()
When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.
Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.
Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 139b31fdb678..fe121fdb05f7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2741,10 +2741,13 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0) {
ret = soft_offline_in_use_page(page);
} else if (ret == 0) {
- if (!page_handle_poison(page, true, false) && try_again) {
- try_again = false;
- flags &= ~MF_COUNT_INCREASED;
- goto retry;
+ if (!page_handle_poison(page, true, false)) {
+ if (try_again) {
+ try_again = false;
+ flags &= ~MF_COUNT_INCREASED;
+ goto retry;
+ }
+ ret = -EBUSY;
}
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x d74943a2f3cdade34e471b36f55f7979be656867
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082639-crushed-impart-a42d@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
d74943a2f3cd ("mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT")
2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h")
63b605128655 ("mm/gup: move gup_must_unshare() to mm/internal.h")
f04740f54594 ("mm/gup: add FOLL_UNLOCKABLE")
d64e2dbc33a1 ("mm/gup: simplify the external interface functions and consolidate invariants")
afa3c33e2684 ("mm/gup: don't call __gup_longterm_locked() if FOLL_LONGTERM cannot be set")
b2a72dff85fa ("mm/gup: have internal functions get the mmap_read_lock()")
b5054174ac7c ("mm: move FOLL_* defs to mm_types.h")
8fa590bf3448 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d74943a2f3cdade34e471b36f55f7979be656867 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david(a)redhat.com>
Date: Thu, 3 Aug 2023 16:32:02 +0200
Subject: [PATCH] mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.
As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast"): "Other follow_page callers like KSM
should not use FOLL_NUMA, or they would fail to get the pages if they use
follow_page instead of get_user_pages."
liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE. Further, it's easy to
reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
and follow_page() refuses to return such pages right now.
As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not
really an option.
To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.
Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.
Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.
1) get_dump_page(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
hinting faults.
To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().
As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback to
ordinary GUP when stumbling over pte_protnone()/pmd_protnone().
As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.
So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default.
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
adding appropriate documenation.
While at it, remove a stale comment from follow_trans_huge_pmd(): That
comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa:
serialise parallel get_user_page against THP migration"), which noted:
THP does not unmap pages due to a lack of support for migration
entries at a PMD level. This allows races with get_user_pages
Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.
[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs…
Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: liubo <liubo254(a)huawei.com>
Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
Reported-by: Peter Xu <peterx(a)redhat.com>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Acked-by: Mel Gorman <mgorman(a)techsingularity.net>
Acked-by: Peter Xu <peterx(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 406ab9ea818f..34f9dba17c1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3421,15 +3421,24 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
* Indicates whether GUP can follow a PROT_NONE mapped page, or whether
* a (NUMA hinting) fault is required.
*/
-static inline bool gup_can_follow_protnone(unsigned int flags)
+static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
+ unsigned int flags)
{
/*
- * FOLL_FORCE has to be able to make progress even if the VMA is
- * inaccessible. Further, FOLL_FORCE access usually does not represent
- * application behaviour and we should avoid triggering NUMA hinting
- * faults.
+ * If callers don't want to honor NUMA hinting faults, no need to
+ * determine if we would actually have to trigger a NUMA hinting fault.
*/
- return flags & FOLL_FORCE;
+ if (!(flags & FOLL_HONOR_NUMA_FAULT))
+ return true;
+
+ /*
+ * NUMA hinting faults don't apply in inaccessible (PROT_NONE) VMAs.
+ *
+ * Requiring a fault here even for inaccessible VMAs would mean that
+ * FOLL_FORCE cannot make any progress, because handle_mm_fault()
+ * refuses to process NUMA hinting faults in inaccessible VMAs.
+ */
+ return !vma_is_accessible(vma);
}
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e74ce4a28cd..7d30dc4ff0ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1286,6 +1286,15 @@ enum {
FOLL_PCI_P2PDMA = 1 << 10,
/* allow interrupts from generic signals */
FOLL_INTERRUPTIBLE = 1 << 11,
+ /*
+ * Always honor (trigger) NUMA hinting faults.
+ *
+ * FOLL_WRITE implicitly honors NUMA hinting faults because a
+ * PROT_NONE-mapped page is not writable (exceptions with FOLL_FORCE
+ * apply). get_user_pages_fast_only() always implicitly honors NUMA
+ * hinting faults.
+ */
+ FOLL_HONOR_NUMA_FAULT = 1 << 12,
/* See also internal only FOLL flags in mm/internal.h */
};
diff --git a/mm/gup.c b/mm/gup.c
index 76d222ccc3ff..6e2f9e9d6537 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -597,7 +597,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
pte = ptep_get(ptep);
if (!pte_present(pte))
goto no_page;
- if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+ if (pte_protnone(pte) && !gup_can_follow_protnone(vma, flags))
goto no_page;
page = vm_normal_page(vma, address, pte);
@@ -714,7 +714,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
- if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
+ if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
return no_page_table(vma, flags);
ptl = pmd_lock(mm, pmd);
@@ -851,6 +851,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
return NULL;
+ /*
+ * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
+ * to fail on PROT_NONE-mapped pages.
+ */
page = follow_page_mask(vma, address, foll_flags, &ctx);
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);
@@ -2227,6 +2231,13 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
gup_flags |= FOLL_UNLOCKABLE;
}
+ /*
+ * For now, always trigger NUMA hinting faults. Some GUP users like
+ * KVM require the hint to be as the calling context of GUP is
+ * functionally similar to a memory reference from task context.
+ */
+ gup_flags |= FOLL_HONOR_NUMA_FAULT;
+
/* FOLL_GET and FOLL_PIN are mutually exclusive. */
if (WARN_ON_ONCE((gup_flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
@@ -2551,7 +2562,14 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
struct page *page;
struct folio *folio;
- if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
+ /*
+ * Always fallback to ordinary GUP on PROT_NONE-mapped pages:
+ * pte_access_permitted() better should reject these pages
+ * either way: otherwise, GUP-fast might succeed in
+ * cases where ordinary GUP would fail due to VMA access
+ * permissions.
+ */
+ if (pte_protnone(pte))
goto pte_unmap;
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
@@ -2970,8 +2988,8 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
pmd_devmap(pmd))) {
- if (pmd_protnone(pmd) &&
- !gup_can_follow_protnone(flags))
+ /* See gup_pte_range() */
+ if (pmd_protnone(pmd))
return 0;
if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
@@ -3151,7 +3169,7 @@ static int internal_get_user_pages_fast(unsigned long start,
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
FOLL_FAST_ONLY | FOLL_NOFAULT |
- FOLL_PCI_P2PDMA)))
+ FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT)))
return -EINVAL;
if (gup_flags & FOLL_PIN)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..f15d557e5708 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1467,8 +1467,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
return ERR_PTR(-EFAULT);
- /* Full NUMA hinting faults to serialise migration in fault paths */
- if (pmd_protnone(*pmd) && !gup_can_follow_protnone(flags))
+ if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 1cbc11aaa01f80577b67ae02c73ee781112125fd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082613-lilac-overpower-64a2@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
1cbc11aaa01f ("NFSv4: Fix dropped lock for racing OPEN and delegation return")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1cbc11aaa01f80577b67ae02c73ee781112125fd Mon Sep 17 00:00:00 2001
From: Benjamin Coddington <bcodding(a)redhat.com>
Date: Fri, 30 Jun 2023 09:18:13 -0400
Subject: [PATCH] NFSv4: Fix dropped lock for racing OPEN and delegation return
Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during delegation
return") attempted to solve this problem by using nfs4's generic async error
handling, but introduced a regression where v4.0 lock recovery would hang.
The additional complexity introduced by overloading that error handling is
not necessary for this case. This patch expects that commit to be
reverted.
The problem as originally explained in the above commit is:
There's a small window where a LOCK sent during a delegation return can
race with another OPEN on client, but the open stateid has not yet been
updated. In this case, the client doesn't handle the OLD_STATEID error
from the server and will lose this lock, emitting:
"NFS: nfs4_handle_delegation_recall_error: unhandled error -10024".
Fix this by using the old_stateid refresh helpers if the server replies
with OLD_STATEID.
Suggested-by: Trond Myklebust <trondmy(a)hammerspace.com>
Signed-off-by: Benjamin Coddington <bcodding(a)redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index e1a886b58354..4604e9f3d1b0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7181,8 +7181,15 @@ static void nfs4_lock_done(struct rpc_task *task, void *calldata)
} else if (!nfs4_update_lock_stateid(lsp, &data->res.stateid))
goto out_restart;
break;
- case -NFS4ERR_BAD_STATEID:
case -NFS4ERR_OLD_STATEID:
+ if (data->arg.new_lock_owner != 0 &&
+ nfs4_refresh_open_old_stateid(&data->arg.open_stateid,
+ lsp->ls_state))
+ goto out_restart;
+ if (nfs4_refresh_lock_old_stateid(&data->arg.lock_stateid, lsp))
+ goto out_restart;
+ fallthrough;
+ case -NFS4ERR_BAD_STATEID:
case -NFS4ERR_STALE_STATEID:
case -NFS4ERR_EXPIRED:
if (data->arg.new_lock_owner != 0) {
Hi Sasha
I just saw that you queued up mm-disable-config_per_vma_lock-until-its-fixed.patch for 6.4.
The problems that this patch tried to prevent were fixed before it actually made it into a
release, and Linus un-did the commit in his merge (at the bottom):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/m…
Since the fixes for PER_VMA_LOCK have been in 6.4 releases for a while, this patch
should not go in.
thanks
Holger
Take the vCPU's mmu_seq snapshot as an "unsigned long" instead of an "int"
when checking to see if a page fault is stale, as the sequence count is
stored as an "unsigned long" everywhere else in KVM. This fixes a bug
where KVM will effectively hang vCPUs due to always thinking page faults
are stale, which results in KVM refusing to "fix" faults.
mmu_invalidate_seq (née mmu_notifier_seq) is a sequence counter used when
KVM is handling page faults to detect if userspace mappings relevant to
the guest were invalidated between snapshotting the counter and acquiring
mmu_lock, i.e. to ensure that the userspace mapping KVM is using to
resolve the page fault is fresh. If KVM sees that the counter has
changed, KVM simply resumes the guest without fixing the fault.
What _should_ happen is that the source of the mmu_notifier invalidations
eventually goes away, mmu_invalidate_seq becomes stable, and KVM can once
again fix guest page fault(s).
But for a long-lived VM and/or a VM that the host just doesn't particularly
like, it's possible for a VM to be on the receiving end of 2 billion (with
a B) mmu_notifier invalidations. When that happens, bit 31 will be set in
mmu_invalidate_seq. This causes the value to be turned into a 32-bit
negative value when implicitly cast to an "int" by is_page_fault_stale(),
and then sign-extended into a 64-bit unsigned when the signed "int" is
implicitly cast back to an "unsigned long" on the call to
mmu_invalidate_retry_hva().
As a result of the casting and sign-extension, given a sequence counter of
e.g. 0x8002dc25, mmu_invalidate_retry_hva() ends up doing
if (0x8002dc25 != 0xffffffff8002dc25)
and signals that the page fault is stale and needs to be retried even
though the sequence counter is stable, and KVM effectively hangs any vCPU
that takes a page fault (EPT violation or #NPF when TDP is enabled).
Note, upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq
in kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
how KVM tracks the sequence counter snapshot.
Reported-by: Brian Rak <brak(a)vultr.com>
Reported-by: Amaan Cheval <amaan.cheval(a)gmail.com>
Reported-by: Eric Wheeler <kvm(a)lists.ewheeler.net>
Closes: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameserver…
Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
---
arch/x86/kvm/mmu/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 230108a90cf3..beca03556379 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* root was invalidated by a memslot update or a relevant mmu_notifier fired.
*/
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault, int mmu_seq)
+ struct kvm_page_fault *fault,
+ unsigned long mmu_seq)
{
struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
base-commit: 802aacbbffe2512dce9f8f33ad99d01cfec435de
--
2.42.0.rc2.253.gd59a3bf2b4-goog
Upstream commit edbdb43fc96b11b3bfa531be306a1993d9fe89ec.
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
[6.1 backport notes: conflicts with
09732d2b4dc5 ("KVM: x86/mmu: Move TDP MMU VM init/uninit behind tdp_mmu_enabled")
1f98f2bd8ec4 ("KVM: x86/mmu: Change tdp_mmu to a read-only parameter")
de0322f575be ("KVM: x86/mmu: Replace open coded usage of tdp_mmu_page with is_tdp_mmu_page()")
prevented a clean cherry-pick. First two resolved by keeping 6.1's check
on kvm->arch.tdp_mmu_enabled, last one resolved by taking the upstream
change, i.e. by opportunistically switching to is_tdp_mmu_page()]
Reported-by: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.micr…
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
---
arch/x86/kvm/mmu/tdp_mmu.c | 121 +++++++++++++++++--------------------
1 file changed, 56 insertions(+), 65 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 672f0432d777..70945f00ec41 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -51,7 +51,17 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
if (!kvm->arch.tdp_mmu_enabled)
return;
- /* Also waits for any queued work items. */
+ /*
+ * Invalidate all roots, which besides the obvious, schedules all roots
+ * for zapping and thus puts the TDP MMU's reference to each root, i.e.
+ * ultimately frees all roots.
+ */
+ kvm_tdp_mmu_invalidate_all_roots(kvm);
+
+ /*
+ * Destroying a workqueue also first flushes the workqueue, i.e. no
+ * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
+ */
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
@@ -127,16 +137,6 @@ static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root
queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
}
-static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
-{
- union kvm_mmu_page_role role = page->role;
- role.invalid = true;
-
- /* No need to use cmpxchg, only the invalid bit can change. */
- role.word = xchg(&page->role.word, role.word);
- return role.invalid;
-}
-
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
@@ -145,45 +145,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
- WARN_ON(!root->tdp_mmu_page);
-
/*
- * The root now has refcount=0. It is valid, but readers already
- * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
- * rejects it. This remains true for the rest of the execution
- * of this function, because readers visit valid roots only
- * (except for tdp_mmu_zap_root_work(), which however
- * does not acquire any reference itself).
- *
- * Even though there are flows that need to visit all roots for
- * correctness, they all take mmu_lock for write, so they cannot yet
- * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
- * since the root still has refcount=0.
- *
- * However, tdp_mmu_zap_root can yield, and writers do not expect to
- * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
- * So the root temporarily gets an extra reference, going to refcount=1
- * while staying invalid. Readers still cannot acquire any reference;
- * but writers are now allowed to run if tdp_mmu_zap_root yields and
- * they might take an extra reference if they themselves yield.
- * Therefore, when the reference is given back by the worker,
- * there is no guarantee that the refcount is still 1. If not, whoever
- * puts the last reference will free the page, but they will not have to
- * zap the root because a root cannot go from invalid to valid.
+ * The TDP MMU itself holds a reference to each root until the root is
+ * explicitly invalidated, i.e. the final reference should be never be
+ * put for a valid root.
*/
- if (!kvm_tdp_root_mark_invalid(root)) {
- refcount_set(&root->tdp_mmu_root_count, 1);
-
- /*
- * Zapping the root in a worker is not just "nice to have";
- * it is required because kvm_tdp_mmu_invalidate_all_roots()
- * skips already-invalid roots. If kvm_tdp_mmu_put_root() did
- * not add the root to the workqueue, kvm_tdp_mmu_zap_all_fast()
- * might return with some roots not zapped yet.
- */
- tdp_mmu_schedule_zap_root(kvm, root);
- return;
- }
+ KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
@@ -329,7 +296,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
root = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_sp(root, NULL, 0, role);
- refcount_set(&root->tdp_mmu_root_count, 1);
+ /*
+ * TDP MMU roots are kept until they are explicitly invalidated, either
+ * by a memslot update or by the destruction of the VM. Initialize the
+ * refcount to two; one reference for the vCPU, and one reference for
+ * the TDP MMU itself, which is held until the root is invalidated and
+ * is ultimately put by tdp_mmu_zap_root_work().
+ */
+ refcount_set(&root->tdp_mmu_root_count, 2);
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
@@ -1027,32 +1001,49 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
/*
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
* is about to be zapped, e.g. in response to a memslots update. The actual
- * zapping is performed asynchronously, so a reference is taken on all roots.
- * Using a separate workqueue makes it easy to ensure that the destruction is
- * performed before the "fast zap" completes, without keeping a separate list
- * of invalidated roots; the list is effectively the list of work items in
- * the workqueue.
+ * zapping is performed asynchronously. Using a separate workqueue makes it
+ * easy to ensure that the destruction is performed before the "fast zap"
+ * completes, without keeping a separate list of invalidated roots; the list is
+ * effectively the list of work items in the workqueue.
*
- * Get a reference even if the root is already invalid, the asynchronous worker
- * assumes it was gifted a reference to the root it processes. Because mmu_lock
- * is held for write, it should be impossible to observe a root with zero refcount,
- * i.e. the list of roots cannot be stale.
- *
- * This has essentially the same effect for the TDP MMU
- * as updating mmu_valid_gen does for the shadow MMU.
+ * Note, the asynchronous worker is gifted the TDP MMU's reference.
+ * See kvm_tdp_mmu_get_vcpu_root_hpa().
*/
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
- lockdep_assert_held_write(&kvm->mmu_lock);
- list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
- if (!root->role.invalid &&
- !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
+ /*
+ * mmu_lock must be held for write to ensure that a root doesn't become
+ * invalid while there are active readers (invalidating a root while
+ * there are active readers may or may not be problematic in practice,
+ * but it's uncharted territory and not supported).
+ *
+ * Waive the assertion if there are no users of @kvm, i.e. the VM is
+ * being destroyed after all references have been put, or if no vCPUs
+ * have been created (which means there are no roots), i.e. the VM is
+ * being destroyed in an error path of KVM_CREATE_VM.
+ */
+ if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+ refcount_read(&kvm->users_count) && kvm->created_vcpus)
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ /*
+ * As above, mmu_lock isn't held when destroying the VM! There can't
+ * be other references to @kvm, i.e. nothing else can invalidate roots
+ * or be consuming roots, but walking the list of roots does need to be
+ * guarded against roots being deleted by the asynchronous zap worker.
+ */
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!root->role.invalid) {
root->role.invalid = true;
tdp_mmu_schedule_zap_root(kvm, root);
}
}
+
+ rcu_read_unlock();
}
/*
base-commit: 802aacbbffe2512dce9f8f33ad99d01cfec435de
--
2.42.0.rc2.253.gd59a3bf2b4-goog
Disable the TDP MMU by default in v5.15 kernels to "fix" several severe
performance bugs that have since been found and fixed in the TDP MMU, but
are unsuitable for backporting to v5.15.
The problematic bugs are fixed by upstream commit edbdb43fc96b ("KVM:
x86: Preserve TDP MMU roots until they are explicitly invalidated") and
commit 01b31714bd90 ("KVM: x86: Do not unload MMU roots when only toggling
CR0.WP with TDP enabled"). Both commits fix scenarios where KVM will
rebuild all TDP MMU page tables in paths that are frequently hit by
certain guest workloads. While not exactly common, the guest workloads
are far from rare. The fallout of rebuilding TDP MMU page tables can be
so severe in some cases that it induces soft lockups in the guest.
Commit edbdb43fc96b would require _significant_ effort and churn to
backport due it depending on a major rework that was done in v5.18.
Commit 01b31714bd90 has far fewer direct conflicts, but has several subtle
_known_ dependencies, and it's unclear whether or not there are more
unknown dependencies that have been missed.
Lastly, disabling the TDP MMU in v5.15 kernels also fixes a lurking train
wreck started by upstream commit a955cad84cda ("KVM: x86/mmu: Retry page
fault if root is invalidated by memslot update"). That commit was tagged
for stable to fix a memory leak, but didn't cherry-pick cleanly and was
never backported to v5.15. Which is extremely fortunate, as it introduced
not one but two bugs, one of which was fixed by upstream commit
18c841e1f411 ("KVM: x86: Retry page fault if MMU reload is pending and
root has no sp"), while the other was unknowingly fixed by upstream
commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in
kvm_faultin_pfn()") in v6.3 (a one-off fix will be made for v6.1 kernels,
which did receive a backport for a955cad84cda). Disabling the TDP MMU
by default reduces the probability of breaking v5.15 kernels by
backporting only a subset of the fixes.
As far as what is lost by disabling the TDP MMU, the main selling point of
the TDP MMU is its ability to service page fault VM-Exits in parallel,
i.e. the main benefactors of the TDP MMU are deployments of large VMs
(hundreds of vCPUs), and in particular delployments that live-migrate such
VMs and thus need to fault-in huge amounts of memory on many vCPUs after
restarting the VM after migration.
Smaller VMs can see performance improvements, but nowhere enough to make
up for the TDP MMU (in v5.15) absolutely cratering performance for some
workloads. And practically speaking, anyone that is deploying and
migrating VMs with hundreds of vCPUs is likely rolling their own kernel,
not using a stock v5.15 series kernel.
This reverts commit 71ba3f3189c78f756a659568fb473600fd78f207.
Link: https://lore.kernel.org/all/ZDmEGM+CgYpvDLh6@google.com
Link: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameserver…
Cc: Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
Cc: Mathias Krause <minipli(a)grsecurity.net>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
---
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c2bb60ccd88..7a64fb238044 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -10,7 +10,7 @@
#include <asm/cmpxchg.h>
#include <trace/events/kvm.h>
-static bool __read_mostly tdp_mmu_enabled = true;
+static bool __read_mostly tdp_mmu_enabled = false;
module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
/* Initializes the TDP MMU for the VM, if enabled. */
base-commit: f6f7927ac664ba23447f8dd3c3dfe2f4ee39272f
--
2.42.0.rc2.253.gd59a3bf2b4-goog
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x 5310760af1d4fbea1452bfc77db5f9a680f7ae47
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082114-remix-cable-0852@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
5310760af1d4 ("ipvs: fix racy memcpy in proc_do_sync_threshold")
1b90af292e71 ("ipvs: Improve robustness to the ipvs sysctl")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5310760af1d4fbea1452bfc77db5f9a680f7ae47 Mon Sep 17 00:00:00 2001
From: Sishuai Gong <sishuai.system(a)gmail.com>
Date: Thu, 10 Aug 2023 15:12:42 -0400
Subject: [PATCH] ipvs: fix racy memcpy in proc_do_sync_threshold
When two threads run proc_do_sync_threshold() in parallel,
data races could happen between the two memcpy():
Thread-1 Thread-2
memcpy(val, valp, sizeof(val));
memcpy(valp, val, sizeof(val));
This race might mess up the (struct ctl_table *) table->data,
so we add a mutex lock to serialize them.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.c…
Signed-off-by: Sishuai Gong <sishuai.system(a)gmail.com>
Acked-by: Simon Horman <horms(a)kernel.org>
Acked-by: Julian Anastasov <ja(a)ssi.bg>
Signed-off-by: Florian Westphal <fw(a)strlen.de>
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 62606fb44d02..4bb0d90eca1c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1876,6 +1876,7 @@ static int
proc_do_sync_threshold(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
+ struct netns_ipvs *ipvs = table->extra2;
int *valp = table->data;
int val[2];
int rc;
@@ -1885,6 +1886,7 @@ proc_do_sync_threshold(struct ctl_table *table, int write,
.mode = table->mode,
};
+ mutex_lock(&ipvs->sync_mutex);
memcpy(val, valp, sizeof(val));
rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
if (write) {
@@ -1894,6 +1896,7 @@ proc_do_sync_threshold(struct ctl_table *table, int write,
else
memcpy(valp, val, sizeof(val));
}
+ mutex_unlock(&ipvs->sync_mutex);
return rc;
}
@@ -4321,6 +4324,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_threshold[0] = DEFAULT_SYNC_THRESHOLD;
ipvs->sysctl_sync_threshold[1] = DEFAULT_SYNC_PERIOD;
tbl[idx].data = &ipvs->sysctl_sync_threshold;
+ tbl[idx].extra2 = ipvs;
tbl[idx++].maxlen = sizeof(ipvs->sysctl_sync_threshold);
ipvs->sysctl_sync_refresh_period = DEFAULT_SYNC_REFRESH_PERIOD;
tbl[idx++].data = &ipvs->sysctl_sync_refresh_period;
This is a port of commit 379eb01c21795edb4c ("riscv: Ensure the value
of FP registers in the core dump file is up to date").
The values of FP/SIMD registers in the core dump file come from the
thread.fpu. However, kernel saves the FP/SIMD registers only before
scheduling out the process. If no process switch happens during the
exception handling, kernel will not have a chance to save the latest
values of FP/SIMD registers. So it may cause their values in the core
dump file incorrect. To solve this problem, force fpr_get()/simd_get()
to save the FP/SIMD registers into the thread.fpu if the target task
equals the current task.
Cc: stable(a)vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai(a)loongson.cn>
---
V2: Rename get_fpu_regs() to save_fpu_regs().
arch/loongarch/include/asm/fpu.h | 22 ++++++++++++++++++----
arch/loongarch/kernel/ptrace.c | 4 ++++
2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/loongarch/include/asm/fpu.h b/arch/loongarch/include/asm/fpu.h
index b541f6248837..08a45e9fd15c 100644
--- a/arch/loongarch/include/asm/fpu.h
+++ b/arch/loongarch/include/asm/fpu.h
@@ -173,16 +173,30 @@ static inline void restore_fp(struct task_struct *tsk)
_restore_fp(&tsk->thread.fpu);
}
-static inline union fpureg *get_fpu_regs(struct task_struct *tsk)
+static inline void save_fpu_regs(struct task_struct *tsk)
{
+ unsigned int euen;
+
if (tsk == current) {
preempt_disable();
- if (is_fpu_owner())
+
+ euen = csr_read32(LOONGARCH_CSR_EUEN);
+
+#ifdef CONFIG_CPU_HAS_LASX
+ if (euen & CSR_EUEN_LASXEN)
+ _save_lasx(¤t->thread.fpu);
+ else
+#endif
+#ifdef CONFIG_CPU_HAS_LSX
+ if (euen & CSR_EUEN_LSXEN)
+ _save_lsx(¤t->thread.fpu);
+ else
+#endif
+ if (euen & CSR_EUEN_FPEN)
_save_fp(¤t->thread.fpu);
+
preempt_enable();
}
-
- return tsk->thread.fpu.fpr;
}
static inline int is_simd_owner(void)
diff --git a/arch/loongarch/kernel/ptrace.c b/arch/loongarch/kernel/ptrace.c
index a0767c3a0f0a..9a75dc43eb29 100644
--- a/arch/loongarch/kernel/ptrace.c
+++ b/arch/loongarch/kernel/ptrace.c
@@ -147,6 +147,8 @@ static int fpr_get(struct task_struct *target,
{
int r;
+ save_fpu_regs(target);
+
if (sizeof(target->thread.fpu.fpr[0]) == sizeof(elf_fpreg_t))
r = gfpr_get(target, &to);
else
@@ -278,6 +280,8 @@ static int simd_get(struct task_struct *target,
{
const unsigned int wr_size = NUM_FPU_REGS * regset->size;
+ save_fpu_regs(target);
+
if (!tsk_used_math(target)) {
/* The task hasn't used FP or LSX, fill with 0xff */
copy_pad_fprs(target, regset, &to, 0);
--
2.39.3
I'm announcing the release of the 6.1.48 kernel.
All users of the 6.1 kernel series must upgrade.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/admin-guide/hw-vuln/srso.rst | 4
Makefile | 2
arch/x86/include/asm/entry-common.h | 1
arch/x86/include/asm/nospec-branch.h | 28 +++---
arch/x86/kernel/cpu/amd.c | 1
arch/x86/kernel/cpu/bugs.c | 28 ++++--
arch/x86/kernel/static_call.c | 13 ++
arch/x86/kernel/traps.c | 2
arch/x86/kernel/vmlinux.lds.S | 20 ++--
arch/x86/kvm/svm/svm.c | 2
arch/x86/lib/retpoline.S | 135 ++++++++++++++++++++---------
tools/objtool/arch/x86/decode.c | 2
tools/objtool/check.c | 21 ++--
13 files changed, 178 insertions(+), 81 deletions(-)
Borislav Petkov (AMD) (4):
x86/srso: Explain the untraining sequences a bit more
x86/CPU/AMD: Fix the DIV(0) initial fix attempt
x86/srso: Disable the mitigation on unaffected configurations
x86/srso: Correct the mitigation status when SMT is disabled
Greg Kroah-Hartman (1):
Linux 6.1.48
Peter Zijlstra (9):
x86/cpu: Fix __x86_return_thunk symbol type
x86/cpu: Fix up srso_safe_ret() and __x86_return_thunk()
x86/alternative: Make custom return thunk unconditional
x86/cpu: Clean up SRSO return thunk mess
x86/cpu: Rename original retbleed methods
x86/cpu: Rename srso_(.*)_alias to srso_alias_\1
x86/cpu: Cleanup the untrain mess
x86/static_call: Fix __static_call_fixup()
objtool/x86: Fixup frame-pointer vs rethunk
Petr Pavlu (1):
x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG
Sean Christopherson (1):
x86/retpoline: Don't clobber RFLAGS during srso_safe_ret()
This is the start of the stable review cycle for the 6.1.48 release.
There are 15 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 26 Aug 2023 14:14:28 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.48-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.48-rc1
Borislav Petkov (AMD) <bp(a)alien8.de>
x86/srso: Correct the mitigation status when SMT is disabled
Peter Zijlstra <peterz(a)infradead.org>
objtool/x86: Fixup frame-pointer vs rethunk
Petr Pavlu <petr.pavlu(a)suse.com>
x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG
Borislav Petkov (AMD) <bp(a)alien8.de>
x86/srso: Disable the mitigation on unaffected configurations
Borislav Petkov (AMD) <bp(a)alien8.de>
x86/CPU/AMD: Fix the DIV(0) initial fix attempt
Sean Christopherson <seanjc(a)google.com>
x86/retpoline: Don't clobber RFLAGS during srso_safe_ret()
Peter Zijlstra <peterz(a)infradead.org>
x86/static_call: Fix __static_call_fixup()
Borislav Petkov (AMD) <bp(a)alien8.de>
x86/srso: Explain the untraining sequences a bit more
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Cleanup the untrain mess
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Rename srso_(.*)_alias to srso_alias_\1
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Rename original retbleed methods
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Clean up SRSO return thunk mess
Peter Zijlstra <peterz(a)infradead.org>
x86/alternative: Make custom return thunk unconditional
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Fix up srso_safe_ret() and __x86_return_thunk()
Peter Zijlstra <peterz(a)infradead.org>
x86/cpu: Fix __x86_return_thunk symbol type
-------------
Diffstat:
Documentation/admin-guide/hw-vuln/srso.rst | 4 +-
Makefile | 4 +-
arch/x86/include/asm/entry-common.h | 1 +
arch/x86/include/asm/nospec-branch.h | 28 +++---
arch/x86/kernel/cpu/amd.c | 1 +
arch/x86/kernel/cpu/bugs.c | 28 +++++-
arch/x86/kernel/static_call.c | 13 +++
arch/x86/kernel/traps.c | 2 -
arch/x86/kernel/vmlinux.lds.S | 20 ++--
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/lib/retpoline.S | 141 ++++++++++++++++++++---------
tools/objtool/arch/x86/decode.c | 2 +-
tools/objtool/check.c | 21 +++--
13 files changed, 182 insertions(+), 85 deletions(-)
From: Helge Deller <deller(a)gmx.de>
Older PA-RISC machines have LEDs which show the disk- and LAN-activity.
The computation is done in software and takes quite some time, e.g. on a
J6500 this may take up to 60% time of one CPU if the machine is loaded
via network traffic.
Since most people don't care about the LEDs, start with LEDs disabled and
just show a CPU heartbeat LED. The disk and LAN LEDs can be turned on
manually via /proc/pdc/led.
Signed-off-by: Helge Deller <deller(a)gmx.de>
Cc: <stable(a)vger.kernel.org>
---
drivers/parisc/led.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 8bdc5e043831..765f19608f60 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -56,8 +56,8 @@
static int led_type __read_mostly = -1;
static unsigned char lastleds; /* LED state from most recent update */
static unsigned int led_heartbeat __read_mostly = 1;
-static unsigned int led_diskio __read_mostly = 1;
-static unsigned int led_lanrxtx __read_mostly = 1;
+static unsigned int led_diskio __read_mostly;
+static unsigned int led_lanrxtx __read_mostly;
static char lcd_text[32] __read_mostly;
static char lcd_text_default[32] __read_mostly;
static int lcd_no_led_support __read_mostly = 0; /* KittyHawk doesn't support LED on its LCD */
@@ -589,6 +589,9 @@ int __init register_led_driver(int model, unsigned long cmd_reg, unsigned long d
return 1;
}
+ pr_info("LED: Enable disk and LAN activity LEDs "
+ "via /proc/pdc/led\n");
+
/* mark the LCD/LED driver now as initialized and
* register to the reboot notifier chain */
initialized++;
--
2.41.0
As of now, bpf counters (bperf) don't support event groups. But the
default perf stat includes topdown metrics if supported (on recent Intel
machines) which require groups. That makes perf stat exiting.
$ sudo perf stat --bpf-counter true
bpf managed perf events do not yet support groups.
Actually the test explicitly uses cycles event only, but it missed to
pass the option when it checks the availability of the command.
Fixes: 2c0cb9f56020d ("perf test: Add a shell test for 'perf stat --bpf-counters' new option")
Cc: stable(a)vger.kernel.org
Cc: Song Liu <song(a)kernel.org>
Signed-off-by: Namhyung Kim <namhyung(a)kernel.org>
---
tools/perf/tests/shell/stat_bpf_counters.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh b/tools/perf/tests/shell/stat_bpf_counters.sh
index 513cd1e58e0e..a87bb2814b4c 100755
--- a/tools/perf/tests/shell/stat_bpf_counters.sh
+++ b/tools/perf/tests/shell/stat_bpf_counters.sh
@@ -22,10 +22,10 @@ compare_number()
}
# skip if --bpf-counters is not supported
-if ! perf stat --bpf-counters true > /dev/null 2>&1; then
+if ! perf stat -e cycles --bpf-counters true > /dev/null 2>&1; then
if [ "$1" = "-v" ]; then
echo "Skipping: --bpf-counters not supported"
- perf --no-pager stat --bpf-counters true || true
+ perf --no-pager stat -e cycles --bpf-counters true || true
fi
exit 2
fi
--
2.42.0.rc1.204.g551eb34607-goog
These two are backports for 5.15.y. Conflict resolution in done in
both patches.
I have tested LTP-nfs fchown02 and chown02 on 5.15.y with below patches
applied. The tests passed.
I would like to have a review as I am not familiar with this code.
Thanks to Vegard for helping me with this.
Regards,
Harshit
Christian Brauner (2):
nfs: use vfs setgid helper
nfsd: use vfs setgid helper
fs/attr.c | 1 +
fs/internal.h | 2 --
fs/nfs/inode.c | 4 +---
fs/nfsd/vfs.c | 4 +++-
include/linux/fs.h | 2 ++
5 files changed, 7 insertions(+), 6 deletions(-)
--
2.34.1
The PERF_RECORD_ATTR is used for a pipe mode to describe an event with
attribute and IDs. The ID table comes after the attr and it calculate
size of the table using the total record size and the attr size.
n_ids = (total_record_size - end_of_the_attr_field) / sizeof(u64)
This is fine for most use cases, but sometimes it saves the pipe output
in a file and then process it later. And it becomes a problem if there
is a change in attr size between the record and report.
$ perf record -o- > perf-pipe.data # old version
$ perf report -i- < perf-pipe.data # new version
For example, if the attr size is 128 and it has 4 IDs, then it would
save them in 168 byte like below:
8 byte: perf event header { .type = PERF_RECORD_ATTR, .size = 168 },
128 byte: perf event attr { .size = 128, ... },
32 byte: event IDs [] = { 1234, 1235, 1236, 1237 },
But when report later, it thinks the attr size is 136 then it only read
the last 3 entries as ID.
8 byte: perf event header { .type = PERF_RECORD_ATTR, .size = 168 },
136 byte: perf event attr { .size = 136, ... },
24 byte: event IDs [] = { 1235, 1236, 1237 }, // 1234 is missing
So it should use the recorded version of the attr. The attr has the
size field already then it should honor the size when reading data.
Fixes: 2c46dbb517a10 ("perf: Convert perf header attrs into attr events")
Cc: stable(a)vger.kernel.org
Cc: Tom Zanussi <zanussi(a)kernel.org>
Signed-off-by: Namhyung Kim <namhyung(a)kernel.org>
---
Keep this version before the libperf change so that it can go through
the stable versions.
tools/perf/util/header.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 52fbf526fe74..f89321cbfdee 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -4381,7 +4381,8 @@ int perf_event__process_attr(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct evlist **pevlist)
{
- u32 i, ids, n_ids;
+ u32 i, n_ids;
+ u64 *ids;
struct evsel *evsel;
struct evlist *evlist = *pevlist;
@@ -4397,9 +4398,8 @@ int perf_event__process_attr(struct perf_tool *tool __maybe_unused,
evlist__add(evlist, evsel);
- ids = event->header.size;
- ids -= (void *)&event->attr.id - (void *)event;
- n_ids = ids / sizeof(u64);
+ n_ids = event->header.size - sizeof(event->header) - event->attr.attr.size;
+ n_ids = n_ids / sizeof(u64);
/*
* We don't have the cpu and thread maps on the header, so
* for allocating the perf_sample_id table we fake 1 cpu and
@@ -4408,8 +4408,9 @@ int perf_event__process_attr(struct perf_tool *tool __maybe_unused,
if (perf_evsel__alloc_id(&evsel->core, 1, n_ids))
return -ENOMEM;
+ ids = (void *)&event->attr.attr + event->attr.attr.size;
for (i = 0; i < n_ids; i++) {
- perf_evlist__id_add(&evlist->core, &evsel->core, 0, i, event->attr.id[i]);
+ perf_evlist__id_add(&evlist->core, &evsel->core, 0, i, ids[i]);
}
return 0;
--
2.42.0.rc1.204.g551eb34607-goog
This patch fixes an issues when concurrent fcntl() syscalls are
executing on two different gfs2 filesystems. Each gfs2 filesystem
creates an DLM lockspace, it seems that VFS only allows fcntl() syscalls
at one time on a per filesystem basis. However if there are two
filesystems and we executing fcntl() syscalls our lookup mechanism on the
global plock op list does not work anymore.
It can be reproduced with two mounted gfs2 filesystems using DLM
locking. Then call stress-ng --fcntl 32 on each mount point. The kernel
log will show several:
WARNING: CPU: 4 PID: 943 at fs/dlm/plock.c:574 dev_write+0x15c/0x590
because we have a sanity check if it's was really the meant original
plock op when dev_write() does a lookup. This patch adds just a
additional check for fsid to find the right plock op which is an
indicator that the recv_list should be on a per lockspace basis and not
globally defined. After this patch the sanity check never warned again
that the wrong plock op was being looked up.
Cc: stable(a)vger.kernel.org
Reported-by: Barry Marson <bmarson(a)redhat.com>
Fixes: 57e2c2f2d94c ("fs: dlm: fix mismatch of plock results from userspace")
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
---
fs/dlm/plock.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
index 00e1d802a81c..e6b4c1a21446 100644
--- a/fs/dlm/plock.c
+++ b/fs/dlm/plock.c
@@ -556,7 +556,8 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
op = plock_lookup_waiter(&info);
} else {
list_for_each_entry(iter, &recv_list, list) {
- if (!iter->info.wait) {
+ if (!iter->info.wait &&
+ iter->info.fsid == info.fsid) {
op = iter;
break;
}
@@ -568,8 +569,7 @@ static ssize_t dev_write(struct file *file, const char __user *u, size_t count,
if (info.wait)
WARN_ON(op->info.optype != DLM_PLOCK_OP_LOCK);
else
- WARN_ON(op->info.fsid != info.fsid ||
- op->info.number != info.number ||
+ WARN_ON(op->info.number != info.number ||
op->info.owner != info.owner ||
op->info.optype != info.optype);
--
2.31.1
From: Benjamin Tissoires <benjamin.tissoires(a)redhat.com>
Extract the internal code inside a helper function, fix the
initialization of the parameters used in the helper function
(`hidpp->answer_available` was not reset and `*response` wasn't either),
and use a `do {...} while();` loop.
Fixes: 586e8fede795 ("HID: logitech-hidpp: Retry commands when device is busy")
Cc: stable(a)vger.kernel.org
Reviewed-by: Bastien Nocera <hadess(a)hadess.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires(a)redhat.com>
---
as requested by https://lore.kernel.org/all/CAHk-=wiMbF38KCNhPFiargenpSBoecSXTLQACKS2UMyo_V…
This is a rewrite of that particular piece of code.
---
Changes in v2:
- added __must_hold() for KASAN
- Reworked the comment describing the functions and their return values
- Link to v1: https://lore.kernel.org/r/20230621-logitech-fixes-v1-1-32e70933c0b0@redhat.…
---
drivers/hid/hid-logitech-hidpp.c | 115 +++++++++++++++++++++++++--------------
1 file changed, 75 insertions(+), 40 deletions(-)
diff --git a/drivers/hid/hid-logitech-hidpp.c b/drivers/hid/hid-logitech-hidpp.c
index 129b01be488d..09ba2086c95c 100644
--- a/drivers/hid/hid-logitech-hidpp.c
+++ b/drivers/hid/hid-logitech-hidpp.c
@@ -275,21 +275,22 @@ static int __hidpp_send_report(struct hid_device *hdev,
}
/*
- * hidpp_send_message_sync() returns 0 in case of success, and something else
- * in case of a failure.
- * - If ' something else' is positive, that means that an error has been raised
- * by the protocol itself.
- * - If ' something else' is negative, that means that we had a classic error
- * (-ENOMEM, -EPIPE, etc...)
+ * Effectively send the message to the device, waiting for its answer.
+ *
+ * Must be called with hidpp->send_mutex locked
+ *
+ * Same return protocol than hidpp_send_message_sync():
+ * - success on 0
+ * - negative error means transport error
+ * - positive value means protocol error
*/
-static int hidpp_send_message_sync(struct hidpp_device *hidpp,
+static int __do_hidpp_send_message_sync(struct hidpp_device *hidpp,
struct hidpp_report *message,
struct hidpp_report *response)
{
- int ret = -1;
- int max_retries = 3;
+ int ret;
- mutex_lock(&hidpp->send_mutex);
+ __must_hold(&hidpp->send_mutex);
hidpp->send_receive_buf = response;
hidpp->answer_available = false;
@@ -300,47 +301,74 @@ static int hidpp_send_message_sync(struct hidpp_device *hidpp,
*/
*response = *message;
- for (; max_retries != 0 && ret; max_retries--) {
- ret = __hidpp_send_report(hidpp->hid_dev, message);
+ ret = __hidpp_send_report(hidpp->hid_dev, message);
+ if (ret) {
+ dbg_hid("__hidpp_send_report returned err: %d\n", ret);
+ memset(response, 0, sizeof(struct hidpp_report));
+ return ret;
+ }
- if (ret) {
- dbg_hid("__hidpp_send_report returned err: %d\n", ret);
- memset(response, 0, sizeof(struct hidpp_report));
- break;
- }
+ if (!wait_event_timeout(hidpp->wait, hidpp->answer_available,
+ 5*HZ)) {
+ dbg_hid("%s:timeout waiting for response\n", __func__);
+ memset(response, 0, sizeof(struct hidpp_report));
+ return -ETIMEDOUT;
+ }
- if (!wait_event_timeout(hidpp->wait, hidpp->answer_available,
- 5*HZ)) {
- dbg_hid("%s:timeout waiting for response\n", __func__);
- memset(response, 0, sizeof(struct hidpp_report));
- ret = -ETIMEDOUT;
- break;
- }
+ if (response->report_id == REPORT_ID_HIDPP_SHORT &&
+ response->rap.sub_id == HIDPP_ERROR) {
+ ret = response->rap.params[1];
+ dbg_hid("%s:got hidpp error %02X\n", __func__, ret);
+ return ret;
+ }
- if (response->report_id == REPORT_ID_HIDPP_SHORT &&
- response->rap.sub_id == HIDPP_ERROR) {
- ret = response->rap.params[1];
- dbg_hid("%s:got hidpp error %02X\n", __func__, ret);
+ if ((response->report_id == REPORT_ID_HIDPP_LONG ||
+ response->report_id == REPORT_ID_HIDPP_VERY_LONG) &&
+ response->fap.feature_index == HIDPP20_ERROR) {
+ ret = response->fap.params[1];
+ dbg_hid("%s:got hidpp 2.0 error %02X\n", __func__, ret);
+ return ret;
+ }
+
+ return 0;
+}
+
+/*
+ * hidpp_send_message_sync() returns 0 in case of success, and something else
+ * in case of a failure.
+ *
+ * See __do_hidpp_send_message_sync() for a detailed explanation of the returned
+ * value.
+ */
+static int hidpp_send_message_sync(struct hidpp_device *hidpp,
+ struct hidpp_report *message,
+ struct hidpp_report *response)
+{
+ int ret;
+ int max_retries = 3;
+
+ mutex_lock(&hidpp->send_mutex);
+
+ do {
+ ret = __do_hidpp_send_message_sync(hidpp, message, response);
+ if (ret != HIDPP20_ERROR_BUSY)
break;
- }
- if ((response->report_id == REPORT_ID_HIDPP_LONG ||
- response->report_id == REPORT_ID_HIDPP_VERY_LONG) &&
- response->fap.feature_index == HIDPP20_ERROR) {
- ret = response->fap.params[1];
- if (ret != HIDPP20_ERROR_BUSY) {
- dbg_hid("%s:got hidpp 2.0 error %02X\n", __func__, ret);
- break;
- }
- dbg_hid("%s:got busy hidpp 2.0 error %02X, retrying\n", __func__, ret);
- }
- }
+ dbg_hid("%s:got busy hidpp 2.0 error %02X, retrying\n", __func__, ret);
+ } while (--max_retries);
mutex_unlock(&hidpp->send_mutex);
return ret;
}
+/*
+ * hidpp_send_fap_command_sync() returns 0 in case of success, and something else
+ * in case of a failure.
+ *
+ * See __do_hidpp_send_message_sync() for a detailed explanation of the returned
+ * value.
+ */
static int hidpp_send_fap_command_sync(struct hidpp_device *hidpp,
u8 feat_index, u8 funcindex_clientid, u8 *params, int param_count,
struct hidpp_report *response)
@@ -373,6 +401,13 @@ static int hidpp_send_fap_command_sync(struct hidpp_device *hidpp,
return ret;
}
+/*
+ * hidpp_send_rap_command_sync() returns 0 in case of success, and something else
+ * in case of a failure.
+ *
+ * See __do_hidpp_send_message_sync() for a detailed explanation of the returned
+ * value.
+ */
static int hidpp_send_rap_command_sync(struct hidpp_device *hidpp_dev,
u8 report_id, u8 sub_id, u8 reg_address, u8 *params, int param_count,
struct hidpp_report *response)
---
base-commit: 87854366176403438d01f368b09de3ec2234e0f5
change-id: 20230621-logitech-fixes-a4c0e66ea2ad
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
The goal is to support a bpf_redirect() from an ethernet device (ingress)
to a ppp device (egress).
The l2 header is added automatically by the ppp driver, thus the ethernet
header should be removed.
CC: stable(a)vger.kernel.org
Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
Tested-by: Siwar Zitouni <siwar.zitouni(a)6wind.com>
---
v2 -> v3:
- add a comment in the code
- rework the commit log
v1 -> v2:
- I forgot the 'Tested-by' tag in the v1 :/
include/linux/if_arp.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/if_arp.h b/include/linux/if_arp.h
index 1ed52441972f..10a1e81434cb 100644
--- a/include/linux/if_arp.h
+++ b/include/linux/if_arp.h
@@ -53,6 +53,10 @@ static inline bool dev_is_mac_header_xmit(const struct net_device *dev)
case ARPHRD_NONE:
case ARPHRD_RAWIP:
case ARPHRD_PIMREG:
+ /* PPP adds its l2 header automatically in ppp_start_xmit().
+ * This makes it look like an l3 device to __bpf_redirect() and tcf_mirred_init().
+ */
+ case ARPHRD_PPP:
return false;
default:
return true;
--
2.39.2
With commit 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver
with pata_falcon and falconide"), the Q40 IDE driver was
replaced by pata_falcon.c.
Both IO and memory resources were defined for the Q40 IDE
platform device, but definition of the IDE register addresses
was modeled after the Falcon case, both in use of the memory
resources and in including register shift and byte vs. word
offset in the address.
This was correct for the Falcon case, which does not apply
any address translation to the register addresses. In the
Q40 case, all of device base address, byte access offset
and register shift is included in the platform specific
ISA access translation (in asm/mm_io.h).
As a consequence, such address translation gets applied
twice, and register addresses are mangled.
Use the device base address from the platform IO resource
for Q40 (the IO address translation will then add the correct
ISA window base address and byte access offset), with register
shift 1. Use MMIO base address and register shift 2 as before
for Falcon.
Encode PIO_OFFSET into IO port addresses for all registers
for Q40 except the data transfer register. Encode the MMIO
offset there (pata_falcon_data_xfer() directly uses raw IO
with no address translation).
Reported-by: William R Sowerbutts <will(a)sowerbutts.com>
Closes: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Link: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Fixes: 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver with pata_falcon and falconide")
Cc: stable(a)vger.kernel.org
Cc: Finn Thain <fthain(a)linux-m68k.org>
Cc: Geert Uytterhoeven <geert(a)linux-m68k.org>
Tested-by: William R Sowerbutts <will(a)sowerbutts.com>
Signed-off-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov(a)omp.ru>
Reviewed-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
Changes from v4:
Geert Uytterhoeven:
- use %px for ap->ioaddr.data_addr
Changes from v3:
Sergey Shtylyov:
- change use of reg_scale to reg_shift
Geert Uytterhoeven:
- factor out ata_port_desc() from platform specific code
Changes from v2:
Finn Thain:
- add back stable Cc:
Changes from v1:
Damien Le Moal:
- change patch title
- drop stable backport tag
Changes from RFC v3:
- split off byte swap option into separate patch
Geert Uytterhoeven:
- review comments
Changes from RFC v2:
- add driver parameter 'data_swap' as bit mask for drives to swap
Changes from RFC v1:
Finn Thain:
- take care to supply IO address suitable for ioread8/iowrite8
- use MMIO address for data transfer
---
drivers/ata/pata_falcon.c | 50 +++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 21 deletions(-)
diff --git a/drivers/ata/pata_falcon.c b/drivers/ata/pata_falcon.c
index 996516e64f13..616064b02de6 100644
--- a/drivers/ata/pata_falcon.c
+++ b/drivers/ata/pata_falcon.c
@@ -123,8 +123,8 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
struct resource *base_res, *ctl_res, *irq_res;
struct ata_host *host;
struct ata_port *ap;
- void __iomem *base;
- int irq = 0;
+ void __iomem *base, *ctl_base;
+ int irq = 0, io_offset = 1, reg_shift = 2; /* Falcon defaults */
dev_info(&pdev->dev, "Atari Falcon and Q40/Q60 PATA controller\n");
@@ -165,26 +165,34 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
ap->pio_mask = ATA_PIO4;
ap->flags |= ATA_FLAG_SLAVE_POSS | ATA_FLAG_NO_IORDY;
- base = (void __iomem *)base_mem_res->start;
/* N.B. this assumes data_addr will be used for word-sized I/O only */
- ap->ioaddr.data_addr = base + 0 + 0 * 4;
- ap->ioaddr.error_addr = base + 1 + 1 * 4;
- ap->ioaddr.feature_addr = base + 1 + 1 * 4;
- ap->ioaddr.nsect_addr = base + 1 + 2 * 4;
- ap->ioaddr.lbal_addr = base + 1 + 3 * 4;
- ap->ioaddr.lbam_addr = base + 1 + 4 * 4;
- ap->ioaddr.lbah_addr = base + 1 + 5 * 4;
- ap->ioaddr.device_addr = base + 1 + 6 * 4;
- ap->ioaddr.status_addr = base + 1 + 7 * 4;
- ap->ioaddr.command_addr = base + 1 + 7 * 4;
-
- base = (void __iomem *)ctl_mem_res->start;
- ap->ioaddr.altstatus_addr = base + 1;
- ap->ioaddr.ctl_addr = base + 1;
-
- ata_port_desc(ap, "cmd 0x%lx ctl 0x%lx",
- (unsigned long)base_mem_res->start,
- (unsigned long)ctl_mem_res->start);
+ ap->ioaddr.data_addr = (void __iomem *)base_mem_res->start;
+
+ if (base_res) { /* only Q40 has IO resources */
+ io_offset = 0x10000;
+ reg_shift = 0;
+ base = (void __iomem *)base_res->start;
+ ctl_base = (void __iomem *)ctl_res->start;
+ } else {
+ base = (void __iomem *)base_mem_res->start;
+ ctl_base = (void __iomem *)ctl_mem_res->start;
+ }
+
+ ap->ioaddr.error_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.feature_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.nsect_addr = base + io_offset + (2 << reg_shift);
+ ap->ioaddr.lbal_addr = base + io_offset + (3 << reg_shift);
+ ap->ioaddr.lbam_addr = base + io_offset + (4 << reg_shift);
+ ap->ioaddr.lbah_addr = base + io_offset + (5 << reg_shift);
+ ap->ioaddr.device_addr = base + io_offset + (6 << reg_shift);
+ ap->ioaddr.status_addr = base + io_offset + (7 << reg_shift);
+ ap->ioaddr.command_addr = base + io_offset + (7 << reg_shift);
+
+ ap->ioaddr.altstatus_addr = ctl_base + io_offset;
+ ap->ioaddr.ctl_addr = ctl_base + io_offset;
+
+ ata_port_desc(ap, "cmd %px ctl %px data %px",
+ base, ctl_base, ap->ioaddr.data_addr);
irq_res = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
if (irq_res && irq_res->start > 0) {
--
2.17.1
Gadget ACM while unloading module try to dequeue not queued usb
request which causes the kernel to crash.
Patch adds extra condition to check whether usb request is processed
by CDNSP driver.
cc: <stable(a)vger.kernel.org>
Fixes: 3d82904559f4 ("usb: cdnsp: cdns3 Add main part of Cadence USBSSP DRD Driver")
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
drivers/usb/cdns3/cdnsp-gadget.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/usb/cdns3/cdnsp-gadget.c b/drivers/usb/cdns3/cdnsp-gadget.c
index fff9ec9c391f..3a30c2af0c00 100644
--- a/drivers/usb/cdns3/cdnsp-gadget.c
+++ b/drivers/usb/cdns3/cdnsp-gadget.c
@@ -1125,6 +1125,9 @@ static int cdnsp_gadget_ep_dequeue(struct usb_ep *ep,
unsigned long flags;
int ret;
+ if (request->status != -EINPROGRESS)
+ return 0;
+
if (!pep->endpoint.desc) {
dev_err(pdev->dev,
"%s: can't dequeue to disabled endpoint\n",
--
2.37.2
With commit 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver
with pata_falcon and falconide"), the Q40 IDE driver was
replaced by pata_falcon.c.
Both IO and memory resources were defined for the Q40 IDE
platform device, but definition of the IDE register addresses
was modeled after the Falcon case, both in use of the memory
resources and in including register shift and byte vs. word
offset in the address.
This was correct for the Falcon case, which does not apply
any address translation to the register addresses. In the
Q40 case, all of device base address, byte access offset
and register shift is included in the platform specific
ISA access translation (in asm/mm_io.h).
As a consequence, such address translation gets applied
twice, and register addresses are mangled.
Use the device base address from the platform IO resource
for Q40 (the IO address translation will then add the correct
ISA window base address and byte access offset), with register
shift 1. Use MMIO base address and register shift 2 as before
for Falcon.
Encode PIO_OFFSET into IO port addresses for all registers
for Q40 except the data transfer register. Encode the MMIO
offset there (pata_falcon_data_xfer() directly uses raw IO
with no address translation).
Reported-by: William R Sowerbutts <will(a)sowerbutts.com>
Closes: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Link: https://lore.kernel.org/r/CAMuHMdUU62jjunJh9cqSqHT87B0H0A4udOOPs=WN7WZKpcag…
Fixes: 44b1fbc0f5f3 ("m68k/q40: Replace q40ide driver with pata_falcon and falconide")
Cc: stable(a)vger.kernel.org
Cc: Finn Thain <fthain(a)linux-m68k.org>
Cc: Geert Uytterhoeven <geert(a)linux-m68k.org>
Tested-by: William R Sowerbutts <will(a)sowerbutts.com>
Signed-off-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov(a)omp.ru>
---
Changes from v3:
Sergey Shtylyov:
- change use of reg_scale to reg_shift
Geert Uytterhoeven:
- factor out ata_port_desc() from platform specific code
Changes from v2:
Finn Thain:
- add back stable Cc:
Changes from v1:
Damien Le Moal:
- change patch title
- drop stable backport tag
Changes from RFC v3:
- split off byte swap option into separate patch
Geert Uytterhoeven:
- review comments
Changes from RFC v2:
- add driver parameter 'data_swap' as bit mask for drives to swap
Changes from RFC v1:
Finn Thain:
- take care to supply IO address suitable for ioread8/iowrite8
- use MMIO address for data transfer
---
drivers/ata/pata_falcon.c | 50 +++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 21 deletions(-)
diff --git a/drivers/ata/pata_falcon.c b/drivers/ata/pata_falcon.c
index 996516e64f13..3841ea200bcb 100644
--- a/drivers/ata/pata_falcon.c
+++ b/drivers/ata/pata_falcon.c
@@ -123,8 +123,8 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
struct resource *base_res, *ctl_res, *irq_res;
struct ata_host *host;
struct ata_port *ap;
- void __iomem *base;
- int irq = 0;
+ void __iomem *base, *ctl_base;
+ int irq = 0, io_offset = 1, reg_shift = 2; /* Falcon defaults */
dev_info(&pdev->dev, "Atari Falcon and Q40/Q60 PATA controller\n");
@@ -165,26 +165,34 @@ static int __init pata_falcon_init_one(struct platform_device *pdev)
ap->pio_mask = ATA_PIO4;
ap->flags |= ATA_FLAG_SLAVE_POSS | ATA_FLAG_NO_IORDY;
- base = (void __iomem *)base_mem_res->start;
/* N.B. this assumes data_addr will be used for word-sized I/O only */
- ap->ioaddr.data_addr = base + 0 + 0 * 4;
- ap->ioaddr.error_addr = base + 1 + 1 * 4;
- ap->ioaddr.feature_addr = base + 1 + 1 * 4;
- ap->ioaddr.nsect_addr = base + 1 + 2 * 4;
- ap->ioaddr.lbal_addr = base + 1 + 3 * 4;
- ap->ioaddr.lbam_addr = base + 1 + 4 * 4;
- ap->ioaddr.lbah_addr = base + 1 + 5 * 4;
- ap->ioaddr.device_addr = base + 1 + 6 * 4;
- ap->ioaddr.status_addr = base + 1 + 7 * 4;
- ap->ioaddr.command_addr = base + 1 + 7 * 4;
-
- base = (void __iomem *)ctl_mem_res->start;
- ap->ioaddr.altstatus_addr = base + 1;
- ap->ioaddr.ctl_addr = base + 1;
-
- ata_port_desc(ap, "cmd 0x%lx ctl 0x%lx",
- (unsigned long)base_mem_res->start,
- (unsigned long)ctl_mem_res->start);
+ ap->ioaddr.data_addr = (void __iomem *)base_mem_res->start;
+
+ if (base_res) { /* only Q40 has IO resources */
+ io_offset = 0x10000;
+ reg_shift = 0;
+ base = (void __iomem *)base_res->start;
+ ctl_base = (void __iomem *)ctl_res->start;
+ } else {
+ base = (void __iomem *)base_mem_res->start;
+ ctl_base = (void __iomem *)ctl_mem_res->start;
+ }
+
+ ap->ioaddr.error_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.feature_addr = base + io_offset + (1 << reg_shift);
+ ap->ioaddr.nsect_addr = base + io_offset + (2 << reg_shift);
+ ap->ioaddr.lbal_addr = base + io_offset + (3 << reg_shift);
+ ap->ioaddr.lbam_addr = base + io_offset + (4 << reg_shift);
+ ap->ioaddr.lbah_addr = base + io_offset + (5 << reg_shift);
+ ap->ioaddr.device_addr = base + io_offset + (6 << reg_shift);
+ ap->ioaddr.status_addr = base + io_offset + (7 << reg_shift);
+ ap->ioaddr.command_addr = base + io_offset + (7 << reg_shift);
+
+ ap->ioaddr.altstatus_addr = ctl_base + io_offset;
+ ap->ioaddr.ctl_addr = ctl_base + io_offset;
+
+ ata_port_desc(ap, "cmd %px ctl %px data %pa",
+ base, ctl_base, &ap->ioaddr.data_addr);
irq_res = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
if (irq_res && irq_res->start > 0) {
--
2.17.1
The quilt patch titled
Subject: shmem: fix smaps BUG sleeping while atomic
has been removed from the -mm tree. Its filename was
shmem-fix-smaps-bug-sleeping-while-atomic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: shmem: fix smaps BUG sleeping while atomic
Date: Tue, 22 Aug 2023 22:14:47 -0700 (PDT)
smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page
table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if
need_resched(): "BUG: sleeping function called from invalid context".
Since shmem_partial_swap_usage() is designed to count across a range, but
smaps_pte_hole_lookup() only calls it for a single page slot, just break
out of the loop on the last or only page, before checking need_resched().
Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com
Fixes: 230100321518 ("mm/smaps: simplify shmem handling of pte holes")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [5.16+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/shmem.c~shmem-fix-smaps-bug-sleeping-while-atomic
+++ a/mm/shmem.c
@@ -806,14 +806,16 @@ unsigned long shmem_partial_swap_usage(s
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
unsigned long swapped = 0;
+ unsigned long max = end - 1;
rcu_read_lock();
- xas_for_each(&xas, page, end - 1) {
+ xas_for_each(&xas, page, max) {
if (xas_retry(&xas, page))
continue;
if (xa_is_value(page))
swapped++;
-
+ if (xas.xa_index == max)
+ break;
if (need_resched()) {
xas_pause(&xas);
cond_resched_rcu();
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-khugepaged-fix-collapse_pte_mapped_thp-versus-uffd.patch
The quilt patch titled
Subject: maple_tree: disable mas_wr_append() when other readers are possible
has been removed from the -mm tree. Its filename was
maple_tree-disable-mas_wr_append-when-other-readers-are-possible.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: maple_tree: disable mas_wr_append() when other readers are possible
Date: Fri, 18 Aug 2023 20:43:55 -0400
The current implementation of append may cause duplicate data and/or
incorrect ranges to be returned to a reader during an update. Although
this has not been reported or seen, disable the append write operation
while the tree is in rcu mode out of an abundance of caution.
During the analysis of the mas_next_slot() the following was
artificially created by separating the writer and reader code:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last slot write is to start of slot
store current contents in slot
overwrite old end pivot
mas_next_slot():
read end metadata
read old end pivot
return with incorrect range
store new value
Alternatively:
Writer: reader:
mas_wr_append
set end pivot
updates end metata
Detects write to last slot
last lost write to end of slot
store value
mas_next_slot():
read end metadata
read old end pivot
read new end pivot
return with incorrect range
set old end pivot
There may be other accesses that are not safe since we are now updating
both metadata and pointers, so disabling append if there could be rcu
readers is the safest action.
Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 7 +++++++
1 file changed, 7 insertions(+)
--- a/lib/maple_tree.c~maple_tree-disable-mas_wr_append-when-other-readers-are-possible
+++ a/lib/maple_tree.c
@@ -4265,6 +4265,10 @@ static inline unsigned char mas_wr_new_e
* mas_wr_append: Attempt to append
* @wr_mas: the maple write state
*
+ * This is currently unsafe in rcu mode since the end of the node may be cached
+ * by readers while the node contents may be updated which could result in
+ * inaccurate information.
+ *
* Return: True if appended, false otherwise
*/
static inline bool mas_wr_append(struct ma_wr_state *wr_mas)
@@ -4274,6 +4278,9 @@ static inline bool mas_wr_append(struct
struct ma_state *mas = wr_mas->mas;
unsigned char node_pivots = mt_pivots[wr_mas->type];
+ if (mt_in_rcu(mas->tree))
+ return false;
+
if (mas->offset != wr_mas->node_end)
return false;
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
maple_tree-clean-up-mas_wr_append.patch
The quilt patch titled
Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
has been removed from the -mm tree. Its filename was
madvise-madvise_free_pte_range-dont-use-mapcount-against-large-folio-for-sharing-check.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yin Fengwei <fengwei.yin(a)intel.com>
Subject: madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
Date: Tue, 8 Aug 2023 10:09:17 +0800
Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a
folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.
It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com
Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/madvise.c~madvise-madvise_free_pte_range-dont-use-mapcount-against-large-folio-for-sharing-check
+++ a/mm/madvise.c
@@ -680,7 +680,7 @@ static int madvise_free_pte_range(pmd_t
if (folio_test_large(folio)) {
int err;
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (!folio_trylock(folio))
break;
_
Patches currently in -mm which might be from fengwei.yin(a)intel.com are
filemap-add-filemap_map_folio_range.patch
rmap-add-folio_add_file_rmap_range.patch
mm-convert-do_set_pte-to-set_pte_range.patch
filemap-batch-pte-mappings.patch
The quilt patch titled
Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
has been removed from the -mm tree. Its filename was
madvise-madvise_cold_or_pageout_pte_range-dont-use-mapcount-against-large-folio-for-sharing-check.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yin Fengwei <fengwei.yin(a)intel.com>
Subject: madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
Date: Tue, 8 Aug 2023 10:09:15 +0800
Patch series "don't use mapcount() to check large folio sharing", v2.
In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.
Use folio_estimated_sharers() here as the estimated number is enough.
This patchset will fix the cases:
User space application call madvise() with MADV_FREE, MADV_COLD and
MADV_PAGEOUT for specific address range. There are THP mapped to the
range. Without the patchset, the THP is skipped. With the patch, the
THP will be split and handled accordingly.
David reported the cow self test skip some cases because of MADV_PAGEOUT
skip THP:
https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel…
and I confirmed this patchset make it work again.
This patch (of 3):
Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range()
to use folios") replaced the page_mapcount() with folio_mapcount() to
check whether the folio is shared by other mapping.
It's not correct for large folio. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.
Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.
User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.
NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.
Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
Signed-off-by: Yin Fengwei <fengwei.yin(a)intel.com>
Reviewed-by: Yu Zhao <yuzhao(a)google.com>
Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/madvise.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/madvise.c~madvise-madvise_cold_or_pageout_pte_range-dont-use-mapcount-against-large-folio-for-sharing-check
+++ a/mm/madvise.c
@@ -384,7 +384,7 @@ static int madvise_cold_or_pageout_pte_r
folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -458,7 +458,7 @@ regular_folio:
if (folio_test_large(folio)) {
int err;
- if (folio_mapcount(folio) != 1)
+ if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
_
Patches currently in -mm which might be from fengwei.yin(a)intel.com are
filemap-add-filemap_map_folio_range.patch
rmap-add-folio_add_file_rmap_range.patch
mm-convert-do_set_pte-to-set_pte_range.patch
filemap-batch-pte-mappings.patch
When building the kernel with binutils 2.37 and GCC-11.1.0/GCC-11.2.0,
the following error occurs:
Assembler messages:
Error: cannot find default versions of the ISA extension `zicsr'
Error: cannot find default versions of the ISA extension `zifencei'
The above error originated from this commit of binutils[0], which has been
resolved and backported by GCC-12.1.0[1] and GCC-11.3.0[2].
So fix this by change the GCC version in
CONFIG_TOOLCHAIN_NEEDS_OLD_ISA_SPEC to GCC-11.3.0.
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=f0bae2552db1dd4f1… [0]
Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=ca2bbb88f999f4d3cc40e89bc1aba… [1]
Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d29f5d6ab513c52fd872f532c492e… [2]
Fixes: ca09f772ccca ("riscv: Handle zicsr/zifencei issue between gcc and binutils")
Reported-by: Conor Dooley <conor.dooley(a)microchip.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Mingzheng Xing <xingmingzheng(a)iscas.ac.cn>
---
Then below are my test results after this fix:
gcc binutils
10.5.0 2.35 ok
10.5.0 2.36 ok
10.5.0 2.37 ok
10.5.0 2.38 ok
11.1.0 2.35 ok
11.1.0 2.36 ok
11.1.0 2.37 ok
11.1.0 2.38 ok
11.2.0 2.35 ok
11.2.0 2.36 ok
11.2.0 2.37 ok
11.2.0 2.38 ok
11.3.0 2.35 ok
11.3.0 2.36 ok
11.3.0 2.37 ok
11.3.0 2.38 ok
11.4.0 2.35 ok
11.4.0 2.36 ok
11.4.0 2.37 ok
11.4.0 2.38 ok
12.1.0 2.35 ok
12.1.0 2.36 ok
12.1.0 2.37 ok
12.1.0 2.38 ok
12.2.0 2.35 ok
12.2.0 2.36 ok
12.2.0 2.37 ok
12.2.0 2.38 ok
arch/riscv/Kconfig | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 10e7a7ad175a..bea7b73e895d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -580,15 +580,15 @@ config TOOLCHAIN_NEEDS_EXPLICIT_ZICSR_ZIFENCEI
and Zifencei are supported in binutils from version 2.36 onwards.
To make life easier, and avoid forcing toolchains that default to a
newer ISA spec to version 2.2, relax the check to binutils >= 2.36.
- For clang < 17 or GCC < 11.1.0, for which this is not possible, this is
- dealt with in CONFIG_TOOLCHAIN_NEEDS_OLD_ISA_SPEC.
+ For clang < 17 or GCC < 11.3.0, for which this is not possible or need
+ special treatment, this is dealt with in TOOLCHAIN_NEEDS_OLD_ISA_SPEC.
config TOOLCHAIN_NEEDS_OLD_ISA_SPEC
def_bool y
depends on TOOLCHAIN_NEEDS_EXPLICIT_ZICSR_ZIFENCEI
# https://github.com/llvm/llvm-project/commit/22e199e6afb1263c943c0c0d4498694…
- # https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b03be74bad08c382da47e048007a7…
- depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION < 110100)
+ # https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d29f5d6ab513c52fd872f532c492e…
+ depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION < 110300)
help
Certain versions of clang and GCC do not support zicsr and zifencei via
-march. This option causes an older ISA spec compatible with these older
--
2.34.1
On Thu, 24 Aug 2023 11:05:13 PDT (-0700), Conor Dooley wrote:
> On Fri, Aug 25, 2023 at 01:46:59AM +0800, Mingzheng Xing wrote:
>
>> Just a question, I see that the previous patch has been merged
>> into 6.5-rc7, and now a new fix patch should be sent out based
>> on that, right?
>
> yes
Ideally ASAP, it's very late in the cycle. I have something for
tomorrow morning, but this will need time to test...
The patch titled
Subject: memcontrol: ensure memcg acquired by id is properly set up
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: memcontrol: ensure memcg acquired by id is properly set up
Date: Wed, 23 Aug 2023 15:54:30 -0700
In the eviction recency check, we attempt to retrieve the memcg to which
the folio belonged when it was evicted, by the memcg id stored in the
shadow entry. However, there is a chance that the retrieved memcg is not
the original memcg that has been killed, but a new one which happens to
have the same id.
This is a somewhat unfortunate, but acceptable and rare inaccuracy in the
heuristics. However, if we retrieve this new memcg between its allocation
and when it is properly attached to the memcg hierarchy, we could run into
the following NULL pointer exception during the memcg hierarchy traversal
done in mem_cgroup_get_nr_swap_pages():
[ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0
[ 155757.807568] #PF: supervisor read access in kernel mode
[ 155757.818024] #PF: error_code(0x0000) - not-present page
[ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0
[ 155757.839985] Oops: 0000 [#1] SMP
[ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages+0x3d/0xb0
[ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48
[ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286
[ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000
[ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000
[ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0
[ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000
[ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354
[ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000
[ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0
[ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 155758.091376] PKRU: 55555554
[ 155758.096957] Call Trace:
[ 155758.102016] <TASK>
[ 155758.106502] ? __die+0x78/0xc0
[ 155758.112793] ? page_fault_oops+0x286/0x380
[ 155758.121175] ? exc_page_fault+0x5d/0x110
[ 155758.129209] ? asm_exc_page_fault+0x22/0x30
[ 155758.137763] ? mem_cgroup_get_nr_swap_pages+0x3d/0xb0
[ 155758.148060] workingset_test_recent+0xda/0x1b0
[ 155758.157133] workingset_refault+0xca/0x1e0
[ 155758.165508] filemap_add_folio+0x4d/0x70
[ 155758.173538] page_cache_ra_unbounded+0xed/0x190
[ 155758.182919] page_cache_sync_ra+0xd6/0x1e0
[ 155758.191738] filemap_read+0x68d/0xdf0
[ 155758.199495] ? mlx5e_napi_poll+0x123/0x940
[ 155758.207981] ? __napi_schedule+0x55/0x90
[ 155758.216095] __x64_sys_pread64+0x1d6/0x2c0
[ 155758.224601] do_syscall_64+0x3d/0x80
[ 155758.232058] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 155758.242473] RIP: 0033:0x7f62c29153b5
[ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b
[ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5
[ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076
[ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c
[ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041
[ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450
[ 155758.376661] </TASK>
This patch fixes the issue by moving the memcg's id publication from the
alloc stage to online stage, ensuring that any memcg acquired via id must
be connected to the memcg tree.
Link: https://lkml.kernel.org/r/20230823225430.166925-1-nphamcs@gmail.com
Fixes: f78dfc7b77d5 ("workingset: fix confusion around eviction vs refault container")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Co-developed-by: Nhat Pham <nphamcs(a)gmail.com>
Signed-off-by: Nhat Pham <nphamcs(a)gmail.com>
Cc: Yosry Ahmed <yosryahmed(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
--- a/mm/memcontrol.c~memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up
+++ a/mm/memcontrol.c
@@ -5339,7 +5339,6 @@ static struct mem_cgroup *mem_cgroup_all
INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
memcg->deferred_split_queue.split_queue_len = 0;
#endif
- idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
lru_gen_init_memcg(memcg);
return memcg;
fail:
@@ -5411,14 +5410,27 @@ static int mem_cgroup_css_online(struct
if (alloc_shrinker_info(memcg))
goto offline_kmem;
- /* Online state pins memcg ID, memcg ID pins CSS */
- refcount_set(&memcg->id.ref, 1);
- css_get(css);
-
if (unlikely(mem_cgroup_is_root(memcg)))
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
FLUSH_TIME);
lru_gen_online_memcg(memcg);
+
+ /* Online state pins memcg ID, memcg ID pins CSS */
+ refcount_set(&memcg->id.ref, 1);
+ css_get(css);
+
+ /*
+ * Ensure mem_cgroup_from_id() works once we're fully online.
+ *
+ * We could do this earlier and require callers to filter with
+ * css_tryget_online(). But right now there are no users that
+ * need earlier access, and the workingset code relies on the
+ * cgroup tree linkage (mem_cgroup_get_nr_swap_pages()). So
+ * publish it here at the end of onlining. This matches the
+ * regular ID destruction during offlining.
+ */
+ idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
+
return 0;
offline_kmem:
memcg_offline_kmem(memcg);
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
memcontrol-ensure-memcg-acquired-by-id-is-properly-set-up.patch
mm-page_alloc-remove-stale-cma-guard-code.patch
On 8/25/23 01:33, Conor Dooley wrote:
> On Fri, Aug 25, 2023 at 01:30:36AM +0800, Mingzheng Xing wrote:
>> On 8/24/23 19:32, Mingzheng Xing wrote:
>>> On 8/23/23 21:31, Conor Dooley wrote:
>>>> On Wed, Aug 23, 2023 at 12:51:13PM +0100, Conor Dooley wrote:
>>>>> On Thu, Aug 17, 2023 at 03:20:24PM +0000,
>>>>> patchwork-bot+linux-riscv(a)kernel.org wrote:
>>>>>> Hello:
>>>>>>
>>>>>> This patch was applied to riscv/linux.git (fixes)
>>>>>> by Palmer Dabbelt <palmer(a)rivosinc.com>:
>>>>>>
>>>>>> On Thu, 10 Aug 2023 00:56:48 +0800 you wrote:
>>>>>>> Binutils-2.38 and GCC-12.1.0 bumped[0][1] the default
>>>>>>> ISA spec to the newer
>>>>>>> 20191213 version which moves some instructions from the
>>>>>>> I extension to the
>>>>>>> Zicsr and Zifencei extensions. So if one of the binutils
>>>>>>> and GCC exceeds
>>>>>>> that version, we should explicitly specifying Zicsr and
>>>>>>> Zifencei via -march
>>>>>>> to cope with the new changes. but this only occurs when
>>>>>>> binutils >= 2.36
>>>>>>> and GCC >= 11.1.0. It's a different story when binutils < 2.36.
>>>>>>>
>>>>>>> [...]
>>>>>> Here is the summary with links:
>>>>>> - [v5] riscv: Handle zicsr/zifencei issue between gcc and binutils
>>>>>> https://git.kernel.org/riscv/c/ca09f772ccca
>>>>> *sigh* so this breaks the build for gcc-11 & binutils 2.37 w/
>>>>> Assembler messages:
>>>>> Error: cannot find default versions of the ISA extension `zicsr'
>>>>> Error: cannot find default versions of the ISA extension `zifencei'
>>>>>
>>>>> I'll have a poke later.
>>>> So uh, are we sure that this should not be:
>>>> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) ||
>>>> (CC_IS_GCC && GCC_VERSION < 110100)
>>>> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) ||
>>>> (CC_IS_GCC && GCC_VERSION <= 110100)
>>>>
>>>> My gcc-11.1 + binutils 2.37 toolchain built from riscv-gnu-toolchain
>>>> doesn't have the default versions & the above diff fixes the build.
>>> I reproduced the error, the combination of gcc-11.1 and
>>> binutils 2.37 does cause errors. What a surprise, since binutils
>>> 2.36 and 2.38 are fine.
>>>
>>> I used git bisect to locate this commit[1] for binutils.
>>> I'll test this diff in more detail later. Thanks!
>>>
>>> [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=f0bae2552db1dd4f1…
>>>
>> Hi, Conor.
>> The above error does originate from link[1] mentioned above, which was
>> resolved in gcc-12.1.0[2], and gcc-11.3.0 made the backport[3].
>> So gcc-11.2.0 combined with binutils 2.37 produces the same error.
>> I think we should do the following diff to fix it:
>> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC &&
>> GCC_VERSION < 110100)
>> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC &&
>> GCC_VERSION < 110300)
> That sounds good to me. Can you make that a real patch please?
Okey.
Just a question, I see that the previous patch has been merged
into 6.5-rc7, and now a new fix patch should be sent out based
on that, right?
Best Regards,
Mingzheng.
>
> Thanks for working on it :)
On 8/23/23 21:31, Conor Dooley wrote:
> On Wed, Aug 23, 2023 at 12:51:13PM +0100, Conor Dooley wrote:
>> On Thu, Aug 17, 2023 at 03:20:24PM +0000, patchwork-bot+linux-riscv(a)kernel.org wrote:
>>> Hello:
>>>
>>> This patch was applied to riscv/linux.git (fixes)
>>> by Palmer Dabbelt <palmer(a)rivosinc.com>:
>>>
>>> On Thu, 10 Aug 2023 00:56:48 +0800 you wrote:
>>>> Binutils-2.38 and GCC-12.1.0 bumped[0][1] the default ISA spec to the newer
>>>> 20191213 version which moves some instructions from the I extension to the
>>>> Zicsr and Zifencei extensions. So if one of the binutils and GCC exceeds
>>>> that version, we should explicitly specifying Zicsr and Zifencei via -march
>>>> to cope with the new changes. but this only occurs when binutils >= 2.36
>>>> and GCC >= 11.1.0. It's a different story when binutils < 2.36.
>>>>
>>>> [...]
>>> Here is the summary with links:
>>> - [v5] riscv: Handle zicsr/zifencei issue between gcc and binutils
>>> https://git.kernel.org/riscv/c/ca09f772ccca
>> *sigh* so this breaks the build for gcc-11 & binutils 2.37 w/
>> Assembler messages:
>> Error: cannot find default versions of the ISA extension `zicsr'
>> Error: cannot find default versions of the ISA extension `zifencei'
>>
>> I'll have a poke later.
> So uh, are we sure that this should not be:
> - depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION < 110100)
> + depends on (CC_IS_CLANG && CLANG_VERSION < 170000) || (CC_IS_GCC && GCC_VERSION <= 110100)
>
> My gcc-11.1 + binutils 2.37 toolchain built from riscv-gnu-toolchain
> doesn't have the default versions & the above diff fixes the build.
I reproduced the error, the combination of gcc-11.1 and
binutils 2.37 does cause errors. What a surprise, since binutils
2.36 and 2.38 are fine.
I used git bisect to locate this commit[1] for binutils.
I'll test this diff in more detail later. Thanks!
[1]
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=f0bae2552db1dd4f1…
Best Regards,
Mingzheng.
>
> Thanks,
> Conor.
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 1f69383b203e28cf8a4ca9570e572da1699f76cd
Gitweb: https://git.kernel.org/tip/1f69383b203e28cf8a4ca9570e572da1699f76cd
Author: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
AuthorDate: Fri, 18 Aug 2023 10:03:05 -07:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 24 Aug 2023 11:01:45 +02:00
x86/fpu: Invalidate FPU state correctly on exec()
The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
valid and should be reloaded when returning to userspace. However, the
kernel will skip doing this if the FPU registers are already valid as
determined by fpregs_state_valid(). The logic embedded there considers
the state valid if two cases are both true:
1: fpu_fpregs_owner_ctx points to the current tasks FPU state
2: the last CPU the registers were live in was the current CPU.
This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
the current FPU during the fpregs_restore_userregs() operation, so it
indicates that the registers have been restored on this CPU. But this
alone doesn’t preclude that the task hasn’t been rescheduled to a
different CPU, where the registers were modified, and then back to the
current CPU. To verify that this was not the case the logic relies on the
second condition. So the assumption is that if the registers have been
restored, AND they haven’t had the chance to be modified (by being
loaded on another CPU), then they MUST be valid on the current CPU.
Besides the lazy FPU optimizations, the other cases where the FPU
registers might not be valid are when the kernel modifies the FPU register
state or the FPU saved buffer. In this case the operation modifying the
FPU state needs to let the kernel know the correspondence has been
broken. The comment in “arch/x86/kernel/fpu/context.h” has:
/*
...
* If the FPU register state is valid, the kernel can skip restoring the
* FPU state from memory.
*
* Any code that clobbers the FPU registers or updates the in-memory
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
* Either one of these invalidation functions is enough. Invalidate
* a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
However, this is not completely true. When the kernel modifies the
registers or saved FPU state, it can only rely on
__fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
tracking. The exec path instead relies on fpregs_deactivate(), which sets
the CPU’s FPU context to NULL. This was observed to fail to restore the
reset FPU state to the registers when returning to userspace in the
following scenario:
1. A task is executing in userspace on CPU0
- CPU0’s FPU context points to tasks
- fpu->last_cpu=CPU0
2. The task exec()’s
3. While in the kernel the task is preempted
- CPU0 gets a thread executing in the kernel (such that no other
FPU context is activated)
- Scheduler sets task’s fpu->last_cpu=CPU0 when scheduling out
4. Task is migrated to CPU1
5. Continuing the exec(), the task gets to
fpu_flush_thread()->fpu_reset_fpregs()
- Sets CPU1’s fpu context to NULL
- Copies the init state to the task’s FPU buffer
- Sets TIF_NEED_FPU_LOAD on the task
6. The task reschedules back to CPU0 before completing the exec() and
returning to userspace
- During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
- Skips saving the registers and updating task’s fpu→last_cpu,
because TIF_NEED_FPU_LOAD is the canonical source.
7. Now CPU0’s FPU context is still pointing to the task’s, and
fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true even
though the reset FPU state has not been restored.
So the root cause is that exec() is doing the wrong kind of invalidate. It
should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
going away, they are just getting reset as part of an exec. So switch to
__fpu_invalidate_fpregs_state().
Also, delete the misleading comment that says that either kind of
invalidate will be enough, because it’s not always the case.
Fixes: 33344368cb08 ("x86/fpu: Clean up the fpu__clear() variants")
Reported-by: Lei Wang <lei4.wang(a)intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Lijun Pan <lijun.pan(a)intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta(a)intel.com>
Acked-by: Lijun Pan <lijun.pan(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230818170305.502891-1-rick.p.edgecombe@intel.com
---
arch/x86/kernel/fpu/context.h | 3 +--
arch/x86/kernel/fpu/core.c | 2 +-
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index af5cbdd..f6d856b 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -19,8 +19,7 @@
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
- * Either one of these invalidation functions is enough. Invalidate
- * a resource you control: CPU if using the CPU for something else
+ * Invalidate a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1015af1..98e507c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -679,7 +679,7 @@ static void fpu_reset_fpregs(void)
struct fpu *fpu = ¤t->thread.fpu;
fpregs_lock();
- fpu__drop(fpu);
+ __fpu_invalidate_fpregs_state(fpu);
/*
* This does not change the actual hardware registers. It just
* resets the memory image and sets TIF_NEED_FPU_LOAD so a
From: Baoquan He <bhe(a)redhat.com>
[ Upstream commit b1e213a9e31c20206f111ec664afcf31cbfe0dbb ]
On s390 systems (aka mainframes), it has classic channel devices for
networking and permanent storage that are currently even more common
than PCI devices. Hence it could have a fully functional s390 kernel
with CONFIG_PCI=n, then the relevant iomem mapping functions
[including ioremap(), devm_ioremap(), etc.] are not available.
Here let FSL_EDMA and INTEL_IDMA64 depend on HAS_IOMEM so that it
won't be built to cause below compiling error if PCI is unset.
--------
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] undefined!
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] undefined!
--------
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306211329.ticOJCSv-lkp@intel.com/
Signed-off-by: Baoquan He <bhe(a)redhat.com>
Cc: Vinod Koul <vkoul(a)kernel.org>
Cc: dmaengine(a)vger.kernel.org
Link: https://lore.kernel.org/r/20230707135852.24292-2-bhe@redhat.com
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/dma/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index f5f422f9b8507..b6221b4432fd3 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -211,6 +211,7 @@ config FSL_DMA
config FSL_EDMA
tristate "Freescale eDMA engine support"
depends on OF
+ depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
@@ -280,6 +281,7 @@ config IMX_SDMA
config INTEL_IDMA64
tristate "Intel integrated DMA 64-bit support"
+ depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
--
2.40.1
Hi there,
Reach your target audience and maximize your sales/revenue with our clean and reliable contact database. We have compiled a list of 45158 contacts who are attending Hy Fcell International Expo And Conference 2023.
If interested, I can share the pricing and guarantees.
Looking for your response.
Kind Regards,
Sarah Perez - B2B Marketing & Tradeshow Specialist
If you do not wish to receive these messages, reply back with remove and we will make sure you don't receive any more emails from our end.
iommu_device_register always returns 0 in 4.11-5.12, so
we need to remove redundant comparison with 0
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: b16c0170b53c ("iommu/mediatek: Make use of iommu_device_register interface")
Signed-off-by: Alexandra Diupina <adiupina(a)astralinux.ru>
---
drivers/iommu/mtk_iommu.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 051815c9d2bb..208c47218b75 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -748,9 +748,7 @@ static int mtk_iommu_probe(struct platform_device *pdev)
iommu_device_set_ops(&data->iommu, &mtk_iommu_ops);
iommu_device_set_fwnode(&data->iommu, &pdev->dev.of_node->fwnode);
- ret = iommu_device_register(&data->iommu);
- if (ret)
- return ret;
+ iommu_device_register(&data->iommu);
spin_lock_init(&data->tlb_lock);
list_add_tail(&data->list, &m4ulist);
--
2.30.2
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 2c66ca3949dc701da7f4c9407f2140ae425683a5
Gitweb: https://git.kernel.org/tip/2c66ca3949dc701da7f4c9407f2140ae425683a5
Author: Feng Tang <feng.tang(a)intel.com>
AuthorDate: Wed, 23 Aug 2023 14:57:47 +08:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Thu, 24 Aug 2023 11:01:45 +02:00
x86/fpu: Set X86_FEATURE_OSXSAVE feature after enabling OSXSAVE in CR4
0-Day found a 34.6% regression in stress-ng's 'af-alg' test case, and
bisected it to commit b81fac906a8f ("x86/fpu: Move FPU initialization into
arch_cpu_finalize_init()"), which optimizes the FPU init order, and moves
the CR4_OSXSAVE enabling into a later place:
arch_cpu_finalize_init
identify_boot_cpu
identify_cpu
generic_identify
get_cpu_cap --> setup cpu capability
...
fpu__init_cpu
fpu__init_cpu_xstate
cr4_set_bits(X86_CR4_OSXSAVE);
As the FPU is not yet initialized the CPU capability setup fails to set
X86_FEATURE_OSXSAVE. Many security module like 'camellia_aesni_avx_x86_64'
depend on this feature and therefore fail to load, causing the regression.
Cure this by setting X86_FEATURE_OSXSAVE feature right after OSXSAVE
enabling.
[ tglx: Moved it into the actual BSP FPU initialization code and added a comment ]
Fixes: b81fac906a8f ("x86/fpu: Move FPU initialization into arch_cpu_finalize_init()")
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Signed-off-by: Feng Tang <feng.tang(a)intel.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/lkml/202307192135.203ac24e-oliver.sang@intel.com
Link: https://lore.kernel.org/lkml/20230823065747.92257-1-feng.tang@intel.com
---
arch/x86/kernel/fpu/xstate.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 0bab497..1afbc48 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -882,6 +882,13 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
goto out_disable;
}
+ /*
+ * CPU capabilities initialization runs before FPU init. So
+ * X86_FEATURE_OSXSAVE is not set. Now that XSAVE is completely
+ * functional, set the feature bit so depending code works.
+ */
+ setup_force_cpu_cap(X86_FEATURE_OSXSAVE);
+
print_xstate_offset_size();
pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
fpu_kernel_cfg.max_features,
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x c164c7bc9775be7bcc68754bb3431fce5823822e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082019-brink-buddhist-4d1b@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
c164c7bc9775 ("blk-cgroup: hold queue_lock when removing blkg->q_node")
a06377c5d01e ("Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"")
9a9c261e6b55 ("Revert "blk-cgroup: pass a gendisk to blkg_lookup"")
1231039db31c ("Revert "blk-cgroup: move the cgroup information to struct gendisk"")
dcb522014351 ("Revert "blk-cgroup: simplify blkg freeing from initialization failure paths"")
3f13ab7c80fd ("blk-cgroup: move the cgroup information to struct gendisk")
479664cee14d ("blk-cgroup: pass a gendisk to blkg_lookup")
ba91c849fa50 ("blk-rq-qos: store a gendisk instead of request_queue in struct rq_qos")
3963d84df797 ("blk-rq-qos: constify rq_qos_ops")
ce57b558604e ("blk-rq-qos: make rq_qos_add and rq_qos_del more useful")
b494f9c566ba ("blk-rq-qos: move rq_qos_add and rq_qos_del out of line")
f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
84d7d462b16d ("blk-cgroup: pin the gendisk in struct blkcg_gq")
180b04d450a7 ("blk-cgroup: remove the !bdi->dev check in blkg_dev_name")
27b642b07a4a ("blk-cgroup: simplify blkg freeing from initialization failure paths")
0b6f93bdf07e ("blk-cgroup: improve error unwinding in blkg_alloc")
f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
dfd6200a0954 ("blk-cgroup: support to track if policy is online")
c7241babf085 ("blk-cgroup: dropping parent refcount after pd_free_fn() is done")
e3ff8887e7db ("blk-cgroup: fix missing pd_online_fn() while activating policy")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From c164c7bc9775be7bcc68754bb3431fce5823822e Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei(a)redhat.com>
Date: Thu, 17 Aug 2023 22:17:51 +0800
Subject: [PATCH] blk-cgroup: hold queue_lock when removing blkg->q_node
When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock
has to be held, otherwise, all kinds of bugs(list corruption, hard lockup,
..) can be triggered from blkg_destroy_all().
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Cc: Yu Kuai <yukuai3(a)huawei.com>
Cc: xiaoli feng <xifeng(a)redhat.com>
Cc: Chunyu Hu <chuhu(a)redhat.com>
Cc: Mike Snitzer <snitzer(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
Acked-by: Tejun Heo <tj(a)kernel.org>
Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fc49be622e05..9faafcd10e17 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -136,7 +136,9 @@ static void blkg_free_workfn(struct work_struct *work)
blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
if (blkg->parent)
blkg_put(blkg->parent);
+ spin_lock_irq(&q->queue_lock);
list_del_init(&blkg->q_node);
+ spin_unlock_irq(&q->queue_lock);
mutex_unlock(&q->blkcg_mutex);
blk_put_queue(q);
Re; Interest,
I am interested in discussing the Investment proposal as I explained
in my previous mail. May you let me know your interest and the
possibility of a cooperation aimed for mutual interest.
Looking forward to your mail for further discussion.
Regards
------
Chen Yun - Chairman of CREC
China Railway Engineering Corporation - CRECG
China Railway Plaza, No.69 Fuxing Road, Haidian District, Beijing, P.R.
China
fbcon requires that we implement &drm_framebuffer_funcs.dirty.
Otherwise, the framebuffer might take a while to flush (which would
manifest as noticeable lag). However, we can't enable this callback for
non-fbcon cases since it may cause too many atomic commits to be made at
once. So, implement amdgpu_dirtyfb() and only enable it for fbcon
framebuffers (we can use the "struct drm_file file" parameter in the
callback to check for this since it is only NULL when called by fbcon,
at least in the mainline kernel) on devices that support atomic KMS.
Cc: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: stable(a)vger.kernel.org # 6.1+
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2519
Signed-off-by: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
--
v3: check if file is NULL instead of doing a strcmp() and make note of
it in the commit message
---
drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 26 ++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
index d20dd3f852fc..363e6a2cad8c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
@@ -38,6 +38,8 @@
#include <linux/pci.h>
#include <linux/pm_runtime.h>
#include <drm/drm_crtc_helper.h>
+#include <drm/drm_damage_helper.h>
+#include <drm/drm_drv.h>
#include <drm/drm_edid.h>
#include <drm/drm_fb_helper.h>
#include <drm/drm_gem_framebuffer_helper.h>
@@ -532,11 +534,29 @@ bool amdgpu_display_ddc_probe(struct amdgpu_connector *amdgpu_connector,
return true;
}
+static int amdgpu_dirtyfb(struct drm_framebuffer *fb, struct drm_file *file,
+ unsigned int flags, unsigned int color,
+ struct drm_clip_rect *clips, unsigned int num_clips)
+{
+
+ if (file)
+ return -ENOSYS;
+
+ return drm_atomic_helper_dirtyfb(fb, file, flags, color, clips,
+ num_clips);
+}
+
static const struct drm_framebuffer_funcs amdgpu_fb_funcs = {
.destroy = drm_gem_fb_destroy,
.create_handle = drm_gem_fb_create_handle,
};
+static const struct drm_framebuffer_funcs amdgpu_fb_funcs_atomic = {
+ .destroy = drm_gem_fb_destroy,
+ .create_handle = drm_gem_fb_create_handle,
+ .dirty = amdgpu_dirtyfb
+};
+
uint32_t amdgpu_display_supported_domains(struct amdgpu_device *adev,
uint64_t bo_flags)
{
@@ -1139,7 +1159,11 @@ static int amdgpu_display_gem_fb_verify_and_init(struct drm_device *dev,
if (ret)
goto err;
- ret = drm_framebuffer_init(dev, &rfb->base, &amdgpu_fb_funcs);
+ if (drm_drv_uses_atomic_modeset(dev))
+ ret = drm_framebuffer_init(dev, &rfb->base,
+ &amdgpu_fb_funcs_atomic);
+ else
+ ret = drm_framebuffer_init(dev, &rfb->base, &amdgpu_fb_funcs);
if (ret)
goto err;
--
2.41.0
The patch titled
Subject: shmem: fix smaps BUG sleeping while atomic
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
shmem-fix-smaps-bug-sleeping-while-atomic.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: shmem: fix smaps BUG sleeping while atomic
Date: Tue, 22 Aug 2023 22:14:47 -0700 (PDT)
smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page
table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if
need_resched(): "BUG: sleeping function called from invalid context".
Since shmem_partial_swap_usage() is designed to count across a range, but
smaps_pte_hole_lookup() only calls it for a single page slot, just break
out of the loop on the last or only page, before checking need_resched().
Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com
Fixes: 230100321518 ("mm/smaps: simplify shmem handling of pte holes")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [5.16+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/shmem.c~shmem-fix-smaps-bug-sleeping-while-atomic
+++ a/mm/shmem.c
@@ -806,14 +806,16 @@ unsigned long shmem_partial_swap_usage(s
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
unsigned long swapped = 0;
+ unsigned long max = end - 1;
rcu_read_lock();
- xas_for_each(&xas, page, end - 1) {
+ xas_for_each(&xas, page, max) {
if (xas_retry(&xas, page))
continue;
if (xa_is_value(page))
swapped++;
-
+ if (xas.xa_index == max)
+ break;
if (need_resched()) {
xas_pause(&xas);
cond_resched_rcu();
_
Patches currently in -mm which might be from hughd(a)google.com are
shmem-fix-smaps-bug-sleeping-while-atomic.patch