This is the start of the stable review cycle for the 4.9.250 release.
There are 32 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 09 Jan 2021 14:08:13 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.250-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.250-rc1
Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
iio:magnetometer:mag3110: Fix alignment and data leak issues.
Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
iio:imu:bmi160: Fix alignment and data leak issues
Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
iio:imu:bmi160: Fix too large a buffer.
sayli karnik <karniksayli1995(a)gmail.com>
iio: bmi160_core: Fix sparse warning due to incorrect type in assignment
SeongJae Park <sjpark(a)amazon.de>
xenbus/xenbus_backend: Disallow pending watch messages
SeongJae Park <sjpark(a)amazon.de>
xen/xenbus: Count pending messages for each watch
SeongJae Park <sjpark(a)amazon.de>
xen/xenbus/xen_bus_type: Support will_handle watch callback
SeongJae Park <sjpark(a)amazon.de>
xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
SeongJae Park <sjpark(a)amazon.de>
xen/xenbus: Allow watches discard events before queueing
Josh Poimboeuf <jpoimboe(a)redhat.com>
kdev_t: always inline major/minor helper functions
Jessica Yu <jeyu(a)kernel.org>
module: delay kobject uevent until after module init call
Qinglang Miao <miaoqinglang(a)huawei.com>
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
Jan Kara <jack(a)suse.cz>
quota: Don't overflow quota file offsets
Miroslav Benes <mbenes(a)suse.cz>
module: set MODULE_STATE_GOING state when a module fails to load
Takashi Iwai <tiwai(a)suse.de>
ALSA: seq: Use bool for snd_seq_queue internal flags
Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
media: gp8psk: initialize stats at power control logic
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
Rustam Kovhaev <rkovhaev(a)gmail.com>
reiserfs: add check for an invalid ih_entry_count
Johan Hovold <johan(a)kernel.org>
of: fix linker-section match-table corruption
Petr Vorel <petr.vorel(a)gmail.com>
uapi: move constants from <linux/kernel.h> to <linux/const.h>
Paolo Abeni <pabeni(a)redhat.com>
l2tp: fix races with ipv4-mapped ipv6 addresses
Paolo Abeni <pabeni(a)redhat.com>
net: ipv6: keep sk status consistent after datagram connect failure
Johan Hovold <johan(a)kernel.org>
USB: serial: digi_acceleport: fix write-wakeup deadlocks
Stefan Haberland <sth(a)linux.ibm.com>
s390/dasd: fix hanging device offline processing
Eric Auger <eric.auger(a)redhat.com>
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
Kailang Yang <kailang(a)realtek.com>
ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236
Hui Wang <hui.wang(a)canonical.com>
ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines
Kailang Yang <kailang(a)realtek.com>
ALSA: hda/realtek - Support Dell headset mode for ALC3271
Johan Hovold <johan(a)kernel.org>
ALSA: usb-audio: fix sync-ep altsetting sanity check
Alberto Aguirre <albaguirre(a)gmail.com>
ALSA: usb-audio: simplify set_sync_ep_implicit_fb_quirk
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda/ca0132 - Fix work handling in delayed HP detection
Jan Beulich <JBeulich(a)suse.com>
x86/entry/64: Add instruction suffix
-------------
Diffstat:
Makefile | 4 +--
arch/powerpc/sysdev/mpic_msgr.c | 2 +-
arch/x86/entry/entry_64.S | 2 +-
drivers/block/xen-blkback/xenbus.c | 3 +-
drivers/iio/imu/bmi160/bmi160_core.c | 12 +++++--
drivers/iio/magnetometer/mag3110.c | 13 +++++---
drivers/media/usb/dvb-usb/gp8psk.c | 2 +-
drivers/misc/vmw_vmci/vmci_context.c | 2 +-
drivers/net/xen-netback/xenbus.c | 4 ++-
drivers/s390/block/dasd_alias.c | 10 +++++-
drivers/usb/serial/digi_acceleport.c | 45 ++++++++------------------
drivers/vfio/pci/vfio_pci.c | 4 +--
drivers/xen/xen-pciback/xenbus.c | 2 +-
drivers/xen/xenbus/xenbus_client.c | 8 ++++-
drivers/xen/xenbus/xenbus_probe.c | 1 +
drivers/xen/xenbus/xenbus_probe.h | 2 ++
drivers/xen/xenbus/xenbus_probe_backend.c | 7 +++++
drivers/xen/xenbus/xenbus_xs.c | 38 ++++++++++++++--------
fs/quota/quota_tree.c | 8 ++---
fs/reiserfs/stree.c | 6 ++++
include/linux/kdev_t.h | 22 ++++++-------
include/linux/of.h | 1 +
include/uapi/linux/const.h | 5 +++
include/uapi/linux/ethtool.h | 2 +-
include/uapi/linux/kernel.h | 9 +-----
include/uapi/linux/lightnvm.h | 2 +-
include/uapi/linux/mroute6.h | 2 +-
include/uapi/linux/netfilter/x_tables.h | 2 +-
include/uapi/linux/netlink.h | 2 +-
include/uapi/linux/sysctl.h | 2 +-
include/xen/xenbus.h | 15 ++++++++-
kernel/module.c | 6 ++--
net/ipv6/datagram.c | 21 ++++++++++---
net/l2tp/l2tp_core.c | 38 +++++++++++-----------
net/l2tp/l2tp_core.h | 3 --
sound/core/seq/seq_queue.h | 8 ++---
sound/pci/hda/patch_ca0132.c | 16 ++++++++--
sound/pci/hda/patch_realtek.c | 25 +++++++++++++--
sound/usb/pcm.c | 52 ++++++++++++-------------------
39 files changed, 242 insertions(+), 166 deletions(-)
This is a note to let you know that I've just added the patch titled
usb: gadget: enable super speed plus
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From e2459108b5a0604c4b472cae2b3cb8d3444c77fb Mon Sep 17 00:00:00 2001
From: "taehyun.cho" <taehyun.cho(a)samsung.com>
Date: Thu, 7 Jan 2021 00:46:25 +0900
Subject: usb: gadget: enable super speed plus
Enable Super speed plus in configfs to support USB3.1 Gen2.
This ensures that when a USB gadget is plugged in, it is
enumerated as Gen 2 and connected at 10 Gbps if the host and
cable are capable of it.
Many in-tree gadget functions (fs, midi, acm, ncm, mass_storage,
etc.) already have SuperSpeed Plus support.
Tested: plugged gadget into Linux host and saw:
[284907.385986] usb 8-2: new SuperSpeedPlus Gen 2 USB device number 3 using xhci_hcd
Tested-by: Lorenzo Colitti <lorenzo(a)google.com>
Acked-by: Felipe Balbi <balbi(a)kernel.org>
Signed-off-by: taehyun.cho <taehyun.cho(a)samsung.com>
Signed-off-by: Lorenzo Colitti <lorenzo(a)google.com>
Link: https://lore.kernel.org/r/20210106154625.2801030-1-lorenzo@google.com
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/configfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/gadget/configfs.c b/drivers/usb/gadget/configfs.c
index 408a5332a975..36ffb43f9c1a 100644
--- a/drivers/usb/gadget/configfs.c
+++ b/drivers/usb/gadget/configfs.c
@@ -1543,7 +1543,7 @@ static const struct usb_gadget_driver configfs_driver_template = {
.suspend = configfs_composite_suspend,
.resume = configfs_composite_resume,
- .max_speed = USB_SPEED_SUPER,
+ .max_speed = USB_SPEED_SUPER_PLUS,
.driver = {
.owner = THIS_MODULE,
.name = "configfs-gadget",
@@ -1583,7 +1583,7 @@ static struct config_group *gadgets_make(
gi->composite.unbind = configfs_do_nothing;
gi->composite.suspend = NULL;
gi->composite.resume = NULL;
- gi->composite.max_speed = USB_SPEED_SUPER;
+ gi->composite.max_speed = USB_SPEED_SUPER_PLUS;
spin_lock_init(&gi->spinlock);
mutex_init(&gi->lock);
--
2.30.0
Building s390:defconfig ... failed
Building s390:allmodconfig ... failed
Building s390:performance_defconfig ... failed
--------------
Error log:
arch/s390/kernel/smp.c: In function 'cpu_configure_store':
arch/s390/kernel/smp.c:1009:8: error: implicit declaration of function 'smp_get_base_cpu'
The problem affects v4.4.y and v4.9.y stable queues.
Guenter
This series hunts the problems discovered after manual enabling of
ARCH_WANT_LD_ORPHAN_WARN. Notably:
- adds the missing PAGE_ALIGNED_DATA() section affecting VDSO
placement (marked for stable);
- properly stops .eh_frame section generation.
Compile and runtime tested on MIPS32R2 CPS board with no issues
using two different toolkits:
- Binutils 2.35.1, GCC 10.2.0;
- LLVM stack 11.0.0.
Since v2 [1]:
- stop discarding .eh_frame and just prevent it from generating
(Kees);
- drop redundant sections assertions (Fangrui);
- place GOT table in .text instead of asserting as it's not empty
when building with LLVM (Nathan);
- catch compound literals in generic definitions when building with
LD_DEAD_CODE_DATA_ELIMINATION (Kees);
- collect two Reviewed-bys (Kees).
Since v1 [0]:
- catch .got entries too as LLD may produce it (Nathan);
- check for unwanted sections to be zero-sized instead of
discarding (Fangrui).
[0] https://lore.kernel.org/linux-mips/20210104121729.46981-1-alobakin@pm.me
[1] https://lore.kernel.org/linux-mips/20210106200713.31840-1-alobakin@pm.me
Alexander Lobakin (7):
MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section
MIPS: vmlinux.lds.S: add ".gnu.attributes" to DISCARDS
MIPS: properly stop .eh_frame generation
MIPS: vmlinux.lds.S: catch bad .rel.dyn at link time
MIPS: vmlinux.lds.S: explicitly declare .got table
vmlinux.lds.h: catch compound literals into data and BSS
MIPS: select ARCH_WANT_LD_ORPHAN_WARN
arch/mips/Kconfig | 1 +
arch/mips/include/asm/asm.h | 18 ++++++++++++++++++
arch/mips/kernel/vmlinux.lds.S | 15 ++++++++++++++-
include/asm-generic/vmlinux.lds.h | 6 +++---
4 files changed, 36 insertions(+), 4 deletions(-)
--
2.30.0
This series hunts the problems discovered after manual enabling of
ARCH_WANT_LD_ORPHAN_WARN, notably the missing PAGE_ALIGNED_DATA()
section affecting VDSO placement (marked for stable).
Compile and runtime tested on MIPS32R2 CPS board with no issues.
Since v1 [0]:
- catch .got entries too as LLD may produce it (Nathan);
- check for unwanted sections to be zero-sized instead of
discarding (Fangrui).
[0] https://lore.kernel.org/linux-mips/20210104121729.46981-1-alobakin@pm.me
Alexander Lobakin (4):
MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section
MIPS: vmlinux.lds.S: add ".gnu.attributes" to DISCARDS
MIPS: vmlinux.lds.S: catch bad .got, .plt and .rel.dyn at link time
MIPS: select ARCH_WANT_LD_ORPHAN_WARN
arch/mips/Kconfig | 1 +
arch/mips/kernel/vmlinux.lds.S | 39 +++++++++++++++++++++++++++++++++-
2 files changed, 39 insertions(+), 1 deletion(-)
--
2.30.0
HDA initialization is failing occasionally on Tegra210 and following
print is observed in the boot log. Because of this probe() fails and
no sound card is registered.
[16.800802] tegra-hda 70030000.hda: no codecs found!
Codecs request a state change and enumeration by the controller. In
failure cases this does not seem to happen as STATETS register reads 0.
The problem seems to be related to the HDA codec dependency on SOR
power domain. If it is gated during HDA probe then the failure is
observed. Building Tegra HDA driver into kernel image avoids this
failure but does not completely address the dependency part. Fix this
problem by adding 'power-domains' DT property for Tegra210 HDA. Note
that Tegra186 and Tegra194 HDA do this already.
Fixes: 742af7e7a0a1 ("arm64: tegra: Add Tegra210 support")
Depends-on: 96d1f078ff0 ("arm64: tegra: Add SOR power-domain for Tegra210")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Sameer Pujar <spujar(a)nvidia.com>
---
arch/arm64/boot/dts/nvidia/tegra210.dtsi | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/boot/dts/nvidia/tegra210.dtsi b/arch/arm64/boot/dts/nvidia/tegra210.dtsi
index 4fbf8c1..fd33b4d 100644
--- a/arch/arm64/boot/dts/nvidia/tegra210.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra210.dtsi
@@ -997,6 +997,7 @@
<&tegra_car 128>, /* hda2hdmi */
<&tegra_car 111>; /* hda2codec_2x */
reset-names = "hda", "hda2hdmi", "hda2codec_2x";
+ power-domains = <&pd_sor>;
status = "disabled";
};
--
2.7.4
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.
The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <lkp(a)intel.com>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.
Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc: <stable(a)vger.kernel.org> # 4.19.x: b3e583825266: clone: add
CLONE_PIDFD
Cc: <stable(a)vger.kernel.org> # 4.19.x: b53b0b9d9a61: pidfd: add polling
support
Cc: <stable(a)vger.kernel.org> # 4.19.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc: <stable(a)vger.kernel.org> # 4.19.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.19.x: 71448011ea2a: proc: Clear the
pieces of
proc_inode that proc_evict_inode cares about
Cc: <stable(a)vger.kernel.org> # 4.19.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/base.c | 111 ++++++++++++++++--------------------------------
fs/proc/inode.c | 2 +-
fs/proc/internal.h | 1 +
include/linux/pid.h | 1 +
include/linux/proc_fs.h | 4 +-
kernel/exit.c | 4 +-
kernel/pid.c | 1 +
7 files changed, 45 insertions(+), 79 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5e705fa..ea74c7c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1741,11 +1741,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
*rgid = gid;
}
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+ struct pid *pid = ei->pid;
+
+ if (S_ISDIR(ei->vfs_inode.i_mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_del_init_rcu(&ei->sibling_inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
+ put_pid(pid);
+}
+
struct inode *proc_pid_make_inode(struct super_block * sb,
struct task_struct *task, umode_t mode)
{
struct inode * inode;
struct proc_inode *ei;
+ struct pid *pid;
/* We need a new inode */
@@ -1763,10 +1777,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
/*
* grab the reference to task.
*/
- ei->pid = get_task_pid(task, PIDTYPE_PID);
- if (!ei->pid)
+ pid = get_task_pid(task, PIDTYPE_PID);
+ if (!pid)
goto out_unlock;
+ /* Let the pid remember us for quick removal */
+ ei->pid = pid;
+ if (S_ISDIR(mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
security_task_to_inode(task, inode);
@@ -3067,90 +3089,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
.permission = proc_pid_permission,
};
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
-{
- struct dentry *dentry, *leader, *dir;
- char buf[10 + 1];
- struct qstr name;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- /* no ->d_hash() rejects on procfs */
- dentry = d_hash_and_lookup(mnt->mnt_root, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- if (pid == tgid)
- return;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", tgid);
- leader = d_hash_and_lookup(mnt->mnt_root, &name);
- if (!leader)
- goto out;
-
- name.name = "task";
- name.len = strlen(name.name);
- dir = d_hash_and_lookup(leader, &name);
- if (!dir)
- goto out_put_leader;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- dentry = d_hash_and_lookup(dir, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- dput(dir);
-out_put_leader:
- dput(leader);
-out:
- return;
-}
-
/**
- * proc_flush_task - Remove dcache entries for @task from the /proc dcache.
- * @task: task that should be flushed.
+ * proc_flush_pid - Remove dcache entries for @pid from the /proc dcache.
+ * @pid: pid that should be flushed.
*
- * When flushing dentries from proc, one needs to flush them from global
- * proc (proc_mnt) and from all the namespaces' procs this task was seen
- * in. This call is supposed to do all of this job.
- *
- * Looks in the dcache for
- * /proc/@pid
- * /proc/@tgid/task/@pid
- * if either directory is present flushes it and all of it'ts children
- * from the dcache.
+ * This function walks a list of inodes (that belong to any proc
+ * filesystem) that are attached to the pid and flushes them from
+ * the dentry cache.
*
* It is safe and reasonable to cache /proc entries for a task until
* that task exits. After that they just clog up the dcache with
* useless entries, possibly causing useful dcache entries to be
- * flushed instead. This routine is proved to flush those useless
- * dcache entries at process exit time.
+ * flushed instead. This routine is provided to flush those useless
+ * dcache entries when a process is reaped.
*
* NOTE: This routine is just an optimization so it does not guarantee
- * that no dcache entries will exist at process exit time it
- * just makes it very unlikely that any will persist.
+ * that no dcache entries will exist after a process is reaped
+ * it just makes it very unlikely that any will persist.
*/
-void proc_flush_task(struct task_struct *task)
+void proc_flush_pid(struct pid *pid)
{
- int i;
- struct pid *pid, *tgid;
- struct upid *upid;
-
- pid = task_pid(task);
- tgid = task_tgid(task);
-
- for (i = 0; i <= pid->level; i++) {
- upid = &pid->numbers[i];
- proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
- tgid->numbers[i].nr);
- }
+ proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+ put_pid(pid);
}
static struct dentry *proc_pid_instantiate(struct dentry * dentry,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index fad579e..83c640a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -41,7 +41,7 @@ static void proc_evict_inode(struct inode *inode)
/* Stop tracking associated processes */
if (ei->pid) {
- put_pid(ei->pid);
+ proc_pid_evict_inode(ei);
ei->pid = NULL;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1db693b..be55ee2 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -158,6 +158,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
extern const struct dentry_operations pid_dentry_operations;
extern int pid_getattr(const struct path *, struct kstat *, u32, unsigned int);
extern int proc_setattr(struct dentry *, struct iattr *);
+extern void proc_pid_evict_inode(struct proc_inode *);
extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
extern void pid_update_inode(struct task_struct *, struct inode *);
extern int pid_delete_dentry(const struct dentry *);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index a82d2f7..08efbe1 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -61,6 +61,7 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f15..94bf352 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -17,7 +17,7 @@
typedef int (*proc_write_t)(struct file *, char *, size_t);
extern void proc_root_init(void);
-extern void proc_flush_task(struct task_struct *);
+extern void proc_flush_pid(struct pid *);
extern struct proc_dir_entry *proc_symlink(const char *,
struct proc_dir_entry *, const char *);
@@ -80,7 +80,7 @@ static inline void proc_root_init(void)
{
}
-static inline void proc_flush_task(struct task_struct *task)
+static inline void proc_flush_pid(struct pid *pid)
{
}
diff --git a/kernel/exit.c b/kernel/exit.c
index 65133eb..55996ba 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -185,6 +185,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
void release_task(struct task_struct *p)
{
struct task_struct *leader;
+ struct pid *thread_pid;
int zap_leader;
repeat:
/* don't need to get the RCU readlock here - the process is dead and
@@ -193,11 +194,11 @@ void release_task(struct task_struct *p)
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();
- proc_flush_task(p);
cgroup_release(p);
write_lock_irq(&tasklist_lock);
ptrace_release_task(p);
+ thread_pid = get_pid(p->thread_pid);
__exit_signal(p);
/*
@@ -220,6 +221,7 @@ void release_task(struct task_struct *p)
}
write_unlock_irq(&tasklist_lock);
+ proc_flush_pid(thread_pid);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/pid.c b/kernel/pid.c
index 3ba6fcb..65df035 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -215,6 +215,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
INIT_HLIST_HEAD(&pid->tasks[type]);
init_waitqueue_head(&pid->wait_pidfd);
+ INIT_HLIST_HEAD(&pid->inodes);
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]
The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused. It will not remove aliases for
the dcache or even think of removing mounts from the dcache. For that
behavior d_invalidate is needed.
To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.
For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.
Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.
To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.
V2: Split the directory and non-directory cases. To make this
code robust to future changes in proc.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 ++++++++++++++--
fs/proc/internal.h | 2 +-
fs/proc/proc_sysctl.c | 8 ++++----
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 45b4344..fad579e 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -118,7 +118,7 @@ void __init proc_init_kmemcache(void)
BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE);
}
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
{
struct inode *inode;
struct proc_inode *ei;
@@ -147,7 +147,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
continue;
}
- d_prune_aliases(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ struct dentry *dir = d_find_any_alias(inode);
+ if (dir) {
+ d_invalidate(dir);
+ dput(dir);
+ }
+ } else {
+ struct dentry *dentry;
+ while ((dentry = d_find_alias(inode))) {
+ d_invalidate(dentry);
+ dput(dentry);
+ }
+ }
iput(inode);
deactivate_super(sb);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6cae472..1db693b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -210,7 +210,7 @@ struct pde_opener {
extern const struct inode_operations proc_pid_link_inode_operations;
void proc_init_kmemcache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
extern int proc_fill_super(struct super_block *, void *data, int flags);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 57b16bf..f8f1f8a 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -262,9 +262,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
}
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
{
- proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
+ proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock);
}
/* called under sysctl_lock, will reacquire if has to wait */
@@ -286,10 +286,10 @@ static void start_unregistering(struct ctl_table_header *p)
spin_unlock(&sysctl_lock);
}
/*
- * Prune dentries for unregistered sysctls: namespaced sysctls
+ * Invalidate dentries for unregistered sysctls: namespaced sysctls
* can have duplicate names and contaminate dcache very badly.
*/
- proc_sys_prune_dcache(p);
+ proc_sys_invalidate_dcache(p);
/*
* do not remove from the list until nobody holds it; walking the
* list in do_sysctl() relies on that.
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]
This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.
The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index fffc7e4..45b4344 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -34,21 +34,27 @@ static void proc_evict_inode(struct inode *inode)
{
struct proc_dir_entry *de;
struct ctl_table_header *head;
+ struct proc_inode *ei = PROC_I(inode);
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
/* Stop tracking associated processes */
- put_pid(PROC_I(inode)->pid);
+ if (ei->pid) {
+ put_pid(ei->pid);
+ ei->pid = NULL;
+ }
/* Let go of any associated proc directory entry */
- de = PDE(inode);
- if (de)
+ de = ei->pde;
+ if (de) {
pde_put(de);
+ ei->pde = NULL;
+ }
- head = PROC_I(inode)->sysctl;
+ head = ei->sysctl;
if (head) {
- RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+ RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
}
--
1.8.3.1
From: "Joel Fernandes (Google)" <joel(a)joelfernandes.org>
[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 3 +++
kernel/fork.c | 26 ++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 42 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29c0a99..a82d2f7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
#define _LINUX_PID_H
#include <linux/rculist.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index e419891..33dc746 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,8 +1688,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ /*
+ * Inform pollers only when the whole thread group exits.
+ * If the thread group leader exits before all other threads in the
+ * group, then poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index b88fe5e..3ba6fcb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index a02a25a..22a04795 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1810,6 +1810,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1833,6 +1841,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
1.8.3.1
From: Christian Brauner <christian(a)brauner.io>
[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Co-developed-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages(a)gmail.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: <stable(a)vger.kernel.org> # 4.19.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 107 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 106 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 14a9a39..29c0a99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
extern struct pid init_struct_pid;
+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f8..ed4ee17 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index f2c92c1..e419891 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
#include <linux/sched/mm.h>
@@ -21,6 +22,7 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
#include <linux/unistd.h>
@@ -1666,6 +1668,58 @@ static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
mutex_unlock(&oom_adj_mutex);
}
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+ int fd;
+
+ fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
+ O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1678,13 +1732,14 @@ static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
- int retval;
+ int pidfd = -1, retval;
struct task_struct *p;
struct multiprocess_signals delayed;
@@ -1734,6 +1789,31 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}
+ if (clone_flags & CLONE_PIDFD) {
+ int reserved;
+
+ /*
+ * - CLONE_PARENT_SETTID is useless for pidfds and also
+ * parent_tidptr is used to return pidfds.
+ * - CLONE_DETACHED is blocked so that we can potentially
+ * reuse it later for CLONE_PIDFD.
+ * - CLONE_THREAD is blocked until someone really needs it.
+ */
+ if (clone_flags &
+ (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Verify that parent_tidptr is sane so we can potentially
+ * reuse it later.
+ */
+ if (get_user(reserved, parent_tidptr))
+ return ERR_PTR(-EFAULT);
+
+ if (reserved != 0)
+ return ERR_PTR(-EINVAL);
+ }
+
/*
* Force any signals received before this point to be delivered
* before the fork happens. Collect up signals sent to multiple
@@ -1934,6 +2014,22 @@ static __latent_entropy struct task_struct *copy_process(
}
}
+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child
+ * if the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+
+ pidfd = retval;
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1989,7 +2085,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;
/*
* From this point on we must avoid any synchronous user-space
@@ -2111,6 +2207,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(¤t->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ ksys_close(pidfd);
bad_fork_free_pid:
cgroup_threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -2178,7 +2277,7 @@ static inline void init_idle_pids(struct task_struct *idle)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
cpu_to_node(cpu));
if (!IS_ERR(task)) {
init_idle_pids(task);
@@ -2225,7 +2324,7 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}
- p = copy_process(clone_flags, stack_start, stack_size,
+ p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.
The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <lkp(a)intel.com>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.
Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc: <stable(a)vger.kernel.org> # 4.9.x: b3e583825266: clone: add CLONE_PIDFD
Cc: <stable(a)vger.kernel.org> # 4.9.x: b53b0b9d9a61: pidfd: add polling
support
Cc: <stable(a)vger.kernel.org> # 4.9.x: db978da8fa1d: proc: Pass file mode to
proc_pid_make_inode
Cc: <stable(a)vger.kernel.org> # 4.9.x: 68eb94f16227: proc: Better ownership of
files for non-dumpable tasks in user namespaces
Cc: <stable(a)vger.kernel.org> # 4.9.x: e3912ac37e07: proc: use %u for pid
printing and slightly less stack
Cc: <stable(a)vger.kernel.org> # 4.9.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc: <stable(a)vger.kernel.org> # 4.9.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.9.x: 71448011ea2a: proc: Clear the pieces of
proc_inode that proc_evict_inode cares about
Cc: <stable(a)vger.kernel.org> # 4.9.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/base.c | 111 ++++++++++++++++--------------------------------
fs/proc/inode.c | 2 +-
fs/proc/internal.h | 1 +
include/linux/pid.h | 1 +
include/linux/proc_fs.h | 4 +-
kernel/exit.c | 5 ++-
kernel/pid.c | 1 +
7 files changed, 45 insertions(+), 80 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t mode,
*rgid = gid;
}
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+ struct pid *pid = ei->pid;
+
+ if (S_ISDIR(ei->vfs_inode.i_mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_del_init_rcu(&ei->sibling_inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
+ put_pid(pid);
+}
+
struct inode *proc_pid_make_inode(struct super_block * sb,
struct task_struct *task, umode_t mode)
{
struct inode * inode;
struct proc_inode *ei;
+ struct pid *pid;
/* We need a new inode */
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
/*
* grab the reference to task.
*/
- ei->pid = get_task_pid(task, PIDTYPE_PID);
- if (!ei->pid)
+ pid = get_task_pid(task, PIDTYPE_PID);
+ if (!pid)
goto out_unlock;
+ /* Let the pid remember us for quick removal */
+ ei->pid = pid;
+ if (S_ISDIR(mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
security_task_to_inode(task, inode);
@@ -3015,90 +3037,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
.permission = proc_pid_permission,
};
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
-{
- struct dentry *dentry, *leader, *dir;
- char buf[10 + 1];
- struct qstr name;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- /* no ->d_hash() rejects on procfs */
- dentry = d_hash_and_lookup(mnt->mnt_root, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- if (pid == tgid)
- return;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", tgid);
- leader = d_hash_and_lookup(mnt->mnt_root, &name);
- if (!leader)
- goto out;
-
- name.name = "task";
- name.len = strlen(name.name);
- dir = d_hash_and_lookup(leader, &name);
- if (!dir)
- goto out_put_leader;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- dentry = d_hash_and_lookup(dir, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- dput(dir);
-out_put_leader:
- dput(leader);
-out:
- return;
-}
-
/**
- * proc_flush_task - Remove dcache entries for @task from the /proc dcache.
- * @task: task that should be flushed.
+ * proc_flush_pid - Remove dcache entries for @pid from the /proc dcache.
+ * @pid: pid that should be flushed.
*
- * When flushing dentries from proc, one needs to flush them from global
- * proc (proc_mnt) and from all the namespaces' procs this task was seen
- * in. This call is supposed to do all of this job.
- *
- * Looks in the dcache for
- * /proc/@pid
- * /proc/@tgid/task/@pid
- * if either directory is present flushes it and all of it'ts children
- * from the dcache.
+ * This function walks a list of inodes (that belong to any proc
+ * filesystem) that are attached to the pid and flushes them from
+ * the dentry cache.
*
* It is safe and reasonable to cache /proc entries for a task until
* that task exits. After that they just clog up the dcache with
* useless entries, possibly causing useful dcache entries to be
- * flushed instead. This routine is proved to flush those useless
- * dcache entries at process exit time.
+ * flushed instead. This routine is provided to flush those useless
+ * dcache entries when a process is reaped.
*
* NOTE: This routine is just an optimization so it does not guarantee
- * that no dcache entries will exist at process exit time it
- * just makes it very unlikely that any will persist.
+ * that no dcache entries will exist after a process is reaped
+ * it just makes it very unlikely that any will persist.
*/
-void proc_flush_task(struct task_struct *task)
+void proc_flush_pid(struct pid *pid)
{
- int i;
- struct pid *pid, *tgid;
- struct upid *upid;
-
- pid = task_pid(task);
- tgid = task_tgid(task);
-
- for (i = 0; i <= pid->level; i++) {
- upid = &pid->numbers[i];
- proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
- tgid->numbers[i].nr);
- }
+ proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+ put_pid(pid);
}
static int proc_pid_instantiate(struct inode *dir,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2af9f4f..8503444 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -39,7 +39,7 @@ static void proc_evict_inode(struct inode *inode)
/* Stop tracking associated processes */
if (ei->pid) {
- put_pid(ei->pid);
+ proc_pid_evict_inode(ei);
ei->pid = NULL;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6a1d679..0c6ca639 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -151,6 +151,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
extern const struct dentry_operations pid_dentry_operations;
extern int pid_getattr(struct vfsmount *, struct dentry *, struct kstat *);
extern int proc_setattr(struct dentry *, struct iattr *);
+extern void proc_pid_evict_inode(struct proc_inode *);
extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
extern int pid_revalidate(struct dentry *, unsigned int);
extern int pid_delete_dentry(const struct dentry *);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index f5552ba..04b4aaa 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -63,6 +63,7 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..d3580f5 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,7 +12,7 @@
#ifdef CONFIG_PROC_FS
extern void proc_root_init(void);
-extern void proc_flush_task(struct task_struct *);
+extern void proc_flush_pid(struct pid *);
extern struct proc_dir_entry *proc_symlink(const char *,
struct proc_dir_entry *, const char *);
@@ -48,7 +48,7 @@ static inline void proc_root_init(void)
{
}
-static inline void proc_flush_task(struct task_struct *task)
+static inline void proc_flush_pid(struct pid *pid)
{
}
diff --git a/kernel/exit.c b/kernel/exit.c
index f9943ef..5e66030 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -168,6 +168,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
void release_task(struct task_struct *p)
{
struct task_struct *leader;
+ struct pid *thread_pid;
int zap_leader;
repeat:
/* don't need to get the RCU readlock here - the process is dead and
@@ -176,10 +177,9 @@ void release_task(struct task_struct *p)
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();
- proc_flush_task(p);
-
write_lock_irq(&tasklist_lock);
ptrace_release_task(p);
+ thread_pid = get_pid(p->pids[PIDTYPE_PID].pid);
__exit_signal(p);
/*
@@ -202,6 +202,7 @@ void release_task(struct task_struct *p)
}
write_unlock_irq(&tasklist_lock);
+ proc_flush_pid(thread_pid);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/pid.c b/kernel/pid.c
index e605398..fb32a81 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -334,6 +334,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
INIT_HLIST_HEAD(&pid->tasks[type]);
init_waitqueue_head(&pid->wait_pidfd);
+ INIT_HLIST_HEAD(&pid->inodes);
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]
The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused. It will not remove aliases for
the dcache or even think of removing mounts from the dcache. For that
behavior d_invalidate is needed.
To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.
For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.
Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.
To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.
V2: Split the directory and non-directory cases. To make this
code robust to future changes in proc.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for f90f3cafe8d5)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 ++++++++++++++--
fs/proc/internal.h | 2 +-
fs/proc/proc_sysctl.c | 8 ++++----
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 739fb9c..2af9f4f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -107,7 +107,7 @@ void __init proc_init_inodecache(void)
init_once);
}
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
{
struct inode *inode;
struct proc_inode *ei;
@@ -136,7 +136,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
continue;
}
- d_prune_aliases(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ struct dentry *dir = d_find_any_alias(inode);
+ if (dir) {
+ d_invalidate(dir);
+ dput(dir);
+ }
+ } else {
+ struct dentry *dentry;
+ while ((dentry = d_find_alias(inode))) {
+ d_invalidate(dentry);
+ dput(dentry);
+ }
+ }
iput(inode);
deactivate_super(sb);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9bc44a1..6a1d679 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,7 +200,7 @@ struct pde_opener {
extern const struct inode_operations proc_pid_link_inode_operations;
extern void proc_init_inodecache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
extern int proc_fill_super(struct super_block *, void *data, int flags);
extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f19063b..b6668a5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -260,9 +260,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
}
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
{
- proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
+ proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock);
}
/* called under sysctl_lock, will reacquire if has to wait */
@@ -284,10 +284,10 @@ static void start_unregistering(struct ctl_table_header *p)
spin_unlock(&sysctl_lock);
}
/*
- * Prune dentries for unregistered sysctls: namespaced sysctls
+ * Invalidate dentries for unregistered sysctls: namespaced sysctls
* can have duplicate names and contaminate dcache very badly.
*/
- proc_sys_prune_dcache(p);
+ proc_sys_invalidate_dcache(p);
/*
* do not remove from the list until nobody holds it; walking the
* list in do_sysctl() relies on that.
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]
This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.
The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 920c761..739fb9c 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -32,21 +32,27 @@ static void proc_evict_inode(struct inode *inode)
{
struct proc_dir_entry *de;
struct ctl_table_header *head;
+ struct proc_inode *ei = PROC_I(inode);
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
/* Stop tracking associated processes */
- put_pid(PROC_I(inode)->pid);
+ if (ei->pid) {
+ put_pid(ei->pid);
+ ei->pid = NULL;
+ }
/* Let go of any associated proc directory entry */
- de = PDE(inode);
- if (de)
+ de = ei->pde;
+ if (de) {
pde_put(de);
+ ei->pde = NULL;
+ }
- head = PROC_I(inode)->sysctl;
+ head = ei->sysctl;
if (head) {
- RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+ RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
}
--
1.8.3.1
From: "Joel Fernandes (Google)" <joel(a)joelfernandes.org>
[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(pidfd: fix up cherry-pick conflicts for b53b0b9d9a61)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 3 +++
kernel/fork.c | 26 ++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 42 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7599a78..f5552ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -2,6 +2,7 @@
#define _LINUX_PID_H
#include <linux/rcupdate.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -62,6 +63,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 4249f60..e3a4a14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1481,8 +1481,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ /*
+ * Inform pollers only when the whole thread group exits.
+ * If the thread group leader exits before all other threads in the
+ * group, then poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index fa704f8..e605398 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -333,6 +333,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index bedca16..053de87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1632,6 +1632,14 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1655,6 +1663,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
1.8.3.1
From: Christian Brauner <christian(a)brauner.io>
[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Co-developed-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages(a)gmail.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 1 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 105 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 103 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 97b745d..7599a78 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,7 @@ struct pid_link
struct hlist_node node;
struct pid *pid;
};
+extern const struct file_operations pidfd_fops;
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..ed6e31d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index b64efec..4249f60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/unistd.h>
@@ -1460,6 +1461,58 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
task->pids[type].pid = pid;
}
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = file_inode(m->file)->i_sb->s_fs_info;
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+ int fd;
+
+ fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
+ O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1472,13 +1525,14 @@ static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
- int retval;
+ int pidfd = -1, retval;
struct task_struct *p;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -1526,6 +1580,30 @@ static __latent_entropy struct task_struct *copy_process(
retval = security_task_create(clone_flags);
if (retval)
goto fork_out;
+ if (clone_flags & CLONE_PIDFD) {
+ int reserved;
+
+ /*
+ * - CLONE_PARENT_SETTID is useless for pidfds and also
+ * parent_tidptr is used to return pidfds.
+ * - CLONE_DETACHED is blocked so that we can potentially
+ * reuse it later for CLONE_PIDFD.
+ * - CLONE_THREAD is blocked until someone really needs it.
+ */
+ if (clone_flags &
+ (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Verify that parent_tidptr is sane so we can potentially
+ * reuse it later.
+ */
+ if (get_user(reserved, parent_tidptr))
+ return ERR_PTR(-EFAULT);
+
+ if (reserved != 0)
+ return ERR_PTR(-EINVAL);
+ }
retval = -ENOMEM;
p = dup_task_struct(current, node);
@@ -1703,6 +1781,22 @@ static __latent_entropy struct task_struct *copy_process(
}
}
+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child
+ * if the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+
+ pidfd = retval;
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1758,7 +1852,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;
/*
* From this point on we must avoid any synchronous user-space
@@ -1869,6 +1963,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(¤t->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ __close_fd(current->files, pidfd);
bad_fork_free_pid:
threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -1928,7 +2025,7 @@ static inline void init_idle_pids(struct pid_link *links)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
cpu_to_node(cpu));
if (!IS_ERR(task)) {
init_idle_pids(task->pids);
@@ -1973,7 +2070,7 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}
- p = copy_process(clone_flags, stack_start, stack_size,
+ p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
/*
--
1.8.3.1
This is the start of the stable review cycle for the 5.10.3 release.
There are 40 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri, 25 Dec 2020 15:05:02 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.3-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.3-rc1
Dae R. Jeong <dae.r.jeong(a)kaist.ac.kr>
md: fix a warning caused by a race between concurrent md_ioctl()s
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
nl80211: validate key indexes for cfg80211_registered_device
Eric Biggers <ebiggers(a)google.com>
crypto: af_alg - avoid undefined behavior accessing salg_name
Antti Palosaari <crope(a)iki.fi>
media: msi2500: assign SPI bus number dynamically
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode()
Jan Kara <jack(a)suse.cz>
quota: Sanity-check quota file headers on load
Peilin Ye <yepeilin.cs(a)gmail.com>
Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt()
Eric Biggers <ebiggers(a)google.com>
f2fs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ext4: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ubifs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
fscrypt: add fscrypt_is_nokey_name()
Eric Biggers <ebiggers(a)google.com>
fscrypt: remove kernel-internal constants from UAPI header
Alexey Kardashevskiy <aik(a)ozlabs.ru>
serial_core: Check for port state when tty is in error state
Julian Sax <jsbc(a)gmx.de>
HID: i2c-hid: add Vero K147 to descriptor override
Arnd Bergmann <arnd(a)arndb.de>
scsi: megaraid_sas: Check user-provided offsets
Jack Qiu <jack.qiu(a)huawei.com>
f2fs: init dirty_secmap incorrectly
Chao Yu <chao(a)kernel.org>
f2fs: fix to seek incorrect data offset in inline data file
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Handle TRCVIPCSSCTLR accesses
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCPROCSELR
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCCIDCTLR1
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCVMIDCTLR1
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: etm4x: Skip setting LPOVERRIDE bit for qcom, skip-power-up
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf()
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: tmc-etr: Fix barrier packet insertion for perf buffer
Mao Jinlong <jinlmao(a)codeaurora.org>
coresight: tmc-etr: Check if page is valid before dma_map_page()
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix USB 3.0 pins supply being turned off on Odroid XU
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix USB 3.0 VBUS control and over-current pins on Exynos5410
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix roles of USB 3.0 ports on Odroid XU
Fabio Estevam <festevam(a)gmail.com>
usb: chipidea: ci_hdrc_imx: Pass DISABLE_DEVICE_STREAMING flag to imx6ul
Will McVicker <willmcvicker(a)google.com>
USB: gadget: f_rndis: fix bitrate for SuperSpeed and above
Jack Pham <jackp(a)codeaurora.org>
usb: gadget: f_fs: Re-use SS descriptors for SuperSpeedPlus
Will McVicker <willmcvicker(a)google.com>
USB: gadget: f_midi: setup SuperSpeed Plus descriptors
taehyun.cho <taehyun.cho(a)samsung.com>
USB: gadget: f_acm: add support for SuperSpeed Plus
Johan Hovold <johan(a)kernel.org>
USB: serial: option: add interface-number sanity check to flag handling
Dan Carpenter <dan.carpenter(a)oracle.com>
usb: mtu3: fix memory corruption in mtu3_debugfs_regset()
Nicolin Chen <nicoleotsuka(a)gmail.com>
soc/tegra: fuse: Fix index bug in get_process_id
Artem Labazov <123321artyom(a)gmail.com>
exfat: Avoid allocating upcase table using kcalloc()
Andi Kleen <ak(a)linux.intel.com>
x86/split-lock: Avoid returning with interrupts enabled
Thierry Reding <treding(a)nvidia.com>
net: ipconfig: Avoid spurious blank lines in boot log
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/exynos5410-odroidxu.dts | 6 ++-
arch/arm/boot/dts/exynos5410-pinctrl.dtsi | 28 ++++++++++++
arch/arm/boot/dts/exynos5410.dtsi | 4 ++
arch/x86/kernel/traps.c | 3 +-
crypto/af_alg.c | 10 +++--
drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c | 8 ++++
drivers/hwtracing/coresight/coresight-etb10.c | 4 +-
drivers/hwtracing/coresight/coresight-etm4x-core.c | 41 ++++++++++-------
drivers/hwtracing/coresight/coresight-priv.h | 2 +
drivers/hwtracing/coresight/coresight-tmc-etf.c | 4 +-
drivers/hwtracing/coresight/coresight-tmc-etr.c | 4 +-
drivers/md/md.c | 7 ++-
drivers/media/usb/msi2500/msi2500.c | 2 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 16 ++++---
drivers/soc/tegra/fuse/speedo-tegra210.c | 2 +-
drivers/tty/serial/serial_core.c | 4 ++
drivers/usb/chipidea/ci_hdrc_imx.c | 3 +-
drivers/usb/gadget/function/f_acm.c | 2 +-
drivers/usb/gadget/function/f_fs.c | 5 ++-
drivers/usb/gadget/function/f_midi.c | 6 +++
drivers/usb/gadget/function/f_rndis.c | 4 +-
drivers/usb/mtu3/mtu3_debugfs.c | 2 +-
drivers/usb/serial/option.c | 23 +++++++++-
fs/crypto/fscrypt_private.h | 9 ++--
fs/crypto/hooks.c | 5 ++-
fs/crypto/keyring.c | 2 +-
fs/crypto/keysetup.c | 4 +-
fs/crypto/policy.c | 5 ++-
fs/exfat/nls.c | 6 +--
fs/ext4/namei.c | 3 ++
fs/f2fs/f2fs.h | 2 +
fs/f2fs/file.c | 11 +++--
fs/f2fs/segment.c | 2 +-
fs/quota/dquot.c | 2 +-
fs/quota/quota_v2.c | 19 ++++++++
fs/ubifs/dir.c | 17 ++++++--
include/linux/fscrypt.h | 34 +++++++++++++++
include/uapi/linux/fscrypt.h | 5 +--
include/uapi/linux/if_alg.h | 16 +++++++
net/bluetooth/hci_event.c | 12 +++--
net/ipv4/ipconfig.c | 14 +++---
net/wireless/core.h | 2 +
net/wireless/nl80211.c | 7 +--
net/wireless/util.c | 51 ++++++++++++++++++----
45 files changed, 334 insertions(+), 88 deletions(-)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
This is a note to let you know that I've just added the patch titled
USB: serial: iuu_phoenix: fix DMA from stack
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 54d0a3ab80f49f19ee916def62fe067596833403 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Mon, 4 Jan 2021 15:50:07 +0100
Subject: USB: serial: iuu_phoenix: fix DMA from stack
Stack-allocated buffers cannot be used for DMA (on all architectures) so
allocate the flush command buffer using kmalloc().
Fixes: 60a8fc017103 ("USB: add iuu_phoenix driver")
Cc: stable <stable(a)vger.kernel.org> # 2.6.25
Reviewed-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/serial/iuu_phoenix.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/usb/serial/iuu_phoenix.c b/drivers/usb/serial/iuu_phoenix.c
index f1201d4de297..e8f06b41a503 100644
--- a/drivers/usb/serial/iuu_phoenix.c
+++ b/drivers/usb/serial/iuu_phoenix.c
@@ -532,23 +532,29 @@ static int iuu_uart_flush(struct usb_serial_port *port)
struct device *dev = &port->dev;
int i;
int status;
- u8 rxcmd = IUU_UART_RX;
+ u8 *rxcmd;
struct iuu_private *priv = usb_get_serial_port_data(port);
if (iuu_led(port, 0xF000, 0, 0, 0xFF) < 0)
return -EIO;
+ rxcmd = kmalloc(1, GFP_KERNEL);
+ if (!rxcmd)
+ return -ENOMEM;
+
+ rxcmd[0] = IUU_UART_RX;
+
for (i = 0; i < 2; i++) {
- status = bulk_immediate(port, &rxcmd, 1);
+ status = bulk_immediate(port, rxcmd, 1);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_write error\n", __func__);
- return status;
+ goto out_free;
}
status = read_immediate(port, &priv->len, 1);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_read error\n", __func__);
- return status;
+ goto out_free;
}
if (priv->len > 0) {
@@ -556,12 +562,16 @@ static int iuu_uart_flush(struct usb_serial_port *port)
status = read_immediate(port, priv->buf, priv->len);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_read error\n", __func__);
- return status;
+ goto out_free;
}
}
}
dev_dbg(dev, "%s - uart_flush_read OK!\n", __func__);
iuu_led(port, 0, 0xF000, 0, 0xFF);
+
+out_free:
+ kfree(rxcmd);
+
return status;
}
--
2.30.0
Hi Greg
Please consider adding
commit 71ac13457d9d1007effde65b54818106b2c2b525
Author: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Date: Fri Dec 18 11:10:54 2020 +0100
rtc: pcf2127: only use watchdog when explicitly available
to the 5.10 stable queue. You will need the preparatory refactoring patch
commit 5d78533a0c53af9659227c803df944ba27cd56e0
Author: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Date: Thu Sep 24 12:52:55 2020 +0200
rtc: pcf2127: move watchdog initialisation to a separate function
And if documentation is supposed to be kept up-to-date in the -stable
trees, you can pick
commit 320d159e2d63a97a40f24cd6dfda5a57eec65b91
Author: Rasmus Villemoes <rasmus.villemoes(a)prevas.dk>
Date: Fri Dec 18 11:10:53 2020 +0100
dt-bindings: rtc: add reset-source property
Thanks,
Rasmus
This is the start of the stable review cycle for the 4.19.165 release.
There are 29 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 07 Jan 2021 09:08:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.165-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.165-rc2
Hyeongseok Kim <hyeongseok(a)gmail.com>
dm verity: skip verity work if I/O error when system is shutting down
Takashi Iwai <tiwai(a)suse.de>
ALSA: pcm: Clear the full allocated memory at hw_params
Jessica Yu <jeyu(a)kernel.org>
module: delay kobject uevent until after module init call
Trond Myklebust <trond.myklebust(a)hammerspace.com>
NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
Qinglang Miao <miaoqinglang(a)huawei.com>
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
Jan Kara <jack(a)suse.cz>
quota: Don't overflow quota file offsets
Miroslav Benes <mbenes(a)suse.cz>
module: set MODULE_STATE_GOING state when a module fails to load
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
Boqun Feng <boqun.feng(a)gmail.com>
fcntl: Fix potential deadlock in send_sig{io, urg}()
Takashi Iwai <tiwai(a)suse.de>
ALSA: rawmidi: Access runtime->avail always in spinlock
Takashi Iwai <tiwai(a)suse.de>
ALSA: seq: Use bool for snd_seq_queue internal flags
Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
media: gp8psk: initialize stats at power control logic
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
Rustam Kovhaev <rkovhaev(a)gmail.com>
reiserfs: add check for an invalid ih_entry_count
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
Bluetooth: hci_h5: close serdev device and free hu in h5_close
Johan Hovold <johan(a)kernel.org>
of: fix linker-section match-table corruption
Damien Le Moal <damien.lemoal(a)wdc.com>
null_blk: Fix zone size initialization
Souptick Joarder <jrdr.linux(a)gmail.com>
xen/gntdev.c: Mark pages as dirty
Christophe Leroy <christophe.leroy(a)csgroup.eu>
powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses
Petr Vorel <petr.vorel(a)gmail.com>
uapi: move constants from <linux/kernel.h> to <linux/const.h>
Jan Kara <jack(a)suse.cz>
ext4: don't remount read-only with errors=continue on reboot
Eric Auger <eric.auger(a)redhat.com>
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
Eric Biggers <ebiggers(a)google.com>
ubifs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
f2fs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ext4: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
fscrypt: add fscrypt_is_nokey_name()
Kevin Vigor <kvigor(a)gmail.com>
md/raid10: initialize r10_bio->read_slot before use.
-------------
Diffstat:
Makefile | 4 +--
arch/powerpc/include/asm/bitops.h | 23 ++++++++++++++--
arch/powerpc/sysdev/mpic_msgr.c | 2 +-
arch/x86/kvm/cpuid.h | 14 ++++++++++
arch/x86/kvm/svm.c | 9 ++----
arch/x86/kvm/vmx.c | 6 ++--
drivers/block/null_blk_zoned.c | 20 +++++++++-----
drivers/bluetooth/hci_h5.c | 8 ++++--
drivers/md/dm-verity-target.c | 12 +++++++-
drivers/md/raid10.c | 3 +-
drivers/media/usb/dvb-usb/gp8psk.c | 2 +-
drivers/misc/vmw_vmci/vmci_context.c | 2 +-
drivers/rtc/rtc-sun6i.c | 8 ++++--
drivers/vfio/pci/vfio_pci.c | 3 +-
drivers/xen/gntdev.c | 17 ++++++++----
fs/crypto/hooks.c | 10 +++----
fs/ext4/namei.c | 3 ++
fs/ext4/super.c | 14 ++++------
fs/f2fs/f2fs.h | 2 ++
fs/fcntl.c | 10 ++++---
fs/nfs/nfs4super.c | 2 +-
fs/nfs/pnfs.c | 33 ++++++++++++++++++++--
fs/nfs/pnfs.h | 5 ++++
fs/quota/quota_tree.c | 8 +++---
fs/reiserfs/stree.c | 6 ++++
fs/ubifs/dir.c | 17 +++++++++---
include/linux/fscrypt_notsupp.h | 5 ++++
include/linux/fscrypt_supp.h | 29 +++++++++++++++++++
include/linux/of.h | 1 +
include/uapi/linux/const.h | 5 ++++
include/uapi/linux/ethtool.h | 2 +-
include/uapi/linux/kernel.h | 9 +-----
include/uapi/linux/lightnvm.h | 2 +-
include/uapi/linux/mroute6.h | 2 +-
include/uapi/linux/netfilter/x_tables.h | 2 +-
include/uapi/linux/netlink.h | 2 +-
include/uapi/linux/sysctl.h | 2 +-
kernel/module.c | 6 ++--
sound/core/pcm_native.c | 9 ++++--
sound/core/rawmidi.c | 49 +++++++++++++++++++++++----------
sound/core/seq/seq_queue.h | 8 +++---
41 files changed, 275 insertions(+), 101 deletions(-)
Stack-allocated buffers cannot be used for DMA (on all architectures).
Replace the HP-channel macro with a helper function that allocates a
dedicated transfer buffer so that it can continue to be used with
arguments from the stack.
Note that the buffer is cleared on allocation as usblp_ctrl_msg()
returns success also on short transfers (the buffer is only used for
debugging).
Cc: stable(a)vger.kernel.org
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/class/usblp.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/class/usblp.c b/drivers/usb/class/usblp.c
index 67cbd42421be..134dc2005ce9 100644
--- a/drivers/usb/class/usblp.c
+++ b/drivers/usb/class/usblp.c
@@ -274,8 +274,25 @@ static int usblp_ctrl_msg(struct usblp *usblp, int request, int type, int dir, i
#define usblp_reset(usblp)\
usblp_ctrl_msg(usblp, USBLP_REQ_RESET, USB_TYPE_CLASS, USB_DIR_OUT, USB_RECIP_OTHER, 0, NULL, 0)
-#define usblp_hp_channel_change_request(usblp, channel, buffer) \
- usblp_ctrl_msg(usblp, USBLP_REQ_HP_CHANNEL_CHANGE_REQUEST, USB_TYPE_VENDOR, USB_DIR_IN, USB_RECIP_INTERFACE, channel, buffer, 1)
+static int usblp_hp_channel_change_request(struct usblp *usblp, int channel, u8 *new_channel)
+{
+ u8 *buf;
+ int ret;
+
+ buf = kzalloc(1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ ret = usblp_ctrl_msg(usblp, USBLP_REQ_HP_CHANNEL_CHANGE_REQUEST,
+ USB_TYPE_VENDOR, USB_DIR_IN, USB_RECIP_INTERFACE,
+ channel, buf, 1);
+ if (ret == 0)
+ *new_channel = buf[0];
+
+ kfree(buf);
+
+ return ret;
+}
/*
* See the description for usblp_select_alts() below for the usage
--
2.26.2
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.
In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.
CC: stable(a)vger.kernel.org
Fixes: a7d5c7ce41ac1 ("KVM: nSVM: delay MSR permission processing to first nested VM run")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
arch/x86/kvm/svm/nested.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 18b71e73a9935..6208d3a5a3fdb 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -199,6 +199,10 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+
+ if (WARN_ON_ONCE(!is_guest_mode(&svm->vcpu)))
+ return false;
+
if (!nested_svm_vmrun_msrpm(svm)) {
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
vcpu->run->internal.suberror =
@@ -598,6 +602,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm->nested.vmcb12_gpa = 0;
WARN_ON_ONCE(svm->nested.nested_run_pending);
+ kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, &svm->vcpu);
+
/* in case we halted in L2 */
svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
--
2.26.2
The old sync_core_before_usermode() comments said that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace. Based on my general understanding of how CPUs work and based on
my atttempt to read the ARM manual, this is not true at all. In fact, x86
seems to be a bit of an anomaly in the other direction: x86's IRET is
unusually heavyweight for a return-to-usermode instruction.
So let's drop any pretense that we can have a generic way implementation
behind membarrier's SYNC_CORE flush and require all architectures that opt
in to supply their own. This means x86, arm64, and powerpc for now. Let's
also rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.
I admit that I'm rather surprised that the code worked at all on arm64,
and I'm suspicious that it has never been very well tested. My apologies
for not reviewing this more carefully in the first place.
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: linuxppc-dev(a)lists.ozlabs.org
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: x86(a)kernel.org
Cc: stable(a)vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
Hi arm64 and powerpc people-
This is part of a series here:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/f…
Before I send out the whole series, I'm hoping that some arm64 and powerpc
people can help me verify that I did this patch right. Once I get
some feedback on this patch, I'll send out the whole pile. And once
*that's* done, I'll start giving the mm lazy stuff some serious thought.
The x86 part is already fixed in Linus' tree.
Thanks,
Andy
arch/arm64/include/asm/sync_core.h | 21 +++++++++++++++++++++
arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
arch/x86/Kconfig | 1 -
arch/x86/include/asm/sync_core.h | 7 +++----
include/linux/sched/mm.h | 1 -
include/linux/sync_core.h | 21 ---------------------
init/Kconfig | 3 ---
kernel/sched/membarrier.c | 15 +++++++++++----
8 files changed, 55 insertions(+), 34 deletions(-)
create mode 100644 arch/arm64/include/asm/sync_core.h
create mode 100644 arch/powerpc/include/asm/sync_core.h
delete mode 100644 include/linux/sync_core.h
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..5be4531caabd
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+ /*
+ * XXX: is this enough or do we need a DMB first to make sure that
+ * writes from other CPUs become visible to this CPU? We have an
+ * smp_mb() already, but that's not quite the same thing.
+ */
+ isb();
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..71dfbe7794e5
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+ /*
+ * XXX: I know basically nothing about powerpc cache management.
+ * Is this correct?
+ */
+ isync();
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5137cc5b7b4..895f70fd4a61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,7 +81,6 @@ config X86
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
- select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_core(void)
}
/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
*/
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
{
/* With PTI, we unconditionally serialize before running user code. */
if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 48640db6ca86..81ba47910a73 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/gfp.h>
-#include <linux/sync_core.h>
/*
* Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index c9446911cf41..eb9772078cd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2334,9 +2334,6 @@ source "kernel/Kconfig.locks"
config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
bool
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
- bool
-
# It may be useful for an architecture to override the definitions of the
# SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
# and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b3a82d7635da..db4945e1ec94 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@
* membarrier system call
*/
#include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
/*
* The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
static void ipi_sync_core(void *info)
{
/*
@@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
* the big comment at the top of this file.
*
* A sync_core() would provide this guarantee, but
- * sync_core_before_usermode() might end up being deferred until
- * after membarrier()'s smp_mb().
+ * membarrier_sync_core_before_usermode() might end up being deferred
+ * until after membarrier()'s smp_mb().
*/
smp_mb(); /* IPIs should be serializing but paranoid. */
- sync_core_before_usermode();
+ membarrier_sync_core_before_usermode();
}
+#endif
static void ipi_rseq(void *info)
{
@@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
smp_call_func_t ipi_func = ipi_mb;
if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
- if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
return -EINVAL;
+#else
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
ipi_func = ipi_sync_core;
+#endif
} else if (flags == MEMBARRIER_FLAG_RSEQ) {
if (!IS_ENABLED(CONFIG_RSEQ))
return -EINVAL;
--
2.29.2
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
With Paul, we've been thinking that the idle loop wasn't twisted enough
yet to deserve 2020.
rcutorture, after some recent parameter changes, has been complaining
about a hung task.
It appears that rcu_idle_enter() may wake up a NOCB kthread but this
happens after the last generic need_resched() check. Some cpuidle drivers
fix it by chance but many others don't.
Here is a proposed bunch of fixes. I will need to also fix the
rcu_user_enter() case, likely using irq_work, since nohz_full requires
irq work to support self IPI.
Also more generally, this raise the question of local task wake_up()
under disabled interrupts. When a wake up occurs in a preempt disabled
section, it gets handled by the outer preempt_enable() call. There is no
similar mechanism when a wake up occurs with interrupts disabled. I guess
it is assumed to be handled, at worst, in the next tick. But a local irq
work would provide instant preemption once IRQs are re-enabled. Of course
this would only make sense in CONFIG_PREEMPTION, and when the tick is
disabled...
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/idle
HEAD: f2fa6e4a070c1535b9edc9ee097167fd2b15d235
Thanks,
Frederic
---
Frederic Weisbecker (4):
sched/idle: Fix missing need_resched() check after rcu_idle_enter()
cpuidle: Fix missing need_resched() check after rcu_idle_enter()
ARM: imx6q: Fix missing need_resched() check after rcu_idle_enter()
ACPI: processor: Fix missing need_resched() check after rcu_idle_enter()
arch/arm/mach-imx/cpuidle-imx6q.c | 7 ++++++-
drivers/acpi/processor_idle.c | 10 ++++++++--
drivers/cpuidle/cpuidle.c | 33 +++++++++++++++++++++++++--------
kernel/sched/idle.c | 18 ++++++++++++------
4 files changed, 51 insertions(+), 17 deletions(-)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
This reverts commit a135a1b4c4db1f3b8cbed9676a40ede39feb3362.
This leads to blank screens on some boards after replugging a
display. Revert until we understand the root cause and can
fix both the leak and the blank screen after replug.
Cc: Stylon Wang <stylon.wang(a)amd.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Cc: Andre Tomt <andre(a)tomt.net>
Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0d2e334be87a..318eb12f8de7 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2385,8 +2385,7 @@ void amdgpu_dm_update_connector_after_detect(
drm_connector_update_edid_property(connector,
aconnector->edid);
- aconnector->num_modes = drm_add_edid_modes(connector, aconnector->edid);
- drm_connector_list_update(connector);
+ drm_add_edid_modes(connector, aconnector->edid);
if (aconnector->dc_link->aux_mode)
drm_dp_cec_set_edid(&aconnector->dm_dp_aux.aux,
--
2.29.2
We received a slab-out-of-bounds report in ip6_xmit() for KASAN build on 4.9
kernel. The patches that fix this issue have been backported to to stable 4.14
and one of them even to 3.16 but 4.9 stable branch does not include them.
Backport to linux-4.9.y required trivial merge conflict resolution. They
cleanly apply to linux-stable linux-4.9.y branch tagged v4.9.249.
Paolo Abeni (2):
net: ipv6: keep sk status consistent after datagram connect failure
l2tp: fix races with ipv4-mapped ipv6 addresses
net/ipv6/datagram.c | 21 +++++++++++++++++----
net/l2tp/l2tp_core.c | 38 ++++++++++++++++++--------------------
net/l2tp/l2tp_core.h | 3 ---
3 files changed, 35 insertions(+), 27 deletions(-)
--
2.29.2.729.g45daf8777d-goog
Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
introduced a new location where a pmd was released, but neglected to run
the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:
c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").
This issue was hidden until recently because the failure mode is silent,
but commit:
b2b29d6d0119 ("mm: account PMD tables like PTE tables")
...turns the failure mode into this signature:
BUG: Bad page state in process lt-pmem-ns pfn:15943d
page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
flags: 0xaffff800000000()
raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
page dumped because: nonzero mapcount
[..]
dump_stack+0x8b/0xb0
bad_page.cold+0x63/0x94
free_pcp_prepare+0x224/0x270
free_unref_page+0x18/0xd0
pud_free_pmd_page+0x146/0x160
ioremap_pud_range+0xe3/0x350
ioremap_page_range+0x108/0x160
__ioremap_caller.constprop.0+0x174/0x2b0
? memremap+0x7a/0x110
memremap+0x7a/0x110
devm_memremap+0x53/0xa0
pmem_attach_disk+0x4ed/0x530 [nd_pmem]
? __devm_release_region+0x52/0x80
nvdimm_bus_probe+0x85/0x210 [libnvdimm]
Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure. As for a better helper
pmd_free() is close, but it is a messy fit due to requiring an @mm arg.
Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table()
instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the
best way forward for now.
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Cc: <stable(a)vger.kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Co-debugged-by: Matthew Wilcox <willy(a)infradead.org>
Tested-by: Yi Zhang <yi.zhang(a)redhat.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/x86/mm/pgtable.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..f6a9e2e36642 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}
free_page((unsigned long)pmd_sv);
+
+ pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
return 1;
It is observed 'use-after-free' on the dmabuf's file->f_inode with the
race between closing the dmabuf file and reading the dmabuf's debug
info.
Consider the below scenario where P1 is closing the dma_buf file
and P2 is reading the dma_buf's debug info in the system:
P1 P2
dma_buf_debug_show()
dma_buf_put()
__fput()
file->f_op->release()
dput()
....
dentry_unlink_inode()
iput(dentry->d_inode)
(where the inode is freed)
mutex_lock(&db_list.lock)
read 'dma_buf->file->f_inode'
(the same inode is freed by P1)
mutex_unlock(&db_list.lock)
dentry->d_op->d_release()-->
dma_buf_release()
.....
mutex_lock(&db_list.lock)
removes the dmabuf from the list
mutex_unlock(&db_list.lock)
In the above scenario, when dma_buf_put() is called on a dma_buf, it
first frees the dma_buf's file->f_inode(=dentry->d_inode) and then
removes this dma_buf from the system db_list. In between P2 traversing
the db_list tries to access this dma_buf's file->f_inode that was freed
by P1 which is a use-after-free case.
Since, __fput() calls f_op->release first and then later calls the
d_op->d_release, move the dma_buf's db_list removal from d_release() to
f_op->release(). This ensures that dma_buf's file->f_inode is not
accessed after it is released.
Cc: <stable(a)vger.kernel.org> # 5.4+
Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops")
Acked-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
---
V2: Resending with stable tags and Acks
V1: https://lore.kernel.org/patchwork/patch/1360118/
drivers/dma-buf/dma-buf.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0eb80c1..a14dcbb 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry)
dmabuf->ops->release(dmabuf);
- mutex_lock(&db_list.lock);
- list_del(&dmabuf->list_node);
- mutex_unlock(&db_list.lock);
-
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
dma_resv_fini(dmabuf->resv);
@@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry)
kfree(dmabuf);
}
+static int dma_buf_file_release(struct inode *inode, struct file *file)
+{
+ struct dma_buf *dmabuf;
+
+ if (!is_dma_buf_file(file))
+ return -EINVAL;
+
+ dmabuf = file->private_data;
+
+ mutex_lock(&db_list.lock);
+ list_del(&dmabuf->list_node);
+ mutex_unlock(&db_list.lock);
+
+ return 0;
+}
+
static const struct dentry_operations dma_buf_dentry_ops = {
.d_dname = dmabuffs_dname,
.d_release = dma_buf_release,
@@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
}
static const struct file_operations dma_buf_fops = {
+ .release = dma_buf_file_release,
.mmap = dma_buf_mmap_internal,
.llseek = dma_buf_llseek,
.poll = dma_buf_poll,
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation
If an active transfer is dequeued, then the endpoint is freed to start a
new transfer. Make sure to clear the endpoint's transfer wait flag for
this case.
Cc: stable(a)vger.kernel.org
Fixes: e0d19563eb6c ("usb: dwc3: gadget: Wait for transfer completion")
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
Changes in v2:
- Only clear the wait flag if the selected request is of an active transfer.
Otherwise, any dequeue will change the endpoint's state even if it's for
some random request.
drivers/usb/dwc3/gadget.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 78cb4db8a6e4..9a00dcaca010 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -1763,6 +1763,8 @@ static int dwc3_gadget_ep_dequeue(struct usb_ep *ep,
list_for_each_entry_safe(r, t, &dep->started_list, list)
dwc3_gadget_move_cancelled_request(r);
+ dep->flags &= ~DWC3_EP_WAIT_TRANSFER_COMPLETE;
+
goto out;
}
}
base-commit: 2edc7af892d0913bf06f5b35e49ec463f03d5ed8
--
2.28.0
Here's another variant PNY Pro Elite USB 3.1 Gen 2 portable SSD that
hangs and doesn't respond to ATA_1x pass-through commands. If it doesn't
support these commands, it should respond properly to the host. Add it
to the unusual uas list to be able to move forward with other
operations.
Cc: stable(a)vger.kernel.org
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/storage/unusual_uas.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h
index 870e9cf3d5dc..f9677a5ec31b 100644
--- a/drivers/usb/storage/unusual_uas.h
+++ b/drivers/usb/storage/unusual_uas.h
@@ -90,6 +90,13 @@ UNUSUAL_DEV(0x152d, 0x0578, 0x0000, 0x9999,
USB_SC_DEVICE, USB_PR_DEVICE, NULL,
US_FL_BROKEN_FUA),
+/* Reported-by: Thinh Nguyen <thinhn(a)synopsys.com> */
+UNUSUAL_DEV(0x154b, 0xf00b, 0x0000, 0x9999,
+ "PNY",
+ "Pro Elite SSD",
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_NO_ATA_1X),
+
/* Reported-by: Thinh Nguyen <thinhn(a)synopsys.com> */
UNUSUAL_DEV(0x154b, 0xf00d, 0x0000, 0x9999,
"PNY",
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.28.0
With CONFIG_EXPERT=y, CONFIG_KASAN=y, CONFIG_RANDOMIZE_BASE=n,
CONFIG_RELOCATABLE=n, we observe the following failure when trying to
link the kernel image with LD=ld.lld:
error: section: .exit.data is not contiguous with other relro sections
ld.lld defaults to -z relro while ld.bfd defaults to -z norelro. This
was previously fixed, but only for CONFIG_RELOCATABLE=y.
Cc: stable(a)vger.kernel.org
Fixes: commit 3bbd3db86470 ("arm64: relocatable: fix inconsistencies in linker script and options")
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
While upgrading our toolchains for Android, we started seeing the above
failure for a particular config that enabled KASAN but disabled KASLR.
This was on a 5.4 stable branch. It looks like
commit dd4bc6076587 ("arm64: warn on incorrect placement of the kernel by the bootloader")
made RELOCATABLE=y the default and depend on EXPERT=y. With those two
enabled, we can then reproduce the same failure on mainline.
arch/arm64/Makefile | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index f4717facf31e..674241df91ab 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -10,13 +10,13 @@
#
# Copyright (C) 1995-2001 by Russell King
-LDFLAGS_vmlinux :=--no-undefined -X
+LDFLAGS_vmlinux :=--no-undefined -X -z norelro
ifeq ($(CONFIG_RELOCATABLE), y)
# Pass --no-apply-dynamic-relocs to restore pre-binutils-2.27 behaviour
# for relative relocs, since this leads to better Image compression
# with the relocation offsets always being zero.
-LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro \
+LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \
$(call ld-option, --no-apply-dynamic-relocs)
endif
--
2.29.0.rc1.297.gfa9743e501-goog
This reverts stable commit baad618d078c857f99cc286ea249e9629159901f.
This commit is adding lines to spinand_write_to_cache_op, wheras the upstream
commit 868cbe2a6dcee451bd8f87cbbb2a73cf463b57e5 that this was supposed to
backport was touching spinand_read_from_cache_op.
It causes a crash on writing OOB data by attempting to write to read-only
kernel memory.
Cc: Miquel Raynal <miquel.raynal(a)bootlin.com>
Signed-off-by: Felix Fietkau <nbd(a)nbd.name>
---
drivers/mtd/nand/spi/core.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/mtd/nand/spi/core.c b/drivers/mtd/nand/spi/core.c
index 7900571fc85b..c35221794645 100644
--- a/drivers/mtd/nand/spi/core.c
+++ b/drivers/mtd/nand/spi/core.c
@@ -318,10 +318,6 @@ static int spinand_write_to_cache_op(struct spinand_device *spinand,
buf += ret;
}
- if (req->ooblen)
- memcpy(req->oobbuf.in, spinand->oobbuf + req->ooboffs,
- req->ooblen);
-
return 0;
}
--
2.28.0
Correctly handle the MVPG instruction when issued by a VSIE guest.
Fixes: a3508fbe9dc6d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable(a)vger.kernel.org
Signed-off-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
---
arch/s390/kvm/vsie.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index ada49583e530..6c3069868acd 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -977,6 +977,75 @@ static int handle_stfle(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
return 0;
}
+static u64 vsie_get_register(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page, u8 reg)
+{
+ reg &= 0xf;
+ switch (reg) {
+ case 15:
+ return vsie_page->scb_s.gg15;
+ case 14:
+ return vsie_page->scb_s.gg14;
+ default:
+ return vcpu->run->s.regs.gprs[reg];
+ }
+}
+
+static int vsie_handle_mvpg(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
+{
+ struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s;
+ unsigned long r1, r2, mask = PAGE_MASK;
+ int rc;
+
+ if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_24BIT)
+ mask = 0xfff000;
+ else if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_31BIT)
+ mask = 0x7ffff000;
+
+ r1 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 20) & mask;
+ r2 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 16) & mask;
+ rc = kvm_s390_vsie_mvpg_check(vcpu, r1, r2, &vsie_page->scb_o->mcic);
+
+ /*
+ * Guest translation was not successful. The host needs to forward
+ * the intercept to the guest and let the guest fix its page tables.
+ * The guest needs then to retry the instruction.
+ */
+ if (rc == -ENOENT)
+ return 1;
+
+ retry_vsie_icpt(vsie_page);
+
+ /*
+ * Guest translation was not successful. The page tables of the guest
+ * are broken. Try again and let the hardware deliver the fault.
+ */
+ if (rc == -EFAULT)
+ return 0;
+
+ /*
+ * Guest translation was successful. The host needs to fix up its
+ * page tables and retry the instruction in the nested guest.
+ * In case of failure, the instruction will intercept again, and
+ * a different path will be taken.
+ */
+ if (!rc) {
+ kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r2);
+ kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r1);
+ return 0;
+ }
+
+ /*
+ * An exception happened during guest translation, it needs to be
+ * delivered to the guest. This can happen if the host has EDAT1
+ * enabled and the guest has not, or for other causes. The guest
+ * needs to process the exception and return to the nested guest.
+ */
+ if (rc > 0)
+ return kvm_s390_inject_prog_cond(vcpu, rc);
+
+ return 1;
+}
+
/*
* Run the vsie on a shadow scb and a shadow gmap, without any further
* sanity checks, handling SIE faults.
@@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
if ((scb_s->ipa & 0xf000) != 0xf000)
scb_s->ipa += 0x1000;
break;
+ case ICPT_PARTEXEC:
+ if (scb_s->ipa == 0xb254)
+ rc = vsie_handle_mvpg(vcpu, vsie_page);
+ break;
}
return rc;
}
--
2.26.2
The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.
The series is a backport of the fix sent by Peter [1].
The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.
This is only for v4.19 stable.
[1] https://patchwork.kernel.org/cover/11284843/
--
Changelog:
v2: Send the patches with the correct format (commit sha1 upstream) for stable
v3: Fix compilation issue on ppc40x_defconfig and ppc44x_defconfig
--
Aneesh Kumar K.V (1):
powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
Peter Zijlstra (4):
asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
asm-generic/tlb: avoid potential double flush
Will Deacon (1):
asm-generic/tlb: Track which levels of the page tables have been
cleared
arch/Kconfig | 3 -
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 --
arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 -
arch/powerpc/include/asm/nohash/32/pgalloc.h | 8 --
arch/powerpc/include/asm/nohash/64/pgalloc.h | 9 +-
arch/powerpc/include/asm/tlb.h | 11 ++
arch/powerpc/mm/pgtable-book3s64.c | 7 --
arch/sparc/include/asm/tlb_64.h | 9 ++
arch/x86/Kconfig | 1 -
include/asm-generic/tlb.h | 103 ++++++++++++++++---
mm/memory.c | 20 ++--
12 files changed, 123 insertions(+), 60 deletions(-)
--
2.24.1
ksys_umount was refactored to into split into another function
(path_umount) to enable sharing code. This changed the order that flags and
permissions are validated in, and made it so that user_path_at was called
before validating flags.
Unfortunately, libfuse2[1] and libmount[2] rely on the old flag validation
behaviour to determine whether or not the kernel supports UMOUNT_NOFOLLOW.
The other path that this validation is being checked on is
init_umount->path_umount->can_umount. That's all internal to the kernel. We
can safely move flag checking to ksys_umount, and let other users of
path_umount know they need to perform their own validation.
[1]: https://github.com/libfuse/libfuse/blob/9bfbeb576c5901b62a171d35510f0d1a922…
[2]: https://github.com/karelzak/util-linux/blob/7ed579523b556b1270f28dbdb7ee07d…
Signed-off-by: Sargun Dhillon <sargun(a)sargun.me>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: stable(a)vger.kernel.org
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Fixes: 41525f56e256 ("fs: refactor ksys_umount")
---
fs/namespace.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index cebaa3e81794..752f82121dd4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1710,8 +1710,6 @@ static int can_umount(const struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
- if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
- return -EINVAL;
if (!may_mount())
return -EPERM;
if (path->dentry != path->mnt->mnt_root)
@@ -1725,6 +1723,13 @@ static int can_umount(const struct path *path, int flags)
return 0;
}
+
+/*
+ * path_umount - unmount by path
+ *
+ * path_umount does not check the validity of flags. It is up to the caller
+ * to ensure that it only contains valid umount options.
+ */
int path_umount(struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
@@ -1746,6 +1751,10 @@ static int ksys_umount(char __user *name, int flags)
struct path path;
int ret;
+ /* Check flag validity first to allow probing of supported flags */
+ if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+ return -EINVAL;
+
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
--
2.25.1
If a transfer is dequeued, then the endpoint is freed to start a new
transfer. Make sure to clear the endpoint's transfer wait flag if
dequeued.
Cc: stable(a)vger.kernel.org
Fixes: e0d19563eb6c ("usb: dwc3: gadget: Wait for transfer completion")
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/dwc3/gadget.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 78cb4db8a6e4..9eaeb3c5d06c 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -1771,6 +1771,7 @@ static int dwc3_gadget_ep_dequeue(struct usb_ep *ep,
request, ep->name);
ret = -EINVAL;
out:
+ dep->flags &= ~DWC3_EP_WAIT_TRANSFER_COMPLETE;
spin_unlock_irqrestore(&dwc->lock, flags);
return ret;
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.28.0
When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr
calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will
intercept in inode_getxattr hooks.
When selinux LSM is installed but not initialized, it will list the
security.selinux xattr in inode_listsecurity, but will not intercept it
in inode_getxattr. This results in -ENODATA for a getxattr call for an
xattr returned by listxattr.
This situation was manifested as overlayfs failure to copy up lower
files from squashfs when selinux is built-in but not initialized,
because ovl_copy_xattr() iterates the lower inode xattrs by
vfs_listxattr() and vfs_getxattr().
Match the logic of inode_listsecurity to that of inode_getxattr and
do not list the security.selinux xattr if selinux is not initialized.
Reported-by: Michael Labriola <michael.d.labriola(a)gmail.com>
Tested-by: Michael Labriola <michael.d.labriola(a)gmail.com>
Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.…
Fixes: c8e222616c7e ("selinux: allow reading labels before policy is loaded")
Cc: stable(a)vger.kernel.org#v5.9+
Signed-off-by: Amir Goldstein <amir73il(a)gmail.com>
---
security/selinux/hooks.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6b1826fc3658..e132e082a5af 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3406,6 +3406,10 @@ static int selinux_inode_setsecurity(struct inode *inode, const char *name,
static int selinux_inode_listsecurity(struct inode *inode, char *buffer, size_t buffer_size)
{
const int len = sizeof(XATTR_NAME_SELINUX);
+
+ if (!selinux_initialized(&selinux_state))
+ return 0;
+
if (buffer && len <= buffer_size)
memcpy(buffer, XATTR_NAME_SELINUX, len);
return len;
--
2.25.1
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3fb2d45c0b42f..6b293804cd734 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3361,17 +3361,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 00c295d3104bb..205c3131f8b05 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3448,17 +3448,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 18fae55713b0a..79fcec674485f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3494,17 +3494,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index eef77c82d2e19..cd98ef48345e1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3554,17 +3554,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4aa268582a225..28e52657e0930 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3718,17 +3718,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
From: Yunfeng Ye <yeyunfeng(a)huawei.com>
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ]
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng(a)huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai(a)gmail.com>
Signed-off-by: Tejun Heo <tj(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/workqueue.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 437935e7a1991..0695c7895c892 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3728,17 +3728,24 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
* is updated and visible.
*/
if (!freezable || !workqueue_freezing) {
+ bool kick = false;
+
pwq->max_active = wq->saved_max_active;
while (!list_empty(&pwq->delayed_works) &&
- pwq->nr_active < pwq->max_active)
+ pwq->nr_active < pwq->max_active) {
pwq_activate_first_delayed(pwq);
+ kick = true;
+ }
/*
* Need to kick a worker after thawed or an unbound wq's
- * max_active is bumped. It's a slow path. Do it always.
+ * max_active is bumped. In realtime scenarios, always kicking a
+ * worker will cause interference on the isolated cpu cores, so
+ * let's kick iff work items were activated.
*/
- wake_up_worker(pwq->pool);
+ if (kick)
+ wake_up_worker(pwq->pool);
} else {
pwq->max_active = 0;
}
--
2.27.0
It sounds unwise to let user space pass an unchecked 32-bit
offset into a kernel structure in an ioctl. This is an unsigned
variable, so checking the upper bound for the size of the structure
it points into is sufficient to avoid data corruption, but as
the pointer might also be unaligned, it has to be written carefully
as well.
While I stumbled over this problem by reading the code, I did not
continue checking the function for further problems like it.
Cc: stable(a)vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 861f7140f52e..c3de69f3bee8 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -8095,7 +8095,7 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
int error = 0, i;
void *sense = NULL;
dma_addr_t sense_handle;
- unsigned long *sense_ptr;
+ void *sense_ptr;
u32 opcode = 0;
int ret = DCMD_SUCCESS;
@@ -8218,6 +8218,12 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
}
if (ioc->sense_len) {
+ /* make sure the pointer is part of the frame */
+ if (ioc->sense_off > (sizeof(union megasas_frame) - sizeof(__le64))) {
+ error = -EINVAL;
+ goto out;
+ }
+
sense = dma_alloc_coherent(&instance->pdev->dev, ioc->sense_len,
&sense_handle, GFP_KERNEL);
if (!sense) {
@@ -8225,12 +8231,11 @@ megasas_mgmt_fw_ioctl(struct megasas_instance *instance,
goto out;
}
- sense_ptr =
- (unsigned long *) ((unsigned long)cmd->frame + ioc->sense_off);
+ sense_ptr = (void *)cmd->frame + ioc->sense_off;
if (instance->consistent_mask_64bit)
- *sense_ptr = cpu_to_le64(sense_handle);
+ put_unaligned_le64(sense_handle, sense_ptr);
else
- *sense_ptr = cpu_to_le32(sense_handle);
+ put_unaligned_le32(sense_handle, sense_ptr);
}
/*
--
2.27.0
On 2020-12-29 9:54 a.m., Deucher, Alexander wrote:
> [AMD Public Use]
>
>
> I don't know if these fixes related to modifiers make sense in the
> pre-modifier code base. Bas, Nick?
>
> Alex
Mesa should be the only userspace trying to make use of DCC and it
doesn't do it for video formats. From the kernel side of things we've
also never supported this and you'd get corruption on the screen if you
tried.
It's a "fix" for both pre-modifiers and post-modifiers code.
Regards,
Nicholas Kazlauskas
> ------------------------------------------------------------------------
> *From:* amd-gfx <amd-gfx-bounces(a)lists.freedesktop.org> on behalf of
> Sasha Levin <sashal(a)kernel.org>
> *Sent:* Tuesday, December 22, 2020 9:16 PM
> *To:* linux-kernel(a)vger.kernel.org <linux-kernel(a)vger.kernel.org>;
> stable(a)vger.kernel.org <stable(a)vger.kernel.org>
> *Cc:* Sasha Levin <sashal(a)kernel.org>; dri-devel(a)lists.freedesktop.org
> <dri-devel(a)lists.freedesktop.org>; amd-gfx(a)lists.freedesktop.org
> <amd-gfx(a)lists.freedesktop.org>; Bas Nieuwenhuizen
> <bas(a)basnieuwenhuizen.nl>; Deucher, Alexander
> <Alexander.Deucher(a)amd.com>; Kazlauskas, Nicholas
> <Nicholas.Kazlauskas(a)amd.com>
> *Subject:* [PATCH AUTOSEL 5.4 006/130] drm/amd/display: Do not silently
> accept DCC for multiplane formats.
> From: Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
>
> [ Upstream commit b35ce7b364ec80b54f48a8fdf9fb74667774d2da ]
>
> Silently accepting it could result in corruption.
>
> Signed-off-by: Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
> Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
> ---
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index d2dd387c95d86..ce70c42a2c3ec 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -2734,7 +2734,7 @@ fill_plane_dcc_attributes(struct amdgpu_device *adev,
> return 0;
>
> if (format >= SURFACE_PIXEL_FORMAT_VIDEO_BEGIN)
> - return 0;
> + return -EINVAL;
>
> if (!dc->cap_funcs.get_dcc_compression_cap)
> return -EINVAL;
> --
> 2.27.0
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx(a)lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre…
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre…>
Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot. However this introduced a pretty serious
performance regression. The root cause is because before we would
generally use the normal writeback path to reclaim delalloc space, and
for this we would provide it with the number of pages we wanted to
flush. The referenced commit changed this to flush that many inodes,
which drastically increased the amount of space we were flushing in
certain cases, which severely affected performance.
We cannot revert this patch unfortunately, because Filipe has another
fix that requires the ability to skip flushing inodes that are being
cloned in certain scenarios, which means we need to keep using our
flushing infrastructure or risk re-introducing the deadlock.
Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us. This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe. The
user reported reproducer was a simple untarring of a large tarball of
the source code for Firefox. The results are as follows
5.9 0m54.258s
5.10 1m26.212s
Patch 0m38.800s
We are significantly faster because of the work I did around improving
ENOSPC flushing in 5.10 and 5.11, so reverting to the previous write out
behavior gave us a pretty big boost.
CC: stable(a)vger.kernel.org # 5.10
Reported-by: René Rebe <rene(a)exactcode.de>
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
fs/btrfs/inode.c | 60 +++++++++++++++++++++++++++++++------------
fs/btrfs/space-info.c | 4 ++-
2 files changed, 46 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 070716650df8..a8e0a6b038d3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9390,7 +9390,8 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode
* some fairly slow code that needs optimization. This walks the list
* of all the inodes with pending delalloc and forces them to disk.
*/
-static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot,
+static int start_delalloc_inodes(struct btrfs_root *root,
+ struct writeback_control *wbc, bool snapshot,
bool in_reclaim_context)
{
struct btrfs_inode *binode;
@@ -9399,6 +9400,7 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot
struct list_head works;
struct list_head splice;
int ret = 0;
+ bool full_flush = wbc->nr_to_write == LONG_MAX;
INIT_LIST_HEAD(&works);
INIT_LIST_HEAD(&splice);
@@ -9427,18 +9429,24 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot
if (snapshot)
set_bit(BTRFS_INODE_SNAPSHOT_FLUSH,
&binode->runtime_flags);
- work = btrfs_alloc_delalloc_work(inode);
- if (!work) {
- iput(inode);
- ret = -ENOMEM;
- goto out;
- }
- list_add_tail(&work->list, &works);
- btrfs_queue_work(root->fs_info->flush_workers,
- &work->work);
- if (*nr != U64_MAX) {
- (*nr)--;
- if (*nr == 0)
+ if (full_flush) {
+ work = btrfs_alloc_delalloc_work(inode);
+ if (!work) {
+ iput(inode);
+ ret = -ENOMEM;
+ goto out;
+ }
+ list_add_tail(&work->list, &works);
+ btrfs_queue_work(root->fs_info->flush_workers,
+ &work->work);
+ } else {
+ ret = sync_inode(inode, wbc);
+ if (!ret &&
+ test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+ &BTRFS_I(inode)->runtime_flags))
+ ret = sync_inode(inode, wbc);
+ btrfs_add_delayed_iput(inode);
+ if (ret || wbc->nr_to_write <= 0)
goto out;
}
cond_resched();
@@ -9464,18 +9472,29 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot
int btrfs_start_delalloc_snapshot(struct btrfs_root *root)
{
+ struct writeback_control wbc = {
+ .nr_to_write = LONG_MAX,
+ .sync_mode = WB_SYNC_NONE,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ };
struct btrfs_fs_info *fs_info = root->fs_info;
- u64 nr = U64_MAX;
if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state))
return -EROFS;
- return start_delalloc_inodes(root, &nr, true, false);
+ return start_delalloc_inodes(root, &wbc, true, false);
}
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
bool in_reclaim_context)
{
+ struct writeback_control wbc = {
+ .nr_to_write = (nr == U64_MAX) ? LONG_MAX : (unsigned long)nr,
+ .sync_mode = WB_SYNC_NONE,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ };
struct btrfs_root *root;
struct list_head splice;
int ret;
@@ -9489,6 +9508,13 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
spin_lock(&fs_info->delalloc_root_lock);
list_splice_init(&fs_info->delalloc_roots, &splice);
while (!list_empty(&splice) && nr) {
+ /*
+ * Reset nr_to_write here so we know that we're doing a full
+ * flush.
+ */
+ if (nr == U64_MAX)
+ wbc.nr_to_write = LONG_MAX;
+
root = list_first_entry(&splice, struct btrfs_root,
delalloc_root);
root = btrfs_grab_root(root);
@@ -9497,9 +9523,9 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, u64 nr,
&fs_info->delalloc_roots);
spin_unlock(&fs_info->delalloc_root_lock);
- ret = start_delalloc_inodes(root, &nr, false, in_reclaim_context);
+ ret = start_delalloc_inodes(root, &wbc, false, in_reclaim_context);
btrfs_put_root(root);
- if (ret < 0)
+ if (ret < 0 || wbc.nr_to_write <= 0)
goto out;
spin_lock(&fs_info->delalloc_root_lock);
}
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 67e55c5479b8..e8347461c8dd 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -532,7 +532,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
loops = 0;
while ((delalloc_bytes || dio_bytes) && loops < 3) {
- btrfs_start_delalloc_roots(fs_info, items, true);
+ u64 nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
+
+ btrfs_start_delalloc_roots(fs_info, nr_pages, true);
loops++;
if (wait_ordered && !trans) {
--
2.26.2
There is one GSWIP_MII_CFG register for each switch-port except the CPU
port. The register offset for the first port is 0x0, 0x02 for the
second, 0x04 for the third and so on.
Update the driver to not only restrict the GSWIP_MII_CFG registers to
ports 0, 1 and 5. Handle ports 0..5 instead but skip the CPU port. This
means we are not overwriting the configuration for the third port (port
two since we start counting from zero) with the settings for the sixth
port (with number five) anymore.
The GSWIP_MII_PCDU(p) registers are not updated because there's really
only three (one for each of the following ports: 0, 1, 5).
Fixes: 14fceff4771e51 ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Cc: stable(a)vger.kernel.org
Signed-off-by: Martin Blumenstingl <martin.blumenstingl(a)googlemail.com>
---
drivers/net/dsa/lantiq_gswip.c | 23 ++++++-----------------
1 file changed, 6 insertions(+), 17 deletions(-)
diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
index 5d378c8026f0..4b36d89bec06 100644
--- a/drivers/net/dsa/lantiq_gswip.c
+++ b/drivers/net/dsa/lantiq_gswip.c
@@ -92,9 +92,7 @@
GSWIP_MDIO_PHY_FDUP_MASK)
/* GSWIP MII Registers */
-#define GSWIP_MII_CFG0 0x00
-#define GSWIP_MII_CFG1 0x02
-#define GSWIP_MII_CFG5 0x04
+#define GSWIP_MII_CFGp(p) (0x2 * (p))
#define GSWIP_MII_CFG_EN BIT(14)
#define GSWIP_MII_CFG_LDCLKDIS BIT(12)
#define GSWIP_MII_CFG_MODE_MIIP 0x0
@@ -392,17 +390,9 @@ static void gswip_mii_mask(struct gswip_priv *priv, u32 clear, u32 set,
static void gswip_mii_mask_cfg(struct gswip_priv *priv, u32 clear, u32 set,
int port)
{
- switch (port) {
- case 0:
- gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG0);
- break;
- case 1:
- gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
- break;
- case 5:
- gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG5);
- break;
- }
+ /* There's no MII_CFG register for the CPU port */
+ if (!dsa_is_cpu_port(priv->ds, port))
+ gswip_mii_mask(priv, clear, set, GSWIP_MII_CFGp(port));
}
static void gswip_mii_mask_pcdu(struct gswip_priv *priv, u32 clear, u32 set,
@@ -822,9 +812,8 @@ static int gswip_setup(struct dsa_switch *ds)
gswip_mdio_mask(priv, 0xff, 0x09, GSWIP_MDIO_MDC_CFG1);
/* Disable the xMII link */
- gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 0);
- gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 1);
- gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 5);
+ for (i = 0; i < priv->hw_info->max_ports; i++)
+ gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, i);
/* enable special tag insertion on cpu port */
gswip_switch_mask(priv, 0, GSWIP_FDMA_PCTRL_STEN,
--
2.30.0
This is a revert for 38d715f494f2 ("btrfs: use
btrfs_start_delalloc_roots in shrink_delalloc"). A user reported a
problem where performance was significantly worse with this patch
applied. The problem needs to be fixed with proper pre-flushing, and
changes to how we deal with the work queues for the inodes. However
that work is much more complicated than is acceptable for stable, and
simply reverting this patch fixes the problem. The original patch was
a cleanup of the code, so it's fine to revert it. My numbers for the
original reported test, which was untarring a copy of the firefox
sources, are as follows
5.9 0m54.258s
5.10 1m26.212s
Fix 0m35.038s
cc: stable(a)vger.kernel.org # 5.10
Reported-by: René Rebe <rene(a)exactcode.de>
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
Dave, this is ontop of linus's branch, because we've changed the arguments for
btrfs_start_delalloc_roots in misc-next, and this needs to go back to 5.10 ASAP.
I can send a misc-next version if you want to have it there as well while we're
waiting for it to go into linus's tree, just let me know.
fs/btrfs/space-info.c | 54 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64099565ab8f..a2b322275b8d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -465,6 +465,28 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
up_read(&info->groups_sem);
}
+static void btrfs_writeback_inodes_sb_nr(struct btrfs_fs_info *fs_info,
+ unsigned long nr_pages, u64 nr_items)
+{
+ struct super_block *sb = fs_info->sb;
+
+ if (down_read_trylock(&sb->s_umount)) {
+ writeback_inodes_sb_nr(sb, nr_pages, WB_REASON_FS_FREE_SPACE);
+ up_read(&sb->s_umount);
+ } else {
+ /*
+ * We needn't worry the filesystem going from r/w to r/o though
+ * we don't acquire ->s_umount mutex, because the filesystem
+ * should guarantee the delalloc inodes list be empty after
+ * the filesystem is readonly(all dirty pages are written to
+ * the disk).
+ */
+ btrfs_start_delalloc_roots(fs_info, nr_items);
+ if (!current->journal_info)
+ btrfs_wait_ordered_roots(fs_info, nr_items, 0, (u64)-1);
+ }
+}
+
static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
u64 to_reclaim)
{
@@ -490,8 +512,10 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans;
u64 delalloc_bytes;
u64 dio_bytes;
+ u64 async_pages;
u64 items;
long time_left;
+ unsigned long nr_pages;
int loops;
/* Calc the number of the pages we need flush for space reservation */
@@ -532,8 +556,36 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
loops = 0;
while ((delalloc_bytes || dio_bytes) && loops < 3) {
- btrfs_start_delalloc_roots(fs_info, items);
+ nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
+
+ /*
+ * Triggers inode writeback for up to nr_pages. This will invoke
+ * ->writepages callback and trigger delalloc filling
+ * (btrfs_run_delalloc_range()).
+ */
+ btrfs_writeback_inodes_sb_nr(fs_info, nr_pages, items);
+ /*
+ * We need to wait for the compressed pages to start before
+ * we continue.
+ */
+ async_pages = atomic_read(&fs_info->async_delalloc_pages);
+ if (!async_pages)
+ goto skip_async;
+
+ /*
+ * Calculate how many compressed pages we want to be written
+ * before we continue. I.e if there are more async pages than we
+ * require wait_event will wait until nr_pages are written.
+ */
+ if (async_pages <= nr_pages)
+ async_pages = 0;
+ else
+ async_pages -= nr_pages;
+ wait_event(fs_info->async_submit_wait,
+ atomic_read(&fs_info->async_delalloc_pages) <=
+ (int)async_pages);
+skip_async:
loops++;
if (wait_ordered && !trans) {
btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
--
2.26.2
This is a note to let you know that I've just added the patch titled
usb: gadget: configfs: Preserve function ordering after bind failure
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 6cd0fe91387917be48e91385a572a69dfac2f3f7 Mon Sep 17 00:00:00 2001
From: Chandana Kishori Chiluveru <cchiluve(a)codeaurora.org>
Date: Tue, 29 Dec 2020 14:44:43 -0800
Subject: usb: gadget: configfs: Preserve function ordering after bind failure
When binding the ConfigFS gadget to a UDC, the functions in each
configuration are added in list order. However, if usb_add_function()
fails, the failed function is put back on its configuration's
func_list and purge_configs_funcs() is called to further clean up.
purge_configs_funcs() iterates over the configurations and functions
in forward order, calling unbind() on each of the previously added
functions. But after doing so, each function gets moved to the
tail of the configuration's func_list. This results in reshuffling
the original order of the functions within a configuration such
that the failed function now appears first even though it may have
originally appeared in the middle or even end of the list. At this
point if the ConfigFS gadget is attempted to re-bind to the UDC,
the functions will be added in a different order than intended,
with the only recourse being to remove and relink the functions all
over again.
An example of this as follows:
ln -s functions/mass_storage.0 configs/c.1
ln -s functions/ncm.0 configs/c.1
ln -s functions/ffs.adb configs/c.1 # oops, forgot to start adbd
echo "<udc device>" > UDC # fails
start adbd
echo "<udc device>" > UDC # now succeeds, but...
# bind order is
# "ADB", mass_storage, ncm
[30133.118289] configfs-gadget gadget: adding 'Mass Storage Function'/ffffff810af87200 to config 'c'/ffffff817d6a2520
[30133.119875] configfs-gadget gadget: adding 'cdc_network'/ffffff80f48d1a00 to config 'c'/ffffff817d6a2520
[30133.119974] using random self ethernet address
[30133.120002] using random host ethernet address
[30133.139604] usb0: HOST MAC 3e:27:46:ba:3e:26
[30133.140015] usb0: MAC 6e:28:7e:42:66:6a
[30133.140062] configfs-gadget gadget: adding 'Function FS Gadget'/ffffff80f3868438 to config 'c'/ffffff817d6a2520
[30133.140081] configfs-gadget gadget: adding 'Function FS Gadget'/ffffff80f3868438 --> -19
[30133.140098] configfs-gadget gadget: unbind function 'Mass Storage Function'/ffffff810af87200
[30133.140119] configfs-gadget gadget: unbind function 'cdc_network'/ffffff80f48d1a00
[30133.173201] configfs-gadget a600000.dwc3: failed to start g1: -19
[30136.661933] init: starting service 'adbd'...
[30136.700126] read descriptors
[30136.700413] read strings
[30138.574484] configfs-gadget gadget: adding 'Function FS Gadget'/ffffff80f3868438 to config 'c'/ffffff817d6a2520
[30138.575497] configfs-gadget gadget: adding 'Mass Storage Function'/ffffff810af87200 to config 'c'/ffffff817d6a2520
[30138.575554] configfs-gadget gadget: adding 'cdc_network'/ffffff80f48d1a00 to config 'c'/ffffff817d6a2520
[30138.575631] using random self ethernet address
[30138.575660] using random host ethernet address
[30138.595338] usb0: HOST MAC 2e:cf:43:cd:ca:c8
[30138.597160] usb0: MAC 6a:f0:9f:ee:82:a0
[30138.791490] configfs-gadget gadget: super-speed config #1: c
Fix this by reversing the iteration order of the functions in
purge_config_funcs() when unbinding them, and adding them back to
the config's func_list at the head instead of the tail. This
ensures that we unbind and unwind back to the original list order.
Fixes: 88af8bbe4ef7 ("usb: gadget: the start of the configfs interface")
Signed-off-by: Chandana Kishori Chiluveru <cchiluve(a)codeaurora.org>
Signed-off-by: Jack Pham <jackp(a)codeaurora.org>
Reviewed-by: Peter Chen <peter.chen(a)nxp.com>
Link: https://lore.kernel.org/r/20201229224443.31623-1-jackp@codeaurora.org
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/configfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/gadget/configfs.c b/drivers/usb/gadget/configfs.c
index d9743f4b56c3..408a5332a975 100644
--- a/drivers/usb/gadget/configfs.c
+++ b/drivers/usb/gadget/configfs.c
@@ -1255,9 +1255,9 @@ static void purge_configs_funcs(struct gadget_info *gi)
cfg = container_of(c, struct config_usb_cfg, c);
- list_for_each_entry_safe(f, tmp, &c->functions, list) {
+ list_for_each_entry_safe_reverse(f, tmp, &c->functions, list) {
- list_move_tail(&f->list, &cfg->func_list);
+ list_move(&f->list, &cfg->func_list);
if (f->unbind) {
dev_dbg(&gi->cdev.gadget->dev,
"unbind function '%s'/%p\n",
--
2.30.0
This is a note to let you know that I've just added the patch titled
usb: gadget: select CONFIG_CRC32
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From d7889c2020e08caab0d7e36e947f642d91015bd0 Mon Sep 17 00:00:00 2001
From: Arnd Bergmann <arnd(a)arndb.de>
Date: Sun, 3 Jan 2021 22:42:17 +0100
Subject: usb: gadget: select CONFIG_CRC32
Without crc32 support, this driver fails to link:
arm-linux-gnueabi-ld: drivers/usb/gadget/function/f_eem.o: in function `eem_unwrap':
f_eem.c:(.text+0x11cc): undefined reference to `crc32_le'
arm-linux-gnueabi-ld: drivers/usb/gadget/function/f_ncm.o:f_ncm.c:(.text+0x1e40):
more undefined references to `crc32_le' follow
Fixes: 6d3865f9d41f ("usb: gadget: NCM: Add transmit multi-frame.")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20210103214224.1996535-1-arnd@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig
index 7e47e6223089..2d152571a7de 100644
--- a/drivers/usb/gadget/Kconfig
+++ b/drivers/usb/gadget/Kconfig
@@ -265,6 +265,7 @@ config USB_CONFIGFS_NCM
depends on NET
select USB_U_ETHER
select USB_F_NCM
+ select CRC32
help
NCM is an advanced protocol for Ethernet encapsulation, allows
grouping of several ethernet frames into one USB transfer and
@@ -314,6 +315,7 @@ config USB_CONFIGFS_EEM
depends on NET
select USB_U_ETHER
select USB_F_EEM
+ select CRC32
help
CDC EEM is a newer USB standard that is somewhat simpler than CDC ECM
and therefore can be supported by more hardware. Technically ECM and
--
2.30.0
This is a note to let you know that I've just added the patch titled
usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From a1383b3537a7bea1c213baa7878ccc4ecf4413b5 Mon Sep 17 00:00:00 2001
From: Wesley Cheng <wcheng(a)codeaurora.org>
Date: Tue, 29 Dec 2020 15:00:37 -0800
Subject: usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup
usb_gadget_deactivate/usb_gadget_activate does not execute the UDC start
operation, which may leave EP0 disabled and event IRQs disabled when
re-activating the function. Move the enabling/disabling of USB EP0 and
device event IRQs to be performed in the pullup routine.
Fixes: ae7e86108b12 ("usb: dwc3: Stop active transfers before halting the controller")
Tested-by: Michael Tretter <m.tretter(a)pengutronix.de>
Cc: stable <stable(a)vger.kernel.org>
Reported-by: Michael Tretter <m.tretter(a)pengutronix.de>
Signed-off-by: Wesley Cheng <wcheng(a)codeaurora.org>
Link: https://lore.kernel.org/r/1609282837-21666-1-git-send-email-wcheng@codeauro…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/dwc3/gadget.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 78cb4db8a6e4..25f654b79e48 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2083,6 +2083,7 @@ static int dwc3_gadget_run_stop(struct dwc3 *dwc, int is_on, int suspend)
static void dwc3_gadget_disable_irq(struct dwc3 *dwc);
static void __dwc3_gadget_stop(struct dwc3 *dwc);
+static int __dwc3_gadget_start(struct dwc3 *dwc);
static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
{
@@ -2145,6 +2146,8 @@ static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
dwc->ev_buf->lpos = (dwc->ev_buf->lpos + count) %
dwc->ev_buf->length;
}
+ } else {
+ __dwc3_gadget_start(dwc);
}
ret = dwc3_gadget_run_stop(dwc, is_on, false);
@@ -2319,10 +2322,6 @@ static int dwc3_gadget_start(struct usb_gadget *g,
}
dwc->gadget_driver = driver;
-
- if (pm_runtime_active(dwc->dev))
- __dwc3_gadget_start(dwc);
-
spin_unlock_irqrestore(&dwc->lock, flags);
return 0;
@@ -2348,13 +2347,6 @@ static int dwc3_gadget_stop(struct usb_gadget *g)
unsigned long flags;
spin_lock_irqsave(&dwc->lock, flags);
-
- if (pm_runtime_suspended(dwc->dev))
- goto out;
-
- __dwc3_gadget_stop(dwc);
-
-out:
dwc->gadget_driver = NULL;
spin_unlock_irqrestore(&dwc->lock, flags);
--
2.30.0
This is a note to let you know that I've just added the patch titled
usb: usbip: vhci_hcd: protect shift size
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 718bf42b119de652ebcc93655a1f33a9c0d04b3c Mon Sep 17 00:00:00 2001
From: Randy Dunlap <rdunlap(a)infradead.org>
Date: Mon, 28 Dec 2020 23:13:09 -0800
Subject: usb: usbip: vhci_hcd: protect shift size
Fix shift out-of-bounds in vhci_hcd.c:
UBSAN: shift-out-of-bounds in ../drivers/usb/usbip/vhci_hcd.c:399:41
shift exponent 768 is too large for 32-bit type 'int'
Fixes: 03cd00d538a6 ("usbip: vhci-hcd: Set the vhci structure up to work")
Signed-off-by: Randy Dunlap <rdunlap(a)infradead.org>
Reported-by: syzbot+297d20e437b79283bf6d(a)syzkaller.appspotmail.com
Cc: Yuyang Du <yuyang.du(a)intel.com>
Cc: Shuah Khan <shuahkh(a)osg.samsung.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: linux-usb(a)vger.kernel.org
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20201229071309.18418-1-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/usbip/vhci_hcd.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/usbip/vhci_hcd.c b/drivers/usb/usbip/vhci_hcd.c
index 66cde5e5f796..3209b5ddd30c 100644
--- a/drivers/usb/usbip/vhci_hcd.c
+++ b/drivers/usb/usbip/vhci_hcd.c
@@ -396,6 +396,8 @@ static int vhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
default:
usbip_dbg_vhci_rh(" ClearPortFeature: default %x\n",
wValue);
+ if (wValue >= 32)
+ goto error;
vhci_hcd->port_status[rhport] &= ~(1 << wValue);
break;
}
--
2.30.0
From: Linus Walleij <linus.walleij(a)linaro.org>
[ Upstream commit d6d51a96c7d63b7450860a3037f2d62388286a52 ]
Functions like memset()/memmove()/memcpy() do a lot of memory
accesses.
If a bad pointer is passed to one of these functions it is important
to catch this. Compiler instrumentation cannot do this since these
functions are written in assembly.
KASan replaces these memory functions with instrumented variants.
The original functions are declared as weak symbols so that
the strong definitions in mm/kasan/kasan.c can replace them.
The original functions have aliases with a '__' prefix in their
name, so we can call the non-instrumented variant if needed.
We must use __memcpy()/__memset() in place of memcpy()/memset()
when we copy .data to RAM and when we clear .bss, because
kasan_early_init cannot be called before the initialization of
.data and .bss.
For the kernel compression and EFI libstub's custom string
libraries we need a special quirk: even if these are built
without KASan enabled, they rely on the global headers for their
custom string libraries, which means that e.g. memcpy()
will be defined to __memcpy() and we get link failures.
Since these implementations are written i C rather than
assembly we use e.g. __alias(memcpy) to redirected any
users back to the local implementation.
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: kasan-dev(a)googlegroups.com
Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
Tested-by: Ard Biesheuvel <ardb(a)kernel.org> # QEMU/KVM/mach-virt/LPAE/8G
Tested-by: Florian Fainelli <f.fainelli(a)gmail.com> # Brahma SoCs
Tested-by: Ahmad Fatoum <a.fatoum(a)pengutronix.de> # i.MX6Q
Reported-by: Russell King - ARM Linux <rmk+kernel(a)armlinux.org.uk>
Signed-off-by: Ahmad Fatoum <a.fatoum(a)pengutronix.de>
Signed-off-by: Abbott Liu <liuwenliang(a)huawei.com>
Signed-off-by: Florian Fainelli <f.fainelli(a)gmail.com>
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Russell King <rmk+kernel(a)armlinux.org.uk>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm/boot/compressed/string.c | 19 +++++++++++++++++++
arch/arm/include/asm/string.h | 26 ++++++++++++++++++++++++++
arch/arm/kernel/head-common.S | 4 ++--
arch/arm/lib/memcpy.S | 3 +++
arch/arm/lib/memmove.S | 5 ++++-
arch/arm/lib/memset.S | 3 +++
6 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/arch/arm/boot/compressed/string.c b/arch/arm/boot/compressed/string.c
index ade5079bebbf9..8c0fa276d9946 100644
--- a/arch/arm/boot/compressed/string.c
+++ b/arch/arm/boot/compressed/string.c
@@ -7,6 +7,25 @@
#include <linux/string.h>
+/*
+ * The decompressor is built without KASan but uses the same redirects as the
+ * rest of the kernel when CONFIG_KASAN is enabled, defining e.g. memcpy()
+ * to __memcpy() but since we are not linking with the main kernel string
+ * library in the decompressor, that will lead to link failures.
+ *
+ * Undefine KASan's versions, define the wrapped functions and alias them to
+ * the right names so that when e.g. __memcpy() appear in the code, it will
+ * still be linked to this local version of memcpy().
+ */
+#ifdef CONFIG_KASAN
+#undef memcpy
+#undef memmove
+#undef memset
+void *__memcpy(void *__dest, __const void *__src, size_t __n) __alias(memcpy);
+void *__memmove(void *__dest, __const void *__src, size_t count) __alias(memmove);
+void *__memset(void *s, int c, size_t count) __alias(memset);
+#endif
+
void *memcpy(void *__dest, __const void *__src, size_t __n)
{
int i = 0;
diff --git a/arch/arm/include/asm/string.h b/arch/arm/include/asm/string.h
index 111a1d8a41ddf..6c607c68f3ad7 100644
--- a/arch/arm/include/asm/string.h
+++ b/arch/arm/include/asm/string.h
@@ -5,6 +5,9 @@
/*
* We don't do inline string functions, since the
* optimised inline asm versions are not small.
+ *
+ * The __underscore versions of some functions are for KASan to be able
+ * to replace them with instrumented versions.
*/
#define __HAVE_ARCH_STRRCHR
@@ -15,15 +18,18 @@ extern char * strchr(const char * s, int c);
#define __HAVE_ARCH_MEMCPY
extern void * memcpy(void *, const void *, __kernel_size_t);
+extern void *__memcpy(void *dest, const void *src, __kernel_size_t n);
#define __HAVE_ARCH_MEMMOVE
extern void * memmove(void *, const void *, __kernel_size_t);
+extern void *__memmove(void *dest, const void *src, __kernel_size_t n);
#define __HAVE_ARCH_MEMCHR
extern void * memchr(const void *, int, __kernel_size_t);
#define __HAVE_ARCH_MEMSET
extern void * memset(void *, int, __kernel_size_t);
+extern void *__memset(void *s, int c, __kernel_size_t n);
#define __HAVE_ARCH_MEMSET32
extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
@@ -39,4 +45,24 @@ static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
return __memset64(p, v, n * 8, v >> 32);
}
+/*
+ * For files that are not instrumented (e.g. mm/slub.c) we
+ * must use non-instrumented versions of the mem*
+ * functions named __memcpy() etc. All such kernel code has
+ * been tagged with KASAN_SANITIZE_file.o = n, which means
+ * that the address sanitization argument isn't passed to the
+ * compiler, and __SANITIZE_ADDRESS__ is not set. As a result
+ * these defines kick in.
+ */
+#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
+#define memcpy(dst, src, len) __memcpy(dst, src, len)
+#define memmove(dst, src, len) __memmove(dst, src, len)
+#define memset(s, c, n) __memset(s, c, n)
+
+#ifndef __NO_FORTIFY
+#define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
+#endif
+
+#endif
+
#endif
diff --git a/arch/arm/kernel/head-common.S b/arch/arm/kernel/head-common.S
index 4a3982812a401..6840c7c60a858 100644
--- a/arch/arm/kernel/head-common.S
+++ b/arch/arm/kernel/head-common.S
@@ -95,7 +95,7 @@ __mmap_switched:
THUMB( ldmia r4!, {r0, r1, r2, r3} )
THUMB( mov sp, r3 )
sub r2, r2, r1
- bl memcpy @ copy .data to RAM
+ bl __memcpy @ copy .data to RAM
#endif
ARM( ldmia r4!, {r0, r1, sp} )
@@ -103,7 +103,7 @@ __mmap_switched:
THUMB( mov sp, r3 )
sub r2, r1, r0
mov r1, #0
- bl memset @ clear .bss
+ bl __memset @ clear .bss
ldmia r4, {r0, r1, r2, r3}
str r9, [r0] @ Save processor ID
diff --git a/arch/arm/lib/memcpy.S b/arch/arm/lib/memcpy.S
index 09a333153dc66..ad4625d16e117 100644
--- a/arch/arm/lib/memcpy.S
+++ b/arch/arm/lib/memcpy.S
@@ -58,6 +58,8 @@
/* Prototype: void *memcpy(void *dest, const void *src, size_t n); */
+.weak memcpy
+ENTRY(__memcpy)
ENTRY(mmiocpy)
ENTRY(memcpy)
@@ -65,3 +67,4 @@ ENTRY(memcpy)
ENDPROC(memcpy)
ENDPROC(mmiocpy)
+ENDPROC(__memcpy)
diff --git a/arch/arm/lib/memmove.S b/arch/arm/lib/memmove.S
index b50e5770fb44d..fd123ea5a5a4a 100644
--- a/arch/arm/lib/memmove.S
+++ b/arch/arm/lib/memmove.S
@@ -24,12 +24,14 @@
* occurring in the opposite direction.
*/
+.weak memmove
+ENTRY(__memmove)
ENTRY(memmove)
UNWIND( .fnstart )
subs ip, r0, r1
cmphi r2, ip
- bls memcpy
+ bls __memcpy
stmfd sp!, {r0, r4, lr}
UNWIND( .fnend )
@@ -222,3 +224,4 @@ ENTRY(memmove)
18: backward_copy_shift push=24 pull=8
ENDPROC(memmove)
+ENDPROC(__memmove)
diff --git a/arch/arm/lib/memset.S b/arch/arm/lib/memset.S
index 6ca4535c47fb6..0e7ff0423f50b 100644
--- a/arch/arm/lib/memset.S
+++ b/arch/arm/lib/memset.S
@@ -13,6 +13,8 @@
.text
.align 5
+.weak memset
+ENTRY(__memset)
ENTRY(mmioset)
ENTRY(memset)
UNWIND( .fnstart )
@@ -132,6 +134,7 @@ UNWIND( .fnstart )
UNWIND( .fnend )
ENDPROC(memset)
ENDPROC(mmioset)
+ENDPROC(__memset)
ENTRY(__memset32)
UNWIND( .fnstart )
--
2.27.0
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 78af4dc949daaa37b3fcd5f348f373085b4e858f Mon Sep 17 00:00:00 2001
From: "peterz(a)infradead.org" <peterz(a)infradead.org>
Date: Fri, 28 Aug 2020 14:37:20 +0200
Subject: [PATCH] perf: Break deadlock involving exec_update_mutex
Syzbot reported a lock inversion involving perf. The sore point being
perf holding exec_update_mutex() for a very long time, specifically
across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
and anon_inode_getfile().
This then inverts against procfs code trying to take
exec_update_mutex.
Move the permission checks later, such that we need to hold the mutex
over less code.
Reported-by: syzbot+db9cdf3dd1f64252c6ef(a)syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a21b0be2f22c..19ae6c931c52 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11832,24 +11832,6 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_task;
}
- if (task) {
- err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
- if (err)
- goto err_task;
-
- /*
- * Preserve ptrace permission check for backwards compatibility.
- *
- * We must hold exec_update_mutex across this and any potential
- * perf_install_in_context() call for this new event to
- * serialize against exec() altering our credentials (and the
- * perf_event_exit_task() that could imply).
- */
- err = -EACCES;
- if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
- goto err_cred;
- }
-
if (flags & PERF_FLAG_PID_CGROUP)
cgroup_fd = pid;
@@ -11857,7 +11839,7 @@ SYSCALL_DEFINE5(perf_event_open,
NULL, NULL, cgroup_fd);
if (IS_ERR(event)) {
err = PTR_ERR(event);
- goto err_cred;
+ goto err_task;
}
if (is_sampling_event(event)) {
@@ -11976,6 +11958,24 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
+ if (task) {
+ err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
+ if (err)
+ goto err_file;
+
+ /*
+ * Preserve ptrace permission check for backwards compatibility.
+ *
+ * We must hold exec_update_mutex across this and any potential
+ * perf_install_in_context() call for this new event to
+ * serialize against exec() altering our credentials (and the
+ * perf_event_exit_task() that could imply).
+ */
+ err = -EACCES;
+ if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
+ goto err_cred;
+ }
+
if (move_group) {
gctx = __perf_event_ctx_lock_double(group_leader, ctx);
@@ -12151,7 +12151,10 @@ SYSCALL_DEFINE5(perf_event_open,
if (move_group)
perf_event_ctx_unlock(group_leader, gctx);
mutex_unlock(&ctx->mutex);
-/* err_file: */
+err_cred:
+ if (task)
+ mutex_unlock(&task->signal->exec_update_mutex);
+err_file:
fput(event_file);
err_context:
perf_unpin_context(ctx);
@@ -12163,9 +12166,6 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (!event_file)
free_event(event);
-err_cred:
- if (task)
- mutex_unlock(&task->signal->exec_update_mutex);
err_task:
if (task)
put_task_struct(task);
There are several reports about the tps6598x causing
interrupt flood on boards with the INT3515 ACPI node, which
then causes instability. There appears to be several
problems with the interrupt. One problem is that the
I2CSerialBus resources do not always map to the Interrupt
resource with the same index, but that is not the only
problem. We have not been able to come up with a solution
for all the issues, and because of that disabling the device
for now.
The PD controller on these platforms is autonomous, and the
purpose for the driver is primarily to supply status to the
userspace, so this will not affect any functionality.
Reported-by: Moody Salem <moody(a)uniswap.org>
Fixes: a3dd034a1707 ("ACPI / scan: Create platform device for INT3515 ACPI nodes")
Cc: stable(a)vger.kernel.org
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1883511
Signed-off-by: Heikki Krogerus <heikki.krogerus(a)linux.intel.com>
---
drivers/platform/x86/i2c-multi-instantiate.c | 31 +++++++++++++++-----
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/drivers/platform/x86/i2c-multi-instantiate.c b/drivers/platform/x86/i2c-multi-instantiate.c
index 6acc8457866e1..e1df665d3ad31 100644
--- a/drivers/platform/x86/i2c-multi-instantiate.c
+++ b/drivers/platform/x86/i2c-multi-instantiate.c
@@ -166,13 +166,29 @@ static const struct i2c_inst_data bsg2150_data[] = {
{}
};
-static const struct i2c_inst_data int3515_data[] = {
- { "tps6598x", IRQ_RESOURCE_APIC, 0 },
- { "tps6598x", IRQ_RESOURCE_APIC, 1 },
- { "tps6598x", IRQ_RESOURCE_APIC, 2 },
- { "tps6598x", IRQ_RESOURCE_APIC, 3 },
- {}
-};
+/*
+ * Device with _HID INT3515 (TI PD controllers) has some unresolved interrupt
+ * issues. The most common problem seen is interrupt flood.
+ *
+ * There are at least two known causes. Firstly, on some boards, the
+ * I2CSerialBus resource index does not match the Interrupt resource, i.e. they
+ * are not one-to-one mapped like in the array below. Secondly, on some boards
+ * the irq line from the PD controller is not actually connected at all. But the
+ * interrupt flood is also seen on some boards where those are not a problem, so
+ * there are some other problems as well.
+ *
+ * Because of the issues with the interrupt, the device is disabled for now. If
+ * you wish to debug the issues, uncomment the below, and add an entry for the
+ * INT3515 device to the i2c_multi_instance__ids table.
+ *
+ * static const struct i2c_inst_data int3515_data[] = {
+ * { "tps6598x", IRQ_RESOURCE_APIC, 0 },
+ * { "tps6598x", IRQ_RESOURCE_APIC, 1 },
+ * { "tps6598x", IRQ_RESOURCE_APIC, 2 },
+ * { "tps6598x", IRQ_RESOURCE_APIC, 3 },
+ * { }
+ * };
+ */
/*
* Note new device-ids must also be added to i2c_multi_instantiate_ids in
@@ -181,7 +197,6 @@ static const struct i2c_inst_data int3515_data[] = {
static const struct acpi_device_id i2c_multi_inst_acpi_ids[] = {
{ "BSG1160", (unsigned long)bsg1160_data },
{ "BSG2150", (unsigned long)bsg2150_data },
- { "INT3515", (unsigned long)int3515_data },
{ }
};
MODULE_DEVICE_TABLE(acpi, i2c_multi_inst_acpi_ids);
--
2.29.2
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a9d2e9ae953f0ddd0327479c81a085adaa76d903 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg(a)nvidia.com>
Date: Fri, 6 Nov 2020 10:00:49 -0400
Subject: [PATCH] RDMA/siw,rxe: Make emulated devices virtual in the device
tree
This moves siw and rxe to be virtual devices in the device tree:
lrwxrwxrwx 1 root root 0 Nov 6 13:55 /sys/class/infiniband/rxe0 -> ../../devices/virtual/infiniband/rxe0/
Previously they were trying to parent themselves to the physical device of
their attached netdev, which doesn't make alot of sense.
My hope is this will solve some weird syzkaller hits related to sysfs as
it could be possible that the parent of a netdev is another netdev, eg
under bonding or some other syzkaller found netdev configuration.
Nesting a ib_device under anything but a physical device is going to cause
inconsistencies in sysfs during destructions.
Link: https://lore.kernel.org/r/0-v1-dcbfc68c4b4a+d6-virtual_dev_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index ee3cf3af7cab..c4b06ced30a7 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -19,15 +19,6 @@
static struct rxe_recv_sockets recv_sockets;
-struct device *rxe_dma_device(struct rxe_dev *rxe)
-{
- struct net_device *ndev;
-
- ndev = rxe->ndev;
-
- return ndev->dev.parent;
-}
-
int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
{
int err;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index a2bd91aaa5de..2fbea2b2d72a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1134,7 +1134,6 @@ int rxe_register_device(struct rxe_dev *rxe, const char *ibdev_name)
dev->node_type = RDMA_NODE_IB_CA;
dev->phys_port_cnt = 1;
dev->num_comp_vectors = num_possible_cpus();
- dev->dev.parent = rxe_dma_device(rxe);
dev->local_dma_lkey = 0;
addrconf_addr_eui48((unsigned char *)&dev->node_guid,
rxe->ndev->dev_addr);
diff --git a/drivers/infiniband/sw/siw/siw_main.c b/drivers/infiniband/sw/siw/siw_main.c
index d9de062852c4..ee95cf29179d 100644
--- a/drivers/infiniband/sw/siw/siw_main.c
+++ b/drivers/infiniband/sw/siw/siw_main.c
@@ -305,24 +305,8 @@ static struct siw_device *siw_device_create(struct net_device *netdev)
{
struct siw_device *sdev = NULL;
struct ib_device *base_dev;
- struct device *parent = netdev->dev.parent;
int rv;
- if (!parent) {
- /*
- * The loopback device has no parent device,
- * so it appears as a top-level device. To support
- * loopback device connectivity, take this device
- * as the parent device. Skip all other devices
- * w/o parent device.
- */
- if (netdev->type != ARPHRD_LOOPBACK) {
- pr_warn("siw: device %s error: no parent device\n",
- netdev->name);
- return NULL;
- }
- parent = &netdev->dev;
- }
sdev = ib_alloc_device(siw_device, base_dev);
if (!sdev)
return NULL;
@@ -359,7 +343,6 @@ static struct siw_device *siw_device_create(struct net_device *netdev)
* per physical port.
*/
base_dev->phys_port_cnt = 1;
- base_dev->dev.parent = parent;
base_dev->num_comp_vectors = num_possible_cpus();
xa_init_flags(&sdev->qp_xa, XA_FLAGS_ALLOC1);
@@ -401,7 +384,7 @@ static struct siw_device *siw_device_create(struct net_device *netdev)
atomic_set(&sdev->num_mr, 0);
atomic_set(&sdev->num_pd, 0);
- sdev->numa_node = dev_to_node(parent);
+ sdev->numa_node = dev_to_node(&netdev->dev);
spin_lock_init(&sdev->lock);
return sdev;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From bd14bf0e4a084514aa62d24d2109e0f09a93822f Mon Sep 17 00:00:00 2001
From: Stanley Chu <stanley.chu(a)mediatek.com>
Date: Tue, 8 Dec 2020 21:56:34 +0800
Subject: [PATCH] scsi: ufs: Re-enable WriteBooster after device reset
UFS 3.1 specification mentions that the WriteBooster flags listed below
will be set to their default values, i.e. disabled, after power cycle or
any type of reset event. Thus we need to reset the flag variables kept in
struct hba to align with the device status and ensure that
WriteBooster-related functions are configured properly after device reset.
Without this fix, WriteBooster will not be enabled successfully after by
ufshcd_wb_ctrl() after device reset because hba->wb_enabled remains true.
Flags required to be reset to default values:
- fWriteBoosterEn: hba->wb_enabled
- fWriteBoosterBufferFlushEn: hba->wb_buf_flush_enabled
- fWriteBoosterBufferFlushDuringHibernate: No variable mapped
Link: https://lore.kernel.org/r/20201208135635.15326-2-stanley.chu@mediatek.com
Fixes: 3d17b9b5ab11 ("scsi: ufs: Add write booster feature support")
Reviewed-by: Bean Huo <beanhuo(a)micron.com>
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h
index 08c8a591e6b0..36d367eb8139 100644
--- a/drivers/scsi/ufs/ufshcd.h
+++ b/drivers/scsi/ufs/ufshcd.h
@@ -1221,8 +1221,13 @@ static inline void ufshcd_vops_device_reset(struct ufs_hba *hba)
if (hba->vops && hba->vops->device_reset) {
int err = hba->vops->device_reset(hba);
- if (!err)
+ if (!err) {
ufshcd_set_ufs_dev_active(hba);
+ if (ufshcd_is_wb_allowed(hba)) {
+ hba->wb_enabled = false;
+ hba->wb_buf_flush_enabled = false;
+ }
+ }
if (err != -EOPNOTSUPP)
ufshcd_update_evt_hist(hba, UFS_EVT_DEV_RESET, err);
}
The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.
Suppose the following race appears:
release_task dput
-> proc_flush_task
-> dentry->d_op->d_delete(dentry)
-> __exit_signal
-> dentry->d_lockref.count-- and return.
In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.
This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.
This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.
This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc/<pid>/ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
proc_flush_task and __exit_signal in release_task() [2]
The test process is as follows:
A, terminal #1
Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace
Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash
B, terminal #2
Find the pid of the unshare process:
# pstree -p | grep unshare
| `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
C, terminal #3
Execute the opendir program, it will always open the /proc/818/ns/ directory:
# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup
D, go back to terminal #2
Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay
Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818
Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876
Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid
Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
cached:
# dmesg | grep ffff8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
It could also be verified via the crash tool:
crash> dentry.d_flags,d_iname,d_inode,d_lockref -x ffff8802bea7b528
d_flags = 0x2800cc
d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
d_inode = 0xffff8802b38c2010
d_lockref = {
{
lock_count = 0x0,
{
lock = {
{
rlock = {
raw_lock = {
{
val = {
counter = 0x0
},
{
locked = 0x0,
pending = 0x0
},
{
locked_pending = 0x0,
tail = 0x0
}
}
}
}
}
},
count = 0x0
}
}
}
crash> kmem ffff8802bea7b528
CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME
ffff8802dd5f5900 192 23663 26130 871 16k dentry
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffea000afa9e00 ffff8802bea78000 0 30 25 5
FREE / [ALLOCATED]
[ffff8802bea7b520]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000afa9ec0 2bea7b000 dead000000000400 0 0 2fffff80000000
crash>
This series of patches is to fix this issue.
Regards,
Wen
Alexey Dobriyan (1):
proc: use %u for pid printing and slightly less stack
Andreas Gruenbacher (1):
proc: Pass file mode to proc_pid_make_inode
Christian Brauner (1):
clone: add CLONE_PIDFD
Eric W. Biederman (6):
proc: Better ownership of files for non-dumpable tasks in user
namespaces
proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
proc: Clear the pieces of proc_inode that proc_evict_inode cares about
proc: Use d_invalidate in proc_prune_siblings_dcache
proc: Use a list of inodes to flush from proc
Joel Fernandes (Google) (1):
pidfd: add polling support
fs/proc/base.c | 242 ++++++++++++++++++++-------------------------
fs/proc/fd.c | 20 +---
fs/proc/inode.c | 67 ++++++++++++-
fs/proc/internal.h | 22 ++---
fs/proc/namespaces.c | 3 +-
fs/proc/proc_sysctl.c | 45 ++-------
fs/proc/self.c | 6 +-
fs/proc/thread_self.c | 5 +-
include/linux/pid.h | 5 +
include/linux/proc_fs.h | 4 +-
include/uapi/linux/sched.h | 1 +
kernel/exit.c | 5 +-
kernel/fork.c | 145 ++++++++++++++++++++++++++-
kernel/pid.c | 3 +
kernel/signal.c | 11 +++
security/selinux/hooks.c | 1 +
16 files changed, 357 insertions(+), 228 deletions(-)
[1] A test program to open the directory (/proc/<pid>/ns)
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <errno.h>
int main(int argc, char *argv[])
{
DIR *dip;
struct dirent *dit;
if (argc < 2) {
printf("Usage :%s <directory>\n", argv[0]);
return -1;
}
if ((dip = opendir(argv[1])) == NULL) {
perror("opendir");
return -1;
}
printf("pid: %d\n", getpid());
while((dit = readdir (dip)) != NULL) {
printf("%s\n", dit->d_name);
}
while (1)
sleep (1);
return 0;
}
[2] Adding some debugging switches to the kernel, also adding a delay between
proc_flush_task and __exit_signal in release_task():
diff --git a/fs/dcache.c b/fs/dcache.c
index 05bad55..fafad37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -84,6 +84,9 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
+int sysctl_dentry_debug_trace __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
+
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
@@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
return 0;
}
+#define DENTRY_DEBUG_TRACE(dentry, keywords) \
+do { \
+ if (sysctl_dentry_debug_trace) \
+ printk("XXX %s:%d " \
+ "dentry=%s/%p, flag=%x, cnt=%d, inode=%p, " \
+ "pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, " \
+ "keywords: %s\n", \
+ __func__, __LINE__, \
+ dentry->d_name.name, \
+ dentry, \
+ dentry->d_flags, \
+ dentry->d_lockref.count, \
+ dentry->d_inode, \
+ dentry->d_parent->d_name.name, \
+ dentry->d_parent, \
+ dentry->d_parent->d_flags, \
+ dentry->d_parent->d_lockref.count, \
+ dentry->d_parent->d_inode, \
+ keywords); \
+} while (0)
/*
* This is dput
@@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
WARN_ON(d_in_lookup(dentry));
+ DENTRY_DEBUG_TRACE(dentry, "be checked");
+
/* Unreachable? Get rid of it */
if (unlikely(d_unhashed(dentry)))
goto kill_it;
@@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
goto kill_it;
if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
- if (dentry->d_op->d_delete(dentry))
+ if (dentry->d_op->d_delete(dentry)) {
+ DENTRY_DEBUG_TRACE(dentry, "be killed");
goto kill_it;
+ }
}
if (!(dentry->d_flags & DCACHE_REFERENCED))
@@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
dentry->d_lockref.count--;
spin_unlock(&dentry->d_lock);
+
+ DENTRY_DEBUG_TRACE(dentry, "be cached");
+
return;
kill_it:
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b9e4183..419a409 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
}
}
+extern int sysctl_dentry_debug_trace;
+
static int proc_pid_instantiate(struct inode *dir,
struct dentry * dentry,
struct task_struct *task, const void *ptr)
@@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
d_set_d_op(dentry, &pid_dentry_operations);
d_add(dentry, inode);
+
+ if (sysctl_dentry_debug_trace)
+ printk("XXX %s:%d pid=%d tid=%d entry=%s/%p\n",
+ __func__, __LINE__, task->pid, task->tgid,
+ dentry->d_name.name, dentry);
+
/* Close the race of the process dying before we return the dentry */
if (pid_revalidate(dentry, 0))
return 0;
diff --git a/kernel/exit.c b/kernel/exit.c
index 27f4168..2b3e1b6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -55,6 +55,8 @@
#include <linux/shm.h>
#include <linux/kcov.h>
+#include <linux/delay.h>
+
#include <asm/uaccess.h>
#include <asm/unistd.h>
#include <asm/pgtable.h>
@@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
}
+int sysctl_dentry_debug_delay __read_mostly = 0;
+int sysctl_dentry_debug_pid __read_mostly = 0;
void release_task(struct task_struct *p)
{
@@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
proc_flush_task(p);
+ if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
+ while (sysctl_dentry_debug_delay)
+ mdelay(1);
+ }
+
write_lock_irq(&tasklist_lock);
ptrace_release_task(p);
__exit_signal(p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 513e6da..27f1395 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
static int max_extfrag_threshold = 1000;
#endif
+extern int sysctl_dentry_debug_trace;
+extern int sysctl_dentry_debug_delay;
+extern int sysctl_dentry_debug_pid;
+
static struct ctl_table kern_table[] = {
{
.procname = "sched_child_runs_first",
@@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
.proc_handler = proc_dointvec,
.extra1 = &zero,
},
+ {
+ .procname = "dentry_debug_trace",
+ .data = &sysctl_dentry_debug_trace,
+ .maxlen = sizeof(sysctl_dentry_debug_trace),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
+ {
+ .procname = "dentry_debug_delay",
+ .data = &sysctl_dentry_debug_delay,
+ .maxlen = sizeof(sysctl_dentry_debug_delay),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
+ {
+ .procname = "dentry_debug_pid",
+ .data = &sysctl_dentry_debug_pid,
+ .maxlen = sizeof(sysctl_dentry_debug_pid),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
{
.procname = "legacy_va_layout",
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
Cc: Pavel Emelyanov <xemul(a)openvz.org>
Cc: Oleg Nesterov <oleg(a)tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev(a)us.ibm.com>
Cc: Paul Menage <menage(a)google.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
--
1.8.3.1
From: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
From: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
commit 56ec43d8b02719402c9fcf984feb52ec2300f8a5 upstream.
As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range
which will only return a memory range if it matches the node ID provided
assuming it is not NUMA_NO_NODE.
I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.
So one piece I would like to verify on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.
On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 2.80s to 1.85s as a result of this patch.
Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdo…
Signed-off-by: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin(a)microsoft.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Khalid Aziz <khalid.aziz(a)oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Laurent Dufour <ldufour(a)linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <yi.z.zhang(a)linux.intel.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
---
mm/page_alloc.c | 51 ++++++++++++++-----------------------------------
1 file changed, 14 insertions(+), 37 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8c3051387d1..c86a117acb5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1321,36 +1321,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
#endif
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
int nid;
- nid = __early_pfn_to_nid(pfn, state);
+ nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
if (nid >= 0 && nid != node)
return false;
return true;
}
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
#else
-
static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
return true;
}
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- return true;
-}
#endif
@@ -1480,21 +1466,13 @@ static inline void __init pgdat_init_report_one_done(void)
*
* Then, we check if a current large page is valid by only checking the validity
* of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
*/
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
- struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
{
if (!pfn_valid_within(pfn))
return false;
if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
return false;
- if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
- return false;
return true;
}
@@ -1502,15 +1480,14 @@ deferred_pfn_valid(int nid, unsigned long pfn,
* Free pages to buddy allocator. Try to free aligned pages in
* pageblock_nr_pages sizes.
*/
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
unsigned long end_pfn)
{
- struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
unsigned long nr_free = 0;
for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+ if (!deferred_pfn_valid(pfn)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 0;
} else if (!(pfn & nr_pgmask)) {
@@ -1530,17 +1507,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
* by performing it only once every pageblock_nr_pages.
* Return number of pages initialized.
*/
-static unsigned long __init deferred_init_pages(int nid, int zid,
+static unsigned long __init deferred_init_pages(struct zone *zone,
unsigned long pfn,
unsigned long end_pfn)
{
- struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
+ int nid = zone_to_nid(zone);
unsigned long nr_pages = 0;
+ int zid = zone_idx(zone);
struct page *page = NULL;
for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+ if (!deferred_pfn_valid(pfn)) {
page = NULL;
continue;
} else if (!page || !(pfn & nr_pgmask)) {
@@ -1603,12 +1581,12 @@ static int __init deferred_init_memmap(void *data)
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+ nr_pages += deferred_init_pages(zone, spfn, epfn);
}
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- deferred_free_pages(nid, zid, spfn, epfn);
+ deferred_free_pages(spfn, epfn);
}
pgdat_resize_unlock(pgdat, &flags);
@@ -1640,7 +1618,6 @@ static int __init deferred_init_memmap(void *data)
static noinline bool __init
deferred_grow_zone(struct zone *zone, unsigned int order)
{
- int zid = zone_idx(zone);
int nid = zone_to_nid(zone);
pg_data_t *pgdat = NODE_DATA(nid);
unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
@@ -1690,7 +1667,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
while (spfn < epfn && nr_pages < nr_pages_needed) {
t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
first_deferred_pfn = min(t, epfn);
- nr_pages += deferred_init_pages(nid, zid, spfn,
+ nr_pages += deferred_init_pages(zone, spfn,
first_deferred_pfn);
spfn = first_deferred_pfn;
}
@@ -1702,7 +1679,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
- deferred_free_pages(nid, zid, spfn, epfn);
+ deferred_free_pages(spfn, epfn);
if (first_deferred_pfn == epfn)
break;
--
2.25.1
This is the start of the stable review cycle for the 5.2.6 release.
There are 20 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun 04 Aug 2019 09:19:34 AM UTC.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.2.6-rc1.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.2.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.2.6-rc1
Yan, Zheng <zyan(a)redhat.com>
ceph: hold i_ceph_lock when removing caps for freeing inode
Yoshinori Sato <ysato(a)users.sourceforge.jp>
Fix allyesconfig output.
Miroslav Lichvar <mlichvar(a)redhat.com>
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
Linus Torvalds <torvalds(a)linux-foundation.org>
/proc/<pid>/cmdline: add back the setproctitle() special case
Linus Torvalds <torvalds(a)linux-foundation.org>
/proc/<pid>/cmdline: remove all the special cases
Jann Horn <jannh(a)google.com>
sched/fair: Use RCU accessors consistently for ->numa_group
Jann Horn <jannh(a)google.com>
sched/fair: Don't free p->numa_faults with concurrent readers
Vladis Dronov <vdronov(a)redhat.com>
Bluetooth: hci_uart: check for missing tty operations
Marta Rybczynska <mrybczyn(a)kalray.eu>
nvme: fix multipath crash when ANA is deactivated
Florian Westphal <fw(a)strlen.de>
xfrm: policy: fix bydst hlist corruption on hash rebuild
Luke Nowakowski-Krijger <lnowakow(a)eng.ucsd.edu>
media: radio-raremono: change devm_k*alloc to k*alloc
Benjamin Coddington <bcodding(a)redhat.com>
NFS: Cleanup if nfs_match_client is interrupted
Andrey Konovalov <andreyknvl(a)google.com>
media: pvrusb2: use a different format for warnings
Oliver Neukum <oneukum(a)suse.com>
media: cpia2_usb: first wake up, then free in disconnect
Fabio Estevam <festevam(a)gmail.com>
ath10k: Change the warning message string
Sean Young <sean(a)mess.org>
media: au0828: fix null dereference in error path
Stanislav Fomichev <sdf(a)google.com>
bpf: fix NULL deref in btf_type_is_resolve_source_only
Takashi Iwai <tiwai(a)suse.de>
ALSA: usb-audio: Sanity checks for each pipe and EP types
Phong Tran <tranmanphong(a)gmail.com>
ISDN: hfcsusb: checking idx of ep configuration
Sunil Muthuswamy <sunilmut(a)microsoft.com>
vsock: correct removal of socket from the list
-------------
Diffstat:
Makefile | 4 +-
arch/sh/boards/Kconfig | 14 +--
drivers/bluetooth/hci_ath.c | 3 +
drivers/bluetooth/hci_bcm.c | 3 +
drivers/bluetooth/hci_intel.c | 3 +
drivers/bluetooth/hci_ldisc.c | 13 +++
drivers/bluetooth/hci_mrvl.c | 3 +
drivers/bluetooth/hci_qca.c | 3 +
drivers/bluetooth/hci_uart.h | 1 +
drivers/isdn/hardware/mISDN/hfcsusb.c | 3 +
drivers/media/radio/radio-raremono.c | 30 ++++--
drivers/media/usb/au0828/au0828-core.c | 12 +--
drivers/media/usb/cpia2/cpia2_usb.c | 3 +-
drivers/media/usb/pvrusb2/pvrusb2-hdw.c | 4 +-
drivers/media/usb/pvrusb2/pvrusb2-i2c-core.c | 6 +-
drivers/media/usb/pvrusb2/pvrusb2-std.c | 2 +-
drivers/net/wireless/ath/ath10k/usb.c | 2 +-
drivers/nvme/host/multipath.c | 8 +-
drivers/nvme/host/nvme.h | 6 +-
drivers/pps/pps.c | 8 ++
fs/ceph/caps.c | 10 +-
fs/ceph/inode.c | 2 +-
fs/ceph/super.h | 2 +-
fs/exec.c | 2 +-
fs/nfs/client.c | 4 +-
fs/proc/base.c | 132 +++++++++++++-----------
include/linux/sched.h | 10 +-
include/linux/sched/numa_balancing.h | 4 +-
kernel/bpf/btf.c | 12 +--
kernel/fork.c | 2 +-
kernel/sched/fair.c | 144 +++++++++++++++++++--------
net/vmw_vsock/af_vsock.c | 38 ++-----
net/xfrm/xfrm_policy.c | 12 ++-
sound/usb/helper.c | 17 ++++
sound/usb/helper.h | 1 +
sound/usb/quirks.c | 18 +++-
tools/testing/selftests/net/xfrm_policy.sh | 27 ++++-
37 files changed, 368 insertions(+), 200 deletions(-)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5812b32e01c6d86ba7a84110702b46d8a8531fe9 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Mon, 23 Nov 2020 11:23:12 +0100
Subject: [PATCH] of: fix linker-section match-table corruption
Specify type alignment when declaring linker-section match-table entries
to prevent gcc from increasing alignment and corrupting the various
tables with padding (e.g. timers, irqchips, clocks, reserved memory).
This is specifically needed on x86 where gcc (typically) aligns larger
objects like struct of_device_id with static extent on 32-byte
boundaries which at best prevents matching on anything but the first
entry. Specifying alignment when declaring variables suppresses this
optimisation.
Here's a 64-bit example where all entries are corrupt as 16 bytes of
padding has been inserted before the first entry:
ffffffff8266b4b0 D __clk_of_table
ffffffff8266b4c0 d __of_table_fixed_factor_clk
ffffffff8266b5a0 d __of_table_fixed_clk
ffffffff8266b680 d __clk_of_table_sentinel
And here's a 32-bit example where the 8-byte-aligned table happens to be
placed on a 32-byte boundary so that all but the first entry are corrupt
due to the 28 bytes of padding inserted between entries:
812b3ec0 D __irqchip_of_table
812b3ec0 d __of_table_irqchip1
812b3fa0 d __of_table_irqchip2
812b4080 d __of_table_irqchip3
812b4160 d irqchip_of_match_end
Verified on x86 using gcc-9.3 and gcc-4.9 (which uses 64-byte
alignment), and on arm using gcc-7.2.
Note that there are no in-tree users of these tables on x86 currently
(even if they are included in the image).
Fixes: 54196ccbe0ba ("of: consolidate linker section OF match table declarations")
Fixes: f6e916b82022 ("irqchip: add basic infrastructure")
Cc: stable <stable(a)vger.kernel.org> # 3.9
Signed-off-by: Johan Hovold <johan(a)kernel.org>
Link: https://lore.kernel.org/r/20201123102319.8090-2-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/include/linux/of.h b/include/linux/of.h
index 5d51891cbf1a..af655d264f10 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -1300,6 +1300,7 @@ static inline int of_get_available_child_count(const struct device_node *np)
#define _OF_DECLARE(table, name, compat, fn, fn_type) \
static const struct of_device_id __of_table_##name \
__used __section("__" #table "_of_table") \
+ __aligned(__alignof__(struct of_device_id)) \
= { .compatible = compat, \
.data = (fn == (fn_type)NULL) ? fn : fn }
#else
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0ebcdd702f49aeb0ad2e2d894f8c124a0acc6e23 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)wdc.com>
Date: Fri, 20 Nov 2020 10:55:11 +0900
Subject: [PATCH] null_blk: Fix zone size initialization
For a null_blk device with zoned mode enabled is currently initialized
with a number of zones equal to the device capacity divided by the zone
size, without considering if the device capacity is a multiple of the
zone size. If the zone size is not a divisor of the capacity, the zones
end up not covering the entire capacity, potentially resulting is out
of bounds accesses to the zone array.
Fix this by adding one last smaller zone with a size equal to the
remainder of the disk capacity divided by the zone size if the capacity
is not a multiple of the zone size. For such smaller last zone, the zone
capacity is also checked so that it does not exceed the smaller zone
size.
Reported-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Fixes: ca4b2a011948 ("null_blk: add zone support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index beb34b4f76b0..1d0370d91fe7 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -6,8 +6,7 @@
#define CREATE_TRACE_POINTS
#include "null_blk_trace.h"
-/* zone_size in MBs to sectors. */
-#define ZONE_SIZE_SHIFT 11
+#define MB_TO_SECTS(mb) (((sector_t)mb * SZ_1M) >> SECTOR_SHIFT)
static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
{
@@ -16,7 +15,7 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
{
- sector_t dev_size = (sector_t)dev->size * 1024 * 1024;
+ sector_t dev_capacity_sects, zone_capacity_sects;
sector_t sector = 0;
unsigned int i;
@@ -38,9 +37,13 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
return -EINVAL;
}
- dev->zone_size_sects = dev->zone_size << ZONE_SIZE_SHIFT;
- dev->nr_zones = dev_size >>
- (SECTOR_SHIFT + ilog2(dev->zone_size_sects));
+ zone_capacity_sects = MB_TO_SECTS(dev->zone_capacity);
+ dev_capacity_sects = MB_TO_SECTS(dev->size);
+ dev->zone_size_sects = MB_TO_SECTS(dev->zone_size);
+ dev->nr_zones = dev_capacity_sects >> ilog2(dev->zone_size_sects);
+ if (dev_capacity_sects & (dev->zone_size_sects - 1))
+ dev->nr_zones++;
+
dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct blk_zone),
GFP_KERNEL | __GFP_ZERO);
if (!dev->zones)
@@ -101,8 +104,12 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
struct blk_zone *zone = &dev->zones[i];
zone->start = zone->wp = sector;
- zone->len = dev->zone_size_sects;
- zone->capacity = dev->zone_capacity << ZONE_SIZE_SHIFT;
+ if (zone->start + dev->zone_size_sects > dev_capacity_sects)
+ zone->len = dev_capacity_sects - zone->start;
+ else
+ zone->len = dev->zone_size_sects;
+ zone->capacity =
+ min_t(sector_t, zone->len, zone_capacity_sects);
zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
zone->cond = BLK_ZONE_COND_EMPTY;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0ebcdd702f49aeb0ad2e2d894f8c124a0acc6e23 Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)wdc.com>
Date: Fri, 20 Nov 2020 10:55:11 +0900
Subject: [PATCH] null_blk: Fix zone size initialization
For a null_blk device with zoned mode enabled is currently initialized
with a number of zones equal to the device capacity divided by the zone
size, without considering if the device capacity is a multiple of the
zone size. If the zone size is not a divisor of the capacity, the zones
end up not covering the entire capacity, potentially resulting is out
of bounds accesses to the zone array.
Fix this by adding one last smaller zone with a size equal to the
remainder of the disk capacity divided by the zone size if the capacity
is not a multiple of the zone size. For such smaller last zone, the zone
capacity is also checked so that it does not exceed the smaller zone
size.
Reported-by: Naohiro Aota <naohiro.aota(a)wdc.com>
Fixes: ca4b2a011948 ("null_blk: add zone support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index beb34b4f76b0..1d0370d91fe7 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -6,8 +6,7 @@
#define CREATE_TRACE_POINTS
#include "null_blk_trace.h"
-/* zone_size in MBs to sectors. */
-#define ZONE_SIZE_SHIFT 11
+#define MB_TO_SECTS(mb) (((sector_t)mb * SZ_1M) >> SECTOR_SHIFT)
static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
{
@@ -16,7 +15,7 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
{
- sector_t dev_size = (sector_t)dev->size * 1024 * 1024;
+ sector_t dev_capacity_sects, zone_capacity_sects;
sector_t sector = 0;
unsigned int i;
@@ -38,9 +37,13 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
return -EINVAL;
}
- dev->zone_size_sects = dev->zone_size << ZONE_SIZE_SHIFT;
- dev->nr_zones = dev_size >>
- (SECTOR_SHIFT + ilog2(dev->zone_size_sects));
+ zone_capacity_sects = MB_TO_SECTS(dev->zone_capacity);
+ dev_capacity_sects = MB_TO_SECTS(dev->size);
+ dev->zone_size_sects = MB_TO_SECTS(dev->zone_size);
+ dev->nr_zones = dev_capacity_sects >> ilog2(dev->zone_size_sects);
+ if (dev_capacity_sects & (dev->zone_size_sects - 1))
+ dev->nr_zones++;
+
dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct blk_zone),
GFP_KERNEL | __GFP_ZERO);
if (!dev->zones)
@@ -101,8 +104,12 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
struct blk_zone *zone = &dev->zones[i];
zone->start = zone->wp = sector;
- zone->len = dev->zone_size_sects;
- zone->capacity = dev->zone_capacity << ZONE_SIZE_SHIFT;
+ if (zone->start + dev->zone_size_sects > dev_capacity_sects)
+ zone->len = dev_capacity_sects - zone->start;
+ else
+ zone->len = dev->zone_size_sects;
+ zone->capacity =
+ min_t(sector_t, zone->len, zone_capacity_sects);
zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
zone->cond = BLK_ZONE_COND_EMPTY;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b08070eca9e247f60ab39d79b2c25d274750441f Mon Sep 17 00:00:00 2001
From: Jan Kara <jack(a)suse.cz>
Date: Fri, 27 Nov 2020 12:33:54 +0100
Subject: [PATCH] ext4: don't remount read-only with errors=continue on reboot
ext4_handle_error() with errors=continue mount option can accidentally
remount the filesystem read-only when the system is rebooting. Fix that.
Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack(a)suse.cz>
Reviewed-by: Andreas Dilger <adilger(a)dilger.ca>
Cc: stable(a)kernel.org
Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 872d45a131ca..3ef84e8ab1ae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -666,19 +666,17 @@ static bool system_going_down(void)
static void ext4_handle_error(struct super_block *sb)
{
+ journal_t *journal = EXT4_SB(sb)->s_journal;
+
if (test_opt(sb, WARN_ON_ERROR))
WARN_ON_ONCE(1);
- if (sb_rdonly(sb))
+ if (sb_rdonly(sb) || test_opt(sb, ERRORS_CONT))
return;
- if (!test_opt(sb, ERRORS_CONT)) {
- journal_t *journal = EXT4_SB(sb)->s_journal;
-
- ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
- if (journal)
- jbd2_journal_abort(journal, -EIO);
- }
+ ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
+ if (journal)
+ jbd2_journal_abort(journal, -EIO);
/*
* We force ERRORS_RO behavior when system is rebooting. Otherwise we
* could panic during 'reboot -f' as the underlying device got already
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b08070eca9e247f60ab39d79b2c25d274750441f Mon Sep 17 00:00:00 2001
From: Jan Kara <jack(a)suse.cz>
Date: Fri, 27 Nov 2020 12:33:54 +0100
Subject: [PATCH] ext4: don't remount read-only with errors=continue on reboot
ext4_handle_error() with errors=continue mount option can accidentally
remount the filesystem read-only when the system is rebooting. Fix that.
Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack(a)suse.cz>
Reviewed-by: Andreas Dilger <adilger(a)dilger.ca>
Cc: stable(a)kernel.org
Link: https://lore.kernel.org/r/20201127113405.26867-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 872d45a131ca..3ef84e8ab1ae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -666,19 +666,17 @@ static bool system_going_down(void)
static void ext4_handle_error(struct super_block *sb)
{
+ journal_t *journal = EXT4_SB(sb)->s_journal;
+
if (test_opt(sb, WARN_ON_ERROR))
WARN_ON_ONCE(1);
- if (sb_rdonly(sb))
+ if (sb_rdonly(sb) || test_opt(sb, ERRORS_CONT))
return;
- if (!test_opt(sb, ERRORS_CONT)) {
- journal_t *journal = EXT4_SB(sb)->s_journal;
-
- ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
- if (journal)
- jbd2_journal_abort(journal, -EIO);
- }
+ ext4_set_mount_flag(sb, EXT4_MF_FS_ABORTED);
+ if (journal)
+ jbd2_journal_abort(journal, -EIO);
/*
* We force ERRORS_RO behavior when system is rebooting. Otherwise we
* could panic during 'reboot -f' as the underlying device got already
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 779055842da5b2e508f3ccf9a8153cb1f704f566 Mon Sep 17 00:00:00 2001
From: Souptick Joarder <jrdr.linux(a)gmail.com>
Date: Sun, 6 Sep 2020 12:21:53 +0530
Subject: [PATCH] xen/gntdev.c: Mark pages as dirty
There seems to be a bug in the original code when gntdev_get_page()
is called with writeable=true then the page needs to be marked dirty
before being put.
To address this, a bool writeable is added in gnt_dev_copy_batch, set
it in gntdev_grant_copy_seg() (and drop `writeable` argument to
gntdev_get_page()) and then, based on batch->writeable, use
set_page_dirty_lock().
Fixes: a4cdb556cae0 (xen/gntdev: add ioctl for grant copy)
Suggested-by: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Signed-off-by: Souptick Joarder <jrdr.linux(a)gmail.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: David Vrabel <david.vrabel(a)citrix.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1599375114-32360-1-git-send-email-jrdr.linux@gmai…
Reviewed-by: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 64a9025a87be..1f32db7b72b2 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -720,17 +720,18 @@ struct gntdev_copy_batch {
s16 __user *status[GNTDEV_COPY_BATCH];
unsigned int nr_ops;
unsigned int nr_pages;
+ bool writeable;
};
static int gntdev_get_page(struct gntdev_copy_batch *batch, void __user *virt,
- bool writeable, unsigned long *gfn)
+ unsigned long *gfn)
{
unsigned long addr = (unsigned long)virt;
struct page *page;
unsigned long xen_pfn;
int ret;
- ret = get_user_pages_fast(addr, 1, writeable ? FOLL_WRITE : 0, &page);
+ ret = get_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page);
if (ret < 0)
return ret;
@@ -746,9 +747,13 @@ static void gntdev_put_pages(struct gntdev_copy_batch *batch)
{
unsigned int i;
- for (i = 0; i < batch->nr_pages; i++)
+ for (i = 0; i < batch->nr_pages; i++) {
+ if (batch->writeable && !PageDirty(batch->pages[i]))
+ set_page_dirty_lock(batch->pages[i]);
put_page(batch->pages[i]);
+ }
batch->nr_pages = 0;
+ batch->writeable = false;
}
static int gntdev_copy(struct gntdev_copy_batch *batch)
@@ -837,8 +842,9 @@ static int gntdev_grant_copy_seg(struct gntdev_copy_batch *batch,
virt = seg->source.virt + copied;
off = (unsigned long)virt & ~XEN_PAGE_MASK;
len = min(len, (size_t)XEN_PAGE_SIZE - off);
+ batch->writeable = false;
- ret = gntdev_get_page(batch, virt, false, &gfn);
+ ret = gntdev_get_page(batch, virt, &gfn);
if (ret < 0)
return ret;
@@ -856,8 +862,9 @@ static int gntdev_grant_copy_seg(struct gntdev_copy_batch *batch,
virt = seg->dest.virt + copied;
off = (unsigned long)virt & ~XEN_PAGE_MASK;
len = min(len, (size_t)XEN_PAGE_SIZE - off);
+ batch->writeable = true;
- ret = gntdev_get_page(batch, virt, true, &gfn);
+ ret = gntdev_get_page(batch, virt, &gfn);
if (ret < 0)
return ret;
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: rc: ensure that uevent can be read directly after rc device register
Author: Sean Young <sean(a)mess.org>
Date: Sun Dec 20 13:29:54 2020 +0100
There is a race condition where if the /sys/class/rc0/uevent file is read
before rc_dev->registered is set to true, -ENODEV will be returned.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1901089
Cc: stable(a)vger.kernel.org
Fixes: a2e2d73fa281 ("media: rc: do not access device via sysfs after rc_unregister_device()")
Signed-off-by: Sean Young <sean(a)mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
drivers/media/rc/rc-main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
---
diff --git a/drivers/media/rc/rc-main.c b/drivers/media/rc/rc-main.c
index 1d811e5ffb55..29d4d01896ff 100644
--- a/drivers/media/rc/rc-main.c
+++ b/drivers/media/rc/rc-main.c
@@ -1928,6 +1928,8 @@ int rc_register_device(struct rc_dev *dev)
goto out_raw;
}
+ dev->registered = true;
+
rc = device_add(&dev->dev);
if (rc)
goto out_rx_free;
@@ -1937,8 +1939,6 @@ int rc_register_device(struct rc_dev *dev)
dev->device_name ?: "Unspecified device", path ?: "N/A");
kfree(path);
- dev->registered = true;
-
/*
* once the the input device is registered in rc_setup_rx_device,
* userspace can open the input device and rc_open() will be called
ksys_umount was refactored to into split into another function
(path_umount) to enable sharing code. This changed the order that flags and
permissions are validated in, and made it so that user_path_at was called
before validating flags and capabilities.
Unfortunately, libfuse2[1] and libmount[2] rely on the old flag validation
behaviour to determine whether or not the kernel supports UMOUNT_NOFOLLOW.
The other path that this validation is being checked on is
init_umount->path_umount->can_umount. That's all internal to the kernel.
[1]: https://github.com/libfuse/libfuse/blob/9bfbeb576c5901b62a171d35510f0d1a922…
[2]: https://github.com/karelzak/util-linux/blob/7ed579523b556b1270f28dbdb7ee07d…
Signed-off-by: Sargun Dhillon <sargun(a)sargun.me>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: stable(a)vger.kernel.org
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Fixes: 41525f56e256 ("fs: refactor ksys_umount")
---
fs/namespace.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index cebaa3e81794..dc76f1cb89f4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1710,10 +1710,6 @@ static int can_umount(const struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
- if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
- return -EINVAL;
- if (!may_mount())
- return -EPERM;
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;
if (!check_mnt(mnt))
@@ -1746,6 +1742,12 @@ static int ksys_umount(char __user *name, int flags)
struct path path;
int ret;
+ if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+ return -EINVAL;
+
+ if (!may_mount())
+ return -EPERM;
+
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
--
2.25.1
The function ovl_dir_real_file() currently uses the semaphore of the
inode to synchronize write to the upperfile cache field.
However, this function will get called by ovl_ioctl_set_flags(), which
utilizes the inode semaphore too. In this case ovl_dir_real_file() will
try to claim a lock that is owned by a function in its call stack, which
won't get released before ovl_dir_real_file() returns.
Define a dedicated semaphore for the upperfile cache, so that the
deadlock won't happen.
Fixes: 61536bed2149 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories")
Cc: stable(a)vger.kernel.org # v5.10
Signed-off-by: Icenowy Zheng <icenowy(a)aosc.io>
---
fs/overlayfs/readdir.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..f10701aabb71 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -56,6 +56,7 @@ struct ovl_dir_file {
struct list_head *cursor;
struct file *realfile;
struct file *upperfile;
+ struct semaphore upperfile_sem;
};
static struct ovl_cache_entry *ovl_cache_entry_from_node(struct rb_node *n)
@@ -883,7 +884,7 @@ struct file *ovl_dir_real_file(const struct file *file, bool want_upper)
ovl_path_upper(dentry, &upperpath);
realfile = ovl_dir_open_realfile(file, &upperpath);
- inode_lock(inode);
+ down(&od->upperfile_sem);
if (!od->upperfile) {
if (IS_ERR(realfile)) {
inode_unlock(inode);
@@ -896,7 +897,7 @@ struct file *ovl_dir_real_file(const struct file *file, bool want_upper)
fput(realfile);
realfile = od->upperfile;
}
- inode_unlock(inode);
+ up(&od->upperfile_sem);
}
}
@@ -959,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file)
od->realfile = realfile;
od->is_real = ovl_dir_is_real(file->f_path.dentry);
od->is_upper = OVL_TYPE_UPPER(type);
+ sema_init(&od->upperfile_sem, 1);
file->private_data = od;
return 0;
--
2.28.0
The function ovl_dir_real_file() currently uses the semaphore of the
inode to synchronize write to the upperfile cache field.
However, this function will get called by ovl_ioctl_set_flags(), which
utilizes the inode semaphore too. In this case ovl_dir_real_file() will
try to claim a lock that is owned by a function in its call stack, which
won't get released before ovl_dir_real_file() returns.
Define a dedicated semaphore for the upperfile cache, so that the
deadlock won't happen.
Fixes: 61536bed2149 ("ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories")
Cc: stable(a)vger.kernel.org # v5.10
Signed-off-by: Icenowy Zheng <icenowy(a)aosc.io>
---
fs/overlayfs/readdir.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..fa1844ff8db4 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -56,6 +56,7 @@ struct ovl_dir_file {
struct list_head *cursor;
struct file *realfile;
struct file *upperfile;
+ struct semaphore upperfile_sem;
};
static struct ovl_cache_entry *ovl_cache_entry_from_node(struct rb_node *n)
@@ -874,8 +875,6 @@ struct file *ovl_dir_real_file(const struct file *file, bool want_upper)
* Need to check if we started out being a lower dir, but got copied up
*/
if (!od->is_upper) {
- struct inode *inode = file_inode(file);
-
realfile = READ_ONCE(od->upperfile);
if (!realfile) {
struct path upperpath;
@@ -883,10 +882,10 @@ struct file *ovl_dir_real_file(const struct file *file, bool want_upper)
ovl_path_upper(dentry, &upperpath);
realfile = ovl_dir_open_realfile(file, &upperpath);
- inode_lock(inode);
+ down(&od->upperfile_sem);
if (!od->upperfile) {
if (IS_ERR(realfile)) {
- inode_unlock(inode);
+ up(&od->upperfile_sem);
return realfile;
}
smp_store_release(&od->upperfile, realfile);
@@ -896,7 +895,7 @@ struct file *ovl_dir_real_file(const struct file *file, bool want_upper)
fput(realfile);
realfile = od->upperfile;
}
- inode_unlock(inode);
+ up(&od->upperfile_sem);
}
}
@@ -959,6 +958,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file)
od->realfile = realfile;
od->is_real = ovl_dir_is_real(file->f_path.dentry);
od->is_upper = OVL_TYPE_UPPER(type);
+ sema_init(&od->upperfile_sem, 1);
file->private_data = od;
return 0;
--
2.28.0
When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.
This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
existing bucket_size of struct cache_sb_disk, it is unnecessary to add
bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
struct cache_sb_disk, bucket_size_hi was added after d[] which makes
csum_set calculate an unexpected super block checksum.
To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.
The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.
For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".
With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org # 5.9+
---
drivers/md/bcache/features.c | 2 +-
drivers/md/bcache/features.h | 11 ++++++++---
drivers/md/bcache/super.c | 22 +++++++++++++++++++---
include/uapi/linux/bcache.h | 2 +-
4 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
index 6469223f0b77..d636b7b2d070 100644
--- a/drivers/md/bcache/features.c
+++ b/drivers/md/bcache/features.c
@@ -17,7 +17,7 @@ struct feature {
};
static struct feature feature_list[] = {
- {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
+ {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE,
"large_bucket"},
{0, 0, 0 },
};
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index e73724c2b49b..84fc2c0f0101 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -13,11 +13,15 @@
/* Feature set definition */
/* Incompat feature set */
-#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET 0x0001 /* 32bit bucket size */
+/* 32bit bucket size, obsoleted */
+#define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET 0x0001
+/* real bucket size is (1 << bucket_size) */
+#define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE 0x0002
#define BCH_FEATURE_COMPAT_SUPP 0
#define BCH_FEATURE_RO_COMPAT_SUPP 0
-#define BCH_FEATURE_INCOMPAT_SUPP BCH_FEATURE_INCOMPAT_LARGE_BUCKET
+#define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
+ BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
#define BCH_HAS_COMPAT_FEATURE(sb, mask) \
((sb)->feature_compat & (mask))
@@ -77,7 +81,8 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
~BCH##_FEATURE_INCOMPAT_##flagname; \
}
-BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
+BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
+BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
{
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f4674a3298af..3999641f1775 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -64,9 +64,25 @@ static unsigned int get_bucket_size(struct cache_sb *sb, struct cache_sb_disk *s
{
unsigned int bucket_size = le16_to_cpu(s->bucket_size);
- if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
- bch_has_feature_large_bucket(sb))
- bucket_size |= le16_to_cpu(s->bucket_size_hi) << 16;
+ if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES) {
+ if (bch_has_feature_large_bucket(sb)) {
+ unsigned int max, order;
+
+ max = sizeof(unsigned int) * BITS_PER_BYTE - 1;
+ order = le16_to_cpu(s->bucket_size);
+ /*
+ * bcache tool will make sure the overflow won't
+ * happen, an error message here is enough.
+ */
+ if (order > max)
+ pr_err("Bucket size (1 << %u) overflows\n",
+ order);
+ bucket_size = 1 << order;
+ } else if (bch_has_feature_obso_large_bucket(sb)) {
+ bucket_size +=
+ le16_to_cpu(s->obso_bucket_size_hi) << 16;
+ }
+ }
return bucket_size;
}
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 52e8bcb33981..cf7399f03b71 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -213,7 +213,7 @@ struct cache_sb_disk {
__le16 keys;
};
__le64 d[SB_JOURNAL_BUCKETS]; /* journal buckets */
- __le16 bucket_size_hi;
+ __le16 obso_bucket_size_hi; /* obsoleted */
};
/*
--
2.26.2
If regmap_read() fails, the product_id local variable will contain
random value from the stack. Do not try to parse such value and fail
the ASV driver probe.
Fixes: 5ea428595cc5 ("soc: samsung: Add Exynos Adaptive Supply Voltage driver")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzk(a)kernel.org>
---
drivers/soc/samsung/exynos-asv.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/soc/samsung/exynos-asv.c b/drivers/soc/samsung/exynos-asv.c
index f653e3533f0f..5daeadc36382 100644
--- a/drivers/soc/samsung/exynos-asv.c
+++ b/drivers/soc/samsung/exynos-asv.c
@@ -129,7 +129,13 @@ static int exynos_asv_probe(struct platform_device *pdev)
return PTR_ERR(asv->chipid_regmap);
}
- regmap_read(asv->chipid_regmap, EXYNOS_CHIPID_REG_PRO_ID, &product_id);
+ ret = regmap_read(asv->chipid_regmap, EXYNOS_CHIPID_REG_PRO_ID,
+ &product_id);
+ if (ret < 0) {
+ dev_err(&pdev->dev, "Cannot read revision from ChipID: %d\n",
+ ret);
+ return -ENODEV;
+ }
switch (product_id & EXYNOS_MASK) {
case 0xE5422000:
--
2.25.1
For cancelling io_uring requests it needs either to be able to run
currently enqueued task_wokrs or having it shut down by that moment.
Otherwise io_uring_cancel_files() may be waiting for requests that won't
ever complete.
Go with the first way and do cancellations before setting PF_EXITING and
so before putting the task_work infrastructure into a transition state
where task_work_run() would better not be called.
Cc: stable(a)vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
---
fs/file.c | 2 --
kernel/exit.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index c0b60961c672..dab120b71e44 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -21,7 +21,6 @@
#include <linux/rcupdate.h>
#include <linux/close_range.h>
#include <net/sock.h>
-#include <linux/io_uring.h>
unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
@@ -428,7 +427,6 @@ void exit_files(struct task_struct *tsk)
struct files_struct * files = tsk->files;
if (files) {
- io_uring_files_cancel(files);
task_lock(tsk);
tsk->files = NULL;
task_unlock(tsk);
diff --git a/kernel/exit.c b/kernel/exit.c
index 3594291a8542..04029e35e69a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -63,6 +63,7 @@
#include <linux/random.h>
#include <linux/rcuwait.h>
#include <linux/compat.h>
+#include <linux/io_uring.h>
#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -776,6 +777,7 @@ void __noreturn do_exit(long code)
schedule();
}
+ io_uring_files_cancel(tsk->files);
exit_signals(tsk); /* sets PF_EXITING */
/* sync mm's RSS info before statistics gathering */
--
2.24.0
When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET
was introduced into the incompat feature set. It used bucket_size_hi
(which was added at the tail of struct cache_sb_disk) to extend current
16bit bucket size to 32bit with existing bucket_size in struct
cache_sb_disk.
This is not a good idea, there are two obvious problems,
- Bucket size is always value power of 2, if store log2(bucket size) in
existing bucket_size of struct cache_sb_disk, it is unnecessary to add
bucket_size_hi.
- Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in
struct cache_sb_disk, bucket_size_hi was added after d[] which makes
csum_set calculate an unexpected super block checksum.
To fix the above problems, this patch introduces a new incompat feature
bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it
means bucket_size in struct cache_sb_disk stores the order of power-of-2
bucket size value. When user specifies a bucket size larger than 32768
sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to
incompat feature set, and bucket_size stores log2(bucket size) more
than store the real bucket size value.
The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore,
it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only
recognized by kernel driver for legacy compatible purpose. The previous
bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk
and not used in bcache-tools anymore.
For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature,
bcache-tools and kernel driver still recognize the feature string and
display it as "obso_large_bucket".
With this change, the unnecessary extra space extend of bcache on-disk
super block can be avoided, and csum_set() may generate expected check
sum as well.
Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket")
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org # 5.9+
---
drivers/md/bcache/features.c | 2 +-
drivers/md/bcache/features.h | 11 ++++++++---
drivers/md/bcache/super.c | 22 +++++++++++++++++++---
include/uapi/linux/bcache.h | 2 +-
4 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c
index 6469223f0b77..d636b7b2d070 100644
--- a/drivers/md/bcache/features.c
+++ b/drivers/md/bcache/features.c
@@ -17,7 +17,7 @@ struct feature {
};
static struct feature feature_list[] = {
- {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LARGE_BUCKET,
+ {BCH_FEATURE_INCOMPAT, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE,
"large_bucket"},
{0, 0, 0 },
};
diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index e73724c2b49b..84fc2c0f0101 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -13,11 +13,15 @@
/* Feature set definition */
/* Incompat feature set */
-#define BCH_FEATURE_INCOMPAT_LARGE_BUCKET 0x0001 /* 32bit bucket size */
+/* 32bit bucket size, obsoleted */
+#define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET 0x0001
+/* real bucket size is (1 << bucket_size) */
+#define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE 0x0002
#define BCH_FEATURE_COMPAT_SUPP 0
#define BCH_FEATURE_RO_COMPAT_SUPP 0
-#define BCH_FEATURE_INCOMPAT_SUPP BCH_FEATURE_INCOMPAT_LARGE_BUCKET
+#define BCH_FEATURE_INCOMPAT_SUPP (BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
+ BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
#define BCH_HAS_COMPAT_FEATURE(sb, mask) \
((sb)->feature_compat & (mask))
@@ -77,7 +81,8 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
~BCH##_FEATURE_INCOMPAT_##flagname; \
}
-BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LARGE_BUCKET);
+BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
+BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
{
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1804df5c866b..d742eb5dc2e5 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -64,9 +64,25 @@ static unsigned int get_bucket_size(struct cache_sb *sb, struct cache_sb_disk *s
{
unsigned int bucket_size = le16_to_cpu(s->bucket_size);
- if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES &&
- bch_has_feature_large_bucket(sb))
- bucket_size |= le16_to_cpu(s->bucket_size_hi) << 16;
+ if (sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES) {
+ if (bch_has_feature_large_bucket(sb)) {
+ unsigned int max, order;
+
+ max = sizeof(unsigned int) * BITS_PER_BYTE - 1;
+ order = le16_to_cpu(s->bucket_size);
+ /*
+ * bcache tool will make sure the overflow won't
+ * happen, an error message here is enough.
+ */
+ if (order > max)
+ pr_err("Bucket size (1 << %u) overflows\n",
+ order);
+ bucket_size = 1 << order;
+ } else if (bch_has_feature_obso_large_bucket(sb)) {
+ bucket_size +=
+ le16_to_cpu(s->obso_bucket_size_hi) << 16;
+ }
+ }
return bucket_size;
}
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 52e8bcb33981..cf7399f03b71 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -213,7 +213,7 @@ struct cache_sb_disk {
__le16 keys;
};
__le64 d[SB_JOURNAL_BUCKETS]; /* journal buckets */
- __le16 bucket_size_hi;
+ __le16 obso_bucket_size_hi; /* obsoleted */
};
/*
--
2.26.2
在 2021/1/2 上午1:09, Barnabás Pőcze 写道:
> Hi
>
>
> 2021. január 1., péntek 17:08 keltezéssel, Jiaxun Yang írta:
>
>> [...]
>>>> @@ -1006,6 +1018,10 @@ static int ideapad_acpi_add(struct platform_device *pdev)
>>>> if (!priv->has_hw_rfkill_switch)
>>>> write_ec_cmd(priv->adev->handle, VPCCMD_W_RF, 1);
>>>>
>>>> + /* The same for Touchpad */
>>>> + if (!priv->has_touchpad_switch)
>>>> + write_ec_cmd(priv->adev->handle, VPCCMD_W_TOUCHPAD, 1);
>>>> +
>>> Shouldn't it be the other way around: `if (priv->has_touchpad_switch)`?
>> It is to prevent accidentally disable touchpad on machines that do have EC switch,
>> so it's intentional.
>> [...]
> Sorry, but the explanation not fully clear to me. The commit message seems to
> indicate that some models "do not use EC to switch touchpad", and I take that
> means that reading from VPCCMD_R_TOUCHPAD will not reflect the actual state of the
> touchpad and writing to VPCCMD_W_TOUCHPAD will not change the state of the touchpad.
I'm just trying to prevent removing functionality on machines that
touchpad can be controlled
by EC but also equipped I2C HID touchpad. At least users will have a
functional touchpad
after that.
>
> But then why do you still write to VPCCMD_W_TOUCHPAD on devices where supposedly
> this does not have any effect (at least not the desired one)? And the part of the
> code I made my comment about only runs on machines on which the touchpad supposedly
> cannot be controlled by the EC. What am I missing?
>
> And there is the other problem: on some machines, this patch removes working
> functionality.
Yeah that's a problem. I just don't want to repeat the story of rfkill
whitelist, it ends up with
countless machine to be added.
Maybe I should specify HID of touchpad as well. Two machines that known
to be problematic
all have ELAN0634 touchpad.
Thanks.
- Jiaxun
>
>
> Regards,
> Barnabás Pőcze