usbip_sockfd_store() is invoked when user requests attach (import)
detach (unimport) usb gadget device from usbip host. vhci_hcd sends
import request and usbip_sockfd_store() exports the device if it is
free for export.
Export and unexport are governed by local state and shared state
- Shared state (usbip device status, sockfd) - sockfd and Device
status are used to determine if stub should be brought up or shut
down. Device status is shared between host and client.
- Local state (tcp_socket, rx and tx thread task_struct ptrs)
A valid tcp_socket controls rx and tx thread operations while the
device is in exported state.
- While the device is exported, device status is marked used and socket,
sockfd, and thread pointers are valid.
Export sequence (stub-up) includes validating the socket and creating
receive (rx) and transmit (tx) threads to talk to the client to provide
access to the exported device. rx and tx threads depends on local and
shared state to be correct and in sync.
Unexport (stub-down) sequence shuts the socket down and stops the rx and
tx threads. Stub-down sequence relies on local and shared states to be
in sync.
There are races in updating the local and shared status in the current
stub-up sequence resulting in crashes. These stem from starting rx and
tx threads before local and global state is updated correctly to be in
sync.
1. Doesn't handle kthread_create() error and saves invalid ptr in local
state that drives rx and tx threads.
2. Updates tcp_socket and sockfd, starts stub_rx and stub_tx threads
before updating usbip_device status to SDEV_ST_USED. This opens up a
race condition between the threads and usbip_sockfd_store() stub up
and down handling.
Fix the above problems:
- Stop using kthread_get_run() macro to create/start threads.
- Create threads and get task struct reference.
- Add kthread_create() failure handling and bail out.
- Hold usbip_device lock to update local and shared states after
creating rx and tx threads.
- Update usbip_device status to SDEV_ST_USED.
- Update usbip_device tcp_socket, sockfd, tcp_rx, and tcp_tx
- Start threads after usbip_device (tcp_socket, sockfd, tcp_rx, tcp_tx,
and status) is complete.
Credit goes to syzbot and Tetsuo Handa for finding and root-causing the
kthread_get_run() improper error handling problem and others. This is a
hard problem to find and debug since the races aren't seen in a normal
case. Fuzzing forces the race window to be small enough for the
kthread_get_run() error path bug and starting threads before updating the
local and shared state bug in the stub-up sequence.
Reported-by: syzbot <syzbot+a93fba6d384346a761e3(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+bf1a360e305ee719e364(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+95ce4b142579611ef0a9(a)syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Fixes: 9720b4bc76a83807 ("staging/usbip: convert to kthread")
Cc: stable(a)vger.kernel.org
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
drivers/usb/usbip/vudc_sysfs.c | 42 +++++++++++++++++++++++++++-------
1 file changed, 34 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/usbip/vudc_sysfs.c b/drivers/usb/usbip/vudc_sysfs.c
index 83a0c59a3de8..a3ec39fc6177 100644
--- a/drivers/usb/usbip/vudc_sysfs.c
+++ b/drivers/usb/usbip/vudc_sysfs.c
@@ -90,8 +90,9 @@ static ssize_t dev_desc_read(struct file *file, struct kobject *kobj,
}
static BIN_ATTR_RO(dev_desc, sizeof(struct usb_device_descriptor));
-static ssize_t usbip_sockfd_store(struct device *dev, struct device_attribute *attr,
- const char *in, size_t count)
+static ssize_t usbip_sockfd_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *in, size_t count)
{
struct vudc *udc = (struct vudc *) dev_get_drvdata(dev);
int rv;
@@ -100,6 +101,8 @@ static ssize_t usbip_sockfd_store(struct device *dev, struct device_attribute *a
struct socket *socket;
unsigned long flags;
int ret;
+ struct task_struct *tcp_rx = NULL;
+ struct task_struct *tcp_tx = NULL;
rv = kstrtoint(in, 0, &sockfd);
if (rv != 0)
@@ -145,24 +148,47 @@ static ssize_t usbip_sockfd_store(struct device *dev, struct device_attribute *a
goto sock_err;
}
- udc->ud.tcp_socket = socket;
-
+ /* unlock and create threads and get tasks */
spin_unlock_irq(&udc->ud.lock);
spin_unlock_irqrestore(&udc->lock, flags);
- udc->ud.tcp_rx = kthread_get_run(&v_rx_loop,
- &udc->ud, "vudc_rx");
- udc->ud.tcp_tx = kthread_get_run(&v_tx_loop,
- &udc->ud, "vudc_tx");
+ tcp_rx = kthread_create(&v_rx_loop, &udc->ud, "vudc_rx");
+ if (IS_ERR(tcp_rx)) {
+ sockfd_put(socket);
+ return -EINVAL;
+ }
+ tcp_tx = kthread_create(&v_tx_loop, &udc->ud, "vudc_tx");
+ if (IS_ERR(tcp_tx)) {
+ kthread_stop(tcp_rx);
+ sockfd_put(socket);
+ return -EINVAL;
+ }
+
+ /* get task structs now */
+ get_task_struct(tcp_rx);
+ get_task_struct(tcp_tx);
+ /* lock and update udc->ud state */
spin_lock_irqsave(&udc->lock, flags);
spin_lock_irq(&udc->ud.lock);
+
+ udc->ud.tcp_socket = socket;
+ udc->ud.tcp_rx = tcp_rx;
+ udc->ud.tcp_rx = tcp_tx;
udc->ud.status = SDEV_ST_USED;
+
spin_unlock_irq(&udc->ud.lock);
ktime_get_ts64(&udc->start_time);
v_start_timer(udc);
udc->connected = 1;
+
+ spin_unlock_irqrestore(&udc->lock, flags);
+
+ wake_up_process(udc->ud.tcp_rx);
+ wake_up_process(udc->ud.tcp_tx);
+ return count;
+
} else {
if (!udc->connected) {
dev_err(dev, "Device not connected");
--
2.27.0
attach_store() is invoked when user requests import (attach) a device
from usbip host.
Attach and detach are governed by local state and shared state
- Shared state (usbip device status) - Device status is used to manage
the attach and detach operations on import-able devices.
- Local state (tcp_socket, rx and tx thread task_struct ptrs)
A valid tcp_socket controls rx and tx thread operations while the
device is in exported state.
- Device has to be in the right state to be attached and detached.
Attach sequence includes validating the socket and creating receive (rx)
and transmit (tx) threads to talk to the host to get access to the
imported device. rx and tx threads depends on local and shared state to
be correct and in sync.
Detach sequence shuts the socket down and stops the rx and tx threads.
Detach sequence relies on local and shared states to be in sync.
There are races in updating the local and shared status in the current
attach sequence resulting in crashes. These stem from starting rx and
tx threads before local and global state is updated correctly to be in
sync.
1. Doesn't handle kthread_create() error and saves invalid ptr in local
state that drives rx and tx threads.
2. Updates tcp_socket and sockfd, starts stub_rx and stub_tx threads
before updating usbip_device status to VDEV_ST_NOTASSIGNED. This opens
up a race condition between the threads, port connect, and detach
handling.
Fix the above problems:
- Stop using kthread_get_run() macro to create/start threads.
- Create threads and get task struct reference.
- Add kthread_create() failure handling and bail out.
- Hold vhci and usbip_device locks to update local and shared states after
creating rx and tx threads.
- Update usbip_device status to VDEV_ST_NOTASSIGNED.
- Update usbip_device tcp_socket, sockfd, tcp_rx, and tcp_tx
- Start threads after usbip_device (tcp_socket, sockfd, tcp_rx, tcp_tx,
and status) is complete.
Credit goes to syzbot and Tetsuo Handa for finding and root-causing the
kthread_get_run() improper error handling problem and others. This is
hard problem to find and debug since the races aren't seen in a normal
case. Fuzzing forces the race window to be small enough for the
kthread_get_run() error path bug and starting threads before updating the
local and shared state bug in the attach sequence.
- Update usbip_device tcp_rx and tcp_tx pointers holding vhci and
usbip_device locks.
Tested with syzbot reproducer:
- https://syzkaller.appspot.com/text?tag=ReproC&x=14801034d00000
Reported-by: syzbot <syzbot+a93fba6d384346a761e3(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+bf1a360e305ee719e364(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+95ce4b142579611ef0a9(a)syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Fixes: 9720b4bc76a83807 ("staging/usbip: convert to kthread")
Cc: stable(a)vger.kernel.org
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
drivers/usb/usbip/vhci_sysfs.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/usbip/vhci_sysfs.c b/drivers/usb/usbip/vhci_sysfs.c
index 1e1ae9bd06ab..c4b4256e5dad 100644
--- a/drivers/usb/usbip/vhci_sysfs.c
+++ b/drivers/usb/usbip/vhci_sysfs.c
@@ -312,6 +312,8 @@ static ssize_t attach_store(struct device *dev, struct device_attribute *attr,
struct vhci *vhci;
int err;
unsigned long flags;
+ struct task_struct *tcp_rx = NULL;
+ struct task_struct *tcp_tx = NULL;
/*
* @rhport: port number of vhci_hcd
@@ -360,9 +362,24 @@ static ssize_t attach_store(struct device *dev, struct device_attribute *attr,
return -EINVAL;
}
- /* now need lock until setting vdev status as used */
+ /* create threads before locking */
+ tcp_rx = kthread_create(vhci_rx_loop, &vdev->ud, "vhci_rx");
+ if (IS_ERR(tcp_rx)) {
+ sockfd_put(socket);
+ return -EINVAL;
+ }
+ tcp_tx = kthread_create(vhci_tx_loop, &vdev->ud, "vhci_tx");
+ if (IS_ERR(tcp_tx)) {
+ kthread_stop(tcp_rx);
+ sockfd_put(socket);
+ return -EINVAL;
+ }
+
+ /* get task structs now */
+ get_task_struct(tcp_rx);
+ get_task_struct(tcp_tx);
- /* begin a lock */
+ /* now begin lock until setting vdev status set */
spin_lock_irqsave(&vhci->lock, flags);
spin_lock(&vdev->ud.lock);
@@ -372,6 +389,8 @@ static ssize_t attach_store(struct device *dev, struct device_attribute *attr,
spin_unlock_irqrestore(&vhci->lock, flags);
sockfd_put(socket);
+ kthread_stop_put(tcp_rx);
+ kthread_stop_put(tcp_tx);
dev_err(dev, "port %d already used\n", rhport);
/*
@@ -390,6 +409,8 @@ static ssize_t attach_store(struct device *dev, struct device_attribute *attr,
vdev->speed = speed;
vdev->ud.sockfd = sockfd;
vdev->ud.tcp_socket = socket;
+ vdev->ud.tcp_rx = tcp_rx;
+ vdev->ud.tcp_tx = tcp_tx;
vdev->ud.status = VDEV_ST_NOTASSIGNED;
usbip_kcov_handle_init(&vdev->ud);
@@ -397,8 +418,8 @@ static ssize_t attach_store(struct device *dev, struct device_attribute *attr,
spin_unlock_irqrestore(&vhci->lock, flags);
/* end the lock */
- vdev->ud.tcp_rx = kthread_get_run(vhci_rx_loop, &vdev->ud, "vhci_rx");
- vdev->ud.tcp_tx = kthread_get_run(vhci_tx_loop, &vdev->ud, "vhci_tx");
+ wake_up_process(vdev->ud.tcp_rx);
+ wake_up_process(vdev->ud.tcp_tx);
rh_port_connect(vdev, speed);
--
2.27.0
usbip_sockfd_store() is invoked when user requests attach (import)
detach (unimport) usb device from usbip host. vhci_hcd sends import
request and usbip_sockfd_store() exports the device if it is free
for export.
Export and unexport are governed by local state and shared state
- Shared state (usbip device status, sockfd) - sockfd and Device
status are used to determine if stub should be brought up or shut
down.
- Local state (tcp_socket, rx and tx thread task_struct ptrs)
A valid tcp_socket controls rx and tx thread operations while the
device is in exported state.
- While the device is exported, device status is marked used and socket,
sockfd, and thread pointers are valid.
Export sequence (stub-up) includes validating the socket and creating
receive (rx) and transmit (tx) threads to talk to the client to provide
access to the exported device. rx and tx threads depends on local and
shared state to be correct and in sync.
Unexport (stub-down) sequence shuts the socket down and stops the rx and
tx threads. Stub-down sequence relies on local and shared states to be
in sync.
There are races in updating the local and shared status in the current
stub-up sequence resulting in crashes. These stem from starting rx and
tx threads before local and global state is updated correctly to be in
sync.
1. Doesn't handle kthread_create() error and saves invalid ptr in local
state that drives rx and tx threads.
2. Updates tcp_socket and sockfd, starts stub_rx and stub_tx threads
before updating usbip_device status to SDEV_ST_USED. This opens up a
race condition between the threads and usbip_sockfd_store() stub up
and down handling.
Fix the above problems:
- Stop using kthread_get_run() macro to create/start threads.
- Create threads and get task struct reference.
- Add kthread_create() failure handling and bail out.
- Hold usbip_device lock to update local and shared states after
creating rx and tx threads.
- Update usbip_device status to SDEV_ST_USED.
- Update usbip_device tcp_socket, sockfd, tcp_rx, and tcp_tx
- Start threads after usbip_device (tcp_socket, sockfd, tcp_rx, tcp_tx,
and status) is complete.
Credit goes to syzbot and Tetsuo Handa for finding and root-causing the
kthread_get_run() improper error handling problem and others. This is a
hard problem to find and debug since the races aren't seen in a normal
case. Fuzzing forces the race window to be small enough for the
kthread_get_run() error path bug and starting threads before updating the
local and shared state bug in the stub-up sequence.
Tested with syzbot reproducer:
- https://syzkaller.appspot.com/text?tag=ReproC&x=14801034d00000
Reported-by: syzbot <syzbot+a93fba6d384346a761e3(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+bf1a360e305ee719e364(a)syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+95ce4b142579611ef0a9(a)syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Fixes: 9720b4bc76a83807 ("staging/usbip: convert to kthread")
Cc: stable(a)vger.kernel.org
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
drivers/usb/usbip/stub_dev.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
diff --git a/drivers/usb/usbip/stub_dev.c b/drivers/usb/usbip/stub_dev.c
index 90c105469a07..8f1de1fbbeed 100644
--- a/drivers/usb/usbip/stub_dev.c
+++ b/drivers/usb/usbip/stub_dev.c
@@ -46,6 +46,8 @@ static ssize_t usbip_sockfd_store(struct device *dev, struct device_attribute *a
int sockfd = 0;
struct socket *socket;
int rv;
+ struct task_struct *tcp_rx = NULL;
+ struct task_struct *tcp_tx = NULL;
if (!sdev) {
dev_err(dev, "sdev is null\n");
@@ -80,20 +82,36 @@ static ssize_t usbip_sockfd_store(struct device *dev, struct device_attribute *a
goto sock_err;
}
- sdev->ud.tcp_socket = socket;
- sdev->ud.sockfd = sockfd;
-
+ /* unlock and create threads and get tasks */
spin_unlock_irq(&sdev->ud.lock);
+ tcp_rx = kthread_create(stub_rx_loop, &sdev->ud, "stub_rx");
+ if (IS_ERR(tcp_rx)) {
+ sockfd_put(socket);
+ return -EINVAL;
+ }
+ tcp_tx = kthread_create(stub_tx_loop, &sdev->ud, "stub_tx");
+ if (IS_ERR(tcp_tx)) {
+ kthread_stop(tcp_rx);
+ sockfd_put(socket);
+ return -EINVAL;
+ }
- sdev->ud.tcp_rx = kthread_get_run(stub_rx_loop, &sdev->ud,
- "stub_rx");
- sdev->ud.tcp_tx = kthread_get_run(stub_tx_loop, &sdev->ud,
- "stub_tx");
+ /* get task structs now */
+ get_task_struct(tcp_rx);
+ get_task_struct(tcp_tx);
+ /* lock and update sdev->ud state */
spin_lock_irq(&sdev->ud.lock);
+ sdev->ud.tcp_socket = socket;
+ sdev->ud.sockfd = sockfd;
+ sdev->ud.tcp_rx = tcp_rx;
+ sdev->ud.tcp_tx = tcp_tx;
sdev->ud.status = SDEV_ST_USED;
spin_unlock_irq(&sdev->ud.lock);
+ wake_up_process(sdev->ud.tcp_rx);
+ wake_up_process(sdev->ud.tcp_tx);
+
} else {
dev_info(dev, "stub down\n");
--
2.27.0
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a4c8dd9c2d0987cf542a2a0c42684c9c6d78a04e Mon Sep 17 00:00:00 2001
From: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Date: Tue, 2 Feb 2021 11:35:28 +0800
Subject: [PATCH] dm table: fix iterate_devices based device capability checks
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4acf2342f7ad..3e211e64f348 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1295,6 +1295,46 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
return &t->targets[(KEYS_PER_NODE * n) + k];
}
+/*
+ * type->iterate_devices() should be called when the sanity check needs to
+ * iterate and check all underlying data devices. iterate_devices() will
+ * iterate all underlying data devices until it encounters a non-zero return
+ * code, returned by whether the input iterate_devices_callout_fn, or
+ * iterate_devices() itself internally.
+ *
+ * For some target type (e.g. dm-stripe), one call of iterate_devices() may
+ * iterate multiple underlying devices internally, in which case a non-zero
+ * return code returned by iterate_devices_callout_fn will stop the iteration
+ * in advance.
+ *
+ * Cases requiring _any_ underlying device supporting some kind of attribute,
+ * should use the iteration structure like dm_table_any_dev_attr(), or call
+ * it directly. @func should handle semantics of positive examples, e.g.
+ * capable of something.
+ *
+ * Cases requiring _all_ underlying devices supporting some kind of attribute,
+ * should use the iteration structure like dm_table_supports_nowait() or
+ * dm_table_supports_discards(). Or introduce dm_table_all_devs_attr() that
+ * uses an @anti_func that handle semantics of counter examples, e.g. not
+ * capable of something. So: return !dm_table_any_dev_attr(t, anti_func);
+ */
+static bool dm_table_any_dev_attr(struct dm_table *t,
+ iterate_devices_callout_fn func)
+{
+ struct dm_target *ti;
+ unsigned int i;
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti, func, NULL))
+ return true;
+ }
+
+ return false;
+}
+
static int count_device(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1595,12 +1635,12 @@ static int dm_table_supports_dax_write_cache(struct dm_table *t)
return false;
}
-static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+static int device_is_rotational(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && blk_queue_nonrot(q);
+ return q && !blk_queue_nonrot(q);
}
static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
@@ -1611,23 +1651,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
return q && !blk_queue_add_random(q);
}
-static bool dm_table_all_devices_attribute(struct dm_table *t,
- iterate_devices_callout_fn func)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, func, NULL))
- return false;
- }
-
- return true;
-}
-
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1779,27 +1802,6 @@ static int device_requires_stable_pages(struct dm_target *ti,
return q && blk_queue_stable_writes(q);
}
-/*
- * If any underlying device requires stable pages, a table must require
- * them as well. Only targets that support iterate_devices are considered:
- * don't want error, zero, etc to require stable pages.
- */
-static bool dm_table_requires_stable_pages(struct dm_table *t)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti, device_requires_stable_pages, NULL))
- return true;
- }
-
- return false;
-}
-
void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
@@ -1849,10 +1851,10 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */
- if (dm_table_all_devices_attribute(t, device_is_nonrot))
- blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
- else
+ if (dm_table_any_dev_attr(t, device_is_rotational))
blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+ else
+ blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
if (!dm_table_supports_write_same(t))
q->limits.max_write_same_sectors = 0;
@@ -1864,8 +1866,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
/*
* Some devices don't use blk_integrity but still want stable pages
* because they do their own checksumming.
+ * If any underlying device requires stable pages, a table must require
+ * them as well. Only targets that support iterate_devices are considered:
+ * don't want error, zero, etc to require stable pages.
*/
- if (dm_table_requires_stable_pages(t))
+ if (dm_table_any_dev_attr(t, device_requires_stable_pages))
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
@@ -1876,7 +1881,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
* Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
* have it set.
*/
- if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t, device_is_not_random))
+ if (blk_queue_add_random(q) && dm_table_any_dev_attr(t, device_is_not_random))
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
/*
hello
getting stable kernels with this script:
https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/tre…
fails since the last 2 (?) stable releases with (last lines):
...
+ /usr/bin/curl -L -o
/home/ron/Downloads/linux-tarball-verify.1GiZid5WT.untrusted/linux-5.11.4.tar.xz
https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.11.4.tar.xz
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 112M 100 112M 0 0 5757k 0 0:00:19 0:00:19 --:--:--
5938k
pushd ${TMPDIR} >/dev/null
+ pushd /home/ron/Downloads/linux-tarball-verify.1GiZid5WT.untrusted
echo "Verifying checksum on linux-${VER}.tar.xz"
+ echo 'Verifying checksum on linux-5.11.4.tar.xz'
Verifying checksum on linux-5.11.4.tar.xz
if ! ${SHA256SUMBIN} -c ${SHACHECK}; then
echo "FAILED to verify the downloaded tarball checksum"
popd >/dev/null
rm -rf ${TMPDIR}
exit 1
fi
+ /usr/bin/sha256sum -c
/home/ron/Downloads/linux-tarball-verify.1GiZid5WT.untrusted/sha256sums.txt
/usr/bin/sha256sum:
/home/ron/Downloads/linux-tarball-verify.1GiZid5WT.untrusted/sha256sums.txt:
no properly formatted SHA256 checksum lines found
+ echo 'FAILED to verify the downloaded tarball checksum'
FAILED to verify the downloaded tarball checksum
+ popd
+ rm -rf /home/ron/Downloads/linux-tarball-verify.1GiZid5WT.untrusted
+ exit 1
checksumming the downloaded kernel manually gives an "Okay" though.
is this just me (on Fedora 33) ?
--
regards
Ronald
This is the start of the stable review cycle for the 4.19.179 release.
There are 52 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 07 Mar 2021 12:08:39 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.179-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.179-rc1
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda/realtek: Apply dual codec quirks for MSI Godlike X570 board
Eckhart Mohr <e.mohr(a)tuxedocomputers.com>
ALSA: hda/realtek: Add quirk for Clevo NH55RZQ
Sakari Ailus <sakari.ailus(a)linux.intel.com>
media: v4l: ioctl: Fix memory leak in video_usercopy
Jens Axboe <axboe(a)kernel.dk>
swap: fix swapfile read/write offset
Rokudo Yan <wu-yan(a)tcl.com>
zsmalloc: account the number of compacted pages correctly
Jan Beulich <jbeulich(a)suse.com>
xen-netback: respect gnttab_map_refs()'s return value
Jan Beulich <jbeulich(a)suse.com>
Xen/gnttab: handle p2m update errors on a per-slot basis
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Verify lengths on passthrough PDUs
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE
Joe Perches <joe(a)perches.com>
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
Lee Duncan <lduncan(a)suse.com>
scsi: iscsi: Restrict sessions and handles to admin capabilities
Hans de Goede <hdegoede(a)redhat.com>
ASoC: Intel: bytcr_rt5640: Add quirk for the Acer One S1002 tablet
Hans de Goede <hdegoede(a)redhat.com>
ASoC: Intel: bytcr_rt5640: Add quirk for the Voyo Winpad A15 tablet
Hans de Goede <hdegoede(a)redhat.com>
ASoC: Intel: bytcr_rt5640: Add quirk for the Estar Beauty HD MID 7316R tablet
John David Anglin <dave.anglin(a)bell.net>
parisc: Bump 64-bit IRQ stack size to 64 KB
Josef Bacik <josef(a)toxicpanda.com>
btrfs: fix error handling in commit_fs_roots
Chao Yu <yuchao0(a)huawei.com>
f2fs: fix to set/clear I_LINKABLE under i_lock
Jaegeuk Kim <jaegeuk(a)kernel.org>
f2fs: handle unallocated section and zone on pinned/atgc
Ricardo Ribalda <ribalda(a)chromium.org>
media: uvcvideo: Allow entities with no pads
Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
drm/amd/display: Guard against NULL pointer deref when get_i2c_info fails
Nirmoy Das <nirmoy.das(a)amd.com>
PCI: Add a REBAR size quirk for Sapphire RX 5600 XT Pulse
Ard Biesheuvel <ardb(a)kernel.org>
crypto: tcrypt - avoid signed overflow in byte count
Christian Gromm <christian.gromm(a)microchip.com>
staging: most: sound: add sanity check for function argument
Gopal Tiwari <gtiwari(a)redhat.com>
Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
Fangrui Song <maskray(a)google.com>
x86/build: Treat R_386_PLT32 relocation as R_386_PC32
Miaoqing Pan <miaoqing(a)codeaurora.org>
ath10k: fix wmi mgmt tx queue full due to race condition
Di Zhu <zhudi21(a)huawei.com>
pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
Claire Chang <tientzu(a)chromium.org>
Bluetooth: hci_h5: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for btrtl
Tony Lindgren <tony(a)atomide.com>
wlcore: Fix command execute failure 19 for wl12xx
Jiri Slaby <jslaby(a)suse.cz>
vt/consolemap: do font sum unsigned
Heiner Kallweit <hkallweit1(a)gmail.com>
x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
staging: fwserial: Fix error handling in fwserial_create
Marek Vasut <marex(a)denx.de>
rsi: Move card interrupt handling to RX thread
Marek Vasut <marex(a)denx.de>
rsi: Fix TX EAPOL packet handling against iwlwifi AP
Geert Uytterhoeven <geert+renesas(a)glider.be>
dt-bindings: net: btusb: DT fix s/interrupt-name/interrupt-names/
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: bridge: use switchdev for port flags set through sysfs too
Li Xinhai <lixinhai.lxh(a)gmail.com>
mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
Marco Elver <elver(a)google.com>
net: fix up truesize of cloned skb in skb_prepare_for_shift()
Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
smackfs: restrict bytes count in smackfs write functions
Yumei Huang <yuhuang(a)redhat.com>
xfs: Fix assert failure in xfs_setattr_size()
Sean Young <sean(a)mess.org>
media: mceusb: sanity check for prescaler value
Zqiang <qiang.zhang(a)windriver.com>
udlfb: Fix memory leak in dlfb_usb_probe
Randy Dunlap <rdunlap(a)infradead.org>
JFS: more checks for invalid superblock
Nathan Chancellor <natechancellor(a)gmail.com>
MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target='
Andrew Murray <andrew.murray(a)arm.com>
arm64: Use correct ll/sc atomic constraints
Will Deacon <will.deacon(a)arm.com>
arm64: cmpxchg: Use "K" instead of "L" for ll/sc immediate constraint
Will Deacon <will.deacon(a)arm.com>
arm64: Avoid redundant type conversions in xchg() and cmpxchg()
Shaoying Xu <shaoyi(a)amazon.com>
arm64 module: set plt* section addresses to 0x0
Cornelia Huck <cohuck(a)redhat.com>
virtio/s390: implement virtio-ccw revision 2 correctly
Sergey Senozhatsky <senozhatsky(a)chromium.org>
drm/virtio: use kvmalloc for large allocations
Mike Kravetz <mike.kravetz(a)oracle.com>
hugetlb: fix update_and_free_page contig page struct assumption
Lech Perczak <lech.perczak(a)gmail.com>
net: usb: qmi_wwan: support ZTE P685M modem
-------------
Diffstat:
Documentation/devicetree/bindings/net/btusb.txt | 2 +-
Documentation/filesystems/sysfs.txt | 8 +-
Makefile | 4 +-
arch/arm/xen/p2m.c | 35 +++++-
arch/arm64/include/asm/atomic_ll_sc.h | 108 +++++++++--------
arch/arm64/include/asm/atomic_lse.h | 46 ++++----
arch/arm64/include/asm/cmpxchg.h | 116 +++++++++----------
arch/arm64/kernel/module.lds | 6 +-
arch/mips/vdso/Makefile | 5 +-
arch/parisc/kernel/irq.c | 4 +
arch/x86/kernel/module.c | 1 +
arch/x86/kernel/reboot.c | 9 ++
arch/x86/tools/relocs.c | 12 +-
arch/x86/xen/p2m.c | 44 ++++++-
crypto/tcrypt.c | 20 ++--
drivers/block/zram/zram_drv.c | 2 +-
drivers/bluetooth/hci_h5.c | 5 +
drivers/gpu/drm/amd/display/dc/core/dc_link.c | 5 +
drivers/gpu/drm/virtio/virtgpu_vq.c | 6 +-
drivers/media/rc/mceusb.c | 9 +-
drivers/media/usb/uvc/uvc_driver.c | 7 +-
drivers/media/v4l2-core/v4l2-ioctl.c | 19 ++-
drivers/net/usb/qmi_wwan.c | 1 +
drivers/net/wireless/ath/ath10k/mac.c | 15 +--
drivers/net/wireless/rsi/rsi_91x_hal.c | 3 +-
drivers/net/wireless/rsi/rsi_91x_sdio.c | 6 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 52 +++------
drivers/net/wireless/rsi/rsi_sdio.h | 8 +-
drivers/net/wireless/ti/wl12xx/main.c | 3 -
drivers/net/wireless/ti/wlcore/main.c | 15 +--
drivers/net/wireless/ti/wlcore/wlcore.h | 3 -
drivers/net/xen-netback/netback.c | 12 +-
drivers/pci/pci.c | 9 +-
drivers/s390/virtio/virtio_ccw.c | 4 +-
drivers/scsi/libiscsi.c | 148 ++++++++++++------------
drivers/scsi/scsi_transport_iscsi.c | 38 ++++--
drivers/staging/fwserial/fwserial.c | 2 +
drivers/staging/most/sound/sound.c | 2 +
drivers/tty/vt/consolemap.c | 2 +-
drivers/video/fbdev/udlfb.c | 1 +
fs/btrfs/transaction.c | 11 +-
fs/f2fs/namei.c | 8 ++
fs/f2fs/segment.h | 4 +-
fs/jfs/jfs_filsys.h | 1 +
fs/jfs/jfs_mount.c | 10 ++
fs/sysfs/file.c | 55 +++++++++
fs/xfs/xfs_iops.c | 2 +-
include/linux/sysfs.h | 16 +++
include/linux/zsmalloc.h | 2 +-
mm/hugetlb.c | 28 +++--
mm/page_io.c | 11 +-
mm/swapfile.c | 2 +-
mm/zsmalloc.c | 17 ++-
net/bluetooth/amp.c | 3 +
net/bridge/br_sysfs_if.c | 9 +-
net/core/pktgen.c | 2 +-
net/core/skbuff.c | 14 ++-
security/smack/smackfs.c | 21 +++-
sound/pci/hda/patch_realtek.c | 2 +
sound/soc/intel/boards/bytcr_rt5640.c | 37 ++++++
60 files changed, 652 insertions(+), 400 deletions(-)
This is the start of the stable review cycle for the 4.4.260 release.
There are 30 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 07 Mar 2021 12:08:39 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.260-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.260-rc1
Sakari Ailus <sakari.ailus(a)linux.intel.com>
media: v4l: ioctl: Fix memory leak in video_usercopy
Jens Axboe <axboe(a)kernel.dk>
swap: fix swapfile read/write offset
Rokudo Yan <wu-yan(a)tcl.com>
zsmalloc: account the number of compacted pages correctly
Jan Beulich <jbeulich(a)suse.com>
xen-netback: respect gnttab_map_refs()'s return value
Jan Beulich <jbeulich(a)suse.com>
Xen/gnttab: handle p2m update errors on a per-slot basis
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Verify lengths on passthrough PDUs
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE
Joe Perches <joe(a)perches.com>
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
Lee Duncan <lduncan(a)suse.com>
scsi: iscsi: Restrict sessions and handles to admin capabilities
Ricardo Ribalda <ribalda(a)chromium.org>
media: uvcvideo: Allow entities with no pads
Christian Gromm <christian.gromm(a)microchip.com>
staging: most: sound: add sanity check for function argument
Gopal Tiwari <gtiwari(a)redhat.com>
Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
Fangrui Song <maskray(a)google.com>
x86/build: Treat R_386_PLT32 relocation as R_386_PC32
Miaoqing Pan <miaoqing(a)codeaurora.org>
ath10k: fix wmi mgmt tx queue full due to race condition
Di Zhu <zhudi21(a)huawei.com>
pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
Tony Lindgren <tony(a)atomide.com>
wlcore: Fix command execute failure 19 for wl12xx
Jiri Slaby <jslaby(a)suse.cz>
vt/consolemap: do font sum unsigned
Heiner Kallweit <hkallweit1(a)gmail.com>
x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
staging: fwserial: Fix error handling in fwserial_create
Li Xinhai <lixinhai.lxh(a)gmail.com>
mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
Marco Elver <elver(a)google.com>
net: fix up truesize of cloned skb in skb_prepare_for_shift()
Yumei Huang <yuhuang(a)redhat.com>
xfs: Fix assert failure in xfs_setattr_size()
Randy Dunlap <rdunlap(a)infradead.org>
JFS: more checks for invalid superblock
Mike Kravetz <mike.kravetz(a)oracle.com>
hugetlb: fix update_and_free_page contig page struct assumption
Rolf Eike Beer <eb(a)emlix.com>
scripts: set proper OpenSSL include dir also for sign-file
Rolf Eike Beer <eb(a)emlix.com>
scripts: use pkg-config to locate libcrypto
Frank Li <Frank.Li(a)nxp.com>
mmc: sdhci-esdhc-imx: fix kernel panic when remove module
Nobuhiro Iwamatsu <nobuhiro1.iwamatsu(a)toshiba.co.jp>
iwlwifi: pcie: fix to correct null check
Lech Perczak <lech.perczak(a)gmail.com>
net: usb: qmi_wwan: support ZTE P685M modem
Thomas Gleixner <tglx(a)linutronix.de>
futex: Ensure the correct return value from futex_lock_pi()
-------------
Diffstat:
Documentation/filesystems/sysfs.txt | 8 +-
Makefile | 4 +-
arch/arm/xen/p2m.c | 35 +++++++-
arch/x86/kernel/module.c | 1 +
arch/x86/kernel/reboot.c | 9 ++
arch/x86/tools/relocs.c | 12 ++-
arch/x86/xen/p2m.c | 44 +++++++++-
drivers/block/zram/zram_drv.c | 2 +-
drivers/media/usb/uvc/uvc_driver.c | 7 +-
drivers/media/v4l2-core/v4l2-ioctl.c | 19 ++--
drivers/mmc/host/sdhci-esdhc-imx.c | 3 +-
drivers/net/usb/qmi_wwan.c | 1 +
drivers/net/wireless/ath/ath10k/mac.c | 15 +---
drivers/net/wireless/iwlwifi/pcie/tx.c | 4 +-
drivers/net/wireless/ti/wl12xx/main.c | 3 -
drivers/net/wireless/ti/wlcore/main.c | 15 +---
drivers/net/wireless/ti/wlcore/wlcore.h | 3 -
drivers/net/xen-netback/netback.c | 12 ++-
drivers/scsi/libiscsi.c | 148 ++++++++++++++++----------------
drivers/scsi/scsi_transport_iscsi.c | 38 ++++++--
drivers/staging/fwserial/fwserial.c | 2 +
drivers/staging/most/aim-sound/sound.c | 2 +
drivers/tty/vt/consolemap.c | 2 +-
fs/jfs/jfs_filsys.h | 1 +
fs/jfs/jfs_mount.c | 10 +++
fs/sysfs/file.c | 55 ++++++++++++
fs/xfs/xfs_iops.c | 2 +-
include/linux/sysfs.h | 16 ++++
include/linux/zsmalloc.h | 2 +-
kernel/futex.c | 24 +++---
mm/hugetlb.c | 28 +++---
mm/page_io.c | 11 +--
mm/swapfile.c | 2 +-
mm/zsmalloc.c | 17 ++--
net/bluetooth/amp.c | 3 +
net/core/pktgen.c | 2 +-
net/core/skbuff.c | 14 ++-
scripts/Makefile | 9 +-
38 files changed, 390 insertions(+), 195 deletions(-)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a4c8dd9c2d0987cf542a2a0c42684c9c6d78a04e Mon Sep 17 00:00:00 2001
From: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Date: Tue, 2 Feb 2021 11:35:28 +0800
Subject: [PATCH] dm table: fix iterate_devices based device capability checks
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4acf2342f7ad..3e211e64f348 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1295,6 +1295,46 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
return &t->targets[(KEYS_PER_NODE * n) + k];
}
+/*
+ * type->iterate_devices() should be called when the sanity check needs to
+ * iterate and check all underlying data devices. iterate_devices() will
+ * iterate all underlying data devices until it encounters a non-zero return
+ * code, returned by whether the input iterate_devices_callout_fn, or
+ * iterate_devices() itself internally.
+ *
+ * For some target type (e.g. dm-stripe), one call of iterate_devices() may
+ * iterate multiple underlying devices internally, in which case a non-zero
+ * return code returned by iterate_devices_callout_fn will stop the iteration
+ * in advance.
+ *
+ * Cases requiring _any_ underlying device supporting some kind of attribute,
+ * should use the iteration structure like dm_table_any_dev_attr(), or call
+ * it directly. @func should handle semantics of positive examples, e.g.
+ * capable of something.
+ *
+ * Cases requiring _all_ underlying devices supporting some kind of attribute,
+ * should use the iteration structure like dm_table_supports_nowait() or
+ * dm_table_supports_discards(). Or introduce dm_table_all_devs_attr() that
+ * uses an @anti_func that handle semantics of counter examples, e.g. not
+ * capable of something. So: return !dm_table_any_dev_attr(t, anti_func);
+ */
+static bool dm_table_any_dev_attr(struct dm_table *t,
+ iterate_devices_callout_fn func)
+{
+ struct dm_target *ti;
+ unsigned int i;
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti, func, NULL))
+ return true;
+ }
+
+ return false;
+}
+
static int count_device(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1595,12 +1635,12 @@ static int dm_table_supports_dax_write_cache(struct dm_table *t)
return false;
}
-static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+static int device_is_rotational(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
- return q && blk_queue_nonrot(q);
+ return q && !blk_queue_nonrot(q);
}
static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
@@ -1611,23 +1651,6 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
return q && !blk_queue_add_random(q);
}
-static bool dm_table_all_devices_attribute(struct dm_table *t,
- iterate_devices_callout_fn func)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, func, NULL))
- return false;
- }
-
- return true;
-}
-
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1779,27 +1802,6 @@ static int device_requires_stable_pages(struct dm_target *ti,
return q && blk_queue_stable_writes(q);
}
-/*
- * If any underlying device requires stable pages, a table must require
- * them as well. Only targets that support iterate_devices are considered:
- * don't want error, zero, etc to require stable pages.
- */
-static bool dm_table_requires_stable_pages(struct dm_table *t)
-{
- struct dm_target *ti;
- unsigned i;
-
- for (i = 0; i < dm_table_get_num_targets(t); i++) {
- ti = dm_table_get_target(t, i);
-
- if (ti->type->iterate_devices &&
- ti->type->iterate_devices(ti, device_requires_stable_pages, NULL))
- return true;
- }
-
- return false;
-}
-
void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
@@ -1849,10 +1851,10 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
dax_write_cache(t->md->dax_dev, true);
/* Ensure that all underlying devices are non-rotational. */
- if (dm_table_all_devices_attribute(t, device_is_nonrot))
- blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
- else
+ if (dm_table_any_dev_attr(t, device_is_rotational))
blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+ else
+ blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
if (!dm_table_supports_write_same(t))
q->limits.max_write_same_sectors = 0;
@@ -1864,8 +1866,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
/*
* Some devices don't use blk_integrity but still want stable pages
* because they do their own checksumming.
+ * If any underlying device requires stable pages, a table must require
+ * them as well. Only targets that support iterate_devices are considered:
+ * don't want error, zero, etc to require stable pages.
*/
- if (dm_table_requires_stable_pages(t))
+ if (dm_table_any_dev_attr(t, device_requires_stable_pages))
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
@@ -1876,7 +1881,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
* Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
* have it set.
*/
- if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t, device_is_not_random))
+ if (blk_queue_add_random(q) && dm_table_any_dev_attr(t, device_is_not_random))
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
/*
Please consider applying the following upstream patch to stable trees
v5.4 and up
commit 660d2062190db131d2feaf19914e90f868fe285c
Author: Ard Biesheuvel <ardb(a)kernel.org>
Date: Wed Jan 13 10:11:35 2021 +0100
crypto - shash: reduce minimum alignment of shash_desc structure
On architectures such as arm64, it reduces the worst case memory
footprint of a SHASH_DESC_ON_STACK() allocation (which reserves space
for two struct shash_desc instances) by ~350 bytes.
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From df84fe94708985cdfb78a83148322bcd0a699472 Mon Sep 17 00:00:00 2001
From: Timothy E Baldwin <T.E.Baldwin99(a)members.leeds.ac.uk>
Date: Sat, 16 Jan 2021 15:18:54 +0000
Subject: [PATCH] arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL)
Since commit f086f67485c5 ("arm64: ptrace: add support for syscall
emulation"), if system call number -1 is called and the process is being
traced with PTRACE_SYSCALL, for example by strace, the seccomp check is
skipped and -ENOSYS is returned unconditionally (unless altered by the
tracer) rather than carrying out action specified in the seccomp filter.
The consequence of this is that it is not possible to reliably strace
a seccomp based implementation of a foreign system call interface in
which r7/x8 is permitted to be -1 on entry to a system call.
Also trace_sys_enter and audit_syscall_entry are skipped if a system
call is skipped.
Fix by removing the in_syscall(regs) check restoring the previous
behaviour which is like AArch32, x86 (which uses generic code) and
everything else.
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Catalin Marinas<catalin.marinas(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Fixes: f086f67485c5 ("arm64: ptrace: add support for syscall emulation")
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Reviewed-by: Sudeep Holla <sudeep.holla(a)arm.com>
Tested-by: Sudeep Holla <sudeep.holla(a)arm.com>
Signed-off-by: Timothy E Baldwin <T.E.Baldwin99(a)members.leeds.ac.uk>
Link: https://lore.kernel.org/r/90edd33b-6353-1228-791f-0336d94d5f8c@majoroak.me.…
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 3d5c8afca75b..170f42fd6101 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1797,7 +1797,7 @@ int syscall_trace_enter(struct pt_regs *regs)
if (flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
- if (!in_syscall(regs) || (flags & _TIF_SYSCALL_EMU))
+ if (flags & _TIF_SYSCALL_EMU)
return NO_SYSCALL;
}
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d14c5cde5c268a2bc26addecf09489cb953ef64 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 22 Feb 2021 18:40:44 +0200
Subject: [PATCH] btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:
* ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")
* 6f23277a49e6 ("btrfs: qgroup: don't commit transaction when we already
hold the handle")
Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:
PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test"
#0 __schedule at ffffffffa529e07d
#1 schedule at ffffffffa529e4ff
#2 schedule_timeout at ffffffffa52a1bdd
#3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held
#4 start_delalloc_inodes at ffffffffc0380db5
#5 btrfs_start_delalloc_snapshot at ffffffffc0393836
#6 try_flush_qgroup at ffffffffc03f04b2
#7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes.
#8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex
#9 btrfs_update_inode at ffffffffc0385ba8
#10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED
#11 touch_atime at ffffffffa4cf0000
#12 generic_file_read_iter at ffffffffa4c1f123
#13 new_sync_read at ffffffffa4ccdc8a
#14 vfs_read at ffffffffa4cd0849
#15 ksys_read at ffffffffa4cd0bd1
#16 do_syscall_64 at ffffffffa4a052eb
#17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c
This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:
PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30"
#0 __schedule at ffffffffa529e07d
#1 schedule at ffffffffa529e4ff
#2 schedule_preempt_disabled at ffffffffa529e80a
#3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up.
#4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex
#5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding
#6 cow_file_range_inline.constprop.78 at ffffffffc0386be7
#7 cow_file_range at ffffffffc03879c1
#8 btrfs_run_delalloc_range at ffffffffc038894c
#9 writepage_delalloc at ffffffffc03a3c8f
#10 __extent_writepage at ffffffffc03a4c01
#11 extent_write_cache_pages at ffffffffc03a500b
#12 extent_writepages at ffffffffc03a6de2
#13 do_writepages at ffffffffa4c277eb
#14 __filemap_fdatawrite_range at ffffffffa4c1e5bb
#15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes
#16 normal_work_helper at ffffffffc03b706c
#17 process_one_work at ffffffffa4aba4e4
#18 worker_thread at ffffffffa4aba6fd
#19 kthread at ffffffffa4ac0a3d
#20 ret_from_fork at ffffffffa54001ff
To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.
Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable(a)vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index ac9966e76a2f..bf25401c9768 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -627,7 +627,8 @@ static int btrfs_delayed_inode_reserve_metadata(
*/
if (!src_rsv || (!trans->bytes_reserved &&
src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
- ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
+ ret = btrfs_qgroup_reserve_meta(root, num_bytes,
+ BTRFS_QGROUP_RSV_META_PREALLOC, true);
if (ret < 0)
return ret;
ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f2f1e932751..c35b724a5611 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6081,7 +6081,7 @@ static int btrfs_dirty_inode(struct inode *inode)
return PTR_ERR(trans);
ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
- if (ret && ret == -ENOSPC) {
+ if (ret && (ret == -ENOSPC || ret == -EDQUOT)) {
/* whoops, lets try again with the full transaction */
btrfs_end_transaction(trans);
trans = btrfs_start_transaction(root, 1);
From: Dmitry Osipenko <digetx(a)gmail.com>
[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]
There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.
Signed-off-by: Dmitry Osipenko <digetx(a)gmail.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 679d75a864d0..40bdf7d85a05 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -236,7 +236,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}
--
2.30.1
From: Dmitry Osipenko <digetx(a)gmail.com>
[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]
There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.
Signed-off-by: Dmitry Osipenko <digetx(a)gmail.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 11e558efd61e..a953d5f04ea0 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -237,7 +237,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}
--
2.30.1
From: Dmitry Osipenko <digetx(a)gmail.com>
[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]
There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.
Signed-off-by: Dmitry Osipenko <digetx(a)gmail.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 11e558efd61e..a953d5f04ea0 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -237,7 +237,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}
--
2.30.1
From: Dmitry Osipenko <digetx(a)gmail.com>
[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]
There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.
Signed-off-by: Dmitry Osipenko <digetx(a)gmail.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 3c777ec80d47..2090a6f3610b 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -245,7 +245,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}
--
2.30.1
From: Dmitry Osipenko <digetx(a)gmail.com>
[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]
There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.
Signed-off-by: Dmitry Osipenko <digetx(a)gmail.com>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Link: https://lore.kernel.org/r/20210302095850.30894-1-digetx@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 1f4d8c06f9be..4cb0478a17f8 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -246,7 +246,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5011c5a663b9c6d6aff3d394f11049b371199627 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dancarpenter(a)oracle.com>
Date: Wed, 17 Feb 2021 09:04:34 +0300
Subject: [PATCH] btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a8c60d46d19c..1b837c08ca90 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1935,7 +1935,10 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
if (vol_args->flags & BTRFS_SUBVOL_RDONLY)
readonly = true;
if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) {
- if (vol_args->size > PAGE_SIZE) {
+ u64 nums;
+
+ if (vol_args->size < sizeof(*inherit) ||
+ vol_args->size > PAGE_SIZE) {
ret = -EINVAL;
goto free_args;
}
@@ -1944,6 +1947,20 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
ret = PTR_ERR(inherit);
goto free_args;
}
+
+ if (inherit->num_qgroups > PAGE_SIZE ||
+ inherit->num_ref_copies > PAGE_SIZE ||
+ inherit->num_excl_copies > PAGE_SIZE) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
+
+ nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
+ 2 * inherit->num_excl_copies;
+ if (vol_args->size != struct_size(inherit, qgroups, nums)) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
}
ret = __btrfs_ioctl_snap_create(file, vol_args->name, vol_args->fd,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5011c5a663b9c6d6aff3d394f11049b371199627 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dancarpenter(a)oracle.com>
Date: Wed, 17 Feb 2021 09:04:34 +0300
Subject: [PATCH] btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a8c60d46d19c..1b837c08ca90 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1935,7 +1935,10 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
if (vol_args->flags & BTRFS_SUBVOL_RDONLY)
readonly = true;
if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) {
- if (vol_args->size > PAGE_SIZE) {
+ u64 nums;
+
+ if (vol_args->size < sizeof(*inherit) ||
+ vol_args->size > PAGE_SIZE) {
ret = -EINVAL;
goto free_args;
}
@@ -1944,6 +1947,20 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
ret = PTR_ERR(inherit);
goto free_args;
}
+
+ if (inherit->num_qgroups > PAGE_SIZE ||
+ inherit->num_ref_copies > PAGE_SIZE ||
+ inherit->num_excl_copies > PAGE_SIZE) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
+
+ nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
+ 2 * inherit->num_excl_copies;
+ if (vol_args->size != struct_size(inherit, qgroups, nums)) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
}
ret = __btrfs_ioctl_snap_create(file, vol_args->name, vol_args->fd,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From df7b59ba9245c4a3115ebaa905e3e5719a3810da Mon Sep 17 00:00:00 2001
From: Milan Broz <gmazyland(a)gmail.com>
Date: Tue, 23 Feb 2021 21:21:21 +0100
Subject: [PATCH] dm verity: fix FEC for RS roots unaligned to block size
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Optional Forward Error Correction (FEC) code in dm-verity uses
Reed-Solomon code and should support roots from 2 to 24.
The error correction parity bytes (of roots lengths per RS block) are
stored on a separate device in sequence without any padding.
Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
client with block size set to verity data block (usually 4096 or 512
bytes).
Because this block size is not divisible by some (most!) of the roots
supported lengths, data repair cannot work for partially stored parity
bytes.
This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
where we can be sure that the full parity data is always available.
(There cannot be partial FEC blocks because parity must cover whole
sectors.)
Because the optional FEC starting offset could be unaligned to this
new block size, we have to use dm_bufio_set_sector_offset() to
configure it.
The problem is easily reproduced using veritysetup, e.g. for roots=13:
# create verity device with RS FEC
dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash
# create an erasure that should be always repairable with this roots setting
dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none
# try to read it through dm-verity
veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
# wait for possible recursive recovery in kernel
udevadm settle
veritysetup close test
With this fix, errors are properly repaired.
device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
...
Without it, FEC code usually ends on unrecoverable failure in RS decoder:
device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
...
This problem is present in all kernels since the FEC code's
introduction (kernel 4.5).
It is thought that this problem is not visible in Android ecosystem
because it always uses a default RS roots=2.
Depends-on: a14e5ec66a7a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
Signed-off-by: Milan Broz <gmazyland(a)gmail.com>
Tested-by: Jérôme Carretero <cJ-ko(a)zougloub.eu>
Reviewed-by: Sami Tolvanen <samitolvanen(a)google.com>
Cc: stable(a)vger.kernel.org # 4.5+
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c
index fb41b4f23c48..66f4c6398f67 100644
--- a/drivers/md/dm-verity-fec.c
+++ b/drivers/md/dm-verity-fec.c
@@ -61,19 +61,18 @@ static int fec_decode_rs8(struct dm_verity *v, struct dm_verity_fec_io *fio,
static u8 *fec_read_parity(struct dm_verity *v, u64 rsb, int index,
unsigned *offset, struct dm_buffer **buf)
{
- u64 position, block;
+ u64 position, block, rem;
u8 *res;
position = (index + rsb) * v->fec->roots;
- block = position >> v->data_dev_block_bits;
- *offset = (unsigned)(position - (block << v->data_dev_block_bits));
+ block = div64_u64_rem(position, v->fec->roots << SECTOR_SHIFT, &rem);
+ *offset = (unsigned)rem;
- res = dm_bufio_read(v->fec->bufio, v->fec->start + block, buf);
+ res = dm_bufio_read(v->fec->bufio, block, buf);
if (IS_ERR(res)) {
DMERR("%s: FEC %llu: parity read failed (block %llu): %ld",
v->data_dev->name, (unsigned long long)rsb,
- (unsigned long long)(v->fec->start + block),
- PTR_ERR(res));
+ (unsigned long long)block, PTR_ERR(res));
*buf = NULL;
}
@@ -155,7 +154,7 @@ static int fec_decode_bufs(struct dm_verity *v, struct dm_verity_fec_io *fio,
/* read the next block when we run out of parity bytes */
offset += v->fec->roots;
- if (offset >= 1 << v->data_dev_block_bits) {
+ if (offset >= v->fec->roots << SECTOR_SHIFT) {
dm_bufio_release(buf);
par = fec_read_parity(v, rsb, block_offset, &offset, &buf);
@@ -674,7 +673,7 @@ int verity_fec_ctr(struct dm_verity *v)
{
struct dm_verity_fec *f = v->fec;
struct dm_target *ti = v->ti;
- u64 hash_blocks;
+ u64 hash_blocks, fec_blocks;
int ret;
if (!verity_fec_is_enabled(v)) {
@@ -744,15 +743,17 @@ int verity_fec_ctr(struct dm_verity *v)
}
f->bufio = dm_bufio_client_create(f->dev->bdev,
- 1 << v->data_dev_block_bits,
+ f->roots << SECTOR_SHIFT,
1, 0, NULL, NULL);
if (IS_ERR(f->bufio)) {
ti->error = "Cannot initialize FEC bufio client";
return PTR_ERR(f->bufio);
}
- if (dm_bufio_get_device_size(f->bufio) <
- ((f->start + f->rounds * f->roots) >> v->data_dev_block_bits)) {
+ dm_bufio_set_sector_offset(f->bufio, f->start << (v->data_dev_block_bits - SECTOR_SHIFT));
+
+ fec_blocks = div64_u64(f->rounds * f->roots, v->fec->roots << SECTOR_SHIFT);
+ if (dm_bufio_get_device_size(f->bufio) < fec_blocks) {
ti->error = "FEC device is too small";
return -E2BIG;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From df7b59ba9245c4a3115ebaa905e3e5719a3810da Mon Sep 17 00:00:00 2001
From: Milan Broz <gmazyland(a)gmail.com>
Date: Tue, 23 Feb 2021 21:21:21 +0100
Subject: [PATCH] dm verity: fix FEC for RS roots unaligned to block size
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Optional Forward Error Correction (FEC) code in dm-verity uses
Reed-Solomon code and should support roots from 2 to 24.
The error correction parity bytes (of roots lengths per RS block) are
stored on a separate device in sequence without any padding.
Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
client with block size set to verity data block (usually 4096 or 512
bytes).
Because this block size is not divisible by some (most!) of the roots
supported lengths, data repair cannot work for partially stored parity
bytes.
This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
where we can be sure that the full parity data is always available.
(There cannot be partial FEC blocks because parity must cover whole
sectors.)
Because the optional FEC starting offset could be unaligned to this
new block size, we have to use dm_bufio_set_sector_offset() to
configure it.
The problem is easily reproduced using veritysetup, e.g. for roots=13:
# create verity device with RS FEC
dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash
# create an erasure that should be always repairable with this roots setting
dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none
# try to read it through dm-verity
veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
# wait for possible recursive recovery in kernel
udevadm settle
veritysetup close test
With this fix, errors are properly repaired.
device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
...
Without it, FEC code usually ends on unrecoverable failure in RS decoder:
device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
...
This problem is present in all kernels since the FEC code's
introduction (kernel 4.5).
It is thought that this problem is not visible in Android ecosystem
because it always uses a default RS roots=2.
Depends-on: a14e5ec66a7a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
Signed-off-by: Milan Broz <gmazyland(a)gmail.com>
Tested-by: Jérôme Carretero <cJ-ko(a)zougloub.eu>
Reviewed-by: Sami Tolvanen <samitolvanen(a)google.com>
Cc: stable(a)vger.kernel.org # 4.5+
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c
index fb41b4f23c48..66f4c6398f67 100644
--- a/drivers/md/dm-verity-fec.c
+++ b/drivers/md/dm-verity-fec.c
@@ -61,19 +61,18 @@ static int fec_decode_rs8(struct dm_verity *v, struct dm_verity_fec_io *fio,
static u8 *fec_read_parity(struct dm_verity *v, u64 rsb, int index,
unsigned *offset, struct dm_buffer **buf)
{
- u64 position, block;
+ u64 position, block, rem;
u8 *res;
position = (index + rsb) * v->fec->roots;
- block = position >> v->data_dev_block_bits;
- *offset = (unsigned)(position - (block << v->data_dev_block_bits));
+ block = div64_u64_rem(position, v->fec->roots << SECTOR_SHIFT, &rem);
+ *offset = (unsigned)rem;
- res = dm_bufio_read(v->fec->bufio, v->fec->start + block, buf);
+ res = dm_bufio_read(v->fec->bufio, block, buf);
if (IS_ERR(res)) {
DMERR("%s: FEC %llu: parity read failed (block %llu): %ld",
v->data_dev->name, (unsigned long long)rsb,
- (unsigned long long)(v->fec->start + block),
- PTR_ERR(res));
+ (unsigned long long)block, PTR_ERR(res));
*buf = NULL;
}
@@ -155,7 +154,7 @@ static int fec_decode_bufs(struct dm_verity *v, struct dm_verity_fec_io *fio,
/* read the next block when we run out of parity bytes */
offset += v->fec->roots;
- if (offset >= 1 << v->data_dev_block_bits) {
+ if (offset >= v->fec->roots << SECTOR_SHIFT) {
dm_bufio_release(buf);
par = fec_read_parity(v, rsb, block_offset, &offset, &buf);
@@ -674,7 +673,7 @@ int verity_fec_ctr(struct dm_verity *v)
{
struct dm_verity_fec *f = v->fec;
struct dm_target *ti = v->ti;
- u64 hash_blocks;
+ u64 hash_blocks, fec_blocks;
int ret;
if (!verity_fec_is_enabled(v)) {
@@ -744,15 +743,17 @@ int verity_fec_ctr(struct dm_verity *v)
}
f->bufio = dm_bufio_client_create(f->dev->bdev,
- 1 << v->data_dev_block_bits,
+ f->roots << SECTOR_SHIFT,
1, 0, NULL, NULL);
if (IS_ERR(f->bufio)) {
ti->error = "Cannot initialize FEC bufio client";
return PTR_ERR(f->bufio);
}
- if (dm_bufio_get_device_size(f->bufio) <
- ((f->start + f->rounds * f->roots) >> v->data_dev_block_bits)) {
+ dm_bufio_set_sector_offset(f->bufio, f->start << (v->data_dev_block_bits - SECTOR_SHIFT));
+
+ fec_blocks = div64_u64(f->rounds * f->roots, v->fec->roots << SECTOR_SHIFT);
+ if (dm_bufio_get_device_size(f->bufio) < fec_blocks) {
ti->error = "FEC device is too small";
return -E2BIG;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a14e5ec66a7a66e57b24e2469f9212a78460207e Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 23 Feb 2021 21:21:20 +0100
Subject: [PATCH] dm bufio: subtract the number of initial sectors in
dm_bufio_get_device_size
dm_bufio_get_device_size returns the device size in blocks. Before
returning the value, we must subtract the nubmer of starting
sectors. The number of starting sectors may not be divisible by block
size.
Note that currently, no target is using dm_bufio_set_sector_offset and
dm_bufio_get_device_size simultaneously, so this change has no effect.
However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Reviewed-by: Milan Broz <gmazyland(a)gmail.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index fce4cbf9529d..50f3e673729c 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1526,6 +1526,10 @@ EXPORT_SYMBOL_GPL(dm_bufio_get_block_size);
sector_t dm_bufio_get_device_size(struct dm_bufio_client *c)
{
sector_t s = i_size_read(c->bdev->bd_inode) >> SECTOR_SHIFT;
+ if (s >= c->start)
+ s -= c->start;
+ else
+ s = 0;
if (likely(c->sectors_per_block_bits >= 0))
s >>= c->sectors_per_block_bits;
else
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a14e5ec66a7a66e57b24e2469f9212a78460207e Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 23 Feb 2021 21:21:20 +0100
Subject: [PATCH] dm bufio: subtract the number of initial sectors in
dm_bufio_get_device_size
dm_bufio_get_device_size returns the device size in blocks. Before
returning the value, we must subtract the nubmer of starting
sectors. The number of starting sectors may not be divisible by block
size.
Note that currently, no target is using dm_bufio_set_sector_offset and
dm_bufio_get_device_size simultaneously, so this change has no effect.
However, an upcoming dm-verity-fec fix needs this change.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Reviewed-by: Milan Broz <gmazyland(a)gmail.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index fce4cbf9529d..50f3e673729c 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1526,6 +1526,10 @@ EXPORT_SYMBOL_GPL(dm_bufio_get_block_size);
sector_t dm_bufio_get_device_size(struct dm_bufio_client *c)
{
sector_t s = i_size_read(c->bdev->bd_inode) >> SECTOR_SHIFT;
+ if (s >= c->start)
+ s -= c->start;
+ else
+ s = 0;
if (likely(c->sectors_per_block_bits >= 0))
s >>= c->sectors_per_block_bits;
else
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b5b0ecb736f1ce1e68eb50613c0cfecff10198eb Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 4 Mar 2021 21:02:58 -0700
Subject: [PATCH] io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
The callback can only be armed, if we get -EIOCBQUEUED returned. It's
important that we clear the WAITQ bit for other cases, otherwise we can
queue for async retry and filemap will assume that we're armed and
return -EAGAIN instead of just blocking for the IO.
Cc: stable(a)vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 044170165402..5762750c666c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3286,6 +3286,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (ret == -EIOCBQUEUED)
return 0;
/* we got some bytes, but not all. retry. */
+ kiocb->ki_flags &= ~IOCB_WAITQ;
} while (ret > 0 && ret < io_size);
done:
kiocb_done(kiocb, ret, issue_flags);
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b5b0ecb736f1ce1e68eb50613c0cfecff10198eb Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 4 Mar 2021 21:02:58 -0700
Subject: [PATCH] io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED return
The callback can only be armed, if we get -EIOCBQUEUED returned. It's
important that we clear the WAITQ bit for other cases, otherwise we can
queue for async retry and filemap will assume that we're armed and
return -EAGAIN instead of just blocking for the IO.
Cc: stable(a)vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 044170165402..5762750c666c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3286,6 +3286,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
if (ret == -EIOCBQUEUED)
return 0;
/* we got some bytes, but not all. retry. */
+ kiocb->ki_flags &= ~IOCB_WAITQ;
} while (ret > 0 && ret < io_size);
done:
kiocb_done(kiocb, ret, issue_flags);
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ca0a26511c679a797f86589894a4523db36d833e Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 4 Mar 2021 17:15:48 -0700
Subject: [PATCH] io_uring: don't keep looping for more events if we can't
flush overflow
It doesn't make sense to wait for more events to come in, if we can't
even flush the overflow we already have to the ring. Return -EBUSY for
that condition, just like we do for attempts to submit with overflow
pending.
Cc: stable(a)vger.kernel.org # 5.11
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 76e243c056b5..044170165402 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1451,18 +1451,22 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force,
return all_flushed;
}
-static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force,
+static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force,
struct task_struct *tsk,
struct files_struct *files)
{
+ bool ret = true;
+
if (test_bit(0, &ctx->cq_check_overflow)) {
/* iopoll syncs against uring_lock, not completion_lock */
if (ctx->flags & IORING_SETUP_IOPOLL)
mutex_lock(&ctx->uring_lock);
- __io_cqring_overflow_flush(ctx, force, tsk, files);
+ ret = __io_cqring_overflow_flush(ctx, force, tsk, files);
if (ctx->flags & IORING_SETUP_IOPOLL)
mutex_unlock(&ctx->uring_lock);
}
+
+ return ret;
}
static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
@@ -6883,11 +6887,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
trace_io_uring_cqring_wait(ctx, min_events);
do {
- io_cqring_overflow_flush(ctx, false, NULL, NULL);
+ /* if we can't even flush overflow, don't wait for more */
+ if (!io_cqring_overflow_flush(ctx, false, NULL, NULL)) {
+ ret = -EBUSY;
+ break;
+ }
prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
TASK_INTERRUPTIBLE);
ret = io_cqring_wait_schedule(ctx, &iowq, &timeout);
finish_wait(&ctx->wait, &iowq.wq);
+ cond_resched();
} while (ret > 0);
restore_saved_sigmask_unless(ret == -EINTR);
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3e6a0d3c7571ce3ed0d25c5c32543a54a7ebcd75 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 1 Mar 2021 13:56:00 -0700
Subject: [PATCH] io_uring: fix -EAGAIN retry with IOPOLL
We no longer revert the iovec on -EIOCBQUEUED, see commit ab2125df921d,
and this started causing issues for IOPOLL on devies that run out of
request slots. Turns out what outside of needing a revert for those, we
also had a bug where we didn't properly setup retry inside the submission
path. That could cause re-import of the iovec, if any, and that could lead
to spurious results if the application had those allocated on the stack.
Catch -EAGAIN retry and make the iovec stable for IOPOLL, just like we do
for !IOPOLL retries.
Cc: <stable(a)vger.kernel.org> # 5.9+
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Reported-by: Xiaoguang Wang <xiaoguang.wang(a)linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index f060dcc1cc86..361befaae28f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2423,23 +2423,32 @@ static bool io_resubmit_prep(struct io_kiocb *req)
return false;
return !io_setup_async_rw(req, iovec, inline_vecs, &iter, false);
}
-#endif
-static bool io_rw_reissue(struct io_kiocb *req)
+static bool io_rw_should_reissue(struct io_kiocb *req)
{
-#ifdef CONFIG_BLOCK
umode_t mode = file_inode(req->file)->i_mode;
+ struct io_ring_ctx *ctx = req->ctx;
if (!S_ISBLK(mode) && !S_ISREG(mode))
return false;
- if ((req->flags & REQ_F_NOWAIT) || io_wq_current_is_worker())
+ if ((req->flags & REQ_F_NOWAIT) || (io_wq_current_is_worker() &&
+ !(ctx->flags & IORING_SETUP_IOPOLL)))
return false;
/*
* If ref is dying, we might be running poll reap from the exit work.
* Don't attempt to reissue from that path, just let it fail with
* -EAGAIN.
*/
- if (percpu_ref_is_dying(&req->ctx->refs))
+ if (percpu_ref_is_dying(&ctx->refs))
+ return false;
+ return true;
+}
+#endif
+
+static bool io_rw_reissue(struct io_kiocb *req)
+{
+#ifdef CONFIG_BLOCK
+ if (!io_rw_should_reissue(req))
return false;
lockdep_assert_held(&req->ctx->uring_lock);
@@ -2482,6 +2491,19 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
+#ifdef CONFIG_BLOCK
+ /* Rewind iter, if we have one. iopoll path resubmits as usual */
+ if (res == -EAGAIN && io_rw_should_reissue(req)) {
+ struct io_async_rw *rw = req->async_data;
+
+ if (rw)
+ iov_iter_revert(&rw->iter,
+ req->result - iov_iter_count(&rw->iter));
+ else if (!io_resubmit_prep(req))
+ res = -EIO;
+ }
+#endif
+
if (kiocb->ki_flags & IOCB_WRITE)
kiocb_end_write(req);
@@ -3230,6 +3252,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
ret = io_iter_do_read(req, iter);
if (ret == -EIOCBQUEUED) {
+ if (req->async_data)
+ iov_iter_revert(iter, io_size - iov_iter_count(iter));
goto out_free;
} else if (ret == -EAGAIN) {
/* IOPOLL retry should happen for io-wq threads */
@@ -3361,6 +3385,8 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
/* no retry on NONBLOCK nor RWF_NOWAIT */
if (ret2 == -EAGAIN && (req->flags & REQ_F_NOWAIT))
goto done;
+ if (ret2 == -EIOCBQUEUED && req->async_data)
+ iov_iter_revert(iter, io_size - iov_iter_count(iter));
if (!force_nonblock || ret2 != -EAGAIN) {
/* IOPOLL retry should happen for io-wq threads */
if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3e6a0d3c7571ce3ed0d25c5c32543a54a7ebcd75 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Mon, 1 Mar 2021 13:56:00 -0700
Subject: [PATCH] io_uring: fix -EAGAIN retry with IOPOLL
We no longer revert the iovec on -EIOCBQUEUED, see commit ab2125df921d,
and this started causing issues for IOPOLL on devies that run out of
request slots. Turns out what outside of needing a revert for those, we
also had a bug where we didn't properly setup retry inside the submission
path. That could cause re-import of the iovec, if any, and that could lead
to spurious results if the application had those allocated on the stack.
Catch -EAGAIN retry and make the iovec stable for IOPOLL, just like we do
for !IOPOLL retries.
Cc: <stable(a)vger.kernel.org> # 5.9+
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Reported-by: Xiaoguang Wang <xiaoguang.wang(a)linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index f060dcc1cc86..361befaae28f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2423,23 +2423,32 @@ static bool io_resubmit_prep(struct io_kiocb *req)
return false;
return !io_setup_async_rw(req, iovec, inline_vecs, &iter, false);
}
-#endif
-static bool io_rw_reissue(struct io_kiocb *req)
+static bool io_rw_should_reissue(struct io_kiocb *req)
{
-#ifdef CONFIG_BLOCK
umode_t mode = file_inode(req->file)->i_mode;
+ struct io_ring_ctx *ctx = req->ctx;
if (!S_ISBLK(mode) && !S_ISREG(mode))
return false;
- if ((req->flags & REQ_F_NOWAIT) || io_wq_current_is_worker())
+ if ((req->flags & REQ_F_NOWAIT) || (io_wq_current_is_worker() &&
+ !(ctx->flags & IORING_SETUP_IOPOLL)))
return false;
/*
* If ref is dying, we might be running poll reap from the exit work.
* Don't attempt to reissue from that path, just let it fail with
* -EAGAIN.
*/
- if (percpu_ref_is_dying(&req->ctx->refs))
+ if (percpu_ref_is_dying(&ctx->refs))
+ return false;
+ return true;
+}
+#endif
+
+static bool io_rw_reissue(struct io_kiocb *req)
+{
+#ifdef CONFIG_BLOCK
+ if (!io_rw_should_reissue(req))
return false;
lockdep_assert_held(&req->ctx->uring_lock);
@@ -2482,6 +2491,19 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
+#ifdef CONFIG_BLOCK
+ /* Rewind iter, if we have one. iopoll path resubmits as usual */
+ if (res == -EAGAIN && io_rw_should_reissue(req)) {
+ struct io_async_rw *rw = req->async_data;
+
+ if (rw)
+ iov_iter_revert(&rw->iter,
+ req->result - iov_iter_count(&rw->iter));
+ else if (!io_resubmit_prep(req))
+ res = -EIO;
+ }
+#endif
+
if (kiocb->ki_flags & IOCB_WRITE)
kiocb_end_write(req);
@@ -3230,6 +3252,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
ret = io_iter_do_read(req, iter);
if (ret == -EIOCBQUEUED) {
+ if (req->async_data)
+ iov_iter_revert(iter, io_size - iov_iter_count(iter));
goto out_free;
} else if (ret == -EAGAIN) {
/* IOPOLL retry should happen for io-wq threads */
@@ -3361,6 +3385,8 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
/* no retry on NONBLOCK nor RWF_NOWAIT */
if (ret2 == -EAGAIN && (req->flags & REQ_F_NOWAIT))
goto done;
+ if (ret2 == -EIOCBQUEUED && req->async_data)
+ iov_iter_revert(iter, io_size - iov_iter_count(iter));
if (!force_nonblock || ret2 != -EAGAIN) {
/* IOPOLL retry should happen for io-wq threads */
if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d14c5cde5c268a2bc26addecf09489cb953ef64 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 22 Feb 2021 18:40:44 +0200
Subject: [PATCH] btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:
* ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")
* 6f23277a49e6 ("btrfs: qgroup: don't commit transaction when we already
hold the handle")
Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:
PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test"
#0 __schedule at ffffffffa529e07d
#1 schedule at ffffffffa529e4ff
#2 schedule_timeout at ffffffffa52a1bdd
#3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held
#4 start_delalloc_inodes at ffffffffc0380db5
#5 btrfs_start_delalloc_snapshot at ffffffffc0393836
#6 try_flush_qgroup at ffffffffc03f04b2
#7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes.
#8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex
#9 btrfs_update_inode at ffffffffc0385ba8
#10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED
#11 touch_atime at ffffffffa4cf0000
#12 generic_file_read_iter at ffffffffa4c1f123
#13 new_sync_read at ffffffffa4ccdc8a
#14 vfs_read at ffffffffa4cd0849
#15 ksys_read at ffffffffa4cd0bd1
#16 do_syscall_64 at ffffffffa4a052eb
#17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c
This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:
PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30"
#0 __schedule at ffffffffa529e07d
#1 schedule at ffffffffa529e4ff
#2 schedule_preempt_disabled at ffffffffa529e80a
#3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up.
#4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex
#5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding
#6 cow_file_range_inline.constprop.78 at ffffffffc0386be7
#7 cow_file_range at ffffffffc03879c1
#8 btrfs_run_delalloc_range at ffffffffc038894c
#9 writepage_delalloc at ffffffffc03a3c8f
#10 __extent_writepage at ffffffffc03a4c01
#11 extent_write_cache_pages at ffffffffc03a500b
#12 extent_writepages at ffffffffc03a6de2
#13 do_writepages at ffffffffa4c277eb
#14 __filemap_fdatawrite_range at ffffffffa4c1e5bb
#15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes
#16 normal_work_helper at ffffffffc03b706c
#17 process_one_work at ffffffffa4aba4e4
#18 worker_thread at ffffffffa4aba6fd
#19 kthread at ffffffffa4ac0a3d
#20 ret_from_fork at ffffffffa54001ff
To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.
Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable(a)vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index ac9966e76a2f..bf25401c9768 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -627,7 +627,8 @@ static int btrfs_delayed_inode_reserve_metadata(
*/
if (!src_rsv || (!trans->bytes_reserved &&
src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
- ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
+ ret = btrfs_qgroup_reserve_meta(root, num_bytes,
+ BTRFS_QGROUP_RSV_META_PREALLOC, true);
if (ret < 0)
return ret;
ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f2f1e932751..c35b724a5611 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6081,7 +6081,7 @@ static int btrfs_dirty_inode(struct inode *inode)
return PTR_ERR(trans);
ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
- if (ret && ret == -ENOSPC) {
+ if (ret && (ret == -ENOSPC || ret == -EDQUOT)) {
/* whoops, lets try again with the full transaction */
btrfs_end_transaction(trans);
trans = btrfs_start_transaction(root, 1);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5011c5a663b9c6d6aff3d394f11049b371199627 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dancarpenter(a)oracle.com>
Date: Wed, 17 Feb 2021 09:04:34 +0300
Subject: [PATCH] btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a8c60d46d19c..1b837c08ca90 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1935,7 +1935,10 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
if (vol_args->flags & BTRFS_SUBVOL_RDONLY)
readonly = true;
if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) {
- if (vol_args->size > PAGE_SIZE) {
+ u64 nums;
+
+ if (vol_args->size < sizeof(*inherit) ||
+ vol_args->size > PAGE_SIZE) {
ret = -EINVAL;
goto free_args;
}
@@ -1944,6 +1947,20 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
ret = PTR_ERR(inherit);
goto free_args;
}
+
+ if (inherit->num_qgroups > PAGE_SIZE ||
+ inherit->num_ref_copies > PAGE_SIZE ||
+ inherit->num_excl_copies > PAGE_SIZE) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
+
+ nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
+ 2 * inherit->num_excl_copies;
+ if (vol_args->size != struct_size(inherit, qgroups, nums)) {
+ ret = -EINVAL;
+ goto free_inherit;
+ }
}
ret = __btrfs_ioctl_snap_create(file, vol_args->name, vol_args->fd,
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3c17916510428dbccdf657de050c34e208347089 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 8 Feb 2021 10:26:54 +0200
Subject: [PATCH] btrfs: fix race between extent freeing/allocation when using
bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable(a)vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5400294bd271..abcf951e6b44 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3125,8 +3125,6 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
entry->bytes -= bytes;
}
- if (entry->bytes == 0)
- rb_erase(&entry->offset_index, &cluster->root);
break;
}
out:
@@ -3143,7 +3141,10 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
ctl->free_space -= bytes;
if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
ctl->discardable_bytes[BTRFS_STAT_CURR] -= bytes;
+
+ spin_lock(&cluster->lock);
if (entry->bytes == 0) {
+ rb_erase(&entry->offset_index, &cluster->root);
ctl->free_extents--;
if (entry->bitmap) {
kmem_cache_free(btrfs_free_space_bitmap_cachep,
@@ -3156,6 +3157,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
kmem_cache_free(btrfs_free_space_cachep, entry);
}
+ spin_unlock(&cluster->lock);
spin_unlock(&ctl->tree_lock);
return ret;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3c17916510428dbccdf657de050c34e208347089 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 8 Feb 2021 10:26:54 +0200
Subject: [PATCH] btrfs: fix race between extent freeing/allocation when using
bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable(a)vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5400294bd271..abcf951e6b44 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3125,8 +3125,6 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
entry->bytes -= bytes;
}
- if (entry->bytes == 0)
- rb_erase(&entry->offset_index, &cluster->root);
break;
}
out:
@@ -3143,7 +3141,10 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
ctl->free_space -= bytes;
if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
ctl->discardable_bytes[BTRFS_STAT_CURR] -= bytes;
+
+ spin_lock(&cluster->lock);
if (entry->bytes == 0) {
+ rb_erase(&entry->offset_index, &cluster->root);
ctl->free_extents--;
if (entry->bitmap) {
kmem_cache_free(btrfs_free_space_bitmap_cachep,
@@ -3156,6 +3157,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
kmem_cache_free(btrfs_free_space_cachep, entry);
}
+ spin_unlock(&cluster->lock);
spin_unlock(&ctl->tree_lock);
return ret;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3c17916510428dbccdf657de050c34e208347089 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 8 Feb 2021 10:26:54 +0200
Subject: [PATCH] btrfs: fix race between extent freeing/allocation when using
bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable(a)vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5400294bd271..abcf951e6b44 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3125,8 +3125,6 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
entry->bytes -= bytes;
}
- if (entry->bytes == 0)
- rb_erase(&entry->offset_index, &cluster->root);
break;
}
out:
@@ -3143,7 +3141,10 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
ctl->free_space -= bytes;
if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
ctl->discardable_bytes[BTRFS_STAT_CURR] -= bytes;
+
+ spin_lock(&cluster->lock);
if (entry->bytes == 0) {
+ rb_erase(&entry->offset_index, &cluster->root);
ctl->free_extents--;
if (entry->bitmap) {
kmem_cache_free(btrfs_free_space_bitmap_cachep,
@@ -3156,6 +3157,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
kmem_cache_free(btrfs_free_space_cachep, entry);
}
+ spin_unlock(&cluster->lock);
spin_unlock(&ctl->tree_lock);
return ret;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3c17916510428dbccdf657de050c34e208347089 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 8 Feb 2021 10:26:54 +0200
Subject: [PATCH] btrfs: fix race between extent freeing/allocation when using
bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable(a)vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5400294bd271..abcf951e6b44 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3125,8 +3125,6 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
entry->bytes -= bytes;
}
- if (entry->bytes == 0)
- rb_erase(&entry->offset_index, &cluster->root);
break;
}
out:
@@ -3143,7 +3141,10 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
ctl->free_space -= bytes;
if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
ctl->discardable_bytes[BTRFS_STAT_CURR] -= bytes;
+
+ spin_lock(&cluster->lock);
if (entry->bytes == 0) {
+ rb_erase(&entry->offset_index, &cluster->root);
ctl->free_extents--;
if (entry->bitmap) {
kmem_cache_free(btrfs_free_space_bitmap_cachep,
@@ -3156,6 +3157,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
kmem_cache_free(btrfs_free_space_cachep, entry);
}
+ spin_unlock(&cluster->lock);
spin_unlock(&ctl->tree_lock);
return ret;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3c17916510428dbccdf657de050c34e208347089 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Mon, 8 Feb 2021 10:26:54 +0200
Subject: [PATCH] btrfs: fix race between extent freeing/allocation when using
bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable(a)vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 5400294bd271..abcf951e6b44 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3125,8 +3125,6 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
entry->bytes -= bytes;
}
- if (entry->bytes == 0)
- rb_erase(&entry->offset_index, &cluster->root);
break;
}
out:
@@ -3143,7 +3141,10 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
ctl->free_space -= bytes;
if (!entry->bitmap && !btrfs_free_space_trimmed(entry))
ctl->discardable_bytes[BTRFS_STAT_CURR] -= bytes;
+
+ spin_lock(&cluster->lock);
if (entry->bytes == 0) {
+ rb_erase(&entry->offset_index, &cluster->root);
ctl->free_extents--;
if (entry->bitmap) {
kmem_cache_free(btrfs_free_space_bitmap_cachep,
@@ -3156,6 +3157,7 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
kmem_cache_free(btrfs_free_space_cachep, entry);
}
+ spin_unlock(&cluster->lock);
spin_unlock(&ctl->tree_lock);
return ret;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1119a72e223f3073a604f8fccb3a470ccd8a4416 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Tue, 16 Feb 2021 15:43:22 -0500
Subject: [PATCH] btrfs: tree-checker: do not error out if extent ref hash
doesn't match
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi(a)gmail.com>
Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 582061c7b547..f4ade821307d 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1453,22 +1453,14 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
return -EUCLEAN;
}
for (; ptr < end; ptr += sizeof(*dref)) {
- u64 root_objectid;
- u64 owner;
u64 offset;
- u64 hash;
+ /*
+ * We cannot check the extent_data_ref hash due to possible
+ * overflow from the leaf due to hash collisions.
+ */
dref = (struct btrfs_extent_data_ref *)ptr;
- root_objectid = btrfs_extent_data_ref_root(leaf, dref);
- owner = btrfs_extent_data_ref_objectid(leaf, dref);
offset = btrfs_extent_data_ref_offset(leaf, dref);
- hash = hash_extent_data_ref(root_objectid, owner, offset);
- if (unlikely(hash != key->offset)) {
- extent_err(leaf, slot,
- "invalid extent data ref hash, item has 0x%016llx key has 0x%016llx",
- hash, key->offset);
- return -EUCLEAN;
- }
if (unlikely(!IS_ALIGNED(offset, leaf->fs_info->sectorsize))) {
extent_err(leaf, slot,
"invalid extent data backref offset, have %llu expect aligned to %u",
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1119a72e223f3073a604f8fccb3a470ccd8a4416 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Tue, 16 Feb 2021 15:43:22 -0500
Subject: [PATCH] btrfs: tree-checker: do not error out if extent ref hash
doesn't match
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi(a)gmail.com>
Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 582061c7b547..f4ade821307d 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1453,22 +1453,14 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
return -EUCLEAN;
}
for (; ptr < end; ptr += sizeof(*dref)) {
- u64 root_objectid;
- u64 owner;
u64 offset;
- u64 hash;
+ /*
+ * We cannot check the extent_data_ref hash due to possible
+ * overflow from the leaf due to hash collisions.
+ */
dref = (struct btrfs_extent_data_ref *)ptr;
- root_objectid = btrfs_extent_data_ref_root(leaf, dref);
- owner = btrfs_extent_data_ref_objectid(leaf, dref);
offset = btrfs_extent_data_ref_offset(leaf, dref);
- hash = hash_extent_data_ref(root_objectid, owner, offset);
- if (unlikely(hash != key->offset)) {
- extent_err(leaf, slot,
- "invalid extent data ref hash, item has 0x%016llx key has 0x%016llx",
- hash, key->offset);
- return -EUCLEAN;
- }
if (unlikely(!IS_ALIGNED(offset, leaf->fs_info->sectorsize))) {
extent_err(leaf, slot,
"invalid extent data backref offset, have %llu expect aligned to %u",
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3660d0bcdb82807d434da9d2e57d88b37331182d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 16 Feb 2021 11:09:25 +0000
Subject: [PATCH] btrfs: fix stale data exposure after cloning a hole with
NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.
The conditions for this to happen are the following:
1) We have a file with a size of, for example, 1280K;
2) There is a written (non-prealloc) extent for the file range from 1024K
to 1280K with a length of 256K;
3) This particular file extent layout is durably persisted, so that the
existing superblock persisted on disk points to a subvolume root where
the file has that exact file extent layout and state;
4) The file is truncated to a smaller size, to an offset lower than the
start offset of its last extent, for example to 800K. The truncate sets
the full sync runtime flag on the inode;
6) Fsync the file to log it and clear the full sync runtime flag;
7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
into the file with a destination offset that starts at or beyond the
256K file extent item we had - for example to offset 1024K;
8) Since the clone operation does not find extents in the source range,
we end up in the if branch at the bottom of btrfs_clone() where we
punch a hole for the file range starting at offset 1024K by calling
btrfs_replace_file_extents(). There we end up not setting the full
sync flag on the inode, because we don't know we are being called in
a clone context (and not fallocate's punch hole operation), and
neither do we create an extent map to represent a hole because the
requested range is beyond eof;
9) A further fsync to the file will be a fast fsync, since the clone
operation did not set the full sync flag, and therefore it relies on
modified extent maps to correctly log the file layout. But since
it does not find any extent map marking the range from 1024K (the
previous eof) to the new eof, it does not log a file extent item
for that range representing the hole;
10) After a power failure no hole for the range starting at 1024K is
punched and we end up exposing stale data from the old 256K extent.
Turning this into exact steps:
$ mkfs.btrfs -f -O no-holes /dev/sdi
$ mount /dev/sdi /mnt
# Create our test file with 3 extents of 256K and a 256K hole at offset
# 256K. The file has a size of 1280K.
$ xfs_io -f -s \
-c "pwrite -S 0xab -b 256K 0 256K" \
-c "pwrite -S 0xcd -b 256K 512K 256K" \
-c "pwrite -S 0xef -b 256K 768K 256K" \
-c "pwrite -S 0x73 -b 256K 1024K 256K" \
/mnt/sdi/foobar
# Make sure it's durably persisted. We want the last committed super
# block to point to this particular file extent layout.
sync
# Now truncate our file to a smaller size, falling within a position of
# the second extent. This sets the full sync runtime flag on the inode.
# Then fsync the file to log it and clear the full sync flag from the
# inode. The third extent is no longer part of the file and therefore
# it is not logged.
$ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
# Now do a clone operation that only clones the hole and sets back the
# file size to match the size it had before the truncate operation
# (1280K).
$ xfs_io \
-c "reflink /mnt/foobar 256K 1024K 256K" \
-c "fsync" \
/mnt/foobar
# File data before power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1310720
<power fail>
# Mount the fs again to replay the log tree.
$ mount /dev/sdi /mnt
# File data after power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
*
1310720
The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.
The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.
So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.
A test case for fstests will follow soon.
Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index b24396cf2f99..5413578d2c32 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -553,6 +553,24 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
*/
btrfs_release_path(path);
+ /*
+ * When using NO_HOLES and we are cloning a range that covers
+ * only a hole (no extents) into a range beyond the current
+ * i_size, punching a hole in the target range will not create
+ * an extent map defining a hole, because the range starts at or
+ * beyond current i_size. If the file previously had an i_size
+ * greater than the new i_size set by this clone operation, we
+ * need to make sure the next fsync is a full fsync, so that it
+ * detects and logs a hole covering a range from the current
+ * i_size to the new i_size. If the clone range covers extents,
+ * besides a hole, then we know the full sync flag was already
+ * set by previous calls to btrfs_replace_file_extents() that
+ * replaced file extent items.
+ */
+ if (last_dest_end >= i_size_read(inode))
+ set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+ &BTRFS_I(inode)->runtime_flags);
+
ret = btrfs_replace_file_extents(inode, path, last_dest_end,
destoff + len - 1, NULL, &trans);
if (ret)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From dd0734f2a866f9d619d4abf97c3d71bcdee40ea9 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 5 Feb 2021 12:55:38 +0000
Subject: [PATCH] btrfs: fix race between swap file activation and snapshot
creation
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 30358a2e2bc0..4f2f1e932751 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10298,7 +10298,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
sector_t *span)
{
struct inode *inode = file_inode(file);
- struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_state *cached_state = NULL;
struct extent_map *em = NULL;
@@ -10349,13 +10350,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
"cannot activate swapfile while exclusive operation is running");
return -EBUSY;
}
+
+ /*
+ * Prevent snapshot creation while we are activating the swap file.
+ * We do not want to race with snapshot creation. If snapshot creation
+ * already started before we bumped nr_swapfiles from 0 to 1 and
+ * completes before the first write into the swap file after it is
+ * activated, than that write would fallback to COW.
+ */
+ if (!btrfs_drew_try_write_lock(&root->snapshot_lock)) {
+ btrfs_exclop_finish(fs_info);
+ btrfs_warn(fs_info,
+ "cannot activate swapfile because snapshot creation is in progress");
+ return -EINVAL;
+ }
/*
* Snapshots can create extents which require COW even if NODATACOW is
* set. We use this counter to prevent snapshots. We must increment it
* before walking the extents because we don't want a concurrent
* snapshot to run after we've already checked the extents.
*/
- atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
+ atomic_inc(&root->nr_swapfiles);
isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
@@ -10501,6 +10516,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
if (ret)
btrfs_swap_deactivate(file);
+ btrfs_drew_write_unlock(&root->snapshot_lock);
+
btrfs_exclop_finish(fs_info);
if (ret)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 195a49eaf655eb914896c92cecd96bc863c9feb3 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 5 Feb 2021 12:55:37 +0000
Subject: [PATCH] btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes: ed46ff3d42378 ("Btrfs: support swap files")
CC: stable(a)vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5064be59dac5..744b99ddc28c 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1162,6 +1162,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
spin_lock(&sinfo->lock);
spin_lock(&cache->lock);
+ if (cache->swap_extents) {
+ ret = -ETXTBSY;
+ goto out;
+ }
+
if (cache->ro) {
cache->ro++;
ret = 0;
@@ -2307,7 +2312,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
}
ret = inc_block_group_ro(cache, 0);
- if (!do_chunk_alloc)
+ if (!do_chunk_alloc || ret == -ETXTBSY)
goto unlock_out;
if (!ret)
goto out;
@@ -2316,6 +2321,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
if (ret < 0)
goto out;
ret = inc_block_group_ro(cache, 0);
+ if (ret == -ETXTBSY)
+ goto unlock_out;
out:
if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
@@ -3406,6 +3413,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
ASSERT(list_empty(&block_group->io_list));
ASSERT(list_empty(&block_group->bg_list));
ASSERT(refcount_read(&block_group->refs) == 1);
+ ASSERT(block_group->swap_extents == 0);
btrfs_put_block_group(block_group);
spin_lock(&info->block_group_cache_lock);
@@ -3472,3 +3480,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
__btrfs_remove_free_space_cache(block_group->free_space_ctl);
}
}
+
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
+{
+ bool ret = true;
+
+ spin_lock(&bg->lock);
+ if (bg->ro)
+ ret = false;
+ else
+ bg->swap_extents++;
+ spin_unlock(&bg->lock);
+
+ return ret;
+}
+
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
+{
+ spin_lock(&bg->lock);
+ ASSERT(!bg->ro);
+ ASSERT(bg->swap_extents >= amount);
+ bg->swap_extents -= amount;
+ spin_unlock(&bg->lock);
+}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 29678426247d..3ecc3372a5ce 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -186,6 +186,12 @@ struct btrfs_block_group {
/* Flag indicating this block group is placed on a sequential zone */
bool seq_zone;
+ /*
+ * Number of extents in this block group used for swap files.
+ * All accesses protected by the spinlock 'lock'.
+ */
+ int swap_extents;
+
/* Record locked full stripes for RAID5/6 block group */
struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
@@ -312,4 +318,7 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
void btrfs_freeze_block_group(struct btrfs_block_group *cache);
void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
+
#endif /* BTRFS_BLOCK_GROUP_H */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3bc00aed13b2..40ec3393d2a1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -524,6 +524,11 @@ struct btrfs_swapfile_pin {
* points to a struct btrfs_device.
*/
bool is_block_group;
+ /*
+ * Only used when 'is_block_group' is true and it is the number of
+ * extents used by a swapfile for this block group ('ptr' field).
+ */
+ int bg_extent_count;
};
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d6f0e1ad3711..30358a2e2bc0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10192,6 +10192,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
sp->ptr = ptr;
sp->inode = inode;
sp->is_block_group = is_block_group;
+ sp->bg_extent_count = 1;
spin_lock(&fs_info->swapfile_pins_lock);
p = &fs_info->swapfile_pins.rb_node;
@@ -10205,6 +10206,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
(sp->ptr == entry->ptr && sp->inode > entry->inode)) {
p = &(*p)->rb_right;
} else {
+ if (is_block_group)
+ entry->bg_extent_count++;
spin_unlock(&fs_info->swapfile_pins_lock);
kfree(sp);
return 1;
@@ -10230,8 +10233,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
sp = rb_entry(node, struct btrfs_swapfile_pin, node);
if (sp->inode == inode) {
rb_erase(&sp->node, &fs_info->swapfile_pins);
- if (sp->is_block_group)
+ if (sp->is_block_group) {
+ btrfs_dec_block_group_swap_extents(sp->ptr,
+ sp->bg_extent_count);
btrfs_put_block_group(sp->ptr);
+ }
kfree(sp);
}
node = next;
@@ -10446,6 +10452,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
goto out;
}
+ if (!btrfs_inc_block_group_swap_extents(bg)) {
+ btrfs_warn(fs_info,
+ "block group for swapfile at %llu is read-only%s",
+ bg->start,
+ atomic_read(&fs_info->scrubs_running) ?
+ " (scrub running)" : "");
+ btrfs_put_block_group(bg);
+ ret = -EINVAL;
+ goto out;
+ }
+
ret = btrfs_add_swapfile_pin(inode, bg, true);
if (ret) {
btrfs_put_block_group(bg);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 310fce00fcda..70a90ca2c8da 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3767,6 +3767,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
* commit_transactions.
*/
ro_set = 0;
+ } else if (ret == -ETXTBSY) {
+ btrfs_warn(fs_info,
+ "skipping scrub of block group %llu due to active swapfile",
+ cache->start);
+ scrub_pause_off(fs_info);
+ ret = 0;
+ goto skip_unfreeze;
} else {
btrfs_warn(fs_info,
"failed setting block group ro: %d", ret);
@@ -3862,7 +3869,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
} else {
spin_unlock(&cache->lock);
}
-
+skip_unfreeze:
btrfs_unfreeze_block_group(cache);
btrfs_put_block_group(cache);
if (ret)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 95c85fba1f64c3249c67f0078a29f8a125078189 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Mon, 25 Jan 2021 16:42:35 -0500
Subject: [PATCH] btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abcf951e6b44..711a6a751ae9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2801,8 +2801,10 @@ static void __btrfs_return_cluster_to_free_space(
struct rb_node *node;
spin_lock(&cluster->lock);
- if (cluster->block_group != block_group)
- goto out;
+ if (cluster->block_group != block_group) {
+ spin_unlock(&cluster->lock);
+ return;
+ }
cluster->block_group = NULL;
cluster->window_start = 0;
@@ -2840,8 +2842,6 @@ static void __btrfs_return_cluster_to_free_space(
entry->offset, &entry->offset_index, bitmap);
}
cluster->root = RB_ROOT;
-
-out:
spin_unlock(&cluster->lock);
btrfs_put_block_group(block_group);
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 95c85fba1f64c3249c67f0078a29f8a125078189 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Mon, 25 Jan 2021 16:42:35 -0500
Subject: [PATCH] btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abcf951e6b44..711a6a751ae9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2801,8 +2801,10 @@ static void __btrfs_return_cluster_to_free_space(
struct rb_node *node;
spin_lock(&cluster->lock);
- if (cluster->block_group != block_group)
- goto out;
+ if (cluster->block_group != block_group) {
+ spin_unlock(&cluster->lock);
+ return;
+ }
cluster->block_group = NULL;
cluster->window_start = 0;
@@ -2840,8 +2842,6 @@ static void __btrfs_return_cluster_to_free_space(
entry->offset, &entry->offset_index, bitmap);
}
cluster->root = RB_ROOT;
-
-out:
spin_unlock(&cluster->lock);
btrfs_put_block_group(block_group);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 95c85fba1f64c3249c67f0078a29f8a125078189 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Mon, 25 Jan 2021 16:42:35 -0500
Subject: [PATCH] btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abcf951e6b44..711a6a751ae9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2801,8 +2801,10 @@ static void __btrfs_return_cluster_to_free_space(
struct rb_node *node;
spin_lock(&cluster->lock);
- if (cluster->block_group != block_group)
- goto out;
+ if (cluster->block_group != block_group) {
+ spin_unlock(&cluster->lock);
+ return;
+ }
cluster->block_group = NULL;
cluster->window_start = 0;
@@ -2840,8 +2842,6 @@ static void __btrfs_return_cluster_to_free_space(
entry->offset, &entry->offset_index, bitmap);
}
cluster->root = RB_ROOT;
-
-out:
spin_unlock(&cluster->lock);
btrfs_put_block_group(block_group);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 95c85fba1f64c3249c67f0078a29f8a125078189 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Mon, 25 Jan 2021 16:42:35 -0500
Subject: [PATCH] btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abcf951e6b44..711a6a751ae9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2801,8 +2801,10 @@ static void __btrfs_return_cluster_to_free_space(
struct rb_node *node;
spin_lock(&cluster->lock);
- if (cluster->block_group != block_group)
- goto out;
+ if (cluster->block_group != block_group) {
+ spin_unlock(&cluster->lock);
+ return;
+ }
cluster->block_group = NULL;
cluster->window_start = 0;
@@ -2840,8 +2842,6 @@ static void __btrfs_return_cluster_to_free_space(
entry->offset, &entry->offset_index, bitmap);
}
cluster->root = RB_ROOT;
-
-out:
spin_unlock(&cluster->lock);
btrfs_put_block_group(block_group);
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 95c85fba1f64c3249c67f0078a29f8a125078189 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Mon, 25 Jan 2021 16:42:35 -0500
Subject: [PATCH] btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abcf951e6b44..711a6a751ae9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2801,8 +2801,10 @@ static void __btrfs_return_cluster_to_free_space(
struct rb_node *node;
spin_lock(&cluster->lock);
- if (cluster->block_group != block_group)
- goto out;
+ if (cluster->block_group != block_group) {
+ spin_unlock(&cluster->lock);
+ return;
+ }
cluster->block_group = NULL;
cluster->window_start = 0;
@@ -2840,8 +2842,6 @@ static void __btrfs_return_cluster_to_free_space(
entry->offset, &entry->offset_index, bitmap);
}
cluster->root = RB_ROOT;
-
-out:
spin_unlock(&cluster->lock);
btrfs_put_block_group(block_group);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d70cef0d46729808dc53f145372c02b145c92604 Mon Sep 17 00:00:00 2001
From: Ira Weiny <ira.weiny(a)intel.com>
Date: Wed, 27 Jan 2021 22:15:03 -0800
Subject: [PATCH] btrfs: fix raid6 qstripe kmap
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
CC: stable(a)vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable(a)vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny(a)intel.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 5394641541f7..abf17fae0912 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2362,16 +2362,21 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
SetPageUptodate(p_page);
if (has_qstripe) {
+ /* RAID6, allocate and map temp space for the Q stripe */
q_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
if (!q_page) {
__free_page(p_page);
goto cleanup;
}
SetPageUptodate(q_page);
+ pointers[rbio->real_stripes - 1] = kmap(q_page);
}
atomic_set(&rbio->error, 0);
+ /* Map the parity stripe just once */
+ pointers[nr_data] = kmap(p_page);
+
for_each_set_bit(pagenr, rbio->dbitmap, rbio->stripe_npages) {
struct page *p;
void *parity;
@@ -2381,16 +2386,8 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
pointers[stripe] = kmap(p);
}
- /* then add the parity stripe */
- pointers[stripe++] = kmap(p_page);
-
if (has_qstripe) {
- /*
- * raid6, add the qstripe and call the
- * library function to fill in our p/q
- */
- pointers[stripe++] = kmap(q_page);
-
+ /* RAID6, call the library function to fill in our P/Q */
raid6_call.gen_syndrome(rbio->real_stripes, PAGE_SIZE,
pointers);
} else {
@@ -2411,12 +2408,14 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
for (stripe = 0; stripe < nr_data; stripe++)
kunmap(page_in_rbio(rbio, stripe, pagenr, 0));
- kunmap(p_page);
}
+ kunmap(p_page);
__free_page(p_page);
- if (q_page)
+ if (q_page) {
+ kunmap(q_page);
__free_page(q_page);
+ }
writeback:
/*
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c17af96554a8a8777cbb0fd53b8497250e548b43 Mon Sep 17 00:00:00 2001
From: David Sterba <dsterba(a)suse.com>
Date: Wed, 19 Feb 2020 15:17:20 +0100
Subject: [PATCH] btrfs: raid56: simplify tracking of Q stripe presence
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.
fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~
Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index a8e53c8e7b01..406b1efd3ba5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1196,22 +1196,19 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
int nr_data = rbio->nr_data;
int stripe;
int pagenr;
- int p_stripe = -1;
- int q_stripe = -1;
+ bool has_qstripe;
struct bio_list bio_list;
struct bio *bio;
int ret;
bio_list_init(&bio_list);
- if (rbio->real_stripes - rbio->nr_data == 1) {
- p_stripe = rbio->real_stripes - 1;
- } else if (rbio->real_stripes - rbio->nr_data == 2) {
- p_stripe = rbio->real_stripes - 2;
- q_stripe = rbio->real_stripes - 1;
- } else {
+ if (rbio->real_stripes - rbio->nr_data == 1)
+ has_qstripe = false;
+ else if (rbio->real_stripes - rbio->nr_data == 2)
+ has_qstripe = true;
+ else
BUG();
- }
/* at this point we either have a full stripe,
* or we've read the full stripe from the drive.
@@ -1255,7 +1252,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
- if (q_stripe != -1) {
+ if (has_qstripe) {
/*
* raid6, add the qstripe and call the
@@ -2353,8 +2350,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
int nr_data = rbio->nr_data;
int stripe;
int pagenr;
- int p_stripe = -1;
- int q_stripe = -1;
+ bool has_qstripe;
struct page *p_page = NULL;
struct page *q_page = NULL;
struct bio_list bio_list;
@@ -2364,14 +2360,12 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
bio_list_init(&bio_list);
- if (rbio->real_stripes - rbio->nr_data == 1) {
- p_stripe = rbio->real_stripes - 1;
- } else if (rbio->real_stripes - rbio->nr_data == 2) {
- p_stripe = rbio->real_stripes - 2;
- q_stripe = rbio->real_stripes - 1;
- } else {
+ if (rbio->real_stripes - rbio->nr_data == 1)
+ has_qstripe = false;
+ else if (rbio->real_stripes - rbio->nr_data == 2)
+ has_qstripe = true;
+ else
BUG();
- }
if (bbio->num_tgtdevs && bbio->tgtdev_map[rbio->scrubp]) {
is_replace = 1;
@@ -2393,7 +2387,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
goto cleanup;
SetPageUptodate(p_page);
- if (q_stripe != -1) {
+ if (has_qstripe) {
q_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
if (!q_page) {
__free_page(p_page);
@@ -2416,8 +2410,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
/* then add the parity stripe */
pointers[stripe++] = kmap(p_page);
- if (q_stripe != -1) {
-
+ if (has_qstripe) {
/*
* raid6, add the qstripe and call the
* library function to fill in our p/q
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 44ac5958a6c1fd91ac8810fbb37194e377d78db5 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Thu, 25 Feb 2021 12:47:26 -0800
Subject: [PATCH] KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML
is enabled
Check that PML is actually enabled before setting the mask to force a
SPTE to be write-protected. The bits used for the !AD_ENABLED case are
in the upper half of the SPTE. With 64-bit paging and EPT, these bits
are ignored, but with 32-bit PAE paging they are reserved. Setting them
for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.
Fixes: 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration with PML")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210225204749.1512652-2-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0928f2b2d..ec4fc28b325a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -81,15 +81,15 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
{
/*
- * When using the EPT page-modification log, the GPAs in the log
- * would come from L2 rather than L1. Therefore, we need to rely
- * on write protection to record dirty pages. This also bypasses
- * PML, since writes now result in a vmexit. Note, this helper will
- * tag SPTEs as needing write-protection even if PML is disabled or
- * unsupported, but that's ok because the tag is consumed if and only
- * if PML is enabled. Omit the PML check to save a few uops.
+ * When using the EPT page-modification log, the GPAs in the CPU dirty
+ * log would come from L2 rather than L1. Therefore, we need to rely
+ * on write protection to record dirty pages, which bypasses PML, since
+ * writes now result in a vmexit. Note, the check on CPU dirty logging
+ * being enabled is mandatory as the bits used to denote WP-only SPTEs
+ * are reserved for NPT w/ PAE (32-bit KVM).
*/
- return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
+ return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
+ kvm_x86_ops.cpu_dirty_log_size;
}
bool is_nx_huge_page_enabled(void);
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 44ac5958a6c1fd91ac8810fbb37194e377d78db5 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Thu, 25 Feb 2021 12:47:26 -0800
Subject: [PATCH] KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML
is enabled
Check that PML is actually enabled before setting the mask to force a
SPTE to be write-protected. The bits used for the !AD_ENABLED case are
in the upper half of the SPTE. With 64-bit paging and EPT, these bits
are ignored, but with 32-bit PAE paging they are reserved. Setting them
for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.
Fixes: 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration with PML")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210225204749.1512652-2-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0928f2b2d..ec4fc28b325a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -81,15 +81,15 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
{
/*
- * When using the EPT page-modification log, the GPAs in the log
- * would come from L2 rather than L1. Therefore, we need to rely
- * on write protection to record dirty pages. This also bypasses
- * PML, since writes now result in a vmexit. Note, this helper will
- * tag SPTEs as needing write-protection even if PML is disabled or
- * unsupported, but that's ok because the tag is consumed if and only
- * if PML is enabled. Omit the PML check to save a few uops.
+ * When using the EPT page-modification log, the GPAs in the CPU dirty
+ * log would come from L2 rather than L1. Therefore, we need to rely
+ * on write protection to record dirty pages, which bypasses PML, since
+ * writes now result in a vmexit. Note, the check on CPU dirty logging
+ * being enabled is mandatory as the bits used to denote WP-only SPTEs
+ * are reserved for NPT w/ PAE (32-bit KVM).
*/
- return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
+ return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
+ kvm_x86_ops.cpu_dirty_log_size;
}
bool is_nx_huge_page_enabled(void);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 44ac5958a6c1fd91ac8810fbb37194e377d78db5 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Thu, 25 Feb 2021 12:47:26 -0800
Subject: [PATCH] KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML
is enabled
Check that PML is actually enabled before setting the mask to force a
SPTE to be write-protected. The bits used for the !AD_ENABLED case are
in the upper half of the SPTE. With 64-bit paging and EPT, these bits
are ignored, but with 32-bit PAE paging they are reserved. Setting them
for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.
Fixes: 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration with PML")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-Id: <20210225204749.1512652-2-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 72b0928f2b2d..ec4fc28b325a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -81,15 +81,15 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
{
/*
- * When using the EPT page-modification log, the GPAs in the log
- * would come from L2 rather than L1. Therefore, we need to rely
- * on write protection to record dirty pages. This also bypasses
- * PML, since writes now result in a vmexit. Note, this helper will
- * tag SPTEs as needing write-protection even if PML is disabled or
- * unsupported, but that's ok because the tag is consumed if and only
- * if PML is enabled. Omit the PML check to save a few uops.
+ * When using the EPT page-modification log, the GPAs in the CPU dirty
+ * log would come from L2 rather than L1. Therefore, we need to rely
+ * on write protection to record dirty pages, which bypasses PML, since
+ * writes now result in a vmexit. Note, the check on CPU dirty logging
+ * being enabled is mandatory as the bits used to denote WP-only SPTEs
+ * are reserved for NPT w/ PAE (32-bit KVM).
*/
- return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
+ return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
+ kvm_x86_ops.cpu_dirty_log_size;
}
bool is_nx_huge_page_enabled(void);
From: Thomas Schoebel-Theuer <tst(a)1und1.de>
This patch and problem analysis is specific for 4.4 LTS, due to incomplete
backporting of other fixes. Later LTS series have different backports.
Since v4.4.257 when CONFIG_PROVE_LOCKING=y
the following triggers right after reboot of our pre-life systems
which equal our production setup:
Mar 03 11:27:33 icpu-test-bap10 kernel: =================================
Mar 03 11:27:33 icpu-test-bap10 kernel: [ INFO: inconsistent lock state ]
Mar 03 11:27:33 icpu-test-bap10 kernel: 4.4.259-rc1-grsec+ #730 Not tainted
Mar 03 11:27:33 icpu-test-bap10 kernel: ---------------------------------
Mar 03 11:27:33 icpu-test-bap10 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
Mar 03 11:27:33 icpu-test-bap10 kernel: apache2-ssl/9310 [HC0[0]:SC0[0]:HE1:SE1] takes:
Mar 03 11:27:33 icpu-test-bap10 kernel: (&p->pi_lock){?.-.-.}, at: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: {IN-HARDIRQ-W} state was registered at:
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088c4a>] __lock_acquire+0x3a7/0xe4a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8170151c>] _raw_spin_lock_irqsave+0x3e/0x50
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810719a5>] try_to_wake_up+0x2c/0x210
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81071bf3>] default_wake_function+0xd/0xf
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083588>] autoremove_wake_function+0x11/0x35
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810830b2>] __wake_up_common+0x48/0x7c
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8108311a>] __wake_up+0x34/0x46
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c2a23>] megasas_complete_int_cmd+0x31/0x33
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c60a0>] megasas_complete_cmd+0x570/0x57b
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d05bc>] complete_cmd_fusion+0x23e/0x33d
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d0768>] megasas_isr_fusion+0x67/0x74
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091ae5>] handle_irq_event_percpu+0x134/0x311
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091cf5>] handle_irq_event+0x33/0x51
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810948b9>] handle_edge_irq+0xa3/0xc2
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005f7b>] handle_irq+0xf9/0x101
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005700>] do_IRQ+0x80/0xf5
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81702228>] ret_from_intr+0x0/0x20
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100cab0>] arch_cpu_idle+0xa/0xc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083a5a>] default_idle_call+0x1e/0x20
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083b9d>] cpu_startup_entry+0x141/0x22f
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816fb853>] rest_init+0x135/0x13b
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5ce99>] start_kernel+0x3fa/0x40a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c2af>] x86_64_start_reservations+0x2a/0x2c
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c3d0>] x86_64_start_kernel+0x11f/0x12c
Mar 03 11:27:33 icpu-test-bap10 kernel: irq event stamp: 1457
Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last enabled at (1457): [<ffffffff81042a69>] get_user_pages_fast+0xeb/0x14f
Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last disabled at (1456): [<ffffffff810429dd>] get_user_pages_fast+0x5f/0x14f
Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last enabled at (1446): [<ffffffff815e127d>] release_sock+0x142/0x14d
Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last disabled at (1444): [<ffffffff815e116f>] release_sock+0x34/0x14d
Mar 03 11:27:33 icpu-test-bap10 kernel:
other info that might help us debug this:
Mar 03 11:27:33 icpu-test-bap10 kernel: Possible unsafe locking scenario:
Mar 03 11:27:33 icpu-test-bap10 kernel: CPU0
Mar 03 11:27:33 icpu-test-bap10 kernel: ----
Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock);
Mar 03 11:27:33 icpu-test-bap10 kernel: <Interrupt>
Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock);
Mar 03 11:27:33 icpu-test-bap10 kernel:
*** DEADLOCK ***
Mar 03 11:27:33 icpu-test-bap10 kernel: 2 locks held by apache2-ssl/9310:
Mar 03 11:27:33 icpu-test-bap10 kernel: #0: (&(&(__futex_data.queues)[i].lock)->rlock){+.+...}, at: [<ffffffff810ae4e6>] do
Mar 03 11:27:33 icpu-test-bap10 kernel: #1: (&lock->wait_lock){+.+...}, at: [<ffffffff810ae53a>] do_futex+0x639/0x809
Mar 03 11:27:33 icpu-test-bap10 kernel:
stack backtrace:
Mar 03 11:27:33 icpu-test-bap10 kernel: CPU: 13 PID: 9310 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #730
Mar 03 11:27:33 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019
Mar 03 11:27:33 icpu-test-bap10 kernel: 0000000000000000 ffff883fb79bfc00 ffffffff816f8fc2 ffff883ffa66d300
Mar 03 11:27:33 icpu-test-bap10 kernel: ffffffff8eaa71f0 ffff883fb79bfc50 ffffffff81088484 0000000000000000
Mar 03 11:27:33 icpu-test-bap10 kernel: 0000000000000001 0000000000000001 0000000000000002 ffff883ffa66db58
Mar 03 11:27:33 icpu-test-bap10 kernel: Call Trace:
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816f8fc2>] dump_stack+0x94/0xca
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088484>] print_usage_bug+0x1bc/0x1d1
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81087d76>] ? check_usage_forwards+0x98/0x98
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810885a5>] mark_lock+0x10c/0x203
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088cb9>] __lock_acquire+0x416/0xe4a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] ? lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81700d12>] _raw_spin_lock+0x2a/0x39
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae5af>] do_futex+0x6ae/0x809
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae83d>] SyS_futex+0x133/0x143
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81701848>] tracesys_phase2+0x90/0x95
Bisecting detects 47e452fcf2f
in the above specific scenario using apache-ssl,
but apparently the missing *_irq() was introduced in
34c8e1c2c02.
However, just reverting the old _irq() variants to a similar status
than before 34c8e1c2c02,
or using _irqsave() / _irqrestore() as some other backports are doing
in various places, would not really help.
The fundamental problem is the following violation of the assertion
lockdep_assert_held(&pi_state->pi_mutex.wait_lock) in pi_state_update_owner():
Mar 03 12:50:03 icpu-test-bap10 kernel: ------------[ cut here ]------------
Mar 03 12:50:03 icpu-test-bap10 kernel: WARNING: CPU: 37 PID: 8488 at kernel/futex.c:844 pi_state_update_owner+0x3d/0xd7()
Mar 03 12:50:03 icpu-test-bap10 kernel: Modules linked in: xt_time xt_connlimit xt_connmark xt_NFLOG xt_limit xt_hashlimit veth ip_set_bitmap_port xt_DSCP xt_multiport ip_set_hash_ip xt_owner xt_set ip_set_hash_net xt_state xt_conntrack nf_conntrack_ftp mars lz4_decompress lz4_compress ipmi_devintf x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul hed ipmi_si ipmi_msghandler processor crc32c_intel ehci_pci ehci_hcd usbcore i40e usb_common
Mar 03 12:50:03 icpu-test-bap10 kernel: CPU: 37 PID: 8488 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #737
Mar 03 12:50:03 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019
Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000000 ffff883f863f7c70 ffffffff816f9002 0000000000000000
Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000009 ffff883f863f7ca8 ffffffff8104cda2 ffffffff810abac7
Mar 03 12:50:03 icpu-test-bap10 kernel: ffff883ffbfe5e80 0000000000000000 ffff883f82ed4bc0 00007fc01c9bf000
Mar 03 12:50:03 icpu-test-bap10 kernel: Call Trace:
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff816f9002>] dump_stack+0x94/0xca
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104cda2>] warn_slowpath_common+0x94/0xad
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] ? pi_state_update_owner+0x3d/0xd7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104ce5f>] warn_slowpath_null+0x15/0x17
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] pi_state_update_owner+0x3d/0xd7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abea8>] free_pi_state+0x2d/0x73
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abf0b>] unqueue_me_pi+0x1d/0x31
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ad735>] futex_lock_pi+0x27a/0x2e8
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81088bca>] ? __lock_acquire+0x327/0xe4a
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae6a9>] do_futex+0x784/0x809
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810cfa9a>] ? seccomp_phase1+0xde/0x1e7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810a4503>] ? current_kernel_time64+0xb/0x31
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810d23c3>] ? current_kernel_time+0xb/0xf
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae861>] SyS_futex+0x133/0x143
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81701888>] tracesys_phase2+0x90/0x95
Mar 03 12:50:03 icpu-test-bap10 kernel: ---[ end trace 968f95a458dea951 ]---
In order to both (1) prevent the self-deadlock, and (2) to satisfy the assertion
at pi_state_update_owner(), some locking with irq disable is needed,
at least in the specific call stack.
Interestingly, there existed a suchalike locking just before
f08a4af5ccb.
This is just a quick hotfix, resurrecting some previous
locks at the old places, but now using ->wait_lock in place
of the previous ->pi_lock (which was in place before
f08a4af5ccb).
The ->pi_lock is now also taken, by the new code
which had been introduced in
34c8e1c2c02.
When this patch is applied, both the above splats are
no longer triggering at my prelife machines.
Without this patch, I cannot ensure stable production at
1&1 Ionos.
Hint for further work: I have not yet tested other call paths,
since I am under time pressure for security reasons.
Hint for further hardening of 4.4.y and probably some more LTS series:
Probably some more systematic testing with CONFIG_PROVE_LOCKING
(and probably some more options) should be invested
in order to make the 4.4 LTS series really "stable" again.
Signed-off-by: Thomas Schoebel-Theuer <tst(a)1und1.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Lee Jones <lee.jones(a)linaro.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Fixes: f08a4af5ccb
Fixes: 34c8e1c2c02
---
kernel/futex.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/futex.c b/kernel/futex.c
index 4a707bc7cceb..0c42d2313660 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -874,7 +874,9 @@ static void free_pi_state(struct futex_pi_state *pi_state)
* and has cleaned up the pi_state already
*/
if (pi_state->owner) {
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
pi_state_update_owner(pi_state, NULL);
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
rt_mutex_proxy_unlock(&pi_state->pi_mutex);
}
--
2.26.2
From: Thomas Schoebel-Theuer <tst(a)1und1.de>
This patch and problem analysis is specific for 4.4 LTS, due to incomplete
backporting of other fixes. Later LTS series have different backports.
The following is obviously incorrect:
static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
struct futex_hash_bucket *hb)
{
[...]
raw_spin_lock(&pi_state->pi_mutex.wait_lock);
[...]
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
[...]
}
The 4.4-specific fix should probably go in the direction of
b4abf91047c,
making everything irq-safe.
Probably, backporting of b4abf91047c
to 4.4 LTS could thus be another good idea.
However, this might involve some more 4.4-specific work and
require thorough testing:
> git log --oneline v4.4..b4abf91047c -- kernel/futex.c kernel/locking/rtmutex.c | wc -l
10
So this patch is just an obvious quickfix for now.
Hint: the lock order is documented in 4.9.y and later. A similar
documenting is missing in 4.4.y. Please somebody either backport also,
or write a new description, if there would be some differences I cannot
easily see at the moment. Without reliable docs,
inspection of the locking correctness may become a pain.
Signed-off-by: Thomas Schoebel-Theuer <tst(a)1und1.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Lee Jones <lee.jones(a)linaro.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Fixes: 394fc498142
Fixes: 6510e4a2d04
---
kernel/futex.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 70ad21bbb1d5..4a707bc7cceb 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1406,7 +1406,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
if (pi_state->owner != current)
return -EINVAL;
- raw_spin_lock(&pi_state->pi_mutex.wait_lock);
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
/*
--
2.26.2
Recent patch to prevent calling __nvme_fc_abort_outstanding_ios in
interrupt context results in a possible race condition. A controller
reset results in errored io completions, which schedules error
work. The change of error work to a work element allows it to fire
after the ctrl state transition tion to NVME_CTRL_CONNECTING, causing
any outstanding io (used to initialize the controller) to fail and
cause problems for connect_work.
Add a state check to only schedule error work if not in the RESETTING
state.
Fixes: 19fce0470f05 ("nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context")
Cc: <stable(a)vger.kernel.org> # v5.10+
Signed-off-by: Nigel Kirkland <nkirkland2304(a)gmail.com>
Signed-off-by: James Smart <jsmart2021(a)gmail.com>
---
drivers/nvme/host/fc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 20dadd86e981..0f92bd12123e 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2055,7 +2055,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
nvme_fc_complete_rq(rq);
check_error:
- if (terminate_assoc)
+ if (terminate_assoc && ctrl->ctrl.state != NVME_CTRL_RESETTING)
queue_work(nvme_reset_wq, &ctrl->ioerr_work);
}
--
2.26.2
This is the start of the stable review cycle for the 4.14.224 release.
There are 39 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 07 Mar 2021 12:08:39 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.224-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.224-rc1
Sakari Ailus <sakari.ailus(a)linux.intel.com>
media: v4l: ioctl: Fix memory leak in video_usercopy
Jens Axboe <axboe(a)kernel.dk>
swap: fix swapfile read/write offset
Rokudo Yan <wu-yan(a)tcl.com>
zsmalloc: account the number of compacted pages correctly
Jan Beulich <jbeulich(a)suse.com>
xen-netback: respect gnttab_map_refs()'s return value
Jan Beulich <jbeulich(a)suse.com>
Xen/gnttab: handle p2m update errors on a per-slot basis
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Verify lengths on passthrough PDUs
Chris Leech <cleech(a)redhat.com>
scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE
Joe Perches <joe(a)perches.com>
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
Lee Duncan <lduncan(a)suse.com>
scsi: iscsi: Restrict sessions and handles to admin capabilities
John David Anglin <dave.anglin(a)bell.net>
parisc: Bump 64-bit IRQ stack size to 64 KB
Jaegeuk Kim <jaegeuk(a)kernel.org>
f2fs: handle unallocated section and zone on pinned/atgc
Ricardo Ribalda <ribalda(a)chromium.org>
media: uvcvideo: Allow entities with no pads
Christian Gromm <christian.gromm(a)microchip.com>
staging: most: sound: add sanity check for function argument
Gopal Tiwari <gtiwari(a)redhat.com>
Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
Fangrui Song <maskray(a)google.com>
x86/build: Treat R_386_PLT32 relocation as R_386_PC32
Miaoqing Pan <miaoqing(a)codeaurora.org>
ath10k: fix wmi mgmt tx queue full due to race condition
Di Zhu <zhudi21(a)huawei.com>
pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
Tony Lindgren <tony(a)atomide.com>
wlcore: Fix command execute failure 19 for wl12xx
Jiri Slaby <jslaby(a)suse.cz>
vt/consolemap: do font sum unsigned
Heiner Kallweit <hkallweit1(a)gmail.com>
x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
staging: fwserial: Fix error handling in fwserial_create
Geert Uytterhoeven <geert+renesas(a)glider.be>
dt-bindings: net: btusb: DT fix s/interrupt-name/interrupt-names/
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: bridge: use switchdev for port flags set through sysfs too
Li Xinhai <lixinhai.lxh(a)gmail.com>
mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
Marco Elver <elver(a)google.com>
net: fix up truesize of cloned skb in skb_prepare_for_shift()
Sabyrzhan Tasbolatov <snovitoll(a)gmail.com>
smackfs: restrict bytes count in smackfs write functions
Yumei Huang <yuhuang(a)redhat.com>
xfs: Fix assert failure in xfs_setattr_size()
Sean Young <sean(a)mess.org>
media: mceusb: sanity check for prescaler value
Randy Dunlap <rdunlap(a)infradead.org>
JFS: more checks for invalid superblock
Andrew Murray <andrew.murray(a)arm.com>
arm64: Use correct ll/sc atomic constraints
Will Deacon <will.deacon(a)arm.com>
arm64: cmpxchg: Use "K" instead of "L" for ll/sc immediate constraint
Will Deacon <will.deacon(a)arm.com>
arm64: Avoid redundant type conversions in xchg() and cmpxchg()
Shaoying Xu <shaoyi(a)amazon.com>
arm64 module: set plt* section addresses to 0x0
Cornelia Huck <cohuck(a)redhat.com>
virtio/s390: implement virtio-ccw revision 2 correctly
Sergey Senozhatsky <senozhatsky(a)chromium.org>
drm/virtio: use kvmalloc for large allocations
Mike Kravetz <mike.kravetz(a)oracle.com>
hugetlb: fix update_and_free_page contig page struct assumption
Rolf Eike Beer <eb(a)emlix.com>
scripts: set proper OpenSSL include dir also for sign-file
Rolf Eike Beer <eb(a)emlix.com>
scripts: use pkg-config to locate libcrypto
Lech Perczak <lech.perczak(a)gmail.com>
net: usb: qmi_wwan: support ZTE P685M modem
-------------
Diffstat:
Documentation/devicetree/bindings/net/btusb.txt | 2 +-
Documentation/filesystems/sysfs.txt | 8 +-
Makefile | 4 +-
arch/arm/xen/p2m.c | 35 +++++-
arch/arm64/include/asm/atomic_ll_sc.h | 108 +++++++++--------
arch/arm64/include/asm/atomic_lse.h | 46 ++++----
arch/arm64/include/asm/cmpxchg.h | 116 +++++++++----------
arch/arm64/kernel/module.lds | 6 +-
arch/parisc/kernel/irq.c | 4 +
arch/x86/kernel/module.c | 1 +
arch/x86/kernel/reboot.c | 9 ++
arch/x86/tools/relocs.c | 12 +-
arch/x86/xen/p2m.c | 44 ++++++-
drivers/block/zram/zram_drv.c | 2 +-
drivers/gpu/drm/virtio/virtgpu_vq.c | 6 +-
drivers/media/rc/mceusb.c | 9 +-
drivers/media/usb/uvc/uvc_driver.c | 7 +-
drivers/media/v4l2-core/v4l2-ioctl.c | 19 ++-
drivers/net/usb/qmi_wwan.c | 1 +
drivers/net/wireless/ath/ath10k/mac.c | 15 +--
drivers/net/wireless/ti/wl12xx/main.c | 3 -
drivers/net/wireless/ti/wlcore/main.c | 15 +--
drivers/net/wireless/ti/wlcore/wlcore.h | 3 -
drivers/net/xen-netback/netback.c | 12 +-
drivers/s390/virtio/virtio_ccw.c | 4 +-
drivers/scsi/libiscsi.c | 148 ++++++++++++------------
drivers/scsi/scsi_transport_iscsi.c | 38 ++++--
drivers/staging/fwserial/fwserial.c | 2 +
drivers/staging/most/aim-sound/sound.c | 2 +
drivers/tty/vt/consolemap.c | 2 +-
fs/f2fs/segment.h | 4 +-
fs/jfs/jfs_filsys.h | 1 +
fs/jfs/jfs_mount.c | 10 ++
fs/sysfs/file.c | 55 +++++++++
fs/xfs/xfs_iops.c | 2 +-
include/linux/sysfs.h | 16 +++
include/linux/zsmalloc.h | 2 +-
mm/hugetlb.c | 28 +++--
mm/page_io.c | 11 +-
mm/swapfile.c | 2 +-
mm/zsmalloc.c | 17 ++-
net/bluetooth/amp.c | 3 +
net/bridge/br_sysfs_if.c | 9 +-
net/core/pktgen.c | 2 +-
net/core/skbuff.c | 14 ++-
scripts/Makefile | 9 +-
security/smack/smackfs.c | 21 +++-
47 files changed, 559 insertions(+), 330 deletions(-)
The first four patches are fixes for XSA-332. The avoid WARN splats
and a performance issue with interdomain events.
Patches 5 and 6 are some additions to event handling in order to add
some per pv-device statistics to sysfs and the ability to have a per
backend device spurious event delay control.
Patches 7 and 8 are minor fixes I had lying around.
Juergen Gross (8):
xen/events: reset affinity of 2-level event when tearing it down
xen/events: don't unmask an event channel when an eoi is pending
xen/events: avoid handling the same event on two cpus at the same time
xen/netback: fix spurious event detection for common event case
xen/events: link interdomain events to associated xenbus device
xen/events: add per-xenbus device event statistics and settings
xen/evtchn: use smp barriers for user event ring
xen/evtchn: use READ/WRITE_ONCE() for accessing ring indices
.../ABI/testing/sysfs-devices-xenbus | 41 ++++
drivers/block/xen-blkback/xenbus.c | 2 +-
drivers/net/xen-netback/interface.c | 24 ++-
drivers/xen/events/events_2l.c | 22 +-
drivers/xen/events/events_base.c | 199 +++++++++++++-----
drivers/xen/events/events_fifo.c | 7 -
drivers/xen/events/events_internal.h | 14 +-
drivers/xen/evtchn.c | 29 ++-
drivers/xen/pvcalls-back.c | 4 +-
drivers/xen/xen-pciback/xenbus.c | 2 +-
drivers/xen/xen-scsiback.c | 2 +-
drivers/xen/xenbus/xenbus_probe.c | 66 ++++++
include/xen/events.h | 7 +-
include/xen/xenbus.h | 7 +
14 files changed, 327 insertions(+), 99 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-devices-xenbus
--
2.26.2
Changes in v3
- drop the patch that introduces the new function tpm_chip_free()
- rework the commit messages for the patches (style, typos, etc.)
- add fixes tag to patch 2
- add James Bottomley to cc list
- add stable mailing list to cc list
Changes in v2:
- drop the patch that erroneously cleaned up after failed installation of
an action handler in tpmm_chip_alloc() (pointed out by Jarkko Sakkinen)
- make the commit message for patch 1 more detailed
- add fixes tags and kernel logs
Lino Sanfilippo (2):
tpm: fix reference counting for struct tpm_chip
tpm: in tpm2_del_space check if ops pointer is still valid
drivers/char/tpm/tpm-chip.c | 18 +++++++++++++++---
drivers/char/tpm/tpm2-space.c | 15 ++++++++++-----
drivers/char/tpm/tpm_ftpm_tee.c | 2 ++
drivers/char/tpm/tpm_vtpm_proxy.c | 1 +
4 files changed, 28 insertions(+), 8 deletions(-)
--
2.7.4
On a 32-bit fast syscall that fails to read its arguments from user
memory, the kernel currently does syscall exit work but not
syscall entry work. This confuses audit and ptrace. For example:
$ ./tools/testing/selftests/x86/syscall_arg_fault_32
...
strace: pid 264258: entering, ptrace_syscall_info.op == 2
...
This is a minimal fix intended for ease of backporting. A more
complete cleanup is coming.
Cc: stable(a)vger.kernel.org
Fixes: 0b085e68f407 ("x86/entry: Consolidate 32/64 bit syscall entry")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
arch/x86/entry/common.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0904f5676e4d..8fdb4cb27efe 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -128,7 +128,8 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
regs->ax = -EFAULT;
instrumentation_end();
- syscall_exit_to_user_mode(regs);
+ local_irq_disable();
+ irqentry_exit_to_user_mode(regs);
return false;
}
--
2.29.2
The following commit has been merged into the sched/core branch of tip:
Commit-ID: ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825
Gitweb: https://git.kernel.org/tip/ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825
Author: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
AuthorDate: Wed, 17 Feb 2021 11:56:51 -05:00
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:21 +01:00
sched/membarrier: fix missing local execution of ipi_sync_rq_state()
The function sync_runqueues_membarrier_state() should copy the
membarrier state from the @mm received as parameter to each runqueue
currently running tasks using that mm.
However, the use of smp_call_function_many() skips the current runqueue,
which is unintended. Replace by a call to on_each_cpu_mask().
Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Reported-by: Nadav Amit <nadav.amit(a)gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: stable(a)vger.kernel.org # 5.4.x+
Link: https://lore.kernel.org/r/74F1E842-4A84-47BF-B6C2-5407DFDD4A4A@gmail.com
---
kernel/sched/membarrier.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index acdae62..b5add64 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
}
rcu_read_unlock();
- preempt_disable();
- smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);
- preempt_enable();
+ on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true);
free_cpumask_var(tmpmask);
cpus_read_unlock();
Creating a series of detached mounts, attaching them to the filesystem,
and unmounting them can be used to trigger an integer overflow in
ns->mounts causing the kernel to block any new mounts in count_mounts()
and returning ENOSPC because it falsely assumes that the maximum number
of mounts in the mount namespace has been reached, i.e. it thinks it
can't fit the new mounts into the mount namespace anymore.
Depending on the number of mounts in your system, this can be reproduced
on any kernel that supportes open_tree() and move_mount() by compiling
and running the following program:
/* SPDX-License-Identifier: LGPL-2.1+ */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
/* open_tree() */
#ifndef OPEN_TREE_CLONE
#define OPEN_TREE_CLONE 1
#endif
#ifndef OPEN_TREE_CLOEXEC
#define OPEN_TREE_CLOEXEC O_CLOEXEC
#endif
#ifndef __NR_open_tree
#if defined __alpha__
#define __NR_open_tree 538
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_open_tree 4428
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_open_tree 6428
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_open_tree 5428
#endif
#elif defined __ia64__
#define __NR_open_tree (428 + 1024)
#else
#define __NR_open_tree 428
#endif
#endif
/* move_mount() */
#ifndef MOVE_MOUNT_F_EMPTY_PATH
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
#endif
#ifndef __NR_move_mount
#if defined __alpha__
#define __NR_move_mount 539
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_move_mount 4429
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_move_mount 6429
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_move_mount 5429
#endif
#elif defined __ia64__
#define __NR_move_mount (428 + 1024)
#else
#define __NR_move_mount 429
#endif
#endif
static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
{
return syscall(__NR_open_tree, dfd, filename, flags);
}
static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
}
static bool is_shared_mountpoint(const char *path)
{
bool shared = false;
FILE *f = NULL;
char *line = NULL;
int i;
size_t len = 0;
f = fopen("/proc/self/mountinfo", "re");
if (!f)
return 0;
while (getline(&line, &len, f) > 0) {
char *slider1, *slider2;
for (slider1 = line, i = 0; slider1 && i < 4; i++)
slider1 = strchr(slider1 + 1, ' ');
if (!slider1)
continue;
slider2 = strchr(slider1 + 1, ' ');
if (!slider2)
continue;
*slider2 = '\0';
if (strcmp(slider1 + 1, path) == 0) {
/* This is the path. Is it shared? */
slider1 = strchr(slider2 + 1, ' ');
if (slider1 && strstr(slider1, "shared:")) {
shared = true;
break;
}
}
}
fclose(f);
free(line);
return shared;
}
static void usage(void)
{
const char *text = "mount-new [--recursive] <base-dir>\n";
fprintf(stderr, "%s", text);
_exit(EXIT_SUCCESS);
}
#define exit_usage(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
usage(); \
})
#define exit_log(format, ...) \
({ \
fprintf(stderr, format "\n", ##__VA_ARGS__); \
exit(EXIT_FAILURE); \
})
static const struct option longopts[] = {
{"help", no_argument, 0, 'a'},
{ NULL, no_argument, 0, 0 },
};
int main(int argc, char *argv[])
{
int exit_code = EXIT_SUCCESS, index = 0;
int dfd, fd_tree, new_argc, ret;
char *base_dir;
char *const *new_argv;
char target[PATH_MAX];
while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
switch (ret) {
case 'a':
/* fallthrough */
default:
usage();
}
}
new_argv = &argv[optind];
new_argc = argc - optind;
if (new_argc < 1)
exit_usage("Missing base directory\n");
base_dir = new_argv[0];
if (*base_dir != '/')
exit_log("Please specify an absolute path");
/* Ensure that target is a shared mountpoint. */
if (!is_shared_mountpoint(base_dir))
exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);
dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
if (dfd < 0)
exit_log("%m - Failed to open base directory \"%s\"", base_dir);
ret = mkdirat(dfd, "detached-move-mount", 0755);
if (ret < 0)
exit_log("%m - Failed to create required temporary directories");
ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
if (ret < 0 || (size_t)ret >= sizeof(target))
exit_log("%m - Failed to assemble target path");
/*
* Having a mount table with 10000 mounts is already quite excessive
* and shoult account even for weird test systems.
*/
for (size_t i = 0; i < 10000; i++) {
fd_tree = sys_open_tree(dfd, "detached-move-mount",
OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC |
AT_EMPTY_PATH);
if (fd_tree < 0) {
fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
if (ret < 0) {
if (errno == ENOSPC)
fprintf(stderr, "%m - Buggy mount counting");
else
fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
exit_code = EXIT_FAILURE;
break;
}
close(fd_tree);
ret = umount2(target, MNT_DETACH);
if (ret < 0) {
fprintf(stderr, "%m - Failed to unmount %s", target);
exit_code = EXIT_FAILURE;
break;
}
}
(void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
close(dfd);
exit(exit_code);
}
and wait for the kernel to refuse any new mounts by returning ENOSPC.
How many iterations are needed depends on the number of mounts in your
system. Assuming you have something like 50 mounts on a standard system
it should be almost instantaneous.
The root cause of this is that detached mounts aren't handled correctly
when source and target mount are identical and reside on a shared mount
causing a broken mount tree where the detached source itself is
propagated which propagation prevents for regular bind-mounts and new
mounts. This ultimately leads to a miscalculation of the number of
mounts in the mount namespace.
Detached mounts created via
open_tree(fd, path, OPEN_TREE_CLONE)
are essentially like an unattached new mount, or an unattached
bind-mount. They can then later on be attached to the filesystem via
move_mount() which calls into attach_recursive_mount(). Part of
attaching it to the filesystem is making sure that mounts get correctly
propagated in case the destination mountpoint is MS_SHARED, i.e. is a
shared mountpoint. This is done by calling into propagate_mnt() which
walks the list of peers calling propagate_one() on each mount in this
list making sure it receives the propagation event.
The propagate_one() functions thereby skips both new mounts and bind
mounts to not propagate them "into themselves". Both are identified by
checking whether the mount is already attached to any mount namespace in
mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.
However, detached mounts have an anonymous mount namespace attached to
them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
realize they need to be skipped causing the mount to propagate "into
itself" breaking the mount table and causing a disconnect between the
number of mounts recorded as being beneath or reachable from the target
mountpoint and the number of mounts actually recorded/counted in
ns->mounts ultimately causing an overflow which in turn prevents any new
mounts via the ENOSPC issue.
So teach propagation to handle detached mounts by making it aware of
them. I've been tracking this issue down for the last couple of days and
then verifying that the fix is correct by
unmounting everything in my current mount table leaving only /proc and
/sys mounted and running the reproducer above overnight verifying the
number of mounts counted in ns->mounts. With this fix the counts are
correct and the ENOSPC issue can't be reproduced.
This change will only have an effect on mounts created with the new
mount API since detached mounts cannot be created with the old mount API
so regressions are extremely unlikely.
Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around")
Cc: David Howells <dhowells(a)redhat.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com>
---
fs/pnode.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pnode.h b/fs/pnode.h
index 26f74e092bd9..988f1aa9b02a 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -12,7 +12,7 @@
#define IS_MNT_SHARED(m) ((m)->mnt.mnt_flags & MNT_SHARED)
#define IS_MNT_SLAVE(m) ((m)->mnt_master)
-#define IS_MNT_NEW(m) (!(m)->mnt_ns)
+#define IS_MNT_NEW(m) (!(m)->mnt_ns || is_anon_ns((m)->mnt_ns))
#define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
#define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
#define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
base-commit: f69d02e37a85645aa90d18cacfff36dba370f797
--
2.27.0
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.
This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
---
Changes v1->v2:
- Move kasan_free_nondeferred_pages() before arch_free_page().
---
mm/page_alloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 969e7012fce0..0efb07b5907c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1304,6 +1304,12 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order);
+ /*
+ * With hardware tag-based KASAN, memory tags must be set before the
+ * page becomes unavailable via debug_pagealloc or arch_free_page.
+ */
+ kasan_free_nondeferred_pages(page, order, fpi_flags);
+
/*
* arch_free_page() can make the page's contents inaccessible. s390
* does this. So nothing which can access the page's contents should
@@ -1313,8 +1319,6 @@ static __always_inline bool free_pages_prepare(struct page *page,
debug_pagealloc_unmap_pages(page, 1 << order);
- kasan_free_nondeferred_pages(page, order, fpi_flags);
-
return true;
}
--
2.30.1.766.gb4fecdf3b7-goog
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.
This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
---
Changes v2->v3:
- Rebase onto mainline.
---
mm/page_alloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..c89ee1ba7034 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1281,6 +1281,12 @@ static __always_inline bool free_pages_prepare(struct page *page,
kernel_poison_pages(page, 1 << order);
+ /*
+ * With hardware tag-based KASAN, memory tags must be set before the
+ * page becomes unavailable via debug_pagealloc or arch_free_page.
+ */
+ kasan_free_nondeferred_pages(page, order);
+
/*
* arch_free_page() can make the page's contents inaccessible. s390
* does this. So nothing which can access the page's contents should
@@ -1290,8 +1296,6 @@ static __always_inline bool free_pages_prepare(struct page *page,
debug_pagealloc_unmap_pages(page, 1 << order);
- kasan_free_nondeferred_pages(page, order);
-
return true;
}
--
2.30.1.766.gb4fecdf3b7-goog
The patch titled
Subject: kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
has been added to the -mm tree. Its filename is
kasan-mm-fix-crash-with-hw_tags-and-debug_pagealloc.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/kasan-mm-fix-crash-with-hw_tags-a…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/kasan-mm-fix-crash-with-hw_tags-a…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan, mm: fix crash with HW_TAGS and DEBUG_PAGEALLOC
Currently, kasan_free_nondeferred_pages()->kasan_free_pages() is called
after debug_pagealloc_unmap_pages(). This causes a crash when
debug_pagealloc is enabled, as HW_TAGS KASAN can't set tags on an
unmapped page.
This patch puts kasan_free_nondeferred_pages() before
debug_pagealloc_unmap_pages() and arch_free_page(), which can also make
the page unavailable.
Link: https://lkml.kernel.org/r/24cd7db274090f0e5bc3adcdc7399243668e3171.16149873…
Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Evgenii Stepanov <eugenis(a)google.com>
Cc: Branislav Rankov <Branislav.Rankov(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/page_alloc.c~kasan-mm-fix-crash-with-hw_tags-and-debug_pagealloc
+++ a/mm/page_alloc.c
@@ -1282,6 +1282,12 @@ static __always_inline bool free_pages_p
kernel_poison_pages(page, 1 << order);
/*
+ * With hardware tag-based KASAN, memory tags must be set before the
+ * page becomes unavailable via debug_pagealloc or arch_free_page.
+ */
+ kasan_free_nondeferred_pages(page, order);
+
+ /*
* arch_free_page() can make the page's contents inaccessible. s390
* does this. So nothing which can access the page's contents should
* happen after this.
@@ -1290,8 +1296,6 @@ static __always_inline bool free_pages_p
debug_pagealloc_unmap_pages(page, 1 << order);
- kasan_free_nondeferred_pages(page, order);
-
return true;
}
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
kasan-mm-fix-crash-with-hw_tags-and-debug_pagealloc.patch
While Kepler does technically support 256x256 cursors, it turns out that
in order for us to use these correctly we need to make sure that the cursor
plane uses a ctxdma that is set to use small (4K)/large (128K) pages -
whichever is applicable to the given cursor surface.
Right now however, we share the main kmsVramCtxDma that is used for all but
the ovly plane which defaults to small pages - resulting in artifacts when
we use 256x256 cursor surfaces. So until we teach nouveau to use a separate
ctxdma for the cursor, let's just stop advertising 256x256 cursors by
default - which should fix the issues that users were seeing.
Coincidentally - this is also why small ovlys don't work on Kepler: the
ctxdma we use for ovlys is set to large pages.
Changes since v2:
* Fix comments and patch description
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Fixes: d3b2f0f7921c ("drm/nouveau/kms/nv50-: Report max cursor size to userspace")
Cc: <stable(a)vger.kernel.org> # v5.11+
---
drivers/gpu/drm/nouveau/dispnv50/disp.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index 196612addfd6..d92cf9e17ac3 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -2693,9 +2693,19 @@ nv50_display_create(struct drm_device *dev)
else
nouveau_display(dev)->format_modifiers = disp50xx_modifiers;
- if (disp->disp->object.oclass >= GK104_DISP) {
+ /* FIXME: 256x256 cursors are supported on Kepler, however unlike Maxwell and later
+ * generations Kepler requires that we specify the page type, small (4K) or large (128K),
+ * correctly for the ctxdma being used on curs/ovly. We currently share a ctxdma across all
+ * display planes (except ovly) that defaults to small pages, which results in artifacting
+ * on 256x256 cursors. Until we teach nouveau to create an appropriate ctxdma for the cursor
+ * fb in use, simply avoid advertising support for 256x256 cursors.
+ */
+ if (disp->disp->object.oclass >= GM107_DISP) {
dev->mode_config.cursor_width = 256;
dev->mode_config.cursor_height = 256;
+ } else if (disp->disp->object.oclass >= GK104_DISP) {
+ dev->mode_config.cursor_width = 128;
+ dev->mode_config.cursor_height = 128;
} else {
dev->mode_config.cursor_width = 64;
dev->mode_config.cursor_height = 64;
--
2.29.2
process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process. It effectively removes the security boundary between the
two processes (in one direction). Granting ptrace attach capability
even to a system process is considered dangerous since it creates an
attack surface. This severely limits the usage of this API.
The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data
is physically located (and therefore, how fast it can be accessed).
What we want is the ability for one process to influence another process
in order to optimize performance across the entire system while leaving
the security boundary intact.
Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
and CAP_SYS_NICE for influencing process performance.
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Acked-by: David Rientjes <rientjes(a)google.com>
---
changes in v3
- Added Reviewed-by: Kees Cook <keescook(a)chromium.org>
- Created man page for process_madvise per Andrew's request: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144…
- cc'ed stable(a)vger.kernel.org # 5.10+ per Andrew's request
- cc'ed linux-security-module(a)vger.kernel.org per James Morris's request
mm/madvise.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index df692d2e35d4..01fef79ac761 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
goto release_task;
}
- mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
+ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
+ /*
+ * Require CAP_SYS_NICE for influencing process performance. Note that
+ * only non-destructive hints are currently supported.
+ */
+ if (!capable(CAP_SYS_NICE)) {
+ ret = -EPERM;
+ goto release_mm;
+ }
+
total_len = iov_iter_count(&iter);
while (iov_iter_count(&iter)) {
@@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
if (ret == 0)
ret = total_len - iov_iter_count(&iter);
+release_mm:
mmput(mm);
release_task:
put_task_struct(task);
--
2.30.1.766.gb4fecdf3b7-goog
From: Jia He <justin.he(a)arm.com>
When walking the page tables at a given level, and if the start
address for the range isn't aligned for that level, we propagate
the misalignment on each iteration at that level.
This results in the walker ignoring a number of entries (depending
on the original misalignment) on each subsequent iteration.
Properly aligning the address before the next iteration addresses
this issue.
Cc: stable(a)vger.kernel.org
Reported-by: Howard Zhang <Howard.Zhang(a)arm.com>
Acked-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Jia He <justin.he(a)arm.com>
Fixes: b1e57de62cfb ("KVM: arm64: Add stand-alone page-table walker infrastructure")
[maz: rewrite commit message]
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20210303024225.2591-1-justin.he@arm.com
---
arch/arm64/kvm/hyp/pgtable.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4d177ce1d536..926fc07074f5 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -223,6 +223,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data,
goto out;
if (!table) {
+ data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
data->addr += kvm_granule_size(level);
goto out;
}
--
2.29.2
From: Will Deacon <will(a)kernel.org>
Commit 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest
context") tracks the currently running vCPU, clearing the pointer to
NULL on exit from a guest.
Unfortunately, the use of 'set_loaded_vcpu' clobbers x1 to point at the
kvm_hyp_ctxt instead of the vCPU context, causing the subsequent RAS
code to go off into the weeds when it saves the DISR assuming that the
CPU context is embedded in a struct vCPU.
Leave x1 alone and use x3 as a temporary register instead when clearing
the vCPU on the guest exit path.
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: Andrew Scull <ascull(a)google.com>
Cc: <stable(a)vger.kernel.org>
Fixes: 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest context")
Suggested-by: Quentin Perret <qperret(a)google.com>
Signed-off-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20210226181211.14542-1-will@kernel.org
---
arch/arm64/kvm/hyp/entry.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
index b0afad7a99c6..0c66a1d408fd 100644
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -146,7 +146,7 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
// Now restore the hyp regs
restore_callee_saved_regs x2
- set_loaded_vcpu xzr, x1, x2
+ set_loaded_vcpu xzr, x2, x3
alternative_if ARM64_HAS_RAS_EXTN
// If we have the RAS extensions we can consume a pending error
--
2.29.2
From: Suzuki K Poulose <suzuki.poulose(a)arm.com>
The nVHE KVM hyp drains and disables the SPE buffer, before
entering the guest, as the EL1&0 translation regime
is going to be loaded with that of the guest.
But this operation is performed way too late, because :
- The owning translation regime of the SPE buffer
is transferred to EL2. (MDCR_EL2_E2PB == 0)
- The guest Stage1 is loaded.
Thus the flush could use the host EL1 virtual address,
but use the EL2 translations instead of host EL1, for writing
out any cached data.
Fix this by moving the SPE buffer handling early enough.
The restore path is doing the right thing.
Fixes: 014c4c77aad7 ("KVM: arm64: Improve debug register save/restore flow")
Cc: stable(a)vger.kernel.org
Cc: Christoffer Dall <christoffer.dall(a)arm.com>
Cc: Marc Zyngier <maz(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Alexandru Elisei <alexandru.elisei(a)arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei(a)arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20210302120345.3102874-1-suzuki.poulose@arm.com
---
arch/arm64/include/asm/kvm_hyp.h | 5 +++++
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 12 ++++++++++--
arch/arm64/kvm/hyp/nvhe/switch.c | 11 ++++++++++-
3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index c0450828378b..385bd7dd3d39 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -83,6 +83,11 @@ void sysreg_restore_guest_state_vhe(struct kvm_cpu_context *ctxt);
void __debug_switch_to_guest(struct kvm_vcpu *vcpu);
void __debug_switch_to_host(struct kvm_vcpu *vcpu);
+#ifdef __KVM_NVHE_HYPERVISOR__
+void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu);
+void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu);
+#endif
+
void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..f401724f12ef 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -58,16 +58,24 @@ static void __debug_restore_spe(u64 pmscr_el1)
write_sysreg_s(pmscr_el1, SYS_PMSCR_EL1);
}
-void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
+void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
{
/* Disable and flush SPE data generation */
__debug_save_spe(&vcpu->arch.host_debug_state.pmscr_el1);
+}
+
+void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
+{
__debug_switch_to_guest_common(vcpu);
}
-void __debug_switch_to_host(struct kvm_vcpu *vcpu)
+void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu)
{
__debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
+}
+
+void __debug_switch_to_host(struct kvm_vcpu *vcpu)
+{
__debug_switch_to_host_common(vcpu);
}
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index f3d0e9eca56c..59aa1045fdaf 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -192,6 +192,14 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
pmu_switch_needed = __pmu_switch_to_guest(host_ctxt);
__sysreg_save_state_nvhe(host_ctxt);
+ /*
+ * We must flush and disable the SPE buffer for nVHE, as
+ * the translation regime(EL1&0) is going to be loaded with
+ * that of the guest. And we must do this before we change the
+ * translation regime to EL2 (via MDCR_EL2_E2PB == 0) and
+ * before we load guest Stage1.
+ */
+ __debug_save_host_buffers_nvhe(vcpu);
__adjust_pc(vcpu);
@@ -234,11 +242,12 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->arch.flags & KVM_ARM64_FP_ENABLED)
__fpsimd_save_fpexc32(vcpu);
+ __debug_switch_to_host(vcpu);
/*
* This must come after restoring the host sysregs, since a non-VHE
* system may enable SPE here and make use of the TTBRs.
*/
- __debug_switch_to_host(vcpu);
+ __debug_restore_host_buffers_nvhe(vcpu);
if (pmu_switch_needed)
__pmu_switch_to_host(host_ctxt);
--
2.29.2
Commit 384d87ef2c95 ("block: Do not discard buffers under a mounted
filesystem") made paths issuing discard or zeroout requests to the
underlying device try to grab block device in exclusive mode. If that
failed we returned EBUSY to userspace. This however caused unexpected
fallout in userspace where e.g. FUSE filesystems issue discard requests
from userspace daemons although the device is open exclusively by the
kernel. Also shrinking of logical volume by LVM issues discard requests
to a device which may be claimed exclusively because there's another LV
on the same PV. So to avoid these userspace regressions, fall back to
invalidate_inode_pages2_range() instead of returning EBUSY to userspace
and return EBUSY only of that call fails as well (meaning that there's
indeed someone using the particular device range we are trying to
discard).
Link: https://bugzilla.kernel.org/show_bug.cgi?id=211167
Fixes: 384d87ef2c95 ("block: Do not discard buffers under a mounted filesystem")
CC: stable(a)vger.kernel.org
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
fs/block_dev.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 235b5042672e..c33151020bcd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -118,13 +118,22 @@ int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
if (!(mode & FMODE_EXCL)) {
int err = bd_prepare_to_claim(bdev, truncate_bdev_range);
if (err)
- return err;
+ goto invalidate;
}
truncate_inode_pages_range(bdev->bd_inode->i_mapping, lstart, lend);
if (!(mode & FMODE_EXCL))
bd_abort_claiming(bdev, truncate_bdev_range);
return 0;
+
+invalidate:
+ /*
+ * Someone else has handle exclusively open. Try invalidating instead.
+ * The 'end' argument is inclusive so the rounding is safe.
+ */
+ return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
+ lstart >> PAGE_SHIFT,
+ lend >> PAGE_SHIFT);
}
EXPORT_SYMBOL(truncate_bdev_range);
--
2.26.2
Extend kvm_s390_shadow_fault to return the pointer to the valid leaf
DAT table entry, or to the invalid entry.
Also return some flags in the lower bits of the address:
PEI_DAT_PROT: indicates that DAT protection applies because of the
protection bit in the segment (or, if EDAT, region) tables.
PEI_NOT_PTE: indicates that the address of the DAT table entry returned
does not refer to a PTE, but to a segment or region table.
Signed-off-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Janosch Frank <frankja(a)de.ibm.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
---
arch/s390/kvm/gaccess.c | 30 +++++++++++++++++++++++++-----
arch/s390/kvm/gaccess.h | 6 +++++-
arch/s390/kvm/vsie.c | 8 ++++----
3 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 6d6b57059493..33a03e518640 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -976,7 +976,9 @@ int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra)
* kvm_s390_shadow_tables - walk the guest page table and create shadow tables
* @sg: pointer to the shadow guest address space structure
* @saddr: faulting address in the shadow gmap
- * @pgt: pointer to the page table address result
+ * @pgt: pointer to the beginning of the page table for the given address if
+ * successful (return value 0), or to the first invalid DAT entry in
+ * case of exceptions (return value > 0)
* @fake: pgt references contiguous guest memory block, not a pgtable
*/
static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
@@ -1034,6 +1036,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
rfte.val = ptr;
goto shadow_r2t;
}
+ *pgt = ptr + vaddr.rfx * 8;
rc = gmap_read_table(parent, ptr + vaddr.rfx * 8, &rfte.val);
if (rc)
return rc;
@@ -1060,6 +1063,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
rste.val = ptr;
goto shadow_r3t;
}
+ *pgt = ptr + vaddr.rsx * 8;
rc = gmap_read_table(parent, ptr + vaddr.rsx * 8, &rste.val);
if (rc)
return rc;
@@ -1087,6 +1091,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
rtte.val = ptr;
goto shadow_sgt;
}
+ *pgt = ptr + vaddr.rtx * 8;
rc = gmap_read_table(parent, ptr + vaddr.rtx * 8, &rtte.val);
if (rc)
return rc;
@@ -1123,6 +1128,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
ste.val = ptr;
goto shadow_pgt;
}
+ *pgt = ptr + vaddr.sx * 8;
rc = gmap_read_table(parent, ptr + vaddr.sx * 8, &ste.val);
if (rc)
return rc;
@@ -1157,6 +1163,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
* @vcpu: virtual cpu
* @sg: pointer to the shadow guest address space structure
* @saddr: faulting address in the shadow gmap
+ * @datptr: will contain the address of the faulting DAT table entry, or of
+ * the valid leaf, plus some flags
*
* Returns: - 0 if the shadow fault was successfully resolved
* - > 0 (pgm exception code) on exceptions while faulting
@@ -1165,11 +1173,11 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
* - -ENOMEM if out of memory
*/
int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
- unsigned long saddr)
+ unsigned long saddr, unsigned long *datptr)
{
union vaddress vaddr;
union page_table_entry pte;
- unsigned long pgt;
+ unsigned long pgt = 0;
int dat_protection, fake;
int rc;
@@ -1191,8 +1199,20 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
pte.val = pgt + vaddr.px * PAGE_SIZE;
goto shadow_page;
}
- if (!rc)
- rc = gmap_read_table(sg->parent, pgt + vaddr.px * 8, &pte.val);
+
+ switch (rc) {
+ case PGM_SEGMENT_TRANSLATION:
+ case PGM_REGION_THIRD_TRANS:
+ case PGM_REGION_SECOND_TRANS:
+ case PGM_REGION_FIRST_TRANS:
+ pgt |= PEI_NOT_PTE;
+ break;
+ case 0:
+ pgt += vaddr.px * 8;
+ rc = gmap_read_table(sg->parent, pgt, &pte.val);
+ }
+ if (*datptr)
+ *datptr = pgt | dat_protection * PEI_DAT_PROT;
if (!rc && pte.i)
rc = PGM_PAGE_TRANSLATION;
if (!rc && pte.z)
diff --git a/arch/s390/kvm/gaccess.h b/arch/s390/kvm/gaccess.h
index 107fdfd2eadd..df4014f09e26 100644
--- a/arch/s390/kvm/gaccess.h
+++ b/arch/s390/kvm/gaccess.h
@@ -378,7 +378,11 @@ void ipte_unlock(struct kvm_vcpu *vcpu);
int ipte_lock_held(struct kvm_vcpu *vcpu);
int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra);
+/* MVPG PEI indication bits */
+#define PEI_DAT_PROT 2
+#define PEI_NOT_PTE 4
+
int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *shadow,
- unsigned long saddr);
+ unsigned long saddr, unsigned long *datptr);
#endif /* __KVM_S390_GACCESS_H */
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index bd803e091918..78b604326016 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -620,10 +620,10 @@ static int map_prefix(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
/* with mso/msl, the prefix lies at offset *mso* */
prefix += scb_s->mso;
- rc = kvm_s390_shadow_fault(vcpu, vsie_page->gmap, prefix);
+ rc = kvm_s390_shadow_fault(vcpu, vsie_page->gmap, prefix, NULL);
if (!rc && (scb_s->ecb & ECB_TE))
rc = kvm_s390_shadow_fault(vcpu, vsie_page->gmap,
- prefix + PAGE_SIZE);
+ prefix + PAGE_SIZE, NULL);
/*
* We don't have to mprotect, we will be called for all unshadows.
* SIE will detect if protection applies and trigger a validity.
@@ -914,7 +914,7 @@ static int handle_fault(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
current->thread.gmap_addr, 1);
rc = kvm_s390_shadow_fault(vcpu, vsie_page->gmap,
- current->thread.gmap_addr);
+ current->thread.gmap_addr, NULL);
if (rc > 0) {
rc = inject_fault(vcpu, rc,
current->thread.gmap_addr,
@@ -936,7 +936,7 @@ static void handle_last_fault(struct kvm_vcpu *vcpu,
{
if (vsie_page->fault_addr)
kvm_s390_shadow_fault(vcpu, vsie_page->gmap,
- vsie_page->fault_addr);
+ vsie_page->fault_addr, NULL);
vsie_page->fault_addr = 0;
}
--
2.26.2
Creating a series of detached mounts, attaching them to the filesystem,
and unmounting them can be used to trigger an integer overflow in
ns->mounts causing the kernel to block any new mounts in count_mounts()
and returning ENOSPC because it falsely assumes that the maximum number
of mounts in the mount namespace has been reached, i.e. it thinks it
can't fit the new mounts into the mount namespace anymore.
Depending on the number of mounts in your system, this can be reproduced
on any kernel that supportes open_tree() and move_mount() with the
following instructions:
1. Compile the following program "repro.c" via "make repro"
> cat repro.c
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/mount.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
/* open_tree() */
#ifndef OPEN_TREE_CLONE
#define OPEN_TREE_CLONE 1
#endif
#ifndef OPEN_TREE_CLOEXEC
#define OPEN_TREE_CLOEXEC O_CLOEXEC
#endif
#ifndef __NR_open_tree
#if defined __alpha__
#define __NR_open_tree 538
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_open_tree 4428
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_open_tree 6428
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_open_tree 5428
#endif
#elif defined __ia64__
#define __NR_open_tree (428 + 1024)
#else
#define __NR_open_tree 428
#endif
#endif
/* move_mount() */
#ifndef MOVE_MOUNT_F_EMPTY_PATH
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
#endif
#ifndef __NR_move_mount
#if defined __alpha__
#define __NR_move_mount 539
#elif defined _MIPS_SIM
#if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
#define __NR_move_mount 4429
#endif
#if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
#define __NR_move_mount 6429
#endif
#if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
#define __NR_move_mount 5429
#endif
#elif defined __ia64__
#define __NR_move_mount (428 + 1024)
#else
#define __NR_move_mount 429
#endif
#endif
static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
{
return syscall(__NR_open_tree, dfd, filename, flags);
}
static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
}
static void usage(void)
{
const char *text = "mount-new [--recursive] <source> <target>\n";
fprintf(stderr, "%s", text);
_exit(EXIT_SUCCESS);
}
#define exit_usage(format, ...) \
({ \
fprintf(stderr, format, ##__VA_ARGS__); \
usage(); \
})
#define exit_log(format, ...) \
({ \
fprintf(stderr, format, ##__VA_ARGS__); \
exit(EXIT_FAILURE); \
})
static const struct option longopts[] = {
{"recursive", no_argument, 0, 'a'},
{"help", no_argument, 0, 'b'},
{ NULL, no_argument, 0, 0 },
};
int main(int argc, char *argv[])
{
int index = 0;
bool recursive = false;
int fd_tree, new_argc, ret;
char *source, *target;
char *const *new_argv;
while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
switch (ret) {
case 'a':
recursive = true;
break;
case 'b':
/* fallthrough */
default:
usage();
}
}
new_argv = &argv[optind];
new_argc = argc - optind;
if (new_argc < 2)
exit_usage("Missing source or target mountpoint\n\n");
source = new_argv[0];
target = new_argv[1];
fd_tree = sys_open_tree(-EBADF, source,
OPEN_TREE_CLONE |
OPEN_TREE_CLOEXEC |
AT_EMPTY_PATH |
(recursive ? AT_RECURSIVE : 0));
if (fd_tree < 0) {
exit_log("%m - Failed to open %s\n", source);
exit(EXIT_FAILURE);
}
ret = sys_move_mount(fd_tree, "", -EBADF, target, MOVE_MOUNT_F_EMPTY_PATH);
if (ret < 0)
exit_log("%m - Failed to attach mount to %s\n", target);
close(fd_tree);
exit(EXIT_SUCCESS);
}
2. Run a loop like:
while true; do sudo ./repro; done
and wait for the kernel to refuse any new mounts by returning ENOSPC.
Depending on the number of mounts you have on the system this might take
shorter or longer.
The root cause of this is that detached mounts aren't handled correctly
when the destination mount point is MS_SHARED causing a borked mount
tree and ultimately to a miscalculation of the number of mounts in the
mount namespace.
Detached mounts created via
open_tree(fd, path, OPEN_TREE_CLONE)
are essentially like an unattached new mount, or an unattached
bind-mount. They can then later on be attached to the filesystem via
move_mount() which calles into attach_recursive_mount(). Part of
attaching it to the filesystem is making sure that mounts get correctly
propagated in case the destination mountpoint is MS_SHARED, i.e. is a
shared mountpoint. This is done by calling into propagate_mnt() which
walks the list of peers calling propagate_one() on each mount in this
list making sure it receives the propagation event.
For new mounts and bind-mounts propagate_one() knows to skip them by
realizing that there's no mount namespace attached to them.
However, detached mounts have an anonymous mount namespace attached to
them and so they aren't skipped causing the mount table to get wonky and
ultimately preventing any new mounts via the ENOSPC issue.
So teach propagation to handle detached mounts by making it aware of
them. I've been tracking this issue down for the last couple of days by
unmounting everything in my current mount table leaving only /proc and
/sys mounted and running the reproducer above verifying the counting of
the mounts. With this fix the counts are correct and the ENOSPC issue
can't be reproduced.
Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around")
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com>
---
fs/pnode.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/pnode.h b/fs/pnode.h
index 26f74e092bd9..988f1aa9b02a 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -12,7 +12,7 @@
#define IS_MNT_SHARED(m) ((m)->mnt.mnt_flags & MNT_SHARED)
#define IS_MNT_SLAVE(m) ((m)->mnt_master)
-#define IS_MNT_NEW(m) (!(m)->mnt_ns)
+#define IS_MNT_NEW(m) (!(m)->mnt_ns || is_anon_ns((m)->mnt_ns))
#define CLEAR_MNT_SHARED(m) ((m)->mnt.mnt_flags &= ~MNT_SHARED)
#define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
#define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
base-commit: f69d02e37a85645aa90d18cacfff36dba370f797
--
2.27.0