From: Steven Rostedt <rostedt(a)goodmis.org>
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable(a)vger.kernel.org
Fixes: 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0fd582651293..bbb89206a891 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7497,12 +7497,12 @@ tracing_mark_raw_write(struct file *filp, const char __user *ubuf,
if (tr == &global_trace) {
guard(rcu)();
list_for_each_entry_rcu(tr, &marker_copies, marker_list) {
- written = write_raw_marker_to_buffer(tr, ubuf, cnt);
+ written = write_raw_marker_to_buffer(tr, buf, cnt);
if (written < 0)
break;
}
} else {
- written = write_raw_marker_to_buffer(tr, ubuf, cnt);
+ written = write_raw_marker_to_buffer(tr, buf, cnt);
}
return written;
--
2.51.0
From: Jani Nurminen <jani.nurminen(a)windriver.com>
When PCIe has been set up by the bootloader, the ecam_size field in the
E_ECAM_CONTROL register already contains a value.
The driver previously programmed it to 0xc (for 16 busses; 16 MB), but
bumped to 0x10 (for 256 busses; 256 MB) by the commit 2fccd11518f1 ("PCI:
xilinx-nwl: Modify ECAM size to enable support for 256 buses").
Regardless of what the bootloader has programmed, the driver ORs in a
new maximal value without doing a proper RMW sequence. This can lead to
problems.
For example, if the bootloader programs in 0xc and the driver uses 0x10,
the ORed result is 0x1c, which is beyond the ecam_max_size limit of 0x10
(from E_ECAM_CAPABILITIES).
Avoid the problems by doing a proper RMW.
Fixes: 2fccd11518f1 ("PCI: xilinx-nwl: Modify ECAM size to enable support for 256 buses")
Signed-off-by: Jani Nurminen <jani.nurminen(a)windriver.com>
[mani: added stable tag]
Signed-off-by: Manivannan Sadhasivam <mani(a)kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Cc: stable(a)vger.kernel.org
Link: https://patch.msgid.link/e83a2af2-af0b-4670-bcf5-ad408571c2b0@windriver.com
---
CR: CR-1250694
Branch: master-next-test
---
drivers/pci/controller/pcie-xilinx-nwl.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/controller/pcie-xilinx-nwl.c b/drivers/pci/controller/pcie-xilinx-nwl.c
index a91eed8812c8..63494b67e42b 100644
--- a/drivers/pci/controller/pcie-xilinx-nwl.c
+++ b/drivers/pci/controller/pcie-xilinx-nwl.c
@@ -665,9 +665,10 @@ static int nwl_pcie_bridge_init(struct nwl_pcie *pcie)
nwl_bridge_writel(pcie, nwl_bridge_readl(pcie, E_ECAM_CONTROL) |
E_ECAM_CR_ENABLE, E_ECAM_CONTROL);
- nwl_bridge_writel(pcie, nwl_bridge_readl(pcie, E_ECAM_CONTROL) |
- (NWL_ECAM_MAX_SIZE << E_ECAM_SIZE_SHIFT),
- E_ECAM_CONTROL);
+ ecam_val = nwl_bridge_readl(pcie, E_ECAM_CONTROL);
+ ecam_val &= ~E_ECAM_SIZE_LOC;
+ ecam_val |= NWL_ECAM_MAX_SIZE << E_ECAM_SIZE_SHIFT;
+ nwl_bridge_writel(pcie, ecam_val, E_ECAM_CONTROL);
nwl_bridge_writel(pcie, lower_32_bits(pcie->phys_ecam_base),
E_ECAM_BASE_LO);
--
2.44.1
Hi Geoffrey,
On 2025/10/9 7:22, Geoffrey Thorpe wrote:
> Any trivial usage of hostfs seems to be broken since commit cd140ce9
> ("hostfs: convert hostfs to use the new mount API") - I bisected it down
> to this commit to make sure.
>
Sorry to trouble you, can you provide your information about mount
version and kernel version (use mount -v and uname -ar) ?
Thanks,
Hongbo
> Steps to reproduce;
>
> The following assumes that the ARCH=um kernel has already been compiled
> (and the 'vmlinux' executable is in the local directory, as is the case
> when building from the top directory of a source tree). I built mine
> from a fresh clone using 'defconfig'. The uml_run.sh script creates a
> bootable root FS image (from debian, via docker) and then boots it with
> a hostfs mount to demonstrate the regression. This should be observable
> with any other bootable image though, simply pass "hostfs=<hostpath>" to
> the ./vmlinux kernel and then try to mount it from within the booted VM
> ("mount -t hostfs none <guestpath>").
>
> The following 3 text files are used, and as they're small enough for
> copy-n-paste I figured (hoped) it was best to inline them rather than
> post attachments.
>
> uml_run.sh:
> #!/bin/bash
> set -ex
> cat Dockerfile | docker build -t foobar:foobar -
> docker export -o foobar.tar \
> `docker run -d foobar:foobar /bin/true`
> dd if=/dev/zero of=rootfs.img \
> bs=$(expr 2048 \* 1024 \* 1024 / 512) count=512
> mkfs.ext4 rootfs.img
> sudo ./uml_root.sh
> cp rootfs.img temp.img
> dd if=/dev/zero of=swapfile bs=1M count=1024
> chmod 600 swapfile
> mkswap swapfile
> ./vmlinux mem=4G ubd0=temp.img rw ubd1=swapfile \
> hostfs=$(pwd)
>
> uml_root.sh:
> #!/bin/bash
> set -ex
> losetup -D
> LOOPDEVICE=$(losetup -f)
> losetup ${LOOPDEVICE} rootfs.img
> mkdir -p tmpmnt
> mount -t auto ${LOOPDEVICE} tmpmnt/
> (cd tmpmnt && tar xf ../foobar.tar)
> umount tmpmnt
> losetup -D
>
> Dockerfile:
> FROM debian:trixie
> RUN echo 'debconf debconf/frontend select Noninteractive' | \
> debconf-set-selections
> RUN apt-get update
> RUN apt-get install -y apt-utils
> RUN apt-get -y full-upgrade
> RUN echo "US/Eastern" > /etc/timezone
> RUN chmod 644 /etc/timezone
> RUN cd /etc && rm -f localtime && \
> ln -s /usr/share/zoneinfo/$$MYTZ localtime
> RUN apt-get install -y systemd-sysv kmod
> RUN echo "root:root" | chpasswd
> RUN echo "/dev/ubdb swap swap defaults 0 0" >> /etc/fstab
> RUN mkdir /hosthack
> RUN echo "none /hosthack hostfs defaults 0 0" >> /etc/fstab
> RUN systemctl set-default multi-user.target
>
> Execute ./uml_run.sh to build the rootfs image and boot the VM. This
> requires a system with docker, and will also require a sudo password
> when creating the rootfs. The boot log indicates whether the hostfs
> mount succeeds or not - the boot should degrade to emergency mode if the
> mount fails, otherwise a login prompt should indicate success. (Login is
> root:root, e.g. if you prefer to go in and shutdown the VM gracefully.)
>
> Please let me know if I can/should provide anything else.
>
> Cheers,
> Geoff
>
Commit e26ee4efbc79 ("fuse: allocate ff->release_args only if release is
needed") skips allocating ff->release_args if the server does not
implement open. However in doing so, fuse_prepare_release() now skips
grabbing the reference on the inode, which makes it possible for an
inode to be evicted from the dcache while there are inflight readahead
requests. This causes a deadlock if the server triggers reclaim while
servicing the readahead request and reclaim attempts to evict the inode
of the file being read ahead. Since the folio is locked during
readahead, when reclaim evicts the fuse inode and fuse_evict_inode()
attempts to remove all folios associated with the inode from the page
cache (truncate_inode_pages_range()), reclaim will block forever waiting
for the lock since readahead cannot relinquish the lock because it is
itself blocked in reclaim:
>>> stack_trace(1504735)
folio_wait_bit_common (mm/filemap.c:1308:4)
folio_lock (./include/linux/pagemap.h:1052:3)
truncate_inode_pages_range (mm/truncate.c:336:10)
fuse_evict_inode (fs/fuse/inode.c:161:2)
evict (fs/inode.c:704:3)
dentry_unlink_inode (fs/dcache.c:412:3)
__dentry_kill (fs/dcache.c:615:3)
shrink_kill (fs/dcache.c:1060:12)
shrink_dentry_list (fs/dcache.c:1087:3)
prune_dcache_sb (fs/dcache.c:1168:2)
super_cache_scan (fs/super.c:221:10)
do_shrink_slab (mm/shrinker.c:435:9)
shrink_slab (mm/shrinker.c:626:10)
shrink_node (mm/vmscan.c:5951:2)
shrink_zones (mm/vmscan.c:6195:3)
do_try_to_free_pages (mm/vmscan.c:6257:3)
do_swap_page (mm/memory.c:4136:11)
handle_pte_fault (mm/memory.c:5562:10)
handle_mm_fault (mm/memory.c:5870:9)
do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
handle_page_fault (arch/x86/mm/fault.c:1481:3)
exc_page_fault (arch/x86/mm/fault.c:1539:2)
asm_exc_page_fault+0x22/0x27
Fix this deadlock by allocating ff->release_args and grabbing the
reference on the inode when preparing the file for release even if the
server does not implement open. The inode reference will be dropped when
the last reference on the fuse file is dropped (see fuse_file_put() ->
fuse_release_end()).
Fixes: e26ee4efbc79 ("fuse: allocate ff->release_args only if release is needed")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong(a)gmail.com>
Reported-by: Omar Sandoval <osandov(a)fb.com>
---
fs/fuse/file.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f1ef77a0be05..654e21ee93fb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -100,7 +100,7 @@ static void fuse_release_end(struct fuse_mount *fm, struct fuse_args *args,
kfree(ra);
}
-static void fuse_file_put(struct fuse_file *ff, bool sync)
+static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
{
if (refcount_dec_and_test(&ff->count)) {
struct fuse_release_args *ra = &ff->args->release_args;
@@ -110,7 +110,9 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
fuse_file_io_release(ff, ra->inode);
if (!args) {
- /* Do nothing when server does not implement 'open' */
+ /* Do nothing when server does not implement 'opendir' */
+ } else if (!isdir && ff->fm->fc->no_open) {
+ fuse_release_end(ff->fm, args, 0);
} else if (sync) {
fuse_simple_request(ff->fm, args);
fuse_release_end(ff->fm, args, 0);
@@ -131,8 +133,17 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
struct fuse_file *ff;
int opcode = isdir ? FUSE_OPENDIR : FUSE_OPEN;
bool open = isdir ? !fc->no_opendir : !fc->no_open;
+ bool release = !isdir || open;
- ff = fuse_file_alloc(fm, open);
+ /*
+ * ff->args->release_args still needs to be allocated (so we can hold an
+ * inode reference while there are pending inflight file operations when
+ * ->release() is called, see fuse_prepare_release()) even if
+ * fc->no_open is set else it becomes possible for reclaim to deadlock
+ * if while servicing the readahead request the server triggers reclaim
+ * and reclaim evicts the inode of the file being read ahead.
+ */
+ ff = fuse_file_alloc(fm, release);
if (!ff)
return ERR_PTR(-ENOMEM);
@@ -152,13 +163,14 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
fuse_file_free(ff);
return ERR_PTR(err);
} else {
- /* No release needed */
- kfree(ff->args);
- ff->args = NULL;
- if (isdir)
+ if (isdir) {
+ /* No release needed */
+ kfree(ff->args);
+ ff->args = NULL;
fc->no_opendir = 1;
- else
+ } else {
fc->no_open = 1;
+ }
}
}
@@ -363,7 +375,7 @@ void fuse_file_release(struct inode *inode, struct fuse_file *ff,
* own ref to the file, the IO completion has to drop the ref, which is
* how the fuse server can end up closing its clients' files.
*/
- fuse_file_put(ff, false);
+ fuse_file_put(ff, false, isdir);
}
void fuse_release_common(struct file *file, bool isdir)
@@ -394,7 +406,7 @@ void fuse_sync_release(struct fuse_inode *fi, struct fuse_file *ff,
{
WARN_ON(refcount_read(&ff->count) > 1);
fuse_prepare_release(fi, ff, flags, FUSE_RELEASE, true);
- fuse_file_put(ff, true);
+ fuse_file_put(ff, true, false);
}
EXPORT_SYMBOL_GPL(fuse_sync_release);
@@ -891,7 +903,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
folio_put(ap->folios[i]);
}
if (ia->ff)
- fuse_file_put(ia->ff, false);
+ fuse_file_put(ia->ff, false, false);
fuse_io_free(ia);
}
@@ -1815,7 +1827,7 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa)
if (wpa->bucket)
fuse_sync_bucket_dec(wpa->bucket);
- fuse_file_put(wpa->ia.ff, false);
+ fuse_file_put(wpa->ia.ff, false, false);
kfree(ap->folios);
kfree(wpa);
@@ -1968,7 +1980,7 @@ int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
ff = __fuse_write_file_get(fi);
err = fuse_flush_times(inode, ff);
if (ff)
- fuse_file_put(ff, false);
+ fuse_file_put(ff, false, false);
return err;
}
@@ -2186,7 +2198,7 @@ static int fuse_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
}
if (data->ff)
- fuse_file_put(data->ff, false);
+ fuse_file_put(data->ff, false, false);
return error;
}
--
2.47.3
The atomic variable vm_fault_info_updated is used to synchronize access to
adev->gmc.vm_fault_info between the interrupt handler and
get_vm_fault_info().
The default atomic functions like atomic_set() and atomic_read() do not
provide memory barriers. This allows for CPU instruction reordering,
meaning the memory accesses to vm_fault_info and the vm_fault_info_updated
flag are not guaranteed to occur in the intended order. This creates a
race condition that can lead to inconsistent or stale data being used.
The previous implementation, which used an explicit mb(), was incomplete
and inefficient. It failed to account for all potential CPU reorderings,
such as the access of vm_fault_info being reordered before the atomic_read
of the flag. This approach is also more verbose and less performant than
using the proper atomic functions with acquire/release semantics.
Fix this by switching to atomic_set_release() and atomic_read_acquire().
These functions provide the necessary acquire and release semantics,
which act as memory barriers to ensure the correct order of operations.
It is also more efficient and idiomatic than using explicit full memory
barriers.
Fixes: b97dfa27ef3a ("drm/amdgpu: save vm fault information for amdkfd")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gui-Dong Han <hanguidong02(a)gmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 5 ++---
drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 7 +++----
drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 7 +++----
3 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index b16cce7c22c3..ac09bbe51634 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2325,10 +2325,9 @@ void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_mem *mem)
int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct amdgpu_device *adev,
struct kfd_vm_fault_info *mem)
{
- if (atomic_read(&adev->gmc.vm_fault_info_updated) == 1) {
+ if (atomic_read_acquire(&adev->gmc.vm_fault_info_updated) == 1) {
*mem = *adev->gmc.vm_fault_info;
- mb(); /* make sure read happened */
- atomic_set(&adev->gmc.vm_fault_info_updated, 0);
+ atomic_set_release(&adev->gmc.vm_fault_info_updated, 0);
}
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
index a8d5795084fc..cf30d3332050 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
@@ -1066,7 +1066,7 @@ static int gmc_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
GFP_KERNEL);
if (!adev->gmc.vm_fault_info)
return -ENOMEM;
- atomic_set(&adev->gmc.vm_fault_info_updated, 0);
+ atomic_set_release(&adev->gmc.vm_fault_info_updated, 0);
return 0;
}
@@ -1288,7 +1288,7 @@ static int gmc_v7_0_process_interrupt(struct amdgpu_device *adev,
vmid = REG_GET_FIELD(status, VM_CONTEXT1_PROTECTION_FAULT_STATUS,
VMID);
if (amdgpu_amdkfd_is_kfd_vmid(adev, vmid)
- && !atomic_read(&adev->gmc.vm_fault_info_updated)) {
+ && !atomic_read_acquire(&adev->gmc.vm_fault_info_updated)) {
struct kfd_vm_fault_info *info = adev->gmc.vm_fault_info;
u32 protections = REG_GET_FIELD(status,
VM_CONTEXT1_PROTECTION_FAULT_STATUS,
@@ -1304,8 +1304,7 @@ static int gmc_v7_0_process_interrupt(struct amdgpu_device *adev,
info->prot_read = protections & 0x8 ? true : false;
info->prot_write = protections & 0x10 ? true : false;
info->prot_exec = protections & 0x20 ? true : false;
- mb();
- atomic_set(&adev->gmc.vm_fault_info_updated, 1);
+ atomic_set_release(&adev->gmc.vm_fault_info_updated, 1);
}
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index b45fa0cea9d2..0d4c93ff6f74 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -1179,7 +1179,7 @@ static int gmc_v8_0_sw_init(struct amdgpu_ip_block *ip_block)
GFP_KERNEL);
if (!adev->gmc.vm_fault_info)
return -ENOMEM;
- atomic_set(&adev->gmc.vm_fault_info_updated, 0);
+ atomic_set_release(&adev->gmc.vm_fault_info_updated, 0);
return 0;
}
@@ -1474,7 +1474,7 @@ static int gmc_v8_0_process_interrupt(struct amdgpu_device *adev,
vmid = REG_GET_FIELD(status, VM_CONTEXT1_PROTECTION_FAULT_STATUS,
VMID);
if (amdgpu_amdkfd_is_kfd_vmid(adev, vmid)
- && !atomic_read(&adev->gmc.vm_fault_info_updated)) {
+ && !atomic_read_acquire(&adev->gmc.vm_fault_info_updated)) {
struct kfd_vm_fault_info *info = adev->gmc.vm_fault_info;
u32 protections = REG_GET_FIELD(status,
VM_CONTEXT1_PROTECTION_FAULT_STATUS,
@@ -1490,8 +1490,7 @@ static int gmc_v8_0_process_interrupt(struct amdgpu_device *adev,
info->prot_read = protections & 0x8 ? true : false;
info->prot_write = protections & 0x10 ? true : false;
info->prot_exec = protections & 0x20 ? true : false;
- mb();
- atomic_set(&adev->gmc.vm_fault_info_updated, 1);
+ atomic_set_release(&adev->gmc.vm_fault_info_updated, 1);
}
return 0;
--
2.25.1
Turned out certain clearly invalid values passed in &xdp_desc from
userspace can pass xp_{,un}aligned_validate_desc() and then lead
to UBs or just invalid frames to be queued for xmit.
desc->len close to ``U32_MAX`` with a non-zero pool->tx_metadata_len
can cause positive integer overflow and wraparound, the same way low
enough desc->addr with a non-zero pool->tx_metadata_len can cause
negative integer overflow. Both scenarios can then pass the
validation successfully.
This doesn't happen with valid XSk applications, but can be used
to perform attacks.
Always promote desc->len to ``u64`` first to exclude positive
overflows of it. Use explicit check_{add,sub}_overflow() when
validating desc->addr (which is ``u64`` already).
bloat-o-meter reports a little growth of the code size:
add/remove: 0/0 grow/shrink: 2/1 up/down: 60/-16 (44)
Function old new delta
xskq_cons_peek_desc 299 330 +31
xsk_tx_peek_release_desc_batch 973 1002 +29
xsk_generic_xmit 3148 3132 -16
but hopefully this doesn't hurt the performance much.
Fixes: 341ac980eab9 ("xsk: Support tx_metadata_len")
Cc: stable(a)vger.kernel.org # 6.8+
Signed-off-by: Alexander Lobakin <aleksander.lobakin(a)intel.com>
---
net/xdp/xsk_queue.h | 45 +++++++++++++++++++++++++++++++++++----------
1 file changed, 35 insertions(+), 10 deletions(-)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index f16f390370dc..1eb8d9f8b104 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -143,14 +143,24 @@ static inline bool xp_unused_options_set(u32 options)
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 addr = desc->addr - pool->tx_metadata_len;
- u64 len = desc->len + pool->tx_metadata_len;
- u64 offset = addr & (pool->chunk_size - 1);
+ u64 len = desc->len;
+ u64 addr, offset;
- if (!desc->len)
+ if (!len)
return false;
- if (offset + len > pool->chunk_size)
+ /* Can overflow if desc->addr < pool->tx_metadata_len */
+ if (check_sub_overflow(desc->addr, pool->tx_metadata_len, &addr))
+ return false;
+
+ offset = addr & (pool->chunk_size - 1);
+
+ /*
+ * Can't overflow: @offset is guaranteed to be < ``U32_MAX``
+ * (pool->chunk_size is ``u32``), @len is guaranteed
+ * to be <= ``U32_MAX``.
+ */
+ if (offset + len + pool->tx_metadata_len > pool->chunk_size)
return false;
if (addr >= pool->addrs_cnt)
@@ -158,27 +168,42 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
if (xp_unused_options_set(desc->options))
return false;
+
return true;
}
static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
- u64 len = desc->len + pool->tx_metadata_len;
+ u64 len = desc->len;
+ u64 addr, end;
- if (!desc->len)
+ if (!len)
return false;
+ /* Can't overflow: @len is guaranteed to be <= ``U32_MAX`` */
+ len += pool->tx_metadata_len;
if (len > pool->chunk_size)
return false;
- if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
- xp_desc_crosses_non_contig_pg(pool, addr, len))
+ /* Can overflow if desc->addr is close to 0 */
+ if (check_sub_overflow(xp_unaligned_add_offset_to_addr(desc->addr),
+ pool->tx_metadata_len, &addr))
+ return false;
+
+ if (addr >= pool->addrs_cnt)
+ return false;
+
+ /* Can overflow if pool->addrs_cnt is high enough */
+ if (check_add_overflow(addr, len, &end) || end > pool->addrs_cnt)
+ return false;
+
+ if (xp_desc_crosses_non_contig_pg(pool, addr, len))
return false;
if (xp_unused_options_set(desc->options))
return false;
+
return true;
}
--
2.51.0
Assuming the disk layout as below,
disk0: 0 --- 0x00035abfff
disk1: 0x00035ac000 --- 0x00037abfff
disk2: 0x00037ac000 --- 0x00037ebfff
and we want to read data from offset=13568 having len=128 across the block
devices, we can illustrate the block addresses like below.
0 .. 0x00037ac000 ------------------- 0x00037ebfff, 0x00037ec000 -------
| ^ ^ ^
| fofs 0 13568 13568+128
| ------------------------------------------------------
| LBA 0x37e8aa9 0x37ebfa9 0x37ec029
--- map 0x3caa9 0x3ffa9
In this example, we should give the relative map of the target block device
ranging from 0x3caa9 to 0x3ffa9 where the length should be calculated by
0x37ebfff + 1 - 0x37ebfa9.
In the below equation, however, map->m_pblk was supposed to be the original
address instead of the one from the target block address.
- map->m_len = min(map->m_len, dev->end_blk + 1 - map->m_pblk);
Cc: stable(a)vger.kernel.org
Fixes: 71f2c8206202 ("f2fs: multidevice: support direct IO")
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
---
fs/f2fs/data.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index ef38e62cda8f..775aa4f63aa3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1497,8 +1497,8 @@ static bool f2fs_map_blocks_cached(struct inode *inode,
struct f2fs_dev_info *dev = &sbi->devs[bidx];
map->m_bdev = dev->bdev;
- map->m_pblk -= dev->start_blk;
map->m_len = min(map->m_len, dev->end_blk + 1 - map->m_pblk);
+ map->m_pblk -= dev->start_blk;
} else {
map->m_bdev = inode->i_sb->s_bdev;
}
--
2.51.0.710.ga91ca5db03-goog
From: Paul Aurich <paul(a)darkrain42.org>
commit a9685b409a03b73d2980bbfa53eb47555802d0a9 upstream.
If open_cached_dir() encounters an error parsing the lease from the
server, the error handling may race with receiving a lease break,
resulting in open_cached_dir() freeing the cfid while the queued work is
pending.
Update open_cached_dir() to drop refs rather than directly freeing the
cfid.
Have cached_dir_lease_break(), cfids_laundromat_worker(), and
invalidate_all_cached_dirs() clear has_lease immediately while still
holding cfids->cfid_list_lock, and then use this to also simplify the
reference counting in cfids_laundromat_worker() and
invalidate_all_cached_dirs().
Fixes this KASAN splat (which manually injects an error and lease break
in open_cached_dir()):
==================================================================
BUG: KASAN: slab-use-after-free in smb2_cached_lease_break+0x27/0xb0
Read of size 8 at addr ffff88811cc24c10 by task kworker/3:1/65
CPU: 3 UID: 0 PID: 65 Comm: kworker/3:1 Not tainted 6.12.0-rc6-g255cf264e6e5-dirty #87
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Workqueue: cifsiod smb2_cached_lease_break
Call Trace:
<TASK>
dump_stack_lvl+0x77/0xb0
print_report+0xce/0x660
kasan_report+0xd3/0x110
smb2_cached_lease_break+0x27/0xb0
process_one_work+0x50a/0xc50
worker_thread+0x2ba/0x530
kthread+0x17c/0x1c0
ret_from_fork+0x34/0x60
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 2464:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
open_cached_dir+0xa7d/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 2464:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x51/0x70
kfree+0x174/0x520
open_cached_dir+0x97f/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Last potentially related work creation:
kasan_save_stack+0x33/0x60
__kasan_record_aux_stack+0xad/0xc0
insert_work+0x32/0x100
__queue_work+0x5c9/0x870
queue_work_on+0x82/0x90
open_cached_dir+0x1369/0x1fb0
smb2_query_path_info+0x43c/0x6e0
cifs_get_fattr+0x346/0xf10
cifs_get_inode_info+0x157/0x210
cifs_revalidate_dentry_attr+0x2d1/0x460
cifs_getattr+0x173/0x470
vfs_statx_path+0x10f/0x160
vfs_statx+0xe9/0x150
vfs_fstatat+0x5e/0xc0
__do_sys_newfstatat+0x91/0xf0
do_syscall_64+0x95/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff88811cc24c00
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 16 bytes inside of
freed 1024-byte region [ffff88811cc24c00, ffff88811cc25000)
Cc: stable(a)vger.kernel.org
Signed-off-by: Paul Aurich <paul(a)darkrain42.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
[ Do not apply the change for cfids_laundromat_worker() since there is no
this function and related feature on 6.1.y. Update open_cached_dir()
according to method of upstream patch. ]
Signed-off-by: Cliff Liu <donghua.liu(a)windriver.com>
Signed-off-by: He Zhe <Zhe.He(a)windriver.com>
[Shivani: Modified to apply on 6.1.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal(a)broadcom.com>
---
fs/smb/client/cached_dir.c | 39 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 23 deletions(-)
diff --git a/fs/smb/client/cached_dir.c b/fs/smb/client/cached_dir.c
index 3d028b6a2..23a57a0c8 100644
--- a/fs/smb/client/cached_dir.c
+++ b/fs/smb/client/cached_dir.c
@@ -320,17 +320,13 @@ int open_cached_dir(unsigned int xid, struct cifs_tcon *tcon,
/*
* We are guaranteed to have two references at this point.
* One for the caller and one for a potential lease.
- * Release the Lease-ref so that the directory will be closed
- * when the caller closes the cached handle.
+ * Release one here, and the second below.
*/
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
if (rc) {
- if (cfid->is_open)
- SMB2_close(0, cfid->tcon, cfid->fid.persistent_fid,
- cfid->fid.volatile_fid);
- free_cached_dir(cfid);
- cfid = NULL;
+ cfid->has_lease = false;
+ kref_put(&cfid->refcount, smb2_close_cached_fid);
}
if (rc == 0) {
@@ -462,25 +458,24 @@ void invalidate_all_cached_dirs(struct cifs_tcon *tcon)
cfids->num_entries--;
cfid->is_open = false;
cfid->on_list = false;
- /* To prevent race with smb2_cached_lease_break() */
- kref_get(&cfid->refcount);
+ if (cfid->has_lease) {
+ /*
+ * The lease was never cancelled from the server,
+ * so steal that reference.
+ */
+ cfid->has_lease = false;
+ } else
+ kref_get(&cfid->refcount);
}
spin_unlock(&cfids->cfid_list_lock);
list_for_each_entry_safe(cfid, q, &entry, entry) {
list_del(&cfid->entry);
cancel_work_sync(&cfid->lease_break);
- if (cfid->has_lease) {
- /*
- * We lease was never cancelled from the server so we
- * need to drop the reference.
- */
- spin_lock(&cfids->cfid_list_lock);
- cfid->has_lease = false;
- spin_unlock(&cfids->cfid_list_lock);
- kref_put(&cfid->refcount, smb2_close_cached_fid);
- }
- /* Drop the extra reference opened above*/
+ /*
+ * Drop the ref-count from above, either the lease-ref (if there
+ * was one) or the extra one acquired.
+ */
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
}
@@ -491,9 +486,6 @@ smb2_cached_lease_break(struct work_struct *work)
struct cached_fid *cfid = container_of(work,
struct cached_fid, lease_break);
- spin_lock(&cfid->cfids->cfid_list_lock);
- cfid->has_lease = false;
- spin_unlock(&cfid->cfids->cfid_list_lock);
kref_put(&cfid->refcount, smb2_close_cached_fid);
}
@@ -511,6 +503,7 @@ int cached_dir_lease_break(struct cifs_tcon *tcon, __u8 lease_key[16])
!memcmp(lease_key,
cfid->fid.lease_key,
SMB2_LEASE_KEY_SIZE)) {
+ cfid->has_lease = false;
cfid->time = 0;
/*
* We found a lease remove it from the list
--
2.40.4