On the following path, flush_tlb_range() can be used for zapping normal
PMD entries (PMD entries that point to page tables) together with the PTE
entries in the pointed-to page table:
collapse_pte_mapped_thp
pmdp_collapse_flush
flush_tlb_range
The arm64 version of flush_tlb_range() has a comment describing that it can
be used for page table removal, and does not use any last-level
invalidation optimizations. Fix the X86 version by making it behave the
same way.
Currently, X86 only uses this information for the following two purposes,
which I think means the issue doesn't have much impact:
- In native_flush_tlb_multi() for checking if lazy TLB CPUs need to be
IPI'd to avoid issues with speculative page table walks.
- In Hyper-V TLB paravirtualization, again for lazy TLB stuff.
The patch "x86/mm: only invalidate final translations with INVLPGB" which
is currently under review (see
<https://lore.kernel.org/all/20241230175550.4046587-13-riel@surriel.com/>)
would probably be making the impact of this a lot worse.
Cc: stable(a)vger.kernel.org
Fixes: 016c4d92cd16 ("x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
arch/x86/include/asm/tlbflush.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e0ecdba3fe948cafe5892b72e86c0..3da645139748538daac70166618d8ad95116eb74 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -242,7 +242,7 @@ void flush_tlb_multi(const struct cpumask *cpumask,
flush_tlb_mm_range((vma)->vm_mm, start, end, \
((vma)->vm_flags & VM_HUGETLB) \
? huge_page_shift(hstate_vma(vma)) \
- : PAGE_SHIFT, false)
+ : PAGE_SHIFT, true)
extern void flush_tlb_all(void);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
---
base-commit: aa135d1d0902c49ed45bec98c61c1b4022652b7e
change-id: 20250103-x86-collapse-flush-fix-fa87ac4d5834
--
Jann Horn <jannh(a)google.com>
This is the start of the stable review cycle for the 6.13.1 release.
There are 25 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 01 Feb 2025 13:34:42 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.13.1-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.13.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.13.1-rc1
Jack Greiner <jack(a)emoss.org>
Input: xpad - add support for wooting two he (arm)
Matheos Mattsson <matheos.mattsson(a)gmail.com>
Input: xpad - add support for Nacon Evol-X Xbox One Controller
Leonardo Brondani Schenkel <leonardo(a)schenkel.net>
Input: xpad - improve name of 8BitDo controller 2dc8:3106
Pierre-Loup A. Griffais <pgriffais(a)valvesoftware.com>
Input: xpad - add QH Electronics VID/PID
Nilton Perim Neto <niltonperimneto(a)gmail.com>
Input: xpad - add unofficial Xbox 360 wireless receiver clone
Mark Pearson <mpearson-lenovo(a)squebb.ca>
Input: atkbd - map F23 key to support default copilot shortcut
Nicolas Nobelis <nicolas(a)nobelis.eu>
Input: xpad - add support for Nacon Pro Compact
Jann Horn <jannh(a)google.com>
io_uring/rsrc: require cloned buffers to share accounting contexts
Jason Gerecke <jason.gerecke(a)wacom.com>
HID: wacom: Initialize brightness of LED trigger
Hans de Goede <hdegoede(a)redhat.com>
wifi: rtl8xxxu: add more missing rtl8192cu USB IDs
Lianqin Hu <hulianqin(a)vivo.com>
ALSA: usb-audio: Add delay quirk for USB Audio Device
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "usb: gadget: u_serial: Disable ep before setting port to null to fix the crash caused by port being null"
Qasim Ijaz <qasdev00(a)gmail.com>
USB: serial: quatech2: fix null-ptr-deref in qt2_process_read_urb()
Easwar Hariharan <eahariha(a)linux.microsoft.com>
scsi: storvsc: Ratelimit warning logs to prevent VM denial of service
Alex Williamson <alex.williamson(a)redhat.com>
vfio/platform: check the bounds of read/write syscalls
Linus Torvalds <torvalds(a)linux-foundation.org>
cachestat: fix page cache statistics permission checking
Jiri Kosina <jikos(a)kernel.org>
Revert "HID: multitouch: Add support for lenovo Y9000P Touchpad"
Jamal Hadi Salim <jhs(a)mojatatu.com>
net: sched: fix ets qdisc OOB Indexing
Paulo Alcantara <pc(a)manguebit.com>
smb: client: handle lack of EA support in smb2_query_path_info()
Chuck Lever <chuck.lever(a)oracle.com>
libfs: Use d_children list to iterate simple_offset directories
Chuck Lever <chuck.lever(a)oracle.com>
libfs: Replace simple_offset end-of-directory detection
Chuck Lever <chuck.lever(a)oracle.com>
Revert "libfs: fix infinite directory reads for offset dir"
Chuck Lever <chuck.lever(a)oracle.com>
Revert "libfs: Add simple_offset_empty()"
Chuck Lever <chuck.lever(a)oracle.com>
libfs: Return ENOSPC when the directory offset range is exhausted
Andreas Gruenbacher <agruenba(a)redhat.com>
gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
-------------
Diffstat:
Makefile | 4 +-
drivers/hid/hid-ids.h | 1 -
drivers/hid/hid-multitouch.c | 8 +-
drivers/hid/wacom_sys.c | 24 ++--
drivers/input/joystick/xpad.c | 9 +-
drivers/input/keyboard/atkbd.c | 2 +-
drivers/net/wireless/realtek/rtl8xxxu/core.c | 20 ++++
drivers/scsi/storvsc_drv.c | 8 +-
drivers/usb/gadget/function/u_serial.c | 8 +-
drivers/usb/serial/quatech2.c | 2 +-
drivers/vfio/platform/vfio_platform_common.c | 10 ++
fs/gfs2/file.c | 1 +
fs/libfs.c | 162 +++++++++++++--------------
fs/smb/client/smb2inode.c | 104 ++++++++++++-----
include/linux/fs.h | 1 -
io_uring/rsrc.c | 7 ++
mm/filemap.c | 17 +++
mm/shmem.c | 4 +-
net/sched/sch_ets.c | 2 +
sound/usb/quirks.c | 2 +
20 files changed, 251 insertions(+), 145 deletions(-)
There are two variables that indicate the interrupt type to be used
in the next test execution, "irq_type" as global and test->irq_type.
The global is referenced from pci_endpoint_test_get_irq() to preserve
the current type for ioctl(PCITEST_GET_IRQTYPE).
The type set in this function isn't reflected in the global "irq_type",
so ioctl(PCITEST_GET_IRQTYPE) returns the previous type.
As a result, the wrong type will be displayed in "pcitest" as follows:
# pcitest -i 0
SET IRQ TYPE TO LEGACY: OKAY
# pcitest -I
GET IRQ TYPE: MSI
Fix this issue by propagating the current type to the global "irq_type".
Cc: stable(a)vger.kernel.org
Fixes: b2ba9225e031 ("misc: pci_endpoint_test: Avoid using module parameter to determine irqtype")
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko(a)socionext.com>
---
drivers/misc/pci_endpoint_test.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
index a342587fc78a..33058630cd50 100644
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -742,6 +742,7 @@ static bool pci_endpoint_test_set_irq(struct pci_endpoint_test *test,
if (!pci_endpoint_test_request_irq(test))
goto err;
+ irq_type = test->irq_type;
return true;
err:
--
2.25.1
Add a sanity check to madvise_dontneed_free() to address a corner case
in madvise where a race condition causes the current vma being processed
to be backed by a different page size.
During a madvise(MADV_DONTNEED) call on a memory region registered with
a userfaultfd, there's a period of time where the process mm lock is
temporarily released in order to send a UFFD_EVENT_REMOVE and let
userspace handle the event. During this time, the vma covering the
current address range may change due to an explicit mmap done
concurrently by another thread.
If, after that change, the memory region, which was originally backed by
4KB pages, is now backed by hugepages, the end address is rounded down
to a hugepage boundary to avoid data loss (see "Fixes" below). This
rounding may cause the end address to be truncated to the same address
as the start.
Make this corner case follow the same semantics as in other similar
cases where the requested region has zero length (ie. return 0).
This will make madvise_walk_vmas() continue to the next vma in the
range (this time holding the process mm lock) which, due to the prev
pointer becoming stale because of the vma change, will be the same
hugepage-backed vma that was just checked before. The next time
madvise_dontneed_free() runs for this vma, if the start address isn't
aligned to a hugepage boundary, it'll return -EINVAL, which is also in
line with the madvise api.
From userspace perspective, madvise() will return EINVAL because the
start address isn't aligned according to the new vma alignment
requirements (hugepage), even though it was correctly page-aligned when
the call was issued.
Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ricardo Cañuelo Navarro <rcn(a)igalia.com>
---
mm/madvise.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 49f3a75046f6..c8e28d51978a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -933,7 +933,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
*/
end = vma->vm_end;
}
- VM_WARN_ON(start >= end);
+ if (start == end)
+ return 0;
+ VM_WARN_ON(start > end);
}
if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
--
2.48.1
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Commit 11c60f23ed13 ("integrity: Remove unused macro
IMA_ACTION_RULE_FLAGS") removed the IMA_ACTION_RULE_FLAGS mask, due to it
not being used after commit 0d73a55208e9 ("ima: re-introduce own integrity
cache lock").
However, it seems that the latter commit mistakenly used the wrong mask
when moving the code from ima_inode_post_setattr() to
process_measurement(). There is no mention in the commit message about this
change and it looks quite important, since changing from IMA_ACTIONS_FLAGS
(later renamed to IMA_NONACTION_FLAGS) to IMA_ACTION_RULE_FLAGS was done by
commit 42a4c603198f0 ("ima: fix ima_inode_post_setattr").
Restore the original change of resetting only the policy-specific flags and
not the new file status, but with new mask 0xfb000000 since the
policy-specific flags changed meanwhile. Also rename IMA_ACTION_RULE_FLAGS
to IMA_NONACTION_RULE_FLAGS, to be consistent with IMA_NONACTION_FLAGS.
Cc: stable(a)vger.kernel.org # v4.16.x
Fixes: 11c60f23ed13 ("integrity: Remove unused macro IMA_ACTION_RULE_FLAGS")
Reviewed-by: Mimi Zohar <zohar(a)linux.ibm.com>
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
---
security/integrity/ima/ima.h | 1 +
security/integrity/ima/ima_main.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index e1a3d1239bee..615900d4150d 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -141,6 +141,7 @@ struct ima_kexec_hdr {
/* IMA iint policy rule cache flags */
#define IMA_NONACTION_FLAGS 0xff000000
+#define IMA_NONACTION_RULE_FLAGS 0xfb000000
#define IMA_DIGSIG_REQUIRED 0x01000000
#define IMA_PERMIT_DIRECTIO 0x02000000
#define IMA_NEW_FILE 0x04000000
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 46adfd524dd8..7173dca20c23 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -275,7 +275,7 @@ static int process_measurement(struct file *file, const struct cred *cred,
/* reset appraisal flags if ima_inode_post_setattr was called */
iint->flags &= ~(IMA_APPRAISE | IMA_APPRAISED |
IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
- IMA_NONACTION_FLAGS);
+ IMA_NONACTION_RULE_FLAGS);
/*
* Re-evaulate the file if either the xattr has changed or the
--
2.34.1
While by default max_autoclose equals to INT_MAX / HZ, one may set
net.sctp.max_autoclose to UINT_MAX. There is code in
sctp_association_init() that can consequently trigger overflow.
Cc: stable(a)vger.kernel.org
Fixes: 9f70f46bd4c7 ("sctp: properly latch and use autoclose value from sock to association")
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
net/sctp/associola.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index c45c192b7878..0b0794f164cf 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -137,7 +137,8 @@ static struct sctp_association *sctp_association_init(
= 5 * asoc->rto_max;
asoc->timeouts[SCTP_EVENT_TIMEOUT_SACK] = asoc->sackdelay;
- asoc->timeouts[SCTP_EVENT_TIMEOUT_AUTOCLOSE] = sp->autoclose * HZ;
+ asoc->timeouts[SCTP_EVENT_TIMEOUT_AUTOCLOSE] =
+ (unsigned long)sp->autoclose * HZ;
/* Initializes the timers */
for (i = SCTP_EVENT_TIMEOUT_NONE; i < SCTP_NUM_TIMEOUT_TYPES; ++i)
--
2.34.1
When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.
The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.
Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.
Pass this systemcall through seccomp without depending on configuration.
Note: uretprobe isn't supported in i386 and __NR_ia32_rt_tgsigqueueinfo
uses the same number as __NR_uretprobe so the syscall isn't forced in the
compat bitmap.
Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Rafael Buchbinder <rafi(a)rbk.io>
Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2A…
Link: https://lore.kernel.org/lkml/20250121182939.33d05470@gandalf.local.home/T/#…
Cc: stable(a)vger.kernel.org
Signed-off-by: Eyal Birger <eyal.birger(a)gmail.com>
---
v2: use action_cache bitmap and mode1 array to check the syscall
The following reproduction script synthetically demonstrates the problem
for seccomp filters:
cat > /tmp/x.c << EOF
char *syscalls[] = {
"write",
"exit_group",
"fstat",
};
__attribute__((noinline)) int probed(void)
{
printf("Probed\n");
return 1;
}
void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
scmp_filter_ctx ctx;
ctx = seccomp_init(SCMP_ACT_KILL);
for (int i = 0; i < num_syscalls; i++) {
seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
seccomp_syscall_resolve_name(syscalls[i]), 0);
}
seccomp_load(ctx);
seccomp_release(ctx);
}
int main(int argc, char *argv[])
{
int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
apply_seccomp_filter(syscalls, num_syscalls);
probed();
return 0;
}
EOF
cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
printf("ret=%d\n", retval);
}
EOF
gcc -o /tmp/x /tmp/x.c -lseccomp
/usr/bin/bpftrace /tmp/trace.bt &
sleep 5 # wait for uretprobe attach
/tmp/x
pkill bpftrace
rm /tmp/x /tmp/x.c /tmp/trace.bt
---
kernel/seccomp.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 385d48293a5f..23b594a68bc0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -734,13 +734,13 @@ seccomp_prepare_user_filter(const char __user *user_filter)
#ifdef SECCOMP_ARCH_NATIVE
/**
- * seccomp_is_const_allow - check if filter is constant allow with given data
+ * seccomp_is_filter_const_allow - check if filter is constant allow with given data
* @fprog: The BPF programs
* @sd: The seccomp data to check against, only syscall number and arch
* number are considered constant.
*/
-static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
- struct seccomp_data *sd)
+static bool seccomp_is_filter_const_allow(struct sock_fprog_kern *fprog,
+ struct seccomp_data *sd)
{
unsigned int reg_value = 0;
unsigned int pc;
@@ -812,6 +812,21 @@ static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
return false;
}
+static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
+ struct seccomp_data *sd)
+{
+#ifdef __NR_uretprobe
+ if (sd->nr == __NR_uretprobe
+#ifdef SECCOMP_ARCH_COMPAT
+ && sd->arch != SECCOMP_ARCH_COMPAT
+#endif
+ )
+ return true;
+#endif
+
+ return seccomp_is_filter_const_allow(fprog, sd);
+}
+
static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
void *bitmap, const void *bitmap_prev,
size_t bitmap_size, int arch)
@@ -1023,6 +1038,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
*/
static const int mode1_syscalls[] = {
__NR_seccomp_read, __NR_seccomp_write, __NR_seccomp_exit, __NR_seccomp_sigreturn,
+#ifdef __NR_uretprobe
+ __NR_uretprobe,
+#endif
-1, /* negative terminated */
};
--
2.43.0