From: Vivek Goyal <vgoyal(a)redhat.com>
[ Upstream commit 3f9b9efd82a84f27e95d0414f852caf1fa839e83 ]
Right now "mount -t virtiofs -o dax myfs /mnt/virtiofs" succeeds even
if filesystem deivce does not have a cache window and hence DAX can't
be supported.
This gives a false sense to user that they are using DAX with virtiofs
but fact of the matter is that they are not.
Fix this by returning error if dax can't be supported and user has asked
for it.
Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha(a)redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/fuse/virtio_fs.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8868ac31a3c0..4ee6f734ba83 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1324,8 +1324,15 @@ static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
/* virtiofs allocates and installs its own fuse devices */
ctx->fudptr = NULL;
- if (ctx->dax)
+ if (ctx->dax) {
+ if (!fs->dax_dev) {
+ err = -EINVAL;
+ pr_err("virtio-fs: dax can't be enabled as filesystem"
+ " device does not support it.\n");
+ goto err_free_fuse_devs;
+ }
ctx->dax_dev = fs->dax_dev;
+ }
err = fuse_fill_super_common(sb, ctx);
if (err < 0)
goto err_free_fuse_devs;
--
2.30.1
commit ec2a29593c83ed71a7f16e3243941ebfcf75fdf6 upstream.
5fdc7db644 ("module: setup load info before module_sig_check()")
moved the ELF setup, so that it was done before the signature
check. This made the module name available to signature error
messages.
However, the checks for ELF correctness in setup_load_info
are not sufficient to prevent bad memory references due to
corrupted offset fields, indices, etc.
So, there's a regression in behavior here: a corrupt and unsigned
(or badly signed) module, which might previously have been rejected
immediately, can now cause an oops/crash.
Harden ELF handling for module loading by doing the following:
- Move the signature check back up so that it comes before ELF
initialization. It's best to do the signature check to see
if we can trust the module, before using the ELF structures
inside it. This also makes checks against info->len
more accurate again, as this field will be reduced by the
length of the signature in mod_check_sig().
The module name is now once again not available for error
messages during the signature check, but that seems like
a fair tradeoff.
- Check if sections have offset / size fields that at least don't
exceed the length of the module.
- Check if sections have section name offsets that don't fall
outside the section name table.
- Add a few other sanity checks against invalid section indices,
etc.
This is not an exhaustive consistency check, but the idea is to
at least get through the signature and blacklist checks without
crashing because of corrupted ELF info, and to error out gracefully
for most issues that would have caused problems later on.
Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()")
Signed-off-by: Frank van der Linden <fllinden(a)amazon.com>
Signed-off-by: Jessica Yu <jeyu(a)kernel.org>
[fllinden(a)amazon.com: backport to 5.4]
---
kernel/module.c | 141 +++++++++++++++++++++++++++++++++-----
kernel/module_signature.c | 2 +-
kernel/module_signing.c | 2 +-
3 files changed, 126 insertions(+), 19 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c
index ab1f97cfe18d..b8c9989dc0e3 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2938,9 +2938,33 @@ static int module_sig_check(struct load_info *info, int flags)
}
#endif /* !CONFIG_MODULE_SIG */
-/* Sanity checks against invalid binaries, wrong arch, weird elf version. */
-static int elf_header_check(struct load_info *info)
+static int validate_section_offset(struct load_info *info, Elf_Shdr *shdr)
{
+ unsigned long secend;
+
+ /*
+ * Check for both overflow and offset/size being
+ * too large.
+ */
+ secend = shdr->sh_offset + shdr->sh_size;
+ if (secend < shdr->sh_offset || secend > info->len)
+ return -ENOEXEC;
+
+ return 0;
+}
+
+/*
+ * Sanity checks against invalid binaries, wrong arch, weird elf version.
+ *
+ * Also do basic validity checks against section offsets and sizes, the
+ * section name string table, and the indices used for it (sh_name).
+ */
+static int elf_validity_check(struct load_info *info)
+{
+ unsigned int i;
+ Elf_Shdr *shdr, *strhdr;
+ int err;
+
if (info->len < sizeof(*(info->hdr)))
return -ENOEXEC;
@@ -2950,11 +2974,78 @@ static int elf_header_check(struct load_info *info)
|| info->hdr->e_shentsize != sizeof(Elf_Shdr))
return -ENOEXEC;
+ /*
+ * e_shnum is 16 bits, and sizeof(Elf_Shdr) is
+ * known and small. So e_shnum * sizeof(Elf_Shdr)
+ * will not overflow unsigned long on any platform.
+ */
if (info->hdr->e_shoff >= info->len
|| (info->hdr->e_shnum * sizeof(Elf_Shdr) >
info->len - info->hdr->e_shoff))
return -ENOEXEC;
+ info->sechdrs = (void *)info->hdr + info->hdr->e_shoff;
+
+ /*
+ * Verify if the section name table index is valid.
+ */
+ if (info->hdr->e_shstrndx == SHN_UNDEF
+ || info->hdr->e_shstrndx >= info->hdr->e_shnum)
+ return -ENOEXEC;
+
+ strhdr = &info->sechdrs[info->hdr->e_shstrndx];
+ err = validate_section_offset(info, strhdr);
+ if (err < 0)
+ return err;
+
+ /*
+ * The section name table must be NUL-terminated, as required
+ * by the spec. This makes strcmp and pr_* calls that access
+ * strings in the section safe.
+ */
+ info->secstrings = (void *)info->hdr + strhdr->sh_offset;
+ if (info->secstrings[strhdr->sh_size - 1] != '\0')
+ return -ENOEXEC;
+
+ /*
+ * The code assumes that section 0 has a length of zero and
+ * an addr of zero, so check for it.
+ */
+ if (info->sechdrs[0].sh_type != SHT_NULL
+ || info->sechdrs[0].sh_size != 0
+ || info->sechdrs[0].sh_addr != 0)
+ return -ENOEXEC;
+
+ for (i = 1; i < info->hdr->e_shnum; i++) {
+ shdr = &info->sechdrs[i];
+ switch (shdr->sh_type) {
+ case SHT_NULL:
+ case SHT_NOBITS:
+ continue;
+ case SHT_SYMTAB:
+ if (shdr->sh_link == SHN_UNDEF
+ || shdr->sh_link >= info->hdr->e_shnum)
+ return -ENOEXEC;
+ fallthrough;
+ default:
+ err = validate_section_offset(info, shdr);
+ if (err < 0) {
+ pr_err("Invalid ELF section in module (section %u type %u)\n",
+ i, shdr->sh_type);
+ return err;
+ }
+
+ if (shdr->sh_flags & SHF_ALLOC) {
+ if (shdr->sh_name >= strhdr->sh_size) {
+ pr_err("Invalid ELF section name in module (section %u type %u)\n",
+ i, shdr->sh_type);
+ return -ENOEXEC;
+ }
+ }
+ break;
+ }
+ }
+
return 0;
}
@@ -3051,11 +3142,6 @@ static int rewrite_section_headers(struct load_info *info, int flags)
for (i = 1; i < info->hdr->e_shnum; i++) {
Elf_Shdr *shdr = &info->sechdrs[i];
- if (shdr->sh_type != SHT_NOBITS
- && info->len < shdr->sh_offset + shdr->sh_size) {
- pr_err("Module len %lu truncated\n", info->len);
- return -ENOEXEC;
- }
/* Mark all sections sh_addr with their address in the
temporary image. */
@@ -3087,11 +3173,6 @@ static int setup_load_info(struct load_info *info, int flags)
{
unsigned int i;
- /* Set up the convenience variables */
- info->sechdrs = (void *)info->hdr + info->hdr->e_shoff;
- info->secstrings = (void *)info->hdr
- + info->sechdrs[info->hdr->e_shstrndx].sh_offset;
-
/* Try to find a name early so we can log errors with a module name */
info->index.info = find_sec(info, ".modinfo");
if (info->index.info)
@@ -3819,23 +3900,49 @@ static int load_module(struct load_info *info, const char __user *uargs,
long err = 0;
char *after_dashes;
- err = elf_header_check(info);
+ /*
+ * Do the signature check (if any) first. All that
+ * the signature check needs is info->len, it does
+ * not need any of the section info. That can be
+ * set up later. This will minimize the chances
+ * of a corrupt module causing problems before
+ * we even get to the signature check.
+ *
+ * The check will also adjust info->len by stripping
+ * off the sig length at the end of the module, making
+ * checks against info->len more correct.
+ */
+ err = module_sig_check(info, flags);
if (err)
goto free_copy;
+ /*
+ * Do basic sanity checks against the ELF header and
+ * sections.
+ */
+ err = elf_validity_check(info);
+ if (err) {
+ pr_err("Module has invalid ELF structures\n");
+ goto free_copy;
+ }
+
+ /*
+ * Everything checks out, so set up the section info
+ * in the info structure.
+ */
err = setup_load_info(info, flags);
if (err)
goto free_copy;
+ /*
+ * Now that we know we have the correct module name, check
+ * if it's blacklisted.
+ */
if (blacklisted(info->name)) {
err = -EPERM;
goto free_copy;
}
- err = module_sig_check(info, flags);
- if (err)
- goto free_copy;
-
err = rewrite_section_headers(info, flags);
if (err)
goto free_copy;
diff --git a/kernel/module_signature.c b/kernel/module_signature.c
index 4224a1086b7d..00132d12487c 100644
--- a/kernel/module_signature.c
+++ b/kernel/module_signature.c
@@ -25,7 +25,7 @@ int mod_check_sig(const struct module_signature *ms, size_t file_len,
return -EBADMSG;
if (ms->id_type != PKEY_ID_PKCS7) {
- pr_err("%s: Module is not signed with expected PKCS#7 message\n",
+ pr_err("%s: not signed with expected PKCS#7 message\n",
name);
return -ENOPKG;
}
diff --git a/kernel/module_signing.c b/kernel/module_signing.c
index 9d9fc678c91d..8723ae70ea1f 100644
--- a/kernel/module_signing.c
+++ b/kernel/module_signing.c
@@ -30,7 +30,7 @@ int mod_verify_sig(const void *mod, struct load_info *info)
memcpy(&ms, mod + (modlen - sizeof(ms)), sizeof(ms));
- ret = mod_check_sig(&ms, modlen, info->name);
+ ret = mod_check_sig(&ms, modlen, "module");
if (ret)
return ret;
--
2.23.3
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From eb85890b29e4d7ae1accdcfba35ed8b16ba9fb97 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 25 Feb 2021 10:13:29 -0700
Subject: [PATCH] io_uring: ensure SQPOLL startup is triggered before error
shutdown
syzbot reports the following hang:
INFO: task syz-executor.0:12538 can't die for more than 143 seconds.
task:syz-executor.0 state:D stack:28352 pid:12538 ppid: 8423 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4324 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5075
schedule+0xcf/0x270 kernel/sched/core.c:5154
schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152
io_sq_offload_create fs/io_uring.c:7929 [inline]
io_uring_create fs/io_uring.c:9465 [inline]
io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
which is due to exiting after the SQPOLL thread has been created, but
hasn't been started yet. Ensure that we always complete the startup
side when waiting for it to exit.
Reported-by: syzbot+c927c937cba8ef66dd4a(a)syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fbc85afa9a87..ef743594d34a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7141,6 +7141,7 @@ static void io_sq_thread_finish(struct io_ring_ctx *ctx)
struct io_sq_data *sqd = ctx->sq_data;
if (sqd) {
+ complete(&sqd->startup);
if (sqd->thread) {
wait_for_completion(&ctx->sq_thread_comp);
io_sq_thread_park(sqd);
@@ -7927,7 +7928,7 @@ static void io_sq_offload_start(struct io_ring_ctx *ctx)
{
struct io_sq_data *sqd = ctx->sq_data;
- if ((ctx->flags & IORING_SETUP_SQPOLL) && sqd->thread)
+ if (ctx->flags & IORING_SETUP_SQPOLL)
complete(&sqd->startup);
}
commit ee7febce051945be28ad86d16a15886f878204de upstream.
Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.
The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.
This can be verified on QEMU with setting kaslr-seed to ~0ul:
memstart_offset_seed = 0xffff
START: __pa(_PAGE_OFFSET(vabits_actual)) = ffff9000c0000000
END: __pa(PAGE_END - 1) = 1000bfffffff
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear mapping")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Tested-by: Tyler Hicks <tyhicks(a)linux.microsoft.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
---
arch/arm64/mm/mmu.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6f0648777d34..ee01f421e1e4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1445,14 +1445,30 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
static bool inside_linear_region(u64 start, u64 size)
{
+ u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+ u64 end_linear_pa = __pa(PAGE_END - 1);
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+ /*
+ * Check for a wrap, it is possible because of randomized linear
+ * mapping the start physical address is actually bigger than
+ * the end physical address. In this case set start to zero
+ * because [0, end_linear_pa] range must still be able to cover
+ * all addressable physical addresses.
+ */
+ if (start_linear_pa > end_linear_pa)
+ start_linear_pa = 0;
+ }
+
+ WARN_ON(start_linear_pa > end_linear_pa);
+
/*
* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
* accommodating both its ends but excluding PAGE_END. Max physical
* range which can be mapped inside this linear mapping range, must
* also be derived from its end points.
*/
- return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
- (start + size - 1) <= __pa(PAGE_END - 1);
+ return start >= start_linear_pa && (start + size - 1) <= end_linear_pa;
}
int arch_add_memory(int nid, u64 start, u64 size,
--
2.25.1
[Backport of commit 1f935e8e72ec28dddb2dc0650b3b6626a293d94b to all
stable branches from 4.4 to 5.4, inclusive]
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Cc: stable(a)vger.kernel.org
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5d323574d04f..c82e7b52ab1f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -620,6 +620,7 @@ struct sock *__vsock_create(struct net *net,
vsk->trusted = psk->trusted;
vsk->owner = get_cred(psk->owner);
vsk->connect_timeout = psk->connect_timeout;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.31.0.291.g576ba9dcdaf-goog
If the initial value of `num_entires` (calculated at line 1654) is not
an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
value greater than the initial value will be assigned to it. That causes
`start > last + 1` after line 1708. Then in the next iteration an
underflow happens at line 1654. It causes message
*ERROR* Couldn't update BO_VA (-12)
printed in kernel log, and GPU hanging.
Fortify the criteria of the loop to fix this issue.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
Reported-by: Xi Ruoyao <xry111(a)mengyan1223.wang>
Reported-by: Dan Horák <dan(a)danny.cz>
Cc: stable(a)vger.kernel.org
Signed-off-by: Xi Ruoyao <xry111(a)mengyan1223.wang>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ad91c0c3c423..cee0cc9c8085 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
}
start = tmp;
- } while (unlikely(start != last + 1));
+ } while (unlikely(start < last + 1));
r = vm->update_funcs->commit(¶ms, fence);
base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
--
2.31.1
The MIPS FPU may have 3 mode:
FR=0: MIPS I style, all of the FPR are single.
FR=1: all 32 FPR can be double.
FRE: redirecting the rw of odd-FPR to the upper 32bit of even-double FPR.
The binary may have 3 mode:
FP32: can only work with FR=0 and FRE mode
FPXX: can work with all of FR=0/FR=1/FRE mode.
FP64: can only work with FR=1 mode
Some binary, for example the output of golang, may be mark as FPXX,
while in fact they are FP32. It is caused by the bug of design and linker:
Object produced by pure Go has no FP annotation while in fact they are FP32;
if we link them with the C module which marked as FPXX,
the result will be marked as FPXX. If these fake-FPXX binaries is executed
in FR=1 mode, some problem will happen.
In Golang, now we add the FP32 annotation, so the future golang programs
won't have this problem. While for the existing binaries, we need a
kernel workaround.
Currently, FR=1 mode is used for all FPXX binary if O32_FP64 supported is enabled,
it makes some wrong behivour of the binaries.
Since FPXX binary can work with both FR=1 and FR=0, we force it to use FR=0.
Reference:
https://web.archive.org/web/20180828210612/https://dmz-portal.mips.com/wiki…https://go-review.googlesource.com/c/go/+/239217https://go-review.googlesource.com/c/go/+/237058
Signed-off-by: YunQiang Su <yunqiang.su(a)cipunited.com>
Cc: stable(a)vger.kernel.org # 4.19+
---
v7->v8->v9:
Rollback to use FR=1 for FPXX on R6 CPU.
v6->v7:
Use FRE mode for pre-R6 binaries on R6 CPU.
v5->v6:
Rollback to V3, aka remove config option.
v4->v5:
Fix CONFIG_MIPS_O32_FPXX_USE_FR0 usage: if -> ifdef
v3->v4:
introduce a config option: CONFIG_MIPS_O32_FPXX_USE_FR0
v2->v3:
commit message: add Signed-off-by and Cc to stable.
v1->v2:
Fix bad commit message: in fact, we are switching to FR=0
arch/mips/kernel/elf.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/arch/mips/kernel/elf.c b/arch/mips/kernel/elf.c
index 7b045d2a0b51..554561d5c1f8 100644
--- a/arch/mips/kernel/elf.c
+++ b/arch/mips/kernel/elf.c
@@ -232,11 +232,16 @@ int arch_check_elf(void *_ehdr, bool has_interpreter, void *_interp_ehdr,
* that inherently require the hybrid FP mode.
* - If FR1 and FRDEFAULT is true, that means we hit the any-abi or
* fpxx case. This is because, in any-ABI (or no-ABI) we have no FPU
- * instructions so we don't care about the mode. We will simply use
- * the one preferred by the hardware. In fpxx case, that ABI can
- * handle both FR=1 and FR=0, so, again, we simply choose the one
- * preferred by the hardware. Next, if we only use single-precision
- * FPU instructions, and the default ABI FPU mode is not good
+ * instructions so we don't care about the mode.
+ * In fpxx case, that ABI can handle all of FR=1/FR=0/FRE mode.
+ * Here, we need to use FR=0 mode instead of FR=1, because some binaries
+ * may be mark as FPXX by mistake due to bugs of design and linker:
+ * The object produced by pure Go has no FP annotation,
+ * then is treated as any-ABI by linker, although in fact they are FP32;
+ * if any-ABI object is linked with FPXX object, the result will be mark as FPXX.
+ * Then the problem happens: run FP32 binaries in FR=1 mode.
+ * - If we only use single-precision FPU instructions,
+ * and the default ABI FPU mode is not good
* (ie single + any ABI combination), we set again the FPU mode to the
* one is preferred by the hardware. Next, if we know that the code
* will only use single-precision instructions, shown by single being
@@ -248,8 +253,9 @@ int arch_check_elf(void *_ehdr, bool has_interpreter, void *_interp_ehdr,
*/
if (prog_req.fre && !prog_req.frdefault && !prog_req.fr1)
state->overall_fp_mode = FP_FRE;
- else if ((prog_req.fr1 && prog_req.frdefault) ||
- (prog_req.single && !prog_req.frdefault))
+ else if (prog_req.fr1 && prog_req.frdefault)
+ state->overall_fp_mode = cpu_has_mips_r6 ? FP_FR1 : FP_FR0;
+ else if (prog_req.single && !prog_req.frdefault)
/* Make sure 64-bit MIPS III/IV/64R1 will not pick FR1 */
state->overall_fp_mode = ((raw_current_cpu_data.fpu_id & MIPS_FPIR_F64) &&
cpu_has_mips_r2_r6) ?
--
2.20.1
On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1
implies page_poison=on:
if (page_poisoning_enabled() ||
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()))
static_branch_enable(&_page_poisoning_enabled);
page_poison=on needs to override init_on_free=1.
Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1
That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);
which leads to poisoned pages returned for __GFP_ZERO pages.
After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
and ignore init_on_free=1.
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Fixes: 8db26a3d4735 ("mm, page_poison: use static key more efficiently")
Cc: <stable(a)vger.kernel.org>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: linux-mm(a)kvack.org
CC: David Hildenbrand <david(a)redhat.com>
CC: Andrey Konovalov <andreyknvl(a)gmail.com>
Link: https://lkml.org/lkml/2021/3/26/443
Signed-off-by: Sergei Trofimovich <slyfox(a)gentoo.org>
---
Change since v2:
- Added 'Fixes:' and 'CC: stable@' suggested by Vlastimil and David
- Renamed local variable to 'page_poisoning_requested' for
consistency suggested by David
- Simplified initialization of page_poisoning_requested suggested
by David
- Added 'Acked-by: Vlastimil'
mm/page_alloc.c | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..4bb3cdfc47f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -764,32 +764,36 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
*/
void init_mem_debugging_and_hardening(void)
{
+ bool page_poisoning_requested = false;
+
+#ifdef CONFIG_PAGE_POISONING
+ /*
+ * Page poisoning is debug page alloc for some arches. If
+ * either of those options are enabled, enable poisoning.
+ */
+ if (page_poisoning_enabled() ||
+ (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
+ debug_pagealloc_enabled())) {
+ static_branch_enable(&_page_poisoning_enabled);
+ page_poisoning_requested = true;
+ }
+#endif
+
if (_init_on_alloc_enabled_early) {
- if (page_poisoning_enabled())
+ if (page_poisoning_requested)
pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, "
"will take precedence over init_on_alloc\n");
else
static_branch_enable(&init_on_alloc);
}
if (_init_on_free_enabled_early) {
- if (page_poisoning_enabled())
+ if (page_poisoning_requested)
pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, "
"will take precedence over init_on_free\n");
else
static_branch_enable(&init_on_free);
}
-#ifdef CONFIG_PAGE_POISONING
- /*
- * Page poisoning is debug page alloc for some arches. If
- * either of those options are enabled, enable poisoning.
- */
- if (page_poisoning_enabled() ||
- (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
- debug_pagealloc_enabled()))
- static_branch_enable(&_page_poisoning_enabled);
-#endif
-
#ifdef CONFIG_DEBUG_PAGEALLOC
if (!debug_pagealloc_enabled())
return;
--
2.31.1
The ingenic-drm driver has two mutually exclusive primary planes
already; so the fact that a CRTC must have one and only one primary
plane is an invalid assumption.
Fixes: 96962e3de725 ("drm: require each CRTC to have a unique primary plane")
Cc: <stable(a)vger.kernel.org> # 5.11
Signed-off-by: Paul Cercueil <paul(a)crapouillou.net>
---
drivers/gpu/drm/drm_mode_config.c | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c
index 37b4b9f0e468..d86621c41047 100644
--- a/drivers/gpu/drm/drm_mode_config.c
+++ b/drivers/gpu/drm/drm_mode_config.c
@@ -626,9 +626,7 @@ void drm_mode_config_validate(struct drm_device *dev)
{
struct drm_encoder *encoder;
struct drm_crtc *crtc;
- struct drm_plane *plane;
u32 primary_with_crtc = 0, cursor_with_crtc = 0;
- unsigned int num_primary = 0;
if (!drm_core_check_feature(dev, DRIVER_MODESET))
return;
@@ -676,13 +674,4 @@ void drm_mode_config_validate(struct drm_device *dev)
cursor_with_crtc |= drm_plane_mask(crtc->cursor);
}
}
-
- drm_for_each_plane(plane, dev) {
- if (plane->type == DRM_PLANE_TYPE_PRIMARY)
- num_primary++;
- }
-
- WARN(num_primary != dev->mode_config.num_crtc,
- "Must have as many primary planes as there are CRTCs, but have %u primary planes and %u CRTCs",
- num_primary, dev->mode_config.num_crtc);
}
--
2.30.2
There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
[<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
[<8033f8b8>] kset_register+0x68/0x88
[<803cf824>] bus_register+0xdc/0x34c
[<803cfac8>] subsys_virtual_register+0x34/0x78
[<8086afb0>] wq_sysfs_init+0x1c/0x4c
[<80001648>] do_one_initcall+0x50/0x1a8
[<8086503c>] kernel_init_freeable+0x230/0x2c8
[<8066bca0>] kernel_init+0x10/0x100
[<80003038>] ret_from_kernel_thread+0x14/0x1c
2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
kernel_execve asynchronously.
3. Memory allocations in kernel_execve cause a page fault, bumping the
MM reference counter:
[<8015adb4>] add_mm_counter_fast+0xb4/0xc0
[<80160d58>] handle_mm_fault+0x6e4/0xea0
[<80158aa4>] __get_user_pages.part.78+0x190/0x37c
[<8015992c>] __get_user_pages_remote+0x128/0x360
[<801a6d9c>] get_arg_page+0x34/0xa0
[<801a7394>] copy_string_kernel+0x194/0x2a4
[<801a880c>] kernel_execve+0x11c/0x298
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
4. In case zero_pfn has not been initialized yet, zap_pte_range does
not decrement the MM_ANONPAGES RSS counter and the BUG message is
triggered shortly afterwards when __mmdrop checks the ref counters:
[<800285e8>] __mmdrop+0x98/0x1d0
[<801a6de8>] free_bprm+0x44/0x118
[<801a86a8>] kernel_execve+0x160/0x1d8
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c
To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is either
constant or gets initialized even earlier, at paging_init, so there is
no issue with initializing zero_pfn earlier.
ML discussion: https://lore.kernel.org/lkml/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpq…
Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy(a)gmail.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: stable(a)vger.kernel.org
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 46ef306375bd..a8bbc4fc121f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
zero_pfn = page_to_pfn(ZERO_PAGE(0));
return 0;
}
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);
void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
{
--
2.31.0
The MIPS FPU may have 3 mode:
FR=0: MIPS I style, all of the FPR are single.
FR=1: all 32 FPR can be double.
FRE: redirecting the rw of odd-FPR to the upper 32bit of even-double FPR.
The binary may have 3 mode:
FP32: can only work with FR=0 and FRE mode
FPXX: can work with all of FR=0/FR=1/FRE mode.
FP64: can only work with FR=1 mode
Some binary, for example the output of golang, may be mark as FPXX,
while in fact they are FP32. It is caused by the bug of design and linker:
Object produced by pure Go has no FP annotation while in fact they are FP32;
if we link them with the C module which marked as FPXX,
the result will be marked as FPXX. If these fake-FPXX binaries is executed
in FR=1 mode, some problem will happen.
In Golang, now we add the FP32 annotation, so the future golang programs
won't have this problem. While for the existing binaries, we need a
kernel workaround.
Currently, FR=1 mode is used for all FPXX binary, it makes some wrong
behivour of the binaries. Since FPXX binary can work with both FR=1 and FR=0,
we force it to use FR=0 or FRE (for R6 CPU).
Reference:
https://web.archive.org/web/20180828210612/https://dmz-portal.mips.com/wiki…https://go-review.googlesource.com/c/go/+/239217https://go-review.googlesource.com/c/go/+/237058
Signed-off-by: YunQiang Su <yunqiang.su(a)cipunited.com>
Cc: stable(a)vger.kernel.org # 4.19+
---
v6->v7:
Use FRE mode for pre-R6 binaries on R6 CPU.
v5->v6:
Rollback to V3, aka remove config option.
v4->v5:
Fix CONFIG_MIPS_O32_FPXX_USE_FR0 usage: if -> ifdef
v3->v4:
introduce a config option: CONFIG_MIPS_O32_FPXX_USE_FR0
v2->v3:
commit message: add Signed-off-by and Cc to stable.
v1->v2:
Fix bad commit message: in fact, we are switching to FR=0
arch/mips/kernel/elf.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/arch/mips/kernel/elf.c b/arch/mips/kernel/elf.c
index 7b045d2a0b51..4d4db619544b 100644
--- a/arch/mips/kernel/elf.c
+++ b/arch/mips/kernel/elf.c
@@ -232,11 +232,16 @@ int arch_check_elf(void *_ehdr, bool has_interpreter, void *_interp_ehdr,
* that inherently require the hybrid FP mode.
* - If FR1 and FRDEFAULT is true, that means we hit the any-abi or
* fpxx case. This is because, in any-ABI (or no-ABI) we have no FPU
- * instructions so we don't care about the mode. We will simply use
- * the one preferred by the hardware. In fpxx case, that ABI can
- * handle both FR=1 and FR=0, so, again, we simply choose the one
- * preferred by the hardware. Next, if we only use single-precision
- * FPU instructions, and the default ABI FPU mode is not good
+ * instructions so we don't care about the mode.
+ * In fpxx case, that ABI can handle all of FR=1/FR=0/FRE mode.
+ * Here, we need to use FR=0/FRE mode instead of FR=1, because some binaries
+ * may be mark as FPXX by mistake due to bugs of design and linker:
+ * The object produced by pure Go has no FP annotation,
+ * then is treated as any-ABI by linker, although in fact they are FP32;
+ * if any-ABI object is linked with FPXX object, the result will be mark as FPXX.
+ * Then the problem happens: run FP32 binaries in FR=1 mode.
+ * - If we only use single-precision FPU instructions,
+ * and the default ABI FPU mode is not good
* (ie single + any ABI combination), we set again the FPU mode to the
* one is preferred by the hardware. Next, if we know that the code
* will only use single-precision instructions, shown by single being
@@ -248,8 +253,9 @@ int arch_check_elf(void *_ehdr, bool has_interpreter, void *_interp_ehdr,
*/
if (prog_req.fre && !prog_req.frdefault && !prog_req.fr1)
state->overall_fp_mode = FP_FRE;
- else if ((prog_req.fr1 && prog_req.frdefault) ||
- (prog_req.single && !prog_req.frdefault))
+ else if (prog_req.fr1 && prog_req.frdefault)
+ state->overall_fp_mode = cpu_has_mips_r6 ? FP_FRE : FP_FR0;
+ else if (prog_req.single && !prog_req.frdefault)
/* Make sure 64-bit MIPS III/IV/64R1 will not pick FR1 */
state->overall_fp_mode = ((raw_current_cpu_data.fpu_id & MIPS_FPIR_F64) &&
cpu_has_mips_r2_r6) ?
--
2.20.1
Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: 8f9be00f0e3f - selftest/bpf: Add a test to check trampoline freeing logic.
The results of these automated tests are provided below.
Overall result: PASSED
Merge: OK
Compile: OK
Tests: OK
All kernel binaries, config files, and logs are available for download here:
https://arr-cki-prod-datawarehouse-public.s3.amazonaws.com/index.html?prefi…
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 4 architectures:
aarch64:
make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
s390x:
make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
Host 1:
✅ Boot test
✅ ACPI table test
✅ ACPI enabled test
✅ LTP
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking: igmp conformance test
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: SCSI VPD
✅ trace: ftrace/tracer
🚧 ✅ i2c: i2cdetect sanity
🚧 ✅ Firmware test suite
🚧 ✅ Memory function: kaslr
🚧 ✅ audit: audit testsuite test
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ selinux-policy: serge-testsuite
✅ storage: software RAID testing
✅ Storage: swraid mdadm raid_module test
🚧 ✅ xfstests - btrfs
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
🚧 ✅ Storage block - filesystem fio test
🚧 ✅ Storage block - queue scheduler test
🚧 ✅ Storage nvme - tcp
🚧 ⚡⚡⚡ Storage: lvm device-mapper test
🚧 ⚡⚡⚡ stress: stress-ng
ppc64le:
Host 1:
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ selinux-policy: serge-testsuite
✅ storage: software RAID testing
✅ Storage: swraid mdadm raid_module test
🚧 ✅ xfstests - btrfs
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
🚧 ✅ Storage block - filesystem fio test
🚧 ✅ Storage block - queue scheduler test
🚧 ✅ Storage nvme - tcp
🚧 ✅ Storage: lvm device-mapper test
Host 2:
✅ Boot test
✅ LTP
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ trace: ftrace/tracer
🚧 ✅ Memory function: kaslr
🚧 ✅ audit: audit testsuite test
s390x:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ LTP
⚡⚡⚡ CIFS Connectathon
⚡⚡⚡ POSIX pjd-fstest suites
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ jvm - jcstress tests
⚡⚡⚡ Memory: fork_mem
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking cki netfilter test
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ audit: audit testsuite test
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ Storage: swraid mdadm raid_module test
🚧 ⚡⚡⚡ Storage blktests
🚧 ⚡⚡⚡ Storage nvme - tcp
🚧 ⚡⚡⚡ stress: stress-ng
Host 3:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ LTP
⚡⚡⚡ CIFS Connectathon
⚡⚡⚡ POSIX pjd-fstest suites
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ jvm - jcstress tests
⚡⚡⚡ Memory: fork_mem
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking cki netfilter test
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ Libkcapi AF_ALG test
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ audit: audit testsuite test
Host 4:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ selinux-policy: serge-testsuite
⚡⚡⚡ Storage: swraid mdadm raid_module test
🚧 ⚡⚡⚡ Storage blktests
🚧 ⚡⚡⚡ Storage nvme - tcp
🚧 ⚡⚡⚡ stress: stress-ng
x86_64:
Host 1:
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
✅ xfstests - nfsv4.2
✅ selinux-policy: serge-testsuite
✅ storage: software RAID testing
✅ Storage: swraid mdadm raid_module test
🚧 ✅ CPU: Idle Test
🚧 ❌ xfstests - btrfs
🚧 ✅ xfstests - cifsv3.11
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
🚧 ✅ Storage block - filesystem fio test
🚧 ✅ Storage block - queue scheduler test
🚧 ✅ Storage nvme - tcp
🚧 ✅ Storage: lvm device-mapper test
🚧 ✅ stress: stress-ng
Host 2:
✅ Boot test
✅ ACPI table test
✅ LTP
✅ CIFS Connectathon
✅ POSIX pjd-fstest suites
✅ Loopdev Sanity
✅ jvm - jcstress tests
✅ Memory: fork_mem
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking socket: fuzz
✅ Networking: igmp conformance test
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking cki netfilter test
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ Libkcapi AF_ALG test
✅ pciutils: sanity smoke test
✅ pciutils: update pci ids test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: SCSI VPD
✅ trace: ftrace/tracer
🚧 ✅ i2c: i2cdetect sanity
🚧 ✅ Firmware test suite
🚧 ✅ Memory function: kaslr
🚧 ✅ audit: audit testsuite test
Test sources: https://gitlab.com/cki-project/kernel-tests
💚 Pull requests are welcome for new tests or improvements to existing tests!
Aborted tests
-------------
Tests that didn't complete running successfully are marked with ⚡⚡⚡.
If this was caused by an infrastructure issue, we try to mark that
explicitly in the report.
Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.
Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running yet are marked with ⏱.
From: Pavel Andrianov <andrianov(a)ispras.ru>
[ Upstream commit 0571a753cb07982cc82f4a5115e0b321da89e1f3 ]
pxa168_eth_remove() firstly calls unregister_netdev(),
then cancels a timeout work. unregister_netdev() shuts down a device
interface and removes it from the kernel tables. If the timeout occurs
in parallel, the timeout work (pxa168_eth_tx_timeout_task) performs stop
and open of the device. It may lead to an inconsistent state and memory
leaks.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Pavel Andrianov <andrianov(a)ispras.ru>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/ethernet/marvell/pxa168_eth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c b/drivers/net/ethernet/marvell/pxa168_eth.c
index 7ace07dad6a3..9986f88618bd 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -1577,8 +1577,8 @@ static int pxa168_eth_remove(struct platform_device *pdev)
mdiobus_unregister(pep->smi_bus);
mdiobus_free(pep->smi_bus);
- unregister_netdev(dev);
cancel_work_sync(&pep->tx_timeout_task);
+ unregister_netdev(dev);
free_netdev(dev);
return 0;
}
--
2.30.1
From: Pavel Andrianov <andrianov(a)ispras.ru>
[ Upstream commit 0571a753cb07982cc82f4a5115e0b321da89e1f3 ]
pxa168_eth_remove() firstly calls unregister_netdev(),
then cancels a timeout work. unregister_netdev() shuts down a device
interface and removes it from the kernel tables. If the timeout occurs
in parallel, the timeout work (pxa168_eth_tx_timeout_task) performs stop
and open of the device. It may lead to an inconsistent state and memory
leaks.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Pavel Andrianov <andrianov(a)ispras.ru>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/net/ethernet/marvell/pxa168_eth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c b/drivers/net/ethernet/marvell/pxa168_eth.c
index 5d5000c8edf1..09cb0ac701e1 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -1571,8 +1571,8 @@ static int pxa168_eth_remove(struct platform_device *pdev)
mdiobus_unregister(pep->smi_bus);
mdiobus_free(pep->smi_bus);
- unregister_netdev(dev);
cancel_work_sync(&pep->tx_timeout_task);
+ unregister_netdev(dev);
free_netdev(dev);
return 0;
}
--
2.30.1
From: Mans Rullgard <mans(a)mansr.com>
[ Upstream commit 9bbce32a20d6a72c767a7f85fd6127babd1410ac ]
Without DT aliases, the numbering of mmc interfaces is unpredictable.
Adding them makes it possible to refer to devices consistently. The
popular suggestion to use UUIDs obviously doesn't work with a blank
device fresh from the factory.
See commit fa2d0aa96941 ("mmc: core: Allow setting slot index via
device tree alias") for more discussion.
Signed-off-by: Mans Rullgard <mans(a)mansr.com>
Signed-off-by: Tony Lindgren <tony(a)atomide.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm/boot/dts/am33xx.dtsi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index e58fab8aec5d..8923273a2f73 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -38,6 +38,9 @@ aliases {
ethernet1 = &cpsw_emac1;
spi0 = &spi0;
spi1 = &spi1;
+ mmc0 = &mmc1;
+ mmc1 = &mmc2;
+ mmc2 = &mmc3;
};
cpus {
--
2.30.1
From: Mans Rullgard <mans(a)mansr.com>
[ Upstream commit 9bbce32a20d6a72c767a7f85fd6127babd1410ac ]
Without DT aliases, the numbering of mmc interfaces is unpredictable.
Adding them makes it possible to refer to devices consistently. The
popular suggestion to use UUIDs obviously doesn't work with a blank
device fresh from the factory.
See commit fa2d0aa96941 ("mmc: core: Allow setting slot index via
device tree alias") for more discussion.
Signed-off-by: Mans Rullgard <mans(a)mansr.com>
Signed-off-by: Tony Lindgren <tony(a)atomide.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm/boot/dts/am33xx.dtsi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index d3dd6a16e70a..e321acaf35d6 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -39,6 +39,9 @@ aliases {
ethernet1 = &cpsw_emac1;
spi0 = &spi0;
spi1 = &spi1;
+ mmc0 = &mmc1;
+ mmc1 = &mmc2;
+ mmc2 = &mmc3;
};
cpus {
--
2.30.1
From: Mans Rullgard <mans(a)mansr.com>
[ Upstream commit 9bbce32a20d6a72c767a7f85fd6127babd1410ac ]
Without DT aliases, the numbering of mmc interfaces is unpredictable.
Adding them makes it possible to refer to devices consistently. The
popular suggestion to use UUIDs obviously doesn't work with a blank
device fresh from the factory.
See commit fa2d0aa96941 ("mmc: core: Allow setting slot index via
device tree alias") for more discussion.
Signed-off-by: Mans Rullgard <mans(a)mansr.com>
Signed-off-by: Tony Lindgren <tony(a)atomide.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm/boot/dts/am33xx.dtsi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index fb6b8aa12cc5..77fa7c0f2104 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -40,6 +40,9 @@ aliases {
ethernet1 = &cpsw_emac1;
spi0 = &spi0;
spi1 = &spi1;
+ mmc0 = &mmc1;
+ mmc1 = &mmc2;
+ mmc2 = &mmc3;
};
cpus {
--
2.30.1
A test was added to the probe function to ensure the device was
actually connected and working before successfully completing a
probe. If the device was actually there, but the I2C bus was not
ready yet for whatever reason, the probe fails permanently.
Change the probe so that we defer the probe on a regmap read
failure so that we try the probe again when the dependent drivers
are potentially loaded. This should not affect the case where the
device truly isn't present because the probe will never successfully
complete.
Fixes: 2aa916e67db3 ("sc16is7xx: Read the LSR register for basic device presence check")
Cc: stable(a)vger.kernel.org
Signed-off-by: Annaliese McDermond <nh6z(a)nh6z.net>
---
drivers/tty/serial/sc16is7xx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index f86ec2d2635b..9adb8362578c 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -1196,7 +1196,7 @@ static int sc16is7xx_probe(struct device *dev,
ret = regmap_read(regmap,
SC16IS7XX_LSR_REG << SC16IS7XX_REG_SHIFT, &val);
if (ret < 0)
- return ret;
+ return -EPROBE_DEFER;
/* Alloc port structure */
s = devm_kzalloc(dev, struct_size(s, p, devtype->nr_uart), GFP_KERNEL);
--
2.27.0
A test was added to the probe function to ensure the device was
actually connected and working before successfully completing a
probe. If the device was actually there, but the I2C bus was not
ready yet for whatever reason, the probe fails permanently.
Change the probe so that we defer the probe on a regmap read
failure so that we try the probe again when the dependent drivers
are potentially loaded. This should not affect the case where the
device truly isn't present because the probe will never successfully
complete.
Fixes: 2aa916e ("sc16is7xx: Read the LSR register for basic device presence check")
Cc: stable(a)vger.kernel.org
Signed-off-by: Annaliese McDermond <nh6z(a)nh6z.net>
---
drivers/tty/serial/sc16is7xx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index f86ec2d2635b..9adb8362578c 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -1196,7 +1196,7 @@ static int sc16is7xx_probe(struct device *dev,
ret = regmap_read(regmap,
SC16IS7XX_LSR_REG << SC16IS7XX_REG_SHIFT, &val);
if (ret < 0)
- return ret;
+ return -EPROBE_DEFER;
/* Alloc port structure */
s = devm_kzalloc(dev, struct_size(s, p, devtype->nr_uart), GFP_KERNEL);
--
2.27.0
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From dcc32f4f183ab8479041b23a1525d48233df1d43 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba(a)kernel.org>
Date: Wed, 17 Mar 2021 09:55:15 -0700
Subject: [PATCH] ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.
Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao(a)fb.com>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/dccp/ipv6.c | 5 +++++
net/ipv6/ip6_input.c | 10 ----------
net/ipv6/tcp_ipv6.c | 5 +++++
net/mptcp/subflow.c | 5 +++++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 1f73603913f5..2be5c69824f9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -319,6 +319,11 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
return 0; /* discard, don't send a reset here */
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
if (dccp_bad_service_code(sk, service)) {
dcb->dccpd_reset_code = DCCP_RESET_CODE_BAD_SERVICE_CODE;
goto drop;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e9d2a4a409aa..80256717868e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -245,16 +245,6 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
if (ipv6_addr_is_multicast(&hdr->saddr))
goto err;
- /* While RFC4291 is not explicit about v4mapped addresses
- * in IPv6 headers, it seems clear linux dual-stack
- * model can not deal properly with these.
- * Security models could be fooled by ::ffff:127.0.0.1 for example.
- *
- * https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
- */
- if (ipv6_addr_v4mapped(&hdr->saddr))
- goto err;
-
skb->transport_header = skb->network_header + sizeof(*hdr);
IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..d0f007741e8e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1175,6 +1175,11 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&tcp6_request_sock_ops,
&tcp_request_sock_ipv6_ops, sk, skb);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d47d670e665..d17d39ccdf34 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -477,6 +477,11 @@ static int subflow_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
--
2.30.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 1f935e8e72ec28dddb2dc0650b3b6626a293d94b Mon Sep 17 00:00:00 2001
From: David Brazdil <dbrazdil(a)google.com>
Date: Fri, 19 Mar 2021 13:05:41 +0000
Subject: [PATCH] selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5546710d8ac1..bc7fb9bf3351 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -755,6 +755,7 @@ static struct sock *__vsock_create(struct net *net,
vsk->buffer_size = psk->buffer_size;
vsk->buffer_min_size = psk->buffer_min_size;
vsk->buffer_max_size = psk->buffer_max_size;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.30.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From dcc32f4f183ab8479041b23a1525d48233df1d43 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba(a)kernel.org>
Date: Wed, 17 Mar 2021 09:55:15 -0700
Subject: [PATCH] ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.
Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao(a)fb.com>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/dccp/ipv6.c | 5 +++++
net/ipv6/ip6_input.c | 10 ----------
net/ipv6/tcp_ipv6.c | 5 +++++
net/mptcp/subflow.c | 5 +++++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 1f73603913f5..2be5c69824f9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -319,6 +319,11 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
return 0; /* discard, don't send a reset here */
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
if (dccp_bad_service_code(sk, service)) {
dcb->dccpd_reset_code = DCCP_RESET_CODE_BAD_SERVICE_CODE;
goto drop;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e9d2a4a409aa..80256717868e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -245,16 +245,6 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
if (ipv6_addr_is_multicast(&hdr->saddr))
goto err;
- /* While RFC4291 is not explicit about v4mapped addresses
- * in IPv6 headers, it seems clear linux dual-stack
- * model can not deal properly with these.
- * Security models could be fooled by ::ffff:127.0.0.1 for example.
- *
- * https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
- */
- if (ipv6_addr_v4mapped(&hdr->saddr))
- goto err;
-
skb->transport_header = skb->network_header + sizeof(*hdr);
IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..d0f007741e8e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1175,6 +1175,11 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&tcp6_request_sock_ops,
&tcp_request_sock_ipv6_ops, sk, skb);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d47d670e665..d17d39ccdf34 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -477,6 +477,11 @@ static int subflow_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
--
2.30.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 175e476b8cdf2a4de7432583b49c871345e4f8a1 Mon Sep 17 00:00:00 2001
From: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Date: Mon, 8 Mar 2021 14:24:13 +1300
Subject: [PATCH] netfilter: x_tables: Use correct memory barriers.
When a new table value was assigned, it was followed by a write memory
barrier. This ensured that all writes before this point would complete
before any writes after this point. However, to determine whether the
rules are unused, the sequence counter is read. To ensure that all
writes have been done before these reads, a full memory barrier is
needed, not just a write memory barrier. The same argument applies when
incrementing the counter, before the rules are read.
Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
reported in cc00bcaa5899 (which is still present), while still
maintaining the same speed of replacing tables.
The smb_mb() barriers potentially slow the packet path, however testing
has shown no measurable change in performance on a 4-core MIPS64
platform.
Fixes: 7f5c6d4f665b ("netfilter: get rid of atomic ops in fast path")
Signed-off-by: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
---
include/linux/netfilter/x_tables.h | 2 +-
net/netfilter/x_tables.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
* since addend is most likely 1
*/
__this_cpu_add(xt_recseq.sequence, addend);
- smp_wmb();
+ smp_mb();
return addend;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 7df3aef39c5c..6bd31a7a27fc 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1389,7 +1389,7 @@ xt_replace_table(struct xt_table *table,
table->private = newinfo;
/* make sure all cpus see new ->private value */
- smp_wmb();
+ smp_mb();
/*
* Even though table entries have now been swapped, other CPU's
--
2.30.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From a188bb5638d41aa99090ebf2f85d3505ab13fba5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Wed, 10 Mar 2021 01:38:10 +0100
Subject: [PATCH] net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.
The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.
Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.
Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).
Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/dst.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 5f6315601776..fb3bcba87744 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -275,37 +275,24 @@ unsigned int dst_blackhole_mtu(const struct dst_entry *dst)
}
EXPORT_SYMBOL_GPL(dst_blackhole_mtu);
-static struct dst_ops md_dst_ops = {
- .family = AF_UNSPEC,
+static struct dst_ops dst_blackhole_ops = {
+ .family = AF_UNSPEC,
+ .neigh_lookup = dst_blackhole_neigh_lookup,
+ .check = dst_blackhole_check,
+ .cow_metrics = dst_blackhole_cow_metrics,
+ .update_pmtu = dst_blackhole_update_pmtu,
+ .redirect = dst_blackhole_redirect,
+ .mtu = dst_blackhole_mtu,
};
-static int dst_md_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call output on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
-static int dst_md_discard(struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call input on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
static void __metadata_dst_init(struct metadata_dst *md_dst,
enum metadata_type type, u8 optslen)
-
{
struct dst_entry *dst;
dst = &md_dst->dst;
- dst_init(dst, &md_dst_ops, NULL, 1, DST_OBSOLETE_NONE,
+ dst_init(dst, &dst_blackhole_ops, NULL, 1, DST_OBSOLETE_NONE,
DST_METADATA | DST_NOCOUNT);
-
- dst->input = dst_md_discard;
- dst->output = dst_md_discard_out;
-
memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
md_dst->type = type;
}
--
2.30.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 1f935e8e72ec28dddb2dc0650b3b6626a293d94b Mon Sep 17 00:00:00 2001
From: David Brazdil <dbrazdil(a)google.com>
Date: Fri, 19 Mar 2021 13:05:41 +0000
Subject: [PATCH] selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5546710d8ac1..bc7fb9bf3351 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -755,6 +755,7 @@ static struct sock *__vsock_create(struct net *net,
vsk->buffer_size = psk->buffer_size;
vsk->buffer_min_size = psk->buffer_min_size;
vsk->buffer_max_size = psk->buffer_max_size;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.30.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From dcc32f4f183ab8479041b23a1525d48233df1d43 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba(a)kernel.org>
Date: Wed, 17 Mar 2021 09:55:15 -0700
Subject: [PATCH] ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.
Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao(a)fb.com>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/dccp/ipv6.c | 5 +++++
net/ipv6/ip6_input.c | 10 ----------
net/ipv6/tcp_ipv6.c | 5 +++++
net/mptcp/subflow.c | 5 +++++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 1f73603913f5..2be5c69824f9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -319,6 +319,11 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
return 0; /* discard, don't send a reset here */
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
if (dccp_bad_service_code(sk, service)) {
dcb->dccpd_reset_code = DCCP_RESET_CODE_BAD_SERVICE_CODE;
goto drop;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e9d2a4a409aa..80256717868e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -245,16 +245,6 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
if (ipv6_addr_is_multicast(&hdr->saddr))
goto err;
- /* While RFC4291 is not explicit about v4mapped addresses
- * in IPv6 headers, it seems clear linux dual-stack
- * model can not deal properly with these.
- * Security models could be fooled by ::ffff:127.0.0.1 for example.
- *
- * https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
- */
- if (ipv6_addr_v4mapped(&hdr->saddr))
- goto err;
-
skb->transport_header = skb->network_header + sizeof(*hdr);
IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..d0f007741e8e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1175,6 +1175,11 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&tcp6_request_sock_ops,
&tcp_request_sock_ipv6_ops, sk, skb);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d47d670e665..d17d39ccdf34 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -477,6 +477,11 @@ static int subflow_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
--
2.30.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 175e476b8cdf2a4de7432583b49c871345e4f8a1 Mon Sep 17 00:00:00 2001
From: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Date: Mon, 8 Mar 2021 14:24:13 +1300
Subject: [PATCH] netfilter: x_tables: Use correct memory barriers.
When a new table value was assigned, it was followed by a write memory
barrier. This ensured that all writes before this point would complete
before any writes after this point. However, to determine whether the
rules are unused, the sequence counter is read. To ensure that all
writes have been done before these reads, a full memory barrier is
needed, not just a write memory barrier. The same argument applies when
incrementing the counter, before the rules are read.
Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
reported in cc00bcaa5899 (which is still present), while still
maintaining the same speed of replacing tables.
The smb_mb() barriers potentially slow the packet path, however testing
has shown no measurable change in performance on a 4-core MIPS64
platform.
Fixes: 7f5c6d4f665b ("netfilter: get rid of atomic ops in fast path")
Signed-off-by: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
---
include/linux/netfilter/x_tables.h | 2 +-
net/netfilter/x_tables.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
* since addend is most likely 1
*/
__this_cpu_add(xt_recseq.sequence, addend);
- smp_wmb();
+ smp_mb();
return addend;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 7df3aef39c5c..6bd31a7a27fc 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1389,7 +1389,7 @@ xt_replace_table(struct xt_table *table,
table->private = newinfo;
/* make sure all cpus see new ->private value */
- smp_wmb();
+ smp_mb();
/*
* Even though table entries have now been swapped, other CPU's
--
2.30.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From a188bb5638d41aa99090ebf2f85d3505ab13fba5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Wed, 10 Mar 2021 01:38:10 +0100
Subject: [PATCH] net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.
The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.
Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.
Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).
Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/dst.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 5f6315601776..fb3bcba87744 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -275,37 +275,24 @@ unsigned int dst_blackhole_mtu(const struct dst_entry *dst)
}
EXPORT_SYMBOL_GPL(dst_blackhole_mtu);
-static struct dst_ops md_dst_ops = {
- .family = AF_UNSPEC,
+static struct dst_ops dst_blackhole_ops = {
+ .family = AF_UNSPEC,
+ .neigh_lookup = dst_blackhole_neigh_lookup,
+ .check = dst_blackhole_check,
+ .cow_metrics = dst_blackhole_cow_metrics,
+ .update_pmtu = dst_blackhole_update_pmtu,
+ .redirect = dst_blackhole_redirect,
+ .mtu = dst_blackhole_mtu,
};
-static int dst_md_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call output on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
-static int dst_md_discard(struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call input on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
static void __metadata_dst_init(struct metadata_dst *md_dst,
enum metadata_type type, u8 optslen)
-
{
struct dst_entry *dst;
dst = &md_dst->dst;
- dst_init(dst, &md_dst_ops, NULL, 1, DST_OBSOLETE_NONE,
+ dst_init(dst, &dst_blackhole_ops, NULL, 1, DST_OBSOLETE_NONE,
DST_METADATA | DST_NOCOUNT);
-
- dst->input = dst_md_discard;
- dst->output = dst_md_discard_out;
-
memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
md_dst->type = type;
}
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 6ab4c3117aec4e08007d9e971fa4133e1de1082d Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Date: Mon, 22 Mar 2021 20:21:08 +0200
Subject: [PATCH] net: bridge: don't notify switchdev for local FDB addresses
As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/
the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:
ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local
and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.
There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:
bridge fdb add dev swp0 00:01:02:03:04:05 master
I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443…
If you're a user reading this and this is what you want, just use:
bridge fdb add dev swp0 00:01:02:03:04:05 master static
Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.
Fixes: 6b26b51b1d13 ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/bridge/br_switchdev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index b89503832fcc..1e24d9a2c9a7 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -128,6 +128,8 @@ br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb, int type)
{
if (!fdb->dst)
return;
+ if (test_bit(BR_FDB_LOCAL, &fdb->flags))
+ return;
switch (type) {
case RTM_DELNEIGH:
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 1f935e8e72ec28dddb2dc0650b3b6626a293d94b Mon Sep 17 00:00:00 2001
From: David Brazdil <dbrazdil(a)google.com>
Date: Fri, 19 Mar 2021 13:05:41 +0000
Subject: [PATCH] selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5546710d8ac1..bc7fb9bf3351 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -755,6 +755,7 @@ static struct sock *__vsock_create(struct net *net,
vsk->buffer_size = psk->buffer_size;
vsk->buffer_min_size = psk->buffer_min_size;
vsk->buffer_max_size = psk->buffer_max_size;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From dcc32f4f183ab8479041b23a1525d48233df1d43 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba(a)kernel.org>
Date: Wed, 17 Mar 2021 09:55:15 -0700
Subject: [PATCH] ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.
Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao(a)fb.com>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/dccp/ipv6.c | 5 +++++
net/ipv6/ip6_input.c | 10 ----------
net/ipv6/tcp_ipv6.c | 5 +++++
net/mptcp/subflow.c | 5 +++++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 1f73603913f5..2be5c69824f9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -319,6 +319,11 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
return 0; /* discard, don't send a reset here */
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
if (dccp_bad_service_code(sk, service)) {
dcb->dccpd_reset_code = DCCP_RESET_CODE_BAD_SERVICE_CODE;
goto drop;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e9d2a4a409aa..80256717868e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -245,16 +245,6 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
if (ipv6_addr_is_multicast(&hdr->saddr))
goto err;
- /* While RFC4291 is not explicit about v4mapped addresses
- * in IPv6 headers, it seems clear linux dual-stack
- * model can not deal properly with these.
- * Security models could be fooled by ::ffff:127.0.0.1 for example.
- *
- * https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
- */
- if (ipv6_addr_v4mapped(&hdr->saddr))
- goto err;
-
skb->transport_header = skb->network_header + sizeof(*hdr);
IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..d0f007741e8e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1175,6 +1175,11 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&tcp6_request_sock_ops,
&tcp_request_sock_ipv6_ops, sk, skb);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d47d670e665..d17d39ccdf34 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -477,6 +477,11 @@ static int subflow_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 175e476b8cdf2a4de7432583b49c871345e4f8a1 Mon Sep 17 00:00:00 2001
From: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Date: Mon, 8 Mar 2021 14:24:13 +1300
Subject: [PATCH] netfilter: x_tables: Use correct memory barriers.
When a new table value was assigned, it was followed by a write memory
barrier. This ensured that all writes before this point would complete
before any writes after this point. However, to determine whether the
rules are unused, the sequence counter is read. To ensure that all
writes have been done before these reads, a full memory barrier is
needed, not just a write memory barrier. The same argument applies when
incrementing the counter, before the rules are read.
Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
reported in cc00bcaa5899 (which is still present), while still
maintaining the same speed of replacing tables.
The smb_mb() barriers potentially slow the packet path, however testing
has shown no measurable change in performance on a 4-core MIPS64
platform.
Fixes: 7f5c6d4f665b ("netfilter: get rid of atomic ops in fast path")
Signed-off-by: Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
---
include/linux/netfilter/x_tables.h | 2 +-
net/netfilter/x_tables.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
* since addend is most likely 1
*/
__this_cpu_add(xt_recseq.sequence, addend);
- smp_wmb();
+ smp_mb();
return addend;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 7df3aef39c5c..6bd31a7a27fc 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1389,7 +1389,7 @@ xt_replace_table(struct xt_table *table,
table->private = newinfo;
/* make sure all cpus see new ->private value */
- smp_wmb();
+ smp_mb();
/*
* Even though table entries have now been swapped, other CPU's
--
2.30.1
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From a188bb5638d41aa99090ebf2f85d3505ab13fba5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Wed, 10 Mar 2021 01:38:10 +0100
Subject: [PATCH] net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.
The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.
Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.
Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).
Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/dst.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 5f6315601776..fb3bcba87744 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -275,37 +275,24 @@ unsigned int dst_blackhole_mtu(const struct dst_entry *dst)
}
EXPORT_SYMBOL_GPL(dst_blackhole_mtu);
-static struct dst_ops md_dst_ops = {
- .family = AF_UNSPEC,
+static struct dst_ops dst_blackhole_ops = {
+ .family = AF_UNSPEC,
+ .neigh_lookup = dst_blackhole_neigh_lookup,
+ .check = dst_blackhole_check,
+ .cow_metrics = dst_blackhole_cow_metrics,
+ .update_pmtu = dst_blackhole_update_pmtu,
+ .redirect = dst_blackhole_redirect,
+ .mtu = dst_blackhole_mtu,
};
-static int dst_md_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call output on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
-static int dst_md_discard(struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call input on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
static void __metadata_dst_init(struct metadata_dst *md_dst,
enum metadata_type type, u8 optslen)
-
{
struct dst_entry *dst;
dst = &md_dst->dst;
- dst_init(dst, &md_dst_ops, NULL, 1, DST_OBSOLETE_NONE,
+ dst_init(dst, &dst_blackhole_ops, NULL, 1, DST_OBSOLETE_NONE,
DST_METADATA | DST_NOCOUNT);
-
- dst->input = dst_md_discard;
- dst->output = dst_md_discard_out;
-
memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
md_dst->type = type;
}
--
2.30.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 6ab4c3117aec4e08007d9e971fa4133e1de1082d Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Date: Mon, 22 Mar 2021 20:21:08 +0200
Subject: [PATCH] net: bridge: don't notify switchdev for local FDB addresses
As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/
the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:
ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local
and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.
There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:
bridge fdb add dev swp0 00:01:02:03:04:05 master
I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443…
If you're a user reading this and this is what you want, just use:
bridge fdb add dev swp0 00:01:02:03:04:05 master static
Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.
Fixes: 6b26b51b1d13 ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/bridge/br_switchdev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index b89503832fcc..1e24d9a2c9a7 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -128,6 +128,8 @@ br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb, int type)
{
if (!fdb->dst)
return;
+ if (test_bit(BR_FDB_LOCAL, &fdb->flags))
+ return;
switch (type) {
case RTM_DELNEIGH:
--
2.30.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 1f935e8e72ec28dddb2dc0650b3b6626a293d94b Mon Sep 17 00:00:00 2001
From: David Brazdil <dbrazdil(a)google.com>
Date: Fri, 19 Mar 2021 13:05:41 +0000
Subject: [PATCH] selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5546710d8ac1..bc7fb9bf3351 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -755,6 +755,7 @@ static struct sock *__vsock_create(struct net *net,
vsk->buffer_size = psk->buffer_size;
vsk->buffer_min_size = psk->buffer_min_size;
vsk->buffer_max_size = psk->buffer_max_size;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.30.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From dcc32f4f183ab8479041b23a1525d48233df1d43 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba(a)kernel.org>
Date: Wed, 17 Mar 2021 09:55:15 -0700
Subject: [PATCH] ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.
Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao(a)fb.com>
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/dccp/ipv6.c | 5 +++++
net/ipv6/ip6_input.c | 10 ----------
net/ipv6/tcp_ipv6.c | 5 +++++
net/mptcp/subflow.c | 5 +++++
4 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 1f73603913f5..2be5c69824f9 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -319,6 +319,11 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
return 0; /* discard, don't send a reset here */
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
if (dccp_bad_service_code(sk, service)) {
dcb->dccpd_reset_code = DCCP_RESET_CODE_BAD_SERVICE_CODE;
goto drop;
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index e9d2a4a409aa..80256717868e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -245,16 +245,6 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
if (ipv6_addr_is_multicast(&hdr->saddr))
goto err;
- /* While RFC4291 is not explicit about v4mapped addresses
- * in IPv6 headers, it seems clear linux dual-stack
- * model can not deal properly with these.
- * Security models could be fooled by ::ffff:127.0.0.1 for example.
- *
- * https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
- */
- if (ipv6_addr_v4mapped(&hdr->saddr))
- goto err;
-
skb->transport_header = skb->network_header + sizeof(*hdr);
IP6CB(skb)->nhoff = offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bd44ded7e50c..d0f007741e8e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1175,6 +1175,11 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&tcp6_request_sock_ops,
&tcp_request_sock_ipv6_ops, sk, skb);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d47d670e665..d17d39ccdf34 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -477,6 +477,11 @@ static int subflow_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
+ if (ipv6_addr_v4mapped(&ipv6_hdr(skb)->saddr)) {
+ __IP6_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_INHDRERRORS);
+ return 0;
+ }
+
return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
--
2.30.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 7233da86697efef41288f8b713c10c2499cffe85 Mon Sep 17 00:00:00 2001
From: Alexander Ovechkin <ovov(a)yandex-team.ru>
Date: Mon, 15 Mar 2021 14:05:45 +0300
Subject: [PATCH] tcp: relookup sock for RST+ACK packets handled by obsolete
req sock
Currently tcp_check_req can be called with obsolete req socket for which big
socket have been already created (because of CPU race or early demux
assigning req socket to multiple packets in gro batch).
Commit e0f9759f530bf789e984 ("tcp: try to keep packet if SYN_RCV race
is lost") added retry in case when tcp_check_req is called for PSH|ACK packet.
But if client sends RST+ACK immediatly after connection being
established (it is performing healthcheck, for example) retry does not
occur. In that case tcp_check_req tries to close req socket,
leaving big socket active.
Fixes: e0f9759f530 ("tcp: try to keep packet if SYN_RCV race is lost")
Signed-off-by: Alexander Ovechkin <ovov(a)yandex-team.ru>
Reported-by: Oleg Senin <olegsenin(a)yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
include/net/inet_connection_sock.h | 2 +-
net/ipv4/inet_connection_sock.c | 7 +++++--
net/ipv4/tcp_minisocks.c | 7 +++++--
3 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 10a625760de9..3c8c59471bc1 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -282,7 +282,7 @@ static inline int inet_csk_reqsk_queue_is_full(const struct sock *sk)
return inet_csk_reqsk_queue_len(sk) >= sk->sk_max_ack_backlog;
}
-void inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req);
+bool inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req);
void inet_csk_reqsk_queue_drop_and_put(struct sock *sk, struct request_sock *req);
static inline void inet_csk_prepare_for_destroy_sock(struct sock *sk)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 6bd7ca09af03..fd472eae4f5c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -705,12 +705,15 @@ static bool reqsk_queue_unlink(struct request_sock *req)
return found;
}
-void inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req)
+bool inet_csk_reqsk_queue_drop(struct sock *sk, struct request_sock *req)
{
- if (reqsk_queue_unlink(req)) {
+ bool unlinked = reqsk_queue_unlink(req);
+
+ if (unlinked) {
reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
reqsk_put(req);
}
+ return unlinked;
}
EXPORT_SYMBOL(inet_csk_reqsk_queue_drop);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 0055ae0a3bf8..7513ba45553d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -804,8 +804,11 @@ embryonic_reset:
tcp_reset(sk, skb);
}
if (!fastopen) {
- inet_csk_reqsk_queue_drop(sk, req);
- __NET_INC_STATS(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ bool unlinked = inet_csk_reqsk_queue_drop(sk, req);
+
+ if (unlinked)
+ __NET_INC_STATS(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ *req_stolen = !unlinked;
}
return NULL;
}
--
2.30.1
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From a188bb5638d41aa99090ebf2f85d3505ab13fba5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Wed, 10 Mar 2021 01:38:10 +0100
Subject: [PATCH] net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.
The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.
Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.
Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).
Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/dst.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 5f6315601776..fb3bcba87744 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -275,37 +275,24 @@ unsigned int dst_blackhole_mtu(const struct dst_entry *dst)
}
EXPORT_SYMBOL_GPL(dst_blackhole_mtu);
-static struct dst_ops md_dst_ops = {
- .family = AF_UNSPEC,
+static struct dst_ops dst_blackhole_ops = {
+ .family = AF_UNSPEC,
+ .neigh_lookup = dst_blackhole_neigh_lookup,
+ .check = dst_blackhole_check,
+ .cow_metrics = dst_blackhole_cow_metrics,
+ .update_pmtu = dst_blackhole_update_pmtu,
+ .redirect = dst_blackhole_redirect,
+ .mtu = dst_blackhole_mtu,
};
-static int dst_md_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call output on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
-static int dst_md_discard(struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call input on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
static void __metadata_dst_init(struct metadata_dst *md_dst,
enum metadata_type type, u8 optslen)
-
{
struct dst_entry *dst;
dst = &md_dst->dst;
- dst_init(dst, &md_dst_ops, NULL, 1, DST_OBSOLETE_NONE,
+ dst_init(dst, &dst_blackhole_ops, NULL, 1, DST_OBSOLETE_NONE,
DST_METADATA | DST_NOCOUNT);
-
- dst->input = dst_md_discard;
- dst->output = dst_md_discard_out;
-
memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
md_dst->type = type;
}
--
2.30.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 6ab4c3117aec4e08007d9e971fa4133e1de1082d Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Date: Mon, 22 Mar 2021 20:21:08 +0200
Subject: [PATCH] net: bridge: don't notify switchdev for local FDB addresses
As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/
the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:
ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local
and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.
There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:
bridge fdb add dev swp0 00:01:02:03:04:05 master
I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443…
If you're a user reading this and this is what you want, just use:
bridge fdb add dev swp0 00:01:02:03:04:05 master static
Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.
Fixes: 6b26b51b1d13 ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/bridge/br_switchdev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index b89503832fcc..1e24d9a2c9a7 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -128,6 +128,8 @@ br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb, int type)
{
if (!fdb->dst)
return;
+ if (test_bit(BR_FDB_LOCAL, &fdb->flags))
+ return;
switch (type) {
case RTM_DELNEIGH:
--
2.30.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 1f935e8e72ec28dddb2dc0650b3b6626a293d94b Mon Sep 17 00:00:00 2001
From: David Brazdil <dbrazdil(a)google.com>
Date: Fri, 19 Mar 2021 13:05:41 +0000
Subject: [PATCH] selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/vmw_vsock/af_vsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5546710d8ac1..bc7fb9bf3351 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -755,6 +755,7 @@ static struct sock *__vsock_create(struct net *net,
vsk->buffer_size = psk->buffer_size;
vsk->buffer_min_size = psk->buffer_min_size;
vsk->buffer_max_size = psk->buffer_max_size;
+ security_sk_clone(parent, sk);
} else {
vsk->trusted = ns_capable_noaudit(&init_user_ns, CAP_NET_ADMIN);
vsk->owner = get_current_cred();
--
2.30.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From 9398e9c0b1d44eeb700e9e766c02bcc765c82570 Mon Sep 17 00:00:00 2001
From: Ido Schimmel <idosch(a)nvidia.com>
Date: Wed, 10 Mar 2021 12:28:01 +0200
Subject: [PATCH] drop_monitor: Perform cleanup upon probe registration failure
In the rare case that drop_monitor fails to register its probe on the
'napi_poll' tracepoint, it will not deactivate its hysteresis timer as
part of the error path. If the hysteresis timer was armed by the shortly
lived 'kfree_skb' probe and user space retries to initiate tracing, a
warning will be emitted for trying to initialize an active object [1].
Fix this by properly undoing all the operations that were done prior to
probe registration, in both software and hardware code paths.
Note that syzkaller managed to fail probe registration by injecting a
slab allocation failure [2].
[1]
ODEBUG: init active (active state 0) object type: timer_list hint: sched_send_work+0x0/0x60 include/linux/list.h:135
WARNING: CPU: 1 PID: 8649 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 1 PID: 8649 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
[...]
Call Trace:
__debug_object_init+0x524/0xd10 lib/debugobjects.c:588
debug_timer_init kernel/time/timer.c:722 [inline]
debug_init kernel/time/timer.c:770 [inline]
init_timer_key+0x2d/0x340 kernel/time/timer.c:814
net_dm_trace_on_set net/core/drop_monitor.c:1111 [inline]
set_all_monitor_traces net/core/drop_monitor.c:1188 [inline]
net_dm_monitor_start net/core/drop_monitor.c:1295 [inline]
net_dm_cmd_trace+0x720/0x1220 net/core/drop_monitor.c:1339
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739
genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e8/0x810 net/socket.c:2348
___sys_sendmsg+0xf3/0x170 net/socket.c:2402
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2435
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
[2]
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 1 PID: 8645 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
dump_stack+0xfa/0x151
should_fail.cold+0x5/0xa
should_failslab+0x5/0x10
__kmalloc+0x72/0x3f0
tracepoint_add_func+0x378/0x990
tracepoint_probe_register+0x9c/0xe0
net_dm_cmd_trace+0x7fc/0x1220
genl_family_rcv_msg_doit+0x228/0x320
genl_rcv_msg+0x328/0x580
netlink_rcv_skb+0x153/0x420
genl_rcv+0x24/0x40
netlink_unicast+0x533/0x7d0
netlink_sendmsg+0x856/0xd90
sock_sendmsg+0xcf/0x120
____sys_sendmsg+0x6e8/0x810
___sys_sendmsg+0xf3/0x170
__sys_sendmsg+0xe5/0x1b0
do_syscall_64+0x2d/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 70c69274f354 ("drop_monitor: Initialize timer and work item upon tracing enable")
Fixes: 8ee2267ad33e ("drop_monitor: Convert to using devlink tracepoint")
Reported-by: syzbot+779559d6503f3a56213d(a)syzkaller.appspotmail.com
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
Reviewed-by: Jiri Pirko <jiri(a)nvidia.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/drop_monitor.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 571f191c06d9..db65ce62b625 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -1053,6 +1053,20 @@ static int net_dm_hw_monitor_start(struct netlink_ext_ack *extack)
return 0;
err_module_put:
+ for_each_possible_cpu(cpu) {
+ struct per_cpu_dm_data *hw_data = &per_cpu(dm_hw_cpu_data, cpu);
+ struct sk_buff *skb;
+
+ del_timer_sync(&hw_data->send_timer);
+ cancel_work_sync(&hw_data->dm_alert_work);
+ while ((skb = __skb_dequeue(&hw_data->drop_queue))) {
+ struct devlink_trap_metadata *hw_metadata;
+
+ hw_metadata = NET_DM_SKB_CB(skb)->hw_metadata;
+ net_dm_hw_metadata_free(hw_metadata);
+ consume_skb(skb);
+ }
+ }
module_put(THIS_MODULE);
return rc;
}
@@ -1134,6 +1148,15 @@ static int net_dm_trace_on_set(struct netlink_ext_ack *extack)
err_unregister_trace:
unregister_trace_kfree_skb(ops->kfree_skb_probe, NULL);
err_module_put:
+ for_each_possible_cpu(cpu) {
+ struct per_cpu_dm_data *data = &per_cpu(dm_cpu_data, cpu);
+ struct sk_buff *skb;
+
+ del_timer_sync(&data->send_timer);
+ cancel_work_sync(&data->dm_alert_work);
+ while ((skb = __skb_dequeue(&data->drop_queue)))
+ consume_skb(skb);
+ }
module_put(THIS_MODULE);
return rc;
}
--
2.30.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Thanks,
Sasha
------------------ original commit in Linus's tree ------------------
>From a188bb5638d41aa99090ebf2f85d3505ab13fba5 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Wed, 10 Mar 2021 01:38:10 +0100
Subject: [PATCH] net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.
The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.
Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.
Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).
Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
---
net/core/dst.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff --git a/net/core/dst.c b/net/core/dst.c
index 5f6315601776..fb3bcba87744 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -275,37 +275,24 @@ unsigned int dst_blackhole_mtu(const struct dst_entry *dst)
}
EXPORT_SYMBOL_GPL(dst_blackhole_mtu);
-static struct dst_ops md_dst_ops = {
- .family = AF_UNSPEC,
+static struct dst_ops dst_blackhole_ops = {
+ .family = AF_UNSPEC,
+ .neigh_lookup = dst_blackhole_neigh_lookup,
+ .check = dst_blackhole_check,
+ .cow_metrics = dst_blackhole_cow_metrics,
+ .update_pmtu = dst_blackhole_update_pmtu,
+ .redirect = dst_blackhole_redirect,
+ .mtu = dst_blackhole_mtu,
};
-static int dst_md_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call output on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
-static int dst_md_discard(struct sk_buff *skb)
-{
- WARN_ONCE(1, "Attempting to call input on metadata dst\n");
- kfree_skb(skb);
- return 0;
-}
-
static void __metadata_dst_init(struct metadata_dst *md_dst,
enum metadata_type type, u8 optslen)
-
{
struct dst_entry *dst;
dst = &md_dst->dst;
- dst_init(dst, &md_dst_ops, NULL, 1, DST_OBSOLETE_NONE,
+ dst_init(dst, &dst_blackhole_ops, NULL, 1, DST_OBSOLETE_NONE,
DST_METADATA | DST_NOCOUNT);
-
- dst->input = dst_md_discard;
- dst->output = dst_md_discard_out;
-
memset(dst + 1, 0, sizeof(*md_dst) + optslen - sizeof(*dst));
md_dst->type = type;
}
--
2.30.1
hello
5.11.11-rc2 compiles, boots and run fine here (on an i7-6700 with Fedora
34 Beta)
Thanks.
but I see some error's/warning's with the new gcc-11.0.1 compared to the
elder gcc-10.2.1-11
In function ‘poly1305_simd_init’,
inlined from ‘crypto_poly1305_setdctxkey’ at
arch/x86/crypto/poly1305_glue.c:150:4,
inlined from ‘crypto_poly1305_setdctxkey’ at
arch/x86/crypto/poly1305_glue.c:144:21,
inlined from ‘poly1305_update_arch’ at
arch/x86/crypto/poly1305_glue.c:181:8:
arch/x86/crypto/poly1305_glue.c:86:9: warning: ‘poly1305_init_x86_64’
reading 32 bytes from a region of size 16 [-Wstringop-overread]
86 | poly1305_init_x86_64(ctx, key);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/crypto/poly1305_glue.c: In function ‘poly1305_update_arch’:
arch/x86/crypto/poly1305_glue.c:86:9: note: referencing argument 2 of
type ‘const u8 *’ {aka ‘const unsigned char *’}
arch/x86/crypto/poly1305_glue.c:18:17: note: in a call to function
‘poly1305_init_x86_64’
18 | asmlinkage void poly1305_init_x86_64(void *ctx,
| ^~~~~~~~~~~~~~~~~~~~
In file included from ./include/linux/bitmap.h:9,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/cpumask.h:5,
from ./arch/x86/include/asm/msr.h:11,
from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:59,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/rcupdate.h:27,
from ./include/linux/rbtree.h:22,
from ./include/linux/iova.h:14,
from ./include/linux/intel-iommu.h:14,
from arch/x86/kernel/tboot.c:9:
In function ‘memcmp’,
inlined from ‘tboot_probe’ at arch/x86/kernel/tboot.c:70:6:
./include/linux/string.h:283:33: warning: ‘__builtin_memcmp_eq’
specified bound 16 exceeds source size 0 [-Wstringop-overread]
283 | #define __underlying_memcmp __builtin_memcmp
| ^
./include/linux/string.h:488:16: note: in expansion of macro
‘__underlying_memcmp’
488 | return __underlying_memcmp(p, q, size);
| ^~~~~~~~~~~~~~~~~~~
lib/crypto/poly1305-donna64.c:15:67: warning: argument 2 of type ‘const
u8[16]’ {aka ‘const unsigned char[16]’} with mismatched bound
[-Warray-parameter=]
15 | void poly1305_core_setkey(struct poly1305_core_key *key, const
u8 raw_key[16])
| ~~~~~~~~~^~~~~~~~~~~
In file included from lib/crypto/poly1305-donna64.c:11:
./include/crypto/internal/poly1305.h:21:68: note: previously declared as
‘const u8 *’ {aka ‘const unsigned char *’}
21 | void poly1305_core_setkey(struct poly1305_core_key *key, const
u8 *raw_key);
| ~~~~~~~~~~^~~~~~~
arch/x86/lib/msr-smp.c:255:51: warning: argument 2 of type ‘u32 *’ {aka
‘unsigned int *’} declared as a pointer [-Warray-parameter=]
255 | int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 *regs)
| ~~~~~^~~~
In file included from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:59,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from arch/x86/lib/msr-smp.c:3:
./arch/x86/include/asm/msr.h:347:50: note: previously declared as an
array ‘u32[8]’ {aka ‘unsigned int[8]’}
347 | int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]);
| ~~~~^~~~~~~
arch/x86/lib/msr-smp.c:268:51: warning: argument 2 of type ‘u32 *’ {aka
‘unsigned int *’} declared as a pointer [-Warray-parameter=]
268 | int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 *regs)
| ~~~~~^~~~
In file included from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:59,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from arch/x86/lib/msr-smp.c:3:
./arch/x86/include/asm/msr.h:348:50: note: previously declared as an
array ‘u32[8]’ {aka ‘unsigned int[8]’}
348 | int wrmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8]);
| ~~~~^~~~~~~
In function ‘snb_wm_latency_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3108:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3057:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3057 | intel_print_wm_latency(dev_priv, "Primary",
dev_priv->wm.pri_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3057:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘snb_wm_latency_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3108:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3058:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3058 | intel_print_wm_latency(dev_priv, "Sprite",
dev_priv->wm.spr_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3058:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘snb_wm_latency_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3108:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3059:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3059 | intel_print_wm_latency(dev_priv, "Cursor",
dev_priv->wm.cur_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3059:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘snb_wm_lp3_irq_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3109:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3086:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3086 | intel_print_wm_latency(dev_priv, "Primary",
dev_priv->wm.pri_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3086:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘snb_wm_lp3_irq_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3109:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3087:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3087 | intel_print_wm_latency(dev_priv, "Sprite",
dev_priv->wm.spr_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3087:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘snb_wm_lp3_irq_quirk’,
inlined from ‘ilk_setup_wm_latency’ at
drivers/gpu/drm/i915/intel_pm.c:3109:3,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3088:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3088 | intel_print_wm_latency(dev_priv, "Cursor",
dev_priv->wm.cur_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3088:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘ilk_setup_wm_latency’,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3103:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3103 | intel_print_wm_latency(dev_priv, "Primary",
dev_priv->wm.pri_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3103:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘ilk_setup_wm_latency’,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3104:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3104 | intel_print_wm_latency(dev_priv, "Sprite",
dev_priv->wm.spr_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3104:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘ilk_setup_wm_latency’,
inlined from ‘intel_init_pm’ at drivers/gpu/drm/i915/intel_pm.c:7650:3:
drivers/gpu/drm/i915/intel_pm.c:3105:9: warning:
‘intel_print_wm_latency’ reading 16 bytes from a region of size 10
[-Wstringop-overread]
3105 | intel_print_wm_latency(dev_priv, "Cursor",
dev_priv->wm.cur_latency);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/intel_pm.c: In function ‘intel_init_pm’:
drivers/gpu/drm/i915/intel_pm.c:3105:9: note: referencing argument 3 of
type ‘const u16 *’ {aka ‘const short unsigned int *’}
drivers/gpu/drm/i915/intel_pm.c:2994:13: note: in a call to function
‘intel_print_wm_latency’
2994 | static void intel_print_wm_latency(struct drm_i915_private
*dev_priv,
| ^~~~~~~~~~~~~~~~~~~~~~
In function ‘intel_dp_check_mst_status’,
inlined from ‘intel_dp_hpd_pulse’ at
drivers/gpu/drm/i915/display/intel_dp.c:7122:8:
drivers/gpu/drm/i915/display/intel_dp.c:5797:22: warning:
‘drm_dp_channel_eq_ok’ reading 6 bytes from a region of size 4
[-Wstringop-overread]
5797 | !drm_dp_channel_eq_ok(&esi[10],
intel_dp->lane_count)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/display/intel_dp.c: In function ‘intel_dp_hpd_pulse’:
drivers/gpu/drm/i915/display/intel_dp.c:5797:22: note: referencing
argument 1 of type ‘const u8 *’ {aka ‘const unsigned char *’}
In file included from drivers/gpu/drm/i915/display/intel_dp.c:38:
./include/drm/drm_dp_helper.h:1267:6: note: in a call to function
‘drm_dp_channel_eq_ok’
1267 | bool drm_dp_channel_eq_ok(const u8
link_status[DP_LINK_STATUS_SIZE],
| ^~~~~~~~~~~~~~~~~~~~
--
regards
Ronald
Clock registration must be performed before the component is
registered. aic32x4_component_probe attempts to get all the
clocks right off the bat. If the component is registered before
the clocks there is a race condition where the clocks may not
be registered by the time aic32x4_componet_probe actually runs.
Fixes: d1c859d ("ASoC: codec: tlv3204: Increased maximum supported channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Annaliese McDermond <nh6z(a)nh6z.net>
---
sound/soc/codecs/tlv320aic32x4.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/sound/soc/codecs/tlv320aic32x4.c b/sound/soc/codecs/tlv320aic32x4.c
index 1ac3b3b12dc6..b689f26fc4be 100644
--- a/sound/soc/codecs/tlv320aic32x4.c
+++ b/sound/soc/codecs/tlv320aic32x4.c
@@ -1243,6 +1243,10 @@ int aic32x4_probe(struct device *dev, struct regmap *regmap)
if (ret)
goto err_disable_regulators;
+ ret = aic32x4_register_clocks(dev, aic32x4->mclk_name);
+ if (ret)
+ goto err_disable_regulators;
+
ret = devm_snd_soc_register_component(dev,
&soc_component_dev_aic32x4, &aic32x4_dai, 1);
if (ret) {
@@ -1250,10 +1254,6 @@ int aic32x4_probe(struct device *dev, struct regmap *regmap)
goto err_disable_regulators;
}
- ret = aic32x4_register_clocks(dev, aic32x4->mclk_name);
- if (ret)
- goto err_disable_regulators;
-
return 0;
err_disable_regulators:
--
2.27.0
Attention Sir,
I represent an investor seeking to invest in any lucrative investment in your country, If you have a solid background and the idea of making good profit in real estate or in any business, Please write to me for possible business cooperation. My email is j.t.elyseerene(a)gmail.com
Regards,
Jacquet Thierry Elysee Rene.
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ee7febce051945be28ad86d16a15886f878204de Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Date: Tue, 16 Feb 2021 10:03:51 -0500
Subject: [PATCH] arm64: mm: correct the inside linear map range during hotplug
check
Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.
The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.
This can be verified on QEMU with setting kaslr-seed to ~0ul:
memstart_offset_seed = 0xffff
START: __pa(_PAGE_OFFSET(vabits_actual)) = ffff9000c0000000
END: __pa(PAGE_END - 1) = 1000bfffffff
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear mapping")
Tested-by: Tyler Hicks <tyhicks(a)linux.microsoft.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Link: https://lore.kernel.org/r/20210216150351.129018-2-pasha.tatashin@soleen.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7484ea4f6ba0..5d9550fdb9cf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1448,6 +1448,22 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
struct range arch_get_mappable_range(void)
{
struct range mhp_range;
+ u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+ u64 end_linear_pa = __pa(PAGE_END - 1);
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+ /*
+ * Check for a wrap, it is possible because of randomized linear
+ * mapping the start physical address is actually bigger than
+ * the end physical address. In this case set start to zero
+ * because [0, end_linear_pa] range must still be able to cover
+ * all addressable physical addresses.
+ */
+ if (start_linear_pa > end_linear_pa)
+ start_linear_pa = 0;
+ }
+
+ WARN_ON(start_linear_pa > end_linear_pa);
/*
* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
@@ -1455,8 +1471,9 @@ struct range arch_get_mappable_range(void)
* range which can be mapped inside this linear mapping range, must
* also be derived from its end points.
*/
- mhp_range.start = __pa(_PAGE_OFFSET(vabits_actual));
- mhp_range.end = __pa(PAGE_END - 1);
+ mhp_range.start = start_linear_pa;
+ mhp_range.end = end_linear_pa;
+
return mhp_range;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ee7febce051945be28ad86d16a15886f878204de Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Date: Tue, 16 Feb 2021 10:03:51 -0500
Subject: [PATCH] arm64: mm: correct the inside linear map range during hotplug
check
Memory hotplug may fail on systems with CONFIG_RANDOMIZE_BASE because the
linear map range is not checked correctly.
The start physical address that linear map covers can be actually at the
end of the range because of randomization. Check that and if so reduce it
to 0.
This can be verified on QEMU with setting kaslr-seed to ~0ul:
memstart_offset_seed = 0xffff
START: __pa(_PAGE_OFFSET(vabits_actual)) = ffff9000c0000000
END: __pa(PAGE_END - 1) = 1000bfffffff
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Fixes: 58284a901b42 ("arm64/mm: Validate hotplug range before creating linear mapping")
Tested-by: Tyler Hicks <tyhicks(a)linux.microsoft.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
Link: https://lore.kernel.org/r/20210216150351.129018-2-pasha.tatashin@soleen.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7484ea4f6ba0..5d9550fdb9cf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1448,6 +1448,22 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
struct range arch_get_mappable_range(void)
{
struct range mhp_range;
+ u64 start_linear_pa = __pa(_PAGE_OFFSET(vabits_actual));
+ u64 end_linear_pa = __pa(PAGE_END - 1);
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+ /*
+ * Check for a wrap, it is possible because of randomized linear
+ * mapping the start physical address is actually bigger than
+ * the end physical address. In this case set start to zero
+ * because [0, end_linear_pa] range must still be able to cover
+ * all addressable physical addresses.
+ */
+ if (start_linear_pa > end_linear_pa)
+ start_linear_pa = 0;
+ }
+
+ WARN_ON(start_linear_pa > end_linear_pa);
/*
* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
@@ -1455,8 +1471,9 @@ struct range arch_get_mappable_range(void)
* range which can be mapped inside this linear mapping range, must
* also be derived from its end points.
*/
- mhp_range.start = __pa(_PAGE_OFFSET(vabits_actual));
- mhp_range.end = __pa(PAGE_END - 1);
+ mhp_range.start = start_linear_pa;
+ mhp_range.end = end_linear_pa;
+
return mhp_range;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e21aa341785c679dd409c8cb71f864c00fe6c463 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast(a)kernel.org>
Date: Tue, 16 Mar 2021 14:00:07 -0700
Subject: [PATCH] bpf: Fix fexit trampoline.
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.
Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Paul E. McKenney <paulmck(a)kernel.org> # for RCU
Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail…
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 747bba0a584a..72b5a57e9e31 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1936,7 +1936,7 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
* add rsp, 8 // skip eth_type_trans's frame
* ret // return to its caller
*/
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call)
@@ -1975,6 +1975,15 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
save_regs(m, &prog, nr_args, stack_size);
+ if (flags & BPF_TRAMP_F_CALL_ORIG) {
+ /* arg1: mov rdi, im */
+ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+ if (emit_call(&prog, __bpf_tramp_enter, prog)) {
+ ret = -EINVAL;
+ goto cleanup;
+ }
+ }
+
if (fentry->nr_progs)
if (invoke_bpf(m, &prog, fentry, stack_size))
return -EINVAL;
@@ -1993,8 +2002,7 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
}
if (flags & BPF_TRAMP_F_CALL_ORIG) {
- if (fentry->nr_progs || fmod_ret->nr_progs)
- restore_regs(m, &prog, nr_args, stack_size);
+ restore_regs(m, &prog, nr_args, stack_size);
/* call original function */
if (emit_call(&prog, orig_call, prog)) {
@@ -2003,6 +2011,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
}
/* remember return value in a stack for bpf prog to access */
emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+ im->ip_after_call = prog;
+ emit_nops(&prog, 5);
}
if (fmod_ret->nr_progs) {
@@ -2033,9 +2043,17 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
* the return value is only updated on the stack and still needs to be
* restored to R0.
*/
- if (flags & BPF_TRAMP_F_CALL_ORIG)
+ if (flags & BPF_TRAMP_F_CALL_ORIG) {
+ im->ip_epilogue = prog;
+ /* arg1: mov rdi, im */
+ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+ if (emit_call(&prog, __bpf_tramp_exit, prog)) {
+ ret = -EINVAL;
+ goto cleanup;
+ }
/* restore original return value back into RAX */
emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
+ }
EMIT1(0x5B); /* pop rbx */
EMIT1(0xC9); /* leave */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d7e0f479a5b0..3625f019767d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -21,6 +21,7 @@
#include <linux/capability.h>
#include <linux/sched/mm.h>
#include <linux/slab.h>
+#include <linux/percpu-refcount.h>
struct bpf_verifier_env;
struct bpf_verifier_log;
@@ -556,7 +557,8 @@ struct bpf_tramp_progs {
* fentry = a set of program to run before calling original function
* fexit = a set of program to run after original function
*/
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+struct bpf_tramp_image;
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call);
@@ -565,6 +567,8 @@ u64 notrace __bpf_prog_enter(struct bpf_prog *prog);
void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog);
void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start);
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
struct bpf_ksym {
unsigned long start;
@@ -583,6 +587,18 @@ enum bpf_tramp_prog_type {
BPF_TRAMP_REPLACE, /* more than MAX */
};
+struct bpf_tramp_image {
+ void *image;
+ struct bpf_ksym ksym;
+ struct percpu_ref pcref;
+ void *ip_after_call;
+ void *ip_epilogue;
+ union {
+ struct rcu_head rcu;
+ struct work_struct work;
+ };
+};
+
struct bpf_trampoline {
/* hlist for trampoline_table */
struct hlist_node hlist;
@@ -605,9 +621,8 @@ struct bpf_trampoline {
/* Number of attached programs. A counter per kind. */
int progs_cnt[BPF_TRAMP_MAX];
/* Executable image of trampoline */
- void *image;
+ struct bpf_tramp_image *cur_image;
u64 selector;
- struct bpf_ksym ksym;
};
struct bpf_attach_target_info {
@@ -691,6 +706,8 @@ void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym);
void bpf_image_ksym_del(struct bpf_ksym *ksym);
void bpf_ksym_add(struct bpf_ksym *ksym);
void bpf_ksym_del(struct bpf_ksym *ksym);
+int bpf_jit_charge_modmem(u32 pages);
+void bpf_jit_uncharge_modmem(u32 pages);
#else
static inline int bpf_trampoline_link_prog(struct bpf_prog *prog,
struct bpf_trampoline *tr)
@@ -787,7 +804,6 @@ struct bpf_prog_aux {
bool func_proto_unreliable;
bool sleepable;
bool tail_call_reachable;
- enum bpf_tramp_prog_type trampoline_prog_type;
struct hlist_node tramp_hlist;
/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
const struct btf_type *attach_func_proto;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 1a666a975416..70f6fd4fa305 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -430,7 +430,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
tprogs[BPF_TRAMP_FENTRY].progs[0] = prog;
tprogs[BPF_TRAMP_FENTRY].nr_progs = 1;
- err = arch_prepare_bpf_trampoline(image,
+ err = arch_prepare_bpf_trampoline(NULL, image,
st_map->image + PAGE_SIZE,
&st_ops->func_models[i], 0,
tprogs, NULL);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3a283bf97f2f..75244ecb2389 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -827,7 +827,7 @@ static int __init bpf_jit_charge_init(void)
}
pure_initcall(bpf_jit_charge_init);
-static int bpf_jit_charge_modmem(u32 pages)
+int bpf_jit_charge_modmem(u32 pages)
{
if (atomic_long_add_return(pages, &bpf_jit_current) >
(bpf_jit_limit >> PAGE_SHIFT)) {
@@ -840,7 +840,7 @@ static int bpf_jit_charge_modmem(u32 pages)
return 0;
}
-static void bpf_jit_uncharge_modmem(u32 pages)
+void bpf_jit_uncharge_modmem(u32 pages)
{
atomic_long_sub(pages, &bpf_jit_current);
}
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 7bc3b3209224..1f3a4be4b175 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -57,19 +57,10 @@ void bpf_image_ksym_del(struct bpf_ksym *ksym)
PAGE_SIZE, true, ksym->name);
}
-static void bpf_trampoline_ksym_add(struct bpf_trampoline *tr)
-{
- struct bpf_ksym *ksym = &tr->ksym;
-
- snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu", tr->key);
- bpf_image_ksym_add(tr->image, ksym);
-}
-
static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
struct bpf_trampoline *tr;
struct hlist_head *head;
- void *image;
int i;
mutex_lock(&trampoline_mutex);
@@ -84,14 +75,6 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
if (!tr)
goto out;
- /* is_root was checked earlier. No need for bpf_jit_charge_modmem() */
- image = bpf_jit_alloc_exec_page();
- if (!image) {
- kfree(tr);
- tr = NULL;
- goto out;
- }
-
tr->key = key;
INIT_HLIST_NODE(&tr->hlist);
hlist_add_head(&tr->hlist, head);
@@ -99,9 +82,6 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
mutex_init(&tr->mutex);
for (i = 0; i < BPF_TRAMP_MAX; i++)
INIT_HLIST_HEAD(&tr->progs_hlist[i]);
- tr->image = image;
- INIT_LIST_HEAD_RCU(&tr->ksym.lnode);
- bpf_trampoline_ksym_add(tr);
out:
mutex_unlock(&trampoline_mutex);
return tr;
@@ -185,10 +165,142 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total)
return tprogs;
}
+static void __bpf_tramp_image_put_deferred(struct work_struct *work)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(work, struct bpf_tramp_image, work);
+ bpf_image_ksym_del(&im->ksym);
+ bpf_jit_free_exec(im->image);
+ bpf_jit_uncharge_modmem(1);
+ percpu_ref_exit(&im->pcref);
+ kfree_rcu(im, rcu);
+}
+
+/* callback, fexit step 3 or fentry step 2 */
+static void __bpf_tramp_image_put_rcu(struct rcu_head *rcu)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(rcu, struct bpf_tramp_image, rcu);
+ INIT_WORK(&im->work, __bpf_tramp_image_put_deferred);
+ schedule_work(&im->work);
+}
+
+/* callback, fexit step 2. Called after percpu_ref_kill confirms. */
+static void __bpf_tramp_image_release(struct percpu_ref *pcref)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(pcref, struct bpf_tramp_image, pcref);
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+/* callback, fexit or fentry step 1 */
+static void __bpf_tramp_image_put_rcu_tasks(struct rcu_head *rcu)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(rcu, struct bpf_tramp_image, rcu);
+ if (im->ip_after_call)
+ /* the case of fmod_ret/fexit trampoline and CONFIG_PREEMPTION=y */
+ percpu_ref_kill(&im->pcref);
+ else
+ /* the case of fentry trampoline */
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+static void bpf_tramp_image_put(struct bpf_tramp_image *im)
+{
+ /* The trampoline image that calls original function is using:
+ * rcu_read_lock_trace to protect sleepable bpf progs
+ * rcu_read_lock to protect normal bpf progs
+ * percpu_ref to protect trampoline itself
+ * rcu tasks to protect trampoline asm not covered by percpu_ref
+ * (which are few asm insns before __bpf_tramp_enter and
+ * after __bpf_tramp_exit)
+ *
+ * The trampoline is unreachable before bpf_tramp_image_put().
+ *
+ * First, patch the trampoline to avoid calling into fexit progs.
+ * The progs will be freed even if the original function is still
+ * executing or sleeping.
+ * In case of CONFIG_PREEMPT=y use call_rcu_tasks() to wait on
+ * first few asm instructions to execute and call into
+ * __bpf_tramp_enter->percpu_ref_get.
+ * Then use percpu_ref_kill to wait for the trampoline and the original
+ * function to finish.
+ * Then use call_rcu_tasks() to make sure few asm insns in
+ * the trampoline epilogue are done as well.
+ *
+ * In !PREEMPT case the task that got interrupted in the first asm
+ * insns won't go through an RCU quiescent state which the
+ * percpu_ref_kill will be waiting for. Hence the first
+ * call_rcu_tasks() is not necessary.
+ */
+ if (im->ip_after_call) {
+ int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP,
+ NULL, im->ip_epilogue);
+ WARN_ON(err);
+ if (IS_ENABLED(CONFIG_PREEMPTION))
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+ else
+ percpu_ref_kill(&im->pcref);
+ return;
+ }
+
+ /* The trampoline without fexit and fmod_ret progs doesn't call original
+ * function and doesn't use percpu_ref.
+ * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+ * Then use call_rcu_tasks() to wait for the rest of trampoline asm
+ * and normal progs.
+ */
+ call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+}
+
+static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
+{
+ struct bpf_tramp_image *im;
+ struct bpf_ksym *ksym;
+ void *image;
+ int err = -ENOMEM;
+
+ im = kzalloc(sizeof(*im), GFP_KERNEL);
+ if (!im)
+ goto out;
+
+ err = bpf_jit_charge_modmem(1);
+ if (err)
+ goto out_free_im;
+
+ err = -ENOMEM;
+ im->image = image = bpf_jit_alloc_exec_page();
+ if (!image)
+ goto out_uncharge;
+
+ err = percpu_ref_init(&im->pcref, __bpf_tramp_image_release, 0, GFP_KERNEL);
+ if (err)
+ goto out_free_image;
+
+ ksym = &im->ksym;
+ INIT_LIST_HEAD_RCU(&ksym->lnode);
+ snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu_%u", key, idx);
+ bpf_image_ksym_add(image, ksym);
+ return im;
+
+out_free_image:
+ bpf_jit_free_exec(im->image);
+out_uncharge:
+ bpf_jit_uncharge_modmem(1);
+out_free_im:
+ kfree(im);
+out:
+ return ERR_PTR(err);
+}
+
static int bpf_trampoline_update(struct bpf_trampoline *tr)
{
- void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
- void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
+ struct bpf_tramp_image *im;
struct bpf_tramp_progs *tprogs;
u32 flags = BPF_TRAMP_F_RESTORE_REGS;
int err, total;
@@ -198,41 +310,42 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
return PTR_ERR(tprogs);
if (total == 0) {
- err = unregister_fentry(tr, old_image);
+ err = unregister_fentry(tr, tr->cur_image->image);
+ bpf_tramp_image_put(tr->cur_image);
+ tr->cur_image = NULL;
tr->selector = 0;
goto out;
}
+ im = bpf_tramp_image_alloc(tr->key, tr->selector);
+ if (IS_ERR(im)) {
+ err = PTR_ERR(im);
+ goto out;
+ }
+
if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
- /* Though the second half of trampoline page is unused a task could be
- * preempted in the middle of the first half of trampoline and two
- * updates to trampoline would change the code from underneath the
- * preempted task. Hence wait for tasks to voluntarily schedule or go
- * to userspace.
- * The same trampoline can hold both sleepable and non-sleepable progs.
- * synchronize_rcu_tasks_trace() is needed to make sure all sleepable
- * programs finish executing.
- * Wait for these two grace periods together.
- */
- synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
-
- err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
+ err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, flags, tprogs,
tr->func.addr);
if (err < 0)
goto out;
- if (tr->selector)
+ WARN_ON(tr->cur_image && tr->selector == 0);
+ WARN_ON(!tr->cur_image && tr->selector);
+ if (tr->cur_image)
/* progs already running at this address */
- err = modify_fentry(tr, old_image, new_image);
+ err = modify_fentry(tr, tr->cur_image->image, im->image);
else
/* first time registering */
- err = register_fentry(tr, new_image);
+ err = register_fentry(tr, im->image);
if (err)
goto out;
+ if (tr->cur_image)
+ bpf_tramp_image_put(tr->cur_image);
+ tr->cur_image = im;
tr->selector++;
out:
kfree(tprogs);
@@ -364,17 +477,12 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
goto out;
if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT])))
goto out;
- bpf_image_ksym_del(&tr->ksym);
- /* This code will be executed when all bpf progs (both sleepable and
- * non-sleepable) went through
- * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred().
- * Hence no need for another synchronize_rcu_tasks_trace() here,
- * but synchronize_rcu_tasks() is still needed, since trampoline
- * may not have had any sleepable programs and we need to wait
- * for tasks to get out of trampoline code before freeing it.
+ /* This code will be executed even when the last bpf_tramp_image
+ * is alive. All progs are detached from the trampoline and the
+ * trampoline image is patched with jmp into epilogue to skip
+ * fexit progs. The fentry-only trampoline will be freed via
+ * multiple rcu callbacks.
*/
- synchronize_rcu_tasks();
- bpf_jit_free_exec(tr->image);
hlist_del(&tr->hlist);
kfree(tr);
out:
@@ -478,8 +586,18 @@ void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start)
rcu_read_unlock_trace();
}
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr)
+{
+ percpu_ref_get(&tr->pcref);
+}
+
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr)
+{
+ percpu_ref_put(&tr->pcref);
+}
+
int __weak
-arch_prepare_bpf_trampoline(void *image, void *image_end,
+arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e21aa341785c679dd409c8cb71f864c00fe6c463 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast(a)kernel.org>
Date: Tue, 16 Mar 2021 14:00:07 -0700
Subject: [PATCH] bpf: Fix fexit trampoline.
The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.
Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
Reported-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Paul E. McKenney <paulmck(a)kernel.org> # for RCU
Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail…
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 747bba0a584a..72b5a57e9e31 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1936,7 +1936,7 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
* add rsp, 8 // skip eth_type_trans's frame
* ret // return to its caller
*/
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call)
@@ -1975,6 +1975,15 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
save_regs(m, &prog, nr_args, stack_size);
+ if (flags & BPF_TRAMP_F_CALL_ORIG) {
+ /* arg1: mov rdi, im */
+ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+ if (emit_call(&prog, __bpf_tramp_enter, prog)) {
+ ret = -EINVAL;
+ goto cleanup;
+ }
+ }
+
if (fentry->nr_progs)
if (invoke_bpf(m, &prog, fentry, stack_size))
return -EINVAL;
@@ -1993,8 +2002,7 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
}
if (flags & BPF_TRAMP_F_CALL_ORIG) {
- if (fentry->nr_progs || fmod_ret->nr_progs)
- restore_regs(m, &prog, nr_args, stack_size);
+ restore_regs(m, &prog, nr_args, stack_size);
/* call original function */
if (emit_call(&prog, orig_call, prog)) {
@@ -2003,6 +2011,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
}
/* remember return value in a stack for bpf prog to access */
emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+ im->ip_after_call = prog;
+ emit_nops(&prog, 5);
}
if (fmod_ret->nr_progs) {
@@ -2033,9 +2043,17 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end,
* the return value is only updated on the stack and still needs to be
* restored to R0.
*/
- if (flags & BPF_TRAMP_F_CALL_ORIG)
+ if (flags & BPF_TRAMP_F_CALL_ORIG) {
+ im->ip_epilogue = prog;
+ /* arg1: mov rdi, im */
+ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+ if (emit_call(&prog, __bpf_tramp_exit, prog)) {
+ ret = -EINVAL;
+ goto cleanup;
+ }
/* restore original return value back into RAX */
emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
+ }
EMIT1(0x5B); /* pop rbx */
EMIT1(0xC9); /* leave */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d7e0f479a5b0..3625f019767d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -21,6 +21,7 @@
#include <linux/capability.h>
#include <linux/sched/mm.h>
#include <linux/slab.h>
+#include <linux/percpu-refcount.h>
struct bpf_verifier_env;
struct bpf_verifier_log;
@@ -556,7 +557,8 @@ struct bpf_tramp_progs {
* fentry = a set of program to run before calling original function
* fexit = a set of program to run after original function
*/
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+struct bpf_tramp_image;
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call);
@@ -565,6 +567,8 @@ u64 notrace __bpf_prog_enter(struct bpf_prog *prog);
void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog);
void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start);
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
struct bpf_ksym {
unsigned long start;
@@ -583,6 +587,18 @@ enum bpf_tramp_prog_type {
BPF_TRAMP_REPLACE, /* more than MAX */
};
+struct bpf_tramp_image {
+ void *image;
+ struct bpf_ksym ksym;
+ struct percpu_ref pcref;
+ void *ip_after_call;
+ void *ip_epilogue;
+ union {
+ struct rcu_head rcu;
+ struct work_struct work;
+ };
+};
+
struct bpf_trampoline {
/* hlist for trampoline_table */
struct hlist_node hlist;
@@ -605,9 +621,8 @@ struct bpf_trampoline {
/* Number of attached programs. A counter per kind. */
int progs_cnt[BPF_TRAMP_MAX];
/* Executable image of trampoline */
- void *image;
+ struct bpf_tramp_image *cur_image;
u64 selector;
- struct bpf_ksym ksym;
};
struct bpf_attach_target_info {
@@ -691,6 +706,8 @@ void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym);
void bpf_image_ksym_del(struct bpf_ksym *ksym);
void bpf_ksym_add(struct bpf_ksym *ksym);
void bpf_ksym_del(struct bpf_ksym *ksym);
+int bpf_jit_charge_modmem(u32 pages);
+void bpf_jit_uncharge_modmem(u32 pages);
#else
static inline int bpf_trampoline_link_prog(struct bpf_prog *prog,
struct bpf_trampoline *tr)
@@ -787,7 +804,6 @@ struct bpf_prog_aux {
bool func_proto_unreliable;
bool sleepable;
bool tail_call_reachable;
- enum bpf_tramp_prog_type trampoline_prog_type;
struct hlist_node tramp_hlist;
/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
const struct btf_type *attach_func_proto;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 1a666a975416..70f6fd4fa305 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -430,7 +430,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
tprogs[BPF_TRAMP_FENTRY].progs[0] = prog;
tprogs[BPF_TRAMP_FENTRY].nr_progs = 1;
- err = arch_prepare_bpf_trampoline(image,
+ err = arch_prepare_bpf_trampoline(NULL, image,
st_map->image + PAGE_SIZE,
&st_ops->func_models[i], 0,
tprogs, NULL);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3a283bf97f2f..75244ecb2389 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -827,7 +827,7 @@ static int __init bpf_jit_charge_init(void)
}
pure_initcall(bpf_jit_charge_init);
-static int bpf_jit_charge_modmem(u32 pages)
+int bpf_jit_charge_modmem(u32 pages)
{
if (atomic_long_add_return(pages, &bpf_jit_current) >
(bpf_jit_limit >> PAGE_SHIFT)) {
@@ -840,7 +840,7 @@ static int bpf_jit_charge_modmem(u32 pages)
return 0;
}
-static void bpf_jit_uncharge_modmem(u32 pages)
+void bpf_jit_uncharge_modmem(u32 pages)
{
atomic_long_sub(pages, &bpf_jit_current);
}
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 7bc3b3209224..1f3a4be4b175 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -57,19 +57,10 @@ void bpf_image_ksym_del(struct bpf_ksym *ksym)
PAGE_SIZE, true, ksym->name);
}
-static void bpf_trampoline_ksym_add(struct bpf_trampoline *tr)
-{
- struct bpf_ksym *ksym = &tr->ksym;
-
- snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu", tr->key);
- bpf_image_ksym_add(tr->image, ksym);
-}
-
static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
struct bpf_trampoline *tr;
struct hlist_head *head;
- void *image;
int i;
mutex_lock(&trampoline_mutex);
@@ -84,14 +75,6 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
if (!tr)
goto out;
- /* is_root was checked earlier. No need for bpf_jit_charge_modmem() */
- image = bpf_jit_alloc_exec_page();
- if (!image) {
- kfree(tr);
- tr = NULL;
- goto out;
- }
-
tr->key = key;
INIT_HLIST_NODE(&tr->hlist);
hlist_add_head(&tr->hlist, head);
@@ -99,9 +82,6 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
mutex_init(&tr->mutex);
for (i = 0; i < BPF_TRAMP_MAX; i++)
INIT_HLIST_HEAD(&tr->progs_hlist[i]);
- tr->image = image;
- INIT_LIST_HEAD_RCU(&tr->ksym.lnode);
- bpf_trampoline_ksym_add(tr);
out:
mutex_unlock(&trampoline_mutex);
return tr;
@@ -185,10 +165,142 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total)
return tprogs;
}
+static void __bpf_tramp_image_put_deferred(struct work_struct *work)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(work, struct bpf_tramp_image, work);
+ bpf_image_ksym_del(&im->ksym);
+ bpf_jit_free_exec(im->image);
+ bpf_jit_uncharge_modmem(1);
+ percpu_ref_exit(&im->pcref);
+ kfree_rcu(im, rcu);
+}
+
+/* callback, fexit step 3 or fentry step 2 */
+static void __bpf_tramp_image_put_rcu(struct rcu_head *rcu)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(rcu, struct bpf_tramp_image, rcu);
+ INIT_WORK(&im->work, __bpf_tramp_image_put_deferred);
+ schedule_work(&im->work);
+}
+
+/* callback, fexit step 2. Called after percpu_ref_kill confirms. */
+static void __bpf_tramp_image_release(struct percpu_ref *pcref)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(pcref, struct bpf_tramp_image, pcref);
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+/* callback, fexit or fentry step 1 */
+static void __bpf_tramp_image_put_rcu_tasks(struct rcu_head *rcu)
+{
+ struct bpf_tramp_image *im;
+
+ im = container_of(rcu, struct bpf_tramp_image, rcu);
+ if (im->ip_after_call)
+ /* the case of fmod_ret/fexit trampoline and CONFIG_PREEMPTION=y */
+ percpu_ref_kill(&im->pcref);
+ else
+ /* the case of fentry trampoline */
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+static void bpf_tramp_image_put(struct bpf_tramp_image *im)
+{
+ /* The trampoline image that calls original function is using:
+ * rcu_read_lock_trace to protect sleepable bpf progs
+ * rcu_read_lock to protect normal bpf progs
+ * percpu_ref to protect trampoline itself
+ * rcu tasks to protect trampoline asm not covered by percpu_ref
+ * (which are few asm insns before __bpf_tramp_enter and
+ * after __bpf_tramp_exit)
+ *
+ * The trampoline is unreachable before bpf_tramp_image_put().
+ *
+ * First, patch the trampoline to avoid calling into fexit progs.
+ * The progs will be freed even if the original function is still
+ * executing or sleeping.
+ * In case of CONFIG_PREEMPT=y use call_rcu_tasks() to wait on
+ * first few asm instructions to execute and call into
+ * __bpf_tramp_enter->percpu_ref_get.
+ * Then use percpu_ref_kill to wait for the trampoline and the original
+ * function to finish.
+ * Then use call_rcu_tasks() to make sure few asm insns in
+ * the trampoline epilogue are done as well.
+ *
+ * In !PREEMPT case the task that got interrupted in the first asm
+ * insns won't go through an RCU quiescent state which the
+ * percpu_ref_kill will be waiting for. Hence the first
+ * call_rcu_tasks() is not necessary.
+ */
+ if (im->ip_after_call) {
+ int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP,
+ NULL, im->ip_epilogue);
+ WARN_ON(err);
+ if (IS_ENABLED(CONFIG_PREEMPTION))
+ call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+ else
+ percpu_ref_kill(&im->pcref);
+ return;
+ }
+
+ /* The trampoline without fexit and fmod_ret progs doesn't call original
+ * function and doesn't use percpu_ref.
+ * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+ * Then use call_rcu_tasks() to wait for the rest of trampoline asm
+ * and normal progs.
+ */
+ call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+}
+
+static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
+{
+ struct bpf_tramp_image *im;
+ struct bpf_ksym *ksym;
+ void *image;
+ int err = -ENOMEM;
+
+ im = kzalloc(sizeof(*im), GFP_KERNEL);
+ if (!im)
+ goto out;
+
+ err = bpf_jit_charge_modmem(1);
+ if (err)
+ goto out_free_im;
+
+ err = -ENOMEM;
+ im->image = image = bpf_jit_alloc_exec_page();
+ if (!image)
+ goto out_uncharge;
+
+ err = percpu_ref_init(&im->pcref, __bpf_tramp_image_release, 0, GFP_KERNEL);
+ if (err)
+ goto out_free_image;
+
+ ksym = &im->ksym;
+ INIT_LIST_HEAD_RCU(&ksym->lnode);
+ snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu_%u", key, idx);
+ bpf_image_ksym_add(image, ksym);
+ return im;
+
+out_free_image:
+ bpf_jit_free_exec(im->image);
+out_uncharge:
+ bpf_jit_uncharge_modmem(1);
+out_free_im:
+ kfree(im);
+out:
+ return ERR_PTR(err);
+}
+
static int bpf_trampoline_update(struct bpf_trampoline *tr)
{
- void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
- void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
+ struct bpf_tramp_image *im;
struct bpf_tramp_progs *tprogs;
u32 flags = BPF_TRAMP_F_RESTORE_REGS;
int err, total;
@@ -198,41 +310,42 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
return PTR_ERR(tprogs);
if (total == 0) {
- err = unregister_fentry(tr, old_image);
+ err = unregister_fentry(tr, tr->cur_image->image);
+ bpf_tramp_image_put(tr->cur_image);
+ tr->cur_image = NULL;
tr->selector = 0;
goto out;
}
+ im = bpf_tramp_image_alloc(tr->key, tr->selector);
+ if (IS_ERR(im)) {
+ err = PTR_ERR(im);
+ goto out;
+ }
+
if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
- /* Though the second half of trampoline page is unused a task could be
- * preempted in the middle of the first half of trampoline and two
- * updates to trampoline would change the code from underneath the
- * preempted task. Hence wait for tasks to voluntarily schedule or go
- * to userspace.
- * The same trampoline can hold both sleepable and non-sleepable progs.
- * synchronize_rcu_tasks_trace() is needed to make sure all sleepable
- * programs finish executing.
- * Wait for these two grace periods together.
- */
- synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
-
- err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
+ err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, flags, tprogs,
tr->func.addr);
if (err < 0)
goto out;
- if (tr->selector)
+ WARN_ON(tr->cur_image && tr->selector == 0);
+ WARN_ON(!tr->cur_image && tr->selector);
+ if (tr->cur_image)
/* progs already running at this address */
- err = modify_fentry(tr, old_image, new_image);
+ err = modify_fentry(tr, tr->cur_image->image, im->image);
else
/* first time registering */
- err = register_fentry(tr, new_image);
+ err = register_fentry(tr, im->image);
if (err)
goto out;
+ if (tr->cur_image)
+ bpf_tramp_image_put(tr->cur_image);
+ tr->cur_image = im;
tr->selector++;
out:
kfree(tprogs);
@@ -364,17 +477,12 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
goto out;
if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT])))
goto out;
- bpf_image_ksym_del(&tr->ksym);
- /* This code will be executed when all bpf progs (both sleepable and
- * non-sleepable) went through
- * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred().
- * Hence no need for another synchronize_rcu_tasks_trace() here,
- * but synchronize_rcu_tasks() is still needed, since trampoline
- * may not have had any sleepable programs and we need to wait
- * for tasks to get out of trampoline code before freeing it.
+ /* This code will be executed even when the last bpf_tramp_image
+ * is alive. All progs are detached from the trampoline and the
+ * trampoline image is patched with jmp into epilogue to skip
+ * fexit progs. The fentry-only trampoline will be freed via
+ * multiple rcu callbacks.
*/
- synchronize_rcu_tasks();
- bpf_jit_free_exec(tr->image);
hlist_del(&tr->hlist);
kfree(tr);
out:
@@ -478,8 +586,18 @@ void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start)
rcu_read_unlock_trace();
}
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr)
+{
+ percpu_ref_get(&tr->pcref);
+}
+
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr)
+{
+ percpu_ref_put(&tr->pcref);
+}
+
int __weak
-arch_prepare_bpf_trampoline(void *image, void *image_end,
+arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
const struct btf_func_model *m, u32 flags,
struct bpf_tramp_progs *tprogs,
void *orig_call)
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From eb85890b29e4d7ae1accdcfba35ed8b16ba9fb97 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 25 Feb 2021 10:13:29 -0700
Subject: [PATCH] io_uring: ensure SQPOLL startup is triggered before error
shutdown
syzbot reports the following hang:
INFO: task syz-executor.0:12538 can't die for more than 143 seconds.
task:syz-executor.0 state:D stack:28352 pid:12538 ppid: 8423 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4324 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5075
schedule+0xcf/0x270 kernel/sched/core.c:5154
schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
io_sq_thread_finish+0x96/0x580 fs/io_uring.c:7152
io_sq_offload_create fs/io_uring.c:7929 [inline]
io_uring_create fs/io_uring.c:9465 [inline]
io_uring_setup+0x1fb2/0x2c20 fs/io_uring.c:9550
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
which is due to exiting after the SQPOLL thread has been created, but
hasn't been started yet. Ensure that we always complete the startup
side when waiting for it to exit.
Reported-by: syzbot+c927c937cba8ef66dd4a(a)syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fbc85afa9a87..ef743594d34a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7141,6 +7141,7 @@ static void io_sq_thread_finish(struct io_ring_ctx *ctx)
struct io_sq_data *sqd = ctx->sq_data;
if (sqd) {
+ complete(&sqd->startup);
if (sqd->thread) {
wait_for_completion(&ctx->sq_thread_comp);
io_sq_thread_park(sqd);
@@ -7927,7 +7928,7 @@ static void io_sq_offload_start(struct io_ring_ctx *ctx)
{
struct io_sq_data *sqd = ctx->sq_data;
- if ((ctx->flags & IORING_SETUP_SQPOLL) && sqd->thread)
+ if (ctx->flags & IORING_SETUP_SQPOLL)
complete(&sqd->startup);
}
From: Peter Zijlstra <peterz(a)infradead.org>
commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream.
Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: juri.lelli(a)arm.com
Cc: bigeasy(a)linutronix.de
Cc: xlpang(a)redhat.com
Cc: rostedt(a)goodmis.org
Cc: mathieu.desnoyers(a)efficios.com
Cc: jdesfossez(a)efficios.com
Cc: dvhart(a)infradead.org
Cc: bristot(a)redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Ben Hutchings <ben(a)decadent.org.uk>
---
kernel/futex.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 796b1c860839..e112a9d4c84f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1565,8 +1565,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
* memory barrier is required here to prevent the following
* store to lock_ptr from getting ahead of the plist_del.
*/
- smp_wmb();
- q->lock_ptr = NULL;
+ smp_store_release(&q->lock_ptr, NULL);
}
/*
The straight backport of 5.12's e1baddf8475b ("mm/memcg: set memcg when
splitting page") works fine in 5.11, but turned out to be wrong for 5.10:
because that relies on a separate flag, which must also be set for the
memcg to be recognized and uncharged and cleared when freeing. Fix that.
Signed-off-by: Hugh Dickins <hughd(a)google.com>
---
mm/memcontrol.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3274,13 +3274,17 @@ void obj_cgroup_uncharge(struct obj_cgro
void split_page_memcg(struct page *head, unsigned int nr)
{
struct mem_cgroup *memcg = head->mem_cgroup;
+ int kmemcg = PageKmemcg(head);
int i;
if (mem_cgroup_disabled() || !memcg)
return;
- for (i = 1; i < nr; i++)
+ for (i = 1; i < nr; i++) {
head[i].mem_cgroup = memcg;
+ if (kmemcg)
+ __SetPageKmemcg(head + i);
+ }
css_get_many(&memcg->css, nr - 1);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From cfc63fc8126a93cbf95379bc4cad79a7b15b6ece Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Fri, 26 Mar 2021 18:41:55 -0500
Subject: [PATCH] smb3: fix cached file size problems in duplicate extents
(reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable(a)vger.kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 3e94f83897e9..f703204fb185 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2038,6 +2038,7 @@ smb2_duplicate_extents(const unsigned int xid,
{
int rc;
unsigned int ret_data_len;
+ struct inode *inode;
struct duplicate_extents_to_file dup_ext_buf;
struct cifs_tcon *tcon = tlink_tcon(trgtfile->tlink);
@@ -2054,10 +2055,21 @@ smb2_duplicate_extents(const unsigned int xid,
cifs_dbg(FYI, "Duplicate extents: src off %lld dst off %lld len %lld\n",
src_off, dest_off, len);
- rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
- if (rc)
- goto duplicate_extents_out;
+ inode = d_inode(trgtfile->dentry);
+ if (inode->i_size < dest_off + len) {
+ rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
+ if (rc)
+ goto duplicate_extents_out;
+ /*
+ * Although also could set plausible allocation size (i_blocks)
+ * here in addition to setting the file size, in reflink
+ * it is likely that the target file is sparse. Its allocation
+ * size will be queried on next revalidate, but it is important
+ * to make sure that file's cached size is updated immediately
+ */
+ cifs_setsize(inode, dest_off + len);
+ }
rc = SMB2_ioctl(xid, tcon, trgtfile->fid.persistent_fid,
trgtfile->fid.volatile_fid,
FSCTL_DUPLICATE_EXTENTS_TO_FILE,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From cfc63fc8126a93cbf95379bc4cad79a7b15b6ece Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Fri, 26 Mar 2021 18:41:55 -0500
Subject: [PATCH] smb3: fix cached file size problems in duplicate extents
(reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable(a)vger.kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 3e94f83897e9..f703204fb185 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2038,6 +2038,7 @@ smb2_duplicate_extents(const unsigned int xid,
{
int rc;
unsigned int ret_data_len;
+ struct inode *inode;
struct duplicate_extents_to_file dup_ext_buf;
struct cifs_tcon *tcon = tlink_tcon(trgtfile->tlink);
@@ -2054,10 +2055,21 @@ smb2_duplicate_extents(const unsigned int xid,
cifs_dbg(FYI, "Duplicate extents: src off %lld dst off %lld len %lld\n",
src_off, dest_off, len);
- rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
- if (rc)
- goto duplicate_extents_out;
+ inode = d_inode(trgtfile->dentry);
+ if (inode->i_size < dest_off + len) {
+ rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
+ if (rc)
+ goto duplicate_extents_out;
+ /*
+ * Although also could set plausible allocation size (i_blocks)
+ * here in addition to setting the file size, in reflink
+ * it is likely that the target file is sparse. Its allocation
+ * size will be queried on next revalidate, but it is important
+ * to make sure that file's cached size is updated immediately
+ */
+ cifs_setsize(inode, dest_off + len);
+ }
rc = SMB2_ioctl(xid, tcon, trgtfile->fid.persistent_fid,
trgtfile->fid.volatile_fid,
FSCTL_DUPLICATE_EXTENTS_TO_FILE,
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From cfc63fc8126a93cbf95379bc4cad79a7b15b6ece Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Fri, 26 Mar 2021 18:41:55 -0500
Subject: [PATCH] smb3: fix cached file size problems in duplicate extents
(reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable(a)vger.kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 3e94f83897e9..f703204fb185 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2038,6 +2038,7 @@ smb2_duplicate_extents(const unsigned int xid,
{
int rc;
unsigned int ret_data_len;
+ struct inode *inode;
struct duplicate_extents_to_file dup_ext_buf;
struct cifs_tcon *tcon = tlink_tcon(trgtfile->tlink);
@@ -2054,10 +2055,21 @@ smb2_duplicate_extents(const unsigned int xid,
cifs_dbg(FYI, "Duplicate extents: src off %lld dst off %lld len %lld\n",
src_off, dest_off, len);
- rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
- if (rc)
- goto duplicate_extents_out;
+ inode = d_inode(trgtfile->dentry);
+ if (inode->i_size < dest_off + len) {
+ rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
+ if (rc)
+ goto duplicate_extents_out;
+ /*
+ * Although also could set plausible allocation size (i_blocks)
+ * here in addition to setting the file size, in reflink
+ * it is likely that the target file is sparse. Its allocation
+ * size will be queried on next revalidate, but it is important
+ * to make sure that file's cached size is updated immediately
+ */
+ cifs_setsize(inode, dest_off + len);
+ }
rc = SMB2_ioctl(xid, tcon, trgtfile->fid.persistent_fid,
trgtfile->fid.volatile_fid,
FSCTL_DUPLICATE_EXTENTS_TO_FILE,
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From cfc63fc8126a93cbf95379bc4cad79a7b15b6ece Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Fri, 26 Mar 2021 18:41:55 -0500
Subject: [PATCH] smb3: fix cached file size problems in duplicate extents
(reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable(a)vger.kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 3e94f83897e9..f703204fb185 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2038,6 +2038,7 @@ smb2_duplicate_extents(const unsigned int xid,
{
int rc;
unsigned int ret_data_len;
+ struct inode *inode;
struct duplicate_extents_to_file dup_ext_buf;
struct cifs_tcon *tcon = tlink_tcon(trgtfile->tlink);
@@ -2054,10 +2055,21 @@ smb2_duplicate_extents(const unsigned int xid,
cifs_dbg(FYI, "Duplicate extents: src off %lld dst off %lld len %lld\n",
src_off, dest_off, len);
- rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
- if (rc)
- goto duplicate_extents_out;
+ inode = d_inode(trgtfile->dentry);
+ if (inode->i_size < dest_off + len) {
+ rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
+ if (rc)
+ goto duplicate_extents_out;
+ /*
+ * Although also could set plausible allocation size (i_blocks)
+ * here in addition to setting the file size, in reflink
+ * it is likely that the target file is sparse. Its allocation
+ * size will be queried on next revalidate, but it is important
+ * to make sure that file's cached size is updated immediately
+ */
+ cifs_setsize(inode, dest_off + len);
+ }
rc = SMB2_ioctl(xid, tcon, trgtfile->fid.persistent_fid,
trgtfile->fid.volatile_fid,
FSCTL_DUPLICATE_EXTENTS_TO_FILE,
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From cfc63fc8126a93cbf95379bc4cad79a7b15b6ece Mon Sep 17 00:00:00 2001
From: Steve French <stfrench(a)microsoft.com>
Date: Fri, 26 Mar 2021 18:41:55 -0500
Subject: [PATCH] smb3: fix cached file size problems in duplicate extents
(reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable(a)vger.kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 3e94f83897e9..f703204fb185 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2038,6 +2038,7 @@ smb2_duplicate_extents(const unsigned int xid,
{
int rc;
unsigned int ret_data_len;
+ struct inode *inode;
struct duplicate_extents_to_file dup_ext_buf;
struct cifs_tcon *tcon = tlink_tcon(trgtfile->tlink);
@@ -2054,10 +2055,21 @@ smb2_duplicate_extents(const unsigned int xid,
cifs_dbg(FYI, "Duplicate extents: src off %lld dst off %lld len %lld\n",
src_off, dest_off, len);
- rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
- if (rc)
- goto duplicate_extents_out;
+ inode = d_inode(trgtfile->dentry);
+ if (inode->i_size < dest_off + len) {
+ rc = smb2_set_file_size(xid, tcon, trgtfile, dest_off + len, false);
+ if (rc)
+ goto duplicate_extents_out;
+ /*
+ * Although also could set plausible allocation size (i_blocks)
+ * here in addition to setting the file size, in reflink
+ * it is likely that the target file is sparse. Its allocation
+ * size will be queried on next revalidate, but it is important
+ * to make sure that file's cached size is updated immediately
+ */
+ cifs_setsize(inode, dest_off + len);
+ }
rc = SMB2_ioctl(xid, tcon, trgtfile->fid.persistent_fid,
trgtfile->fid.volatile_fid,
FSCTL_DUPLICATE_EXTENTS_TO_FILE,
This is a note to let you know that I've just added the patch titled
tty: fix memory leak in vc_deallocate
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 211b4d42b70f1c1660feaa968dac0efc2a96ac4d Mon Sep 17 00:00:00 2001
From: Pavel Skripkin <paskripkin(a)gmail.com>
Date: Sun, 28 Mar 2021 00:44:43 +0300
Subject: tty: fix memory leak in vc_deallocate
syzbot reported memory leak in tty/vt.
The problem was in VT_DISALLOCATE ioctl cmd.
After allocating unimap with PIO_UNIMAP it wasn't
freed via VT_DISALLOCATE, but vc_cons[currcons].d was
zeroed.
Reported-by: syzbot+bcc922b19ccc64240b42(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin(a)gmail.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20210327214443.21548-1-paskripkin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/vt/vt.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index d9366da51e06..01645e87b3d5 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -1381,6 +1381,7 @@ struct vc_data *vc_deallocate(unsigned int currcons)
atomic_notifier_call_chain(&vt_notifier_list, VT_DEALLOCATE, ¶m);
vcs_remove_sysfs(currcons);
visual_deinit(vc);
+ con_free_unimap(vc);
put_pid(vc->vt_pid);
vc_uniscr_set(vc, NULL);
kfree(vc->vc_screenbuf);
--
2.31.1
From: Gao Xiang <hsiangkao(a)redhat.com>
If any unknown i_format fields are set (may be of some new incompat
inode features), mark such inode as unsupported.
Just in case of any new incompat i_format fields added in the future.
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Cc: <stable(a)vger.kernel.org> # 4.19+
Signed-off-by: Gao Xiang <hsiangkao(a)redhat.com>
---
new potential inode features can also be covered by COMPR_CFGS & BIG_PCLUSTER
in the next cycle at least, and possibly need its new sb incompat feature as
well so it's not quite vital.
fs/erofs/erofs_fs.h | 3 +++
fs/erofs/inode.c | 7 +++++++
2 files changed, 10 insertions(+)
diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index 9ad1615f4474..e8d04d808fa6 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -75,6 +75,9 @@ static inline bool erofs_inode_is_data_compressed(unsigned int datamode)
#define EROFS_I_VERSION_BIT 0
#define EROFS_I_DATALAYOUT_BIT 1
+#define EROFS_I_ALL \
+ ((1 << (EROFS_I_DATALAYOUT_BIT + EROFS_I_DATALAYOUT_BITS)) - 1)
+
/* 32-byte reduced form of an ondisk inode */
struct erofs_inode_compact {
__le16 i_format; /* inode format hints */
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 119fdce1b520..7ed2d7391692 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -44,6 +44,13 @@ static struct page *erofs_read_inode(struct inode *inode,
dic = page_address(page) + *ofs;
ifmt = le16_to_cpu(dic->i_format);
+ if (ifmt & ~EROFS_I_ALL) {
+ erofs_err(inode->i_sb, "unsupported i_format %u of nid %llu",
+ ifmt, vi->nid);
+ err = -EOPNOTSUPP;
+ goto err_out;
+ }
+
vi->datalayout = erofs_inode_datalayout(ifmt);
if (vi->datalayout >= EROFS_INODE_DATALAYOUT_MAX) {
erofs_err(inode->i_sb, "unsupported datalayout %u of nid %llu",
--
2.20.1
The patch titled
Subject: fs: direct-io: fix missing sdio->boundary
has been added to the -mm tree. Its filename is
fs-direct-io-fix-missing-sdio-boundary.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/fs-direct-io-fix-missing-sdio-bou…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/fs-direct-io-fix-missing-sdio-bou…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jack Qiu <jack.qiu(a)huawei.com>
Subject: fs: direct-io: fix missing sdio->boundary
dio_send_cur_page() may clear sdio->boundary, so save it to avoid missing
a boundary.
Link: https://lkml.kernel.org/r/20210322042253.38312-1-jack.qiu@huawei.com
Fixes: b1058b981272 ("direct-io: submit bio after boundary buffer is
added to it")
Signed-off-by: Jack Qiu <jack.qiu(a)huawei.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/direct-io.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/fs/direct-io.c~fs-direct-io-fix-missing-sdio-boundary
+++ a/fs/direct-io.c
@@ -812,6 +812,7 @@ submit_page_section(struct dio *dio, str
struct buffer_head *map_bh)
{
int ret = 0;
+ int boundary = sdio->boundary; /* dio_send_cur_page may clear it */
if (dio->op == REQ_OP_WRITE) {
/*
@@ -850,10 +851,10 @@ submit_page_section(struct dio *dio, str
sdio->cur_page_fs_offset = sdio->block_in_file << sdio->blkbits;
out:
/*
- * If sdio->boundary then we want to schedule the IO now to
+ * If boundary then we want to schedule the IO now to
* avoid metadata seeks.
*/
- if (sdio->boundary) {
+ if (boundary) {
ret = dio_send_cur_page(dio, sdio, map_bh);
if (sdio->bio)
dio_bio_submit(dio, sdio);
_
Patches currently in -mm which might be from jack.qiu(a)huawei.com are
fs-direct-io-fix-missing-sdio-boundary.patch
The patch titled
Subject: mm: page_alloc: fix allocation imbalances from speculative cache lookup
has been removed from the -mm tree. Its filename was
mm-page_alloc-fix-memcg-accounting-leak-in-speculative-cache-lookup.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: mm: page_alloc: fix allocation imbalances from speculative cache lookup
When the freeing of a higher-order page block (non-compound) races with a
speculative page cache lookup, __free_pages() needs to leave the first
order-0 page in the chunk to the lookup but free the buddy pages that the
lookup doesn't know about separately.
There are currently two problems with it:
1. It checks PageHead() to see whether we're dealing with a compound
page after put_page_testzero(). But the speculative lookup could have
freed the page after our put and cleared PageHead, in which case we
would double free the tail pages.
To fix this, test PageHead before the put and cache the result for
afterwards.
2. If such a higher-order page is charged to a memcg (e.g. !vmap
kernel stack)), only the first page of the block has page->memcg set.
That means we'll uncharge only one order-0 page from the entire block,
and leak the remainder.
To fix this, add a split_page_memcg() before it starts freeing tail
pages, to ensure they all have page->memcg set up.
While at it, also update the comments a bit to clarify what exactly is
happening to the page during that race.
Link: https://lkml.kernel.org/r/20210319071547.60973-1-hannes@cmpxchg.org
Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reported-by: Hugh Dickins <hughd(a)google.com>
Reported-by: Matthew Wilcox <willy(a)infradead.org>
Acked-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Zhou Guanghui <zhouguanghui1(a)huawei.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org> # 5.10+
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-fix-memcg-accounting-leak-in-speculative-cache-lookup
+++ a/mm/page_alloc.c
@@ -5072,10 +5072,9 @@ static inline void free_the_page(struct
* the allocation, so it is easy to leak memory. Freeing more memory
* than was allocated will probably emit a warning.
*
- * If the last reference to this page is speculative, it will be released
- * by put_page() which only frees the first page of a non-compound
- * allocation. To prevent the remaining pages from being leaked, we free
- * the subsequent pages here. If you want to use the page's reference
+ * This function isn't a put_page(). Don't let the put_page_testzero()
+ * fool you, it's only to deal with speculative cache references. It
+ * WILL free pages directly. If you want to use the page's reference
* count to decide when to free the allocation, you should allocate a
* compound page, and use put_page() instead of __free_pages().
*
@@ -5084,11 +5083,33 @@ static inline void free_the_page(struct
*/
void __free_pages(struct page *page, unsigned int order)
{
- if (put_page_testzero(page))
+ /*
+ * Drop the base reference from __alloc_pages and free. In
+ * case there is an outstanding speculative reference, from
+ * e.g. the page cache, it will put and free the page later.
+ */
+ if (likely(put_page_testzero(page))) {
free_the_page(page, order);
- else if (!PageHead(page))
+ return;
+ }
+
+ /*
+ * The speculative reference will put and free the page.
+ *
+ * However, if the speculation was into a higher-order page
+ * chunk that isn't marked compound, the other side will know
+ * nothing about our buddy pages and only free the order-0
+ * page at the start of our chunk! We must split off and free
+ * the buddy pages here.
+ *
+ * The buddy pages aren't individually refcounted, so they
+ * can't have any pending speculative references themselves.
+ */
+ if (!PageHead(page) && order > 0) {
+ split_page_memcg(page, 1 << order);
while (order-- > 0)
free_the_page(page + (1 << order), order);
+ }
}
EXPORT_SYMBOL(__free_pages);
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
mm-page_alloc-fix-allocation-imbalances-from-speculative-cache-lookup.patch
mm-page-writeback-simplify-memcg-handling-in-test_clear_page_writeback.patch
mm-memcontrol-fix-cpuhotplug-statistics-flushing.patch
mm-memcontrol-kill-mem_cgroup_nodeinfo.patch
mm-memcontrol-privatize-memcg_page_state-query-functions.patch
cgroup-rstat-support-cgroup1.patch
cgroup-rstat-punt-root-level-optimization-to-individual-controllers.patch
mm-memcontrol-switch-to-rstat.patch
mm-memcontrol-switch-to-rstat-fix-2.patch
mm-memcontrol-consolidate-lruvec-stat-flushing.patch
kselftests-cgroup-update-kmem-test-for-new-vmstat-implementation.patch
The patch titled
Subject: mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
has been removed from the -mm tree. Its filename was
mm-highmem-fix-config_debug_kmap_local_force_map.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Ira Weiny <ira.weiny(a)intel.com>
Subject: mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
The kernel test robot found that __kmap_local_sched_out() was not
correctly skipping the guard pages when CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
was set.[1] This was due to CONFIG_DEBUG_HIGHMEM check being used.
Change the configuration check to be correct.
[1] https://lore.kernel.org/lkml/20210304083825.GB17830@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20210318230657.1497881-1-ira.weiny@intel.com
Fixes: 0e91a0c6984c ("mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP")
Signed-off-by: Ira Weiny <ira.weiny(a)intel.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reviewed-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Oliver Sang <oliver.sang(a)intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni(a)wdc.com>
Cc: David Sterba <dsterba(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/highmem.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/highmem.c~mm-highmem-fix-config_debug_kmap_local_force_map
+++ a/mm/highmem.c
@@ -618,7 +618,7 @@ void __kmap_local_sched_out(void)
int idx;
/* With debug all even slots are unmapped and act as guard */
- if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !(i & 0x01)) {
+ if (IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL) && !(i & 0x01)) {
WARN_ON_ONCE(!pte_none(pteval));
continue;
}
@@ -654,7 +654,7 @@ void __kmap_local_sched_in(void)
int idx;
/* With debug all even slots are unmapped and act as guard */
- if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !(i & 0x01)) {
+ if (IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL) && !(i & 0x01)) {
WARN_ON_ONCE(!pte_none(pteval));
continue;
}
_
Patches currently in -mm which might be from ira.weiny(a)intel.com are
iov_iter-lift-memzero_page-to-highmemh.patch
btrfs-use-memzero_page-instead-of-open-coded-kmap-pattern.patch
mm-highmem-remove-deprecated-kmap_atomic.patch
The patch titled
Subject: squashfs: fix inode lookup sanity checks
has been removed from the -mm tree. Its filename was
squashfs-fix-inode-lookup-sanity-checks.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Sean Nyekjaer <sean(a)geanix.com>
Subject: squashfs: fix inode lookup sanity checks
When mouting a squashfs image created without inode compression it fails
with: "unable to read inode lookup table"
It turns out that the BLOCK_OFFSET is missing when checking the
SQUASHFS_METADATA_SIZE agaist the actual size.
Link: https://lkml.kernel.org/r/20210226092903.1473545-1-sean@geanix.com
Fixes: eabac19e40c0 ("squashfs: add more sanity checks in inode lookup")
Signed-off-by: Sean Nyekjaer <sean(a)geanix.com>
Acked-by: Phillip Lougher <phillip(a)squashfs.org.uk>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/squashfs/export.c | 8 ++++++--
fs/squashfs/squashfs_fs.h | 1 +
2 files changed, 7 insertions(+), 2 deletions(-)
--- a/fs/squashfs/export.c~squashfs-fix-inode-lookup-sanity-checks
+++ a/fs/squashfs/export.c
@@ -152,14 +152,18 @@ __le64 *squashfs_read_inode_lookup_table
start = le64_to_cpu(table[n]);
end = le64_to_cpu(table[n + 1]);
- if (start >= end || (end - start) > SQUASHFS_METADATA_SIZE) {
+ if (start >= end
+ || (end - start) >
+ (SQUASHFS_METADATA_SIZE + SQUASHFS_BLOCK_OFFSET)) {
kfree(table);
return ERR_PTR(-EINVAL);
}
}
start = le64_to_cpu(table[indexes - 1]);
- if (start >= lookup_table_start || (lookup_table_start - start) > SQUASHFS_METADATA_SIZE) {
+ if (start >= lookup_table_start ||
+ (lookup_table_start - start) >
+ (SQUASHFS_METADATA_SIZE + SQUASHFS_BLOCK_OFFSET)) {
kfree(table);
return ERR_PTR(-EINVAL);
}
--- a/fs/squashfs/squashfs_fs.h~squashfs-fix-inode-lookup-sanity-checks
+++ a/fs/squashfs/squashfs_fs.h
@@ -17,6 +17,7 @@
/* size of metadata (inode and directory) blocks */
#define SQUASHFS_METADATA_SIZE 8192
+#define SQUASHFS_BLOCK_OFFSET 2
/* default size of block device I/O */
#ifdef CONFIG_SQUASHFS_4K_DEVBLK_SIZE
_
Patches currently in -mm which might be from sean(a)geanix.com are
The patch titled
Subject: z3fold: prevent reclaim/free race for headless pages
has been removed from the -mm tree. Its filename was
z3fold-prevent-reclaim-free-race-for-headless-pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Thomas Hebb <tommyhebb(a)gmail.com>
Subject: z3fold: prevent reclaim/free race for headless pages
commit ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced the
PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page release."
By atomically testing and setting the bit in each of z3fold_free() and
z3fold_reclaim_page(), a double-free was avoided.
However, commit dcf5aedb24f8 ("z3fold: stricter locking and more careful
reclaim") appears to have unintentionally broken this behavior by moving
the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
gets taken, which only happens for non-headless pages. For headless
pages, the check is now skipped entirely and races can occur again.
I have observed such a race on my system:
page:00000000ffbd76b7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x165316
flags: 0x2ffff0000000000()
raw: 02ffff0000000000 ffffea0004535f48 ffff8881d553a170 0000000000000000
raw: 0000000000000000 0000000000000011 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:707!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: G B 5.10.7-arch1-1-kasan #1
Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS F9b 03/03/2016
Workqueue: zswap-shrink shrink_worker
RIP: 0010:__free_pages+0x10a/0x130
Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
RSP: 0000:ffff88819a2ffb98 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffea000594c5a8 RCX: 0000000000000000
RDX: 1ffffd4000b298b7 RSI: 0000000000000000 RDI: ffffea000594c5b8
RBP: ffffea000594c580 R08: 000000000000003e R09: ffff8881d5520bbb
R10: ffffed103aaa4177 R11: 0000000000000001 R12: ffffea000594c5b4
R13: 0000000000000000 R14: ffff888165316000 R15: ffffea000594c588
FS: 0000000000000000(0000) GS:ffff8881d5500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7c8c3654d8 CR3: 0000000103f42004 CR4: 00000000001706e0
Call Trace:
z3fold_zpool_shrink+0x9b6/0x1240
? sugov_update_single+0x357/0x990
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x180
? z3fold_zpool_map+0x490/0x490
? _raw_spin_lock_irq+0x88/0xe0
shrink_worker+0x35/0x90
process_one_work+0x70c/0x1210
? pwq_dec_nr_in_flight+0x15b/0x2a0
worker_thread+0x539/0x1200
? __kthread_parkme+0x73/0x120
? rescuer_thread+0x1000/0x1000
kthread+0x330/0x400
? __kthread_bind_mask+0x90/0x90
ret_from_fork+0x22/0x30
Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
---[ end trace 126d646fc3dc0ad8 ]---
To fix the issue, re-add the earlier test and set in the case where we
have a headless page.
Link: https://lkml.kernel.org/r/c8106dbe6d8390b290cd1d7f873a2942e805349e.16154520…
Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim")
Signed-off-by: Thomas Hebb <tommyhebb(a)gmail.com>
Reviewed-by: Vitaly Wool <vitaly.wool(a)konsulko.com>
Cc: Jongseok Kim <ks77sj(a)gmail.com>
Cc: Snild Dolkow <snild(a)sony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/z3fold.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
--- a/mm/z3fold.c~z3fold-prevent-reclaim-free-race-for-headless-pages
+++ a/mm/z3fold.c
@@ -1346,8 +1346,22 @@ static int z3fold_reclaim_page(struct z3
page = list_entry(pos, struct page, lru);
zhdr = page_address(page);
- if (test_bit(PAGE_HEADLESS, &page->private))
+ if (test_bit(PAGE_HEADLESS, &page->private)) {
+ /*
+ * For non-headless pages, we wait to do this
+ * until we have the page lock to avoid racing
+ * with __z3fold_alloc(). Headless pages don't
+ * have a lock (and __z3fold_alloc() will never
+ * see them), but we still need to test and set
+ * PAGE_CLAIMED to avoid racing with
+ * z3fold_free(), so just do it now before
+ * leaving the loop.
+ */
+ if (test_and_set_bit(PAGE_CLAIMED, &page->private))
+ continue;
+
break;
+ }
if (kref_get_unless_zero(&zhdr->refcount) == 0) {
zhdr = NULL;
_
Patches currently in -mm which might be from tommyhebb(a)gmail.com are
The patch titled
Subject: kasan: fix per-page tags for non-page_alloc pages
has been removed from the -mm tree. Its filename was
kasan-fix-per-page-tags-for-non-page_alloc-pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan: fix per-page tags for non-page_alloc pages
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a conflict
with KFENCE. If a KFENCE-allocated slab object is being freed via
kfree(page_address(page) + offset), the address passed to kfree() will get
tagged with 0x00 (as slab pages keep the default per-page tags). This
leads to is_kfence_address() check failing, and a KFENCE object ending up
in normal slab freelist, which causes memory corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.16154754…
Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Reviewed-by: Marco Elver <elver(a)google.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Evgenii Stepanov <eugenis(a)google.com>
Cc: Branislav Rankov <Branislav.Rankov(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
--- a/include/linux/mm.h~kasan-fix-per-page-tags-for-non-page_alloc-pages
+++ a/include/linux/mm.h
@@ -1461,16 +1461,28 @@ static inline bool cpupid_match_pid(stru
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
+/*
+ * KASAN per-page tags are stored xor'ed with 0xff. This allows to avoid
+ * setting tags for all pages to native kernel tag value 0xff, as the default
+ * value 0x00 maps to 0xff.
+ */
+
static inline u8 page_kasan_tag(const struct page *page)
{
- if (kasan_enabled())
- return (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK;
- return 0xff;
+ u8 tag = 0xff;
+
+ if (kasan_enabled()) {
+ tag = (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK;
+ tag ^= 0xff;
+ }
+
+ return tag;
}
static inline void page_kasan_tag_set(struct page *page, u8 tag)
{
if (kasan_enabled()) {
+ tag ^= 0xff;
page->flags &= ~(KASAN_TAG_MASK << KASAN_TAG_PGSHIFT);
page->flags |= (tag & KASAN_TAG_MASK) << KASAN_TAG_PGSHIFT;
}
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
kasan-initialize-shadow-to-tag_invalid-for-sw_tags.patch
mm-kasan-dont-poison-boot-memory-with-tag-based-modes.patch
arm64-kasan-allow-to-init-memory-when-setting-tags.patch
kasan-init-memory-in-kasan_unpoison-for-hw_tags.patch
kasan-mm-integrate-page_alloc-init-with-hw_tags.patch
kasan-mm-integrate-slab-init_on_alloc-with-hw_tags.patch
kasan-mm-integrate-slab-init_on_free-with-hw_tags.patch
kasan-docs-clean-up-sections.patch
kasan-docs-update-overview-section.patch
kasan-docs-update-usage-section.patch
kasan-docs-update-error-reports-section.patch
kasan-docs-update-boot-parameters-section.patch
kasan-docs-update-generic-implementation-details-section.patch
kasan-docs-update-sw_tags-implementation-details-section.patch
kasan-docs-update-hw_tags-implementation-details-section.patch
kasan-docs-update-shadow-memory-section.patch
kasan-docs-update-ignoring-accesses-section.patch
kasan-docs-update-tests-section.patch
The patch titled
Subject: hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
has been removed from the -mm tree. Its filename was
hugetlb_cgroup-fix-imbalanced-css_get-and-css_put-pair-for-shared-mappings.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Miaohe Lin <linmiaohe(a)huawei.com>
Subject: hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
The current implementation of hugetlb_cgroup for shared mappings could
have different behavior. Consider the following two scenarios:
1.Assume initial css reference count of hugetlb_cgroup is 1:
1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
count is 2 associated with 1 file_region.
1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
count is 3 associated with 2 file_region.
1.3 coalesce_file_region will coalesce these two file_regions into one.
So css reference count is 3 associated with 1 file_region now.
2.Assume initial css reference count of hugetlb_cgroup is 1 again:
2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
count is 2 associated with 1 file_region.
Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to imbalanced css_get()
and css_put() pair. If we do css_put one by one (i.g. hole punch case),
scenario 2 would put one more css reference. If we do css_put all
together (i.g. truncate case), scenario 1 will leak one css reference.
The imbalanced css_get() and css_put() pair would result in a non-zero
reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
directory is removed __but__ associated resource is not freed. This might
result in OOM or can not create a new hugetlb cgroup in a busy workload
ultimately.
In order to fix this, we have to make sure that one file_region must hold
exactly one css reference. So in coalesce_file_region case, we should
release one css reference before coalescence. Also only put css reference
when the entire file_region is removed.
The last thing to note is that the caller of region_add() will only hold
one reference to h_cg->css for the whole contiguous reservation region.
But this area might be scattered when there are already some file_regions
reside in it. As a result, many file_regions may share only one h_cg->css
reference. In order to ensure that one file_region must hold exactly one
css reference, we should do css_get() for each file_region and release the
reference held by caller when they are done.
[linmiaohe(a)huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: kernel test robot <lkp(a)intel.com> (auto build test ERROR)
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Wanpeng Li <liwp.linux(a)gmail.com>
Cc: Mina Almasry <almasrymina(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb_cgroup.h | 15 +++++++++--
mm/hugetlb.c | 41 +++++++++++++++++++++++++++----
mm/hugetlb_cgroup.c | 10 ++++++-
3 files changed, 58 insertions(+), 8 deletions(-)
--- a/include/linux/hugetlb_cgroup.h~hugetlb_cgroup-fix-imbalanced-css_get-and-css_put-pair-for-shared-mappings
+++ a/include/linux/hugetlb_cgroup.h
@@ -113,6 +113,11 @@ static inline bool hugetlb_cgroup_disabl
return !cgroup_subsys_enabled(hugetlb_cgrp_subsys);
}
+static inline void hugetlb_cgroup_put_rsvd_cgroup(struct hugetlb_cgroup *h_cg)
+{
+ css_put(&h_cg->css);
+}
+
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr);
extern int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages,
@@ -138,7 +143,8 @@ extern void hugetlb_cgroup_uncharge_coun
extern void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
struct file_region *rg,
- unsigned long nr_pages);
+ unsigned long nr_pages,
+ bool region_del);
extern void hugetlb_cgroup_file_init(void) __init;
extern void hugetlb_cgroup_migrate(struct page *oldhpage,
@@ -147,7 +153,8 @@ extern void hugetlb_cgroup_migrate(struc
#else
static inline void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
struct file_region *rg,
- unsigned long nr_pages)
+ unsigned long nr_pages,
+ bool region_del)
{
}
@@ -185,6 +192,10 @@ static inline bool hugetlb_cgroup_disabl
return true;
}
+static inline void hugetlb_cgroup_put_rsvd_cgroup(struct hugetlb_cgroup *h_cg)
+{
+}
+
static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
{
--- a/mm/hugetlb.c~hugetlb_cgroup-fix-imbalanced-css_get-and-css_put-pair-for-shared-mappings
+++ a/mm/hugetlb.c
@@ -280,6 +280,17 @@ static void record_hugetlb_cgroup_unchar
nrg->reservation_counter =
&h_cg->rsvd_hugepage[hstate_index(h)];
nrg->css = &h_cg->css;
+ /*
+ * The caller will hold exactly one h_cg->css reference for the
+ * whole contiguous reservation region. But this area might be
+ * scattered when there are already some file_regions reside in
+ * it. As a result, many file_regions may share only one css
+ * reference. In order to ensure that one file_region must hold
+ * exactly one h_cg->css reference, we should do css_get for
+ * each file_region and leave the reference held by caller
+ * untouched.
+ */
+ css_get(&h_cg->css);
if (!resv->pages_per_hpage)
resv->pages_per_hpage = pages_per_huge_page(h);
/* pages_per_hpage should be the same for all entries in
@@ -293,6 +304,14 @@ static void record_hugetlb_cgroup_unchar
#endif
}
+static void put_uncharge_info(struct file_region *rg)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+ if (rg->css)
+ css_put(rg->css);
+#endif
+}
+
static bool has_same_uncharge_info(struct file_region *rg,
struct file_region *org)
{
@@ -316,6 +335,7 @@ static void coalesce_file_region(struct
prg->to = rg->to;
list_del(&rg->link);
+ put_uncharge_info(rg);
kfree(rg);
rg = prg;
@@ -327,6 +347,7 @@ static void coalesce_file_region(struct
nrg->from = rg->from;
list_del(&rg->link);
+ put_uncharge_info(rg);
kfree(rg);
}
}
@@ -662,7 +683,7 @@ retry:
del += t - f;
hugetlb_cgroup_uncharge_file_region(
- resv, rg, t - f);
+ resv, rg, t - f, false);
/* New entry for end of split region */
nrg->from = t;
@@ -683,7 +704,7 @@ retry:
if (f <= rg->from && t >= rg->to) { /* Remove entire region */
del += rg->to - rg->from;
hugetlb_cgroup_uncharge_file_region(resv, rg,
- rg->to - rg->from);
+ rg->to - rg->from, true);
list_del(&rg->link);
kfree(rg);
continue;
@@ -691,13 +712,13 @@ retry:
if (f <= rg->from) { /* Trim beginning of region */
hugetlb_cgroup_uncharge_file_region(resv, rg,
- t - rg->from);
+ t - rg->from, false);
del += t - rg->from;
rg->from = t;
} else { /* Trim end of region */
hugetlb_cgroup_uncharge_file_region(resv, rg,
- rg->to - f);
+ rg->to - f, false);
del += rg->to - f;
rg->to = f;
@@ -5187,6 +5208,10 @@ bool hugetlb_reserve_pages(struct inode
*/
long rsv_adjust;
+ /*
+ * hugetlb_cgroup_uncharge_cgroup_rsvd() will put the
+ * reference to h_cg->css. See comment below for detail.
+ */
hugetlb_cgroup_uncharge_cgroup_rsvd(
hstate_index(h),
(chg - add) * pages_per_huge_page(h), h_cg);
@@ -5194,6 +5219,14 @@ bool hugetlb_reserve_pages(struct inode
rsv_adjust = hugepage_subpool_put_pages(spool,
chg - add);
hugetlb_acct_memory(h, -rsv_adjust);
+ } else if (h_cg) {
+ /*
+ * The file_regions will hold their own reference to
+ * h_cg->css. So we should release the reference held
+ * via hugetlb_cgroup_charge_cgroup_rsvd() when we are
+ * done.
+ */
+ hugetlb_cgroup_put_rsvd_cgroup(h_cg);
}
}
return true;
--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-fix-imbalanced-css_get-and-css_put-pair-for-shared-mappings
+++ a/mm/hugetlb_cgroup.c
@@ -391,7 +391,8 @@ void hugetlb_cgroup_uncharge_counter(str
void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
struct file_region *rg,
- unsigned long nr_pages)
+ unsigned long nr_pages,
+ bool region_del)
{
if (hugetlb_cgroup_disabled() || !resv || !rg || !nr_pages)
return;
@@ -400,7 +401,12 @@ void hugetlb_cgroup_uncharge_file_region
!resv->reservation_counter) {
page_counter_uncharge(rg->reservation_counter,
nr_pages * resv->pages_per_hpage);
- css_put(rg->css);
+ /*
+ * Only do css_put(rg->css) when we delete the entire region
+ * because one file_region must hold exactly one css reference.
+ */
+ if (region_del)
+ css_put(rg->css);
}
}
_
Patches currently in -mm which might be from linmiaohe(a)huawei.com are
mm-hugetlb-remove-redundant-reservation-check-condition-in-alloc_huge_page.patch
mm-hugetlb-use-some-helper-functions-to-cleanup-code.patch
mm-hugetlb-optimize-the-surplus-state-transfer-code-in-move_hugetlb_state.patch
hugetlb_cgroup-remove-unnecessary-vm_bug_on_page-in-hugetlb_cgroup_migrate.patch
mm-hugetlb-simplify-the-code-when-alloc_huge_page-failed-in-hugetlb_no_page.patch
mm-hugetlb-avoid-calculating-fault_mutex_hash-in-truncate_op-case.patch
khugepaged-remove-unneeded-return-value-of-khugepaged_collapse_pte_mapped_thps.patch
khugepaged-reuse-the-smp_wmb-inside-__setpageuptodate.patch
khugepaged-use-helper-khugepaged_test_exit-in-__khugepaged_enter.patch
khugepaged-fix-wrong-result-value-for-trace_mm_collapse_huge_page_isolate.patch
mm-huge_memoryc-remove-unnecessary-local-variable-ret2.patch
mm-huge_memoryc-rework-the-function-vma_adjust_trans_huge.patch
mm-huge_memoryc-make-get_huge_zero_page-return-bool.patch
mm-huge_memoryc-rework-the-function-do_huge_pmd_numa_page-slightly.patch
mm-huge_memoryc-remove-redundant-pagecompound-check.patch
mm-huge_memoryc-remove-unused-macro-transparent_hugepage_debug_cow_flag.patch
mm-huge_memoryc-use-helper-function-migration_entry_to_page.patch
This is a note to let you know that I've just added the patch titled
tty: fix memory leak in vc_deallocate
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the tty-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
>From 211b4d42b70f1c1660feaa968dac0efc2a96ac4d Mon Sep 17 00:00:00 2001
From: Pavel Skripkin <paskripkin(a)gmail.com>
Date: Sun, 28 Mar 2021 00:44:43 +0300
Subject: tty: fix memory leak in vc_deallocate
syzbot reported memory leak in tty/vt.
The problem was in VT_DISALLOCATE ioctl cmd.
After allocating unimap with PIO_UNIMAP it wasn't
freed via VT_DISALLOCATE, but vc_cons[currcons].d was
zeroed.
Reported-by: syzbot+bcc922b19ccc64240b42(a)syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin(a)gmail.com>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20210327214443.21548-1-paskripkin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/vt/vt.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index d9366da51e06..01645e87b3d5 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -1381,6 +1381,7 @@ struct vc_data *vc_deallocate(unsigned int currcons)
atomic_notifier_call_chain(&vt_notifier_list, VT_DEALLOCATE, ¶m);
vcs_remove_sysfs(currcons);
visual_deinit(vc);
+ con_free_unimap(vc);
put_pid(vc->vt_pid);
vc_uniscr_set(vc, NULL);
kfree(vc->vc_screenbuf);
--
2.31.1