From: Qingfang Deng <qingfang.deng(a)siflower.com.cn>
[ Upstream commit ed779fe4c9b5a20b4ab4fd6f3e19807445bb78c7 ]
After the blamed commit, the member key is longer 4-byte aligned. On
platforms that do not support unaligned access, e.g., MIPS32R2 with
unaligned_action set to 1, this will trigger a crash when accessing
an IPv6 pneigh_entry, as the key is cast to an in6_addr pointer.
Change the type of the key to u32 to make it aligned.
Fixes: 62dd93181aaa ("[IPV6] NDISC: Set per-entry is_router flag in Proxy NA.")
Signed-off-by: Qingfang Deng <qingfang.deng(a)siflower.com.cn>
---
include/net/neighbour.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index e58ef9e338de..4c53e51f0799 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -172,7 +172,7 @@ struct pneigh_entry {
possible_net_t net;
struct net_device *dev;
u8 flags;
- u8 key[0];
+ u32 key[0];
};
/*
--
2.34.1
It has been brought to my attention that what had been fixed 1 year ago
here for kernels 5.18 and later:
https://lore.kernel.org/netdev/20230626155112.3155993-1-vladimir.oltean@nxp…
is still broken on linux-5.15.y. Short summary: PTP boundary clock is
broken for ports under a VLAN-aware bridge.
The reason is that the Fixes: tags in those patches were wrong. The
issue originated from earlier, but the changes from 5.18 (blamed there),
aka DSA FDB isolation, masked that.
A straightforward cherry-pick was not possible, due to the conflict with
the aforementioned DSA FDB isolation work from 5.18. So I redid patch
2/2 and marked what I had to adapt.
Tested on the NXP LS1021A-TSN board.
Vladimir Oltean (2):
net: dsa: sja1105: always enable the INCL_SRCPT option
net: dsa: tag_sja1105: always prefer source port information from
INCL_SRCPT
drivers/net/dsa/sja1105/sja1105_main.c | 9 ++-----
net/dsa/tag_sja1105.c | 34 ++++++++++++++++++++------
2 files changed, 28 insertions(+), 15 deletions(-)
---
I'm sorry for the people who will want to backport DSA FDB isolation to
linux-5.15.y :(
--
2.34.1
From: Linus Torvalds <torvalds(a)linux-foundation.org>
commit 02b670c1f88e78f42a6c5aee155c7b26960ca054 upstream.
The syzbot-reported stack trace from hell in this discussion thread
actually has three nested page faults:
https://lore.kernel.org/r/000000000000d5f4fc0616e816d4@google.com
... and I think that's actually the important thing here:
- the first page fault is from user space, and triggers the vsyscall
emulation.
- the second page fault is from __do_sys_gettimeofday(), and that should
just have caused the exception that then sets the return value to
-EFAULT
- the third nested page fault is due to _raw_spin_unlock_irqrestore() ->
preempt_schedule() -> trace_sched_switch(), which then causes a BPF
trace program to run, which does that bpf_probe_read_compat(), which
causes that page fault under pagefault_disable().
It's quite the nasty backtrace, and there's a lot going on.
The problem is literally the vsyscall emulation, which sets
current->thread.sig_on_uaccess_err = 1;
and that causes the fixup_exception() code to send the signal *despite* the
exception being caught.
And I think that is in fact completely bogus. It's completely bogus
exactly because it sends that signal even when it *shouldn't* be sent -
like for the BPF user mode trace gathering.
In other words, I think the whole "sig_on_uaccess_err" thing is entirely
broken, because it makes any nested page-faults do all the wrong things.
Now, arguably, I don't think anybody should enable vsyscall emulation any
more, but this test case clearly does.
I think we should just make the "send SIGSEGV" be something that the
vsyscall emulation does on its own, not this broken per-thread state for
something that isn't actually per thread.
The x86 page fault code actually tried to deal with the "incorrect nesting"
by having that:
if (in_interrupt())
return;
which ignores the sig_on_uaccess_err case when it happens in interrupts,
but as shown by this example, these nested page faults do not need to be
about interrupts at all.
IOW, I think the only right thing is to remove that horrendously broken
code.
The attached patch looks like the ObviouslyCorrect(tm) thing to do.
NOTE! This broken code goes back to this commit in 2011:
4fc3490114bb ("x86-64: Set siginfo and context on vsyscall emulation faults")
... and back then the reason was to get all the siginfo details right.
Honestly, I do not for a moment believe that it's worth getting the siginfo
details right here, but part of the commit says:
This fixes issues with UML when vsyscall=emulate.
... and so my patch to remove this garbage will probably break UML in this
situation.
I do not believe that anybody should be running with vsyscall=emulate in
2024 in the first place, much less if you are doing things like UML. But
let's see if somebody screams.
Reported-and-tested-by: syzbot+83e7f982ca045ab4405c(a)syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Tested-by: Jiri Olsa <jolsa(a)kernel.org>
Acked-by: Andy Lutomirski <luto(a)kernel.org>
Link: https://lore.kernel.org/r/CAHk-=wh9D6f7HUkDgZHKmDCHUQmp+Co89GP+b8+z+G56BKey…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[gpiccoli: Backport the patch due to differences in the trees. The main change
between 5.10.y and 5.15.y is due to renaming the fixup function, by
commit 6456a2a69ee1 ("x86/fault: Rename no_context() to kernelmode_fixup_or_oops()").
Following 2 commits cause divergence in the diffs too (in the removed lines):
cd072dab453a ("x86/fault: Add a helper function to sanitize error code")
d4ffd5df9d18 ("x86/fault: Fix wrong signal when vsyscall fails with pkey")
Finally, there is context adjustment in the processor.h file.]
Signed-off-by: Guilherme G. Piccoli <gpiccoli(a)igalia.com>
---
Hi folks, this was backported by AUTOSEL up to 5.15.y; I'm manually submitting
the backport to 5.4.y and 5.10.y. I've detailed a bit the changes necessary
due to other nonrelated missing patches, but these are really simple and
non-intrusive. Nevertheless, I've explicitely CCed x86 ML to be sure the
maintainers are aware of the backport, and if anybody thinks we shouldn't
do it for these (very) old releases, please respond here.
Cheers,
Guilherme
arch/x86/entry/vsyscall/vsyscall_64.c | 28 ++-------------------------
arch/x86/include/asm/processor.h | 1 -
arch/x86/mm/fault.c | 27 +-------------------------
3 files changed, 3 insertions(+), 53 deletions(-)
diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 44c33103a955..f0b817eb6e8b 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -98,11 +98,6 @@ static int addr_to_vsyscall_nr(unsigned long addr)
static bool write_ok_or_segv(unsigned long ptr, size_t size)
{
- /*
- * XXX: if access_ok, get_user, and put_user handled
- * sig_on_uaccess_err, this could go away.
- */
-
if (!access_ok((void __user *)ptr, size)) {
struct thread_struct *thread = ¤t->thread;
@@ -120,10 +115,8 @@ static bool write_ok_or_segv(unsigned long ptr, size_t size)
bool emulate_vsyscall(unsigned long error_code,
struct pt_regs *regs, unsigned long address)
{
- struct task_struct *tsk;
unsigned long caller;
int vsyscall_nr, syscall_nr, tmp;
- int prev_sig_on_uaccess_err;
long ret;
unsigned long orig_dx;
@@ -172,8 +165,6 @@ bool emulate_vsyscall(unsigned long error_code,
goto sigsegv;
}
- tsk = current;
-
/*
* Check for access_ok violations and find the syscall nr.
*
@@ -233,12 +224,8 @@ bool emulate_vsyscall(unsigned long error_code,
goto do_ret; /* skip requested */
/*
- * With a real vsyscall, page faults cause SIGSEGV. We want to
- * preserve that behavior to make writing exploits harder.
+ * With a real vsyscall, page faults cause SIGSEGV.
*/
- prev_sig_on_uaccess_err = current->thread.sig_on_uaccess_err;
- current->thread.sig_on_uaccess_err = 1;
-
ret = -EFAULT;
switch (vsyscall_nr) {
case 0:
@@ -261,23 +248,12 @@ bool emulate_vsyscall(unsigned long error_code,
break;
}
- current->thread.sig_on_uaccess_err = prev_sig_on_uaccess_err;
-
check_fault:
if (ret == -EFAULT) {
/* Bad news -- userspace fed a bad pointer to a vsyscall. */
warn_bad_vsyscall(KERN_INFO, regs,
"vsyscall fault (exploit attempt?)");
-
- /*
- * If we failed to generate a signal for any reason,
- * generate one here. (This should be impossible.)
- */
- if (WARN_ON_ONCE(!sigismember(&tsk->pending.signal, SIGBUS) &&
- !sigismember(&tsk->pending.signal, SIGSEGV)))
- goto sigsegv;
-
- return true; /* Don't emulate the ret. */
+ goto sigsegv;
}
regs->ax = ret;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6dc3c5f0be07..c682a14299e0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -528,7 +528,6 @@ struct thread_struct {
unsigned long iopl_emul;
unsigned int iopl_warn:1;
- unsigned int sig_on_uaccess_err:1;
/* Floating point and extended processor state */
struct fpu fpu;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index cdb337cf92ba..98a5924d98b7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -649,33 +649,8 @@ no_context(struct pt_regs *regs, unsigned long error_code,
}
/* Are we prepared to handle this kernel fault? */
- if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
- /*
- * Any interrupt that takes a fault gets the fixup. This makes
- * the below recursive fault logic only apply to a faults from
- * task context.
- */
- if (in_interrupt())
- return;
-
- /*
- * Per the above we're !in_interrupt(), aka. task context.
- *
- * In this case we need to make sure we're not recursively
- * faulting through the emulate_vsyscall() logic.
- */
- if (current->thread.sig_on_uaccess_err && signal) {
- set_signal_archinfo(address, error_code);
-
- /* XXX: hwpoison faults will set the wrong code. */
- force_sig_fault(signal, si_code, (void __user *)address);
- }
-
- /*
- * Barring that, we can do the fixup and be happy.
- */
+ if (fixup_exception(regs, X86_TRAP_PF, error_code, address))
return;
- }
#ifdef CONFIG_VMAP_STACK
/*
--
2.45.1
Since upstream commit 4acffb66 "selftests: net: explicitly wait for
listener ready", the net_helper.sh from commit 3bdd9fd2 "selftests/net:
synchronize udpgro tests' tx and rx connection" will be needed.
Otherwise selftests/net/udpgro_fwd.sh will complain about:
$ sudo ./udpgro_fwd.sh
./udpgro_fwd.sh: line 4: net_helper.sh: No such file or directory
IPv4
No GRO ./udpgro_fwd.sh: line 134: wait_local_port_listen: command not found
Patch "selftests/net: synchronize udpgro tests' tx and rx connection" adds
the missing net_helper.sh. Context adjustment is needed for applying this
patch, as the BPF_FILE is different in 6.6.y
Patch "selftests: net: Remove executable bits from library scripts" fixes
the script permission.
Patch "selftests: net: included needed helper in the install targets" and
"selftests: net: List helper scripts in TEST_FILES Makefile variable" will
add this helper to the Makefile and fix the installation, lib.sh needs to
be ignored for them.
Benjamin Poirier (2):
selftests: net: Remove executable bits from library scripts
selftests: net: List helper scripts in TEST_FILES Makefile variable
Lucas Karpinski (1):
selftests/net: synchronize udpgro tests' tx and rx connection
Paolo Abeni (1):
selftests: net: included needed helper in the install targets
tools/testing/selftests/net/Makefile | 4 ++--
tools/testing/selftests/net/net_helper.sh | 22 ++++++++++++++++++++++
tools/testing/selftests/net/setup_loopback.sh | 0
tools/testing/selftests/net/udpgro.sh | 13 ++++++-------
tools/testing/selftests/net/udpgro_bench.sh | 5 +++--
tools/testing/selftests/net/udpgro_frglist.sh | 5 +++--
6 files changed, 36 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/net/net_helper.sh
mode change 100755 => 100644 tools/testing/selftests/net/setup_loopback.sh
--
2.7.4
[ Upstream commit 1cd4bc987abb2823836cbb8f887026011ccddc8a ]
Commit f58f45c1e5b9 ("vxlan: drop packets from invalid src-address")
has recently been added to vxlan mainly in the context of source
address snooping/learning so that when it is enabled, an entry in the
FDB is not being created for an invalid address for the corresponding
tunnel endpoint.
Before commit f58f45c1e5b9 vxlan was similarly behaving as geneve in
that it passed through whichever macs were set in the L2 header. It
turns out that this change in behavior breaks setups, for example,
Cilium with netkit in L3 mode for Pods as well as tunnel mode has been
passing before the change in f58f45c1e5b9 for both vxlan and geneve.
After mentioned change it is only passing for geneve as in case of
vxlan packets are dropped due to vxlan_set_mac() returning false as
source and destination macs are zero which for E/W traffic via tunnel
is totally fine.
Fix it by only opting into the is_valid_ether_addr() check in
vxlan_set_mac() when in fact source address snooping/learning is
actually enabled in vxlan. This is done by moving the check into
vxlan_snoop(). With this change, the Cilium connectivity test suite
passes again for both tunnel flavors.
Fixes: f58f45c1e5b9 ("vxlan: drop packets from invalid src-address")
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Cc: David Bauer <mail(a)david-bauer.net>
Cc: Ido Schimmel <idosch(a)nvidia.com>
Cc: Nikolay Aleksandrov <razor(a)blackwall.org>
Cc: Martin KaFai Lau <martin.lau(a)kernel.org>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor(a)blackwall.org>
Reviewed-by: David Bauer <mail(a)david-bauer.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
[ Backport note: vxlan snooping/learning not supported in 6.8 or older,
so commit is simply a revert. ]
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
---
drivers/net/vxlan/vxlan_core.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 0a0b4a9717ce..1d0688610189 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1615,10 +1615,6 @@ static bool vxlan_set_mac(struct vxlan_dev *vxlan,
if (ether_addr_equal(eth_hdr(skb)->h_source, vxlan->dev->dev_addr))
return false;
- /* Ignore packets from invalid src-address */
- if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
- return false;
-
/* Get address from the outer IP header */
if (vxlan_get_sk_family(vs) == AF_INET) {
saddr.sin.sin_addr.s_addr = ip_hdr(skb)->saddr;
--
2.34.1
The patch below does not apply to the 6.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.9.y
git checkout FETCH_HEAD
git cherry-pick -x 34bf6bae3286a58762711cfbce2cf74ecd42e1b5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024060624-platinum-ladies-9214@gregkh' --subject-prefix 'PATCH 6.9.y' HEAD^..
Possible dependencies:
34bf6bae3286 ("x86/topology/amd: Evaluate SMT in CPUID leaf 0x8000001e only on family 0x17 and greater")
21f546a43a91 ("Merge branch 'x86/urgent' into x86/cpu, to resolve conflict")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 34bf6bae3286a58762711cfbce2cf74ecd42e1b5 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Tue, 28 May 2024 22:21:31 +0200
Subject: [PATCH] x86/topology/amd: Evaluate SMT in CPUID leaf 0x8000001e only
on family 0x17 and greater
The new AMD/HYGON topology parser evaluates the SMT information in CPUID leaf
0x8000001e unconditionally while the original code restricted it to CPUs with
family 0x17 and greater.
This breaks family 0x15 CPUs which advertise that leaf and have a non-zero
value in the SMT section. The machine boots, but the scheduler complains loudly
about the mismatch of the core IDs:
WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:6482 sched_cpu_starting+0x183/0x250
WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2408 build_sched_domains+0x76b/0x12b0
Add the condition back to cure it.
[ bp: Make it actually build because grandpa is not concerned with
trivial stuff. :-P ]
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser")
Closes: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/issues/56
Reported-by: Tim Teichmann <teichmanntim(a)outlook.de>
Reported-by: Christian Heusel <christian(a)heusel.eu>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Tested-by: Tim Teichmann <teichmanntim(a)outlook.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/7skhx6mwe4hxiul64v6azhlxnokheorksqsdbp7qw6g2jduf6…
diff --git a/arch/x86/kernel/cpu/topology_amd.c b/arch/x86/kernel/cpu/topology_amd.c
index d419deed6a48..7d476fa697ca 100644
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -84,9 +84,9 @@ static bool parse_8000_001e(struct topo_scan *tscan, bool has_topoext)
/*
* If leaf 0xb is available, then the domain shifts are set
- * already and nothing to do here.
+ * already and nothing to do here. Only valid for family >= 0x17.
*/
- if (!has_topoext) {
+ if (!has_topoext && tscan->c->x86 >= 0x17) {
/*
* Leaf 0x80000008 set the CORE domain shift already.
* Update the SMT domain, but do not propagate it.
From: Shakeel Butt <shakeelb(a)google.com>
commit d4a5b369ad6d8aae552752ff438dddde653a72ec upstream.
One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
upstream kernel and on further investigation, it seems like the cause is
the always synchronous rstat flush in the count_shadow_nodes() added by
the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical
stats"). On further inspection it seems like we don't really need
accurate stats in this function as it was already approximating the amount
of appropriate shadow entries to keep for maintaining the refault
information. Since there is already 2 sec periodic rstat flush, we don't
need exact stats here. Let's ratelimit the rstat flush in this code path.
Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Jesper Dangaard Brouer <hawk(a)kernel.org>
---
On production with kernel v6.6 we are observing issues with excessive
cgroup rstat flushing due to the extra call to mem_cgroup_flush_stats()
in count_shadow_nodes() introduced in commit f82e6bf9bb9b ("mm: memcg:
use rstat for non-hierarchical stats") that commit is part of v6.6.
We request backport of commit d4a5b369ad6d ("mm: ratelimit stat flush
from workingset shrinker") as it have a fixes tag for this commit.
IMHO it is worth explaining call path that makes count_shadow_nodes()
cause excessive cgroup rstat flushing calls. Function shrink_node()
calls mem_cgroup_flush_stats() on its own first, and then invokes
shrink_node_memcgs(). Function shrink_node_memcgs() iterates over
cgroups via mem_cgroup_iter() for each calling shrink_slab(). The
shrink_slab() calls do_shrink_slab() that via shrinker->count_objects()
invoke count_shadow_nodes(), and count_shadow_nodes() does
a mem_cgroup_flush_stats() call, that seems unnecessary.
Backport differs slightly due to v6.6.32 doesn't contain commit
7d7ef0a4686a ("mm: memcg: restore subtree stats flushing") from v6.8.
---
mm/workingset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 2559a1f2fc1c..9110957bec5b 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
struct lruvec *lruvec;
int i;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_ratelimited();
lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
pages += lruvec_page_state_local(lruvec,