Hi,
in 5.10.94 these two xfrm changes cause userspace programs like Cilium to
suddenly fail (https://github.com/cilium/cilium/pull/18789):
- xfrm: interface with if_id 0 should return error
8dce43919566f06e865f7e8949f5c10d8c2493f5
- xfrm: state and policy should fail if XFRMA_IF_ID 0
68ac0f3810e76a853b5f7b90601a05c3048b8b54
I see that these changes are a reaction to
- xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
9f8550e4bd9d
but even if the "wrong" usage caused weird behavior I still wonder if it
was the right decision to do the changes as part of a bugfix update for an
LTS kernel.
What do you think about reverting the changes at least for 5.10?
Regards,
Kai
This is the start of the stable review cycle for the 4.14.269 release.
There are 31 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 02 Mar 2022 17:20:16 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.269-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.269-rc1
Linus Torvalds <torvalds(a)linux-foundation.org>
fget: clarify and improve __fget_files() implementation
Miaohe Lin <linmiaohe(a)huawei.com>
memblock: use kfree() to release kmalloced memblock regions
Karol Herbst <kherbst(a)redhat.com>
Revert "drm/nouveau/pmu/gm200-: avoid touching PMU outside of DEVINIT/PREOS/ACR"
daniel.starke(a)siemens.com <daniel.starke(a)siemens.com>
tty: n_gsm: fix proper link termination after failed open
daniel.starke(a)siemens.com <daniel.starke(a)siemens.com>
tty: n_gsm: fix encoding of control signal octet bit DV
Hongyu Xie <xiehongyu1(a)kylinos.cn>
xhci: Prevent futile URB re-submissions due to incorrect return value.
Puma Hsu <pumahsu(a)google.com>
xhci: re-initialize the HC during resume if HCE was set
Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
usb: dwc3: gadget: Let the interrupt handler disable bottom halves.
Daniele Palmas <dnlplm(a)gmail.com>
USB: serial: option: add Telit LE910R1 compositions
Slark Xiao <slark_xiao(a)163.com>
USB: serial: option: add support for DW5829e
Steven Rostedt (Google) <rostedt(a)goodmis.org>
tracefs: Set the group ownership in apply_options() not parse_options()
Szymon Heidrich <szymon.heidrich(a)gmail.com>
USB: gadget: validate endpoint index for xilinx udc
Daehwan Jung <dh10.jung(a)samsung.com>
usb: gadget: rndis: add spinlock for rndis response list
Dmytro Bagrii <dimich.dmb(a)gmail.com>
Revert "USB: serial: ch341: add new Product ID for CH341A"
Sergey Shtylyov <s.shtylyov(a)omp.ru>
ata: pata_hpt37x: disable primary channel on HPT371
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
iio: adc: men_z188_adc: Fix a resource leak in an error handling path
Bart Van Assche <bvanassche(a)acm.org>
RDMA/ib_srp: Fix a deadlock
ChenXiaoSong <chenxiaosong2(a)huawei.com>
configfs: fix a race in configfs_{,un}register_subsystem()
Gal Pressman <gal(a)nvidia.com>
net/mlx5e: Fix wrong return value on ioctl EEPROM query failure
Maxime Ripard <maxime(a)cerno.tech>
drm/edid: Always set RGB444
Paul Blakey <paulb(a)nvidia.com>
openvswitch: Fix setting ipv6 fields causing hw csum failure
Tao Liu <thomas.liu(a)ucloud.cn>
gso: do not skip outer ip header in case of ipip and net_failover
Eric Dumazet <edumazet(a)google.com>
net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends
Xin Long <lucien.xin(a)gmail.com>
ping: remove pr_err from ping_lookup
Robert Hancock <robert.hancock(a)calian.com>
serial: 8250: of: Fix mapped region size when using reg-offset property
Oliver Neukum <oneukum(a)suse.com>
USB: zaurus: support another broken Zaurus
Oliver Neukum <oneukum(a)suse.com>
sr9700: sanity check for packet length
Helge Deller <deller(a)gmx.de>
parisc/unaligned: Fix ldw() and stw() unalignment handlers
Helge Deller <deller(a)gmx.de>
parisc/unaligned: Fix fldd and fstd unaligned handlers on 32-bit kernel
Stefano Garzarella <sgarzare(a)redhat.com>
vhost/vsock: don't check owner in vhost_vsock_stop() while releasing
Zhang Qiao <zhangqiao22(a)huawei.com>
cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
-------------
Diffstat:
Makefile | 4 +-
arch/parisc/kernel/unaligned.c | 14 ++---
drivers/ata/pata_hpt37x.c | 14 +++++
drivers/gpu/drm/drm_edid.c | 2 +-
drivers/gpu/drm/nouveau/nvkm/subdev/pmu/base.c | 37 +++++------
drivers/iio/adc/men_z188_adc.c | 9 ++-
drivers/infiniband/ulp/srp/ib_srp.c | 6 +-
.../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 2 +-
drivers/net/usb/cdc_ether.c | 12 ++++
drivers/net/usb/sr9700.c | 2 +-
drivers/net/usb/zaurus.c | 12 ++++
drivers/tty/n_gsm.c | 4 +-
drivers/tty/serial/8250/8250_of.c | 11 +++-
drivers/usb/dwc3/gadget.c | 2 +
drivers/usb/gadget/function/rndis.c | 8 +++
drivers/usb/gadget/function/rndis.h | 1 +
drivers/usb/gadget/udc/udc-xilinx.c | 6 ++
drivers/usb/host/xhci.c | 28 ++++++---
drivers/usb/serial/ch341.c | 1 -
drivers/usb/serial/option.c | 12 ++++
drivers/vhost/vsock.c | 21 ++++---
fs/configfs/dir.c | 14 +++++
fs/file.c | 73 +++++++++++++++++-----
fs/tracefs/inode.c | 5 +-
include/net/checksum.h | 5 ++
kernel/cgroup/cpuset.c | 2 +
mm/memblock.c | 10 ++-
net/core/skbuff.c | 4 +-
net/ipv4/af_inet.c | 5 +-
net/ipv4/ping.c | 1 -
net/ipv6/ip6_offload.c | 2 +
net/openvswitch/actions.c | 46 +++++++++++---
32 files changed, 287 insertions(+), 88 deletions(-)
Daniel Dao has reported [1] a regression on workloads that may trigger
a lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
The easiest fix for now is for performance critical codepaths trigger
the rstat flush asynchronously. This patch converts the refault codepath
to use async rstat flush. In addition, this patch has premptively
converted mem_cgroup_wb_stats and shrink_node to also use the async
rstat flush as they may also similar performance regressions.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndg… [1]
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Reported-by: Daniel Dao <dqminh(a)cloudflare.com>
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: <stable(a)vger.kernel.org>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 10 +++++++++-
mm/vmscan.c | 2 +-
mm/workingset.c | 2 +-
4 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef4b445392a9..bfdd48be60ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -998,6 +998,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
}
void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_async(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c695608c521c..4338e8d779b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -690,6 +690,14 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
+void mem_cgroup_flush_stats_async(void)
+{
+ if (atomic_read(&stats_flush_threshold) > num_online_cpus()) {
+ atomic_set(&stats_flush_threshold, 0);
+ mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0);
+ }
+}
+
static void flush_memcg_stats_dwork(struct work_struct *w)
{
__mem_cgroup_flush_stats();
@@ -4522,7 +4530,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6f77e3e6d59..b6c6b165c1ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3188,7 +3188,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
memset(&sc->nr, 0, sizeof(sc->nr));
diff --git a/mm/workingset.c b/mm/workingset.c
index b717eae4e0dd..a4f2b1aa5bcc 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,7 +355,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_async();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.35.1.574.g5d30c73bfb-goog
From: Maxim Levitsky <mlevitsk(a)redhat.com>
[ Upstream commit 755c2bf878607dbddb1423df9abf16b82205896f ]
kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
can be set by the CPU after it is called, and don't cause the irr_pending
to be set to true.
Also logic in avic_kick_target_vcpu doesn't expect a race with this
function so to make it simple, just keep irr_pending set to true and
let the next interrupt injection to the guest clear it.
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20220207155447.840194-9-mlevitsk(a)redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/x86/kvm/lapic.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 91c2dc9f198df..5f935e7a09566 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2306,7 +2306,12 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
apic->irr_pending = true;
apic->isr_count = 1;
} else {
- apic->irr_pending = (apic_search_irr(apic) != -1);
+ /*
+ * Don't clear irr_pending, searching the IRR can race with
+ * updates from the CPU as APICv is still active from hardware's
+ * perspective. The flag will be cleared as appropriate when
+ * KVM injects the interrupt.
+ */
apic->isr_count = count_vectors(apic->regs + APIC_ISR);
}
}
--
2.34.1
From: Maxim Levitsky <mlevitsk(a)redhat.com>
[ Upstream commit 755c2bf878607dbddb1423df9abf16b82205896f ]
kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
can be set by the CPU after it is called, and don't cause the irr_pending
to be set to true.
Also logic in avic_kick_target_vcpu doesn't expect a race with this
function so to make it simple, just keep irr_pending set to true and
let the next interrupt injection to the guest clear it.
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
Message-Id: <20220207155447.840194-9-mlevitsk(a)redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/x86/kvm/lapic.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e8e383fbe8868..bfac6d0933c39 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2306,7 +2306,12 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
apic->irr_pending = true;
apic->isr_count = 1;
} else {
- apic->irr_pending = (apic_search_irr(apic) != -1);
+ /*
+ * Don't clear irr_pending, searching the IRR can race with
+ * updates from the CPU as APICv is still active from hardware's
+ * perspective. The flag will be cleared as appropriate when
+ * KVM injects the interrupt.
+ */
apic->isr_count = count_vectors(apic->regs + APIC_ISR);
}
}
--
2.34.1