December 2024 - Linux-stable-mirror

[PATCH v6.12] KVM: arm64: Disable MPAM visibility by default and ignore VMM writes

by Marc Zyngier

From: James Morse <james.morse(a)arm.com> commit 6685f5d572c22e1003e7c0d089afe1c64340ab1f upstream. commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests, but didn't add trap handling. A previous patch supplied the missing trap handling. Existing VMs that have the MPAM field of ID_AA64PFR0_EL1 set need to be migratable, but there is little point enabling the MPAM CPU interface on new VMs until there is something a guest can do with it. Clear the MPAM field from the guest's ID_AA64PFR0_EL1 and on hardware that supports MPAM, politely ignore the VMMs attempts to set this bit. Guests exposed to this bug have the sanitised value of the MPAM field, so only the correct value needs to be ignored. This means the field can continue to be used to block migration to incompatible hardware (between MPAM=1 and MPAM=5), and the VMM can't rely on the field being ignored. Signed-off-by: James Morse <james.morse(a)arm.com> Co-developed-by: Joey Gouly <joey.gouly(a)arm.com> Signed-off-by: Joey Gouly <joey.gouly(a)arm.com> Reviewed-by: Gavin Shan <gshan(a)redhat.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi(a)huawei.com> Reviewed-by: Marc Zyngier <maz(a)kernel.org> Link: https://lore.kernel.org/r/20241030160317.2528209-7-joey.gouly@arm.com Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev> [maz: adapted to lack of ID_FILTERED()] Signed-off-by: Marc Zyngier <maz(a)kernel.org> Cc: stable(a)vger.kernel.org --- arch/arm64/kvm/sys_regs.c | 55 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 52 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index ff8c4e1b847ed..fbed433283c9b 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -1535,6 +1535,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu, val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTEX); val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_DF2); val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_PFAR); + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MPAM_frac); break; case SYS_ID_AA64PFR2_EL1: /* We only expose FPMR */ @@ -1724,6 +1725,13 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, val &= ~ID_AA64PFR0_EL1_AMU_MASK; + /* + * MPAM is disabled by default as KVM also needs a set of PARTID to + * program the MPAMVPMx_EL2 PARTID remapping registers with. But some + * older kernels let the guest see the ID bit. + */ + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + return val; } @@ -1834,6 +1842,42 @@ static int set_id_dfr0_el1(struct kvm_vcpu *vcpu, return set_id_reg(vcpu, rd, val); } +static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd, u64 user_val) +{ + u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); + u64 mpam_mask = ID_AA64PFR0_EL1_MPAM_MASK; + + /* + * Commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits + * in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to + * guests, but didn't add trap handling. KVM doesn't support MPAM and + * always returns an UNDEF for these registers. The guest must see 0 + * for this field. + * + * But KVM must also accept values from user-space that were provided + * by KVM. On CPUs that support MPAM, permit user-space to write + * the sanitizied value to ID_AA64PFR0_EL1.MPAM, but ignore this field. + */ + if ((hw_val & mpam_mask) == (user_val & mpam_mask)) + user_val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + + return set_id_reg(vcpu, rd, user_val); +} + +static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd, u64 user_val) +{ + u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); + u64 mpam_mask = ID_AA64PFR1_EL1_MPAM_frac_MASK; + + /* See set_id_aa64pfr0_el1 for comment about MPAM */ + if ((hw_val & mpam_mask) == (user_val & mpam_mask)) + user_val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK; + + return set_id_reg(vcpu, rd, user_val); +} + /* * cpufeature ID register user accessors * @@ -2377,7 +2421,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ID_AA64PFR0_EL1), .access = access_id_reg, .get_user = get_id_reg, - .set_user = set_id_reg, + .set_user = set_id_aa64pfr0_el1, .reset = read_sanitised_id_aa64pfr0_el1, .val = ~(ID_AA64PFR0_EL1_AMU | ID_AA64PFR0_EL1_MPAM | @@ -2385,7 +2429,12 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR0_EL1_RAS | ID_AA64PFR0_EL1_AdvSIMD | ID_AA64PFR0_EL1_FP), }, - ID_WRITABLE(ID_AA64PFR1_EL1, ~(ID_AA64PFR1_EL1_PFAR | + { SYS_DESC(SYS_ID_AA64PFR1_EL1), + .access = access_id_reg, + .get_user = get_id_reg, + .set_user = set_id_aa64pfr1_el1, + .reset = kvm_read_sanitised_id_reg, + .val = ~(ID_AA64PFR1_EL1_PFAR | ID_AA64PFR1_EL1_DF2 | ID_AA64PFR1_EL1_MTEX | ID_AA64PFR1_EL1_THE | @@ -2397,7 +2446,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR1_EL1_RES0 | ID_AA64PFR1_EL1_MPAM_frac | ID_AA64PFR1_EL1_RAS_frac | - ID_AA64PFR1_EL1_MTE)), + ID_AA64PFR1_EL1_MTE), }, ID_WRITABLE(ID_AA64PFR2_EL1, ID_AA64PFR2_EL1_FPMR), ID_UNALLOCATED(4,3), ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), -- 2.39.2

1 year

2
1
0 0

[PATCH v2 5.4 5.15 6.6] tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()

by Nikolay Kuratov

commit b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") avoids checking number_of_same_symbols() for module symbol in __trace_kprobe_create(), but create_local_trace_kprobe() should avoid this check too. Doing this check leads to ENOENT for module_name:symbol_name constructions passed over perf_event_open. No bug in newer kernels as it was fixed more generally by commit 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads") Link: https://lore.kernel.org/linux-trace-kernel/20240705161030.b3ddb33a8167013b9… Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru> --- v1 -> v2: * Reword commit title and message * Send for stable instead of mainline kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 12d997bb3e78..94cb09d44115 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1814,7 +1814,7 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs, int ret; char *event; - if (func) { + if (func && !strchr(func, ':')) { unsigned int count; count = number_of_same_symbols(func); -- 2.34.1

1 year

2
1
0 0

[PATCH] usb: phy-tahvo: fix call balance for tu->ick handling routines

by Vitalii Mordan

If the clock tu->ick was not enabled in tahvo_usb_probe, it may still hold a non-error pointer, potentially causing the clock to be incorrectly disabled later in the function. Use the devm_clk_get_enabled helper function to ensure proper call balance for tu->ick. Found by Linux Verification Center (linuxtesting.org) with Klever. Fixes: 9ba96ae5074c ("usb: omap1: Tahvo USB transceiver driver") Cc: stable(a)vger.kernel.org Signed-off-by: Vitalii Mordan <mordan(a)ispras.ru> --- v2: Corrected a typo in the error handling of the devm_clk_get_enabled call. This issue was reported by Dan Carpenter <dan.carpenter(a)linaro.org>. drivers/usb/phy/phy-tahvo.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/usb/phy/phy-tahvo.c b/drivers/usb/phy/phy-tahvo.c index ae7bf3ff89ee..4182e86dc450 100644 --- a/drivers/usb/phy/phy-tahvo.c +++ b/drivers/usb/phy/phy-tahvo.c @@ -341,9 +341,11 @@ static int tahvo_usb_probe(struct platform_device *pdev) mutex_init(&tu->serialize); - tu->ick = devm_clk_get(&pdev->dev, "usb_l4_ick"); - if (!IS_ERR(tu->ick)) - clk_enable(tu->ick); + tu->ick = devm_clk_get_enabled(&pdev->dev, "usb_l4_ick"); + if (IS_ERR(tu->ick)) { + dev_err(&pdev->dev, "failed to get and enable clock\n"); + return PTR_ERR(tu->ick); + } /* * Set initial state, so that we generate kevents only on state changes. @@ -353,15 +355,14 @@ static int tahvo_usb_probe(struct platform_device *pdev) tu->extcon = devm_extcon_dev_allocate(&pdev->dev, tahvo_cable); if (IS_ERR(tu->extcon)) { dev_err(&pdev->dev, "failed to allocate memory for extcon\n"); - ret = PTR_ERR(tu->extcon); - goto err_disable_clk; + return PTR_ERR(tu->extcon); } ret = devm_extcon_dev_register(&pdev->dev, tu->extcon); if (ret) { dev_err(&pdev->dev, "could not register extcon device: %d\n", ret); - goto err_disable_clk; + return ret; } /* Set the initial cable state. */ @@ -384,7 +385,7 @@ static int tahvo_usb_probe(struct platform_device *pdev) if (ret < 0) { dev_err(&pdev->dev, "cannot register USB transceiver: %d\n", ret); - goto err_disable_clk; + return ret; } dev_set_drvdata(&pdev->dev, tu); @@ -405,9 +406,6 @@ static int tahvo_usb_probe(struct platform_device *pdev) err_remove_phy: usb_remove_phy(&tu->phy); -err_disable_clk: - if (!IS_ERR(tu->ick)) - clk_disable(tu->ick); return ret; } @@ -418,8 +416,6 @@ static void tahvo_usb_remove(struct platform_device *pdev) free_irq(tu->irq, tu); usb_remove_phy(&tu->phy); - if (!IS_ERR(tu->ick)) - clk_disable(tu->ick); } static struct platform_driver tahvo_usb_driver = { -- 2.25.1

1 year

1
0
0 0

[PATCH] usb: phy-tahvo: fix call balance for tu->ick handling routines

by Vitalii Mordan

If the clock tu->ick was not enabled in tahvo_usb_probe, it may still hold a non-error pointer, potentially causing the clock to be incorrectly disabled later in the function. Use the devm_clk_get_enabled helper function to ensure proper call balance for tu->ick. Found by Linux Verification Center (linuxtesting.org) with Klever. Fixes: 9ba96ae5074c ("usb: omap1: Tahvo USB transceiver driver") Cc: stable(a)vger.kernel.org Signed-off-by: Vitalii Mordan <mordan(a)ispras.ru> --- drivers/usb/phy/phy-tahvo.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/usb/phy/phy-tahvo.c b/drivers/usb/phy/phy-tahvo.c index ae7bf3ff89ee..d393308d23d4 100644 --- a/drivers/usb/phy/phy-tahvo.c +++ b/drivers/usb/phy/phy-tahvo.c @@ -341,9 +341,11 @@ static int tahvo_usb_probe(struct platform_device *pdev) mutex_init(&tu->serialize); - tu->ick = devm_clk_get(&pdev->dev, "usb_l4_ick"); - if (!IS_ERR(tu->ick)) - clk_enable(tu->ick); + tu->ick = devm_clk_get_enabled(&pdev->dev, "usb_l4_ick"); + if (!IS_ERR(tu->ick)) { + dev_err(&pdev->dev, "failed to get and enable clock\n"); + return PTR_ERR(tu->ick); + } /* * Set initial state, so that we generate kevents only on state changes. @@ -353,15 +355,14 @@ static int tahvo_usb_probe(struct platform_device *pdev) tu->extcon = devm_extcon_dev_allocate(&pdev->dev, tahvo_cable); if (IS_ERR(tu->extcon)) { dev_err(&pdev->dev, "failed to allocate memory for extcon\n"); - ret = PTR_ERR(tu->extcon); - goto err_disable_clk; + return PTR_ERR(tu->extcon); } ret = devm_extcon_dev_register(&pdev->dev, tu->extcon); if (ret) { dev_err(&pdev->dev, "could not register extcon device: %d\n", ret); - goto err_disable_clk; + return ret; } /* Set the initial cable state. */ @@ -384,7 +385,7 @@ static int tahvo_usb_probe(struct platform_device *pdev) if (ret < 0) { dev_err(&pdev->dev, "cannot register USB transceiver: %d\n", ret); - goto err_disable_clk; + return ret; } dev_set_drvdata(&pdev->dev, tu); @@ -405,9 +406,6 @@ static int tahvo_usb_probe(struct platform_device *pdev) err_remove_phy: usb_remove_phy(&tu->phy); -err_disable_clk: - if (!IS_ERR(tu->ick)) - clk_disable(tu->ick); return ret; } @@ -418,8 +416,6 @@ static void tahvo_usb_remove(struct platform_device *pdev) free_irq(tu->irq, tu); usb_remove_phy(&tu->phy); - if (!IS_ERR(tu->ick)) - clk_disable(tu->ick); } static struct platform_driver tahvo_usb_driver = { -- 2.25.1

1 year

3
2
0 0

[PATCH v5 1/5] libfs: Return ENOSPC when the directory offset range is exhausted

by cel＠kernel.org

From: Chuck Lever <chuck.lever(a)oracle.com> Testing shows that the EBUSY error return from mtree_alloc_cyclic() leaks into user space. The ERRORS section of "man creat(2)" says: > EBUSY O_EXCL was specified in flags and pathname refers > to a block device that is in use by the system > (e.g., it is mounted). ENOSPC is closer to what applications expect in this situation. Note that the normal range of simple directory offset values is 2..2^63, so hitting this error is going to be rare to impossible. Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets") Cc: <stable(a)vger.kernel.org> # v6.9+ Reviewed-by: Jeff Layton <jlayton(a)kernel.org> Reviewed-by: Yang Erkun <yangerkun(a)huawei.com> Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com> --- fs/libfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/libfs.c b/fs/libfs.c index 748ac5923154..f6d04c69f195 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -292,7 +292,9 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry) ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN, LONG_MAX, &octx->next_offset, GFP_KERNEL); - if (ret < 0) + if (unlikely(ret == -EBUSY)) + return -ENOSPC; + if (unlikely(ret < 0)) return ret; offset_set(dentry, offset); -- 2.47.0

1 year

4
3
0 0

[PATCH v1 v1] drm/nouveau: Fix memory leak in nvbios_iccsense_parse

by Zhanxin Qi

The nvbios_iccsense_parse function allocates memory for sensor data but fails to free it when the function exits, leading to a memory leak. Add proper cleanup to free the allocated memory. Fixes: 39b7e6e547ff ("drm/nouveau/nvbios/iccsense: add parsing of the SENSE table") Signed-off-by: Zhanxin Qi <zhanxin(a)nfschina.com> Signed-off-by: Duanjun Li <duanjun(a)nfschina.com> Signed-off-by: Danilo Krummrich <dakr(a)redhat.com> --- .../include/nvkm/subdev/bios/iccsense.h | 2 ++ .../drm/nouveau/nvkm/subdev/bios/iccsense.c | 20 +++++++++++++++++++ .../drm/nouveau/nvkm/subdev/iccsense/base.c | 3 +++ 3 files changed, 25 insertions(+) diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/bios/iccsense.h b/drivers/gpu/drm/nouveau/include/nvkm/subdev/bios/iccsense.h index 4c108fd2c805..8bfc28c3f7a7 100644 --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/bios/iccsense.h +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/bios/iccsense.h @@ -20,4 +20,6 @@ struct nvbios_iccsense { }; int nvbios_iccsense_parse(struct nvkm_bios *, struct nvbios_iccsense *); + +void nvbios_iccsense_cleanup(struct nvbios_iccsense *iccsense); #endif diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bios/iccsense.c b/drivers/gpu/drm/nouveau/nvkm/subdev/bios/iccsense.c index dea444d48f94..38fcc91ffea6 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/bios/iccsense.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bios/iccsense.c @@ -56,6 +56,19 @@ nvbios_iccsense_table(struct nvkm_bios *bios, u8 *ver, u8 *hdr, u8 *cnt, return 0; } +/** + * nvbios_iccsense_parse - Parse ICCSENSE table from VBIOS + * @bios: VBIOS base pointer + * @iccsense: ICCSENSE table structure to fill + * + * Parses the ICCSENSE table from VBIOS and fills the provided structure. + * The caller must invoke nvbios_iccsense_cleanup() after successful parsing + * to free the allocated rail resources. + * + * Returns: + * 0 - Success + * -ENODEV - Table not found + */ int nvbios_iccsense_parse(struct nvkm_bios *bios, struct nvbios_iccsense *iccsense) { @@ -127,3 +140,10 @@ nvbios_iccsense_parse(struct nvkm_bios *bios, struct nvbios_iccsense *iccsense) return 0; } + +void +nvbios_iccsense_cleanup(struct nvbios_iccsense *iccsense) +{ + kfree(iccsense->rail); + iccsense->rail = NULL; +} diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/iccsense/base.c b/drivers/gpu/drm/nouveau/nvkm/subdev/iccsense/base.c index 8f0ccd3664eb..4c1759ecce38 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/iccsense/base.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/iccsense/base.c @@ -291,6 +291,9 @@ nvkm_iccsense_oneinit(struct nvkm_subdev *subdev) list_add_tail(&rail->head, &iccsense->rails); } } + + nvbios_iccsense_cleanup(&stbl); + return 0; } -- 2.30.2

1 year

2
1
0 0

FAILED: patch "[PATCH] bpf,perf: Fix invalid prog_array access in" failed to apply to 5.15-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y git checkout FETCH_HEAD git cherry-pick -x 978c4486cca5c7b9253d3ab98a88c8e769cb9bbd # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121506-pancreas-mosaic-0ae0@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^.. Possible dependencies: thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 978c4486cca5c7b9253d3ab98a88c8e769cb9bbd Mon Sep 17 00:00:00 2001 From: Jiri Olsa <jolsa(a)kernel.org> Date: Sun, 8 Dec 2024 15:25:07 +0100 Subject: [PATCH] bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog Syzbot reported [1] crash that happens for following tracing scenario: - create tracepoint perf event with attr.inherit=1, attach it to the process and set bpf program to it - attached process forks -> chid creates inherited event the new child event shares the parent's bpf program and tp_event (hence prog_array) which is global for tracepoint - exit both process and its child -> release both events - first perf_event_detach_bpf_prog call will release tp_event->prog_array and second perf_event_detach_bpf_prog will crash, because tp_event->prog_array is NULL The fix makes sure the perf_event_detach_bpf_prog checks prog_array is valid before it tries to remove the bpf program from it. [1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc9… Fixes: 0ee288e69d03 ("bpf,perf: Fix perf_event_detach_bpf_prog error handling") Reported-by: syzbot+2e0d2840414ce817aaac(a)syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa(a)kernel.org> Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org> Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a403b05a7091..1b8db5aee9d3 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2250,6 +2250,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event) goto unlock; old_array = bpf_event_rcu_dereference(event->tp_event->prog_array); + if (!old_array) + goto put; + ret = bpf_prog_array_copy(old_array, event->prog, NULL, 0, &new_array); if (ret < 0) { bpf_prog_array_delete_safe(old_array, event->prog); @@ -2258,6 +2261,7 @@ void perf_event_detach_bpf_prog(struct perf_event *event) bpf_prog_array_free_sleepable(old_array); } +put: /* * It could be that the bpf_prog is not sleepable (and will be freed * via normal RCU), but is called from a point that supports sleepable

1 year

2
2
0 0

[PATCH V4] mm, compaction: don't use ALLOC_CMA in long term GUP flow

by yangge1116＠126.com

From: yangge <yangge1116(a)126.com> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()") allow compaction to proceed when free pages required for compaction reside in the CMA pageblocks, it's possible that __compaction_suitable() always returns true, and in some cases, it's not acceptable. There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. During the start-up of the virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. Since there is 16G of free CMA memory on the NUMA node, watermark for order-0 always be met for compaction, so __compaction_suitable() always returns true, even if the node is unable to allocate non-CMA memory for the virtual machine. For costly allocations, because __compaction_suitable() always returns true, __alloc_pages_slowpath() can't exit at the appropriate place, resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here In order to quickly fall back to remote node, we should remove ALLOC_CMA both in __compaction_suitable() and __isolate_free_page() in long term GUP flow. After this fix, starting a 32GB virtual machine with device passthrough takes only a few seconds. Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()") Cc: <stable(a)vger.kernel.org> Signed-off-by: yangge <yangge1116(a)126.com> --- V4: - rich the commit log description V3: - fix build errors - add ALLOC_CMA both in should_continue_reclaim() and compaction_ready() V2: - using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed - rich the commit log description include/linux/compaction.h | 6 ++++-- mm/compaction.c | 18 +++++++++++------- mm/page_alloc.c | 4 +++- mm/vmscan.c | 4 ++-- 4 files changed, 20 insertions(+), 12 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index e947764..b4c3ac3 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, struct page **page); extern void reset_isolation_suitable(pg_data_t *pgdat); extern bool compaction_suitable(struct zone *zone, int order, - int highest_zoneidx); + int highest_zoneidx, + unsigned int alloc_flags); extern void compaction_defer_reset(struct zone *zone, int order, bool alloc_success); @@ -108,7 +109,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat) } static inline bool compaction_suitable(struct zone *zone, int order, - int highest_zoneidx) + int highest_zoneidx, + unsigned int alloc_flags) { return false; } diff --git a/mm/compaction.c b/mm/compaction.c index 07bd227..585f5ab 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2381,9 +2381,11 @@ static enum compact_result compact_finished(struct compact_control *cc) static bool __compaction_suitable(struct zone *zone, int order, int highest_zoneidx, + unsigned int alloc_flags, unsigned long wmark_target) { unsigned long watermark; + bool use_cma; /* * Watermarks for order-0 must be met for compaction to be able to * isolate free pages for migration targets. This means that the @@ -2395,25 +2397,27 @@ static bool __compaction_suitable(struct zone *zone, int order, * even if compaction succeeds. * For costly orders, we require low watermark instead of min for * compaction to proceed to increase its chances. - * ALLOC_CMA is used, as pages in CMA pageblocks are considered - * suitable migration targets + * In addition to long term GUP flow, ALLOC_CMA is used, as pages in + * CMA pageblocks are considered suitable migration targets */ watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ? low_wmark_pages(zone) : min_wmark_pages(zone); watermark += compact_gap(order); + use_cma = !!(alloc_flags & ALLOC_CMA); return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx, - ALLOC_CMA, wmark_target); + use_cma ? ALLOC_CMA : 0, wmark_target); } /* * compaction_suitable: Is this suitable to run compaction on this zone now? */ -bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx) +bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx, + unsigned int alloc_flags) { enum compact_result compact_result; bool suitable; - suitable = __compaction_suitable(zone, order, highest_zoneidx, + suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags, zone_page_state(zone, NR_FREE_PAGES)); /* * fragmentation index determines if allocation failures are due to @@ -2474,7 +2478,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); if (__compaction_suitable(zone, order, ac->highest_zoneidx, - available)) + alloc_flags, available)) return true; } @@ -2499,7 +2503,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, alloc_flags)) return COMPACT_SUCCESS; - if (!compaction_suitable(zone, order, highest_zoneidx)) + if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags)) return COMPACT_SKIPPED; return COMPACT_CONTINUE; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dde19db..9a5dfda 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2813,6 +2813,7 @@ int __isolate_free_page(struct page *page, unsigned int order) { struct zone *zone = page_zone(page); int mt = get_pageblock_migratetype(page); + bool pin; if (!is_migrate_isolate(mt)) { unsigned long watermark; @@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order) * exists. */ watermark = zone->_watermark[WMARK_MIN] + (1UL << order); - if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) + pin = !!(current->flags & PF_MEMALLOC_PIN); + if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : ALLOC_CMA)) return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e03a61..33f5b46 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, sc->reclaim_idx, 0)) return false; - if (compaction_suitable(zone, sc->order, sc->reclaim_idx)) + if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA)) return false; } @@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc) return true; /* Compaction cannot yet proceed. Do reclaim. */ - if (!compaction_suitable(zone, sc->order, sc->reclaim_idx)) + if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA)) return false; /* -- 2.7.4

1 year

3
2
0 0

[PATCH] drm/mediatek: only touch DISP_REG_OVL_PITCH_MSB if AFBC is supported

by Daniel Golle

Touching DISP_REG_OVL_PITCH_MSB leads to video overlay on MT2701, MT7623N and probably other older SoCs being broken. Only touching it on hardware which actually supports AFBC like it was before commit c410fa9b07c3 ("drm/mediatek: Add AFBC support to Mediatek DRM driver") fixes it. Fixes: c410fa9b07c3 ("drm/mediatek: Add AFBC support to Mediatek DRM driver") Cc: stable(a)vger.kernel.org Signed-off-by: Daniel Golle <daniel(a)makrotopia.org> --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index f731d4fbe8b6..321b40a387cd 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -545,7 +545,7 @@ void mtk_ovl_layer_config(struct device *dev, unsigned int idx, &ovl->cmdq_reg, ovl->regs, DISP_REG_OVL_PITCH_MSB(idx)); mtk_ddp_write_relaxed(cmdq_pkt, hdr_pitch, &ovl->cmdq_reg, ovl->regs, DISP_REG_OVL_HDR_PITCH(ovl, idx)); - } else { + } else if (ovl->data->supports_afbc) { mtk_ddp_write_relaxed(cmdq_pkt, overlay_pitch.split_pitch.msb, &ovl->cmdq_reg, ovl->regs, DISP_REG_OVL_PITCH_MSB(idx)); -- 2.47.1

1 year

2
1
0 0

[PATCHv3] drm: zynqmp_dp: Fix integer overflow in zynqmp_dp_rate_get()

by Karol Przybylski

This patch fixes a potential integer overflow in the zynqmp_dp_rate_get() The issue comes up when the expression drm_dp_bw_code_to_link_rate(dp->test.bw_code) * 10000 is evaluated using 32-bit Now the constant is a compatible 64-bit type. Resolves coverity issues: CID 1636340 and CID 1635811 Cc: stable(a)vger.kernel.org Fixes: 28edaacb821c6 ("drm: zynqmp_dp: Add debugfs interface for compliance testing") Signed-off-by: Karol Przybylski <karprzy7(a)gmail.com> --- Changes from previous versions: Added Fixes tag Added Cc for stable kernel version Fixed formatting drivers/gpu/drm/xlnx/zynqmp_dp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xlnx/zynqmp_dp.c b/drivers/gpu/drm/xlnx/zynqmp_dp.c index 25c5dc61ee88..56a261a40ea3 100644 --- a/drivers/gpu/drm/xlnx/zynqmp_dp.c +++ b/drivers/gpu/drm/xlnx/zynqmp_dp.c @@ -2190,7 +2190,7 @@ static int zynqmp_dp_rate_get(void *data, u64 *val) struct zynqmp_dp *dp = data; mutex_lock(&dp->lock); - *val = drm_dp_bw_code_to_link_rate(dp->test.bw_code) * 10000; + *val = drm_dp_bw_code_to_link_rate(dp->test.bw_code) * 10000ULL; mutex_unlock(&dp->lock); return 0; } -- 2.34.1

1 year

3
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror December 2024