The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y
git checkout FETCH_HEAD
git cherry-pick -x f559b2e9c5c5308850544ab59396b7d53cfc67bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102849-giggle-five-0241@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^..
Possible dependencies:
f559b2e9c5c5 ("KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory")
2732be902353 ("KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs")
883b0a91f41a ("KVM: SVM: Move Nested SVM Implementation to nested.c")
46a010dd6896 ("kVM SVM: Move SVM related files to own sub-directory")
d55c9d4009c7 ("KVM: nSVM: check for EFER.SVME=1 before entering guest")
0b66465344a7 ("KVM: nSVM: Remove an obsolete comment.")
78f2145c4d93 ("KVM: nSVM: avoid loss of pending IRQ/NMI before entering L2")
b518ba9fa691 ("KVM: nSVM: implement check_nested_events for interrupts")
64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
b5ec2e020b70 ("KVM: nSVM: do not change host intercepts while nested VM is running")
689f3bf21628 ("KVM: x86: unify callbacks to load paging root")
257038745cae ("KVM: x86: Move nSVM CPUID 0x8000000A handling into common x86 code")
a50718cc3f43 ("KVM: nSVM: Expose SVM features to L1 iff nested is enabled")
703c335d0693 ("KVM: x86/mmu: Configure max page level during hardware setup")
bde772355958 ("KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function")
213e0e1f500b ("KVM: SVM: Refactor logging of NPT enabled/disabled")
a1bead2abaa1 ("KVM: VMX: Directly query Intel PT mode when refreshing PMUs")
139085101f85 ("KVM: x86: Use KVM cpu caps to detect MSR_TSC_AUX virt support")
93c380e7b528 ("KVM: x86: Set emulated/transmuted feature bits via kvm_cpu_caps")
bd7919999047 ("KVM: x86: Override host CPUID results with kvm_cpu_caps")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f559b2e9c5c5308850544ab59396b7d53cfc67bd Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 9 Oct 2024 07:08:38 -0700
Subject: [PATCH] KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.
In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.
Per the APM:
The CR3 register points to the base address of the page-directory-pointer
table. The page-directory-pointer table is aligned on a 32-byte boundary,
with the low 5 address bits 4:0 assumed to be 0.
And the SDM's much more explicit:
4:0 Ignored
Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.
Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski(a)google.com>
Cc: Andy Nguyen <theflow(a)google.com>
Cc: 3pvd <3pvd(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-ID: <20241009140838.1036226-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d5314cb7dff4..cf84103ce38b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -63,8 +63,12 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
u64 pdpte;
int ret;
+ /*
+ * Note, nCR3 is "assumed" to be 32-byte aligned, i.e. the CPU ignores
+ * nCR3[4:0] when loading PDPTEs from memory.
+ */
ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
- offset_in_page(cr3) + index * 8, 8);
+ (cr3 & GENMASK(11, 5)) + index * 8, 8);
if (ret)
return 0;
return pdpte;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x f559b2e9c5c5308850544ab59396b7d53cfc67bd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102847-level-stoic-fc77@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
f559b2e9c5c5 ("KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory")
2732be902353 ("KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs")
883b0a91f41a ("KVM: SVM: Move Nested SVM Implementation to nested.c")
46a010dd6896 ("kVM SVM: Move SVM related files to own sub-directory")
d55c9d4009c7 ("KVM: nSVM: check for EFER.SVME=1 before entering guest")
0b66465344a7 ("KVM: nSVM: Remove an obsolete comment.")
78f2145c4d93 ("KVM: nSVM: avoid loss of pending IRQ/NMI before entering L2")
b518ba9fa691 ("KVM: nSVM: implement check_nested_events for interrupts")
64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
b5ec2e020b70 ("KVM: nSVM: do not change host intercepts while nested VM is running")
689f3bf21628 ("KVM: x86: unify callbacks to load paging root")
257038745cae ("KVM: x86: Move nSVM CPUID 0x8000000A handling into common x86 code")
a50718cc3f43 ("KVM: nSVM: Expose SVM features to L1 iff nested is enabled")
703c335d0693 ("KVM: x86/mmu: Configure max page level during hardware setup")
bde772355958 ("KVM: x86/mmu: Merge kvm_{enable,disable}_tdp() into a common function")
213e0e1f500b ("KVM: SVM: Refactor logging of NPT enabled/disabled")
a1bead2abaa1 ("KVM: VMX: Directly query Intel PT mode when refreshing PMUs")
139085101f85 ("KVM: x86: Use KVM cpu caps to detect MSR_TSC_AUX virt support")
93c380e7b528 ("KVM: x86: Set emulated/transmuted feature bits via kvm_cpu_caps")
bd7919999047 ("KVM: x86: Override host CPUID results with kvm_cpu_caps")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f559b2e9c5c5308850544ab59396b7d53cfc67bd Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Wed, 9 Oct 2024 07:08:38 -0700
Subject: [PATCH] KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits
4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't
enforce 32-byte alignment of nCR3.
In the absolute worst case scenario, failure to ignore bits 4:0 can result
in an out-of-bounds read, e.g. if the target page is at the end of a
memslot, and the VMM isn't using guard pages.
Per the APM:
The CR3 register points to the base address of the page-directory-pointer
table. The page-directory-pointer table is aligned on a 32-byte boundary,
with the low 5 address bits 4:0 assumed to be 0.
And the SDM's much more explicit:
4:0 Ignored
Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow
that is broken.
Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory")
Reported-by: Kirk Swidowski <swidowski(a)google.com>
Cc: Andy Nguyen <theflow(a)google.com>
Cc: 3pvd <3pvd(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Message-ID: <20241009140838.1036226-1-seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d5314cb7dff4..cf84103ce38b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -63,8 +63,12 @@ static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
u64 pdpte;
int ret;
+ /*
+ * Note, nCR3 is "assumed" to be 32-byte aligned, i.e. the CPU ignores
+ * nCR3[4:0] when loading PDPTEs from memory.
+ */
ret = kvm_vcpu_read_guest_page(vcpu, gpa_to_gfn(cr3), &pdpte,
- offset_in_page(cr3) + index * 8, 8);
+ (cr3 & GENMASK(11, 5)) + index * 8, 8);
if (ret)
return 0;
return pdpte;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 088984c8d54c0053fc4ae606981291d741c5924b
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102850-postal-pogo-94ed@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 088984c8d54c0053fc4ae606981291d741c5924b Mon Sep 17 00:00:00 2001
From: Koba Ko <kobak(a)nvidia.com>
Date: Sun, 13 Oct 2024 04:50:10 +0800
Subject: [PATCH] ACPI: PRM: Find EFI_MEMORY_RUNTIME block for PRM handler and
context
PRMT needs to find the correct type of block to translate the PA-VA
mapping for EFI runtime services.
The issue arises because the PRMT is finding a block of type
EFI_CONVENTIONAL_MEMORY, which is not appropriate for runtime services
as described in Section 2.2.2 (Runtime Services) of the UEFI
Specification [1]. Since the PRM handler is a type of runtime service,
this causes an exception when the PRM handler is called.
[Firmware Bug]: Unable to handle paging request in EFI runtime service
WARNING: CPU: 22 PID: 4330 at drivers/firmware/efi/runtime-wrappers.c:341
__efi_queue_work+0x11c/0x170
Call trace:
Let PRMT find a block with EFI_MEMORY_RUNTIME for PRM handler and PRM
context.
If no suitable block is found, a warning message will be printed, but
the procedure continues to manage the next PRM handler.
However, if the PRM handler is actually called without proper allocation,
it would result in a failure during error handling.
By using the correct memory types for runtime services, ensure that the
PRM handler and the context are properly mapped in the virtual address
space during runtime, preventing the paging request error.
The issue is really that only memory that has been remapped for runtime
by the firmware can be used by the PRM handler, and so the region needs
to have the EFI_MEMORY_RUNTIME attribute.
Link: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_10_Aug29.pdf # [1]
Fixes: cefc7ca46235 ("ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtype")
Cc: All applicable <stable(a)vger.kernel.org>
Signed-off-by: Koba Ko <kobak(a)nvidia.com>
Reviewed-by: Matthew R. Ochs <mochs(a)nvidia.com>
Reviewed-by: Zhang Rui <rui.zhang(a)intel.com>
Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
Link: https://patch.msgid.link/20241012205010.4165798-1-kobak@nvidia.com
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
diff --git a/drivers/acpi/prmt.c b/drivers/acpi/prmt.c
index 1cfaa5957ac4..d59307a76ca3 100644
--- a/drivers/acpi/prmt.c
+++ b/drivers/acpi/prmt.c
@@ -72,17 +72,21 @@ struct prm_module_info {
struct prm_handler_info handlers[] __counted_by(handler_count);
};
-static u64 efi_pa_va_lookup(u64 pa)
+static u64 efi_pa_va_lookup(efi_guid_t *guid, u64 pa)
{
efi_memory_desc_t *md;
u64 pa_offset = pa & ~PAGE_MASK;
u64 page = pa & PAGE_MASK;
for_each_efi_memory_desc(md) {
- if (md->phys_addr < pa && pa < md->phys_addr + PAGE_SIZE * md->num_pages)
+ if ((md->attribute & EFI_MEMORY_RUNTIME) &&
+ (md->phys_addr < pa && pa < md->phys_addr + PAGE_SIZE * md->num_pages)) {
return pa_offset + md->virt_addr + page - md->phys_addr;
+ }
}
+ pr_warn("Failed to find VA for GUID: %pUL, PA: 0x%llx", guid, pa);
+
return 0;
}
@@ -148,9 +152,15 @@ acpi_parse_prmt(union acpi_subtable_headers *header, const unsigned long end)
th = &tm->handlers[cur_handler];
guid_copy(&th->guid, (guid_t *)handler_info->handler_guid);
- th->handler_addr = (void *)efi_pa_va_lookup(handler_info->handler_address);
- th->static_data_buffer_addr = efi_pa_va_lookup(handler_info->static_data_buffer_address);
- th->acpi_param_buffer_addr = efi_pa_va_lookup(handler_info->acpi_param_buffer_address);
+ th->handler_addr =
+ (void *)efi_pa_va_lookup(&th->guid, handler_info->handler_address);
+
+ th->static_data_buffer_addr =
+ efi_pa_va_lookup(&th->guid, handler_info->static_data_buffer_address);
+
+ th->acpi_param_buffer_addr =
+ efi_pa_va_lookup(&th->guid, handler_info->acpi_param_buffer_address);
+
} while (++cur_handler < tm->handler_count && (handler_info = get_next_handler(handler_info)));
return 0;
@@ -277,6 +287,13 @@ static acpi_status acpi_platformrt_space_handler(u32 function,
if (!handler || !module)
goto invalid_guid;
+ if (!handler->handler_addr ||
+ !handler->static_data_buffer_addr ||
+ !handler->acpi_param_buffer_addr) {
+ buffer->prm_status = PRM_HANDLER_ERROR;
+ return AE_OK;
+ }
+
ACPI_COPY_NAMESEG(context.signature, "PRMC");
context.revision = 0x0;
context.reserved = 0x0;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102859-around-strangely-5c99@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102857-unvaried-grip-2f5d@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.11.y
git checkout FETCH_HEAD
git cherry-pick -x f10f59f91a6278e9637327d1206140d28e2d5004
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102855-swab-anytime-3da1@gregkh' --subject-prefix 'PATCH 6.11.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From f10f59f91a6278e9637327d1206140d28e2d5004 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 9 Oct 2024 09:37:03 +1030
Subject: [PATCH] btrfs: fix the delalloc range locking if sector size < page
size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable(a)vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 309a8ae48434..872cca54cc6c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -262,22 +262,23 @@ static noinline int lock_delalloc_folios(struct inode *inode,
for (i = 0; i < found_folios; i++) {
struct folio *folio = fbatch.folios[i];
- u32 len = end + 1 - start;
+ u64 range_start;
+ u32 range_len;
if (folio == locked_folio)
continue;
- if (btrfs_folio_start_writer_lock(fs_info, folio, start,
- len))
- goto out;
-
+ folio_lock(folio);
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
- btrfs_folio_end_writer_lock(fs_info, folio, start,
- len);
+ folio_unlock(folio);
goto out;
}
+ range_start = max_t(u64, folio_pos(folio), start);
+ range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
+ end + 1) - range_start;
+ btrfs_folio_set_writer_lock(fs_info, folio, range_start, range_len);
- processed_end = folio_pos(folio) + folio_size(folio) - 1;
+ processed_end = range_start + range_len - 1;
}
folio_batch_release(&fbatch);
cond_resched();
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 5f9062a48db260fd6b53d86ecfb4d5dc59266316
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102858-yield-gestate-e6bc@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5f9062a48db260fd6b53d86ecfb4d5dc59266316 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 10 Sep 2024 15:21:04 +0930
Subject: [PATCH] btrfs: qgroup: set a more sane default value for subtree drop
threshold
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable(a)vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1238a38c59b2..5afb68c0304b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1959,7 +1959,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
fs_info->qgroup_seq = 1;
fs_info->qgroup_ulist = NULL;
fs_info->qgroup_rescan_running = false;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
mutex_init(&fs_info->qgroup_rescan_lock);
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1332ec59c539..a0e8deca87a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1407,7 +1407,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
fs_info->quota_root = NULL;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
spin_unlock(&fs_info->qgroup_lock);
btrfs_free_qgroup_config(fs_info);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 98adf4ec7b01..c229256d6fd5 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -121,6 +121,8 @@ struct btrfs_inode;
#define BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN (1ULL << 63)
#define BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING (1ULL << 62)
+#define BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT (3)
+
/*
* Record a dirty extent, and info qgroup to update quota on it
*/
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 5f9062a48db260fd6b53d86ecfb4d5dc59266316
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024102856-chance-seventh-bedd@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5f9062a48db260fd6b53d86ecfb4d5dc59266316 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Tue, 10 Sep 2024 15:21:04 +0930
Subject: [PATCH] btrfs: qgroup: set a more sane default value for subtree drop
threshold
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable(a)vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1238a38c59b2..5afb68c0304b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1959,7 +1959,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
fs_info->qgroup_seq = 1;
fs_info->qgroup_ulist = NULL;
fs_info->qgroup_rescan_running = false;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
mutex_init(&fs_info->qgroup_rescan_lock);
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1332ec59c539..a0e8deca87a7 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1407,7 +1407,7 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
fs_info->quota_root = NULL;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE;
- fs_info->qgroup_drop_subtree_thres = BTRFS_MAX_LEVEL;
+ fs_info->qgroup_drop_subtree_thres = BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT;
spin_unlock(&fs_info->qgroup_lock);
btrfs_free_qgroup_config(fs_info);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 98adf4ec7b01..c229256d6fd5 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -121,6 +121,8 @@ struct btrfs_inode;
#define BTRFS_QGROUP_RUNTIME_FLAG_CANCEL_RESCAN (1ULL << 63)
#define BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING (1ULL << 62)
+#define BTRFS_QGROUP_DROP_SUBTREE_THRES_DEFAULT (3)
+
/*
* Record a dirty extent, and info qgroup to update quota on it
*/
The Synopsys DesignWare mmc controller on the JH7110 SoC
(dw_mmc-starfive.c driver) is using a 32-bit IDMAC address bus width,
and thus requires the use of SWIOTLB.
The commit 8396c793ffdf ("mmc: dw_mmc: Fix IDMAC operation with pages
bigger than 4K") increased the max_seq_size, even for 4K pages, causing
"swiotlb buffer is full" to happen because swiotlb can only handle a
memory size up to 256kB only.
Fix the issue, by making sure the dw_mmc driver doesn't use segments
bigger than what SWIOTLB can handle.
Reported-by: Ron Economos <re(a)w6rz.net>
Reported-by: Jing Luo <jing(a)jing.rocks>
Fixes: 8396c793ffdf ("mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K")
Cc: stable(a)vger.kernel.org
Signed-off-by: Aurelien Jarno <aurelien(a)aurel32.net>
---
drivers/mmc/host/dw_mmc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 41e451235f637..dc0d6201f7b73 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2958,7 +2958,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
mmc->max_segs = host->ring_size;
mmc->max_blk_size = 65535;
mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
- mmc->max_seg_size = mmc->max_req_size;
+ mmc->max_seg_size =
+ min_t(size_t, mmc->max_req_size, dma_max_mapping_size(host->dev));
mmc->max_blk_count = mmc->max_req_size / 512;
} else if (host->use_dma == TRANS_MODE_EDMAC) {
mmc->max_segs = 64;
--
2.45.2
Fixes file corruption issues when reading contents via ceph client.
Call netfs_reset_subreq_iter() to align subreq->io_iter before
calling netfs_clear_unread() to clear tail, as subreq->io_iter count
and subreq->transferred might not be aligned after incomplete I/O,
having the subreq's NETFS_SREQ_CLEAR_TAIL set.
Based on ee4cdf7b ("netfs: Speed up buffered reading"), which
introduces a fix for the issue in mainline.
Fixes: 92b6cc5d ("netfs: Add iov_iters to (sub)requests to describe various buffers")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219237
Signed-off-by: Christian Ebner <c.ebner(a)proxmox.com>
---
Sending this patch in an attempt to backport the fix introduced by
commit ee4cdf7b ("netfs: Speed up buffered reading"), which however
can not be cherry picked for older kernels, as the patch is not
independent from other commits and touches a lot of unrelated (to
the fix) code.
fs/netfs/io.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/netfs/io.c b/fs/netfs/io.c
index d6ada4eba744..500119285346 100644
--- a/fs/netfs/io.c
+++ b/fs/netfs/io.c
@@ -528,6 +528,7 @@ void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
incomplete:
if (test_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags)) {
+ netfs_reset_subreq_iter(rreq, subreq);
netfs_clear_unread(subreq);
subreq->transferred = subreq->len;
goto complete;
--
2.39.5