This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled.
The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation.
The series is structured in two parts: Patch 1/2 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds.
Patch 2/2 provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of SLUB. This bypasses the KFENCE interception entirely.
Pasha Tatashin (2): liveupdate: kho: warn and fail on metadata or preserved memory in scratch area liveupdate: kho: allocate metadata directly from the buddy allocator
kernel/liveupdate/Kconfig | 15 ++++++ kernel/liveupdate/kexec_handover.c | 51 ++++++++++++++++----- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++ 4 files changed, 81 insertions(+), 12 deletions(-)
base-commit: 0b2f041c47acb45db82b4e847af6e17eb66cd32d
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG + bool "Enable Kexec Handover debug checks" + depends on KEXEC_HANDOVER_DEBUGFS + help + This option enables extra sanity checks for the Kexec Handover + subsystem. + + These checks verify that neither preserved memory regions nor KHO's + internal metadata are allocated from within a KHO scratch area. + An overlap can lead to memory corruption during a subsequent kexec + operation. + + If an overlap is detected, the kernel will print a warning and the + offending operation will fail. This should only be enabled for + debugging purposes due to runtime overhead. endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
+ if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) { + kfree(elm); + return ERR_PTR(-EINVAL); + } + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res)); @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk) - return NULL; + return ERR_PTR(-ENOMEM); + + if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) { + kfree(chunk); + return ERR_PTR(-EINVAL); + } + chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk); @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order; + int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys;
chunk = new_chunk(chunk, order); - if (!chunk) + if (IS_ERR(chunk)) { + ret = PTR_ERR(chunk); goto err_free; + }
if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order); - if (!chunk) + if (IS_ERR(chunk)) { + ret = PTR_ERR(chunk); goto err_free; + } }
elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk); - return -ENOMEM; + return ret; }
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
+ if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) + return -EINVAL; + return __kho_preserve_order(track, pfn, order); } EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
+ if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT, + nr_pages << PAGE_SHIFT))) { + return -EINVAL; + } + while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; } + +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + phys_addr_t scratch_start, scratch_end; + unsigned int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + scratch_start = kho_scratch[i].addr; + scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1; + + if (phys <= scratch_end && (phys + size) > scratch_start) + return true; + } + + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ + #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
Hi Pasha,
On Wed, Oct 15, 2025 at 01:31:20AM -0400, Pasha Tatashin wrote:
Subject: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
Hmm, can't say I like liveupdate: kho: prefix. KHO: on it's own is shorter and reflects that this is a KHO change rather than liveupdate.
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_DEBUG
- bool "Enable Kexec Handover debug checks"
- depends on KEXEC_HANDOVER_DEBUGFS
- help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
I think __free() cleanup would be better than this.
return ERR_PTR(-EINVAL);
- }
- res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
- }
- chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
- int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk);
... and indeed, -errno from new_chunk() juts makes things more complex :(
goto err_free;
}
if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out) if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
} }
elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) err_free: kho_mem_ser_free(first_chunk);
- return -ENOMEM;
- return ret;
} static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
- if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;
- return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
- if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {
return -EINVAL;
- }
Can't we check this in __kho_preseve_order() and not duplicate the code?
- while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"
depends on KEXEC_HANDOVER_DEBUGFS
help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
return ERR_PTR(-EINVAL);
}
res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
}
chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
int ret = -ENOMEM; xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk);
... and indeed, -errno from new_chunk() juts makes things more complex :(
goto err_free;
} if (!first_chunk) first_chunk = chunk;
@@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
} } elm = &chunk->bitmaps[chunk->hdr.num_elms];
@@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk);
return -ENOMEM;
return ret;
}
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;
return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {
return -EINVAL;
}
Can't we check this in __kho_preseve_order() and not duplicate the code?
Yes, that is possible, I will move it in the next version.
while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
-- Sincerely yours, Mike.
On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"
depends on KEXEC_HANDOVER_DEBUGFS
help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
There's __free(kfree) cleanup function defined in include/linux/cleanup.h that ensures that on return from a function resources are not leaked. With kfree we could do something like
void *elm __free(kfree) = NULL;
if (error) return ERR_PTR(errno);
return no_free_ptr(elm);
There's no __free() definition for free_page() though :(
The second best IMHO is to use goto for error handling rather than free() inside if (error).
return ERR_PTR(-EINVAL);
}
res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
Do you really see the callers handling this differently? And we already have WARN_ON() because something is broken in the system.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
}
On Thu, Oct 16, 2025 at 1:23 PM Mike Rapoport rppt@kernel.org wrote:
On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"
depends on KEXEC_HANDOVER_DEBUGFS
help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
There's __free(kfree) cleanup function defined in include/linux/cleanup.h that ensures that on return from a function resources are not leaked. With kfree we could do something like
void *elm __free(kfree) = NULL; if (error) return ERR_PTR(errno); return no_free_ptr(elm);
There's no __free() definition for free_page() though :(
The second best IMHO is to use goto for error handling rather than free() inside if (error).
Makes sense, let me try to get it done. I actually like what cleanup.h provides.
return ERR_PTR(-EINVAL);
}
res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
Do you really see the callers handling this differently? And we already have WARN_ON() because something is broken in the system.
Maybe, also maybe for some self-tests. I think it is more consistent to return PTR_ERR() based on other code in this file (i.e. xa_load_or_alloc() already uses it). Let's keep it.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
}
-- Sincerely yours, Mike.
Can't we check this in __kho_preseve_order() and not duplicate the code?
Yes, that is possible, I will move it in the next version.
Actually, I decided against this. The check might be expensive depending on how sparse scratch area. Why make it even more expensive for kho_preserve_pages() case, which might be calling __kho_preserve_order() multiple times.
Pasha
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_DEBUG
- bool "Enable Kexec Handover debug checks"
- depends on KEXEC_HANDOVER_DEBUGFS
Why the dependency on debugfs? Why can't the debug checks be enabled independently?
- help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
This also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
return ERR_PTR(-EINVAL);
- }
- res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
- }
- chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
- int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
}
if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out) if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
} }
elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) err_free: kho_mem_ser_free(first_chunk);
- return -ENOMEM;
- return ret;
} static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
- if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;
- return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
- if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {
return -EINVAL;
- }
- while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; }
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
- phys_addr_t scratch_start, scratch_end;
- unsigned int i;
- for (i = 0; i < kho_scratch_cnt; i++) {
scratch_start = kho_scratch[i].addr;
scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;
Nit: wouldn't it be a tad bit simpler to do
scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
if (phys <= scratch_end && (phys + size) > scratch_start)
and here
if (phys < scratch_end && (phys + size) > scratch_start)
At least I find it slightly easier to understand, though I don't think it makes too much of a difference so either way is fine.
return true;
- }
- return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
- return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav pratyush@kernel.org wrote:
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"
depends on KEXEC_HANDOVER_DEBUGFS
Why the dependency on debugfs? Why can't the debug checks be enabled independently?
Because there is one kexec_handover_debug.c file, that I thought would make sense to use for both, but now thinking about this, perhaps we should split the code: KEXEC_HANDOVER_DEBUGFS and KEXEC_HANDOVER_DEBUG, and add two files: kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid ifdefs in .c.
help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
I thought it could be expanded when new features are added, but I can remove this description.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
This also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
If it changes, and Kconfig should be updated as well.
debugging purposes due to runtime overhead.
endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
return ERR_PTR(-EINVAL);
}
res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);
return ERR_PTR(-EINVAL);
}
chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
int ret = -ENOMEM; xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
} if (!first_chunk) first_chunk = chunk;
@@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {
ret = PTR_ERR(chunk); goto err_free;
} } elm = &chunk->bitmaps[chunk->hdr.num_elms];
@@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk);
return -ENOMEM;
return ret;
}
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;
return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {
return -EINVAL;
}
while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; }
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
phys_addr_t scratch_start, scratch_end;
unsigned int i;
for (i = 0; i < kho_scratch_cnt; i++) {
scratch_start = kho_scratch[i].addr;
scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;
Nit: wouldn't it be a tad bit simpler to do
scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
if (phys <= scratch_end && (phys + size) > scratch_start)
and here
if (phys < scratch_end && (phys + size) > scratch_start)
At least I find it slightly easier to understand, though I don't think it makes too much of a difference so either way is fine.
return true;
}
return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- Regards, Pratyush Yadav
On Wed, Oct 15 2025, Pasha Tatashin wrote:
On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav pratyush@kernel.org wrote:
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"
depends on KEXEC_HANDOVER_DEBUGFS
Why the dependency on debugfs? Why can't the debug checks be enabled independently?
Because there is one kexec_handover_debug.c file, that I thought would make sense to use for both, but now thinking about this, perhaps we should split the code: KEXEC_HANDOVER_DEBUGFS and KEXEC_HANDOVER_DEBUG, and add two files: kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid ifdefs in .c.
Sounds good.
help
This option enables extra sanity checks for the Kexec Handover
subsystem.
These checks verify that neither preserved memory regions nor KHO's
internal metadata are allocated from within a KHO scratch area.
An overlap can lead to memory corruption during a subsequent kexec
operation.
I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
I thought it could be expanded when new features are added, but I can remove this description.
Yes, but it is easy to forget to do so.
If an overlap is detected, the kernel will print a warning and the
offending operation will fail. This should only be enabled for
This also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
If it changes, and Kconfig should be updated as well.
debugging purposes due to runtime overhead.
endmenu
[...]
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable); * Keep track of memory that is to be preserved across KHO. * * The serializing side uses two levels of xarrays to manage chunks of per-order - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most - * 512K of bitmap memory will be needed for order 0. + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0 + * allocations each bitmap will cover 128M of address space. Thus, for 16G of + * memory at most 512K of bitmap memory will be needed for order 0. * * This approach is fully incremental, as the serialization progresses folios * can continue be aggregated to the tracker. The final step, immediately prior @@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable); * successor kernel to parse. */
-#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8)
struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) { + unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL); + order = get_order(sz); + elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) { - kfree(elm); + if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), + PAGE_SIZE << order))) { + free_pages((unsigned long)elm, order); return ERR_PTR(-EINVAL); }
@@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) res = ERR_PTR(xa_err(res));
if (res) { - kfree(elm); + free_pages((unsigned long)elm, order); return res; }
@@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, { struct khoser_mem_chunk *chunk;
- chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); + chunk = (void *)get_zeroed_page(GFP_KERNEL); if (!chunk) return ERR_PTR(-ENOMEM);
On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/ -#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8) struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = { static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
The name 'xa_load_or_alloc' is confusing now that we only use this function to allocate bitmaps. I think it should be renamed to reflect that and it's return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz parameter and the size calculations below become redundant.
{
- unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL);
- order = get_order(sz);
- elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
PAGE_SIZE << order))) {
return ERR_PTR(-EINVAL); }free_pages((unsigned long)elm, order);
On Wed, Oct 15, 2025 at 4:37 AM Mike Rapoport rppt@kernel.org wrote:
On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/
-#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8)
struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
The name 'xa_load_or_alloc' is confusing now that we only use this function
Indeed, but this is not something that is done by this patch
to allocate bitmaps. I think it should be renamed to reflect that and it's return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz parameter and the size calculations below become redundant.
I am thinking of splitting from this patch increase of bitmap size to PAGE_SIZE, and after that re-name this function, and remove size_t argument in another patch, and finally the fix patch that replaces slub with buddy.
{
unsigned int order; void *elm, *res; elm = xa_load(xa, index); if (elm) return elm;
elm = kzalloc(sz, GFP_KERNEL);
order = get_order(sz);
elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
PAGE_SIZE << order))) {
free_pages((unsigned long)elm, order); return ERR_PTR(-EINVAL); }
-- Sincerely yours, Mike.
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
kfence_alloc_pool_and_metadata(); report_meminit(); kmsan_init_shadow(); stack_depot_early_init(); [...] kho_memory_init();
memblock_free_all();
Can't kfence allocate its pool right after memblock_free_all()? IIUC at that point, there shouldn't be much fragmentation and allocations should still be possible.
Another idea could be to disable scratch-only a bit earlier and add an option in memblock_alloc() to avoid scratch areas?
Anyway, not something we need to solve right now with this series. Something to figure out eventually.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
The implication of this change is that preservation metadata becomes less memory-efficient when preserved pages are sparse. Mainly because if only one bit is set in the bitmap, now 4k bytes of memory is used instead of 512 bytes.
It is hard to say what difference this makes in practice without sampling real workloads, but perhaps still worth mentioning in the commit message?
Other than this,
Reviewed-by: Pratyush Yadav pratyush@kernel.org
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/ -#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8) struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = { static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) {
- unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL);
- order = get_order(sz);
- elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
PAGE_SIZE << order))) {
return ERR_PTR(-EINVAL); }free_pages((unsigned long)elm, order);
@@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) res = ERR_PTR(xa_err(res)); if (res) {
kfree(elm);
return res; }free_pages((unsigned long)elm, order);
@@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, { struct khoser_mem_chunk *chunk;
- chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
- chunk = (void *)get_zeroed_page(GFP_KERNEL); if (!chunk) return ERR_PTR(-ENOMEM);
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages. If we do that, kfence can ignore allocations with that new preservable GFP flag.
Pasha
On Wed, Oct 15, 2025 at 4:19 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages. If we do that, kfence can ignore allocations with that new preservable GFP flag.
I think this is the best way forward. Changing KFENCE to allocate the pool from buddy will make the allocation size less flexible (goodbye, CONFIG_KFENCE_NUM_OBJECTS)
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
The implication of this change is that preservation metadata becomes less memory-efficient when preserved pages are sparse. Mainly because if only one bit is set in the bitmap, now 4k bytes of memory is used instead of 512 bytes.
It is hard to say what difference this makes in practice without sampling real workloads, but perhaps still worth mentioning in the commit message?
Forgot to reply to the other part:
I agree, however, I suspect the implication is going to be minimal, it is strange to preserve fragmented state and expect a fast reboot. Most likely, we are going to be optimizing the preservation pools, such as using 1G order pages for guest memory.
Also, we are moving toward preserving 4K bitmaps as part of the Stateless KHO patch series, so I think we will make this change anyway, as part of this fix or as part of transitioning to radix-tree stateless KHO.
Reviewed-by: Pratyush Yadav pratyush@kernel.org
Thank you. Pasha
linux-kselftest-mirror@lists.linaro.org