This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled.
The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation.
The series is structured in two parts: Patch 1/2 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds.
Patch 2/2 provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of SLUB. This bypasses the KFENCE interception entirely.
Pasha Tatashin (2): liveupdate: kho: warn and fail on metadata or preserved memory in scratch area liveupdate: kho: allocate metadata directly from the buddy allocator
kernel/liveupdate/Kconfig | 15 ++++++ kernel/liveupdate/kexec_handover.c | 51 ++++++++++++++++----- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++ 4 files changed, 81 insertions(+), 12 deletions(-)
base-commit: 0b2f041c47acb45db82b4e847af6e17eb66cd32d
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG + bool "Enable Kexec Handover debug checks" + depends on KEXEC_HANDOVER_DEBUGFS + help + This option enables extra sanity checks for the Kexec Handover + subsystem. + + These checks verify that neither preserved memory regions nor KHO's + internal metadata are allocated from within a KHO scratch area. + An overlap can lead to memory corruption during a subsequent kexec + operation. + + If an overlap is detected, the kernel will print a warning and the + offending operation will fail. This should only be enabled for + debugging purposes due to runtime overhead. endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
+ if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) { + kfree(elm); + return ERR_PTR(-EINVAL); + } + res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res)); @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk) - return NULL; + return ERR_PTR(-ENOMEM); + + if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) { + kfree(chunk); + return ERR_PTR(-EINVAL); + } + chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk); @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order; + int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys;
chunk = new_chunk(chunk, order); - if (!chunk) + if (IS_ERR(chunk)) { + ret = PTR_ERR(chunk); goto err_free; + }
if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order); - if (!chunk) + if (IS_ERR(chunk)) { + ret = PTR_ERR(chunk); goto err_free; + } }
elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk); - return -ENOMEM; + return ret; }
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
+ if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) + return -EINVAL; + return __kho_preserve_order(track, pfn, order); } EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
+ if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT, + nr_pages << PAGE_SHIFT))) { + return -EINVAL; + } + while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; } + +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + phys_addr_t scratch_start, scratch_end; + unsigned int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + scratch_start = kho_scratch[i].addr; + scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1; + + if (phys <= scratch_end && (phys + size) > scratch_start) + return true; + } + + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + return false; +} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ + #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
Hi Pasha,
On Wed, Oct 15, 2025 at 01:31:20AM -0400, Pasha Tatashin wrote:
Subject: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
Hmm, can't say I like liveupdate: kho: prefix. KHO: on it's own is shorter and reflects that this is a KHO change rather than liveupdate.
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_DEBUG
- bool "Enable Kexec Handover debug checks"
- depends on KEXEC_HANDOVER_DEBUGFS
- help
This option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled fordebugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
I think __free() cleanup would be better than this.
return ERR_PTR(-EINVAL);- }
- res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);
I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);return ERR_PTR(-EINVAL);- }
- chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
- int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk);
... and indeed, -errno from new_chunk() juts makes things more complex :(
goto err_free;
}if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out) if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;} }elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) err_free: kho_mem_ser_free(first_chunk);
- return -ENOMEM;
- return ret;
} static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
- if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;- return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
- if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {return -EINVAL;- }
Can't we check this in __kho_preseve_order() and not duplicate the code?
- while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"depends on KEXEC_HANDOVER_DEBUGFShelpThis option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled fordebugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {kfree(elm);I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
return ERR_PTR(-EINVAL);}res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {kfree(chunk);return ERR_PTR(-EINVAL);}chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
int ret = -ENOMEM; xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk);... and indeed, -errno from new_chunk() juts makes things more complex :(
goto err_free;
} if (!first_chunk) first_chunk = chunk;@@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;} } elm = &chunk->bitmaps[chunk->hdr.num_elms];@@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk);
return -ENOMEM;
return ret;}
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))return -EINVAL;return __kho_preserve_order(track, pfn, order);} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,nr_pages << PAGE_SHIFT))) {return -EINVAL;}Can't we check this in __kho_preseve_order() and not duplicate the code?
Yes, that is possible, I will move it in the next version.
while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));-- Sincerely yours, Mike.
On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"depends on KEXEC_HANDOVER_DEBUGFShelpThis option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled fordebugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {kfree(elm);I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
There's __free(kfree) cleanup function defined in include/linux/cleanup.h that ensures that on return from a function resources are not leaked. With kfree we could do something like
void *elm __free(kfree) = NULL;
if (error) return ERR_PTR(errno);
return no_free_ptr(elm);
There's no __free() definition for free_page() though :(
The second best IMHO is to use goto for error handling rather than free() inside if (error).
return ERR_PTR(-EINVAL);}res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
Do you really see the callers handling this differently? And we already have WARN_ON() because something is broken in the system.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {kfree(chunk);return ERR_PTR(-EINVAL);}
On Thu, Oct 16, 2025 at 1:23 PM Mike Rapoport rppt@kernel.org wrote:
On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++
Feels like kernel/liveupdate/Makefile change is missing
It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in kexec_handover_debug.c
That debug file contains KHO debugfs and debug code. The debug code adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is appropriate for a debug build.
However, I do not like ugly ifdefs in .c, so perhaps, we should have two files: kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What do you think?
kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"depends on KEXEC_HANDOVER_DEBUGFShelpThis option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled fordebugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {kfree(elm);I think __free() cleanup would be better than this.
Sorry, not sure what do you mean. kfree() is already is in this function in case of failure.
There's __free(kfree) cleanup function defined in include/linux/cleanup.h that ensures that on return from a function resources are not leaked. With kfree we could do something like
void *elm __free(kfree) = NULL; if (error) return ERR_PTR(errno); return no_free_ptr(elm);There's no __free() definition for free_page() though :(
The second best IMHO is to use goto for error handling rather than free() inside if (error).
Makes sense, let me try to get it done. I actually like what cleanup.h provides.
return ERR_PTR(-EINVAL);}res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);I don't think it's important to return -errno here, it's not that it's called from a syscall and we need to set errno for the userspace. BTW, the same applies to xa_load_or_alloc() IMO.
HM, but they are very different errors: ENOMEM, the KHO user can try again after more memory is available, but the new -EINVAL return from this function tells the caller that there is something broken in the system, and using KHO is futile until this bug is fixed.
Do you really see the callers handling this differently? And we already have WARN_ON() because something is broken in the system.
Maybe, also maybe for some self-tests. I think it is more consistent to return PTR_ERR() based on other code in this file (i.e. xa_load_or_alloc() already uses it). Let's keep it.
if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {kfree(chunk);return ERR_PTR(-EINVAL);}-- Sincerely yours, Mike.
Can't we check this in __kho_preseve_order() and not duplicate the code?
Yes, that is possible, I will move it in the next version.
Actually, I decided against this. The check might be expensive depending on how sparse scratch area. Why make it even more expensive for kho_preserve_pages() case, which might be calling __kho_preserve_order() multiple times.
Pasha
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_DEBUG
- bool "Enable Kexec Handover debug checks"
- depends on KEXEC_HANDOVER_DEBUGFS
Why the dependency on debugfs? Why can't the debug checks be enabled independently?
- help
This option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.
I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled for
This also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
debugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);return ERR_PTR(-EINVAL);- }
- res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);- if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
kfree(chunk);return ERR_PTR(-EINVAL);- }
- chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
- int ret = -ENOMEM;
xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;}if (!first_chunk) first_chunk = chunk; @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out) if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;} }elm = &chunk->bitmaps[chunk->hdr.num_elms]; @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) err_free: kho_mem_ser_free(first_chunk);
- return -ENOMEM;
- return ret;
} static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
- if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
return -EINVAL;- return __kho_preserve_order(track, pfn, order);
} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
- if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
nr_pages << PAGE_SHIFT))) {return -EINVAL;- }
- while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; }
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
- phys_addr_t scratch_start, scratch_end;
- unsigned int i;
- for (i = 0; i < kho_scratch_cnt; i++) {
scratch_start = kho_scratch[i].addr;scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;
Nit: wouldn't it be a tad bit simpler to do
scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
if (phys <= scratch_end && (phys + size) > scratch_start)
and here
if (phys < scratch_end && (phys + size) > scratch_start)
At least I find it slightly easier to understand, though I don't think it makes too much of a difference so either way is fine.
return true;- }
- return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
- return false;
+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav pratyush@kernel.org wrote:
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"depends on KEXEC_HANDOVER_DEBUGFSWhy the dependency on debugfs? Why can't the debug checks be enabled independently?
Because there is one kexec_handover_debug.c file, that I thought would make sense to use for both, but now thinking about this, perhaps we should split the code: KEXEC_HANDOVER_DEBUGFS and KEXEC_HANDOVER_DEBUG, and add two files: kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid ifdefs in .c.
helpThis option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
I thought it could be expanded when new features are added, but I can remove this description.
If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled forThis also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
If it changes, and Kconfig should be updated as well.
debugging purposes due to runtime overhead.endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5da21f1510cc..ef1e6f7a234b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {kfree(elm);return ERR_PTR(-EINVAL);}res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL); if (xa_is_err(res)) res = ERR_PTR(xa_err(res));@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!chunk)
return NULL;
return ERR_PTR(-ENOMEM);if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {kfree(chunk);return ERR_PTR(-EINVAL);}chunk->hdr.order = order; if (cur_chunk) KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out) struct khoser_mem_chunk *chunk = NULL; struct kho_mem_phys *physxa; unsigned long order;
int ret = -ENOMEM; xa_for_each(&kho_out->track.orders, order, physxa) { struct kho_mem_phys_bits *bits; unsigned long phys; chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;} if (!first_chunk) first_chunk = chunk;@@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) { chunk = new_chunk(chunk, order);
if (!chunk)
if (IS_ERR(chunk)) {ret = PTR_ERR(chunk); goto err_free;} } elm = &chunk->bitmaps[chunk->hdr.num_elms];@@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
err_free: kho_mem_ser_free(first_chunk);
return -ENOMEM;
return ret;}
static void __init deserialize_bitmap(unsigned int order, @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track;
if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))return -EINVAL;return __kho_preserve_order(track, pfn, order);} EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages) unsigned long failed_pfn = 0; int err = 0;
if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,nr_pages << PAGE_SHIFT))) {return -EINVAL;}while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c index eb47f000887d..294d1d290142 100644 --- a/kernel/liveupdate/kexec_handover_debug.c +++ b/kernel/liveupdate/kexec_handover_debug.c @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void) return -ENOENT; return 0; }
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
phys_addr_t scratch_start, scratch_end;unsigned int i;for (i = 0; i < kho_scratch_cnt; i++) {scratch_start = kho_scratch[i].addr;scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;Nit: wouldn't it be a tad bit simpler to do
scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
if (phys <= scratch_end && (phys + size) > scratch_start)and here
if (phys < scratch_end && (phys + size) > scratch_start)At least I find it slightly easier to understand, though I don't think it makes too much of a difference so either way is fine.
return true;}return false;+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index b3fc1957affa..92798346fa5a 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG +bool kho_scratch_overlap(phys_addr_t phys, size_t size); +#else +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{
return false;+} +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- Regards, Pratyush Yadav
On Wed, Oct 15 2025, Pasha Tatashin wrote:
On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav pratyush@kernel.org wrote:
On Wed, Oct 15 2025, Pasha Tatashin wrote:
It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption.
Adds checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/Kconfig | 15 ++++++++++ kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++--- kernel/liveupdate/kexec_handover_debug.c | 18 ++++++++++++ kernel/liveupdate/kexec_handover_internal.h | 9 ++++++ 4 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 522b9f74d605..d119f4f3f4b1 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs.
+config KEXEC_HANDOVER_DEBUG
bool "Enable Kexec Handover debug checks"depends on KEXEC_HANDOVER_DEBUGFSWhy the dependency on debugfs? Why can't the debug checks be enabled independently?
Because there is one kexec_handover_debug.c file, that I thought would make sense to use for both, but now thinking about this, perhaps we should split the code: KEXEC_HANDOVER_DEBUGFS and KEXEC_HANDOVER_DEBUG, and add two files: kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid ifdefs in .c.
Sounds good.
helpThis option enables extra sanity checks for the Kexec Handoversubsystem.These checks verify that neither preserved memory regions nor KHO'sinternal metadata are allocated from within a KHO scratch area.An overlap can lead to memory corruption during a subsequent kexecoperation.I don't think the checks that are done should be listed here since as soon as another check is added this list will become out of date.
I thought it could be expanded when new features are added, but I can remove this description.
Yes, but it is easy to forget to do so.
If an overlap is detected, the kernel will print a warning and theoffending operation will fail. This should only be enabled forThis also describes the behaviour of the checks, which might change later. Maybe for some checks the operation won't fail? I suppose just leave it at "the kernel will print a warning"?
If it changes, and Kconfig should be updated as well.
debugging purposes due to runtime overhead.endmenu
[...]
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com --- kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable); * Keep track of memory that is to be preserved across KHO. * * The serializing side uses two levels of xarrays to manage chunks of per-order - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most - * 512K of bitmap memory will be needed for order 0. + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0 + * allocations each bitmap will cover 128M of address space. Thus, for 16G of + * memory at most 512K of bitmap memory will be needed for order 0. * * This approach is fully incremental, as the serialization progresses folios * can continue be aggregated to the tracker. The final step, immediately prior @@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable); * successor kernel to parse. */
-#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8)
struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) { + unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL); + order = get_order(sz); + elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) { - kfree(elm); + if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), + PAGE_SIZE << order))) { + free_pages((unsigned long)elm, order); return ERR_PTR(-EINVAL); }
@@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) res = ERR_PTR(xa_err(res));
if (res) { - kfree(elm); + free_pages((unsigned long)elm, order); return res; }
@@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, { struct khoser_mem_chunk *chunk;
- chunk = kzalloc(PAGE_SIZE, GFP_KERNEL); + chunk = (void *)get_zeroed_page(GFP_KERNEL); if (!chunk) return ERR_PTR(-ENOMEM);
On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/ -#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8) struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = { static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
The name 'xa_load_or_alloc' is confusing now that we only use this function to allocate bitmaps. I think it should be renamed to reflect that and it's return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz parameter and the size calculations below become redundant.
{
- unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL);
- order = get_order(sz);
- elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
PAGE_SIZE << order))) { return ERR_PTR(-EINVAL); }free_pages((unsigned long)elm, order);
On Wed, Oct 15, 2025 at 4:37 AM Mike Rapoport rppt@kernel.org wrote:
On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/
-#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8)
struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
The name 'xa_load_or_alloc' is confusing now that we only use this function
Indeed, but this is not something that is done by this patch
to allocate bitmaps. I think it should be renamed to reflect that and it's return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz parameter and the size calculations below become redundant.
I am thinking of splitting from this patch increase of bitmap size to PAGE_SIZE, and after that re-name this function, and remove size_t argument in another patch, and finally the fix patch that replaces slub with buddy.
{
unsigned int order; void *elm, *res; elm = xa_load(xa, index); if (elm) return elm;
elm = kzalloc(sz, GFP_KERNEL);
order = get_order(sz);elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {kfree(elm);
if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),PAGE_SIZE << order))) {free_pages((unsigned long)elm, order); return ERR_PTR(-EINVAL); }-- Sincerely yours, Mike.
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
kfence_alloc_pool_and_metadata(); report_meminit(); kmsan_init_shadow(); stack_depot_early_init(); [...] kho_memory_init();
memblock_free_all();
Can't kfence allocate its pool right after memblock_free_all()? IIUC at that point, there shouldn't be much fragmentation and allocations should still be possible.
Another idea could be to disable scratch-only a bit earlier and add an option in memblock_alloc() to avoid scratch areas?
Anyway, not something we need to solve right now with this series. Something to figure out eventually.
When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded.
To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of SLUB.
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
The implication of this change is that preservation metadata becomes less memory-efficient when preserved pages are sparse. Mainly because if only one bit is set in the bitmap, now 4k bytes of memory is used instead of 512 bytes.
It is hard to say what difference this makes in practice without sampling real workloads, but perhaps still worth mentioning in the commit message?
Other than this,
Reviewed-by: Pratyush Yadav pratyush@kernel.org
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin pasha.tatashin@soleen.com
kernel/liveupdate/kexec_handover.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index ef1e6f7a234b..519de6d68b27 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
- Keep track of memory that is to be preserved across KHO.
- The serializing side uses two levels of xarrays to manage chunks of per-order
- 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- 512K of bitmap memory will be needed for order 0.
- PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
- of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
- allocations each bitmap will cover 128M of address space. Thus, for 16G of
- memory at most 512K of bitmap memory will be needed for order 0.
- This approach is fully incremental, as the serialization progresses folios
- can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
- successor kernel to parse.
*/ -#define PRESERVE_BITS (512 * 8) +#define PRESERVE_BITS (PAGE_SIZE * 8) struct kho_mem_phys_bits { DECLARE_BITMAP(preserve, PRESERVE_BITS); @@ -131,18 +131,21 @@ static struct kho_out kho_out = { static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) {
- unsigned int order; void *elm, *res;
elm = xa_load(xa, index); if (elm) return elm;
- elm = kzalloc(sz, GFP_KERNEL);
- order = get_order(sz);
- elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
kfree(elm);
- if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
PAGE_SIZE << order))) { return ERR_PTR(-EINVAL); }free_pages((unsigned long)elm, order);@@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz) res = ERR_PTR(xa_err(res)); if (res) {
kfree(elm);
return res; }free_pages((unsigned long)elm, order);@@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk, { struct khoser_mem_chunk *chunk;
- chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
- chunk = (void *)get_zeroed_page(GFP_KERNEL); if (!chunk) return ERR_PTR(-ENOMEM);
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages. If we do that, kfence can ignore allocations with that new preservable GFP flag.
Pasha
On Wed, Oct 15, 2025 at 4:19 PM Pasha Tatashin pasha.tatashin@soleen.com wrote:
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages. If we do that, kfence can ignore allocations with that new preservable GFP flag.
I think this is the best way forward. Changing KFENCE to allocate the pool from buddy will make the allocation size less flexible (goodbye, CONFIG_KFENCE_NUM_OBJECTS)
On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages.
That isn't an issue. If we make slab preservable then we'd have to preserve the page and then somehow record what order is stored in that page and a bit map of which parts are allocated to restore the slab state on recovery.
So long as the non-preserved memory comes back as freed on the sucessor kernel it doesn't matter what was in it in the preceeding kernel. The new kernel will eventually zero it. So it isn't a 'leak'.
Jason
On Fri, Oct 24, 2025 at 9:25 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages.
That isn't an issue. If we make slab preservable then we'd have to preserve the page and then somehow record what order is stored in that page and a bit map of which parts are allocated to restore the slab state on recovery.
So long as the non-preserved memory comes back as freed on the sucessor kernel it doesn't matter what was in it in the preceeding kernel. The new kernel will eventually zero it. So it isn't a 'leak'.
Hi Jason,
I agree, it's not a "leak" in the traditional sense, as we trust the successor kernel to manage its own memory.
However, my concern is that without a dedicated GFP flag, this partial-page preservation model becomes too fragile, inefficient, and creates a data exposure risk.
You're right the new kernel will eventually zero memory, but KHO preserves at page granularity. If we preserve a single slab object, the entire page is handed off. When the new kernel maps that page (e.g., to userspace) to access the preserved object, it also exposes the unpreserved portions of that same page. Those portions contain stale data from the old kernel and won't have been zeroed yet, creating an easy-to-miss data leak vector. It makes the API very error-prone.
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed. Depending on the use case, that object might live for the entire kernel lifetime, effectively wasting that memory. This waste could then accumulate with each subsequent live update.
Trying to create a special KHO slab cache isn't a solution either, since slab caches are often merged.
As I see it, the only robust solution is to use a special GFP flag. This would force these allocations to come from a dedicated pool of pages that are fully preserved, with no partial/mixed-use pages and also retrieved as slabs.
That said, I'm not sure preserving individual slab objects is a high priority right now. It might be simpler to avoid it altogether.
Pasha
On Fri, Oct 24, 2025 at 09:57:24AM -0400, Pasha Tatashin wrote:
You're right the new kernel will eventually zero memory, but KHO preserves at page granularity. If we preserve a single slab object, the entire page is handed off. When the new kernel maps that page (e.g., to userspace) to access the preserved object, it also exposes the unpreserved portions of that same page. Those portions contain stale data from the old kernel and won't have been zeroed yet, creating an easy-to-miss data leak vector.
Do we zero any of the memory on KHO? Honestly, I wouldn't worry about the point it zeros, slab guarentees it will be zero when it should be zero.
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed.
Thats not how I see slab preservation working. When the slab page is unpreserved all the free space in that page should be immediately available to the sucessor kernel.
As I see it, the only robust solution is to use a special GFP flag. This would force these allocations to come from a dedicated pool of pages that are fully preserved, with no partial/mixed-use pages and also retrieved as slabs.
It is certainly more efficient to preserve fewer slab pages in total and pooling would help get there.
That said, I'm not sure preserving individual slab objects is a high priority right now. It might be simpler to avoid it altogether.
I think we will need something, a lot of the structs I'm seeing in other patches are small and allocating a whole page is pretty wasteful too.
Jason
On Fri, Oct 24, 2025 at 10:20 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 24, 2025 at 09:57:24AM -0400, Pasha Tatashin wrote:
You're right the new kernel will eventually zero memory, but KHO preserves at page granularity. If we preserve a single slab object, the entire page is handed off. When the new kernel maps that page (e.g., to userspace) to access the preserved object, it also exposes the unpreserved portions of that same page. Those portions contain stale data from the old kernel and won't have been zeroed yet, creating an easy-to-miss data leak vector.
Do we zero any of the memory on KHO? Honestly, I wouldn't worry about the point it zeros, slab guarentees it will be zero when it should be zero.
We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes memory on demand during allocation. My point is that the KHO interface retrieves a full page in the next kernel, not an individual slab. Consequently, a caller might retrieve data that was preserved as a slab in the previous kernel, expose that data to the user, and unintentionally leak the remaining part of the page as well.
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed.
Thats not how I see slab preservation working. When the slab page is unpreserved all the free space in that page should be immediately available to the sucessor kernel.
This ties into the same problem. The scenario I'm worried about is: 1. A caller preserves one small slab object. 2. In the new kernel, the caller retrieves the entire page that contains this object. 3. The caller uses the data from that slab object without freeing it.
In this case, the rest of the page, all the other slab slots, is effectively wasted. The page can't be fully returned to the system or used by the slab allocator until that one preserved object is freed, which might be never. The free space isn't "immediately available" because the page is being held by the caller, even though the caller is using only a slab in that page.
As I see it, the only robust solution is to use a special GFP flag. This would force these allocations to come from a dedicated pool of pages that are fully preserved, with no partial/mixed-use pages and also retrieved as slabs.
It is certainly more efficient to preserve fewer slab pages in total and pooling would help get there.
That said, I'm not sure preserving individual slab objects is a high priority right now. It might be simpler to avoid it altogether.
I think we will need something, a lot of the structs I'm seeing in other patches are small and allocating a whole page is pretty wasteful too.
If we're going to support this, it would have to be specifically engineered as full slab support for KHO preservation, where the interface retrieves slab objects directly, not the pages they're on, and I think would require using a special GFP_PRESERVED flag.
Jason
On Fri, Oct 24, 2025 at 10:36:45AM -0400, Pasha Tatashin wrote:
We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes memory on demand during allocation. My point is that the KHO interface retrieves a full page in the next kernel, not an individual slab. Consequently, a caller might retrieve data that was preserved as a slab in the previous kernel, expose that data to the user, and unintentionally leak the remaining part of the page as well.
I don't think preventing that is part of the kho threat model..
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed.
Thats not how I see slab preservation working. When the slab page is unpreserved all the free space in that page should be immediately available to the sucessor kernel.
This ties into the same problem. The scenario I'm worried about is:
- A caller preserves one small slab object.
- In the new kernel, the caller retrieves the entire page that
contains this object. 3. The caller uses the data from that slab object without freeing it.
4. When slab restores the page it immediately makes all the free slots available on its free list.
other patches are small and allocating a whole page is pretty wasteful too.
If we're going to support this, it would have to be specifically engineered as full slab support for KHO preservation, where the interface retrieves slab objects directly, not the pages they're on,
Yes
and I think would require using a special GFP_PRESERVED flag.
Maybe so, I was hoping not..
Jason
On Fri, Oct 24, 2025 at 10:55 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 24, 2025 at 10:36:45AM -0400, Pasha Tatashin wrote:
We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes memory on demand during allocation. My point is that the KHO interface retrieves a full page in the next kernel, not an individual slab. Consequently, a caller might retrieve data that was preserved as a slab in the previous kernel, expose that data to the user, and unintentionally leak the remaining part of the page as well.
I don't think preventing that is part of the kho threat model..
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed.
Thats not how I see slab preservation working. When the slab page is unpreserved all the free space in that page should be immediately available to the sucessor kernel.
This ties into the same problem. The scenario I'm worried about is:
- A caller preserves one small slab object.
- In the new kernel, the caller retrieves the entire page that
contains this object. 3. The caller uses the data from that slab object without freeing it.
- When slab restores the page it immediately makes all the free slots
available on its free list.
Right, we do not have this functionality.
other patches are small and allocating a whole page is pretty wasteful too.
If we're going to support this, it would have to be specifically engineered as full slab support for KHO preservation, where the interface retrieves slab objects directly, not the pages they're on,
Yes
and I think would require using a special GFP_PRESERVED flag.
Maybe so, I was hoping not..
Jason
As part of this change, the metadata bitmap size is increased from 512 bytes to PAGE_SIZE to align with the page-based allocations from the buddy system.
The implication of this change is that preservation metadata becomes less memory-efficient when preserved pages are sparse. Mainly because if only one bit is set in the bitmap, now 4k bytes of memory is used instead of 512 bytes.
It is hard to say what difference this makes in practice without sampling real workloads, but perhaps still worth mentioning in the commit message?
Forgot to reply to the other part:
I agree, however, I suspect the implication is going to be minimal, it is strange to preserve fragmented state and expect a fast reboot. Most likely, we are going to be optimizing the preservation pools, such as using 1G order pages for guest memory.
Also, we are moving toward preserving 4K bitmaps as part of the Stateless KHO patch series, so I think we will make this change anyway, as part of this fix or as part of transitioning to radix-tree stateless KHO.
Reviewed-by: Pratyush Yadav pratyush@kernel.org
Thank you. Pasha
On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
- elm = kzalloc(sz, GFP_KERNEL);
- order = get_order(sz);
- elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order); if (!elm) return ERR_PTR(-ENOMEM);
Please add a function to allocate preservable non-scratch pages then :\ I don't think we should be sprinkling __get_free_pages() all over the place.
You can then add a cleanup.h for the new function and so on.
It should allocate frozen pages, accept a size_t and return a void *.
If this issue is only related to kfence then the new function could have an if IS_ENABLED(KFENCE) and use slab when it is safe.
Jason
linux-kselftest-mirror@lists.linaro.org