Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator

24 Oct 2025


      On Fri, Oct 24, 2025 at 9:25 AM Jason Gunthorpe jgg@ziepe.ca wrote:
...
On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
...
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
...
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
...
KHO allocates metadata for its preserved memory map using the SLUB
allocator via kzalloc(). This metadata is temporary and is used by the
next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be
randomly intercepted by kfence_alloc(), which services the allocation
from a dedicated KFENCE memory pool. This pool is allocated early in
boot via memblock.
At some point, we'd probably want to add support for preserving slab
objects using KHO. That wouldn't work if the objects can land in scratch
memory. Right now, the kfence pools are allocated right before KHO goes
out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it,
so the slab can use a special pool of pages that are preservable.
Otherwise, we are going to be leaking memory from the old kernel in
the unpreserved parts of the pages.
That isn't an issue. If we make slab preservable then we'd have to
preserve the page and then somehow record what order is stored in that
page and a bit map of which parts are allocated to restore the slab
state on recovery.
So long as the non-preserved memory comes back as freed on the
sucessor kernel it doesn't matter what was in it in the preceeding
kernel. The new kernel will eventually zero it. So it isn't a 'leak'.
Hi Jason,
I agree, it's not a "leak" in the traditional sense, as we trust the
successor kernel to manage its own memory.
However, my concern is that without a dedicated GFP flag, this
partial-page preservation model becomes too fragile, inefficient, and
creates a data exposure risk.
You're right the new kernel will eventually zero memory, but KHO
preserves at page granularity. If we preserve a single slab object,
the entire page is handed off. When the new kernel maps that page
(e.g., to userspace) to access the preserved object, it also exposes
the unpreserved portions of that same page. Those portions contain
stale data from the old kernel and won't have been zeroed yet,
creating an easy-to-miss data leak vector. It makes the API very
error-prone.
There's also the inefficiency. The unpreserved parts of that page are
unusable by the new kernel until the preserved object is freed.
Depending on the use case, that object might live for the entire
kernel lifetime, effectively wasting that memory. This waste could
then accumulate with each subsequent live update.
Trying to create a special KHO slab cache isn't a solution either,
since slab caches are often merged.
As I see it, the only robust solution is to use a special GFP flag.
This would force these allocations to come from a dedicated pool of
pages that are fully preserved, with no partial/mixed-use pages and
also retrieved as slabs.
That said, I'm not sure preserving individual slab objects is a high
priority right now. It might be simpler to avoid it altogether.
Pasha

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator