On Fri, Oct 24, 2025 at 9:25 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav pratyush@kernel.org wrote:
+Cc Marco, Alexander
On Wed, Oct 15 2025, Pasha Tatashin wrote:
KHO allocates metadata for its preserved memory map using the SLUB allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory.
A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock.
At some point, we'd probably want to add support for preserving slab objects using KHO. That wouldn't work if the objects can land in scratch memory. Right now, the kfence pools are allocated right before KHO goes out of scratch-only and memblock frees pages to buddy.
If we do that, most likely we will add a GFP flag that goes with it, so the slab can use a special pool of pages that are preservable. Otherwise, we are going to be leaking memory from the old kernel in the unpreserved parts of the pages.
That isn't an issue. If we make slab preservable then we'd have to preserve the page and then somehow record what order is stored in that page and a bit map of which parts are allocated to restore the slab state on recovery.
So long as the non-preserved memory comes back as freed on the sucessor kernel it doesn't matter what was in it in the preceeding kernel. The new kernel will eventually zero it. So it isn't a 'leak'.
Hi Jason,
I agree, it's not a "leak" in the traditional sense, as we trust the successor kernel to manage its own memory.
However, my concern is that without a dedicated GFP flag, this partial-page preservation model becomes too fragile, inefficient, and creates a data exposure risk.
You're right the new kernel will eventually zero memory, but KHO preserves at page granularity. If we preserve a single slab object, the entire page is handed off. When the new kernel maps that page (e.g., to userspace) to access the preserved object, it also exposes the unpreserved portions of that same page. Those portions contain stale data from the old kernel and won't have been zeroed yet, creating an easy-to-miss data leak vector. It makes the API very error-prone.
There's also the inefficiency. The unpreserved parts of that page are unusable by the new kernel until the preserved object is freed. Depending on the use case, that object might live for the entire kernel lifetime, effectively wasting that memory. This waste could then accumulate with each subsequent live update.
Trying to create a special KHO slab cache isn't a solution either, since slab caches are often merged.
As I see it, the only robust solution is to use a special GFP flag. This would force these allocations to come from a dedicated pool of pages that are fully preserved, with no partial/mixed-use pages and also retrieved as slabs.
That said, I'm not sure preserving individual slab objects is a high priority right now. It might be simpler to avoid it altogether.
Pasha