Hi,
On 25. Sep 2024, at 16:32, Chuck Lever III chuck.lever@oracle.com wrote:
I'm not entirely certain what you mean by "cold restart" versus "warm restart" but for the moment I will assume that "cold restart" means you reboot the NFS server host, and "warm restart" means you simply cycle the NFS service (eg systemctl restart nfs-server).
The NFS server is a VM: the "warm reboot" keeps the hypervisor process active and only performs an internal start within the VM. The “cold reboot” performs a shutdown/poweroff, the hypervisor process exits and then a new VM hypervisor process is started again.
STALE means the file handle no longer exists on the server. This can mean the file system was unexported and thus is no longer accessible.
In your case, I'm guessing that what is happening on a cold restart is the exported file system is replaced; for example a tmpfs. Or, maybe reboot removes exported files.
And while riding my bike home and getting some fresh air I came to the same conclusion (after previously bashing my head against this for hours).
We have a step where VMs (that are booted fresh on the hypervisor) get a randomized UUID on their root filesystem and because of $reasons we do that every time, not just during first boot. Looks like we need to stop doing that.
My problem goes away once I fix the fsid in the exports, but I don’t think I want to dig a deeper hole.
Sorry for the noise and thanks for the hint (which seems even arrived telepathically).
Cheers, Christian