On Mon, Oct 04, 2021 at 12:44:36PM -0300, Jason Gunthorpe wrote:
On Mon, Oct 04, 2021 at 03:12:20PM +0100, Mark Brown wrote:
On Mon, Oct 04, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
When something like kexec happens we need the machine to be in a state where random DMA's are not corrupting memory.
That's all well and good but there's no point in implementing something half baked that's opening up a whole bunch of opportunities to crash the system if more work comes in after it's half broken the device setup.
Well, that is up to the driver implementing this. It looks like device shutdown is called before the userspace is all nuked so yes, concurrency with userspace is a possible concern here.
It's not just userspace that can initiate things - interrupts are also an issue, someone could press a button or whatever. Frankly for SPI the quiescing part doesn't seem like logic that should be implemented in drivers, it's a subsystem level thing since there's nothing driver specific about it.
Due to the emergency sort of nature it is not appropriate to do locking complicated sorts of things like struct device unregistrations here.
That's just not what's actually implemented in a bunch of places, nor something one would infer from the documentation ("Called at shut-down to quiesce the device", no mention of emergency cases which I'd guess would just be kdump) -
Drivers mis understanding stuff is not new..
Not just drivers, entire subsystems. And like I say given the documentation I'd be hard pressed to say that it's a misunderstanding.
that's a different thing and definitely abusing the API. I would guess that a good proportion of people implementing it are more worried about clean system shutdown than they are about kdump.
The other important case is to get the device cleaned up enough to pass back to firmware for platforms that use a firmware shutdown/reboot path.
Right, so the other cases I'm aware of are doing pretty much that - bringing things down to a state where the system can reboot cleanly. That can definitely include things like blocking for some hardware, and you're going to need some concurrency handling which means a combination of locking and infrequently tested lockless code paths.
In the case of this specific driver I'm still not clear that the best thing isn't just to delete the shutdown callback and let any ongoing transfers complete, though I guess there'd be issues in kexec cases with long enough tansfers.