So I've gotten back to playing with prime for a day, and found some old intel/radeon tests I had failing,
Tracked it down to a lifetime issue with the current code and can think of two fixes,
The problem scenario is
1. i915: create gem object 2. i915: export gem object to prime 3. radeon: import gem object 4. close prime fd 5. radeon: unref object 6. i915: unref object
So we end up at this point, with a imported buffer record for the dma_buf on the i915 file private.
Now if a subsequent test (without closing the drm fd) reallocates the dma_buf with the same address, we can end up seeing that.
So why doesn't that reference get cleaned up?
So the reference gets added above at step 2, and when radeon unrefs the object, i915 gets the dma-buf release callback, however at that stage we don't actually have the file priv to remove the pointer from, so it dangles there.
Possible fixes:
a) take a reference on the dma_buf attached to the gem handle when we export it, keep it until the gem handle goes away. I'm unsure if this could create zombie objects, since the dma buf has a reference on the gem object, but since the gem handle is separate to the object it might work.
b) don't keep track of dma_buf, keep track of gem objects, when we get a lookup, check inside the gem object, since currently we NULL out the export_dma_buf when we hit the release path, apart from the fact I'm sure the locking is foobar,
c) scan all the file privs for all the devices, no.
Anyone else any better plans?
Dave.
b) don't keep track of dma_buf, keep track of gem objects, when we get a lookup, check inside the gem object, since currently we NULL out the export_dma_buf when we hit the release path, apart from the fact I'm sure the locking is foobar,
Okay I've sent and implementation of this option, however it does mean that prime relies on GEM, which at this point is a tradeoff I'm quite happy to start making.
Dave.
Hi Dave!
On Tue, Dec 18, 2012 at 5:38 AM, Dave Airlie airlied@gmail.com wrote:
So I've gotten back to playing with prime for a day, and found some old intel/radeon tests I had failing,
Tracked it down to a lifetime issue with the current code and can think of two fixes,
The problem scenario is
- i915: create gem object
- i915: export gem object to prime
- radeon: import gem object
- close prime fd
- radeon: unref object
- i915: unref object
So we end up at this point, with a imported buffer record for the dma_buf on the i915 file private.
Now if a subsequent test (without closing the drm fd) reallocates the dma_buf with the same address, we can end up seeing that.
So why doesn't that reference get cleaned up?
So the reference gets added above at step 2, and when radeon unrefs the object, i915 gets the dma-buf release callback, however at that stage we don't actually have the file priv to remove the pointer from, so it dangles there.
I haven't checked yet, but we seem to have similar leaks by simply self-importing a dma-buf on a different open fd of the same driver already (i.e. the wayland/dri3/rendernode use-case).
Possible fixes:
a) take a reference on the dma_buf attached to the gem handle when we export it, keep it until the gem handle goes away. I'm unsure if this could create zombie objects, since the dma buf has a reference on the gem object, but since the gem handle is separate to the object it might work.
I'm leaning towards this option, since it'll give us cleaner lifetime rules: - As long as we still have gem handle around, we keep an explicit reference to the dma_buf. We need to keep a full reference since otherwise re-exporting by doing reusing the cached dma_buf will be a royal pain and fraught with races. From a drm core/prime perspective I think we could even unify the 2 dma_buf poiners we have currently - drivers could keep track of whether a gem bo is native or imported in their own state with just one bit ... - If gem->handle_refcount drops to zero we clean up the dma-buf pointer&cache entry and drop that reference. We need to be careful here with races against other users resurrect the object with a new userspace gem handle. But essentially it's the same dance as with flink, with the only complication that we now have two ways to create a new gem handle while we're doing the teardown when the last gem handle is closed.
This way dma-buf mirrors the lifetime rules of flink, where we hold a reference on the lookup object (flink name/dma-buf) for as long a gem handle exists. flink names are driver private there's no need for a real refcount, but the current rules amount to an implicit refcount with only 0,1 as values. The lookup-object themselves also hold a reference onto the real backing storage.
The second reason why I think we should give the dma-buf (re-)import cache a real reference is my framebuffer lifetime rework. I've played around with tons of different schemes which all kinda looked somewhat silly, until settling on holding a reference while the object is in the lookup idr. This reference gets dropped when the real user disappears: Either at rmfb ioctl time for userspace fbs, or at driver destruction for the fbdev fb (or whenever the driver drops an internally-created fb).
Ime that pattern of keeping an explicit reference just leads to less locking headache.
b) don't keep track of dma_buf, keep track of gem objects, when we get a lookup, check inside the gem object, since currently we NULL out the export_dma_buf when we hit the release path, apart from the fact I'm sure the locking is foobar,
Scrolled through your patch quickly, but didn't do a full review. Good thing is that I've resurrect my ivb/nvc0 test machine and beaten the igt/prime tests a bit into shape, so I should be able to test things (not like last time around, where my attempt to fix up lifetime rules a bit resulted in some unduly oopses).
I'll look at this in-depth next year, pls yell in case I'll forget by then ;-)
Cheers, Daniel
linaro-mm-sig@lists.linaro.org