Am 18.04.24 um 03:33 schrieb zhiguojiang:
在 2024/4/15 19:57, Christian König 写道:
[Some people who received this message don't often get email from christian.koenig@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Am 15.04.24 um 12:35 schrieb zhiguojiang:
在 2024/4/12 14:39, Christian König 写道:
[Some people who received this message don't often get email from christian.koenig@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Am 12.04.24 um 08:19 schrieb zhiguojiang:
[SNIP] -> Here task 2220 do epoll again where internally it will get/put then start to free twice and lead to final crash.
Here is the basic flow:
- Thread A install the dma_buf_fd via dma_buf_export, the fd
refcount is 1
Thread A add the fd to epoll list via epoll_ctl(EPOLL_CTL_ADD)
After use the dma buf, Thread A close the fd, then here fd
refcount is 0, and it will run __fput finally to release the file
Stop, that isn't correct.
The fs layer which calls dma_buf_poll() should make sure that the file can't go away until the function returns.
Then inside dma_buf_poll() we add another reference to the file while installing the callback:
/* Paired with fput in dma_buf_poll_cb */ get_file(dmabuf->file);
Hi,
The problem may just occurred here.
Is it possible file reference count already decreased to 0 and fput already being in progressing just before calling get_file(dmabuf->file) in dma_buf_poll()?
No, exactly that isn't possible.
If a function gets a dma_buf pointer or even more general any reference counted pointer which has already decreased to 0 then that is a major bug in the caller of that function.
BTW: It's completely illegal to work around such issues by using file_count() or RCU functions. So when you suggest stuff like that it will immediately face rejection.
Regards, Christian.
Hi,
Thanks for your kind comment.
'If a function gets a dma_buf pointer or even more general any reference
counted pointer which has already decreased to 0 then that is a major
bug in the caller of that function.'
According to the issue log, we can see the dma buf file close and epoll operation running in parallel.
Apparently at the moment caller calls epoll, although another task caller already called close on the same fd, but this fd was still in progress to close, so fd is still valid, thus no EBADF returned to caller.
No, exactly that can't happen either.
As far as I can see the EPOLL holds a reference to the files it contains. So it is perfectly valid to add the file descriptor to EPOLL and then close it.
In this case the file won't be closed, but be kept alive by it's reference in the EPOLL file descriptor.
Then the two task contexts operate on same dma buf fd(one is close, another is epoll) in kernel space.
Please kindly help to comment whether above scenario is possible.
If there is some bug in the caller logic, Can u help to point it out? :)
Well what could be is that the EPOLL implementation is broken somehow, but I have really doubts on that.
As I said before the most likely explanation is that you have a broken device driver which messes up the DMA-buf file reference count somehow.
Regards, Christian.
Regards, Zhiguo
This reference is only dropped after the callback is completed in dma_buf_poll_cb():
/* Paired with get_file in dma_buf_poll */ fput(dmabuf->file);
So your explanation for the issue just seems to be incorrect.
- Here Thread A not do epoll_ctl(EPOLL_CTL_DEL) manunally, so it
still resides in one epoll_list. Although __fput will call eventpoll_release to remove the file from binded epoll list, but it has small time window where Thread B jumps in.
Well if that is really the case then that would then be a bug in epoll_ctl().
- During the small window, Thread B do the poll action for
dma_buf_fd, where it will fget/fput for the file, this means the fd refcount will be 0 -> 1 -> 0, and it will call __fput again. This will lead to __fput twice for the same file.
- So the potenial fix is use get_file_rcu which to check if file
refcount already zero which means under free. If so, we just return and no need to do the dma_buf_poll.
Well to say it bluntly as far as I can see this suggestion is completely nonsense.
When the reference to the file goes away while dma_buf_poll() is executed then that's a massive bug in the caller of that function.
Regards, Christian.
Here is the race condition:
Thread A Thread B dma_buf_export fd_refcount is 1 epoll_ctl(EPOLL_ADD) add dma_buf_fd to epoll list close(dma_buf_fd) fd_refcount is 0 __fput dma_buf_poll fget fput fd_refcount is zero again
Thanks
Thanks Christian/TJ for the inputs!!
On 4/18/2024 12:16 PM, Christian König wrote:
As far as I can see the EPOLL holds a reference to the files it contains. So it is perfectly valid to add the file descriptor to EPOLL and then close it.
In this case the file won't be closed, but be kept alive by it's reference in the EPOLL file descriptor.
I am not seeing that adding a 'fd' into the epoll monitoring list will increase its refcount. Confirmed by writing a testcase that just do open + EPOLL_CTL_ADD and see the /proc/../fdinfo of epoll fd(Added file_count() info to the output) # cat /proc/136/fdinfo/3 pos: 0 flags: 02 mnt_id: 14 ino: 1041 tfd: 4 events: 19 data: 4 pos:0 ino:81 sdev:5 refcount: 1<-- The fd added to the epoll monitoring is still 1(same as open file refcount)
From the code too, I don't see a file added in the epoll monitoring list will have an extra refcount but momentarily (where it increases the refcount of target file, add it to the rbtree of the epoll context and then decreasing the file refcount): do_epoll_ctl(): f = fdget(epfd); tf = fdget(fd); epoll_mutex_lock(&ep->mtx) EPOLL_CTL_ADD: ep_insert(ep, epds, tf.file, fd, full_check); // Added to the epoll monitoring rb tree list as epitem. mutex_unlock(&ep->mtx); fdput(tf); fdput(f);
Not sure If i am completely mistaken what you're saying here.
The fs layer which calls dma_buf_poll() should make sure that the file can't go away until the function returns.
Then inside dma_buf_poll() we add another reference to the file while installing the callback:
/* Paired with fput in dma_buf_poll_cb */ get_file(dmabuf->file); No, exactly that can't happen either.
I am not quite comfortable with epoll internals but I think below race is possible where "The 'file' passed to dma_buf_poll() is proper but ->f_count == 0, which is indicating that a parallel freeing is happening". So, we should check the file->f_count value before taking the refcount.
(A 'fd' registered for the epoll monitoring list is maintained as 'epitem(epi)' in the rbtree of 'eventpoll(ep, called as epoll context).
epoll_wait() __fput()(as file->f_count=0) ------------------------------------------------------------------------ a) ep_poll_callback(): Is the waitqueue function called when signaled on the wait_queue_head_t of the registered poll() function.
1) It links the 'epi' to the ready list of 'ep': if (!ep_is_linked(epi)) list_add_tail_lockless(&epi->rdllink, &ep->rdllist)
2) wake_up(&ep->wq); Wake up the process waiting on epoll_wait() that endup in ep_poll.
b) ep_poll(): 1) while (1) { eavail = ep_events_available(ep); (checks ep->rdlist) ep_send_events(ep); (notify the events to user) } (epoll_wait() calling process gets woken up from a.2 and process the events raised added to rdlist in a.1)
2) ep_send_events(): mutex_lock(&ep->mtx); (** The sync lock is taken **) list_for_each_entry_safe(epi, tmp, &txlist, rdllink) { list_del_init(&epi->rdllink); revents = ep_item_poll(epi, &pt, 1) (vfs_poll()-->...f_op->poll(=dma_buf_poll) } mutex_unlock(&ep->mtx) (**release the lock.**)
(As part of procession of events, each epitem is removed from rdlist and call the f_op->poll() of a file which will endup in dma_buf_poll())
3) dma_buf_poll(): get_file(dmabuf->file); (** f_count changed from 0->1 **) dma_buf_poll_add_cb(resv, true, dcb): (registers dma_buf_poll_cb() against fence)
c) eventpoll_release_file(): mutex_lock(&ep->mtx); __ep_remove(ep, epi, true): mutex_unlock(&ep->mtx); (__ep_remove() will remove the 'epi' from rbtree and if present from rdlist as well)
d) file_free(file), free the 'file'.
e) dma_buf_poll_cb: /* Paired with get_file in dma_buf_poll */ fput(dmabuf->file); (f_count changed from 1->0, where we try to free the 'file' again which is UAF/double free).
In the above race, If c) gets called first, then the 'epi' is removed from both rbtree and 'rdlink' under ep->mtx lock thus b.2 don't end up in calling the ->poll() as it don't see this event in the rdlist.
Race only exist If b.2 executes first, where it will call dma_buf_poll with __valid 'struct file' under ep->mtx but its refcount is already could have been zero__. Later When e) is executed, it turns into double free of the 'file' structure.
If you're convinced with the above race, should the fix here will be this simple check: diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..e469dd9288cc --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -240,6 +240,10 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) struct dma_resv *resv; __poll_t events;
+ /* Check parallel freeing of file */ + if (!file_count(file)) + return 0; +
Thanks, Charan
On Fri, May 3, 2024 at 6:40 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Thanks Christian/TJ for the inputs!!
On 4/18/2024 12:16 PM, Christian König wrote:
As far as I can see the EPOLL holds a reference to the files it contains. So it is perfectly valid to add the file descriptor to EPOLL and then close it.
In this case the file won't be closed, but be kept alive by it's reference in the EPOLL file descriptor.
I am not seeing that adding a 'fd' into the epoll monitoring list will increase its refcount. Confirmed by writing a testcase that just do open + EPOLL_CTL_ADD and see the /proc/../fdinfo of epoll fd(Added file_count() info to the output) # cat /proc/136/fdinfo/3 pos: 0 flags: 02 mnt_id: 14 ino: 1041 tfd: 4 events: 19 data: 4 pos:0 ino:81 sdev:5 refcount: 1<-- The fd added to the epoll monitoring is still 1(same as open file refcount)
From the code too, I don't see a file added in the epoll monitoring list will have an extra refcount but momentarily (where it increases the refcount of target file, add it to the rbtree of the epoll context and then decreasing the file refcount): do_epoll_ctl(): f = fdget(epfd); tf = fdget(fd); epoll_mutex_lock(&ep->mtx) EPOLL_CTL_ADD: ep_insert(ep, epds, tf.file, fd, full_check); // Added to the epoll monitoring rb tree list as epitem. mutex_unlock(&ep->mtx); fdput(tf); fdput(f);
Not sure If i am completely mistaken what you're saying here.
The fs layer which calls dma_buf_poll() should make sure that the file can't go away until the function returns.
Then inside dma_buf_poll() we add another reference to the file while installing the callback:
/* Paired with fput in dma_buf_poll_cb */ get_file(dmabuf->file); No, exactly that can't
happen either.
I am not quite comfortable with epoll internals but I think below race is possible where "The 'file' passed to dma_buf_poll() is proper but ->f_count == 0, which is indicating that a parallel freeing is happening". So, we should check the file->f_count value before taking the refcount.
(A 'fd' registered for the epoll monitoring list is maintained as 'epitem(epi)' in the rbtree of 'eventpoll(ep, called as epoll context).
epoll_wait() __fput()(as file->f_count=0)
a) ep_poll_callback(): Is the waitqueue function called when signaled on the wait_queue_head_t of the registered poll() function.
It links the 'epi' to the ready list of 'ep': if (!ep_is_linked(epi)) list_add_tail_lockless(&epi->rdllink, &ep->rdllist)
wake_up(&ep->wq); Wake up the process waiting on epoll_wait() that endup in ep_poll.
b) ep_poll(): 1) while (1) { eavail = ep_events_available(ep); (checks ep->rdlist) ep_send_events(ep); (notify the events to user) } (epoll_wait() calling process gets woken up from a.2 and process the events raised added to rdlist in a.1)
ep_send_events(): mutex_lock(&ep->mtx); (** The sync lock is taken **) list_for_each_entry_safe(epi, tmp, &txlist, rdllink) { list_del_init(&epi->rdllink); revents = ep_item_poll(epi, &pt, 1) (vfs_poll()-->...f_op->poll(=dma_buf_poll) } mutex_unlock(&ep->mtx) (**release the lock.**)
(As part of procession of events, each epitem is removed from rdlist and call the f_op->poll() of a file which will endup in dma_buf_poll())
dma_buf_poll(): get_file(dmabuf->file); (** f_count changed from 0->1 **) dma_buf_poll_add_cb(resv, true, dcb): (registers dma_buf_poll_cb() against fence)
c) eventpoll_release_file(): mutex_lock(&ep->mtx); __ep_remove(ep, epi, true): mutex_unlock(&ep->mtx); (__ep_remove() will remove the 'epi' from rbtree and if present from rdlist as well) d) file_free(file), free the 'file'.
e) dma_buf_poll_cb: /* Paired with get_file in dma_buf_poll */ fput(dmabuf->file); (f_count changed from 1->0, where we try to free the 'file' again which is UAF/double free).
In the above race, If c) gets called first, then the 'epi' is removed from both rbtree and 'rdlink' under ep->mtx lock thus b.2 don't end up in calling the ->poll() as it don't see this event in the rdlist.
Race only exist If b.2 executes first, where it will call dma_buf_poll with __valid 'struct file' under ep->mtx but its refcount is already could have been zero__. Later When e) is executed, it turns into double free of the 'file' structure.
If you're convinced with the above race, should the fix here will be this simple check: diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..e469dd9288cc --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -240,6 +240,10 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) struct dma_resv *resv; __poll_t events;
/* Check parallel freeing of file */
if (!file_count(file))
return 0;
Thanks, Charan
Hi Charan,
It looks like a similar conclusion about epoll was reached at: https://lore.kernel.org/all/a87d7ef8-2c59-4dc5-ba0a-b821d1effc72@amd.com/
I agree with Christian that it should not be possible for the file to be freed while inside dma_buf_poll. Aside from causing problems in dma_buf_poll, ep_item_poll itself would have issues dereferencing the freed file pointer.
Christian's patch there makes me wonder about what the epoll man page says about closing files. "Will closing a file descriptor cause it to be removed from all epoll interest lists?" Answer: Yes https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man7/ep...
It looks like eventpoll_release_file is responsible for that, but if the epitem is changed to hold a reference then that can't be true anymore because __fput will never call eventpoll_release_file (until EPOLL_CTL_DEL). But how do you call EPOLL_CTL_DEL if you've closed all your references to the file in userspace? So I think epoll needs a slightly more complicated fix.
Thanks T.J for the reply!!
On 5/4/2024 4:43 AM, T.J. Mercier wrote:
It looks like a similar conclusion about epoll was reached at: https://lore.kernel.org/all/a87d7ef8-2c59-4dc5-ba0a-b821d1effc72@amd.com/
I am unaware of this discussion. Thanks...
I agree with Christian that it should not be possible for the file to be freed while inside dma_buf_poll. Aside from causing problems in dma_buf_poll, ep_item_poll itself would have issues dereferencing the freed file pointer.
Not sure about my understanding: ep_item_poll() always call the ->poll() interface with a stable 'struct file' because of ep->mtx. This lock ensures that: a) If eventpoll_release_file() get the ep->mtx first, ->poll() corresponds to the epitem(target file) will never be called, because it is removed from the rdlist.
b) If ep_send_events() get the ep->mtx() first, ->poll() will get called with a stable 'struct file', __but the refcount(->f_count) of a file can be zero__. I am saying that this is stable because the 'struct file' contents are still valid till we are in ->poll().
Can you/Christian help me with what I am missing here to say that ->poll() is receiving stale 'struct file*', please?
And, If you are convinced with above, I think, It should have been the responsibility of ->poll() implementation to have taken refcount on a file that is going to be still valid even after ->poll() exits. Incase of dma_buf_poll() implementation, it took the refcount on a file that is not going to be valid once the dma_buf_poll() exits(because of mentioned race with the freeing of the 'struct file*').
So, in dma_buf_poll(), Should we be using atomic_long_inc_not_zero() based implementation to take the refcount on a file?
Thanks, Charan
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
On 5/5/2024 9:50 PM, Charan Teja Kalla wrote:
Thanks T.J for the reply!!
On 5/4/2024 4:43 AM, T.J. Mercier wrote:
It looks like a similar conclusion about epoll was reached at: https://lore.kernel.org/all/a87d7ef8-2c59-4dc5-ba0a-b821d1effc72@amd.com/
I am unaware of this discussion. Thanks...
I agree with Christian that it should not be possible for the file to be freed while inside dma_buf_poll. Aside from causing problems in dma_buf_poll, ep_item_poll itself would have issues dereferencing the freed file pointer.
Not sure about my understanding: ep_item_poll() always call the ->poll() interface with a stable 'struct file' because of ep->mtx. This lock ensures that: a) If eventpoll_release_file() get the ep->mtx first, ->poll() corresponds to the epitem(target file) will never be called, because it is removed from the rdlist.
b) If ep_send_events() get the ep->mtx() first, ->poll() will get called with a stable 'struct file', __but the refcount(->f_count) of a file can be zero__. I am saying that this is stable because the 'struct file' contents are still valid till we are in ->poll().
Can you/Christian help me with what I am missing here to say that ->poll() is receiving stale 'struct file*', please?
And, If you are convinced with above, I think, It should have been the responsibility of ->poll() implementation to have taken refcount on a file that is going to be still valid even after ->poll() exits. Incase of dma_buf_poll() implementation, it took the refcount on a file that is not going to be valid once the dma_buf_poll() exits(because of mentioned race with the freeing of the 'struct file*').
So, in dma_buf_poll(), Should we be using atomic_long_inc_not_zero() based implementation to take the refcount on a file?
Thanks, Charan
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Am 06.05.24 um 21:04 schrieb T.J. Mercier:
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Yeah and it also has some interesting side conclusion: We should probably tell people to stop using DMA-buf with epoll.
The background is that the mutex approach epoll uses to make files disappear from the interest list on close results in the fact that each file can only be part of a single epoll at a time.
Now since DMA-buf is build around the idea that we share the buffer representation as file between processes it means that only one process at a time can use epoll with each DMA-buf.
So for example if a window manager uses epoll everything is fine. If a client is using epoll everything is fine as well. But if *both* use epoll at the same time it won't work.
This can lead to rather funny and hard to debug combinations of failures and I think we need to document this limitation and explicitly point it out.
Regards, Christian.
On Tue, May 07, 2024 at 12:10:07PM +0200, Christian König wrote:
Am 06.05.24 um 21:04 schrieb T.J. Mercier:
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Yeah and it also has some interesting side conclusion: We should probably tell people to stop using DMA-buf with epoll.
The background is that the mutex approach epoll uses to make files disappear from the interest list on close results in the fact that each file can only be part of a single epoll at a time.
Now since DMA-buf is build around the idea that we share the buffer representation as file between processes it means that only one process at a time can use epoll with each DMA-buf.
So for example if a window manager uses epoll everything is fine. If a client is using epoll everything is fine as well. But if *both* use epoll at the same time it won't work.
This can lead to rather funny and hard to debug combinations of failures and I think we need to document this limitation and explicitly point it out.
Ok, I tested this with a small C program, and you're mixing things up. Here's what I got
- You cannot add a file twice to the same epoll file/fd. So that part is correct, and also my understanding from reading the kernel code.
- You can add the same file to two different epoll file instaces. Which means it's totally fine to use epoll on a dma_buf in different processes like both in the compositor and in clients.
- Substantially more entertaining, you can nest epoll instances, and e.g. add a 2nd epoll file as an event to the first one. That way you can add the same file to both epoll fds, and so end up with the same file essentially being added twice to the top-level epoll file. So even within one application there's no real issue when e.g. different userspace drivers all want to use epoll on the same fd, because you can just throw in another level of epoll and it's fine again and you won't get an EEXISTS on EPOLL_CTL_ADD.
But I also don't think we have this issue right now anywhere, since it's kinda a general epoll issue that happens with any duplicated file.
So I don't think there's any reasons to recommend against using epoll on dma-buf fd (or sync_file or drm_syncobj or any of the sharing primitives we have really).
Cheers, Sima
Am 07.05.24 um 15:39 schrieb Daniel Vetter:
On Tue, May 07, 2024 at 12:10:07PM +0200, Christian König wrote:
Am 06.05.24 um 21:04 schrieb T.J. Mercier:
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Yeah and it also has some interesting side conclusion: We should probably tell people to stop using DMA-buf with epoll.
The background is that the mutex approach epoll uses to make files disappear from the interest list on close results in the fact that each file can only be part of a single epoll at a time.
Now since DMA-buf is build around the idea that we share the buffer representation as file between processes it means that only one process at a time can use epoll with each DMA-buf.
So for example if a window manager uses epoll everything is fine. If a client is using epoll everything is fine as well. But if *both* use epoll at the same time it won't work.
This can lead to rather funny and hard to debug combinations of failures and I think we need to document this limitation and explicitly point it out.
Ok, I tested this with a small C program, and you're mixing things up. Here's what I got
You cannot add a file twice to the same epoll file/fd. So that part is correct, and also my understanding from reading the kernel code.
You can add the same file to two different epoll file instaces. Which means it's totally fine to use epoll on a dma_buf in different processes like both in the compositor and in clients.
Ah! Than I misunderstood that comment in the discussion. Thanks for clarifying that.
Substantially more entertaining, you can nest epoll instances, and e.g. add a 2nd epoll file as an event to the first one. That way you can add the same file to both epoll fds, and so end up with the same file essentially being added twice to the top-level epoll file. So even within one application there's no real issue when e.g. different userspace drivers all want to use epoll on the same fd, because you can just throw in another level of epoll and it's fine again and you won't get an EEXISTS on EPOLL_CTL_ADD.
But I also don't think we have this issue right now anywhere, since it's kinda a general epoll issue that happens with any duplicated file.
I actually have been telling people to (ab)use the epoll behavior to check if two file descriptors point to the same underlying file when KCMP isn't available.
Some environments (Android?) disable KCMP because they see it as security problem.
So I don't think there's any reasons to recommend against using epoll on dma-buf fd (or sync_file or drm_syncobj or any of the sharing primitives we have really).
No, that indeed seems to be fine then.
Thanks, Christian.
Cheers, Sima
On Tue, May 7, 2024 at 7:04 AM Christian König christian.koenig@amd.com wrote:
Am 07.05.24 um 15:39 schrieb Daniel Vetter:
On Tue, May 07, 2024 at 12:10:07PM +0200, Christian König wrote:
Am 06.05.24 um 21:04 schrieb T.J. Mercier:
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Yeah and it also has some interesting side conclusion: We should probably tell people to stop using DMA-buf with epoll.
The background is that the mutex approach epoll uses to make files disappear from the interest list on close results in the fact that each file can only be part of a single epoll at a time.
Now since DMA-buf is build around the idea that we share the buffer representation as file between processes it means that only one process at a time can use epoll with each DMA-buf.
So for example if a window manager uses epoll everything is fine. If a client is using epoll everything is fine as well. But if *both* use epoll at the same time it won't work.
This can lead to rather funny and hard to debug combinations of failures and I think we need to document this limitation and explicitly point it out.
Ok, I tested this with a small C program, and you're mixing things up. Here's what I got
You cannot add a file twice to the same epoll file/fd. So that part is correct, and also my understanding from reading the kernel code.
You can add the same file to two different epoll file instaces. Which means it's totally fine to use epoll on a dma_buf in different processes like both in the compositor and in clients.
Ah! Than I misunderstood that comment in the discussion. Thanks for clarifying that.
Substantially more entertaining, you can nest epoll instances, and e.g. add a 2nd epoll file as an event to the first one. That way you can add the same file to both epoll fds, and so end up with the same file essentially being added twice to the top-level epoll file. So even within one application there's no real issue when e.g. different userspace drivers all want to use epoll on the same fd, because you can just throw in another level of epoll and it's fine again and you won't get an EEXISTS on EPOLL_CTL_ADD.
But I also don't think we have this issue right now anywhere, since it's kinda a general epoll issue that happens with any duplicated file.
I actually have been telling people to (ab)use the epoll behavior to check if two file descriptors point to the same underlying file when KCMP isn't available.
Some environments (Android?) disable KCMP because they see it as security problem.
Didn't know about this so I checked. Our kernel has CONFIG_KCMP, but seccomp does look like it's blocking kcmp for apps. https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc...
On Tue, May 7, 2024 at 11:17 AM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 7, 2024 at 7:04 AM Christian König christian.koenig@amd.com wrote:
Am 07.05.24 um 15:39 schrieb Daniel Vetter:
On Tue, May 07, 2024 at 12:10:07PM +0200, Christian König wrote:
Am 06.05.24 um 21:04 schrieb T.J. Mercier:
On Mon, May 6, 2024 at 2:30 AM Charan Teja Kalla quic_charante@quicinc.com wrote:
Hi TJ,
Seems I have got answers from [1], where it is agreed upon epoll() is the source of issue.
Thanks a lot for the discussion.
[1] https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Thanks Charan
Oh man, quite a set of threads on this over the weekend. Thanks for the link.
Yeah and it also has some interesting side conclusion: We should probably tell people to stop using DMA-buf with epoll.
The background is that the mutex approach epoll uses to make files disappear from the interest list on close results in the fact that each file can only be part of a single epoll at a time.
Now since DMA-buf is build around the idea that we share the buffer representation as file between processes it means that only one process at a time can use epoll with each DMA-buf.
So for example if a window manager uses epoll everything is fine. If a client is using epoll everything is fine as well. But if *both* use epoll at the same time it won't work.
This can lead to rather funny and hard to debug combinations of failures and I think we need to document this limitation and explicitly point it out.
Ok, I tested this with a small C program, and you're mixing things up. Here's what I got
You cannot add a file twice to the same epoll file/fd. So that part is correct, and also my understanding from reading the kernel code.
You can add the same file to two different epoll file instaces. Which means it's totally fine to use epoll on a dma_buf in different processes like both in the compositor and in clients.
Ah! Than I misunderstood that comment in the discussion. Thanks for clarifying that.
Substantially more entertaining, you can nest epoll instances, and e.g. add a 2nd epoll file as an event to the first one. That way you can add the same file to both epoll fds, and so end up with the same file essentially being added twice to the top-level epoll file. So even within one application there's no real issue when e.g. different userspace drivers all want to use epoll on the same fd, because you can just throw in another level of epoll and it's fine again and you won't get an EEXISTS on EPOLL_CTL_ADD.
But I also don't think we have this issue right now anywhere, since it's kinda a general epoll issue that happens with any duplicated file.
I actually have been telling people to (ab)use the epoll behavior to check if two file descriptors point to the same underlying file when KCMP isn't available.
Some environments (Android?) disable KCMP because they see it as security problem.
Didn't know about this so I checked. Our kernel has CONFIG_KCMP, but seccomp does look like it's blocking kcmp for apps. https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc...
Getting a bit off the original topic, but fwiw this is what CrOS did for CONFIG_KCMP:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/3...
Ie. allow the kcmp syscall, but block type == KCMP_SYSVSEM, which was the more specific thing that android wanted to block.
BR, -R
From: Christian König
Sent: 07 May 2024 15:05
...
I actually have been telling people to (ab)use the epoll behavior to check if two file descriptors point to the same underlying file when KCMP isn't available.
In what way? You can add both fd to the same epoll fd. Relying on the implicit EPOLL_CTL_DEL not happening until both fd are closed is a recipe for disaster. (And I can't see an obvious way of testing it.)
Q6/A6 on epoll(7) should always have had a caveat that it is an 'implementation detail' and shouldn't be relied on. (it is written as a 'beware of' ...)
The other point is that there are two ways to get multiple fd that reference the same underlying file. dup() fork() etc share the file offset, but open("/dev/fd/n") adds a reference count later and has a separate file offset.
I don't know which structure epoll is using, but I suspect it is the former. So it may not tell you what you want to know.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Am 08.05.24 um 13:51 schrieb David Laight:
From: Christian König
Sent: 07 May 2024 15:05
...
I actually have been telling people to (ab)use the epoll behavior to check if two file descriptors point to the same underlying file when KCMP isn't available.
In what way?
Something like this:
fd_e = epoll_create1(EPOLL_CLOEXEC);
tmp = dup(fd_A) epoll_ctl(fd_e, EPOLL_CTL_ADD, tmp, ....); dup2(fd_B, tmp);
/* If this return -EEXISTS then the fd_A and fd_B are pointing to the same struct file */ epoll_ctl(fd_e, EPOLL_CTL_ADD, tmp, ....);
close (tmp); close (fd_e
You can add both fd to the same epoll fd. Relying on the implicit EPOLL_CTL_DEL not happening until both fd are closed is a recipe for disaster. (And I can't see an obvious way of testing it.)
Q6/A6 on epoll(7) should always have had a caveat that it is an 'implementation detail' and shouldn't be relied on. (it is written as a 'beware of' ...)
The other point is that there are two ways to get multiple fd that reference the same underlying file. dup() fork() etc share the file offset, but open("/dev/fd/n") adds a reference count later and has a separate file offset.
No it doesn't.
Accessing /dev/fd/n or /proc/*/fd/n ideally accesses the same inode, but gives you a new struct file.
dup(), fork() etc.. make you actually reference the same struct file inside the kernel.
That turned out to be a rather important distinction when working with device drivers and DMA-buf.
Regards, Christian.
I don't know which structure epoll is using, but I suspect it is the former. So it may not tell you what you want to know.
David
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linaro-mm-sig@lists.linaro.org