The commit <1b6b26ae7053>("pipe: fix and clarify pipe write wakeup logic") changed pipe write logic to wakeup readers only if the pipe was empty at the time of write. However, there are libraries that relied upon the older behavior for notification scheme similar to what's described in [1]
One such library 'realm-core'[2] is used by numerous Android applications. The library uses a similar notification mechanism as GNU Make but it never drains the pipe until it is full. When Android moved to v5.10 kernel, all applications using this library stopped working. The C program at the end of this email mimics the library code.
The program works with 5.4 kernel. It fails with v5.10 and I am fairly certain it will fail wiht v5.5 as well. The single patch in this series restores the old behavior. With the patch, the test and all affected Android applications start working with v5.10
After reading through epoll(7), I think the pipe should be drained after each epoll_wait() comes back. Also, that a non-empty pipe is considered to be "ready" for readers. The problem is that prior to the commit above, any new data written to non-empty pipes would wakeup threads waiting in epoll(EPOLLIN|EPILLET) and thats how this library worked.
I do think the program below is using EPOLLET wrong. However, it used to work before and now it doesn't. So, I thought it is worth asking if this counts as userspace break.
There was also a symmetrical change made to pipe_read in commit <f467a6a66419> ("pipe: fix and clarify pipe read wakeup logic") that I am not sure needs changing.
The library has since been fixed[3] but it will be a while before all applications incorporate the updated library.
- ssp
1. https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0b... 2. https://github.com/realm/realm-core 3. https://github.com/realm/realm-core/issues/4666
==== #include <stdio.h> #include <error.h> #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <stdlib.h> #include <unistd.h> #include <time.h> #include <sys/epoll.h> #include <sys/stat.h> #include <sys/types.h>
#define FIFO_NAME "epoll-test-fifo"
pthread_t tid; int max_delay_ms = 20; int fifo_fd; unsigned char written; unsigned char received; int epoll_fd;
void *wait_on_fifo(void *unused) { while (1) { struct epoll_event ev; int ret; unsigned char c;
ret = epoll_wait(epoll_fd, &ev, 1, 5000); if (ret == -1) { /* If interrupted syscall, continue .. */ if (errno == EINTR) continue; /* epoll_wait failed, bail.. */ error(99, errno, "epoll_wait failed \n"); }
/* timeout */ if (ret == 0) break;
if (ev.data.fd == fifo_fd) { /* Assume this is notification where the thread is catching up. * pipe is emptied by the writer when it detects it is full. */ received = written; } }
return NULL; }
int write_fifo(int fd, unsigned char c) { while (1) { int actual; char buf[1024];
ssize_t ret = write(fd, &c, 1); if (ret == 1) break; /* * If the pipe's buffer is full, we need to read some of the old data in * it to make space. We dont read in the code waiting for * notifications so that we can notify multiple waiters with a single * write. */ if (ret != 0) { if (errno != EAGAIN) return -EIO; } actual = read(fd, buf, 1024); if (actual == 0) return -errno; }
return 0; }
int create_and_setup_fifo() { int ret; char fifo_path[4096]; struct epoll_event ev;
char *tmpdir = getenv("TMPDIR"); if (tmpdir == NULL) tmpdir = ".";
ret = sprintf(fifo_path, "%s/%s", tmpdir, FIFO_NAME); if (access(fifo_path, F_OK) == 0) unlink(fifo_path);
ret = mkfifo(fifo_path, 0600); if (ret < 0) error(1, errno, "Failed to create fifo");
fifo_fd = open(fifo_path, O_RDWR | O_NONBLOCK); if (fifo_fd < 0) error(2, errno, "Failed to open Fifo");
ev.events = EPOLLIN | EPOLLET; ev.data.fd = fifo_fd;
ret = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fifo_fd, &ev); if (ret < 0) error(4, errno, "Failed to add fifo to epoll instance");
return 0; }
int main(int argc, char *argv[]) { int ret, random; unsigned char c = 1;
epoll_fd = epoll_create(1); if (epoll_fd == -1) error(3, errno, "Failed to create epoll instance");
ret = create_and_setup_fifo(); if (ret != 0) error(45, EINVAL, "Failed to setup fifo");
ret = pthread_create(&tid, NULL, wait_on_fifo, NULL); if (ret != 0) error(2, errno, "Failed to create a thread");
srand(time(NULL));
/* Write 256 bytes to fifo one byte at a time with random delays upto 20ms */ do { written = c; ret = write_fifo(fifo_fd, c); if (ret != 0) error(55, errno, "Failed to notify fifo, write #%u", (unsigned int)c); c++;
random = rand(); usleep((random % max_delay_ms) * 1000); } while (written <= c); /* stop after c = 255 */
pthread_join(tid, NULL);
printf("Test: %s", written == received ? "PASS\n" : "FAIL"); if (written != received) printf(": Written (%d) Received (%d)\n", written, received);
close(fifo_fd); close(epoll_fd);
return 0; } ====
Sandeep Patil (1): fs: pipe: wakeup readers everytime new data written is to pipe
fs/pipe.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
commit '1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")' changed pipe_write() to wakeup readers only if the pipe was empty. Prior to this change, threads waiting in epoll_wait(EPOLLET | EPOLLIN) on non-empty pipes would get woken up on new data.
It meant an applications that, 1. used pipe + epoll for notifications between threads / processes 2. Didn't drain the pipe on each epoll wakeup unless the pipe was full started to experience hang / timeouts in threads stuck in epoll_wait()
So restore the old behavior to wakeup all readers if any new data is written to the pipe.
Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") Signed-off-by: Sandeep Patil sspatil@android.com --- fs/pipe.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c index bfd946a9ad01..dda22a316bb3 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -406,7 +406,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) ssize_t ret = 0; size_t total_len = iov_iter_count(from); ssize_t chars; - bool was_empty = false; + bool do_wakeup = false; bool wake_next_writer = false;
/* Null write succeeds. */ @@ -429,10 +429,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) #endif
/* - * Only wake up if the pipe started out empty, since - * otherwise there should be no readers waiting. + * Wake up readers if the pipe was written to. Regardless + * of whether it was empty or not. Otherwise, threads + * waiting with EPOLLET will hang until the pipe is emptied. * - * If it wasn't empty we try to merge new data into + * If pipe wasn't empty we try to merge new data into * the last buffer. * * That naturally merges small writes, but it also @@ -440,9 +441,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * spanning multiple pages. */ head = pipe->head; - was_empty = pipe_empty(head, pipe->tail); chars = total_len & (PAGE_SIZE-1); - if (chars && !was_empty) { + if (chars && !pipe_empty(head, pipe->tail)) { unsigned int mask = pipe->ring_size - 1; struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; int offset = buf->offset + buf->len; @@ -460,6 +460,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) }
buf->len += ret; + do_wakeup = true; if (!iov_iter_count(from)) goto out; } @@ -526,6 +527,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) ret += copied; buf->offset = 0; buf->len = copied; + do_wakeup = true;
if (!iov_iter_count(from)) break; @@ -553,13 +555,12 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * become empty while we dropped the lock. */ __pipe_unlock(pipe); - if (was_empty) { + if (do_wakeup) { wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); __pipe_lock(pipe); - was_empty = pipe_empty(pipe->head, pipe->tail); wake_next_writer = true; } out: @@ -576,7 +577,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * how (for example) the GNU make jobserver uses small writes to * wake up pending jobs */ - if (was_empty) { + if (do_wakeup) { wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); }
On Thu, Jul 29, 2021 at 3:27 PM Sandeep Patil sspatil@android.com wrote:
So restore the old behavior to wakeup all readers if any new data is written to the pipe.
Ah-hahh.
I've had this slightly smaller patch waiting for the better part of a year:
https://lore.kernel.org/lkml/CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1W...
waiting to see if some broken program actually depends on the bogus epollet semantics.
Can you verify that that patch fixes the realm-core brokenness too?
Linus
On 7/29/21 11:01 PM, Linus Torvalds wrote:
On Thu, Jul 29, 2021 at 3:27 PM Sandeep Patil sspatil@android.com wrote:
So restore the old behavior to wakeup all readers if any new data is written to the pipe.
Ah-hahh.
I've had this slightly smaller patch waiting for the better part of a year:
https://lore.kernel.org/lkml/CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1W...
waiting to see if some broken program actually depends on the bogus epollet semantics.
Can you verify that that patch fixes the realm-core brokenness too?
Yes, your patch fixes all apps on Android I can test that include this library.
fwiw, the library seems to have been fixed. However, I am not sure how long it will be for all apps to take that update :(.
- ssp
On Fri, Jul 30, 2021 at 12:11 PM Sandeep Patil sspatil@android.com wrote:
Yes, your patch fixes all apps on Android I can test that include this library.
Ok, thanks for checking.
fwiw, the library seems to have been fixed. However, I am not sure how long it will be for all apps to take that update :(.
I wonder if I could make the wakeup logic do this only for the epollet case.
I'll have to think about it, but maybe I'll just apply that simple patch. I dislike the pointless wakeups, and as long as the only case I knew of was only a test of broken behavior, it was fine. But now that you've reported actual application breakage, this is in the "real regression" category, and so I'll fix it one way or the other.
And on the other hand I also have a slight preference towards your patch simply because you did the work of finding this out, so you should get the credit.
I'll mull it over a bit more, but whatever I'll do I'll do before rc4 and mark it for stable.
Thanks for testing,
Linus
On 7/30/21 7:23 PM, Linus Torvalds wrote:
On Fri, Jul 30, 2021 at 12:11 PM Sandeep Patil sspatil@android.com wrote:
Yes, your patch fixes all apps on Android I can test that include this library.
Ok, thanks for checking.
fwiw, the library seems to have been fixed. However, I am not sure how long it will be for all apps to take that update :(.
I wonder if I could make the wakeup logic do this only for the epollet case.
aren't we supposed to wakeup on each write in level-triggered (default) case though?
I'll have to think about it, but maybe I'll just apply that simple patch. I dislike the pointless wakeups, and as long as the only case I knew of was only a test of broken behavior, it was fine. But now that you've reported actual application breakage, this is in the "real regression" category, and so I'll fix it one way or the other.
And on the other hand I also have a slight preference towards your patch simply because you did the work of finding this out, so you should get the credit.
Ha, I can't really claim credit here. This was also reported to us in Android that triggered the search. Plus, now that I see your thread with Michael Kerrisk, he was way ahead of us in finding this out.
I'll mull it over a bit more, but whatever I'll do I'll do before rc4 and mark it for stable.
Thanks, I was actually going to suggest taking your patch cause it also makes changes in pipe_read(). I am not sure if there are apps that do EPOLLET | EPOLLOUT (can't think of a reason why).
- ssp
Thanks for testing,
Linus
On Fri, Jul 30, 2021 at 12:47 PM Sandeep Patil sspatil@android.com wrote:
aren't we supposed to wakeup on each write in level-triggered (default) case though?
No.
The thing about level triggered is that if the condition was already true, it would not need a wakeup in the first place.
Put another way: select() and poll() are both fundamentally level-triggered. If the condition was already true, they will return success immediately, and don't need any extraneous wakeups.
This is literally an epoll() confusion about what an "edge" is.
An edge is not "somebody wrote more data". An edge is "there was no data, now there is data".
And a level triggered event is *also* not "somebody wrote more data". A level-triggered signal is simply "there is data".
Notice how neither edge nor level are about "more data". One is about the edge of "no data" -> "some data", and the other is just a "data is available".
Sadly, it seems that our old "we'll wake things up whether needed or not" implementation ended up being something that people thought was edge-triggered semantics.
But we have the policy that regressions aren't about documentation or even sane behavior.
Regressions are about whether a user application broke in a noticeable way.
Linus
On Fri, Jul 30, 2021 at 12:23 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I'll mull it over a bit more, but whatever I'll do I'll do before rc4 and mark it for stable.
Ok, I ended up committing the minimal possible change (and fixing up the comment above it).
It's very much *not* the original behavior either, but that original behavior was truly insane ("wake up for each hunk written"), and I'm trying to at least keep the kernel code from doing actively stupid things.
Since that old patch of mine worked for your test-case, then clearly that realm-core library didn't rely on _that_ kind of insane internal kernel implementation details exposed as semantics. So The minimal patch basically says "each write() system call wil do at least one wake-up, whether really necessary or not".
I also intentionally kept the read side untouched, in that there apparently still isn't a case that would need the confused semantics for read events.
End result: the commit message is a lot bigger than the patch, with most of it being trying to explain the background.
I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers"). Holler if you notice anything odd remaining.
Linus
On Fri, Jul 30, 2021 at 3:53 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers"). Holler if you notice anything odd remaining.
.. and as I wrote that, I realized that I had forgotten to mark it for stable even though that was the intent.
So stable team, please consider this the "that commit should probably be added to your queue" notification,
Linus
On Fri, Jul 30, 2021 at 03:55:17PM -0700, Linus Torvalds wrote:
On Fri, Jul 30, 2021 at 3:53 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers"). Holler if you notice anything odd remaining.
.. and as I wrote that, I realized that I had forgotten to mark it for stable even though that was the intent.
So stable team, please consider this the "that commit should probably be added to your queue" notification,
Will do so, thanks!
greg k-h
On 7/30/21 10:53 PM, Linus Torvalds wrote:
On Fri, Jul 30, 2021 at 12:23 PM Linus Torvalds torvalds@linux-foundation.org wrote:
I'll mull it over a bit more, but whatever I'll do I'll do before rc4 and mark it for stable.
Ok, I ended up committing the minimal possible change (and fixing up the comment above it).
It's very much *not* the original behavior either, but that original behavior was truly insane ("wake up for each hunk written"), and I'm trying to at least keep the kernel code from doing actively stupid things.
Since that old patch of mine worked for your test-case, then clearly that realm-core library didn't rely on _that_ kind of insane internal kernel implementation details exposed as semantics. So The minimal patch basically says "each write() system call wil do at least one wake-up, whether really necessary or not".
I also intentionally kept the read side untouched, in that there apparently still isn't a case that would need the confused semantics for read events.
End result: the commit message is a lot bigger than the patch, with most of it being trying to explain the background.
I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers"). Holler if you notice anything odd remaining.
Since what you merged isn't different than what I tested, I don't expect any surprises but I will test it regardless. I will come back if I see anything unexpected.
Thanks for the explanation about the default behavior earlier in the thread.
- ssp
linux-stable-mirror@lists.linaro.org