On Fri, Mar 30, 2018 at 08:37:45PM +0300, Michael S. Tsirkin wrote:
get_user_pages_fast is supposed to be a faster drop-in equivalent of get_user_pages. As such, callers expect it to return a negative return code when passed an invalid address, and never expect it to return 0 when passed a positive number of pages, since its documentation says:
- Returns number of pages pinned. This may be fewer than the number
- requested. If nr_pages is 0 or negative, returns 0. If no pages
- were pinned, returns -errno.
Unfortunately this is not what the implementation does: it returns 0 if passed a kernel address, confusing callers: for example, the following is pretty common but does not appear to do the right thing with a kernel address:
ret = get_user_pages_fast(addr, 1, writeable, &page); if (ret < 0) return ret;
Change get_user_pages_fast to return -EFAULT when supplied a kernel address to make it match expectations.
__get_user_pages_fast does not seem to be used like this, but let's change __get_user_pages_fast as well for consistency and to match documentation.
Lightly tested.
Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Huang Ying ying.huang@intel.com Cc: Jonathan Corbet corbet@lwn.net Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Thorsten Leemhuis regressions@leemhuis.info Cc: stable@vger.kernel.org Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()") Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com Signed-off-by: Michael S. Tsirkin mst@redhat.com
Any feedback on this? As this fixes a bug in vhost, I'll merge through the vhost tree unless someone objects.
mm/gup.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c index 6afae32..5642521 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1749,6 +1749,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, unsigned long flags; int nr = 0;
- if (nr_pages <= 0)
return 0;
- start &= PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -1756,7 +1759,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, (void __user *)start, len)))
return 0;
return -EFAULT;
/* * Disable interrupts. We use the nested form as we can already have @@ -1806,9 +1809,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len;
- if (nr_pages <= 0)
return 0;
- if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, (void __user *)start, len)))
return 0;
return -EFAULT;
if (gup_fast_permitted(start, nr_pages, write)) { local_irq_disable(); -- MST
On Wed, Apr 4, 2018 at 6:53 PM, Michael S. Tsirkin mst@redhat.com wrote:
Any feedback on this? As this fixes a bug in vhost, I'll merge through the vhost tree unless someone objects.
NAK.
__get_user_pages_fast() returns the number of pages it gets.
It has never returned an error code, and all the other versions of it (architecture-specific) don't either.
If you ask for one page, and get zero pages, then that's an -EFAULT. Note that that's an EFAULT regardless of whether that zero page happened due to kernel addresses or just lack of mapping in user space.
The documentation is simply wrong if it says anything else. Fix the docs, and fix the users.
The correct use has always been to check the number of pages returned.
Just looking around, returning an error number looks like it could seriously confuse some things. You have things like the kvm code that does the *right* thing:
unsigned long ... npinned ...
npinned = get_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); if (npinned != npages) { ...
err: if (npinned > 0) release_pages(pages, npinned);
and the above code clearly depends on the actual behavior, not on the documentation.
Any changes in this area would need some *extreme* care, exactly because of code like the above that clearly depends on the existing semantics.
In fact, the documentation really seems to be just buggy. The actual get_user_pages() function itself is expressly being careful *not* to return an error code, it even has a comment to the effect ("Have to be a bit careful with return values").
So the "If no pages were pinned, returns -errno" comment is just bogus.
Linus
On Wed, Apr 04, 2018 at 07:40:36PM -0700, Linus Torvalds wrote:
On Wed, Apr 4, 2018 at 6:53 PM, Michael S. Tsirkin mst@redhat.com wrote:
Any feedback on this? As this fixes a bug in vhost, I'll merge through the vhost tree unless someone objects.
NAK.
__get_user_pages_fast() returns the number of pages it gets.
It has never returned an error code, and all the other versions of it (architecture-specific) don't either.
Thanks Linus. I can change the docs and all the callers.
I wonder however whether all the following should be changed then:
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
...
if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT;
is this a bug in __get_user_pages?
Another example:
ret = get_gate_page(mm, start & PAGE_MASK, gup_flags, &vma, pages ? &pages[i] : NULL); if (ret) return i ? : ret;
and ret is -EFAULT on error.
Another example: switch (ret) { case 0: goto retry; case -EFAULT: case -ENOMEM: case -EHWPOISON: return i ? i : ret; case -EBUSY: return i; case -ENOENT: goto next_page; }
it looks like this will return -EFAULT/-ENOMEM/-EHWPOISON if i is 0.
If you ask for one page, and get zero pages, then that's an -EFAULT. Note that that's an EFAULT regardless of whether that zero page happened due to kernel addresses or just lack of mapping in user space.
The documentation is simply wrong if it says anything else. Fix the docs, and fix the users.
The correct use has always been to check the number of pages returned.
Just looking around, returning an error number looks like it could seriously confuse some things.
You have things like the kvm code that does the *right* thing:
unsigned long ... npinned ... npinned = get_user_pages_fast(uaddr, npages, write ?
FOLL_WRITE : 0, pages); if (npinned != npages) { ...
err: if (npinned > 0) release_pages(pages, npinned);
and the above code clearly depends on the actual behavior, not on the documentation.
This seems to work fine with my patch since it only changes the case where npinned == 0.
Any changes in this area would need some *extreme* care, exactly because of code like the above that clearly depends on the existing semantics.
In fact, the documentation really seems to be just buggy. The actual get_user_pages() function itself is expressly being careful *not* to return an error code, it even has a comment to the effect ("Have to be a bit careful with return values").
So the "If no pages were pinned, returns -errno" comment is just bogus.
Linus
I'd like to change the doc then, but it seems that I'll have to change the implementation in that case too.
On Thu, Apr 5, 2018 at 7:17 AM, Michael S. Tsirkin mst@redhat.com wrote:
I wonder however whether all the following should be changed then:
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
...
if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT;
is this a bug in __get_user_pages?
Note the difference between "get_user_pages()", and "get_user_pages_fast()".
It's the *fast* versions that just return the number of pages pinned.
The non-fast ones will return an error code for various cases.
Why?
The non-fast cases actually *have* various error cases. They can block and get interrupted etc.
The fast cases are basically "just get me the pages, dammit, and if you can't get some page, stop".
At least that's one excuse for the difference in behavior.
The real excuse is probably just "that's how it worked" - the fast case just walked the page tables and that was it.
Linus
On Thu, Apr 05, 2018 at 08:40:05AM -0700, Linus Torvalds wrote:
On Thu, Apr 5, 2018 at 7:17 AM, Michael S. Tsirkin mst@redhat.com wrote:
I wonder however whether all the following should be changed then:
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
...
if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT;
is this a bug in __get_user_pages?
Note the difference between "get_user_pages()", and "get_user_pages_fast()".
It's the *fast* versions that just return the number of pages pinned.
The non-fast ones will return an error code for various cases.
Why?
The non-fast cases actually *have* various error cases. They can block and get interrupted etc.
The fast cases are basically "just get me the pages, dammit, and if you can't get some page, stop".
At least that's one excuse for the difference in behavior.
The real excuse is probably just "that's how it worked" - the fast case just walked the page tables and that was it.
Linus
I see, thanks for the clarification Linus.
to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't pin any pages and that is by design. Returning 0 on error isn't usual I think so I guess this behaviour should we well documented.
That part of my patch was wrong and should be replaced with a doc update.
What about get_user_pages_fast though? That's the other part of the patch. Right now get_user_pages_fast does:
ret = get_user_pages_unlocked(start, nr_pages - nr, pages, write ? FOLL_WRITE : 0);
/* Have to be a bit careful with return values */ if (nr > 0) { if (ret < 0) ret = nr; else ret += nr; }
so an error on the 1st page gets propagated to the caller, and that get_user_pages_unlocked eventually calls __get_user_pages so it does return an error sometimes.
Would it be correct to apply the second part of the patch then (pasted below for reference) or should get_user_pages_fast and all its callers be changed to return 0 on error instead?
@@ -1806,9 +1809,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, len = (unsigned long) nr_pages << PAGE_SHIFT; end = start + len;
+ if (nr_pages <= 0) + return 0; + if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, (void __user *)start, len))) - return 0; + return -EFAULT;
if (gup_fast_permitted(start, nr_pages, write)) { local_irq_disable();
On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin mst@redhat.com wrote:
to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't pin any pages and that is by design. Returning 0 on error isn't usual I think so I guess this behaviour should we well documented.
Arguably it happens elsewhere too, and not just in the kernel. "read()" at past the end of a file is not an error, you'll just get 0 for EOF.
So it's not really "returning 0 on error".
It really is simply returning the number of pages it got. End of story. That number of pages can be smaller than the requested number of pages, and _that_ is due to some error, but note how it can return "5" on error too - you asked for 10 pages, but the error happened in the middle!
So the right way to check for error is to bverify that you get the number of pages that you asked for. If you don't, something bad happened.
Of course, many users don't actually care about "I didn't get everything". They only care about "did I get _something_". Then that 0 ends up being the error case, but note how it depends on the caller.
What about get_user_pages_fast though?
We do seem to special-case the first page there. I'm not sure it's a good idea. But like the __get_user_pages_fast(), we seem to have users that know about the particular semantics and depend on it.
It's all ugly, I agree.
End result: we can't just change semantics of either of them.
At least not without going through every single user and checking that they are ok with it.
Which I guess I could be ok with. Maybe changing the semantics of __get_user_pages_fast() is acceptable, if you just change it *everywhere* (which includes not just he users, but also the couple of architecture-specific versions of that same function that we have.
Linus
On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin mst@redhat.com wrote:
to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't pin any pages and that is by design. Returning 0 on error isn't usual I think so I guess this behaviour should we well documented.
Arguably it happens elsewhere too, and not just in the kernel. "read()" at past the end of a file is not an error, you'll just get 0 for EOF.
So it's not really "returning 0 on error".
It really is simply returning the number of pages it got. End of story. That number of pages can be smaller than the requested number of pages, and _that_ is due to some error, but note how it can return "5" on error too - you asked for 10 pages, but the error happened in the middle!
So the right way to check for error is to bverify that you get the number of pages that you asked for. If you don't, something bad happened.
Of course, many users don't actually care about "I didn't get everything". They only care about "did I get _something_". Then that 0 ends up being the error case, but note how it depends on the caller.
What about get_user_pages_fast though?
We do seem to special-case the first page there. I'm not sure it's a good idea. But like the __get_user_pages_fast(), we seem to have users that know about the particular semantics and depend on it.
It's all ugly, I agree.
End result: we can't just change semantics of either of them.
At least not without going through every single user and checking that they are ok with it.
Which I guess I could be ok with. Maybe changing the semantics of __get_user_pages_fast() is acceptable, if you just change it *everywhere* (which includes not just he users, but also the couple of architecture-specific versions of that same function that we have.
Linus
OK I hope I understood what you are saying here.
At least drivers/gpu/drm/i915/i915_gem_userptr.c seems to get it wrong:
pinned = __get_user_pages_fast(obj->userptr.ptr,
if (pinned < 0) { pages = ERR_PTR(pinned); pinned = 0; } else if (pinned < num_pages) { pages = __i915_gem_userptr_get_pages_schedule(obj); active = pages == ERR_PTR(-EAGAIN); } else { pages = __i915_gem_userptr_alloc_pages(obj, pvec, num_pages); active = !IS_ERR(pages); }
The <0 path is never taken.
Cc maintainers - should that driver be changed to use get_user_pages_fast?
Quoting Michael S. Tsirkin (2018-04-05 20:34:08)
On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin mst@redhat.com wrote:
to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't pin any pages and that is by design. Returning 0 on error isn't usual I think so I guess this behaviour should we well documented.
Arguably it happens elsewhere too, and not just in the kernel. "read()" at past the end of a file is not an error, you'll just get 0 for EOF.
So it's not really "returning 0 on error".
It really is simply returning the number of pages it got. End of story. That number of pages can be smaller than the requested number of pages, and _that_ is due to some error, but note how it can return "5" on error too - you asked for 10 pages, but the error happened in the middle!
So the right way to check for error is to bverify that you get the number of pages that you asked for. If you don't, something bad happened.
Of course, many users don't actually care about "I didn't get everything". They only care about "did I get _something_". Then that 0 ends up being the error case, but note how it depends on the caller.
What about get_user_pages_fast though?
We do seem to special-case the first page there. I'm not sure it's a good idea. But like the __get_user_pages_fast(), we seem to have users that know about the particular semantics and depend on it.
It's all ugly, I agree.
End result: we can't just change semantics of either of them.
At least not without going through every single user and checking that they are ok with it.
Which I guess I could be ok with. Maybe changing the semantics of __get_user_pages_fast() is acceptable, if you just change it *everywhere* (which includes not just he users, but also the couple of architecture-specific versions of that same function that we have.
Linus
OK I hope I understood what you are saying here.
At least drivers/gpu/drm/i915/i915_gem_userptr.c seems to get it wrong:
pinned = __get_user_pages_fast(obj->userptr.ptr, if (pinned < 0) { pages = ERR_PTR(pinned); pinned = 0; } else if (pinned < num_pages) { pages = __i915_gem_userptr_get_pages_schedule(obj); active = pages == ERR_PTR(-EAGAIN); } else { pages = __i915_gem_userptr_alloc_pages(obj, pvec, num_pages); active = !IS_ERR(pages); }
The <0 path is never taken.
Please note that it only recently lost other paths that set an error beforehand. Not exactly wrong since the short return is expected and handled.
Cc maintainers - should that driver be changed to use get_user_pages_fast?
It's not allowed to fault. __gup_fast has that comment, gup_fast does not. -Chris
On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin mst@redhat.com wrote:
to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't pin any pages and that is by design. Returning 0 on error isn't usual I think so I guess this behaviour should we well documented.
Arguably it happens elsewhere too, and not just in the kernel. "read()" at past the end of a file is not an error, you'll just get 0 for EOF.
So it's not really "returning 0 on error".
It really is simply returning the number of pages it got. End of story. That number of pages can be smaller than the requested number of pages, and _that_ is due to some error, but note how it can return "5" on error too - you asked for 10 pages, but the error happened in the middle!
So the right way to check for error is to bverify that you get the number of pages that you asked for. If you don't, something bad happened.
Of course, many users don't actually care about "I didn't get everything". They only care about "did I get _something_". Then that 0 ends up being the error case, but note how it depends on the caller.
What about get_user_pages_fast though?
We do seem to special-case the first page there. I'm not sure it's a good idea. But like the __get_user_pages_fast(), we seem to have users that know about the particular semantics and depend on it.
It's all ugly, I agree.
End result: we can't just change semantics of either of them.
At least not without going through every single user and checking that they are ok with it.
Which I guess I could be ok with. Maybe changing the semantics of __get_user_pages_fast() is acceptable, if you just change it *everywhere* (which includes not just he users, but also the couple of architecture-specific versions of that same function that we have.
Linus
For now I sent a patchset 1. documenting current behaviour for __get_user_pages_fast. 2. fixing get_user_pages_fast for consistency.
so an error on the 1st page gets propagated to the caller, and that get_user_pages_unlocked eventually calls __get_user_pages so it does return an error sometimes.
Would it be correct to apply the second part of the patch then (pasted below for reference) or should get_user_pages_fast and all its callers be changed to return 0 on error instead?
0 isn't an error. As SuS sees it (ie from the userspace end of the pile)
returning the number you asked for means it worked
returning a smaller number means it worked partially and that much was consumed (or in some cases more and the rest if so was lost - depends what you are reading/writing)
returning 0 means you read nothing as you were at the end of file
returning an error code means it broke, or you should try again (EAGAIN/EWOULDBLOCK)
The ugly bit there is the try-again semantics needs to exactly match the attached poll() behaviour or you get busy loops.
Alan
linux-stable-mirror@lists.linaro.org