When running on an AMD vIOMMU, we observed multiple invalidations (of
decreasing power of 2 aligned sizes) when unmapping a single page.
Domain flush takes gather bounds (end-start) as size param. However,
gather->end is defined as the last inclusive address (start + size - 1).
This leads to an off by 1 error.
With this patch, verified that 1 invalidation occurs when unmapping a
single page.
Fixes: a270be1b3fdf ("iommu/amd: Use only natural aligned flushes in a VM")
Cc: <stable(a)vger.kernel.org> # 5.15.x
Suggested-by: Gary Zibrat <gzibrat(a)google.com>
Tested-by: Sudheer Dantuluri <dantuluris(a)google.com>
Acked-by: Nadav Amit <namit(a)vmware.com>
Reviewed-by: Vasant Hegde <vasant.hegde(a)amd.com>
Signed-off-by: Jon Pan-Doh <pandoh(a)google.com>
---
drivers/iommu/amd/iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5a505ba5467e..da45b1ab042d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2378,7 +2378,7 @@ static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
unsigned long flags;
spin_lock_irqsave(&dom->lock, flags);
- domain_flush_pages(dom, gather->start, gather->end - gather->start, 1);
+ domain_flush_pages(dom, gather->start, gather->end - gather->start + 1, 1);
amd_iommu_domain_flush_complete(dom);
spin_unlock_irqrestore(&dom->lock, flags);
}
--
2.40.0.634.g4ca3ef3211-goog
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance
to restore them. Fix it by adding a call to the arch_swap_restore() hook
before the call to swap_free().
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c51…
Cc: <stable(a)vger.kernel.org> # 6.1
Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()")
Reported-by: Qun-wei Lin (林群崴) <Qun-wei.Lin(a)mediatek.com>
Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@…
---
v2:
- Call arch_swap_restore() directly instead of via arch_do_swap_page()
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..fc25764016b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3932,6 +3932,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
}
+ /*
+ * Some architectures may have to restore extra metadata to the page
+ * when reading from swap. This metadata may be indexed by swap entry
+ * so this must be called before swap_free().
+ */
+ arch_swap_restore(entry, folio);
+
/*
* Remove the swap entry and conditionally try to free up the swapcache.
* We're already holding a reference on the page but haven't mapped it
--
2.40.1.606.ga4b1b128d6-goog
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance
to restore them. One other possibility was to hook arch_do_swap_page(),
but this had a number of problems:
- The call to the hook was also after swap_free().
- The call to the hook was after the call to set_pte_at(), so there was a
racy window where uninitialized metadata may be exposed to userspace.
This likely also affects SPARC ADI, which implements this hook to
restore tags.
- As a result of commit 1eba86c096e3 ("mm: change page type prior to
adding page table entry"), we were also passing the new PTE as the
oldpte argument, preventing the hook from knowing the swap index.
Fix all of these problems by moving the arch_do_swap_page() call before
the call to free_page(), and ensuring that we do not set orig_pte until
after the call.
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Suggested-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c51…
Cc: <stable(a)vger.kernel.org> # 6.1
Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
Fixes: 1eba86c096e3 ("mm: change page type prior to adding page table entry")
---
mm/memory.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 01a23ad48a04..83268d287ff1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
}
- /*
- * Remove the swap entry and conditionally try to free up the swapcache.
- * We're already holding a reference on the page but haven't mapped it
- * yet.
- */
- swap_free(entry);
- if (should_try_to_free_swap(folio, vma, vmf->flags))
- folio_free_swap(folio);
-
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
-
/*
* Same logic as in do_wp_page(); however, optimize for pages that are
* certainly not shared either because we just allocated them without
@@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
pte = pte_mkuffd_wp(pte);
+ arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
vmf->orig_pte = pte;
+ /*
+ * Remove the swap entry and conditionally try to free up the swapcache.
+ * We're already holding a reference on the page but haven't mapped it
+ * yet.
+ */
+ swap_free(entry);
+ if (should_try_to_free_swap(folio, vma, vmf->flags))
+ folio_free_swap(folio);
+
+ inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+ dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+
/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address);
@@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
folio_unlock(folio);
if (folio != swapcache && swapcache) {
--
2.40.1.606.ga4b1b128d6-goog
After the listener svc_sock be freed, and before invoking svc_tcp_accept()
for the established child sock, there is a window that the newsock
retaining a freed listener svc_sock in sk_user_data which cloning from
parent. In the race windows if data is received on the newsock, we will
observe use-after-free report in svc_tcp_listen_data_ready().
Reproduce by two tasks:
1. while :; do rpc.nfsd 0 ; rpc.nfsd; done
2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done
KASAN report:
==================================================================
BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
Read of size 8 at addr ffff888139d96228 by task nc/102553
CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Call Trace:
<IRQ>
dump_stack_lvl+0x33/0x50
print_address_description.constprop.0+0x27/0x310
print_report+0x3e/0x70
kasan_report+0xae/0xe0
svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc]
tcp_data_queue+0x9f4/0x20e0
tcp_rcv_established+0x666/0x1f60
tcp_v4_do_rcv+0x51c/0x850
tcp_v4_rcv+0x23fc/0x2e80
ip_protocol_deliver_rcu+0x62/0x300
ip_local_deliver_finish+0x267/0x350
ip_local_deliver+0x18b/0x2d0
ip_rcv+0x2fb/0x370
__netif_receive_skb_one_core+0x166/0x1b0
process_backlog+0x24c/0x5e0
__napi_poll+0xa2/0x500
net_rx_action+0x854/0xc90
__do_softirq+0x1bb/0x5de
do_softirq+0xcb/0x100
</IRQ>
<TASK>
...
</TASK>
Allocated by task 102371:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x7b/0x90
svc_setup_socket+0x52/0x4f0 [sunrpc]
svc_addsock+0x20d/0x400 [sunrpc]
__write_ports_addfd+0x209/0x390 [nfsd]
write_ports+0x239/0x2c0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 102551:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0x106/0x190
__kmem_cache_free+0x133/0x270
svc_xprt_free+0x1e2/0x350 [sunrpc]
svc_xprt_destroy_all+0x25a/0x440 [sunrpc]
nfsd_put+0x125/0x240 [nfsd]
nfsd_svc+0x2cb/0x3c0 [nfsd]
write_threads+0x1ac/0x2a0 [nfsd]
nfsctl_transaction_write+0xac/0x110 [nfsd]
vfs_write+0x1c3/0xae0
ksys_write+0xed/0x1c0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready()
if state != TCP_LISTEN, that will avoid dereferencing svsk for all
child socket.
Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/
Fixes: fa9251afc33c ("SUNRPC: Call the default socket callbacks instead of open coding")
Signed-off-by: Ding Hui <dinghui(a)sangfor.com.cn>
Cc: <stable(a)vger.kernel.org>
---
net/sunrpc/svcsock.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a51c9b989d58..9aca6e1e78e4 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -825,12 +825,6 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
trace_sk_data_ready(sk);
- if (svsk) {
- /* Refer to svc_setup_socket() for details. */
- rmb();
- svsk->sk_odata(sk);
- }
-
/*
* This callback may called twice when a new connection
* is established as a child socket inherits everything
@@ -839,13 +833,18 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
* when one of child sockets become ESTABLISHED.
* 2) data_ready method of the child socket may be called
* when it receives data before the socket is accepted.
- * In case of 2, we should ignore it silently.
+ * In case of 2, we should ignore it silently and DO NOT
+ * dereference svsk.
*/
- if (sk->sk_state == TCP_LISTEN) {
- if (svsk) {
- set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
- svc_xprt_enqueue(&svsk->sk_xprt);
- }
+ if (sk->sk_state != TCP_LISTEN)
+ return;
+
+ if (svsk) {
+ /* Refer to svc_setup_socket() for details. */
+ rmb();
+ svsk->sk_odata(sk);
+ set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
+ svc_xprt_enqueue(&svsk->sk_xprt);
}
}
--
2.17.1