Hi Greg, Sasha,
This is a batch of -stable backport fixes for 4.19. This batch includes
dependency patches which are not currently in the 4.19 branch.
The following list shows the backported patches, I am using original
commit IDs for reference:
1) 4f16d25c68ec ("netfilter: nftables: add nft_parse_register_load() and use it")
2) 345023b0db31 ("netfilter: nftables: add nft_parse_register_store() and use it")
3) 08a01c11a5bb ("netfilter: nftables: statify nft_parse_register()")
4) 6e1acfa387b9 ("netfilter: nf_tables: validate registers coming from userspace.")
5) 20a1452c3542 ("netfilter: nf_tables: add nft_setelem_parse_key()")
6) fdb9c405e35b ("netfilter: nf_tables: allow up to 64 bytes in the set element data area")
7) 7e6bc1f6cabc ("netfilter: nf_tables: stricter validation of element data")
8) 5a2f3dc31811 ("netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag")
9) 36d5b2913219 ("netfilter: nf_tables: do not allow RULE_ID to refer to another chain")
Patches #1, #2, #3, #5, #6 are dependencies.
Please apply,
Thanks,
Pablo Neira Ayuso (9):
netfilter: nftables: add nft_parse_register_load() and use it
netfilter: nftables: add nft_parse_register_store() and use it
netfilter: nftables: statify nft_parse_register()
netfilter: nf_tables: validate registers coming from userspace.
netfilter: nf_tables: add nft_setelem_parse_key()
netfilter: nf_tables: allow up to 64 bytes in the set element data area
netfilter: nf_tables: stricter validation of element data
netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
netfilter: nf_tables: do not allow RULE_ID to refer to another chain
include/net/netfilter/nf_tables.h | 15 +-
include/net/netfilter/nf_tables_core.h | 9 +-
include/net/netfilter/nft_fib.h | 2 +-
include/net/netfilter/nft_masq.h | 4 +-
include/net/netfilter/nft_redir.h | 4 +-
net/ipv4/netfilter/nft_dup_ipv4.c | 18 +-
net/ipv6/netfilter/nft_dup_ipv6.c | 18 +-
net/netfilter/nf_tables_api.c | 228 ++++++++++++++++---------
net/netfilter/nft_bitwise.c | 14 +-
net/netfilter/nft_byteorder.c | 14 +-
net/netfilter/nft_cmp.c | 8 +-
net/netfilter/nft_ct.c | 12 +-
net/netfilter/nft_dup_netdev.c | 6 +-
net/netfilter/nft_dynset.c | 12 +-
net/netfilter/nft_exthdr.c | 14 +-
net/netfilter/nft_fib.c | 5 +-
net/netfilter/nft_fwd_netdev.c | 18 +-
net/netfilter/nft_hash.c | 25 ++-
net/netfilter/nft_immediate.c | 6 +-
net/netfilter/nft_lookup.c | 14 +-
net/netfilter/nft_masq.c | 14 +-
net/netfilter/nft_meta.c | 12 +-
net/netfilter/nft_nat.c | 35 ++--
net/netfilter/nft_numgen.c | 15 +-
net/netfilter/nft_objref.c | 6 +-
net/netfilter/nft_osf.c | 8 +-
net/netfilter/nft_payload.c | 10 +-
net/netfilter/nft_queue.c | 12 +-
net/netfilter/nft_range.c | 6 +-
net/netfilter/nft_redir.c | 14 +-
net/netfilter/nft_rt.c | 7 +-
net/netfilter/nft_socket.c | 7 +-
net/netfilter/nft_tproxy.c | 14 +-
net/netfilter/nft_tunnel.c | 8 +-
34 files changed, 328 insertions(+), 286 deletions(-)
--
2.30.2
The current uses of PageAnon in page table check functions can lead to
type confusion bugs between struct page and slab [1], if slab pages are
accidentally mapped into the user space. This is because slab reuses the
bits in struct page to store its internal states, which renders PageAnon
ineffective on slab pages.
Since slab pages are not expected to be mapped into the user space, this
patch adds BUG_ON(PageSlab(page)) checks to make sure that slab pages
are not inadvertently mapped. Otherwise, there must be some bugs in the
kernel.
Reported-by: syzbot+fcf1a817ceb50935ce99(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1]
Fixes: df4e817b7108 ("mm: page table check")
Cc: <stable(a)vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000(a)pku.edu.cn>
---
include/linux/page-flags.h | 6 ++++++
mm/page_table_check.c | 6 ++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c68d67b8..92a2063a0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -617,6 +617,12 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
* Please note that, confusingly, "page_mapping" refers to the inode
* address_space which maps the page from disk; whereas "page_mapped"
* refers to user virtual address space into which the page is mapped.
+ *
+ * For slab pages, since slab reuses the bits in struct page to store its
+ * internal states, the page->mapping does not exist as such, nor do these
+ * flags below. So in order to avoid testing non-existent bits, please
+ * make sure that PageSlab(page) actually evaluates to false before calling
+ * the following functions (e.g., PageAnon). See mm/slab.h.
*/
#define PAGE_MAPPING_ANON 0x1
#define PAGE_MAPPING_MOVABLE 0x2
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 25d8610c0..f2baf97d5 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -71,6 +71,8 @@ static void page_table_check_clear(struct mm_struct *mm, unsigned long addr,
page = pfn_to_page(pfn);
page_ext = page_ext_get(page);
+
+ BUG_ON(PageSlab(page));
anon = PageAnon(page);
for (i = 0; i < pgcnt; i++) {
@@ -107,6 +109,8 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
page = pfn_to_page(pfn);
page_ext = page_ext_get(page);
+
+ BUG_ON(PageSlab(page));
anon = PageAnon(page);
for (i = 0; i < pgcnt; i++) {
@@ -133,6 +137,8 @@ void __page_table_check_zero(struct page *page, unsigned int order)
struct page_ext *page_ext;
unsigned long i;
+ BUG_ON(PageSlab(page));
+
page_ext = page_ext_get(page);
BUG_ON(!page_ext);
for (i = 0; i < (1ul << order); i++) {
--
2.40.1
vmbus_wait_for_unload() may be called in the panic path after other
CPUs are stopped. vmbus_wait_for_unload() currently loops through
online CPUs looking for the UNLOAD response message. But the values of
CONFIG_KEXEC_CORE and crash_kexec_post_notifiers affect the path used
to stop the other CPUs, and in one of the paths the stopped CPUs
are removed from cpu_online_mask. This removal happens in both
x86/x64 and arm64 architectures. In such a case, vmbus_wait_for_unload()
only checks the panic'ing CPU, and misses the UNLOAD response message
except when the panic'ing CPU is CPU 0. vmbus_wait_for_unload()
eventually times out, but only after waiting 100 seconds.
Fix this by looping through *present* CPUs in vmbus_wait_for_unload().
The cpu_present_mask is not modified by stopping the other CPUs in the
panic path, nor should it be. Furthermore, the synic_message_page
being checked in vmbus_wait_for_unload() is allocated in
hv_synic_alloc() for all present CPUs. So looping through the
present CPUs is more consistent.
For additional safety, also add a check for the message_page being
NULL before looking for the UNLOAD response message.
Reported-by: John Starks <jostarks(a)microsoft.com>
Fixes: cd95aad55793 ("Drivers: hv: vmbus: handle various crash scenarios")
Signed-off-by: Michael Kelley <mikelley(a)microsoft.com>
---
drivers/hv/channel_mgmt.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 007f26d..df2ba20 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -829,11 +829,14 @@ static void vmbus_wait_for_unload(void)
if (completion_done(&vmbus_connection.unload_event))
goto completed;
- for_each_online_cpu(cpu) {
+ for_each_present_cpu(cpu) {
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
page_addr = hv_cpu->synic_message_page;
+ if (!page_addr)
+ continue;
+
msg = (struct hv_message *)page_addr
+ VMBUS_MESSAGE_SINT;
@@ -867,11 +870,14 @@ static void vmbus_wait_for_unload(void)
* maybe-pending messages on all CPUs to be able to receive new
* messages after we reconnect.
*/
- for_each_online_cpu(cpu) {
+ for_each_present_cpu(cpu) {
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);
page_addr = hv_cpu->synic_message_page;
+ if (!page_addr)
+ continue;
+
msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT;
msg->header.message_type = HVMSG_NONE;
}
--
1.8.3.1
Without EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary
physical memory regions into the userspace via /dev/mem. At the same
time, pages may change their properties (e.g., from anonymous pages to
named pages) while they are still being mapped in the userspace, leading
to "corruption" detected by the page table check.
To avoid these false positives, this patch makes PAGE_TABLE_CHECK
depends on EXCLUSIVE_SYSTEM_RAM. This dependency is understandable
because PAGE_TABLE_CHECK is a hardening technique but /dev/mem without
STRICT_DEVMEM (i.e., !EXCLUSIVE_SYSTEM_RAM) is itself a security
problem.
Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be
mapped via /dev/mem. However, these pages are always considered as named
pages, so they won't break the logic used in the page table check.
Cc: <stable(a)vger.kernel.org> # 5.17
Signed-off-by: Ruihan Li <lrh2000(a)pku.edu.cn>
---
Documentation/mm/page_table_check.rst | 19 +++++++++++++++++++
mm/Kconfig.debug | 1 +
2 files changed, 20 insertions(+)
diff --git a/Documentation/mm/page_table_check.rst b/Documentation/mm/page_table_check.rst
index cfd8f4117..c12838ce6 100644
--- a/Documentation/mm/page_table_check.rst
+++ b/Documentation/mm/page_table_check.rst
@@ -52,3 +52,22 @@ Build kernel with:
Optionally, build kernel with PAGE_TABLE_CHECK_ENFORCED in order to have page
table support without extra kernel parameter.
+
+Implementation notes
+====================
+
+We specifically decided not to use VMA information in order to avoid relying on
+MM states (except for limited "struct page" info). The page table check is a
+separate from Linux-MM state machine that verifies that the user accessible
+pages are not falsely shared.
+
+PAGE_TABLE_CHECK depends on EXCLUSIVE_SYSTEM_RAM. The reason is that without
+EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary physical memory
+regions into the userspace via /dev/mem. At the same time, pages may change
+their properties (e.g., from anonymous pages to named pages) while they are
+still being mapped in the userspace, leading to "corruption" detected by the
+page table check.
+
+Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be mapped via
+/dev/mem. However, these pages are always considered as named pages, so they
+won't break the logic used in the page table check.
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index a925415b4..018a5bd2f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -98,6 +98,7 @@ config PAGE_OWNER
config PAGE_TABLE_CHECK
bool "Check for invalid mappings in user page tables"
depends on ARCH_SUPPORTS_PAGE_TABLE_CHECK
+ depends on EXCLUSIVE_SYSTEM_RAM
select PAGE_EXTENSION
help
Check that anonymous page is not being mapped twice with read write
--
2.40.1
When hcd->localmem_pool is non-null, localmem_pool is used to allocate
DMA memory. In this case, the dma address will be properly returned (in
dma_handle), and dma_mmap_coherent should be used to map this memory
into the user space. However, the current implementation uses
pfn_remap_range, which is supposed to map normal pages.
Instead of repeating the logic in the memory allocation function, this
patch introduces a more robust solution. Here, the type of allocated
memory is checked by testing whether dma_handle is properly set. If
dma_handle is properly returned, it means some DMA pages are allocated
and dma_mmap_coherent should be used to map them. Otherwise, normal
pages are allocated and pfn_remap_range should be called. This ensures
that the correct mmap functions are used consistently, independently
with logic details that determine which type of memory gets allocated.
Fixes: a0e710a7def4 ("USB: usbfs: fix mmap dma mismatch")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ruihan Li <lrh2000(a)pku.edu.cn>
---
drivers/usb/core/devio.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index 3936ca7f7..fcf68818e 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -235,7 +235,7 @@ static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
size_t size = vma->vm_end - vma->vm_start;
void *mem;
unsigned long flags;
- dma_addr_t dma_handle;
+ dma_addr_t dma_handle = DMA_MAPPING_ERROR;
int ret;
ret = usbfs_increase_memory_usage(size + sizeof(struct usb_memory));
@@ -265,7 +265,14 @@ static int usbdev_mmap(struct file *file, struct vm_area_struct *vma)
usbm->vma_use_count = 1;
INIT_LIST_HEAD(&usbm->memlist);
- if (hcd->localmem_pool || !hcd_uses_dma(hcd)) {
+ /*
+ * In DMA-unavailable cases, hcd_buffer_alloc_pages allocates
+ * normal pages and assigns DMA_MAPPING_ERROR to dma_handle. Check
+ * whether we are in such cases, and then use remap_pfn_range (or
+ * dma_mmap_coherent) to map normal (or DMA) pages into the user
+ * space, respectively.
+ */
+ if (dma_handle == DMA_MAPPING_ERROR) {
if (remap_pfn_range(vma, vma->vm_start,
virt_to_phys(usbm->mem) >> PAGE_SHIFT,
size, vma->vm_page_prot) < 0) {
--
2.40.1