From: Roberto Sassu <roberto.sassu(a)huawei.com>
Commit 6e71b04a82248 ("bpf: Add file mode configuration into bpf maps")
added the BPF_F_RDONLY and BPF_F_WRONLY flags, to let user space specify
whether it will just read or modify a map.
Map access control is done in two steps. First, when user space wants to
obtain a map fd, it provides to the kernel the eBPF-defined flags, which
are converted into open flags and passed to the security_bpf_map() security
hook for evaluation by LSMs.
Second, if user space successfully obtained an fd, it passes that fd to the
kernel when it requests a map operation (e.g. lookup or update). The kernel
first checks if the fd has the modes required to perform the requested
operation and, if yes, continues the execution and returns the result to
user space.
While the fd modes check was added for map_*_elem() functions, it is
currently missing for map iterators, added more recently with commit
a5cbe05a6673 ("bpf: Implement bpf iterator for map elements"). A map
iterator executes a chosen eBPF program for each key/value pair of a map
and allows that program to read and/or modify them.
Whether a map iterator allows only read or also write depends on whether
the MEM_RDONLY flag in the ctx_arg_info member of the bpf_iter_reg
structure is set. Also, write needs to be supported at verifier level (for
example, it is currently not supported for sock maps).
Since map iterators obtain a map from a user space fd with
bpf_map_get_with_uref(), add the new req_modes parameter to that function,
so that map iterators can provide the required fd modes to access a map. If
the user space fd doesn't include the required modes,
bpf_map_get_with_uref() returns with an error, and the map iterator will
not be created.
If a map iterator marks both the key and value as read-only, it calls
bpf_map_get_with_uref() with FMODE_CAN_READ as value for req_modes. If it
also allows write access to either the key or the value, it calls that
function with FMODE_CAN_READ | FMODE_CAN_WRITE as value for req_modes,
regardless of whether or not the write is supported by the verifier (the
write is intentionally allowed).
bpf_fd_probe_obj() does not require any fd mode, as the fd is only used for
the purpose of finding the eBPF object type, for pinning the object to the
bpffs filesystem.
Finally, it is worth to mention that the fd modes check was not added for
the cgroup iterator, although it registers an attach_target method like the
other iterators. The reason is that the fd is not the only way for user
space to reference a cgroup object (also by ID and by path). For the
protection to be effective, all reference methods need to be evaluated
consistently. This work is deferred to a separate patch.
Cc: stable(a)vger.kernel.org # 5.10.x
Fixes: a5cbe05a6673 ("bpf: Implement bpf iterator for map elements")
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
---
include/linux/bpf.h | 2 +-
kernel/bpf/inode.c | 2 +-
kernel/bpf/map_iter.c | 3 ++-
kernel/bpf/syscall.c | 8 +++++++-
net/core/bpf_sk_storage.c | 3 ++-
net/core/sock_map.c | 3 ++-
6 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9c1674973e03..6cd2ca910553 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1628,7 +1628,7 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
struct bpf_map *bpf_map_get(u32 ufd);
-struct bpf_map *bpf_map_get_with_uref(u32 ufd);
+struct bpf_map *bpf_map_get_with_uref(u32 ufd, fmode_t req_modes);
struct bpf_map *__bpf_map_get(struct fd f);
void bpf_map_inc(struct bpf_map *map);
void bpf_map_inc_with_uref(struct bpf_map *map);
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4f841e16779e..862e1caa8b0f 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -71,7 +71,7 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
{
void *raw;
- raw = bpf_map_get_with_uref(ufd);
+ raw = bpf_map_get_with_uref(ufd, 0);
if (!IS_ERR(raw)) {
*type = BPF_TYPE_MAP;
return raw;
diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c
index b0fa190b0979..1143f8960135 100644
--- a/kernel/bpf/map_iter.c
+++ b/kernel/bpf/map_iter.c
@@ -110,7 +110,8 @@ static int bpf_iter_attach_map(struct bpf_prog *prog,
if (!linfo->map.map_fd)
return -EBADF;
- map = bpf_map_get_with_uref(linfo->map.map_fd);
+ map = bpf_map_get_with_uref(linfo->map.map_fd,
+ FMODE_CAN_READ | FMODE_CAN_WRITE);
if (IS_ERR(map))
return PTR_ERR(map);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e9d4622aef7..4a2063d8e99c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1232,7 +1232,7 @@ struct bpf_map *bpf_map_get(u32 ufd)
}
EXPORT_SYMBOL(bpf_map_get);
-struct bpf_map *bpf_map_get_with_uref(u32 ufd)
+struct bpf_map *bpf_map_get_with_uref(u32 ufd, fmode_t req_modes)
{
struct fd f = fdget(ufd);
struct bpf_map *map;
@@ -1241,7 +1241,13 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
if (IS_ERR(map))
return map;
+ if ((map_get_sys_perms(map, f) & req_modes) != req_modes) {
+ map = ERR_PTR(-EPERM);
+ goto out;
+ }
+
bpf_map_inc_with_uref(map);
+out:
fdput(f);
return map;
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 1b7f385643b4..bf9c6afed8ac 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -897,7 +897,8 @@ static int bpf_iter_attach_map(struct bpf_prog *prog,
if (!linfo->map.map_fd)
return -EBADF;
- map = bpf_map_get_with_uref(linfo->map.map_fd);
+ map = bpf_map_get_with_uref(linfo->map.map_fd,
+ FMODE_CAN_READ | FMODE_CAN_WRITE);
if (IS_ERR(map))
return PTR_ERR(map);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index a660baedd9e7..7f7375dc39b2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1636,7 +1636,8 @@ static int sock_map_iter_attach_target(struct bpf_prog *prog,
if (!linfo->map.map_fd)
return -EBADF;
- map = bpf_map_get_with_uref(linfo->map.map_fd);
+ map = bpf_map_get_with_uref(linfo->map.map_fd,
+ FMODE_CAN_READ | FMODE_CAN_WRITE);
if (IS_ERR(map))
return PTR_ERR(map);
--
2.25.1
MADV_PAGEOUT tries to isolate non-LRU pages and get the warning
from isolate_lru_page below.
Fix it with checking PageLRU in advance.
------------[ cut here ]------------
trying to isolate tail page
WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140
Modules linked in:
CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:isolate_lru_page+0x130/0x140
Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantia…
Reported-by: 韩天硕 <hantianshuo(a)iie.ac.cn>
Suggested-by: Yang Shi <shy828301(a)gmail.com>
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Cc: stable(a)vger.kernel.org
Signed-off-by: Minchan Kim <minchan(a)kernel.org>
Acked-by: Yang Shi <shy828301(a)gmail.com>
---
mm/madvise.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 682e1d161aef..a3fc4cd32ed3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -452,8 +452,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
continue;
}
- /* Do not interfere with other mappings of this page */
- if (page_mapcount(page) != 1)
+ /*
+ * Do not interfere with other mappings of this page and
+ * non-LRU page.
+ */
+ if (!PageLRU(page) || page_mapcount(page) != 1)
continue;
VM_BUG_ON_PAGE(PageTransCompound(page), page);
--
2.37.2.672.g94769d06f0-goog
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec888 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua(a)huawei.com>
Reported-by: Zhongguohua <zhongguohua1(a)huawei.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Andrew Lutomirski <luto(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
---
drivers/char/mem.c | 4 ++--
drivers/char/random.c | 5 +++++
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 84ca98ed1dad..c2b37009b11e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -706,8 +706,8 @@ static const struct memdev {
#endif
[5] = { "zero", 0666, &zero_fops, FMODE_NOWAIT },
[7] = { "full", 0666, &full_fops, 0 },
- [8] = { "random", 0666, &random_fops, 0 },
- [9] = { "urandom", 0666, &urandom_fops, 0 },
+ [8] = { "random", 0666, &random_fops, FMODE_NOWAIT },
+ [9] = { "urandom", 0666, &urandom_fops, FMODE_NOWAIT },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 79d7d4e4e582..c8cc23515568 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1347,6 +1347,11 @@ static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
+ if (!crng_ready() &&
+ ((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
+ (kiocb->ki_filp->f_flags & O_NONBLOCK)))
+ return -EAGAIN;
+
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
--
2.37.3