After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
Patches #1~#8 are for bpf map. Patches #9~#12 are for bpf prog. Patch #13 and #14 are for selftests and also the examples of how to use them.
[1]. https://lwn.net/Articles/887180/
Yafang Shao (14): bpf: Introduce no charge flag for bpf map bpf: Only sys admin can set no charge flag bpf: Enable no charge in map _CREATE_FLAG_MASK bpf: Introduce new parameter bpf_attr in bpf_map_area_alloc bpf: Allow no charge in bpf_map_area_alloc bpf: Allow no charge for allocation not at map creation time bpf: Allow no charge in map specific allocation bpf: Aggregate flags for BPF_PROG_LOAD command bpf: Add no charge flag for bpf prog bpf: Only sys admin can set no charge flag for bpf prog bpf: Set __GFP_ACCOUNT at the callsite of bpf_prog_alloc bpf: Allow no charge for bpf prog bpf: selftests: Add test case for BPF_F_NO_CHARTE bpf: selftests: Add test case for BPF_F_PROG_NO_CHARGE
include/linux/bpf.h | 27 ++++++- include/uapi/linux/bpf.h | 21 +++-- kernel/bpf/arraymap.c | 9 +-- kernel/bpf/bloom_filter.c | 7 +- kernel/bpf/bpf_local_storage.c | 8 +- kernel/bpf/bpf_struct_ops.c | 13 +-- kernel/bpf/core.c | 20 +++-- kernel/bpf/cpumap.c | 10 ++- kernel/bpf/devmap.c | 14 ++-- kernel/bpf/hashtab.c | 14 ++-- kernel/bpf/local_storage.c | 4 +- kernel/bpf/lpm_trie.c | 4 +- kernel/bpf/queue_stack_maps.c | 5 +- kernel/bpf/reuseport_array.c | 3 +- kernel/bpf/ringbuf.c | 19 ++--- kernel/bpf/stackmap.c | 13 +-- kernel/bpf/syscall.c | 40 +++++++--- kernel/bpf/verifier.c | 2 +- net/core/filter.c | 6 +- net/core/sock_map.c | 8 +- net/xdp/xskmap.c | 9 ++- tools/include/uapi/linux/bpf.h | 21 +++-- .../selftests/bpf/map_tests/no_charg.c | 79 +++++++++++++++++++ .../selftests/bpf/prog_tests/no_charge.c | 49 ++++++++++++ 24 files changed, 297 insertions(+), 108 deletions(-) create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c
A new map flag BPF_F_NO_CHARGE is introduced in bpf_attr, with which we can choose not to charge map memory while account it to root memcg only. At the map creation time, we can get the no charge flag from struct bpf_attr directly, while for other paths we can get it from struct bpf_map.
The usecase of this flag is that sometimes we may create bpf maps with a process running in a container (with memcg) but these maps are targeted to the whole system, so we don't want to charge these memory into this container. That will be good for memory resource management for this container, as these shared bpf maps are always pinned which should belong to the system rather than this container. That can also help to make the charging behavior consistent, for example, if we charge the pinned memory into this container, after the contianer restarts these memory will not belong to it any more.
Two helpers are introduced for followup usage.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- include/linux/bpf.h | 15 ++++++++++++++- include/uapi/linux/bpf.h | 3 +++ kernel/bpf/syscall.c | 1 + tools/include/uapi/linux/bpf.h | 3 +++ 4 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 88449fbbe063..07c6603a6c81 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -184,7 +184,8 @@ struct bpf_map { char name[BPF_OBJ_NAME_LEN]; bool bypass_spec_v1; bool frozen; /* write-once; write-protected by freeze_mutex */ - /* 14 bytes hole */ + bool no_charge; /* Don't charge to memcg */ + /* 13 bytes hole */
/* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. @@ -207,6 +208,18 @@ struct bpf_map { } owner; };
+static inline gfp_t +map_flags_no_charge(gfp_t flags, union bpf_attr *attr) +{ + return flags |= (attr->map_flags & BPF_F_NO_CHARGE) ? 0 : __GFP_ACCOUNT; +} + +static inline gfp_t +bpf_flags_no_charge(gfp_t flags, bool no_charge) +{ + return flags |= no_charge ? 0 : __GFP_ACCOUNT; +} + static inline bool map_value_has_spin_lock(const struct bpf_map *map) { return map->spin_lock_off >= 0; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 7604e7d5438f..e2dba6cdd88d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1225,6 +1225,9 @@ enum {
/* Create a map that is suitable to be an inner map with dynamic max entries */ BPF_F_INNER_MAP = (1U << 12), + +/* Don't charge memory to memcg */ + BPF_F_NO_CHARGE = (1U << 13), };
/* Flags for BPF_PROG_QUERY. */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index cdaa1152436a..029f04588b1a 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -368,6 +368,7 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr) map->map_flags = bpf_map_flags_retain_permanent(attr->map_flags); map->numa_node = bpf_map_attr_numa_node(attr); map->map_extra = attr->map_extra; + map->no_charge = attr->map_flags & BPF_F_NO_CHARGE; }
static int bpf_map_alloc_id(struct bpf_map *map) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7604e7d5438f..e2dba6cdd88d 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1225,6 +1225,9 @@ enum {
/* Create a map that is suitable to be an inner map with dynamic max entries */ BPF_F_INNER_MAP = (1U << 12), + +/* Don't charge memory to memcg */ + BPF_F_NO_CHARGE = (1U << 13), };
/* Flags for BPF_PROG_QUERY. */
Only the sys admin has the privilege to account the bpf map memory into root memcg only.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/syscall.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 029f04588b1a..0cca3d7d0d84 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -852,6 +852,9 @@ static int map_create(union bpf_attr *attr) attr->map_extra != 0) return -EINVAL;
+ if (attr->map_flags & BPF_F_NO_CHARGE && !capable(CAP_SYS_ADMIN)) + return -EPERM; + f_flags = bpf_get_file_flag(attr->map_flags); if (f_flags < 0) return f_flags;
Many maps have their create masks to warn the invalid map flag. Set BPF_F_NO_CHARGE in all these masks to enable it.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/arraymap.c | 2 +- kernel/bpf/bloom_filter.c | 4 ++-- kernel/bpf/bpf_local_storage.c | 3 ++- kernel/bpf/bpf_struct_ops.c | 5 ++++- kernel/bpf/cpumap.c | 5 ++++- kernel/bpf/devmap.c | 2 +- kernel/bpf/hashtab.c | 2 +- kernel/bpf/local_storage.c | 2 +- kernel/bpf/lpm_trie.c | 2 +- kernel/bpf/queue_stack_maps.c | 2 +- kernel/bpf/ringbuf.c | 2 +- kernel/bpf/stackmap.c | 2 +- net/core/sock_map.c | 2 +- net/xdp/xskmap.c | 5 ++++- 14 files changed, 25 insertions(+), 15 deletions(-)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 7f145aefbff8..ac123747303c 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -16,7 +16,7 @@
#define ARRAY_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | \ - BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP) + BPF_F_PRESERVE_ELEMS | BPF_F_INNER_MAP | BPF_F_NO_CHARGE)
static void bpf_array_free_percpu(struct bpf_array *array) { diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c index b141a1346f72..f8ebfb4831e5 100644 --- a/kernel/bpf/bloom_filter.c +++ b/kernel/bpf/bloom_filter.c @@ -8,8 +8,8 @@ #include <linux/jhash.h> #include <linux/random.h>
-#define BLOOM_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK) +#define BLOOM_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | \ + BPF_F_ZERO_SEED | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
struct bpf_bloom_filter { struct bpf_map map; diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index 092a1ac772d7..a92d3032fcde 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -15,7 +15,8 @@ #include <linux/rcupdate_trace.h> #include <linux/rcupdate_wait.h>
-#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE) +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK \ + (BPF_F_NO_PREALLOC | BPF_F_CLONE | BPF_F_NO_CHARGE)
static struct bpf_local_storage_map_bucket * select_bucket(struct bpf_local_storage_map *smap, diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 21069dbe9138..09eb848e6d12 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -21,6 +21,8 @@ enum bpf_struct_ops_state { refcount_t refcnt; \ enum bpf_struct_ops_state state
+#define STRUCT_OPS_CREATE_FLAG_MASK (BPF_F_NO_CHARGE) + struct bpf_struct_ops_value { BPF_STRUCT_OPS_COMMON_VALUE; char data[] ____cacheline_aligned_in_smp; @@ -556,7 +558,8 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) { if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || - attr->map_flags || !attr->btf_vmlinux_value_type_id) + attr->map_flags & ~STRUCT_OPS_CREATE_FLAG_MASK || + !attr->btf_vmlinux_value_type_id) return -EINVAL; return 0; } diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 650e5d21f90d..201226fc652b 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -39,6 +39,9 @@ */
#define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */ + +#define CPU_MAP_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_NO_CHARGE) + struct bpf_cpu_map_entry; struct bpf_cpu_map;
@@ -93,7 +96,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) if (attr->max_entries == 0 || attr->key_size != 4 || (value_size != offsetofend(struct bpf_cpumap_val, qsize) && value_size != offsetofend(struct bpf_cpumap_val, bpf_prog.fd)) || - attr->map_flags & ~BPF_F_NUMA_NODE) + attr->map_flags & ~CPU_MAP_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL);
cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT); diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 038f6d7a83e4..39bf8b521f27 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -50,7 +50,7 @@ #include <trace/events/xdp.h>
#define DEV_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE)
struct xdp_dev_bulk_queue { struct xdp_frame *q[DEV_MAP_BULK_SIZE]; diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 65877967f414..5c6ec8780b09 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -16,7 +16,7 @@
#define HTAB_CREATE_FLAG_MASK \ (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \ - BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED) + BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED | BPF_F_NO_CHARGE)
#define BATCH_OPS(_name) \ .map_lookup_batch = \ diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 497916060ac7..865766c240d6 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -15,7 +15,7 @@ #include "../cgroup/cgroup-internal.h"
#define LOCAL_STORAGE_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
struct bpf_cgroup_storage_map { struct bpf_map map; diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 5763cc7ac4f1..f42edf613624 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -537,7 +537,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key) #define LPM_KEY_SIZE_MIN LPM_KEY_SIZE(LPM_DATA_SIZE_MIN)
#define LPM_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | \ - BPF_F_ACCESS_MASK) + BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
static struct bpf_map *trie_alloc(union bpf_attr *attr) { diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index f9c734aaa990..c78eed4659ce 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -11,7 +11,7 @@ #include "percpu_freelist.h"
#define QUEUE_STACK_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK | BPF_F_NO_CHARGE)
struct bpf_queue_stack { struct bpf_map map; diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 710ba9de12ce..88779f688679 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -11,7 +11,7 @@ #include <linux/kmemleak.h> #include <uapi/linux/btf.h>
-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE) +#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_NO_CHARGE)
/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */ #define RINGBUF_PGOFF \ diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 38bdfcd06f55..b2e7dc1d9f5a 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -14,7 +14,7 @@
#define STACK_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \ - BPF_F_STACK_BUILD_ID) + BPF_F_STACK_BUILD_ID | BPF_F_NO_CHARGE)
struct stack_map_bucket { struct pcpu_freelist_node fnode; diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 2d213c4011db..7b0215bea413 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -22,7 +22,7 @@ struct bpf_stab { };
#define SOCK_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE)
static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog, struct bpf_prog *old, u32 which); diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c index 65b53fb3de13..10a5ae727bd5 100644 --- a/net/xdp/xskmap.c +++ b/net/xdp/xskmap.c @@ -12,6 +12,9 @@
#include "xsk.h"
+#define XSK_MAP_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_NO_CHARGE) + static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map, struct xdp_sock __rcu **map_entry) { @@ -68,7 +71,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
if (attr->max_entries == 0 || attr->key_size != 4 || attr->value_size != 4 || - attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)) + attr->map_flags & ~XSK_MAP_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL);
numa_node = bpf_map_attr_numa_node(attr);
Add a new parameter bpf_attr in bpf_map_area_alloc(), then we can get no charge flag from it. Currently there're two parameters, one of which is also got from bpf_attr, so we can remove it after this change.
No functional change.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- include/linux/bpf.h | 4 ++-- kernel/bpf/arraymap.c | 5 ++--- kernel/bpf/bloom_filter.c | 3 +-- kernel/bpf/bpf_struct_ops.c | 8 ++++---- kernel/bpf/cpumap.c | 3 +-- kernel/bpf/devmap.c | 10 ++++------ kernel/bpf/hashtab.c | 10 ++++------ kernel/bpf/queue_stack_maps.c | 3 +-- kernel/bpf/reuseport_array.c | 3 +-- kernel/bpf/ringbuf.c | 11 ++++++----- kernel/bpf/stackmap.c | 11 ++++++----- kernel/bpf/syscall.c | 13 +++++++------ net/core/sock_map.c | 6 ++---- net/xdp/xskmap.c | 4 +--- 14 files changed, 42 insertions(+), 52 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 07c6603a6c81..90a542d5a411 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1518,8 +1518,8 @@ void bpf_map_inc_with_uref(struct bpf_map *map); struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map); void bpf_map_put_with_uref(struct bpf_map *map); void bpf_map_put(struct bpf_map *map); -void *bpf_map_area_alloc(u64 size, int numa_node); -void *bpf_map_area_mmapable_alloc(u64 size, int numa_node); +void *bpf_map_area_alloc(u64 size, union bpf_attr *attr); +void *bpf_map_area_mmapable_alloc(u64 size, union bpf_attr *attr); void bpf_map_area_free(void *base); bool bpf_map_write_active(const struct bpf_map *map); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index ac123747303c..e26aef906392 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -81,7 +81,6 @@ int array_map_alloc_check(union bpf_attr *attr) static struct bpf_map *array_map_alloc(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; - int numa_node = bpf_map_attr_numa_node(attr); u32 elem_size, index_mask, max_entries; bool bypass_spec_v1 = bpf_bypass_spec_v1(); u64 array_size, mask64; @@ -130,13 +129,13 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) void *data;
/* kmalloc'ed memory can't be mmap'ed, use explicit vmalloc */ - data = bpf_map_area_mmapable_alloc(array_size, numa_node); + data = bpf_map_area_mmapable_alloc(array_size, attr); if (!data) return ERR_PTR(-ENOMEM); array = data + PAGE_ALIGN(sizeof(struct bpf_array)) - offsetof(struct bpf_array, value); } else { - array = bpf_map_area_alloc(array_size, numa_node); + array = bpf_map_area_alloc(array_size, attr); } if (!array) return ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c index f8ebfb4831e5..a35c664b4a02 100644 --- a/kernel/bpf/bloom_filter.c +++ b/kernel/bpf/bloom_filter.c @@ -90,7 +90,6 @@ static int bloom_map_get_next_key(struct bpf_map *map, void *key, void *next_key static struct bpf_map *bloom_map_alloc(union bpf_attr *attr) { u32 bitset_bytes, bitset_mask, nr_hash_funcs, nr_bits; - int numa_node = bpf_map_attr_numa_node(attr); struct bpf_bloom_filter *bloom;
if (!bpf_capable()) @@ -141,7 +140,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr) }
bitset_bytes = roundup(bitset_bytes, sizeof(unsigned long)); - bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node); + bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, attr);
if (!bloom) return ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 09eb848e6d12..1ca1407ae5e6 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -591,17 +591,17 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) */ (vt->size - sizeof(struct bpf_struct_ops_value));
- st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); + attr->map_flags &= ~BPF_F_NUMA_NODE; + st_map = bpf_map_area_alloc(st_map_size, attr); if (!st_map) return ERR_PTR(-ENOMEM);
st_map->st_ops = st_ops; map = &st_map->map;
- st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); + st_map->uvalue = bpf_map_area_alloc(vt->size, attr); st_map->progs = - bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), - NUMA_NO_NODE); + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), attr); st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); if (!st_map->uvalue || !st_map->progs || !st_map->image) { bpf_struct_ops_map_free(map); diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 201226fc652b..5a5b40e986ff 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -113,8 +113,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
/* Alloc array for possible remote "destination" CPUs */ cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * - sizeof(struct bpf_cpu_map_entry *), - cmap->map.numa_node); + sizeof(struct bpf_cpu_map_entry *), attr); if (!cmap->cpu_map) goto free_cmap;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 39bf8b521f27..2857176c82bb 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -88,12 +88,12 @@ static DEFINE_SPINLOCK(dev_map_lock); static LIST_HEAD(dev_map_list);
static struct hlist_head *dev_map_create_hash(unsigned int entries, - int numa_node) + union bpf_attr *attr) { int i; struct hlist_head *hash;
- hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node); + hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), attr); if (hash != NULL) for (i = 0; i < entries; i++) INIT_HLIST_HEAD(&hash[i]); @@ -137,16 +137,14 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr) }
if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { - dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets, - dtab->map.numa_node); + dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets, attr); if (!dtab->dev_index_head) return -ENOMEM;
spin_lock_init(&dtab->index_lock); } else { dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries * - sizeof(struct bpf_dtab_netdev *), - dtab->map.numa_node); + sizeof(struct bpf_dtab_netdev *), attr); if (!dtab->netdev_map) return -ENOMEM; } diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 5c6ec8780b09..2c84045ff8e1 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -303,7 +303,7 @@ static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key, return NULL; }
-static int prealloc_init(struct bpf_htab *htab) +static int prealloc_init(union bpf_attr *attr, struct bpf_htab *htab) { u32 num_entries = htab->map.max_entries; int err = -ENOMEM, i; @@ -311,8 +311,7 @@ static int prealloc_init(struct bpf_htab *htab) if (htab_has_extra_elems(htab)) num_entries += num_possible_cpus();
- htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries, - htab->map.numa_node); + htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries, attr); if (!htab->elems) return -ENOMEM;
@@ -513,8 +512,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
err = -ENOMEM; htab->buckets = bpf_map_area_alloc(htab->n_buckets * - sizeof(struct bucket), - htab->map.numa_node); + sizeof(struct bucket), attr); if (!htab->buckets) goto free_htab;
@@ -535,7 +533,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) htab_init_buckets(htab);
if (prealloc) { - err = prealloc_init(htab); + err = prealloc_init(attr, htab); if (err) goto free_map_locked;
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index c78eed4659ce..0ff93c5bc184 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -66,14 +66,13 @@ static int queue_stack_map_alloc_check(union bpf_attr *attr)
static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr) { - int numa_node = bpf_map_attr_numa_node(attr); struct bpf_queue_stack *qs; u64 size, queue_size;
size = (u64) attr->max_entries + 1; queue_size = sizeof(*qs) + size * attr->value_size;
- qs = bpf_map_area_alloc(queue_size, numa_node); + qs = bpf_map_area_alloc(queue_size, attr); if (!qs) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c index 8251243022a2..b19fb70118a4 100644 --- a/kernel/bpf/reuseport_array.c +++ b/kernel/bpf/reuseport_array.c @@ -150,14 +150,13 @@ static void reuseport_array_free(struct bpf_map *map)
static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr) { - int numa_node = bpf_map_attr_numa_node(attr); struct reuseport_array *array;
if (!bpf_capable()) return ERR_PTR(-EPERM);
/* allocate all map elements and zero-initialize them */ - array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node); + array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), attr); if (!array) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 88779f688679..a3b4d2a0a2c7 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -58,13 +58,14 @@ struct bpf_ringbuf_hdr { u32 pg_off; };
-static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, union bpf_attr *attr) { const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_ZERO; int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES; int nr_data_pages = data_sz >> PAGE_SHIFT; int nr_pages = nr_meta_pages + nr_data_pages; + int numa_node = bpf_map_attr_numa_node(attr); struct page **pages, *page; struct bpf_ringbuf *rb; size_t array_size; @@ -88,7 +89,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node) * user-space implementations significantly. */ array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages); - pages = bpf_map_area_alloc(array_size, numa_node); + pages = bpf_map_area_alloc(array_size, attr); if (!pages) return NULL;
@@ -126,11 +127,11 @@ static void bpf_ringbuf_notify(struct irq_work *work) wake_up_all(&rb->waitq); }
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) +static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, union bpf_attr *attr) { struct bpf_ringbuf *rb;
- rb = bpf_ringbuf_area_alloc(data_sz, numa_node); + rb = bpf_ringbuf_area_alloc(data_sz, attr); if (!rb) return NULL;
@@ -169,7 +170,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
bpf_map_init_from_attr(&rb_map->map, attr);
- rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node); + rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, attr); if (!rb_map->rb) { kfree(rb_map); return ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index b2e7dc1d9f5a..ed6bebef0132 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -42,14 +42,15 @@ static inline int stack_map_data_size(struct bpf_map *map) sizeof(struct bpf_stack_build_id) : sizeof(u64); }
-static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) +static int prealloc_elems_and_freelist(union bpf_attr *attr, + struct bpf_stack_map *smap) { u64 elem_size = sizeof(struct stack_map_bucket) + (u64)smap->map.value_size; int err;
- smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries, - smap->map.numa_node); + smap->elems = bpf_map_area_alloc(elem_size * + smap->map.max_entries, attr); if (!smap->elems) return -ENOMEM;
@@ -101,7 +102,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap); cost += n_buckets * (value_size + sizeof(struct stack_map_bucket)); - smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr)); + smap = bpf_map_area_alloc(cost, attr); if (!smap) return ERR_PTR(-ENOMEM);
@@ -113,7 +114,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) if (err) goto free_smap;
- err = prealloc_elems_and_freelist(smap); + err = prealloc_elems_and_freelist(attr, smap); if (err) goto put_buffers;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0cca3d7d0d84..f70a7067ef4a 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -295,7 +295,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, * (e.g. in map update path) without taking care of setting the active * memory cgroup (see at bpf_map_kmalloc_node() for example). */ -static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable) +static void *__bpf_map_area_alloc(u64 size, union bpf_attr *attr, bool mmapable) { /* We really just want to fail instead of triggering OOM killer * under memory pressure, therefore we set __GFP_NORETRY to kmalloc, @@ -308,8 +308,9 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable) */
const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT; - unsigned int flags = 0; + int numa_node = bpf_map_attr_numa_node(attr); unsigned long align = 1; + unsigned int flags = 0; void *area;
if (size >= SIZE_MAX) @@ -332,14 +333,14 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable) flags, numa_node, __builtin_return_address(0)); }
-void *bpf_map_area_alloc(u64 size, int numa_node) +void *bpf_map_area_alloc(u64 size, union bpf_attr *attr) { - return __bpf_map_area_alloc(size, numa_node, false); + return __bpf_map_area_alloc(size, attr, false); }
-void *bpf_map_area_mmapable_alloc(u64 size, int numa_node) +void *bpf_map_area_mmapable_alloc(u64 size, union bpf_attr *attr) { - return __bpf_map_area_alloc(size, numa_node, true); + return __bpf_map_area_alloc(size, attr, true); }
void bpf_map_area_free(void *area) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 7b0215bea413..26b89d37944d 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -49,8 +49,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) raw_spin_lock_init(&stab->lock);
stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries * - sizeof(struct sock *), - stab->map.numa_node); + sizeof(struct sock *), attr); if (!stab->sks) { kfree(stab); return ERR_PTR(-ENOMEM); @@ -1093,8 +1092,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) }
htab->buckets = bpf_map_area_alloc(htab->buckets_num * - sizeof(struct bpf_shtab_bucket), - htab->map.numa_node); + sizeof(struct bpf_shtab_bucket), attr); if (!htab->buckets) { err = -ENOMEM; goto free_htab; diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c index 10a5ae727bd5..50795c0c9b81 100644 --- a/net/xdp/xskmap.c +++ b/net/xdp/xskmap.c @@ -63,7 +63,6 @@ static void xsk_map_sock_delete(struct xdp_sock *xs, static struct bpf_map *xsk_map_alloc(union bpf_attr *attr) { struct xsk_map *m; - int numa_node; u64 size;
if (!capable(CAP_NET_ADMIN)) @@ -74,10 +73,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr) attr->map_flags & ~XSK_MAP_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL);
- numa_node = bpf_map_attr_numa_node(attr); size = struct_size(m, xsk_map, attr->max_entries);
- m = bpf_map_area_alloc(size, numa_node); + m = bpf_map_area_alloc(size, attr); if (!m) return ERR_PTR(-ENOMEM);
Use the helper we introduced before to decide whether set the __GFP_ACCOUNT or not.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f70a7067ef4a..add3b4045b4d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -307,7 +307,7 @@ static void *__bpf_map_area_alloc(u64 size, union bpf_attr *attr, bool mmapable) * __GFP_RETRY_MAYFAIL to avoid such situations. */
- const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT; + const gfp_t gfp = map_flags_no_charge(__GFP_NOWARN | __GFP_ZERO, attr); int numa_node = bpf_map_attr_numa_node(attr); unsigned long align = 1; unsigned int flags = 0;
Below three functions are used for memory allocation which is not at map creation time, - bpf_map_kmalloc_node() - bpf_map_kzalloc() - bpf_map_alloc_percpu()
For this kind of path, we can get the no charge flag from bpf_map struct we set before.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/syscall.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index add3b4045b4d..e84aeefa05f4 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -434,7 +434,8 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags, void *ptr;
old_memcg = set_active_memcg(map->memcg); - ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node); + ptr = kmalloc_node(size, bpf_flags_no_charge(flags, map->no_charge), + node); set_active_memcg(old_memcg);
return ptr; @@ -446,7 +447,7 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags) void *ptr;
old_memcg = set_active_memcg(map->memcg); - ptr = kzalloc(size, flags | __GFP_ACCOUNT); + ptr = kzalloc(size, bpf_flags_no_charge(flags, map->no_charge)); set_active_memcg(old_memcg);
return ptr; @@ -459,7 +460,8 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, void __percpu *ptr;
old_memcg = set_active_memcg(map->memcg); - ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT); + ptr = __alloc_percpu_gfp(size, align, + bpf_flags_no_charge(flags, map->no_charge)); set_active_memcg(old_memcg);
return ptr;
Some maps have their own ->map_alloc, in which the no charge should also be allowed.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/arraymap.c | 2 +- kernel/bpf/bpf_local_storage.c | 5 +++-- kernel/bpf/cpumap.c | 2 +- kernel/bpf/devmap.c | 2 +- kernel/bpf/hashtab.c | 2 +- kernel/bpf/local_storage.c | 2 +- kernel/bpf/lpm_trie.c | 2 +- kernel/bpf/ringbuf.c | 6 +++--- 8 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index e26aef906392..9df425ad769c 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -1062,7 +1062,7 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr) struct bpf_array_aux *aux; struct bpf_map *map;
- aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT); + aux = kzalloc(sizeof(*aux), map_flags_no_charge(GFP_KERNEL, attr)); if (!aux) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c index a92d3032fcde..b626546d384d 100644 --- a/kernel/bpf/bpf_local_storage.c +++ b/kernel/bpf/bpf_local_storage.c @@ -582,11 +582,12 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr) { + gfp_t gfp_flags = map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr); struct bpf_local_storage_map *smap; unsigned int i; u32 nbuckets;
- smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT); + smap = kzalloc(sizeof(*smap), gfp_flags); if (!smap) return ERR_PTR(-ENOMEM); bpf_map_init_from_attr(&smap->map, attr); @@ -597,7 +598,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr) smap->bucket_log = ilog2(nbuckets);
smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets, - GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT); + map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr)); if (!smap->buckets) { kfree(smap); return ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 5a5b40e986ff..fd3b3f05e76a 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -99,7 +99,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) attr->map_flags & ~CPU_MAP_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL);
- cmap = kzalloc(sizeof(*cmap), GFP_USER | __GFP_ACCOUNT); + cmap = kzalloc(sizeof(*cmap), map_flags_no_charge(GFP_USER, attr)); if (!cmap) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 2857176c82bb..6aaa2e3ce795 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -160,7 +160,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr) if (!capable(CAP_NET_ADMIN)) return ERR_PTR(-EPERM);
- dtab = kzalloc(sizeof(*dtab), GFP_USER | __GFP_ACCOUNT); + dtab = kzalloc(sizeof(*dtab), map_flags_no_charge(GFP_USER, attr)); if (!dtab) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 2c84045ff8e1..74696d8196a5 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -474,7 +474,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) struct bpf_htab *htab; int err, i;
- htab = kzalloc(sizeof(*htab), GFP_USER | __GFP_ACCOUNT); + htab = kzalloc(sizeof(*htab), map_flags_no_charge(GFP_USER, attr)); if (!htab) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 865766c240d6..741ab7cf3626 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) return ERR_PTR(-EINVAL);
map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map), - __GFP_ZERO | GFP_USER | __GFP_ACCOUNT, numa_node); + map_flags_no_charge(__GFP_ZERO | GFP_USER, attr), numa_node); if (!map) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index f42edf613624..9673f8e98c58 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -557,7 +557,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) attr->value_size > LPM_VAL_SIZE_MAX) return ERR_PTR(-EINVAL);
- trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT); + trie = kzalloc(sizeof(*trie), map_flags_no_charge(GFP_USER | __GFP_NOWARN, attr)); if (!trie) return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index a3b4d2a0a2c7..3db07cd0ab60 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -60,8 +60,8 @@ struct bpf_ringbuf_hdr {
static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, union bpf_attr *attr) { - const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL | - __GFP_NOWARN | __GFP_ZERO; + const gfp_t flags = map_flags_no_charge(__GFP_RETRY_MAYFAIL | + __GFP_NOWARN | __GFP_ZERO, attr); int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES; int nr_data_pages = data_sz >> PAGE_SHIFT; int nr_pages = nr_meta_pages + nr_data_pages; @@ -164,7 +164,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr) return ERR_PTR(-E2BIG); #endif
- rb_map = kzalloc(sizeof(*rb_map), GFP_USER | __GFP_ACCOUNT); + rb_map = kzalloc(sizeof(*rb_map), map_flags_no_charge(GFP_USER, attr)); if (!rb_map) return ERR_PTR(-ENOMEM);
It will be easy to read if we aggregate the flags for BPF_PROG_LOAD into
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- include/uapi/linux/bpf.h | 15 +++++++++------ tools/include/uapi/linux/bpf.h | 15 +++++++++------ 2 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index e2dba6cdd88d..93ee04fb8c62 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1065,12 +1065,14 @@ enum bpf_link_type { #define BPF_F_ALLOW_MULTI (1U << 1) #define BPF_F_REPLACE (1U << 2)
+/* flags for BPF_PROG_LOAD command */ +enum { /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the * verifier will perform strict alignment checking as if the kernel * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set, * and NET_IP_ALIGN defined to 2. */ -#define BPF_F_STRICT_ALIGNMENT (1U << 0) + BPF_F_STRICT_ALIGNMENT = (1U << 0),
/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the * verifier will allow any alignment whatsoever. On platforms @@ -1084,7 +1086,7 @@ enum bpf_link_type { * of an unaligned access the alignment check would trigger before * the one we are interested in. */ -#define BPF_F_ANY_ALIGNMENT (1U << 1) + BPF_F_ANY_ALIGNMENT = (1U << 1),
/* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose. * Verifier does sub-register def/use analysis and identifies instructions whose @@ -1102,10 +1104,10 @@ enum bpf_link_type { * Then, if verifier is not doing correct analysis, such randomization will * regress tests to expose bugs. */ -#define BPF_F_TEST_RND_HI32 (1U << 2) + BPF_F_TEST_RND_HI32 = (1U << 2),
/* The verifier internal test flag. Behavior is undefined */ -#define BPF_F_TEST_STATE_FREQ (1U << 3) + BPF_F_TEST_STATE_FREQ = (1U << 3),
/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will * restrict map and helper usage for such programs. Sleepable BPF programs can @@ -1113,12 +1115,13 @@ enum bpf_link_type { * Such programs are allowed to use helpers that may sleep like * bpf_copy_from_user(). */ -#define BPF_F_SLEEPABLE (1U << 4) + BPF_F_SLEEPABLE = (1U << 4),
/* If BPF_F_XDP_HAS_FRAGS is used in BPF_PROG_LOAD command, the loaded program * fully support xdp frags. */ -#define BPF_F_XDP_HAS_FRAGS (1U << 5) + BPF_F_XDP_HAS_FRAGS = (1U << 5), +};
/* link_create.kprobe_multi.flags used in LINK_CREATE command for * BPF_TRACE_KPROBE_MULTI attach type to create return probe. diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index e2dba6cdd88d..71a4d8fdc880 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1065,12 +1065,14 @@ enum bpf_link_type { #define BPF_F_ALLOW_MULTI (1U << 1) #define BPF_F_REPLACE (1U << 2)
+/* flags for BPF_PROG_LOAD */ +enum { /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the * verifier will perform strict alignment checking as if the kernel * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set, * and NET_IP_ALIGN defined to 2. */ -#define BPF_F_STRICT_ALIGNMENT (1U << 0) + BPF_F_STRICT_ALIGNMENT = (1U << 0),
/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the * verifier will allow any alignment whatsoever. On platforms @@ -1084,7 +1086,7 @@ enum bpf_link_type { * of an unaligned access the alignment check would trigger before * the one we are interested in. */ -#define BPF_F_ANY_ALIGNMENT (1U << 1) + BPF_F_ANY_ALIGNMENT = (1U << 1),
/* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose. * Verifier does sub-register def/use analysis and identifies instructions whose @@ -1102,10 +1104,10 @@ enum bpf_link_type { * Then, if verifier is not doing correct analysis, such randomization will * regress tests to expose bugs. */ -#define BPF_F_TEST_RND_HI32 (1U << 2) + BPF_F_TEST_RND_HI32 = (1U << 2),
/* The verifier internal test flag. Behavior is undefined */ -#define BPF_F_TEST_STATE_FREQ (1U << 3) + BPF_F_TEST_STATE_FREQ = (1U << 3),
/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will * restrict map and helper usage for such programs. Sleepable BPF programs can @@ -1113,12 +1115,13 @@ enum bpf_link_type { * Such programs are allowed to use helpers that may sleep like * bpf_copy_from_user(). */ -#define BPF_F_SLEEPABLE (1U << 4) + BPF_F_SLEEPABLE = (1U << 4),
/* If BPF_F_XDP_HAS_FRAGS is used in BPF_PROG_LOAD command, the loaded program * fully support xdp frags. */ -#define BPF_F_XDP_HAS_FRAGS (1U << 5) + BPF_F_XDP_HAS_FRAGS = (1U << 5), +};
/* link_create.kprobe_multi.flags used in LINK_CREATE command for * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
A no charge flag is also introduced for bpf prog, which is similar with the no charge flag introduced for bpf map. The usecase of it is the same too.
It is added in bpf_attr for BPF_PROG_LOAD command, and then set in bpf_prog_aux for the memory allocation which is not at the loading path. There're 4B holes after the member max_rdwr_access in struct bpf_prog_aux, so we can place the new member there.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- include/linux/bpf.h | 8 ++++++++ include/uapi/linux/bpf.h | 3 +++ kernel/bpf/syscall.c | 4 +++- tools/include/uapi/linux/bpf.h | 3 +++ 4 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 90a542d5a411..69ff3e35b8f2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -214,6 +214,13 @@ map_flags_no_charge(gfp_t flags, union bpf_attr *attr) return flags |= (attr->map_flags & BPF_F_NO_CHARGE) ? 0 : __GFP_ACCOUNT; }
+static inline gfp_t +prog_flags_no_charge(gfp_t flags, union bpf_attr *attr) +{ + return flags |= (attr->prog_flags & BPF_F_PROG_NO_CHARGE) ? + 0 : __GFP_ACCOUNT; +} + static inline gfp_t bpf_flags_no_charge(gfp_t flags, bool no_charge) { @@ -958,6 +965,7 @@ struct bpf_prog_aux { u32 ctx_arg_info_size; u32 max_rdonly_access; u32 max_rdwr_access; + bool no_charge; /* dont' charge memory to memcg */ struct btf *attach_btf; const struct bpf_ctx_arg_aux *ctx_arg_info; struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 93ee04fb8c62..3c98b1b77db6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1121,6 +1121,9 @@ enum { * fully support xdp frags. */ BPF_F_XDP_HAS_FRAGS = (1U << 5), + +/* Don't charge memory to memcg */ + BPF_F_PROG_NO_CHARGE = (1U << 6), };
/* link_create.kprobe_multi.flags used in LINK_CREATE command for diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index e84aeefa05f4..346f3df9fa1d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2230,7 +2230,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) BPF_F_TEST_STATE_FREQ | BPF_F_SLEEPABLE | BPF_F_TEST_RND_HI32 | - BPF_F_XDP_HAS_FRAGS)) + BPF_F_XDP_HAS_FRAGS | + BPF_F_PROG_NO_CHARGE)) return -EINVAL;
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && @@ -2317,6 +2318,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) prog->aux->offload_requested = !!attr->prog_ifindex; prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE; prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS; + prog->aux->no_charge = attr->prog_flags & BPF_F_PROG_NO_CHARGE;
err = security_bpf_prog_alloc(prog->aux); if (err) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 71a4d8fdc880..89752e5c11c0 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1121,6 +1121,9 @@ enum { * fully support xdp frags. */ BPF_F_XDP_HAS_FRAGS = (1U << 5), + +/* Don't charge memory to memcg */ + BPF_F_PROG_NO_CHARGE = (1U << 6), };
/* link_create.kprobe_multi.flags used in LINK_CREATE command for
When a bpf prog is loaded by a proccess running in a container (with memcg), only sys admin has privilege not to charge bpf prog memory into this container while account it to root memcg only.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/syscall.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 346f3df9fa1d..ecc5de216f50 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2234,6 +2234,9 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) BPF_F_PROG_NO_CHARGE)) return -EINVAL;
+ if (attr->prog_flags & BPF_F_PROG_NO_CHARGE && !capable(CAP_SYS_ADMIN)) + return -EPERM; + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && (attr->prog_flags & BPF_F_ANY_ALIGNMENT) && !bpf_capable())
Instead of setting __GFP_ACCOUNT inside bpf_prog_alloc(), let's set it at the callsite. No functional change.
It is a preparation for followup patch.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/core.c | 8 ++++---- kernel/bpf/syscall.c | 2 +- kernel/bpf/verifier.c | 2 +- net/core/filter.c | 6 +++--- 4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 1324f9523e7c..0f68b8203c18 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -80,7 +80,7 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags) { - gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags; + gfp_t gfp_flags = __GFP_ZERO | gfp_extra_flags; struct bpf_prog_aux *aux; struct bpf_prog *fp;
@@ -89,12 +89,12 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag if (fp == NULL) return NULL;
- aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT | gfp_extra_flags); + aux = kzalloc(sizeof(*aux), gfp_extra_flags); if (aux == NULL) { vfree(fp); return NULL; } - fp->active = alloc_percpu_gfp(int, GFP_KERNEL_ACCOUNT | gfp_extra_flags); + fp->active = alloc_percpu_gfp(int, gfp_flags); if (!fp->active) { vfree(fp); kfree(aux); @@ -116,7 +116,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) { - gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags; + gfp_t gfp_flags = __GFP_ZERO | gfp_extra_flags; struct bpf_prog *prog; int cpu;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ecc5de216f50..fdfbb4d0d5e0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2305,7 +2305,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) }
/* plain bpf_prog allocation */ - prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER); + prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER | __GFP_ACCOUNT); if (!prog) { if (dst_prog) bpf_prog_put(dst_prog); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0287176bfe9a..fe989cc08391 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12991,7 +12991,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) * subprogs don't have IDs and not reachable via prog_get_next_id * func[i]->stats will never be accessed and stays NULL */ - func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER); + func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER | __GFP_ACCOUNT); if (!func[i]) goto out_free; memcpy(func[i]->insnsi, &prog->insnsi[subprog_start], diff --git a/net/core/filter.c b/net/core/filter.c index 03655f2074ae..6466a1e0ed4d 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1363,7 +1363,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog) if (!bpf_check_basics_ok(fprog->filter, fprog->len)) return -EINVAL;
- fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0); + fp = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT); if (!fp) return -ENOMEM;
@@ -1410,7 +1410,7 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, if (!bpf_check_basics_ok(fprog->filter, fprog->len)) return -EINVAL;
- fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0); + fp = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT); if (!fp) return -ENOMEM;
@@ -1488,7 +1488,7 @@ struct bpf_prog *__get_filter(struct sock_fprog *fprog, struct sock *sk) if (!bpf_check_basics_ok(fprog->filter, fprog->len)) return ERR_PTR(-EINVAL);
- prog = bpf_prog_alloc(bpf_prog_size(fprog->len), 0); + prog = bpf_prog_alloc(bpf_prog_size(fprog->len), __GFP_ACCOUNT); if (!prog) return ERR_PTR(-ENOMEM);
Allow not to charge memory used by bpf progs. This includes the memory used by bpf progs itself, auxxiliary data, statistics and bpf line info.
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- kernel/bpf/core.c | 12 ++++++++++-- kernel/bpf/syscall.c | 6 ++++-- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 0f68b8203c18..7aa750e8bd7d 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -144,12 +144,17 @@ EXPORT_SYMBOL_GPL(bpf_prog_alloc);
int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog) { + gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN; + if (!prog->aux->nr_linfo || !prog->jit_requested) return 0;
+ if (!prog->aux->no_charge) + gfp_flags |= __GFP_ACCOUNT; + prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo, sizeof(*prog->aux->jited_linfo), - GFP_KERNEL_ACCOUNT | __GFP_NOWARN); + gfp_flags); if (!prog->aux->jited_linfo) return -ENOMEM;
@@ -224,7 +229,7 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags) { - gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO | gfp_extra_flags; + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog *fp; u32 pages;
@@ -233,6 +238,9 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, if (pages <= fp_old->pages) return fp_old;
+ if (!fp_old->aux->no_charge) + gfp_flags |= __GFP_ACCOUNT; + fp = __vmalloc(size, gfp_flags); if (fp) { memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index fdfbb4d0d5e0..e386b549fafc 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2218,9 +2218,10 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) enum bpf_prog_type type = attr->prog_type; struct bpf_prog *prog, *dst_prog = NULL; struct btf *attach_btf = NULL; - int err; + gfp_t gfp_flags = GFP_USER; char license[128]; bool is_gpl; + int err;
if (CHECK_ATTR(BPF_PROG_LOAD)) return -EINVAL; @@ -2305,7 +2306,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr) }
/* plain bpf_prog allocation */ - prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER | __GFP_ACCOUNT); + prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), + prog_flags_no_charge(gfp_flags, attr)); if (!prog) { if (dst_prog) bpf_prog_put(dst_prog);
BPF_F_NO_CHARTE test case for various maps. Below is the result, $ ./test_maps ... test_no_charge:PASS ...
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- .../selftests/bpf/map_tests/no_charg.c | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 tools/testing/selftests/bpf/map_tests/no_charg.c
diff --git a/tools/testing/selftests/bpf/map_tests/no_charg.c b/tools/testing/selftests/bpf/map_tests/no_charg.c new file mode 100644 index 000000000000..db18685a53f7 --- /dev/null +++ b/tools/testing/selftests/bpf/map_tests/no_charg.c @@ -0,0 +1,79 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <sys/syscall.h> +#include <linux/bpf.h> +#include <string.h> +#include <unistd.h> +#include <stdio.h> +#include <errno.h> + +#include <test_maps.h> + +struct map_attr { + __u32 map_type; + char *map_name; + __u32 key_size; + __u32 value_size; + __u32 max_entries; + __u32 map_flags; +}; + +static struct map_attr attrs[] = { + {BPF_MAP_TYPE_HASH, "BPF_MAP_TYPE_HASH", 4, 4, 10}, + {BPF_MAP_TYPE_ARRAY, "BPF_MAP_TYPE_ARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_PROG_ARRAY, "BPF_MAP_TYPE_PROG_ARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_PERF_EVENT_ARRAY, "BPF_MAP_TYPE_PERF_EVENT_ARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_PERCPU_HASH, "BPF_MAP_TYPE_PERCPU_HASH", 4, 4, 10}, + {BPF_MAP_TYPE_PERCPU_ARRAY, "BPF_MAP_TYPE_PERCPU_ARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_STACK_TRACE, "BPF_MAP_TYPE_STACK_TRACE", 4, 8, 10}, + {BPF_MAP_TYPE_CGROUP_ARRAY, "BPF_MAP_TYPE_CGROUP_ARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_LRU_HASH, "BPF_MAP_TYPE_LRU_HASH", 4, 4, 10}, + {BPF_MAP_TYPE_LRU_PERCPU_HASH, "BPF_MAP_TYPE_LRU_PERCPU_HASH", 4, 4, 10}, + {BPF_MAP_TYPE_LPM_TRIE, "BPF_MAP_TYPE_LPM_TRIE", 32, 4, 10, BPF_F_NO_PREALLOC}, + {BPF_MAP_TYPE_DEVMAP, "BPF_MAP_TYPE_DEVMAP", 4, 4, 10}, + {BPF_MAP_TYPE_SOCKMAP, "BPF_MAP_TYPE_SOCKMAP", 4, 4, 10}, + {BPF_MAP_TYPE_CPUMAP, "BPF_MAP_TYPE_CPUMAP", 4, 4, 10}, + {BPF_MAP_TYPE_XSKMAP, "BPF_MAP_TYPE_XSKMAP", 4, 4, 10}, + {BPF_MAP_TYPE_SOCKHASH, "BPF_MAP_TYPE_SOCKHASH", 4, 4, 10}, + {BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, "BPF_MAP_TYPE_REUSEPORT_SOCKARRAY", 4, 4, 10}, + {BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, "BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE", 8, 4, 0}, + {BPF_MAP_TYPE_QUEUE, "BPF_MAP_TYPE_QUEUE", 0, 4, 10}, + {BPF_MAP_TYPE_DEVMAP_HASH, "BPF_MAP_TYPE_DEVMAP_HASH", 4, 4, 10}, + {BPF_MAP_TYPE_RINGBUF, "BPF_MAP_TYPE_RINGBUF", 0, 0, 4096}, + {BPF_MAP_TYPE_BLOOM_FILTER, "BPF_MAP_TYPE_BLOOM_FILTER", 0, 4, 10}, +}; + +static __u32 flags[] = { + BPF_F_NO_CHARGE, +}; + +void test_map_flags(union bpf_attr *attr, char *name) +{ + int mfd; + + mfd = syscall(SYS_bpf, BPF_MAP_CREATE, attr, sizeof(*attr)); + CHECK(mfd <= 0 && mfd != -EPERM, "no_charge", "%s error: %s\n", + name, strerror(errno)); + + if (mfd > 0) + close(mfd); +} + +void test_no_charge(void) +{ + union bpf_attr attr; + int i, j; + + memset(&attr, 0, sizeof(attr)); + for (i = 0; i < sizeof(flags) / sizeof(__u32); i++) { + for (j = 0; j < sizeof(attrs) / sizeof(struct map_attr); j++) { + attr.map_type = attrs[j].map_type; + attr.key_size = attrs[j].key_size; + attr.value_size = attrs[j].value_size; + attr.max_entries = attrs[j].max_entries; + attr.map_flags = attrs[j].map_flags | flags[i]; + test_map_flags(&attr, attrs[j].map_name); + } + } + + printf("%s:PASS\n", __func__); +}
Test case to check if BPF_F_PROG_NO_CHARGE valid. The result as follows, $ ./test_progs ... #103 no_charge:OK ...
Signed-off-by: Yafang Shao laoar.shao@gmail.com --- .../selftests/bpf/prog_tests/no_charge.c | 49 +++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/no_charge.c
diff --git a/tools/testing/selftests/bpf/prog_tests/no_charge.c b/tools/testing/selftests/bpf/prog_tests/no_charge.c new file mode 100644 index 000000000000..85fbd2ccbc71 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/no_charge.c @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <linux/bpf.h> +#include <sys/syscall.h> + +#include "test_progs.h" + +#define BPF_ALU64_IMM(OP, DST, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +#define BPF_EXIT_INSN() \ + ((struct bpf_insn) { \ + .code = BPF_JMP | BPF_EXIT, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = 0 }) + +void test_no_charge(void) +{ + struct bpf_insn prog[] = { + BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0), + BPF_EXIT_INSN(), + }; + union bpf_attr attr; + int duration = 0; + int fd; + + bzero(&attr, sizeof(attr)); + attr.prog_type = BPF_PROG_TYPE_SCHED_CLS; + attr.insn_cnt = 2; + attr.insns = (__u64)prog; + attr.license = (__u64)("GPL"); + attr.prog_flags |= BPF_F_PROG_NO_CHARGE; + + fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); + CHECK(fd < 0 && fd != -EPERM, "no_charge", "error: %s\n", + strerror(errno)); + + if (fd > 0) + close(fd); +}
Hello, Yafang!
Thank you for continuing working on this!
On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
I'm looking at this text and increasingly feeling that it's not a bpf-specific problem and it shouldn't be solved as one.
Is there any significant reason why the loader can't temporarily enter the root cgroup before creating bpf maps/progs? I know it will require some changes in the userspace code, but so do new bpf flags.
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
If we're going with bpf-specific flags (which I'd prefer not to), let's define them as the way to create system-wide bpf objects which are expected to outlive the original cgroup. With expectations that they will be treated as belonging to the root cgroup by any sort of existing or future resource accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
But then again: why just not create them in the root cgroup?
Thanks!
On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin roman.gushchin@linux.dev wrote:
Hello, Yafang!
Thank you for continuing working on this!
On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
I'm looking at this text and increasingly feeling that it's not a bpf-specific problem and it shouldn't be solved as one.
I'm not sure whether it is a common issue or not, but I'm sure bpf has its special attribute that we should handle it specifically. I can show you an example on why bpf is a special one.
The pinned bpf is similar to a kernel module, right? But that issue can't happen in a kernel module, while it can happen in bpf only. The reason is that the kernel module has the choice on whether account the allocated memory or not, e.g. - Account kmalloc(size, GFP_KERNEL | __GFP_ACCOUNT); - Not Account kmalloc(size, GFP_KERNEL);
While the bpf has no choice because the GFP_KERNEL is a KAPI which is not exposed to the user.
Then the issue is exposed when the memcg-based accounting is forcefully enabled to all bpf programs. That is a behavior change, while unfortunately we don't give the user a chance to keep the old behavior unless they don't use memcg....
But that is not to say the memcg-based accounting is bad, while it is really useful, but it needs to be improved. We can't expose GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to the user, that's what this patchset did, like what we always have done in bpf.
Is there any significant reason why the loader can't temporarily enter the root cgroup before creating bpf maps/progs? I know it will require some changes in the userspace code, but so do new bpf flags.
On our k8s environment, the user agents should be deployed in a Daemonset[1]. It will make more trouble to temporarily enter the root group before creating bpf maps/progs, for example we must change the way we used to deploy user agents, that will be a big project.
[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
If we're going with bpf-specific flags (which I'd prefer not to), let's define them as the way to create system-wide bpf objects which are expected to outlive the original cgroup. With expectations that they will be treated as belonging to the root cgroup by any sort of existing or future resource accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
Now that talking about the cpu resource, I have some more complaints that cpu cgroup does really better than memory cgroup. Below is the detailed information why cpu cgroup does a good job,
- CPU Task Cgroup Code CPU time is accounted to the one who is executeING this code
- Memory Memory Cgroup Data Memory usage is accounted to the one who allocatED this data.
Have you found the difference? The difference is that, cpu time is accounted to the one who is using it (that is reasonable), while memory usage is accounted to the one who allocated it (that is unreasonable). If we split the Data/Code into private and shared, we can find why it is unreasonable.
Memory Cgroup Private Data Private and thus accounted to one single memcg, good. Shared Data Shared but accounted to one single memcg, bad.
Task Cgroup Private Code Private and thus accounted to one single task group, good. Shared Code Shared and thus accounted to all the task groups, good.
The pages are accounted when they are allocated rather than when they are used, that is why so many ridiculous things happen. But we have a common sense that we can’t dynamically charge the page to the process who is accessing it, right? So we have to handle the issues caused by shared pages case by case.
But then again: why just not create them in the root cgroup?
As I explained above, that is limited by the deployment.
On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin roman.gushchin@linux.dev wrote:
Hello, Yafang!
Thank you for continuing working on this!
On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
I'm looking at this text and increasingly feeling that it's not a bpf-specific problem and it shouldn't be solved as one.
I'm not sure whether it is a common issue or not, but I'm sure bpf has its special attribute that we should handle it specifically. I can show you an example on why bpf is a special one.
The pinned bpf is similar to a kernel module, right? But that issue can't happen in a kernel module, while it can happen in bpf only. The reason is that the kernel module has the choice on whether account the allocated memory or not, e.g. - Account kmalloc(size, GFP_KERNEL | __GFP_ACCOUNT);
- Not Account kmalloc(size, GFP_KERNEL);
While the bpf has no choice because the GFP_KERNEL is a KAPI which is not exposed to the user.
But if your process opens a file, creates a pipe etc there are also kernel allocations happening and the process has no control over whether these allocations are accounted or not. The same applies for the anonymous memory and pagecache as well, so it's not even kmem-specific.
Then the issue is exposed when the memcg-based accounting is forcefully enabled to all bpf programs. That is a behavior change, while unfortunately we don't give the user a chance to keep the old behavior unless they don't use memcg....
But that is not to say the memcg-based accounting is bad, while it is really useful, but it needs to be improved. We can't expose GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to the user, that's what this patchset did, like what we always have done in bpf.
Is there any significant reason why the loader can't temporarily enter the root cgroup before creating bpf maps/progs? I know it will require some changes in the userspace code, but so do new bpf flags.
On our k8s environment, the user agents should be deployed in a Daemonset[1]. It will make more trouble to temporarily enter the root group before creating bpf maps/progs, for example we must change the way we used to deploy user agents, that will be a big project.
I understand, however introducing new kernel interfaces to overcome such things has its own downside: every introduced interface will stay pretty much forever and we'll _have_ to support it. Kernel interfaces have a very long life cycle, we have to admit it.
The problem you're describing - inconsistencies on accounting of shared regions of memory - is a generic cgroup problem, which has a configuration solution: the resource accounting and control should be performed on a stable level and actual workloads can be (re)started in sub-cgroups with optionally disabled physical controllers. E.g.: / workload2.slice workload1.slice <- accounting should be performed here workload_gen1.scope workload_gen2.scope ...
[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
If we're going with bpf-specific flags (which I'd prefer not to), let's define them as the way to create system-wide bpf objects which are expected to outlive the original cgroup. With expectations that they will be treated as belonging to the root cgroup by any sort of existing or future resource accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
Now that talking about the cpu resource, I have some more complaints that cpu cgroup does really better than memory cgroup. Below is the detailed information why cpu cgroup does a good job,
- CPU Task Cgroup Code CPU time is accounted to the one who is executeING
this code
- Memory Memory Cgroup Data Memory usage is accounted to the one who
allocatED this data.
Have you found the difference?
Well, RAM is a vastly different thing than CPU :) They have different physical properties and corresponding accounting limitations.
The difference is that, cpu time is accounted to the one who is using it (that is reasonable), while memory usage is accounted to the one who allocated it (that is unreasonable). If we split the Data/Code into private and shared, we can find why it is unreasonable.
Memory Cgroup
Private Data Private and thus accounted to one single memcg, good. Shared Data Shared but accounted to one single memcg, bad.
Task Cgroup
Private Code Private and thus accounted to one single task group, good. Shared Code Shared and thus accounted to all the task groups, good.
The pages are accounted when they are allocated rather than when they are used, that is why so many ridiculous things happen. But we have a common sense that we can’t dynamically charge the page to the process who is accessing it, right? So we have to handle the issues caused by shared pages case by case.
The accounting of shared regions of memory is complex because of two physical limitations:
1) Amount of (meta)data which we can be used to track ownership. We expect the memory overhead be small in comparison to the accounted data. If a page is used by many cgroups, even saving a single pointer to each cgroup can take a lot of space. Even worse for slab objects. At some point it stops making sense: if the accounting takes more memory than the accounted memory, it's better to not account at all.
2) CPU overhead: tracking memory usage beyond the initial allocation adds an overhead to some very hot paths. Imagine two processes mapping the same file, first processes faults in the whole file and the second just uses the pagecache. Currently it's very efficient. Causing the second process to change the ownership information on each minor page fault will lead to a performance regression. Think of libc binary as this file.
That said, I'm not saying it can't be done better that now. But it's a complex problem.
Thanks!
On Wed, Mar 23, 2022 at 3:10 AM Roman Gushchin roman.gushchin@linux.dev wrote:
On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin roman.gushchin@linux.dev wrote:
Hello, Yafang!
Thank you for continuing working on this!
On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by defaut, that causes unexpected issues for us. For instance, the container of the loader-which loads the bpf programs and pins them on bpffs-may restart after pinning the progs and maps. After the restart, the pinned progs and maps won't belong to the new container any more, while they actually belong to an offline memcg left by the previous generation. That inconsistent behavior will make trouble for the memory resource management for this container.
I'm looking at this text and increasingly feeling that it's not a bpf-specific problem and it shouldn't be solved as one.
I'm not sure whether it is a common issue or not, but I'm sure bpf has its special attribute that we should handle it specifically. I can show you an example on why bpf is a special one.
The pinned bpf is similar to a kernel module, right? But that issue can't happen in a kernel module, while it can happen in bpf only. The reason is that the kernel module has the choice on whether account the allocated memory or not, e.g. - Account kmalloc(size, GFP_KERNEL | __GFP_ACCOUNT);
- Not Account kmalloc(size, GFP_KERNEL);
While the bpf has no choice because the GFP_KERNEL is a KAPI which is not exposed to the user.
But if your process opens a file, creates a pipe etc there are also kernel allocations happening and the process has no control over whether these allocations are accounted or not. The same applies for the anonymous memory and pagecache as well, so it's not even kmem-specific.
So what is the real problem in practice ? Does anyone complain about it ?
[At least, there's no behavior change in these areas.]
Then the issue is exposed when the memcg-based accounting is forcefully enabled to all bpf programs. That is a behavior change, while unfortunately we don't give the user a chance to keep the old behavior unless they don't use memcg....
But that is not to say the memcg-based accounting is bad, while it is really useful, but it needs to be improved. We can't expose GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to the user, that's what this patchset did, like what we always have done in bpf.
Is there any significant reason why the loader can't temporarily enter the root cgroup before creating bpf maps/progs? I know it will require some changes in the userspace code, but so do new bpf flags.
On our k8s environment, the user agents should be deployed in a Daemonset[1]. It will make more trouble to temporarily enter the root group before creating bpf maps/progs, for example we must change the way we used to deploy user agents, that will be a big project.
I understand, however introducing new kernel interfaces to overcome such things has its own downside: every introduced interface will stay pretty much forever and we'll _have_ to support it. Kernel interfaces have a very long life cycle, we have to admit it.
The problem you're describing - inconsistencies on accounting of shared regions of memory - is a generic cgroup problem, which has a configuration solution: the resource accounting and control should be performed on a stable level and actual workloads can be (re)started in sub-cgroups with optionally disabled physical controllers. E.g.: / workload2.slice workload1.slice <- accounting should be performed here workload_gen1.scope workload_gen2.scope ...
I think we talked about it several days earlier.
[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
The reason why these progs and maps have to be persistent across multiple generations is that these progs and maps are also used by other processes which are not in this container. IOW, they can't be removed when this container is restarted. Take a specific example, bpf program for clsact qdisc is loaded by a agent running in a container, which not only loads bpf program but also processes the data generated by this program and do some other maintainace things.
In order to keep the charging behavior consistent, we used to consider a way to recharge these pinned maps and progs again after the container is restarted, but after the discussion[1] with Roman, we decided to go another direction that don't charge them to the container in the first place. TL;DR about the mentioned disccussion: recharging is not a generic solution and it may take too much risk.
This patchset is the solution of no charge. Two flags are introduced in union bpf_attr, one for bpf map and another for bpf prog. The user who doesn't want to charge to current memcg can use these two flags. These two flags are only permitted for sys admin as these memory will be accounted to the root memcg only.
If we're going with bpf-specific flags (which I'd prefer not to), let's define them as the way to create system-wide bpf objects which are expected to outlive the original cgroup. With expectations that they will be treated as belonging to the root cgroup by any sort of existing or future resource accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
Now that talking about the cpu resource, I have some more complaints that cpu cgroup does really better than memory cgroup. Below is the detailed information why cpu cgroup does a good job,
- CPU Task Cgroup Code CPU time is accounted to the one who is executeING
this code
- Memory Memory Cgroup Data Memory usage is accounted to the one who
allocatED this data.
Have you found the difference?
Well, RAM is a vastly different thing than CPU :) They have different physical properties and corresponding accounting limitations.
The difference is that, cpu time is accounted to the one who is using it (that is reasonable), while memory usage is accounted to the one who allocated it (that is unreasonable). If we split the Data/Code into private and shared, we can find why it is unreasonable.
Memory Cgroup
Private Data Private and thus accounted to one single memcg, good. Shared Data Shared but accounted to one single memcg, bad.
Task Cgroup
Private Code Private and thus accounted to one single task group, good. Shared Code Shared and thus accounted to all the task groups, good.
The pages are accounted when they are allocated rather than when they are used, that is why so many ridiculous things happen. But we have a common sense that we can’t dynamically charge the page to the process who is accessing it, right? So we have to handle the issues caused by shared pages case by case.
The accounting of shared regions of memory is complex because of two physical limitations:
- Amount of (meta)data which we can be used to track ownership. We expect
the memory overhead be small in comparison to the accounted data. If a page is used by many cgroups, even saving a single pointer to each cgroup can take a lot of space. Even worse for slab objects. At some point it stops making sense: if the accounting takes more memory than the accounted memory, it's better to not account at all.
- CPU overhead: tracking memory usage beyond the initial allocation adds
an overhead to some very hot paths. Imagine two processes mapping the same file, first processes faults in the whole file and the second just uses the pagecache. Currently it's very efficient. Causing the second process to change the ownership information on each minor page fault will lead to a performance regression. Think of libc binary as this file.
That said, I'm not saying it can't be done better that now. But it's a complex problem.
Memcg-based accounting is also complex, but it introduces user visible behavior change now :(
linux-kselftest-mirror@lists.linaro.org