This unintended LRU eviction issue was observed while developing the selftest for "[PATCH bpf-next v10 0/8] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps" [1].
When updating an existing element in lru_hash or lru_percpu_hash maps, the current implementation calls prealloc_lru_pop() to get a new node before checking if the key already exists. If the map is full, this triggers LRU eviction and removes an existing element, even though the update operation only needs to modify the value in-place.
In the selftest, this was to be worked around by reserving an extra entry to avoid triggering eviction in __htab_lru_percpu_map_update_elem(). However, the underlying issue remains problematic because:
1. Users may unexpectedly lose entries when updating existing keys in a full map. 2. The eviction overhead is unnecessary for existing key updates.
This patchset fixes the issue by first checking if the key exists before allocating a new node. If the key is found, update the value in-place, refresh the LRU reference, and return immediately without triggering any eviction. Only proceed with node allocation if the key does not exist.
Links: [1] https://lore.kernel.org/bpf/20251117162033.6296-1-leon.hwang@linux.dev/
Leon Hwang (3): bpf: Avoid unintended eviction when updating lru_hash maps bpf: Avoid unintended eviction when updating lru_percpu_hash maps selftests/bpf: Add tests to verify no unintended eviction when updating lru hash maps
kernel/bpf/hashtab.c | 43 +++++++++++ .../selftests/bpf/prog_tests/htab_update.c | 73 +++++++++++++++++++ 2 files changed, 116 insertions(+)
-- 2.52.0
When updating an existing element in lru_hash maps, the current implementation always calls prealloc_lru_pop() to get a new node before checking if the key already exists. If the map is full, this triggers LRU eviction and removes an existing element, even though the update operation only needs to modify the value of an existing key in-place.
This is problematic because: 1. Users may unexpectedly lose entries when doing simple value updates 2. The eviction overhead is unnecessary for existing key updates
Fix this by first checking if the key exists before allocating a new node. If the key is found, update the value in-place, refresh the LRU reference, and return immediately without triggering any eviction.
Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH") Signed-off-by: Leon Hwang leon.hwang@linux.dev --- kernel/bpf/hashtab.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index c8a9b27f8663..fb624aa76573 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1207,6 +1207,27 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value b = __select_bucket(htab, hash); head = &b->head;
+ ret = htab_lock_bucket(b, &flags); + if (ret) + goto err_lock_bucket; + + l_old = lookup_elem_raw(head, hash, key, key_size); + + ret = check_flags(htab, l_old, map_flags); + if (ret) + goto err; + + if (l_old) { + bpf_lru_node_set_ref(&l_old->lru_node); + copy_map_value(&htab->map, htab_elem_value(l_old, map->key_size), value); + check_and_free_fields(htab, l_old); + } + + htab_unlock_bucket(b, flags); + + if (l_old) + return 0; + /* For LRU, we need to alloc before taking bucket's * spinlock because getting free nodes from LRU may need * to remove older elements from htab and this removal
Similar to the previous fix for lru_hash maps, the lru_percpu_hash map implementation also suffers from unnecessary eviction when updating existing elements.
When updating a key that already exists in a full lru_percpu_hash map, the current code path calls prealloc_lru_pop() before checking for the existing key (unless map_flags is BPF_EXIST). This can evict an unrelated element even though the update is just modifying the per-CPU value of an existing entry.
Fix this by looking up the key first. If found, update the per-CPU value in-place using pcpu_copy_value(), refresh the LRU reference, and return early. Only proceed with node allocation if the key does not exist.
Fixes: 8f8449384ec3 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH") Signed-off-by: Leon Hwang leon.hwang@linux.dev --- kernel/bpf/hashtab.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index fb624aa76573..af54fc3a9ba9 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1358,6 +1358,28 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, b = __select_bucket(htab, hash); head = &b->head;
+ ret = htab_lock_bucket(b, &flags); + if (ret) + goto err_lock_bucket; + + l_old = lookup_elem_raw(head, hash, key, key_size); + + ret = check_flags(htab, l_old, map_flags); + if (ret) + goto err; + + if (l_old) { + bpf_lru_node_set_ref(&l_old->lru_node); + /* per-cpu hash map can update value in-place */ + pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size), + value, onallcpus); + } + + htab_unlock_bucket(b, flags); + + if (l_old) + return 0; + /* For LRU, we need to alloc before taking bucket's * spinlock because LRU's elem alloc may need * to remove older elem from htab and this removal
Add two tests to verify that updating an existing element in LRU hash maps does not cause unintended eviction of other elements.
The test creates lru_hash/lru_percpu_hash maps with max_entries slots and populates all of them. It then updates an existing key and verifies that: 1. The update succeeds without error 2. The updated key has the new value 3. All other keys still exist with their original values
This validates the fix that prevents unnecessary LRU eviction when updating existing elements in full LRU hash maps.
Signed-off-by: Leon Hwang leon.hwang@linux.dev --- .../selftests/bpf/prog_tests/htab_update.c | 73 +++++++++++++++++++ 1 file changed, 73 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c index d0b405eb2966..bd29a915bb05 100644 --- a/tools/testing/selftests/bpf/prog_tests/htab_update.c +++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c @@ -143,3 +143,76 @@ void test_htab_update(void) if (test__start_subtest("concurrent_update")) test_concurrent_update(); } + +static void test_lru_hash_map_update_elem(enum bpf_map_type map_type) +{ + int err, map_fd, i, key, nr_cpus, max_entries = 128; + u64 *values, value = 0xDEADC0DE; + + nr_cpus = libbpf_num_possible_cpus(); + if (!ASSERT_GT(nr_cpus, 0, "libbpf_num_possible_cpus")) + return; + + values = calloc(nr_cpus, sizeof(u64)); + if (!ASSERT_OK_PTR(values, "calloc values")) + return; + for (i = 0; i < nr_cpus; i++) + values[i] = value; + + map_fd = bpf_map_create(map_type, "test_lru", sizeof(int), sizeof(u64), max_entries, NULL); + if (!ASSERT_GE(map_fd, 0, "bpf_map_create")) { + free(values); + return; + } + + /* populate all slots */ + for (key = 0; key < max_entries; key++) { + err = bpf_map_update_elem(map_fd, &key, values, 0); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + goto out; + } + + /* LRU eviction should not happen */ + + key = 0; + memset(values, 0, nr_cpus * sizeof(u64)); + err = bpf_map_update_elem(map_fd, &key, values, 0); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + goto out; + + err = bpf_map_lookup_elem(map_fd, &key, values); + if (!ASSERT_OK(err, "bpf_map_lookup_elem")) + goto out; + if (!ASSERT_EQ(*values, 0, "bpf_map_lookup_elem value")) + goto out; + + for (key = 1; key < max_entries; key++) { + err = bpf_map_lookup_elem(map_fd, &key, values); + if (!ASSERT_OK(err, "bpf_map_lookup_elem")) + goto out; + if (!ASSERT_EQ(*values, value, "bpf_map_lookup_elem value")) + goto out; + } + +out: + close(map_fd); + free(values); +} + +static void test_update_lru_hash_map(void) +{ + test_lru_hash_map_update_elem(BPF_MAP_TYPE_LRU_HASH); +} + +static void test_update_lru_percpu_hash_map(void) +{ + test_lru_hash_map_update_elem(BPF_MAP_TYPE_LRU_PERCPU_HASH); +} + +void test_update_lru_hash_maps(void) +{ + if (test__start_subtest("lru_hash")) + test_update_lru_hash_map(); + if (test__start_subtest("lru_percpu_hash")) + test_update_lru_percpu_hash_map(); +}
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c index d0b405eb2..bd29a915b 100644 --- a/tools/testing/selftests/bpf/prog_tests/htab_update.c +++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
[ ... ]
+static void test_lru_hash_map_update_elem(enum bpf_map_type map_type) +{
- int err, map_fd, i, key, nr_cpus, max_entries = 128;
- u64 *values, value = 0xDEADC0DE;
- nr_cpus = libbpf_num_possible_cpus();
- if (!ASSERT_GT(nr_cpus, 0, "libbpf_num_possible_cpus"))
return;- values = calloc(nr_cpus, sizeof(u64));
[ ... ]
- err = bpf_map_lookup_elem(map_fd, &key, values);
- if (!ASSERT_OK(err, "bpf_map_lookup_elem"))
goto out;- if (!ASSERT_EQ(*values, 0, "bpf_map_lookup_elem value"))
goto out;- for (key = 1; key < max_entries; key++) {
err = bpf_map_lookup_elem(map_fd, &key, values);if (!ASSERT_OK(err, "bpf_map_lookup_elem"))goto out;if (!ASSERT_EQ(*values, value, "bpf_map_lookup_elem value"))goto out;- }
When test_lru_hash_map_update_elem() is called with BPF_MAP_TYPE_LRU_PERCPU_HASH via test_update_lru_percpu_hash_map(), should all CPU values be checked rather than just *values?
For percpu maps, bpf_map_lookup_elem() returns an array with one value per CPU. The test allocates space for nr_cpus values but only verifies the first element. This means values for CPUs 1 through nr_cpus-1 are not validated.
Looking at tools/testing/selftests/bpf/prog_tests/lookup_and_delete.c, the pattern for percpu maps is to loop through all CPUs:
for (i = 0; i < nr_cpus; i++) { val = value[i]; /* check val */ }
For BPF_MAP_TYPE_LRU_HASH this is correct, but for BPF_MAP_TYPE_LRU_PERCPU_HASH the test may pass even if non-primary CPUs have incorrect values.
+out:
- close(map_fd);
- free(values);
+}
+static void test_update_lru_hash_map(void) +{
- test_lru_hash_map_update_elem(BPF_MAP_TYPE_LRU_HASH);
+}
+static void test_update_lru_percpu_hash_map(void) +{
- test_lru_hash_map_update_elem(BPF_MAP_TYPE_LRU_PERCPU_HASH);
+}
[ ... ]
--- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19864460617
On Tue, Dec 2, 2025 at 7:31 AM Leon Hwang leon.hwang@linux.dev wrote:
When updating an existing element in lru_hash maps, the current implementation always calls prealloc_lru_pop() to get a new node before checking if the key already exists. If the map is full, this triggers LRU eviction and removes an existing element, even though the update operation only needs to modify the value of an existing key in-place.
This is problematic because:
- Users may unexpectedly lose entries when doing simple value updates
- The eviction overhead is unnecessary for existing key updates
Fix this by first checking if the key exists before allocating a new node. If the key is found, update the value in-place, refresh the LRU reference, and return immediately without triggering any eviction.
Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH") Signed-off-by: Leon Hwang leon.hwang@linux.dev
kernel/bpf/hashtab.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index c8a9b27f8663..fb624aa76573 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1207,6 +1207,27 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value b = __select_bucket(htab, hash); head = &b->head;
ret = htab_lock_bucket(b, &flags);if (ret)goto err_lock_bucket;l_old = lookup_elem_raw(head, hash, key, key_size);ret = check_flags(htab, l_old, map_flags);if (ret)goto err;if (l_old) {bpf_lru_node_set_ref(&l_old->lru_node);copy_map_value(&htab->map, htab_elem_value(l_old, map->key_size), value);check_and_free_fields(htab, l_old);}
We cannot do this. It breaks the atomicity of the update. We added htab_map_update_elem_in_place() for a very specific case. See https://lore.kernel.org/all/20250401062250.543403-1-houtao@huaweicloud.com/ and discussion in v1,v2.
We cannot do in-place updates for other map types. It will break user expectations.
pw-bot: cr
linux-kselftest-mirror@lists.linaro.org