Hi All, We have been noticing protection key test failures on our systems here. Shggy(Dave Kleikamp) from oracle reported to this problem couple of weeks ago. Here the is the failure.
# ./protection_keys_64 has pkeys: 1 startup pkey_reg: 0000000055555550 test 0 PASSED (iteration 1) test 1 PASSED (iteration 1) test 2 PASSED (iteration 1) test 3 PASSED (iteration 1) test 4 PASSED (iteration 1) test 5 PASSED (iteration 1) test 6 PASSED (iteration 1) test 7 PASSED (iteration 1) test 8 PASSED (iteration 1) test 9 PASSED (iteration 1) test 10 PASSED (iteration 1) test 11 PASSED (iteration 1) test 12 PASSED (iteration 1) test 13 PASSED (iteration 1) protection_keys_64: pkey-helpers.h:127: _read_pkey_reg: Assertion `pkey_reg == shadow_pkey_reg' failed. Aborted (core dumped)
The test that is failing is "test_ptrace_of_child".
Sometimes it fails even earlier also. Sometimes(very rarely) it passes when I insert few printfs. Most probably it fails with test_ptrace_of_child.
In the test "test_ptrace_of_child", the parent thread attaches to the child and modifies the data structure associated with protection key. Verifies the access results based on the key permissions. While running the test, the tool finds the key permission changes out of nowhere. The test fails with assert "assert(pkey_reg == shadow_pkey_reg);"
Investigation so far. 1. The test fails on AMD(Milan) systems. Test appears to pass on Intel systems. Tested on old Intel system available here.
2. I was trying to see if the hardware(or firmware) is corrupting the pkru register value. At this point, it does not appear that way. I was able to trace all the MSR writes to the application or kernel. At this point, it does not appear to me as an hardware issue. I see that kernel appears to do save/restore properly during the context switch. This value stays default(value 0x55555554) in most cases unless the application changes the default permissions. Only application that changes here is "protection_keys".
3. I played with the test tool little bit. The behavior changes drastically when I make minor changes. For example, in the following code.
void setup_handlers(void) { signal(SIGCHLD, &sig_chld); setup_sigsegv_handler(); }
Just commenting the first line "signal(SIGCHLD, &sig_chld);" changes the behavior drastically. I see some tests PASS after this change. The first line appear to not do anything except some printing.
I still have not figured out completely what is going on with setup_sigsegv_handler();. It seems very odd modifying some structures in the signal handler. I am not sure about some of the offsets it is trying to modify. I feel it may be messing up something there. I am not sure though. Will have to investigate.
4. I took the traces(x86_fpu) while running test. It appears to show some of the feature headers are not copied during the test. But I could not figure out why it was happening. It appears to show not all the feature headers are copied when the new child is created. When I printed the feature bits, they all appear to be intact. Here is the trace.
protection_keys-17350 [035] 59275.833511: x86_fpu_regs_activated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834197: x86_fpu_copy_src: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834197: x86_fpu_copy_dst: x86/fpu: 0xffff93d722877800 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17351 [040] 59275.834274: x86_fpu_regs_activated: x86/fpu: 0xffff93d722877800 load: 1 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834283: x86_fpu_regs_deactivated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17351 [040] 59275.834289: x86_fpu_regs_deactivated: x86/fpu: 0xffff93d722877800 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834296: x86_fpu_regs_activated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834298: x86_fpu_regs_activated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834406: x86_fpu_before_save: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834406: x86_fpu_after_save: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834408: x86_fpu_before_save: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834408: x86_fpu_after_save: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834654: x86_fpu_regs_deactivated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834654: x86_fpu_dropped: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 auditd-1834 [036] 59275.834655: x86_fpu_regs_activated: x86/fpu: 0xffff93d713fef800 load: 1 xfeatures: 202 xcomp_bv: 8000000000000207 protection_keys-17350 [035] 59275.834665: x86_fpu_regs_deactivated: x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207 <...>-17285 [041] 59275.834679: x86_fpu_regs_activated: x86/fpu: 0xffff93d716d0df40 load: 1 xfeatures: 202 xcomp_bv: 8000000000000207 5. I instrumented parent child interactions using a separate standalone application(without the signal handler), it appears to work just fine. I see PRKU register staying intact swiching from parent to child.
My suspicion at this point is towards the selftest tool protection_keys.c. I will keep looking. Any feedback would be much appreciated to debug further.
Shaggy, Feel free to add if I missed something.
Thanks Babu