In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
This triggers WARN_ON() because sgx_dirty_page_list ends up being non-empty, and dumps the call stack:
[ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff [ 0.268591] ------------[ cut here ]------------ [ 0.268592] WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 [ 0.268598] Modules linked in: [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 07/06/2022 [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: 0000000000000000 [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: 00000000ffffffff [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: ffff8dcd820a4180 [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: ffffb6c74006bce0 [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: 0000000000000000 [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) knlGS:0000000000000000 [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: 00000000003706e0 [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.268622] Call Trace: [ 0.268624] <TASK> [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 0.268634] ? __kthread_parkme+0x36/0x90 [ 0.268637] kthread+0xe5/0x110 [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 [ 0.268642] ret_from_fork+0x1f/0x30 [ 0.268647] </TASK> [ 0.268648] ---[ end trace 0000000000000000 ]---
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err().
Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/... Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Cc: stable@vger.kernel.org # v5.13+ Reported-by: Paul Menzel pmenzel@molgen.mpg.de Signed-off-by: Jarkko Sakkinen jarkko@kernel.org --- v6: - Address Reinette's feedback: https://lore.kernel.org/linux-sgx/Yw6%2FiTzSdSw%2FY%2FVO@kernel.org/
v5: - Add the klog dump and sysctl option to the commit message.
v4: - Explain expectations for dirty_page_list in the function header, instead of an inline comment. - Improve commit message to explain the conditions better. - Return the number of pages left dirty to ksgxd() and print warning after the 2nd call, if there are any.
v3: - Remove WARN_ON(). - Tuned comments and the commit message a bit.
v2: - Replaced WARN_ON() with optional pr_info() inside __sgx_sanitize_pages(). - Rewrote the commit message. - Added the fixes tag. --- arch/x86/kernel/cpu/sgx/main.c | 42 ++++++++++++++++++++++++++++------ 1 file changed, 35 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 515e2a5f25bb..bcd6b64961bd 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -49,17 +49,20 @@ static LIST_HEAD(sgx_dirty_page_list); * Reset post-kexec EPC pages to the uninitialized state. The pages are removed * from the input list, and made available for the page allocator. SECS pages * prepending their children in the input list are left intact. + * + * Contents of the @dirty_page_list must be thread-local, i.e. + * not shared by multiple threads. */ -static void __sgx_sanitize_pages(struct list_head *dirty_page_list) +static long __sgx_sanitize_pages(struct list_head *dirty_page_list) { struct sgx_epc_page *page; + long left_dirty = 0; LIST_HEAD(dirty); int ret;
- /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop()) - return; + return -ECANCELED;
page = list_first_entry(dirty_page_list, struct sgx_epc_page, list);
@@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) } else { /* The page is not yet clean - move to the dirty list. */ list_move_tail(&page->list, &dirty); + left_dirty++; }
cond_resched(); }
list_splice(&dirty, dirty_page_list); + return left_dirty; }
static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page) @@ -388,17 +393,40 @@ void sgx_reclaim_direct(void)
static int ksgxd(void *p) { + long ret; + set_freezable();
/* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */ - __sgx_sanitize_pages(&sgx_dirty_page_list); - __sgx_sanitize_pages(&sgx_dirty_page_list); + ret = __sgx_sanitize_pages(&sgx_dirty_page_list); + if (ret == -ECANCELED) + /* kthread stopped */ + return 0;
- /* sanity check: */ - WARN_ON(!list_empty(&sgx_dirty_page_list)); + ret = __sgx_sanitize_pages(&sgx_dirty_page_list); + switch (ret) { + case 0: + /* success, no unsanitized pages */ + break; + + case -ECANCELED: + /* kthread stopped */ + return 0; + + default: + /* + * Never expected to happen in a working driver. If it happens + * the bug is expected to be in the sanitization process, but + * successfully sanitized pages are still valid and driver can + * be used and most importantly debugged without issues. To put + * short, the global state of kernel is not corrupted so no + * reason to do any more complicated rollback. + */ + pr_err("%ld unsanitized pages\n", ret); + }
while (!kthread_should_stop()) { if (try_to_freeze())
Hi Jarkko,
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
This triggers WARN_ON() because sgx_dirty_page_list ends up being non-empty, and dumps the call stack:
[ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff [ 0.268591] ------------[ cut here ]------------ [ 0.268592] WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 [ 0.268598] Modules linked in: [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 07/06/2022 [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: 0000000000000000 [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: 00000000ffffffff [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: ffff8dcd820a4180 [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: ffffb6c74006bce0 [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: 0000000000000000 [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) knlGS:0000000000000000 [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: 00000000003706e0 [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.268622] Call Trace: [ 0.268624] <TASK> [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 0.268634] ? __kthread_parkme+0x36/0x90 [ 0.268637] kthread+0xe5/0x110 [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 [ 0.268642] ret_from_fork+0x1f/0x30 [ 0.268647] </TASK> [ 0.268648] ---[ end trace 0000000000000000 ]---
Are you still planning to trim this?
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err().
I think that "print the number of unsanitized pages with pr_err()" contradicts the patch subject of "Do not consider unsanitized pages an error".
...
@@ -388,17 +393,40 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- long ret;
- set_freezable();
/* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it happens
* the bug is expected to be in the sanitization process, but
* successfully sanitized pages are still valid and driver can
* be used and most importantly debugged without issues. To put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void)
static int ksgxd(void *p) { + unsigned long left_dirty; + set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list); - __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */ - WARN_ON(!list_empty(&sgx_dirty_page_list)); + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); + if (left_dirty && !kthread_should_stop()) + pr_err("%lu unsanitized pages\n", left_dirty);
while (!kthread_should_stop()) { if (try_to_freeze())
Reinette
On Wed, 2022-08-31 at 13:39 -0700, Reinette Chatre wrote:
static int ksgxd(void *p) {
- long ret;
set_freezable(); /* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it
happens
* the bug is expected to be in the sanitization process,
but
* successfully sanitized pages are still valid and driver
can
* be used and most importantly debugged without issues. To
put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
set_freezable(); /* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
This basically means driver bug if I understand correctly. To be consistent with the behaviour of existing code, how about just WARN()? ... left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); WARN_ON(left_dirty && !kthread_should_stop());
It seems there's little value to print out the unsanitized pages here. The existing code doesn't print it anyway.
On Thu, Sep 01, 2022 at 10:50:07AM +0000, Huang, Kai wrote:
On Wed, 2022-08-31 at 13:39 -0700, Reinette Chatre wrote:
static int ksgxd(void *p) {
- long ret;
set_freezable(); /* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it
happens
* the bug is expected to be in the sanitization process,
but
* successfully sanitized pages are still valid and driver
can
* be used and most importantly debugged without issues. To
put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
set_freezable(); /* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
This basically means driver bug if I understand correctly. To be consistent with the behaviour of existing code, how about just WARN()? ... left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); WARN_ON(left_dirty && !kthread_should_stop());
It seems there's little value to print out the unsanitized pages here. The existing code doesn't print it anyway.
Using WARN IMHO here is too strong measure, given that it tear down the whole kernel, if panic_on_warn is enabled.
For debugging, any information is useful information, so would not make sense not print the number of pages, if that is available. That could very well point out the issue why all pages are not sanitized if there was a bug.
BR, Jarkko
On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
This triggers WARN_ON() because sgx_dirty_page_list ends up being non-empty, and dumps the call stack:
[ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff [ 0.268591] ------------[ cut here ]------------ [ 0.268592] WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 [ 0.268598] Modules linked in: [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 07/06/2022 [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: 0000000000000000 [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: 00000000ffffffff [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: ffff8dcd820a4180 [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: ffffb6c74006bce0 [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: 0000000000000000 [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) knlGS:0000000000000000 [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: 00000000003706e0 [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.268622] Call Trace: [ 0.268624] <TASK> [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 0.268634] ? __kthread_parkme+0x36/0x90 [ 0.268637] kthread+0xe5/0x110 [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 [ 0.268642] ret_from_fork+0x1f/0x30 [ 0.268647] </TASK> [ 0.268648] ---[ end trace 0000000000000000 ]---
Are you still planning to trim this?
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err().
I think that "print the number of unsanitized pages with pr_err()" contradicts the patch subject of "Do not consider unsanitized pages an error".
...
@@ -388,17 +393,40 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- long ret;
- set_freezable();
/* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it happens
* the bug is expected to be in the sanitization process, but
* successfully sanitized pages are still valid and driver can
* be used and most importantly debugged without issues. To put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
IMHO, would make sense also to have here:
if (!kthread_should_stop()) return 0;
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
That would be incorrect, if the function returned because of kthread stopped.
If you do the check here you already have a window where kthread could have been stopped anyhow.
So even this would be less correct:
if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); }
So in the end you end as complicated and less correct fix. This all is explained in the commit message.
If you unconditionally print error, you don't have a meaning for the number of unsanitized pags.
BR, Jarkko
On Fri, Sep 02, 2022 at 12:53:55AM +0300, Jarkko Sakkinen wrote:
On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
This triggers WARN_ON() because sgx_dirty_page_list ends up being non-empty, and dumps the call stack:
[ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff [ 0.268591] ------------[ cut here ]------------ [ 0.268592] WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 [ 0.268598] Modules linked in: [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 07/06/2022 [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: 0000000000000000 [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: 00000000ffffffff [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: ffff8dcd820a4180 [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: ffffb6c74006bce0 [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: 0000000000000000 [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) knlGS:0000000000000000 [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: 00000000003706e0 [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.268622] Call Trace: [ 0.268624] <TASK> [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 0.268634] ? __kthread_parkme+0x36/0x90 [ 0.268637] kthread+0xe5/0x110 [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 [ 0.268642] ret_from_fork+0x1f/0x30 [ 0.268647] </TASK> [ 0.268648] ---[ end trace 0000000000000000 ]---
Are you still planning to trim this?
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err().
I think that "print the number of unsanitized pages with pr_err()" contradicts the patch subject of "Do not consider unsanitized pages an error".
...
@@ -388,17 +393,40 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- long ret;
- set_freezable();
/* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it happens
* the bug is expected to be in the sanitization process, but
* successfully sanitized pages are still valid and driver can
* be used and most importantly debugged without issues. To put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
IMHO, would make sense also to have here:
if (!kthread_should_stop()) return 0;
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
That would be incorrect, if the function returned because of kthread stopped.
If you do the check here you already have a window where kthread could have been stopped anyhow.
So even this would be less correct:
if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); }
So in the end you end as complicated and less correct fix. This all is explained in the commit message.
If you unconditionally print error, you don't have a meaning for the number of unsanitized pags.
If you add my long comment explaining the error case, the SLOC size is almost the same. That takes most space in my patch.
BR, Jarkko
On Fri, Sep 02, 2022 at 12:56:34AM +0300, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 12:53:55AM +0300, Jarkko Sakkinen wrote:
On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
This triggers WARN_ON() because sgx_dirty_page_list ends up being non-empty, and dumps the call stack:
[ 0.268103] sgx: EPC section 0x40200000-0x45f7ffff [ 0.268591] ------------[ cut here ]------------ [ 0.268592] WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0 [ 0.268598] Modules linked in: [ 0.268600] CPU: 6 PID: 83 Comm: ksgxd Not tainted 6.0.0-rc2 #382 [ 0.268603] Hardware name: Dell Inc. XPS 13 9370/0RMYH9, BIOS 1.21.0 07/06/2022 [ 0.268604] RIP: 0010:ksgxd+0x1b7/0x1d0 [ 0.268607] Code: ff e9 f2 fe ff ff 48 89 df e8 75 07 0e 00 84 c0 0f 84 c3 fe ff ff 31 ff e8 e6 07 0e 00 84 c0 0f 85 94 fe ff ff e9 af fe ff ff <0f> 0b e9 7f fe ff ff e8 dd 9c 95 00 66 66 2e 0f 1f 84 00 00 00 00 [ 0.268608] RSP: 0000:ffffb6c7404f3ed8 EFLAGS: 00010287 [ 0.268610] RAX: ffffb6c740431a10 RBX: ffff8dcd8117b400 RCX: 0000000000000000 [ 0.268612] RDX: 0000000080000000 RSI: ffffb6c7404319d0 RDI: 00000000ffffffff [ 0.268613] RBP: ffff8dcd820a4d80 R08: ffff8dcd820a4180 R09: ffff8dcd820a4180 [ 0.268614] R10: 0000000000000000 R11: 0000000000000006 R12: ffffb6c74006bce0 [ 0.268615] R13: ffff8dcd80e63880 R14: ffffffffa8a60f10 R15: 0000000000000000 [ 0.268616] FS: 0000000000000000(0000) GS:ffff8dcf25580000(0000) knlGS:0000000000000000 [ 0.268617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.268619] CR2: 0000000000000000 CR3: 0000000213410001 CR4: 00000000003706e0 [ 0.268620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.268621] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.268622] Call Trace: [ 0.268624] <TASK> [ 0.268627] ? _raw_spin_lock_irqsave+0x24/0x60 [ 0.268632] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 0.268634] ? __kthread_parkme+0x36/0x90 [ 0.268637] kthread+0xe5/0x110 [ 0.268639] ? kthread_complete_and_exit+0x20/0x20 [ 0.268642] ret_from_fork+0x1f/0x30 [ 0.268647] </TASK> [ 0.268648] ---[ end trace 0000000000000000 ]---
Are you still planning to trim this?
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err().
I think that "print the number of unsanitized pages with pr_err()" contradicts the patch subject of "Do not consider unsanitized pages an error".
...
@@ -388,17 +393,40 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- long ret;
- set_freezable();
/* * Sanitize pages in order to recover from kexec(). The 2nd pass is * required for SECS pages, whose child pages blocked EREMOVE. */
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (ret == -ECANCELED)
/* kthread stopped */
return 0;
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- ret = __sgx_sanitize_pages(&sgx_dirty_page_list);
- switch (ret) {
- case 0:
/* success, no unsanitized pages */
break;
- case -ECANCELED:
/* kthread stopped */
return 0;
- default:
/*
* Never expected to happen in a working driver. If it happens
* the bug is expected to be in the sanitization process, but
* successfully sanitized pages are still valid and driver can
* be used and most importantly debugged without issues. To put
* short, the global state of kernel is not corrupted so no
* reason to do any more complicated rollback.
*/
pr_err("%ld unsanitized pages\n", ret);
- }
while (!kthread_should_stop()) { if (try_to_freeze())
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
IMHO, would make sense also to have here:
if (!kthread_should_stop()) return 0;
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
That would be incorrect, if the function returned because of kthread stopped.
If you do the check here you already have a window where kthread could have been stopped anyhow.
So even this would be less correct:
if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); }
So in the end you end as complicated and less correct fix. This all is explained in the commit message.
If you unconditionally print error, you don't have a meaning for the number of unsanitized pags.
If you add my long comment explaining the error case, the SLOC size is almost the same. That takes most space in my patch.
Printing pages if the thread was stopped is just making the bug look different and have less output, it does not fix anything, except prevent kernel panic.
BR, Jarkko
Hi Jarkko,
On 9/1/2022 2:53 PM, Jarkko Sakkinen wrote:
On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote:
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
IMHO, would make sense also to have here:
if (!kthread_should_stop()) return 0;
Would this not prematurely stop the thread when it should not be?
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
That would be incorrect, if the function returned because of kthread stopped.
I should have highlighted this but in my example I changed left_dirty to be "unsigned long" with the intention that the "return -ECANCELED" is replaced with "return 0".
__sgx_sanitize_pages() returns 0 when it exits because of kthread stopped.
To elaborate I was thinking about:
+static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list) { + unsigned long left_dirty = 0; struct sgx_epc_page *page; LIST_HEAD(dirty); int ret;
- /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop()) - return; + return 0;
page = list_first_entry(dirty_page_list, struct sgx_epc_page, list);
@@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) } else { /* The page is not yet clean - move to the dirty list. */ list_move_tail(&page->list, &dirty); + left_dirty++; }
cond_resched(); }
list_splice(&dirty, dirty_page_list); + return left_dirty; }
and then with what I had in previous email the checks should work:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void)
static int ksgxd(void *p) { + unsigned long left_dirty; + set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list); - __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */ - WARN_ON(!list_empty(&sgx_dirty_page_list)); + left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list); + if (left_dirty && !kthread_should_stop()) + pr_err("%lu unsanitized pages\n", left_dirty);
while (!kthread_should_stop()) { if (try_to_freeze())
If you do the check here you already have a window where kthread could have been stopped anyhow.
So even this would be less correct:
if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); }
So in the end you end as complicated and less correct fix. This all is explained in the commit message.
If you unconditionally print error, you don't have a meaning for the number of unsanitized pags.
Understood that the goal is to only print the number of unsanitized pages if ksgxd has not been stopped prematurely.
Reinette
On Thu, Sep 01, 2022 at 03:34:10PM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 9/1/2022 2:53 PM, Jarkko Sakkinen wrote:
On Wed, Aug 31, 2022 at 01:39:53PM -0700, Reinette Chatre wrote:
On 8/31/2022 10:38 AM, Jarkko Sakkinen wrote:
I think I am missing something here. A lot of logic is added here but I do not see why it is necessary. ksgxd() knows via kthread_should_stop() if the reclaimer was canceled. I am thus wondering, could the above not be simplified to something similar to V1:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
IMHO, would make sense also to have here:
if (!kthread_should_stop()) return 0;
Would this not prematurely stop the thread when it should not be?
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
That would be incorrect, if the function returned because of kthread stopped.
I should have highlighted this but in my example I changed left_dirty to be "unsigned long" with the intention that the "return -ECANCELED" is replaced with "return 0".
It wasn't supposed to be, it's an error. Thanks for spotting that.
__sgx_sanitize_pages() returns 0 when it exits because of kthread stopped.
To elaborate I was thinking about:
+static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list) {
- unsigned long left_dirty = 0; struct sgx_epc_page *page; LIST_HEAD(dirty); int ret;
- /* dirty_page_list is thread-local, no need for a lock: */ while (!list_empty(dirty_page_list)) { if (kthread_should_stop())
return;
return 0;
page = list_first_entry(dirty_page_list, struct sgx_epc_page, list); @@ -92,12 +95,14 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list) } else { /* The page is not yet clean - move to the dirty list. */ list_move_tail(&page->list, &dirty);
}left_dirty++;
cond_resched(); } list_splice(&dirty, dirty_page_list);
- return left_dirty;
}
and then with what I had in previous email the checks should work:
@@ -388,6 +393,8 @@ void sgx_reclaim_direct(void) static int ksgxd(void *p) {
- unsigned long left_dirty;
- set_freezable();
/* @@ -395,10 +402,10 @@ static int ksgxd(void *p) * required for SECS pages, whose child pages blocked EREMOVE. */ __sgx_sanitize_pages(&sgx_dirty_page_list);
- __sgx_sanitize_pages(&sgx_dirty_page_list);
- /* sanity check: */
- WARN_ON(!list_empty(&sgx_dirty_page_list));
- left_dirty = __sgx_sanitize_pages(&sgx_dirty_page_list);
- if (left_dirty && !kthread_should_stop())
pr_err("%lu unsanitized pages\n", left_dirty);
while (!kthread_should_stop()) { if (try_to_freeze())
If you do the check here you already have a window where kthread could have been stopped anyhow.
So even this would be less correct:
if (kthreas_should_stop()) { return 0; } else if (left_dirty) { pr_err("%lu unsanitized pages\n", left_dirty); }
So in the end you end as complicated and less correct fix. This all is explained in the commit message.
If you unconditionally print error, you don't have a meaning for the number of unsanitized pags.
Understood that the goal is to only print the number of unsanitized pages if ksgxd has not been stopped prematurely.
Yeah, and thus give as useful information for sysadmin/developer as we can.
BR, Jarkko
On Fri, Sep 02, 2022 at 02:57:01AM +0300, Jarkko Sakkinen wrote:
Understood that the goal is to only print the number of unsanitized pages if ksgxd has not been stopped prematurely.
Yeah, and thus give as useful information for sysadmin/developer as we can.
I figured out how to get best of both worlds. See the attached patch. I'll send formal series later on.
Keeps the accurancy, just postpones the exit.
BR, Jarkko
Hi Jarkko,
On 9/2/2022 8:53 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 04:26:51PM +0300, Jarkko Sakkinen wrote:
- if (ret)
pr_err("%ld unsanitized pages\n", left_dirty);
Yeah, I know, should be 'left_dirty'. I just quickly drafted the patch for the email.
No problem - you did mention that it was an informal patch.
(btw ... also watch out for the long local parameter returned as an unsigned long and the signed vs unsigned printing format string.) I also continue to recommend that you trim that backtrace ... this patch is heading to x86 area where this is required.
Reinette
On Fri, Sep 02, 2022 at 09:08:23AM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 9/2/2022 8:53 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 04:26:51PM +0300, Jarkko Sakkinen wrote:
- if (ret)
pr_err("%ld unsanitized pages\n", left_dirty);
Yeah, I know, should be 'left_dirty'. I just quickly drafted the patch for the email.
No problem - you did mention that it was an informal patch.
(btw ... also watch out for the long local parameter returned as an unsigned long and the signed vs unsigned printing format string.) I also continue to recommend that you trim
Point taken.
that backtrace ... this patch is heading to x86 area where this is required.
Should I just cut the whole stack trace, and leave the part before it?
BR, Jarkko
Hi Jarkko,
On 9/2/2022 9:30 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 09:08:23AM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 9/2/2022 8:53 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 04:26:51PM +0300, Jarkko Sakkinen wrote:
- if (ret)
pr_err("%ld unsanitized pages\n", left_dirty);
Yeah, I know, should be 'left_dirty'. I just quickly drafted the patch for the email.
No problem - you did mention that it was an informal patch.
(btw ... also watch out for the long local parameter returned as an unsigned long and the signed vs unsigned printing format string.) I also continue to recommend that you trim
Point taken.
that backtrace ... this patch is heading to x86 area where this is required.
Should I just cut the whole stack trace, and leave the part before it?
The trace is printed because of a WARN_ON() in the code. I do not think there is anything very helpful in that trace. I think the only helpful parts are the WARN itself that includes the line number and then information on which kernel it was encountered on.
How about something like (please note the FIXME within):
" Paul reported the following WARN while running kernel vFIXME: WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0
This is the WARN_ON() within ksgxd() that is triggered when sgx_dirty_page_list (the list containing unsanitized pages) is non-empty.
In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave some unsanitized pages, which does not matter, because SGX will be disabled for the whole power cycle.
Ultimately this can crash the kernel, if the following is set:
/proc/sys/kernel/panic_on_warn
In premature stop, print nothing, as the number is by practical means a random number. Otherwise, it is an indicator of a bug in the driver, and therefore print the number of unsanitized pages with pr_err(). "
Reinette
On Fri, Sep 02, 2022 at 10:38:34AM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 9/2/2022 9:30 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 09:08:23AM -0700, Reinette Chatre wrote:
Hi Jarkko,
On 9/2/2022 8:53 AM, Jarkko Sakkinen wrote:
On Fri, Sep 02, 2022 at 04:26:51PM +0300, Jarkko Sakkinen wrote:
- if (ret)
pr_err("%ld unsanitized pages\n", left_dirty);
Yeah, I know, should be 'left_dirty'. I just quickly drafted the patch for the email.
No problem - you did mention that it was an informal patch.
(btw ... also watch out for the long local parameter returned as an unsigned long and the signed vs unsigned printing format string.) I also continue to recommend that you trim
Point taken.
that backtrace ... this patch is heading to x86 area where this is required.
Should I just cut the whole stack trace, and leave the part before it?
The trace is printed because of a WARN_ON() in the code. I do not think there is anything very helpful in that trace. I think the only helpful parts are the WARN itself that includes the line number and then information on which kernel it was encountered on.
How about something like (please note the FIXME within):
" Paul reported the following WARN while running kernel vFIXME: WARNING: CPU: 6 PID: 83 at arch/x86/kernel/cpu/sgx/main.c:401 ksgxd+0x1b7/0x1d0
Yeah, this is a great idea, the use of WARN() is the whole point. Thank you.
BR, Jarkko
linux-stable-mirror@lists.linaro.org