[PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

List overview All Threads
Download

newer

older

[PATCH 4.19] tcp: relookup sock...

stable/linux-4.14.y build: 201...

Xℹ Ruoyao

29 Mar 2021 29 Mar '21

5:53 p.m.

If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message

*ERROR* Couldn't update BO_VA (-12)

printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp;

- } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1));

r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

-- 2.31.1

Show replies by date

Christian König

29 Mar 29 Mar

6:04 p.m.

Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:

...

If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message
 *ERROR* Couldn't update BO_VA (-12)
printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

NAK the value of num_entries must always be a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.

How do you trigger that?

Christian.

...

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp;

} while (unlikely(start != last + 1));

} while (unlikely(start < last + 1));

r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

Xi Ruoyao

6:08 p.m.

On 2021-03-29 20:04 +0200, Christian König wrote:

...

Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:

...
If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message

*ERROR* Couldn't update BO_VA (-12)

printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

NAK the value of num_entries must always be a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.

How do you trigger that?

Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL) under Xorg, on MIPS64. See the BugLink.

...

Christian.

...
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Christian König

6:10 p.m.

Am 29.03.21 um 20:08 schrieb Xi Ruoyao:

...

On 2021-03-29 20:04 +0200, Christian König wrote:

...
Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:

...
If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message

*ERROR* Couldn't update BO_VA (-12)

printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

NAK the value of num_entries must always be a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.

How do you trigger that?

Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL) under Xorg, on MIPS64. See the BugLink.

You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

Christian.

...

...
Christian.

...
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

Xi Ruoyao

6:21 p.m.

On 2021-03-29 20:10 +0200, Christian König wrote:

...

You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

I printk'ed the value of start & last, they are all a multiple of 4 (AMDGPU_GPU_PAGES_IN_CPU_PAGE).

However... `num_entries = last - start + 1` so it became some irrational thing... Either this line is wrong, or someone called amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which is unexpected.

...

...
...
...
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Xi Ruoyao

6:30 p.m.

On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:

...

On 2021-03-29 20:10 +0200, Christian König wrote:

...
You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

I printk'ed the value of start & last, they are all a multiple of 4 (AMDGPU_GPU_PAGES_IN_CPU_PAGE).

However... `num_entries = last - start + 1` so it became some irrational thing... Either this line is wrong, or someone called amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which is unexpected.

I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:

...

Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>] amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>] amdgpu_vm_bo_update+0x270/0x4c0 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>] amdgpu_gem_va_ioctl+0x40c/0x430 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>] drm_ioctl_kernel+0xcc/0x120 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>] drm_ioctl+0x220/0x408 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>] amdgpu_drm_ioctl+0x58/0x98 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>] syscall_common+0x34/0x58

...

...
...
...
...
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Xi Ruoyao

7:27 p.m.

Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */ if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK || size == 0 || size & AMDGPU_GPU_PAGE_MASK) return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE; eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr; mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:

...

On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 20:10 +0200, Christian König wrote:

...
You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

I printk'ed the value of start & last, they are all a multiple of 4 (AMDGPU_GPU_PAGES_IN_CPU_PAGE).

However... `num_entries = last - start + 1` so it became some irrational thing... Either this line is wrong, or someone called amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which is unexpected.

I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:

...
Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>] amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>] amdgpu_vm_bo_update+0x270/0x4c0 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>] amdgpu_gem_va_ioctl+0x40c/0x430 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>] drm_ioctl_kernel+0xcc/0x120 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>] drm_ioctl+0x220/0x408 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>] amdgpu_drm_ioctl+0x58/0x98 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>] syscall_common+0x34/0x58

...
...
...
...
...
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang

drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Christian König

7:36 p.m.

Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...

Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
     /* validate the parameters */
     if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
         size == 0 || size & AMDGPU_GPU_PAGE_MASK)
             return -EINVAL;
/* snip */
     saddr /= AMDGPU_GPU_PAGE_SIZE;
     eaddr /= AMDGPU_GPU_PAGE_SIZE;
/* snip */
     mapping->start = saddr;
     mapping->last = eaddr;
If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

Yeah, good point.

...

I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

BTW: What code base are you based on? The code your post here is quite outdated.

Christian.

...

On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:

...
On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 20:10 +0200, Christian König wrote:

...
You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

I printk'ed the value of start & last, they are all a multiple of 4 (AMDGPU_GPU_PAGES_IN_CPU_PAGE).

However... `num_entries = last - start + 1` so it became some irrational thing... Either this line is wrong, or someone called amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which is unexpected.

I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:

...
Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>] amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>] amdgpu_vm_bo_update+0x270/0x4c0 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>] amdgpu_gem_va_ioctl+0x40c/0x430 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>] drm_ioctl_kernel+0xcc/0x120 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>] drm_ioctl+0x220/0x408 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>] amdgpu_drm_ioctl+0x58/0x98 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8 Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>] syscall_common+0x34/0x58

...
...
...
> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") > Reported-by: Xi Ruoyao xry111@mengyan1223.wang > Reported-by: Dan Horák dan@danny.cz > Cc: stable@vger.kernel.org > Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index ad91c0c3c423..cee0cc9c8085 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct > amdgpu_device *adev, > } > start = tmp; > > - } while (unlikely(start != last + 1)); > + } while (unlikely(start < last + 1)); > > r = vm->update_funcs->commit(&params, fence); > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

Xi Ruoyao

7:40 p.m.

On 2021-03-29 21:36 +0200, Christian König wrote:

...

Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||              size == 0 || size & AMDGPU_GPU_PAGE_MASK)                  return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;          eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;          mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

BTW: What code base are you based on? The code your post here is quite outdated.

Linus' tree.

I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch drm-next or something.

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Xi Ruoyao

30 Mar 30 Mar

12:04 p.m.

On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

...

On 2021-03-29 21:36 +0200, Christian König wrote:

...
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK

...
...
size == 0 || size & AMDGPU_GPU_PAGE_MASK)                  return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;          eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;          mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.

...

...
Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7. Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.

If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case. -- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Christian König

12:55 p.m.

Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

...

On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 21:36 +0200, Christian König wrote:

...
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK              size == 0 || size & AMDGPU_GPU_PAGE_MASK)                  return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;          eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;          mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.

...
...
Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't find it.

...

Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't.

...

If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

Thanks, Christian.

...

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Dan Horák

1 p.m.

On Tue, 30 Mar 2021 14:55:01 +0200 Christian König christian.koenig@amd.com wrote:

...

Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

...
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 21:36 +0200, Christian König wrote:

...
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK              size == 0 || size & AMDGPU_GPU_PAGE_MASK)                  return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;          eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;          mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.

...
...
Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't find it.

it's a gitlab shortcut for https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be...

Dan

...

...
Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't.

...
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

Thanks, Christian.

...
-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Christian König

1:02 p.m.

Am 30.03.21 um 15:00 schrieb Dan Horák:

...

On Tue, 30 Mar 2021 14:55:01 +0200 Christian König christian.koenig@amd.com wrote:

...
Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

...
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 21:36 +0200, Christian König wrote:

...
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK              size == 0 || size & AMDGPU_GPU_PAGE_MASK)                  return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;          eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;          mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.

...
...
Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't find it.

it's a gitlab shortcut for https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.fre...

Ah! Yes I would expect that this patch is fixing something in this use case.

Thanks, Christian.

...

Dan
...
...
Another is in Mesa from userspace:  it uses
`dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.
Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't.

...
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

Thanks, Christian.

...
-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Xi Ruoyao

1:02 p.m.

On 2021-03-30 14:55 +0200, Christian König wrote:

...

Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

...
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

...
On 2021-03-29 21:36 +0200, Christian König wrote:

...
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

...
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */           if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK               size == 0 || size & AMDGPU_GPU_PAGE_MASK)                   return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;           eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;           mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.

...
...
Yeah, good point.

...
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.

Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.

On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't find it.

...
Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't.

...
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)                  }                  dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);                  dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; +               dev_info->gart_page_size = dev_info->virtual_address_alignment;                  dev_info->cu_active_number = adev->gfx.cu_info.number;                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae1... before (with a common sub-expression, though :).

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Xi Ruoyao

1:09 p.m.

On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:

...

On 2021-03-30 14:55 +0200, Christian König wrote:

...
I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae1... before (with a common sub-expression, though :).

Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

Dan Horák

1:23 p.m.

On Tue, 30 Mar 2021 21:09:12 +0800 Xi Ruoyao xry111@mengyan1223.wang wrote:

...

On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:

...
On 2021-03-30 14:55 +0200, Christian König wrote:

...
I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae1... before (with a common sub-expression, though :).

Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.

yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Dan

Christian König

1:24 p.m.

Am 30.03.21 um 15:23 schrieb Dan Horák:

...

On Tue, 30 Mar 2021 21:09:12 +0800 Xi Ruoyao xry111@mengyan1223.wang wrote:

...
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:

...
On 2021-03-30 14:55 +0200, Christian König wrote:

...
I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com... before (with a common sub-expression, though :).

Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.

yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Mhm, as far as I can say this patch never made it to us.

Can you please send it to the mailing list with me on CC?

Thanks, Christian.

...

Dan

Xi Ruoyao

3:46 p.m.

On 2021-03-30 15:24 +0200, Christian König wrote:

...

Am 30.03.21 um 15:23 schrieb Dan Horák:

...
On Tue, 30 Mar 2021 21:09:12 +0800 Xi Ruoyao xry111@mengyan1223.wang wrote:

...
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:

...
On 2021-03-30 14:55 +0200, Christian König wrote:

...
I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev-

...
gfx.cu_info.number;

dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com... before (with a common sub-expression, though :).

Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.

yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Mhm, as far as I can say this patch never made it to us.

I think maybe its author considers it a "workaround" like me, as there is already a "virtual_address_alignment" which seems correct.

...

Can you please send it to the mailing list with me on CC?

I've sent it, together with my patch for using ~PAGE_MASK in parameter validation.

-- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University

1734

days inactive

1735

days old

linux-stable-mirror@lists.linaro.org

17 comments

participants

tags (0)

participants (5)

Christian König
Christian König
Dan Horák
Xi Ruoyao
Xℹ Ruoyao