5.4.188 and later: massive performance regression with nfsd

List overview All Threads
Download

newer

older

VERY VERY URGENT,

[PATCH] drm/bridge: fix anx6345...

Wolfgang Walter

11 May 2022 11 May '22

10:03 a.m.

Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Regards,

-- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts

Show replies by date

Greg KH

11 May 11 May

12:38 p.m.

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...

Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

thanks,

greg k-h

Chuck Lever III

2:16 p.m.

...

On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18-rc.

...

If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

-- Chuck Lever

Greg KH

2:23 p.m.

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...

...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

...

...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

thanks,

greg k-h

Chuck Lever III

2:36 p.m.

...

On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...

...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

-- Chuck Lever

Wolfgang Walter

3:30 p.m.

Am 2022-05-11 16:36, schrieb Chuck Lever III:

...

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

-- Chuck Lever

This maybe unrelated: when I used 5.4.188 for the first time I got this:

[Mon Apr 11 09:20:23 2022] ------------[ cut here ]------------ [Mon Apr 11 09:20:23 2022] refcount_t: underflow; use-after-free. [Mon Apr 11 09:20:23 2022] WARNING: CPU: 18 PID: 7443 at lib/refcount.c:190 refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_ms r(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) s nd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si (E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper( E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:20:23 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc 32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:20:23 2022] CPU: 18 PID: 7443 Comm: kworker/u50:1 Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:20:23 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:20:23 2022] Workqueue: rpciod rpc_async_schedule [sunrpc] [Mon Apr 11 09:20:23 2022] RIP: 0010:refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Code: 31 e4 44 89 e0 41 5c c3 eb f8 44 0f b6 25 7c ee ce 00 45 84 e4 75 e8 48 c7 c7 80 97 8b b1 c6 05 69 ee ce 00 01 e8 78 3f 3a 00 <0f> 0b eb d4 0f 1f 40 00 48 89 fe bf 01 00 00 00 eb a6 66 0f 1f 44 [Mon Apr 11 09:20:23 2022] RSP: 0018:ffffad8a8d01fe48 EFLAGS: 00010286 [Mon Apr 11 09:20:23 2022] RAX: 0000000000000000 RBX: 0000000000002a81 RCX: 0000000000000000 [Mon Apr 11 09:20:23 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:20:23 2022] RBP: ffff9fb4a2394400 R08: 0000000000000001 R09: 000000000000085b [Mon Apr 11 09:20:23 2022] R10: 0000000000030f5c R11: 0000000000000004 R12: 0000000000000000 [Mon Apr 11 09:20:23 2022] R13: ffff9feaf358d400 R14: 0000000000000000 R15: ffff9fb7e1b25148 [Mon Apr 11 09:20:23 2022] FS: 0000000000000000(0000) GS:ffff9feaff980000(0000) knlGS:0000000000000000 [Mon Apr 11 09:20:23 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:20:23 2022] CR2: 00007ffe62ca3080 CR3: 000000350880a002 CR4: 00000000001606e0 [Mon Apr 11 09:20:23 2022] Call Trace: [Mon Apr 11 09:20:23 2022] nfsd4_cb_offload_release+0x16/0x30 [nfsd] [Mon Apr 11 09:20:23 2022] rpc_free_task+0x39/0x70 [sunrpc] [Mon Apr 11 09:20:23 2022] rpc_async_schedule+0x29/0x40 [sunrpc] [Mon Apr 11 09:20:23 2022] process_one_work+0x1ab/0x390 [Mon Apr 11 09:20:23 2022] worker_thread+0x50/0x3c0 [Mon Apr 11 09:20:23 2022] ? rescuer_thread+0x370/0x370 [Mon Apr 11 09:20:23 2022] kthread+0x13c/0x160 [Mon Apr 11 09:20:23 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:20:23 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:20:23 2022] ---[ end trace c58c1aaca9be5d21 ]--- [Mon Apr 11 09:56:32 2022] perf: interrupt took too long (6293 > 6228), lowering kernel.perf_event_max_sample_rate to 31750 [Mon Apr 11 09:59:17 2022] ------------[ cut here ]------------

[Mon Apr 11 09:59:17 2022] nfsd4_process_open2 failed to open newly-created file! status=10008 [Mon Apr 11 09:59:17 2022] WARNING: CPU: 2 PID: 3061 at fs/nfsd/nfs4proc.c:456 nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) snd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper(E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:59:17 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:59:17 2022] CPU: 2 PID: 3061 Comm: nfsd Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:59:17 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:59:17 2022] RIP: 0010:nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Code: e8 db 0f 01 00 41 89 c5 85 c0 0f 84 5e 01 00 00 80 bd 15 01 00 00 00 74 13 44 89 ee 48 c7 c7 a8 08 6f c0 0f ce e8 b6 f8 ae f0 <0f> 0b 4d 85 ff 0f 84 58 fd ff ff 4d 39 fc 0f 84 4f fd ff ff 4c 89 [Mon Apr 11 09:59:17 2022] RSP: 0018:ffffad8ab09c7db8 EFLAGS: 00010286 [Mon Apr 11 09:59:17 2022] RAX: 0000000000000000 RBX: ffff9feaaf7cb000 RCX: 0000000000000000 [Mon Apr 11 09:59:17 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:59:17 2022] RBP: ffff9feaae54f460 R08: 0000000000000001 R09: 000000000000087a [Mon Apr 11 09:59:17 2022] R10: 0000000000031e00 R11: 0000000000000004 R12: ffff9feaae550070 [Mon Apr 11 09:59:17 2022] R13: 0000000018270000 R14: ffff9feaae5ae000 R15: ffff9fc07ccfd400 [Mon Apr 11 09:59:17 2022] FS: 0000000000000000(0000) GS:ffff9fcb1f880000(0000) knlGS:0000000000000000 [Mon Apr 11 09:59:17 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:59:17 2022] CR2: 00007fdae60db000 CR3: 00000037e87fe005 CR4: 00000000001606e0 [Mon Apr 11 09:59:17 2022] Call Trace: [Mon Apr 11 09:59:17 2022] nfsd4_proc_compound+0x45d/0x730 [nfsd] [Mon Apr 11 09:59:17 2022] nfsd_dispatch+0xc1/0x200 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process_common+0x399/0x6e0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? svc_recv+0x312/0x9f0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? nfsd_svc+0x2f0/0x2f0 [nfsd] [Mon Apr 11 09:59:17 2022] ? nfsd_destroy+0x60/0x60 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process+0xd4/0x110 [sunrpc] [Mon Apr 11 09:59:17 2022] nfsd+0xed/0x150 [nfsd] [Mon Apr 11 09:59:17 2022] kthread+0x13c/0x160 [Mon Apr 11 09:59:17 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:59:17 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:59:17 2022] ---[ end trace c58c1aaca9be5d22 ]---

I never saw this with an earlier kernel (and we use 5.4 since a long time).

I did not see this with the unchanged 5.4.192, though, which I tested before running 5.4.192 with the above mentioned patches reverted.

Regards,

-- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts

Chuck Lever III

3:50 p.m.

...

On May 11, 2022, at 11:30 AM, Wolfgang Walter linux@stwm.de wrote:

Am 2022-05-11 16:36, schrieb Chuck Lever III:

...
...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote: On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote: On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi, starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver. The culprit are commits (or one of it): c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette" (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050) If I revert them in v5.4.192 the kernel works as before and performance is ok again. I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related. 5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4. I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that 6b8a94332ee4 ("nfsd: Fix a write performance regression") addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads. -- Chuck Lever

This maybe unrelated: when I used 5.4.188 for the first time I got this:

[Mon Apr 11 09:20:23 2022] ------------[ cut here ]------------ [Mon Apr 11 09:20:23 2022] refcount_t: underflow; use-after-free.

I don't believe this is the same issue as the performance regression. Hard to say more without knowing what the server's workload was when this occurred.

The only two NFSD-related patches that appear between v5.4.187 and v5.4.188 are Trond's filecache. It's implausible that a callback reference count issue would be related.

...

[Mon Apr 11 09:20:23 2022] WARNING: CPU: 18 PID: 7443 at lib/refcount.c:190 refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_ms r(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) s nd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si (E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper( E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:20:23 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc 32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:20:23 2022] CPU: 18 PID: 7443 Comm: kworker/u50:1 Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:20:23 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:20:23 2022] Workqueue: rpciod rpc_async_schedule [sunrpc] [Mon Apr 11 09:20:23 2022] RIP: 0010:refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Code: 31 e4 44 89 e0 41 5c c3 eb f8 44 0f b6 25 7c ee ce 00 45 84 e4 75 e8 48 c7 c7 80 97 8b b1 c6 05 69 ee ce 00 01 e8 78 3f 3a 00 <0f> 0b eb d4 0f 1f 40 00 48 89 fe bf 01 00 00 00 eb a6 66 0f 1f 44 [Mon Apr 11 09:20:23 2022] RSP: 0018:ffffad8a8d01fe48 EFLAGS: 00010286 [Mon Apr 11 09:20:23 2022] RAX: 0000000000000000 RBX: 0000000000002a81 RCX: 0000000000000000 [Mon Apr 11 09:20:23 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:20:23 2022] RBP: ffff9fb4a2394400 R08: 0000000000000001 R09: 000000000000085b [Mon Apr 11 09:20:23 2022] R10: 0000000000030f5c R11: 0000000000000004 R12: 0000000000000000 [Mon Apr 11 09:20:23 2022] R13: ffff9feaf358d400 R14: 0000000000000000 R15: ffff9fb7e1b25148 [Mon Apr 11 09:20:23 2022] FS: 0000000000000000(0000) GS:ffff9feaff980000(0000) knlGS:0000000000000000 [Mon Apr 11 09:20:23 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:20:23 2022] CR2: 00007ffe62ca3080 CR3: 000000350880a002 CR4: 00000000001606e0 [Mon Apr 11 09:20:23 2022] Call Trace: [Mon Apr 11 09:20:23 2022] nfsd4_cb_offload_release+0x16/0x30 [nfsd] [Mon Apr 11 09:20:23 2022] rpc_free_task+0x39/0x70 [sunrpc] [Mon Apr 11 09:20:23 2022] rpc_async_schedule+0x29/0x40 [sunrpc] [Mon Apr 11 09:20:23 2022] process_one_work+0x1ab/0x390 [Mon Apr 11 09:20:23 2022] worker_thread+0x50/0x3c0 [Mon Apr 11 09:20:23 2022] ? rescuer_thread+0x370/0x370 [Mon Apr 11 09:20:23 2022] kthread+0x13c/0x160 [Mon Apr 11 09:20:23 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:20:23 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:20:23 2022] ---[ end trace c58c1aaca9be5d21 ]--- [Mon Apr 11 09:56:32 2022] perf: interrupt took too long (6293 > 6228), lowering kernel.perf_event_max_sample_rate to 31750 [Mon Apr 11 09:59:17 2022] ------------[ cut here ]------------

...

[Mon Apr 11 09:59:17 2022] nfsd4_process_open2 failed to open newly-created file! status=10008

This is definitely not the same problem. This is a failure to open a freshly created file. Maybe due to a memory shortage? But this area has been a problem for a long time.

I have some patches going into 5.19 that might help in this area by making NFSv4 OPEN(CREATE) atomic.

...

[Mon Apr 11 09:59:17 2022] WARNING: CPU: 2 PID: 3061 at fs/nfsd/nfs4proc.c:456 nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) snd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper(E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:59:17 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:59:17 2022] CPU: 2 PID: 3061 Comm: nfsd Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:59:17 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:59:17 2022] RIP: 0010:nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Code: e8 db 0f 01 00 41 89 c5 85 c0 0f 84 5e 01 00 00 80 bd 15 01 00 00 00 74 13 44 89 ee 48 c7 c7 a8 08 6f c0 0f ce e8 b6 f8 ae f0 <0f> 0b 4d 85 ff 0f 84 58 fd ff ff 4d 39 fc 0f 84 4f fd ff ff 4c 89 [Mon Apr 11 09:59:17 2022] RSP: 0018:ffffad8ab09c7db8 EFLAGS: 00010286 [Mon Apr 11 09:59:17 2022] RAX: 0000000000000000 RBX: ffff9feaaf7cb000 RCX: 0000000000000000 [Mon Apr 11 09:59:17 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:59:17 2022] RBP: ffff9feaae54f460 R08: 0000000000000001 R09: 000000000000087a [Mon Apr 11 09:59:17 2022] R10: 0000000000031e00 R11: 0000000000000004 R12: ffff9feaae550070 [Mon Apr 11 09:59:17 2022] R13: 0000000018270000 R14: ffff9feaae5ae000 R15: ffff9fc07ccfd400 [Mon Apr 11 09:59:17 2022] FS: 0000000000000000(0000) GS:ffff9fcb1f880000(0000) knlGS:0000000000000000 [Mon Apr 11 09:59:17 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:59:17 2022] CR2: 00007fdae60db000 CR3: 00000037e87fe005 CR4: 00000000001606e0 [Mon Apr 11 09:59:17 2022] Call Trace: [Mon Apr 11 09:59:17 2022] nfsd4_proc_compound+0x45d/0x730 [nfsd] [Mon Apr 11 09:59:17 2022] nfsd_dispatch+0xc1/0x200 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process_common+0x399/0x6e0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? svc_recv+0x312/0x9f0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? nfsd_svc+0x2f0/0x2f0 [nfsd] [Mon Apr 11 09:59:17 2022] ? nfsd_destroy+0x60/0x60 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process+0xd4/0x110 [sunrpc] [Mon Apr 11 09:59:17 2022] nfsd+0xed/0x150 [nfsd] [Mon Apr 11 09:59:17 2022] kthread+0x13c/0x160 [Mon Apr 11 09:59:17 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:59:17 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:59:17 2022] ---[ end trace c58c1aaca9be5d22 ]---

I never saw this with an earlier kernel (and we use 5.4 since a long time).

I did not see this with the unchanged 5.4.192, though, which I tested before running 5.4.192 with the above mentioned patches reverted.

Regards,

Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts

-- Chuck Lever

Wolfgang Walter

5:24 p.m.

Am 2022-05-11 17:50, schrieb Chuck Lever III:

...

...
On May 11, 2022, at 11:30 AM, Wolfgang Walter linux@stwm.de wrote:

Am 2022-05-11 16:36, schrieb Chuck Lever III:

...
...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote: On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote: On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote: > Hi, > starting with 5.4.188 wie see a massive performance regression on > our > nfs-server. It basically is serving requests very very slowly > with cpu > utilization of 100% (with 5.4.187 and earlier it is 10%) so that > it is > unusable as a fileserver. > The culprit are commits (or one of it): > c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup > nfsd_file_lru_dispose()" > 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise > filecache > laundrette" > (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > 9542e6a643fc69d528dfb3303f145719c61d3050) > If I revert them in v5.4.192 the kernel works as before and > performance is > ok again. > I did not try to revert them one by one as any disruption of our > nfs-server > is a severe problem for us and I'm not sure if they are related. > 5.10 and 5.15 both always performed very badly on our nfs-server > in a > similar way so we were stuck with 5.4. > I now think this is because of > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't > tried to > revert them in 5.15 yet. Odds are 5.18-rc6 is also a problem?

We believe that 6b8a94332ee4 ("nfsd: Fix a write performance regression") addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads. -- Chuck Lever

This maybe unrelated: when I used 5.4.188 for the first time I got this:

[Mon Apr 11 09:20:23 2022] ------------[ cut here ]------------ [Mon Apr 11 09:20:23 2022] refcount_t: underflow; use-after-free.

I don't believe this is the same issue as the performance regression. Hard to say more without knowing what the server's workload was when this occurred.

The only two NFSD-related patches that appear between v5.4.187 and v5.4.188 are Trond's filecache. It's implausible that a callback reference count issue would be related.

Maybe it is because of the high cpu load with these two patches. Whereas there are usually only 10 nfsd threads in R state even under high load and only for short times, with these two patches there are periods where there are 50 or more.

Our monitoring says that the load on average (1m) is < 1, with these two patches it was 18. So there is more parallelism.

...

...
[Mon Apr 11 09:20:23 2022] WARNING: CPU: 18 PID: 7443 at lib/refcount.c:190 refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_ms r(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) s nd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si (E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper( E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:20:23 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc 32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:20:23 2022] CPU: 18 PID: 7443 Comm: kworker/u50:1 Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:20:23 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:20:23 2022] Workqueue: rpciod rpc_async_schedule [sunrpc] [Mon Apr 11 09:20:23 2022] RIP: 0010:refcount_sub_and_test_checked+0x48/0x50 [Mon Apr 11 09:20:23 2022] Code: 31 e4 44 89 e0 41 5c c3 eb f8 44 0f b6 25 7c ee ce 00 45 84 e4 75 e8 48 c7 c7 80 97 8b b1 c6 05 69 ee ce 00 01 e8 78 3f 3a 00 <0f> 0b eb d4 0f 1f 40 00 48 89 fe bf 01 00 00 00 eb a6 66 0f 1f 44 [Mon Apr 11 09:20:23 2022] RSP: 0018:ffffad8a8d01fe48 EFLAGS: 00010286 [Mon Apr 11 09:20:23 2022] RAX: 0000000000000000 RBX: 0000000000002a81 RCX: 0000000000000000 [Mon Apr 11 09:20:23 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:20:23 2022] RBP: ffff9fb4a2394400 R08: 0000000000000001 R09: 000000000000085b [Mon Apr 11 09:20:23 2022] R10: 0000000000030f5c R11: 0000000000000004 R12: 0000000000000000 [Mon Apr 11 09:20:23 2022] R13: ffff9feaf358d400 R14: 0000000000000000 R15: ffff9fb7e1b25148 [Mon Apr 11 09:20:23 2022] FS: 0000000000000000(0000) GS:ffff9feaff980000(0000) knlGS:0000000000000000 [Mon Apr 11 09:20:23 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:20:23 2022] CR2: 00007ffe62ca3080 CR3: 000000350880a002 CR4: 00000000001606e0 [Mon Apr 11 09:20:23 2022] Call Trace: [Mon Apr 11 09:20:23 2022] nfsd4_cb_offload_release+0x16/0x30 [nfsd] [Mon Apr 11 09:20:23 2022] rpc_free_task+0x39/0x70 [sunrpc] [Mon Apr 11 09:20:23 2022] rpc_async_schedule+0x29/0x40 [sunrpc] [Mon Apr 11 09:20:23 2022] process_one_work+0x1ab/0x390 [Mon Apr 11 09:20:23 2022] worker_thread+0x50/0x3c0 [Mon Apr 11 09:20:23 2022] ? rescuer_thread+0x370/0x370 [Mon Apr 11 09:20:23 2022] kthread+0x13c/0x160 [Mon Apr 11 09:20:23 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:20:23 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:20:23 2022] ---[ end trace c58c1aaca9be5d21 ]--- [Mon Apr 11 09:56:32 2022] perf: interrupt took too long (6293 > 6228), lowering kernel.perf_event_max_sample_rate to 31750 [Mon Apr 11 09:59:17 2022] ------------[ cut here ]------------

...
[Mon Apr 11 09:59:17 2022] nfsd4_process_open2 failed to open newly-created file! status=10008

This is definitely not the same problem. This is a failure to open a freshly created file. Maybe due to a memory shortage?

No, I don't think so, there was about 100G available.

...

But this area has been a problem for a long time.

I have some patches going into 5.19 that might help in this area by making NFSv4 OPEN(CREATE) atomic.

...
[Mon Apr 11 09:59:17 2022] WARNING: CPU: 2 PID: 3061 at fs/nfsd/nfs4proc.c:456 nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Modules linked in: rpcsec_gss_krb5(E) msr(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) dm_cache_smq(E) binfmt_misc(E) dm_cache(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ast(E) snd_pcm(E) drm_vram_helper(E) ttm(E) ghash_clmulni_intel(E) snd_timer(E) rapl(E) snd(E) soundcore(E) drm_kms_helper(E) intel_cstate(E) pcspkr(E) iTCO_wdt(E) intel_uncore(E) drm(E) iTCO_vendor_support(E) mei_me(E) joydev(E) i2c_algo_bit(E) watchdog(E) mei(E) ioatdma(E) evdev(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) button(E) poly1305_x86_64(E) poly1305_generic(E) sha256_ssse3(E) sha1_ssse3(E) chacha_generic(E) chacha20poly1305(E) aesni_intel(E) crypto_simd(E) glue_helper(E) cryptd(E) drbd(E) lru_cache(E) nfsd(E) nfs_acl(E) lockd(E) grace(E) auth_rpcgss(E) sunrpc(E) [Mon Apr 11 09:59:17 2022] ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) raid1(E) md_mod(E) sd_mod(E) crc32c_intel(E) ixgbe(E) ahci(E) xhci_pci(E) libahci(E) ehci_pci(E) xfrm_algo(E) mpt3sas(E) xhci_hcd(E) ehci_hcd(E) dca(E) raid_class(E) libata(E) libphy(E) usbcore(E) scsi_transport_sas(E) lpc_ich(E) ptp(E) i2c_i801(E) mfd_core(E) usb_common(E) pps_core(E) scsi_mod(E) mdio(E) wmi(E) [Mon Apr 11 09:59:17 2022] CPU: 2 PID: 3061 Comm: nfsd Tainted: G W E 5.4.188-debian64.all+1.2 #1 [Mon Apr 11 09:59:17 2022] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1a 10/16/2015 [Mon Apr 11 09:59:17 2022] RIP: 0010:nfsd4_open+0x39a/0x710 [nfsd] [Mon Apr 11 09:59:17 2022] Code: e8 db 0f 01 00 41 89 c5 85 c0 0f 84 5e 01 00 00 80 bd 15 01 00 00 00 74 13 44 89 ee 48 c7 c7 a8 08 6f c0 0f ce e8 b6 f8 ae f0 <0f> 0b 4d 85 ff 0f 84 58 fd ff ff 4d 39 fc 0f 84 4f fd ff ff 4c 89 [Mon Apr 11 09:59:17 2022] RSP: 0018:ffffad8ab09c7db8 EFLAGS: 00010286 [Mon Apr 11 09:59:17 2022] RAX: 0000000000000000 RBX: ffff9feaaf7cb000 RCX: 0000000000000000 [Mon Apr 11 09:59:17 2022] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 00000000ffffffff [Mon Apr 11 09:59:17 2022] RBP: ffff9feaae54f460 R08: 0000000000000001 R09: 000000000000087a [Mon Apr 11 09:59:17 2022] R10: 0000000000031e00 R11: 0000000000000004 R12: ffff9feaae550070 [Mon Apr 11 09:59:17 2022] R13: 0000000018270000 R14: ffff9feaae5ae000 R15: ffff9fc07ccfd400 [Mon Apr 11 09:59:17 2022] FS: 0000000000000000(0000) GS:ffff9fcb1f880000(0000) knlGS:0000000000000000 [Mon Apr 11 09:59:17 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Apr 11 09:59:17 2022] CR2: 00007fdae60db000 CR3: 00000037e87fe005 CR4: 00000000001606e0 [Mon Apr 11 09:59:17 2022] Call Trace: [Mon Apr 11 09:59:17 2022] nfsd4_proc_compound+0x45d/0x730 [nfsd] [Mon Apr 11 09:59:17 2022] nfsd_dispatch+0xc1/0x200 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process_common+0x399/0x6e0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? svc_recv+0x312/0x9f0 [sunrpc] [Mon Apr 11 09:59:17 2022] ? nfsd_svc+0x2f0/0x2f0 [nfsd] [Mon Apr 11 09:59:17 2022] ? nfsd_destroy+0x60/0x60 [nfsd] [Mon Apr 11 09:59:17 2022] svc_process+0xd4/0x110 [sunrpc] [Mon Apr 11 09:59:17 2022] nfsd+0xed/0x150 [nfsd] [Mon Apr 11 09:59:17 2022] kthread+0x13c/0x160 [Mon Apr 11 09:59:17 2022] ? __kthread_bind_mask+0x60/0x60 [Mon Apr 11 09:59:17 2022] ret_from_fork+0x1f/0x40 [Mon Apr 11 09:59:17 2022] ---[ end trace c58c1aaca9be5d22 ]---

I never saw this with an earlier kernel (and we use 5.4 since a long time).

I did not see this with the unchanged 5.4.192, though, which I tested before running 5.4.192 with the above mentioned patches reverted.

Regards,

Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts

-- Chuck Lever

-- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts

Chuck Lever III

20 May 20 May

3:36 p.m.

...

On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18-rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

-- Chuck Lever

Trond Myklebust

4:40 p.m.

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...

...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:

...
Hi,

starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.

The culprit are commits (or one of it):

c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"

(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)

If I revert them in v5.4.192 the kernel works as before and performance is ok again.

I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.

5.10 and 5.15 both always performed very badly on our nfs- server in a similar way so we were stuck with 5.4.

I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18- rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

...

-- Chuck Lever

-- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com

dai.ngo＠oracle.com

5:06 p.m.

On 5/20/22 9:40 AM, Trond Myklebust wrote:

...

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote: > Hi, > > starting with 5.4.188 wie see a massive performance > regression on our > nfs-server. It basically is serving requests very very > slowly with cpu > utilization of 100% (with 5.4.187 and earlier it is 10%) so > that it is > unusable as a fileserver. > > The culprit are commits (or one of it): > > c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup > nfsd_file_lru_dispose()" > 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: > Containerise filecache > laundrette" > > (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > 9542e6a643fc69d528dfb3303f145719c61d3050) > > If I revert them in v5.4.192 the kernel works as before and > performance is > ok again. > > I did not try to revert them one by one as any disruption > of our nfs-server > is a severe problem for us and I'm not sure if they are > related. > > 5.10 and 5.15 both always performed very badly on our nfs- > server in a > similar way so we were stuck with 5.4. > > I now think this is because of > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I > didn't tried to > revert them in 5.15 yet. Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18- rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

Thanks Trond for catching this, I'll submit the v2 patch.

-Dai

...

...
-- Chuck Lever

Chuck Lever III

9:52 p.m.

...

On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote: > Hi, > > starting with 5.4.188 wie see a massive performance > regression on our > nfs-server. It basically is serving requests very very > slowly with cpu > utilization of 100% (with 5.4.187 and earlier it is 10%) so > that it is > unusable as a fileserver. > > The culprit are commits (or one of it): > > c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup > nfsd_file_lru_dispose()" > 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: > Containerise filecache > laundrette" > > (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > 9542e6a643fc69d528dfb3303f145719c61d3050) > > If I revert them in v5.4.192 the kernel works as before and > performance is > ok again. > > I did not try to revert them one by one as any disruption > of our nfs-server > is a severe problem for us and I'm not sure if they are > related. > > 5.10 and 5.15 both always performed very badly on our nfs- > server in a > similar way so we were stuck with 5.4. > > I now think this is because of > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I > didn't tried to > revert them in 5.15 yet.

Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18- rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
...
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

-- Chuck Lever

Trond Myklebust

10:24 p.m.

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...

...
On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:

...
> On May 11, 2022, at 8:38 AM, Greg KH > gregkh@linuxfoundation.org wrote: > > On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter > wrote: > > Hi, > > > > starting with 5.4.188 wie see a massive performance > > regression on our > > nfs-server. It basically is serving requests very very > > slowly with cpu > > utilization of 100% (with 5.4.187 and earlier it is > > 10%) so > > that it is > > unusable as a fileserver. > > > > The culprit are commits (or one of it): > > > > c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup > > nfsd_file_lru_dispose()" > > 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: > > Containerise filecache > > laundrette" > > > > (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > > 9542e6a643fc69d528dfb3303f145719c61d3050) > > > > If I revert them in v5.4.192 the kernel works as before > > and > > performance is > > ok again. > > > > I did not try to revert them one by one as any > > disruption > > of our nfs-server > > is a severe problem for us and I'm not sure if they are > > related. > > > > 5.10 and 5.15 both always performed very badly on our > > nfs- > > server in a > > similar way so we were stuck with 5.4. > > > > I now think this is because of > > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > > and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though > > I > > didn't tried to > > revert them in 5.15 yet. > > Odds are 5.18-rc6 is also a problem?

We believe that

6b8a94332ee4 ("nfsd: Fix a write performance regression")

addresses the performance regression. It was merged into 5.18- rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
...
> If so, I'll just wait for the fix to get into Linus's > tree as > this does > not seem to be a stable-tree-only issue.

Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

-- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com

Chuck Lever III

11:43 p.m.

...

On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:

On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote: > > >> On May 11, 2022, at 8:38 AM, Greg KH >> gregkh@linuxfoundation.org wrote: >> >> On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter >> wrote: >>> Hi, >>> >>> starting with 5.4.188 wie see a massive performance >>> regression on our >>> nfs-server. It basically is serving requests very very >>> slowly with cpu >>> utilization of 100% (with 5.4.187 and earlier it is >>> 10%) so >>> that it is >>> unusable as a fileserver. >>> >>> The culprit are commits (or one of it): >>> >>> c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup >>> nfsd_file_lru_dispose()" >>> 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: >>> Containerise filecache >>> laundrette" >>> >>> (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and >>> 9542e6a643fc69d528dfb3303f145719c61d3050) >>> >>> If I revert them in v5.4.192 the kernel works as before >>> and >>> performance is >>> ok again. >>> >>> I did not try to revert them one by one as any >>> disruption >>> of our nfs-server >>> is a severe problem for us and I'm not sure if they are >>> related. >>> >>> 5.10 and 5.15 both always performed very badly on our >>> nfs- >>> server in a >>> similar way so we were stuck with 5.4. >>> >>> I now think this is because of >>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 >>> and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though >>> I >>> didn't tried to >>> revert them in 5.15 yet. >> >> Odds are 5.18-rc6 is also a problem? > > We believe that > > 6b8a94332ee4 ("nfsd: Fix a write performance regression") > > addresses the performance regression. It was merged into > 5.18- > rc.

And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

...
>> If so, I'll just wait for the fix to get into Linus's >> tree as >> this does >> not seem to be a stable-tree-only issue. > > Unfortunately I've received a recent report that the fix > introduces > a "sleep while spinlock is held" for NFSv4.0 in rare cases.

Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

-- Chuck Lever

Chuck Lever III

21 May 21 May

5:22 p.m.

...

On May 20, 2022, at 7:43 PM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
...
On May 11, 2022, at 10:36 AM, Chuck Lever III chuck.lever@oracle.com wrote:

> On May 11, 2022, at 10:23 AM, Greg KH > gregkh@linuxfoundation.org wrote: > > On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III > wrote: >> >> >>> On May 11, 2022, at 8:38 AM, Greg KH >>> gregkh@linuxfoundation.org wrote: >>> >>> On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter >>> wrote: >>>> Hi, >>>> >>>> starting with 5.4.188 wie see a massive performance >>>> regression on our >>>> nfs-server. It basically is serving requests very very >>>> slowly with cpu >>>> utilization of 100% (with 5.4.187 and earlier it is >>>> 10%) so >>>> that it is >>>> unusable as a fileserver. >>>> >>>> The culprit are commits (or one of it): >>>> >>>> c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup >>>> nfsd_file_lru_dispose()" >>>> 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: >>>> Containerise filecache >>>> laundrette" >>>> >>>> (upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and >>>> 9542e6a643fc69d528dfb3303f145719c61d3050) >>>> >>>> If I revert them in v5.4.192 the kernel works as before >>>> and >>>> performance is >>>> ok again. >>>> >>>> I did not try to revert them one by one as any >>>> disruption >>>> of our nfs-server >>>> is a severe problem for us and I'm not sure if they are >>>> related. >>>> >>>> 5.10 and 5.15 both always performed very badly on our >>>> nfs- >>>> server in a >>>> similar way so we were stuck with 5.4. >>>> >>>> I now think this is because of >>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 >>>> and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though >>>> I >>>> didn't tried to >>>> revert them in 5.15 yet. >>> >>> Odds are 5.18-rc6 is also a problem? >> >> We believe that >> >> 6b8a94332ee4 ("nfsd: Fix a write performance regression") >> >> addresses the performance regression. It was merged into >> 5.18- >> rc. > > And into 5.17.4 if someone wants to try that release.

I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.

>>> If so, I'll just wait for the fix to get into Linus's >>> tree as >>> this does >>> not seem to be a stable-tree-only issue. >> >> Unfortunately I've received a recent report that the fix >> introduces >> a "sleep while spinlock is held" for NFSv4.0 in rare cases. > > Ick, not good, any potential fixes for that?

Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

I had a thought about an alternate approach.

Create a second nfsd_file_put API that is not allowed to sleep. Let's call it "nfsd_file_put_async()". Teach check_for_locked() to use that instead of nfsd_file_put().

Here's where I'm a little fuzzy: nfsd_file_put_async() could do something like:

void nfsd_file_put_async(struct nfsd_file *nf) { if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_close_inode(nf->nf_inode); }

-- Chuck Lever

Trond Myklebust

6:10 p.m.

On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:

...

...
On May 20, 2022, at 7:43 PM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:

...
> On May 11, 2022, at 10:36 AM, Chuck Lever III > chuck.lever@oracle.com wrote: > > > > > On May 11, 2022, at 10:23 AM, Greg KH > > gregkh@linuxfoundation.org wrote: > > > > On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever > > III > > wrote: > > > > > > > > > > On May 11, 2022, at 8:38 AM, Greg KH > > > > gregkh@linuxfoundation.org wrote: > > > > > > > > On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang > > > > Walter > > > > wrote: > > > > > Hi, > > > > > > > > > > starting with 5.4.188 wie see a massive > > > > > performance > > > > > regression on our > > > > > nfs-server. It basically is serving requests very > > > > > very > > > > > slowly with cpu > > > > > utilization of 100% (with 5.4.187 and earlier it > > > > > is > > > > > 10%) so > > > > > that it is > > > > > unusable as a fileserver. > > > > > > > > > > The culprit are commits (or one of it): > > > > > > > > > > c32f1041382a88b17da5736886da4a492353a1bb "nfsd: > > > > > cleanup > > > > > nfsd_file_lru_dispose()" > > > > > 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: > > > > > Containerise filecache > > > > > laundrette" > > > > > > > > > > (upstream > > > > > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > > > > > 9542e6a643fc69d528dfb3303f145719c61d3050) > > > > > > > > > > If I revert them in v5.4.192 the kernel works as > > > > > before > > > > > and > > > > > performance is > > > > > ok again. > > > > > > > > > > I did not try to revert them one by one as any > > > > > disruption > > > > > of our nfs-server > > > > > is a severe problem for us and I'm not sure if > > > > > they are > > > > > related. > > > > > > > > > > 5.10 and 5.15 both always performed very badly on > > > > > our > > > > > nfs- > > > > > server in a > > > > > similar way so we were stuck with 5.4. > > > > > > > > > > I now think this is because of > > > > > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > > > > > and/or 9542e6a643fc69d528dfb3303f145719c61d3050 > > > > > though > > > > > I > > > > > didn't tried to > > > > > revert them in 5.15 yet. > > > > > > > > Odds are 5.18-rc6 is also a problem? > > > > > > We believe that > > > > > > 6b8a94332ee4 ("nfsd: Fix a write performance > > > regression") > > > > > > addresses the performance regression. It was merged > > > into > > > 5.18- > > > rc. > > > > And into 5.17.4 if someone wants to try that release. > > I don't have a lot of time to backport this one myself, > so > I welcome anyone who wants to apply that commit to their > favorite LTS kernel and test it for us. > > > > > > If so, I'll just wait for the fix to get into > > > > Linus's > > > > tree as > > > > this does > > > > not seem to be a stable-tree-only issue. > > > > > > Unfortunately I've received a recent report that the > > > fix > > > introduces > > > a "sleep while spinlock is held" for NFSv4.0 in rare > > > cases. > > > > Ick, not good, any potential fixes for that? > > Not yet. I was at LSF last week, so I've just started > digging > into this one. I've confirmed that the report is a real > bug, > but we still don't know how hard it is to hit it with > real > workloads.

We believe the following, which should be part of the first NFSD pull request for 5.19, will properly address the splat.

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-...

Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

I had a thought about an alternate approach.

Create a second nfsd_file_put API that is not allowed to sleep. Let's call it "nfsd_file_put_async()". Teach check_for_locked() to use that instead of nfsd_file_put().

Here's where I'm a little fuzzy: nfsd_file_put_async() could do something like:

void nfsd_file_put_async(struct nfsd_file *nf) { if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_close_inode(nf->nf_inode); }

That approach moves the sync to the garbage collector, which was exactly what we're trying to avoid in the first place.

Why not just do this "check_for_locks()" thing differently?

It really shouldn't be too hard to add something to nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in the lockowner in order to tell you whether or not locks are still held instead of doing this bone headed walk through the list of locks.

-- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com

Chuck Lever III

7:11 p.m.

...

On May 21, 2022, at 2:10 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 7:43 PM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 12:40 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote: > > >> On May 11, 2022, at 10:36 AM, Chuck Lever III >> chuck.lever@oracle.com wrote: >> >> >> >>> On May 11, 2022, at 10:23 AM, Greg KH >>> gregkh@linuxfoundation.org wrote: >>> >>> On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever >>> III >>> wrote: >>>> >>>> >>>>> On May 11, 2022, at 8:38 AM, Greg KH >>>>> gregkh@linuxfoundation.org wrote: >>>>> >>>>> On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang >>>>> Walter >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> starting with 5.4.188 wie see a massive >>>>>> performance >>>>>> regression on our >>>>>> nfs-server. It basically is serving requests very >>>>>> very >>>>>> slowly with cpu >>>>>> utilization of 100% (with 5.4.187 and earlier it >>>>>> is >>>>>> 10%) so >>>>>> that it is >>>>>> unusable as a fileserver. >>>>>> >>>>>> The culprit are commits (or one of it): >>>>>> >>>>>> c32f1041382a88b17da5736886da4a492353a1bb "nfsd: >>>>>> cleanup >>>>>> nfsd_file_lru_dispose()" >>>>>> 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: >>>>>> Containerise filecache >>>>>> laundrette" >>>>>> >>>>>> (upstream >>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and >>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050) >>>>>> >>>>>> If I revert them in v5.4.192 the kernel works as >>>>>> before >>>>>> and >>>>>> performance is >>>>>> ok again. >>>>>> >>>>>> I did not try to revert them one by one as any >>>>>> disruption >>>>>> of our nfs-server >>>>>> is a severe problem for us and I'm not sure if >>>>>> they are >>>>>> related. >>>>>> >>>>>> 5.10 and 5.15 both always performed very badly on >>>>>> our >>>>>> nfs- >>>>>> server in a >>>>>> similar way so we were stuck with 5.4. >>>>>> >>>>>> I now think this is because of >>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 >>>>>> and/or 9542e6a643fc69d528dfb3303f145719c61d3050 >>>>>> though >>>>>> I >>>>>> didn't tried to >>>>>> revert them in 5.15 yet. >>>>> >>>>> Odds are 5.18-rc6 is also a problem? >>>> >>>> We believe that >>>> >>>> 6b8a94332ee4 ("nfsd: Fix a write performance >>>> regression") >>>> >>>> addresses the performance regression. It was merged >>>> into >>>> 5.18- >>>> rc. >>> >>> And into 5.17.4 if someone wants to try that release. >> >> I don't have a lot of time to backport this one myself, >> so >> I welcome anyone who wants to apply that commit to their >> favorite LTS kernel and test it for us. >> >> >>>>> If so, I'll just wait for the fix to get into >>>>> Linus's >>>>> tree as >>>>> this does >>>>> not seem to be a stable-tree-only issue. >>>> >>>> Unfortunately I've received a recent report that the >>>> fix >>>> introduces >>>> a "sleep while spinlock is held" for NFSv4.0 in rare >>>> cases. >>> >>> Ick, not good, any potential fixes for that? >> >> Not yet. I was at LSF last week, so I've just started >> digging >> into this one. I've confirmed that the report is a real >> bug, >> but we still don't know how hard it is to hit it with >> real >> workloads. > > We believe the following, which should be part of the first > NFSD pull request for 5.19, will properly address the > splat. > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-... > > Uh... What happens if you have 2 simultaneous calls to nfsd4_release_lockowner() for the same file? i.e. 2 separate processes owned by the same user, both locking the same file.

Can't that cause the 'putlist' to get corrupted when both callers add the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

I had a thought about an alternate approach.

Create a second nfsd_file_put API that is not allowed to sleep. Let's call it "nfsd_file_put_async()". Teach check_for_locked() to use that instead of nfsd_file_put().

Here's where I'm a little fuzzy: nfsd_file_put_async() could do something like:

void nfsd_file_put_async(struct nfsd_file *nf) { if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_close_inode(nf->nf_inode); }

That approach moves the sync to the garbage collector, which was exactly what we're trying to avoid in the first place.

Totally understood.

My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID would be unlikely to have any data to sync -- callers that actually have data to flush are elsewhere, and those would continue to use the synchronous nfsd_file_put() API.

Do you have a workload where we can test this assumption?

...

Why not just do this "check_for_locks()" thing differently?

It really shouldn't be too hard to add something to nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in the lockowner in order to tell you whether or not locks are still held instead of doing this bone headed walk through the list of locks.

I thought of that a couple weeks ago. That doesn't work because you can lock or unlock by range. That means the symmetry of LOCK and LOCKU is not guaranteed, and I don't believe these calls are used that way anyway. So I abandoned the idea of using get_owner / put_owner.

But maybe we can provide some other mechanism to record whether a lockowner is associated with file locks.

-- Chuck Lever

Trond Myklebust

7:49 p.m.

On Sat, 2022-05-21 at 19:11 +0000, Chuck Lever III wrote:

...

...
On May 21, 2022, at 2:10 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 7:43 PM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:

...
> On May 20, 2022, at 12:40 PM, Trond Myklebust > trondmy@hammerspace.com wrote: > > On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote: > > > > > > > On May 11, 2022, at 10:36 AM, Chuck Lever III > > > chuck.lever@oracle.com wrote: > > > > > > > > > > > > > On May 11, 2022, at 10:23 AM, Greg KH > > > > gregkh@linuxfoundation.org wrote: > > > > > > > > On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck > > > > Lever > > > > III > > > > wrote: > > > > > > > > > > > > > > > > On May 11, 2022, at 8:38 AM, Greg KH > > > > > > gregkh@linuxfoundation.org wrote: > > > > > > > > > > > > On Wed, May 11, 2022 at 12:03:13PM +0200, > > > > > > Wolfgang > > > > > > Walter > > > > > > wrote: > > > > > > > Hi, > > > > > > > > > > > > > > starting with 5.4.188 wie see a massive > > > > > > > performance > > > > > > > regression on our > > > > > > > nfs-server. It basically is serving requests > > > > > > > very > > > > > > > very > > > > > > > slowly with cpu > > > > > > > utilization of 100% (with 5.4.187 and earlier > > > > > > > it > > > > > > > is > > > > > > > 10%) so > > > > > > > that it is > > > > > > > unusable as a fileserver. > > > > > > > > > > > > > > The culprit are commits (or one of it): > > > > > > > > > > > > > > c32f1041382a88b17da5736886da4a492353a1bb > > > > > > > "nfsd: > > > > > > > cleanup > > > > > > > nfsd_file_lru_dispose()" > > > > > > > 628adfa21815f74c04724abc85847f24b5dd1645 > > > > > > > "nfsd: > > > > > > > Containerise filecache > > > > > > > laundrette" > > > > > > > > > > > > > > (upstream > > > > > > > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and > > > > > > > 9542e6a643fc69d528dfb3303f145719c61d3050) > > > > > > > > > > > > > > If I revert them in v5.4.192 the kernel works > > > > > > > as > > > > > > > before > > > > > > > and > > > > > > > performance is > > > > > > > ok again. > > > > > > > > > > > > > > I did not try to revert them one by one as > > > > > > > any > > > > > > > disruption > > > > > > > of our nfs-server > > > > > > > is a severe problem for us and I'm not sure > > > > > > > if > > > > > > > they are > > > > > > > related. > > > > > > > > > > > > > > 5.10 and 5.15 both always performed very > > > > > > > badly on > > > > > > > our > > > > > > > nfs- > > > > > > > server in a > > > > > > > similar way so we were stuck with 5.4. > > > > > > > > > > > > > > I now think this is because of > > > > > > > 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 > > > > > > > and/or > > > > > > > 9542e6a643fc69d528dfb3303f145719c61d3050 > > > > > > > though > > > > > > > I > > > > > > > didn't tried to > > > > > > > revert them in 5.15 yet. > > > > > > > > > > > > Odds are 5.18-rc6 is also a problem? > > > > > > > > > > We believe that > > > > > > > > > > 6b8a94332ee4 ("nfsd: Fix a write performance > > > > > regression") > > > > > > > > > > addresses the performance regression. It was > > > > > merged > > > > > into > > > > > 5.18- > > > > > rc. > > > > > > > > And into 5.17.4 if someone wants to try that > > > > release. > > > > > > I don't have a lot of time to backport this one > > > myself, > > > so > > > I welcome anyone who wants to apply that commit to > > > their > > > favorite LTS kernel and test it for us. > > > > > > > > > > > > If so, I'll just wait for the fix to get into > > > > > > Linus's > > > > > > tree as > > > > > > this does > > > > > > not seem to be a stable-tree-only issue. > > > > > > > > > > Unfortunately I've received a recent report that > > > > > the > > > > > fix > > > > > introduces > > > > > a "sleep while spinlock is held" for NFSv4.0 in > > > > > rare > > > > > cases. > > > > > > > > Ick, not good, any potential fixes for that? > > > > > > Not yet. I was at LSF last week, so I've just started > > > digging > > > into this one. I've confirmed that the report is a > > > real > > > bug, > > > but we still don't know how hard it is to hit it with > > > real > > > workloads. > > > > We believe the following, which should be part of the > > first > > NFSD pull request for 5.19, will properly address the > > splat. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-... > > > > > Uh... What happens if you have 2 simultaneous calls to > nfsd4_release_lockowner() for the same file? i.e. 2 > separate > processes > owned by the same user, both locking the same file. > > Can't that cause the 'putlist' to get corrupted when both > callers > add > the same nf->nf_putfile to two separate lists?

IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.

The first call finds the lockowner in cl_ownerstr_hashtbl and unhashes it before releasing cl_lock.

Then the second cannot find that lockowner, thus it can't requeue it for bulk_put.

Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

I had a thought about an alternate approach.

Create a second nfsd_file_put API that is not allowed to sleep. Let's call it "nfsd_file_put_async()". Teach check_for_locked() to use that instead of nfsd_file_put().

Here's where I'm a little fuzzy: nfsd_file_put_async() could do something like:

void nfsd_file_put_async(struct nfsd_file *nf) { if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_close_inode(nf->nf_inode); }

That approach moves the sync to the garbage collector, which was exactly what we're trying to avoid in the first place.

Totally understood.

My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID would be unlikely to have any data to sync -- callers that actually have data to flush are elsewhere, and those would continue to use the synchronous nfsd_file_put() API.

Do you have a workload where we can test this assumption?

...
Why not just do this "check_for_locks()" thing differently?

It really shouldn't be too hard to add something to nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in the lockowner in order to tell you whether or not locks are still held instead of doing this bone headed walk through the list of locks.

I thought of that a couple weeks ago. That doesn't work because you can lock or unlock by range. That means the symmetry of LOCK and LOCKU is not guaranteed, and I don't believe these calls are used that way anyway. So I abandoned the idea of using get_owner / put_owner.

Then you're misunderstanding how it works. lm_get_owner() is called when a lock is initialised from another one. The whole point is to ensure that each and every object representing a range lock on the inode's list maintains its own private reference to the knfsd lockowner (i.e. the fl->fl_owner).

For instance when a LOCK call calls posix_lock_inode(), then that function uses locks_copy_conflock() (which calls lm_get_owner) to initialise the range lock object that is being put on the inode list. If the new lock causes multiple existing locks to be replaced, they all call lm_put_owner to release their references to fl->fl_owner as part of the process of being freed.

Conversely, when LOCKU causes a range to get split, the two locks that replace the old one are both initialised using locks_copy_conflock(), so they both call lm_get_owner. The lock that represents the range being replaced is then made to call lm_put_owner() when it is freed.

etc, etc...

...

But maybe we can provide some other mechanism to record whether a lockowner is associated with file locks.

-- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com

Chuck Lever III

8:41 p.m.

...

On May 21, 2022, at 3:49 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Sat, 2022-05-21 at 19:11 +0000, Chuck Lever III wrote:

...
...
On May 21, 2022, at 2:10 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:

...
...
On May 20, 2022, at 7:43 PM, Chuck Lever III chuck.lever@oracle.com wrote:

...
On May 20, 2022, at 6:24 PM, Trond Myklebust trondmy@hammerspace.com wrote:

On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote: > > >> On May 20, 2022, at 12:40 PM, Trond Myklebust >> trondmy@hammerspace.com wrote: >> >> On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote: >>> >>> >>>> On May 11, 2022, at 10:36 AM, Chuck Lever III >>>> chuck.lever@oracle.com wrote: >>>> >>>> >>>> >>>>> On May 11, 2022, at 10:23 AM, Greg KH >>>>> gregkh@linuxfoundation.org wrote: >>>>> >>>>> On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck >>>>> Lever >>>>> III >>>>> wrote: >>>>>> >>>>>> >>>>>>> On May 11, 2022, at 8:38 AM, Greg KH >>>>>>> gregkh@linuxfoundation.org wrote: >>>>>>> >>>>>>> On Wed, May 11, 2022 at 12:03:13PM +0200, >>>>>>> Wolfgang >>>>>>> Walter >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> starting with 5.4.188 wie see a massive >>>>>>>> performance >>>>>>>> regression on our >>>>>>>> nfs-server. It basically is serving requests >>>>>>>> very >>>>>>>> very >>>>>>>> slowly with cpu >>>>>>>> utilization of 100% (with 5.4.187 and earlier >>>>>>>> it >>>>>>>> is >>>>>>>> 10%) so >>>>>>>> that it is >>>>>>>> unusable as a fileserver. >>>>>>>> >>>>>>>> The culprit are commits (or one of it): >>>>>>>> >>>>>>>> c32f1041382a88b17da5736886da4a492353a1bb >>>>>>>> "nfsd: >>>>>>>> cleanup >>>>>>>> nfsd_file_lru_dispose()" >>>>>>>> 628adfa21815f74c04724abc85847f24b5dd1645 >>>>>>>> "nfsd: >>>>>>>> Containerise filecache >>>>>>>> laundrette" >>>>>>>> >>>>>>>> (upstream >>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and >>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050) >>>>>>>> >>>>>>>> If I revert them in v5.4.192 the kernel works >>>>>>>> as >>>>>>>> before >>>>>>>> and >>>>>>>> performance is >>>>>>>> ok again. >>>>>>>> >>>>>>>> I did not try to revert them one by one as >>>>>>>> any >>>>>>>> disruption >>>>>>>> of our nfs-server >>>>>>>> is a severe problem for us and I'm not sure >>>>>>>> if >>>>>>>> they are >>>>>>>> related. >>>>>>>> >>>>>>>> 5.10 and 5.15 both always performed very >>>>>>>> badly on >>>>>>>> our >>>>>>>> nfs- >>>>>>>> server in a >>>>>>>> similar way so we were stuck with 5.4. >>>>>>>> >>>>>>>> I now think this is because of >>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 >>>>>>>> and/or >>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050 >>>>>>>> though >>>>>>>> I >>>>>>>> didn't tried to >>>>>>>> revert them in 5.15 yet. >>>>>>> >>>>>>> Odds are 5.18-rc6 is also a problem? >>>>>> >>>>>> We believe that >>>>>> >>>>>> 6b8a94332ee4 ("nfsd: Fix a write performance >>>>>> regression") >>>>>> >>>>>> addresses the performance regression. It was >>>>>> merged >>>>>> into >>>>>> 5.18- >>>>>> rc. >>>>> >>>>> And into 5.17.4 if someone wants to try that >>>>> release. >>>> >>>> I don't have a lot of time to backport this one >>>> myself, >>>> so >>>> I welcome anyone who wants to apply that commit to >>>> their >>>> favorite LTS kernel and test it for us. >>>> >>>> >>>>>>> If so, I'll just wait for the fix to get into >>>>>>> Linus's >>>>>>> tree as >>>>>>> this does >>>>>>> not seem to be a stable-tree-only issue. >>>>>> >>>>>> Unfortunately I've received a recent report that >>>>>> the >>>>>> fix >>>>>> introduces >>>>>> a "sleep while spinlock is held" for NFSv4.0 in >>>>>> rare >>>>>> cases. >>>>> >>>>> Ick, not good, any potential fixes for that? >>>> >>>> Not yet. I was at LSF last week, so I've just started >>>> digging >>>> into this one. I've confirmed that the report is a >>>> real >>>> bug, >>>> but we still don't know how hard it is to hit it with >>>> real >>>> workloads. >>> >>> We believe the following, which should be part of the >>> first >>> NFSD pull request for 5.19, will properly address the >>> splat. >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-... >>> >>> >> Uh... What happens if you have 2 simultaneous calls to >> nfsd4_release_lockowner() for the same file? i.e. 2 >> separate >> processes >> owned by the same user, both locking the same file. >> >> Can't that cause the 'putlist' to get corrupted when both >> callers >> add >> the same nf->nf_putfile to two separate lists? > > IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls. > > The first call finds the lockowner in cl_ownerstr_hashtbl > and > unhashes it before releasing cl_lock. > > Then the second cannot find that lockowner, thus it can't > requeue it for bulk_put. > > Am I missing something?

In the example I quoted, there are 2 separate processes running on the client. Those processes could share the same open owner + open stateid, and hence the same struct nfs4_file, since that depends only on the process credentials matching. However they will not normally share a lock owner, since POSIX does not expect different processes to share locks.

IOW: The point is that one can relatively easily create 2 different lock owners with different lock stateids that share the same underlying struct nfs4_file.

Is there a similar exposure if two different clients are locking the same file? If so, then we can't use a per-nfs4_client semaphore to serialize access to the nf_putfile field.

I had a thought about an alternate approach.

Create a second nfsd_file_put API that is not allowed to sleep. Let's call it "nfsd_file_put_async()". Teach check_for_locked() to use that instead of nfsd_file_put().

Here's where I'm a little fuzzy: nfsd_file_put_async() could do something like:

void nfsd_file_put_async(struct nfsd_file *nf) { if (refcount_dec_and_test(&nf->nf_ref)) nfsd_file_close_inode(nf->nf_inode); }

That approach moves the sync to the garbage collector, which was exactly what we're trying to avoid in the first place.

Totally understood.

My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID would be unlikely to have any data to sync -- callers that actually have data to flush are elsewhere, and those would continue to use the synchronous nfsd_file_put() API.

Do you have a workload where we can test this assumption?

...
Why not just do this "check_for_locks()" thing differently?

It really shouldn't be too hard to add something to nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in the lockowner in order to tell you whether or not locks are still held instead of doing this bone headed walk through the list of locks.

I thought of that a couple weeks ago. That doesn't work because you can lock or unlock by range. That means the symmetry of LOCK and LOCKU is not guaranteed, and I don't believe these calls are used that way anyway. So I abandoned the idea of using get_owner / put_owner.

Then you're misunderstanding how it works. lm_get_owner() is called when a lock is initialised from another one. The whole point is to ensure that each and every object representing a range lock on the inode's list maintains its own private reference to the knfsd lockowner (i.e. the fl->fl_owner).

For instance when a LOCK call calls posix_lock_inode(), then that function uses locks_copy_conflock() (which calls lm_get_owner) to initialise the range lock object that is being put on the inode list. If the new lock causes multiple existing locks to be replaced, they all call lm_put_owner to release their references to fl->fl_owner as part of the process of being freed.

Conversely, when LOCKU causes a range to get split, the two locks that replace the old one are both initialised using locks_copy_conflock(), so they both call lm_get_owner. The lock that represents the range being replaced is then made to call lm_put_owner() when it is freed.

etc, etc...

That is definitely not what it looked like when I traced it. The reference count managed by get_owner / put_owner did not seem to be usable for tracking whether a lockowner had locks or not. The reference count was put pretty quickly after the lm_get_owner call.

But I'm not interested in an argument. I'll go back and look at get_owner / put_owner again, because I agree that not having to traverse the inode's flc_posix list during check_for_locks() would be awesome sauce.

-- Chuck Lever

1331

days inactive

1341

days old

linux-stable-mirror@lists.linaro.org

18 comments

participants

tags (0)

participants (5)

Chuck Lever III
dai.ngo＠oracle.com
Greg KH
Trond Myklebust
Wolfgang Walter