On May 11, 2022, at 10:23 AM, Greg KH gregkh@linuxfoundation.org wrote:
On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck Lever III wrote:
On May 11, 2022, at 8:38 AM, Greg KH gregkh@linuxfoundation.org wrote:
On Wed, May 11, 2022 at 12:03:13PM +0200, Wolfgang Walter wrote:
Hi,
starting with 5.4.188 wie see a massive performance regression on our nfs-server. It basically is serving requests very very slowly with cpu utilization of 100% (with 5.4.187 and earlier it is 10%) so that it is unusable as a fileserver.
The culprit are commits (or one of it):
c32f1041382a88b17da5736886da4a492353a1bb "nfsd: cleanup nfsd_file_lru_dispose()" 628adfa21815f74c04724abc85847f24b5dd1645 "nfsd: Containerise filecache laundrette"
(upstream 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and 9542e6a643fc69d528dfb3303f145719c61d3050)
If I revert them in v5.4.192 the kernel works as before and performance is ok again.
I did not try to revert them one by one as any disruption of our nfs-server is a severe problem for us and I'm not sure if they are related.
5.10 and 5.15 both always performed very badly on our nfs-server in a similar way so we were stuck with 5.4.
I now think this is because of 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and/or 9542e6a643fc69d528dfb3303f145719c61d3050 though I didn't tried to revert them in 5.15 yet.
Odds are 5.18-rc6 is also a problem?
We believe that
6b8a94332ee4 ("nfsd: Fix a write performance regression")
addresses the performance regression. It was merged into 5.18-rc.
And into 5.17.4 if someone wants to try that release.
I don't have a lot of time to backport this one myself, so I welcome anyone who wants to apply that commit to their favorite LTS kernel and test it for us.
If so, I'll just wait for the fix to get into Linus's tree as this does not seem to be a stable-tree-only issue.
Unfortunately I've received a recent report that the fix introduces a "sleep while spinlock is held" for NFSv4.0 in rare cases.
Ick, not good, any potential fixes for that?
Not yet. I was at LSF last week, so I've just started digging into this one. I've confirmed that the report is a real bug, but we still don't know how hard it is to hit it with real workloads.
-- Chuck Lever