Hi,
some of our customers (Proxmox VE) are seeing issues with file corruptions when accessing contents located on CephFS via the in-kernel Ceph client [0,1], we managed to reproduce this regression on kernels up to the latest 6.11-rc6. Accessing the same content on the CephFS using the FUSE client or the in-kernel ceph client with older kernels (Ubuntu kernel on v6.5) does not show file corruptions. Unfortunately the corruption is hard to reproduce, seemingly only a small subset of files is affected. However, once a file is affected, the issue is persistent and can easily be reproduced.
Bisection with the reproducer points to this commit:
"92b6cc5d: netfs: Add iov_iters to (sub)requests to describe various buffers"
Description of the issue:
A file was copied from local filesystem to cephfs via: ``` cp /tmp/proxmox-backup-server_3.2-1.iso /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso ``` * sha256sum on local filesystem:`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /tmp/proxmox-backup-server_3.2-1.iso` * sha256sum on cephfs with kernel up to above commit: `1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso` * sha256sum on cephfs with kernel after above commit: `89ad3620bf7b1e0913b534516cfbe48580efbaec944b79951e2c14e5e551f736 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso` * removing and/or recopying the file does not change the issue, the corrupt checksum remains the same. * accessing the same file from different clients results in the same output: the one with above patch applied do show the incorrect checksum, ones without the patch show the correct checksum. * the issue persists even across reboot of the ceph cluster and/or clients. * the file is indeed corrupt after reading, as verified by a `cmp -b`. Interestingly, the first 4M contain the correct data, the following 4M are read as all zeros, which differs from the original data. * the issue is related to the readahead size: mounting the cephfs with a `rasize=0` makes the issue disappear, same is true for sizes up to 128k (please note that the ranges as initially reported on the mailing list [3] are not correct for rasize [0..128k] the file is not corrupted).
In the bugtracker issue [4] I attached a ftrace with "*ceph*" as filter while performing a read on the latest kernel 6.11-rc6 while performing ``` dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1 ``` the relevant part shown by task `dd-26192`.
Please let me know if I can provide further information or debug outputs in order to narrow down the issue.
[0] https://forum.proxmox.com/threads/78340/post-676129 [1] https://forum.proxmox.com/threads/149249/ [2] https://forum.proxmox.com/threads/151291/ [3] https://lore.kernel.org/lkml/db686d0c-2f27-47c8-8c14-26969433b13b@proxmox.co... [4] https://bugzilla.kernel.org/show_bug.cgi?id=219237
#regzbot introduced: 92b6cc5d
Regards, Christian Ebner
Hi Christian,
Thanks for reporting this.
Let me have a look and how to fix this.
- Xiubo
On 9/4/24 23:49, Christian Ebner wrote:
Hi,
some of our customers (Proxmox VE) are seeing issues with file corruptions when accessing contents located on CephFS via the in-kernel Ceph client [0,1], we managed to reproduce this regression on kernels up to the latest 6.11-rc6. Accessing the same content on the CephFS using the FUSE client or the in-kernel ceph client with older kernels (Ubuntu kernel on v6.5) does not show file corruptions. Unfortunately the corruption is hard to reproduce, seemingly only a small subset of files is affected. However, once a file is affected, the issue is persistent and can easily be reproduced.
Bisection with the reproducer points to this commit:
"92b6cc5d: netfs: Add iov_iters to (sub)requests to describe various buffers"
Description of the issue:
A file was copied from local filesystem to cephfs via:
cp /tmp/proxmox-backup-server_3.2-1.iso /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso
- sha256sum on local
filesystem:`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /tmp/proxmox-backup-server_3.2-1.iso`
- sha256sum on cephfs with kernel up to above commit:
`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
- sha256sum on cephfs with kernel after above commit:
`89ad3620bf7b1e0913b534516cfbe48580efbaec944b79951e2c14e5e551f736 /mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
- removing and/or recopying the file does not change the issue, the
corrupt checksum remains the same.
- accessing the same file from different clients results in the same
output: the one with above patch applied do show the incorrect checksum, ones without the patch show the correct checksum.
- the issue persists even across reboot of the ceph cluster and/or
clients.
- the file is indeed corrupt after reading, as verified by a `cmp -b`.
Interestingly, the first 4M contain the correct data, the following 4M are read as all zeros, which differs from the original data.
- the issue is related to the readahead size: mounting the cephfs with
a `rasize=0` makes the issue disappear, same is true for sizes up to 128k (please note that the ranges as initially reported on the mailing list [3] are not correct for rasize [0..128k] the file is not corrupted).
In the bugtracker issue [4] I attached a ftrace with "*ceph*" as filter while performing a read on the latest kernel 6.11-rc6 while performing
dd if=/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso of=/tmp/test.out bs=8M count=1
the relevant part shown by task `dd-26192`.
Please let me know if I can provide further information or debug outputs in order to narrow down the issue.
[0] https://forum.proxmox.com/threads/78340/post-676129 [1] https://forum.proxmox.com/threads/149249/ [2] https://forum.proxmox.com/threads/151291/ [3] https://lore.kernel.org/lkml/db686d0c-2f27-47c8-8c14-26969433b13b@proxmox.co... [4] https://bugzilla.kernel.org/show_bug.cgi?id=219237
#regzbot introduced: 92b6cc5d
Regards, Christian Ebner
On 06.09.2024 03:01 CEST Xiubo Li xiubli@redhat.com wrote:
Hi Christian,
Thanks for reporting this.
Let me have a look and how to fix this.
- Xiubo
Thanks for looking into it, please do not hesitate to contact me if I can be of help regarding debugging or testing a possible fix.
Regards, Christian Ebner
On 9/6/24 15:15, Christian Ebner wrote:
On 06.09.2024 03:01 CEST Xiubo Li xiubli@redhat.com wrote:
Hi Christian,
Thanks for reporting this.
Let me have a look and how to fix this.
- Xiubo
Thanks for looking into it, please do not hesitate to contact me if I can be of help regarding debugging or testing a possible fix.
Sure, thanks in advance.
I will work on this next week or later because I am occupied by some other stuff recently.
- Xiubo
Regards, Christian Ebner
On 06.09.24 13:09, Xiubo Li wrote:
On 9/6/24 15:15, Christian Ebner wrote:
On 06.09.2024 03:01 CEST Xiubo Li xiubli@redhat.com wrote:
Thanks for reporting this.
Let me have a look and how to fix this.
Thanks for looking into it, please do not hesitate to contact me if I can be of help regarding debugging or testing a possible fix.
Sure, thanks in advance.
I will work on this next week or later because I am occupied by some other stuff recently.
Thx. FWIW, there were some other corruption bugs related to netfs, one of them [1] was recently solved by c26096ee0278c5 ("mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()"); this is a v6.11-rc6-post commit. Makes me wonder if retesting with latest mainline might be wise; OTOH I assume David might have mentioned this if he suspects it might be related, so maybe I'm just confusing everything with this message. Sending it nevertheless, but if that's turns out to be the case: sorry!
Ciao, Thorsten
[1] like this https://lore.kernel.org/all/pv2lcjhveti4sfua95o0u6r4i73r39srra@sonic.net/
"Linux regression tracking (Thorsten Leemhuis)" wrote:
Thx. FWIW, there were some other corruption bugs related to netfs, one of them [1] was recently solved by c26096ee0278c5 ("mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()"); this is a v6.11-rc6-post commit.
filemap_invalidate_inode() is not used directly by ceph yet, and ceph doesn't yet use the netfs DIO that would also use that.
David
On 9/6/24 13:09, Xiubo Li wrote:
Sure, thanks in advance.
I will work on this next week or later because I am occupied by some other stuff recently.
There is some further information I can provide regarding this: Further testing with the reproducer and the current mainline kernel shows that the issue might be fixed.
Bisection of the possible fix points to ee4cdf7b ("netfs: Speed up buffered reading").
Could this additional information help to boil down the part that fixes the cephfs issues so that the fix can be backported to current stable?
Regards, Christian Ebner
I can report some progress on this issue and attached a possible fix to current stable as patch to the issue at https://bugzilla.kernel.org/show_bug.cgi?id=219237
Is this the correct approach for a fix?
I noticed the tail clear flag being set in the requested netfs traces for the subreq leading to the corrupt all zero contents being read.
Best regards, Christian Ebner
Christian Ebner c.ebner@proxmox.com wrote:
some of our customers (Proxmox VE) are seeing issues with file corruptions when accessing contents located on CephFS via the in-kernel Ceph client [0,1], we managed to reproduce this regression on kernels up to the latest 6.11-rc6. Accessing the same content on the CephFS using the FUSE client or the in-kernel ceph client with older kernels (Ubuntu kernel on v6.5) does not show file corruptions.
Are they using local caching with cachefiles?
David
On 9/6/24 18:22, David Howells wrote:
Are they using local caching with cachefiles?
David
Hi David,
if you are referring to [0] than no, there is no such caching layer active.
Output of ``` $ cat /proc/fs/fscache/{caches,cookies,requests,stats,volumes} CACHE REF VOLS OBJS ACCES S NAME ======== ===== ===== ===== ===== = =============== COOKIE VOLUME REF ACT ACC S FL DEF ======== ======== === === === = == ================ REQUEST OR REF FL ERR OPS COVERAGE ======== == === == ==== === ========= Netfs : DR=0 RA=140 RF=0 WB=0 WBZ=0 Netfs : BW=0 WT=0 DW=0 WP=0 Netfs : ZR=0 sh=0 sk=0 Netfs : DL=548 ds=548 df=0 di=0 Netfs : RD=0 rs=0 rf=0 Netfs : UL=0 us=0 uf=0 Netfs : WR=0 ws=0 wf=0 Netfs : rr=0 sr=0 wsc=0 -- FS-Cache statistics -- Cookies: n=0 v=0 vcol=0 voom=0 Acquire: n=0 ok=0 oom=0 LRU : n=0 exp=0 rmv=0 drp=0 at=0 Invals : n=0 Updates: n=0 rsz=0 rsn=0 Relinqs: n=0 rtr=0 drop=0 NoSpace: nwr=0 ncr=0 cull=0 IO : rd=0 wr=0 mis=0 VOLUME REF nCOOK ACC FL CACHE KEY ======== ===== ===== === == =============== ================ ```
Also, disabling caching by stetting `client_cache_size` to 0 and `client_oc` to false as found in [1] did not change the corrupted read behavior.
[0] https://www.kernel.org/doc/html/latest/filesystems/caching/fscache.html [1] https://docs.ceph.com/en/latest/cephfs/client-config-ref/#client-config-refe...
Regards, Chris
linux-stable-mirror@lists.linaro.org