[PATCH] ceph: force updating the msg pointer in non-split case

List overview All Threads
Download

newer

older

[PATCH AUTOSEL 6.3 01/24] ASoC:...

[PATCH] tpm/tpm_tis: Disable...

xiubli＠redhat.com

17 May 2023 17 May '23

5:24 a.m.

From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com --- fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); } + } else { + p += sizeof(u64) * num_split_inos; + p += sizeof(u64) * num_split_realms; }

-- 2.40.1

Show replies by date

Ilya Dryomov

17 May 17 May

10:31 a.m.

On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...

From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
  }

  /*
-- 2.40.1

Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

Thanks,

Ilya

Xiubo Li

11:04 a.m.

On 5/17/23 18:31, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...

Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800

mds: send snap related messages centrally during mds recovery

sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to clients centrally in MDCache::open_snaprealms()

Signed-off-by: "Yan, Zheng" zyan@redhat.com

Before this commit it will only send the 'split_realm' list for the _SPLIT op.

The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93 ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00 .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05 {K]-.

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

Thanks

- Xiubo

...

Thanks,

             Ilya

Xiubo Li

11:14 a.m.

On 5/17/23 19:04, Xiubo Li wrote:

...

On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); } + } else { + p += sizeof(u64) * num_split_inos; + p += sizeof(u64) * num_split_realms; }

/*

2.40.1

Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

@Venky,

Do you remember which one ? As I remembered this is why we fixed the snaptrace issue by blocking all the IOs and at the same time blocklisting the kclient before.

Before the kcleint won't dump the corrupted msg and we don't know what was wrong with the msg and also we added code to dump the msg in the above fix.

Thanks

- Xiubo

...

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date:   Fri Jul 21 21:40:46 2017 +0800

mds: send snap related messages centrally during mds recovery

sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to     clients centrally in MDCache::open_snaprealms()

Signed-off-by: "Yan, Zheng" zyan@redhat.com

Before this commit it will only send the 'split_realm' list for the _SPLIT op.

The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93                                   ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8....           <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8....       <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00   .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05                                   {K]-.

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

Thanks

Xiubo

...
Thanks,

Ilya

Venky Shankar

12:48 p.m.

Hey Xiubo,

On Wed, May 17, 2023 at 4:45 PM Xiubo Li xiubli@redhat.com wrote:

...

On 5/17/23 19:04, Xiubo Li wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.
@Venky,

Do you remember which one ? As I remembered this is why we fixed the snaptrace issue by blocking all the IOs and at the same time blocklisting the kclient before.

Before the kcleint won't dump the corrupted msg and we don't know what was wrong with the msg and also we added code to dump the msg in the above fix.

The "corrupted" snaptrace issue happened just after the mds asserted hitting the metadata corruption (dentry first corrupted) and it _seemed_ that this corruption somehow triggered a corrupted snaptrace to be sent to the client.

...

Thanks

Xiubo

...
...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
mds: send snap related messages centrally during mds recovery

sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
clients centrally in MDCache::open_snaprealms()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.

The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93 ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00 .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05 {K]-.

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

Thanks

Xiubo

...
Thanks,
             Ilya

-- Cheers, Venky

Venky Shankar

12:53 p.m.

On Wed, May 17, 2023 at 6:18 PM Venky Shankar vshankar@redhat.com wrote:

...

Hey Xiubo,

On Wed, May 17, 2023 at 4:45 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:04, Xiubo Li wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.
@Venky,

Do you remember which one ? As I remembered this is why we fixed the snaptrace issue by blocking all the IOs and at the same time blocklisting the kclient before.

Before the kcleint won't dump the corrupted msg and we don't know what was wrong with the msg and also we added code to dump the msg in the above fix.
The "corrupted" snaptrace issue happened just after the mds asserted hitting the metadata corruption (dentry first corrupted) and it _seemed_ that this corruption somehow triggered a corrupted snaptrace to be sent to the client.

[sent message a bit early - cotd...]

But I think you found the issue - the message dump did help and its not related to the dentry first corruption.

...

...
Thanks

Xiubo

...
...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
mds: send snap related messages centrally during mds recovery

sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
clients centrally in MDCache::open_snaprealms()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.

The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93 ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00 .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05 {K]-.

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

Thanks

Xiubo

...
Thanks,
             Ilya
-- Cheers, Venky

-- Cheers, Venky

Xiubo Li

1:14 p.m.

On 5/17/23 20:53, Venky Shankar wrote:

...

On Wed, May 17, 2023 at 6:18 PM Venky Shankar vshankar@redhat.com wrote:

...
Hey Xiubo,

On Wed, May 17, 2023 at 4:45 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:04, Xiubo Li wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
    }

    /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.
@Venky,

Do you remember which one ? As I remembered this is why we fixed the snaptrace issue by blocking all the IOs and at the same time blocklisting the kclient before.

Before the kcleint won't dump the corrupted msg and we don't know what was wrong with the msg and also we added code to dump the msg in the above fix.
The "corrupted" snaptrace issue happened just after the mds asserted hitting the metadata corruption (dentry first corrupted) and it _seemed_ that this corruption somehow triggered a corrupted snaptrace to be sent to the client.
[sent message a bit early - cotd...]

But I think you found the issue - the message dump did help and its not related to the dentry first corruption.

Yeah, this one is not related to dentry first corruption.

@Ilya

I have created one ceph PR to fix it in MDS side https://tracker.ceph.com/issues/61217 and https://github.com/ceph/ceph/pull/51536.

Let's keep this kclient patch to make to be compatible with the old cephs. At least users could only update the kclient node to fix this issue.

Thanks

- Xiubo

...

...
...
Thanks

Xiubo

...
...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.

The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93 ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00 .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05 {K]-.

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

Thanks

Xiubo

...
Thanks,
              Ilya
-- Cheers, Venky

Ilya Dryomov

11:29 a.m.

On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...

On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.

It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.

Thanks,

Ilya

Xiubo Li

12:46 p.m.

On 5/17/23 19:29, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
    }

    /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
  mds: send snap related messages centrally during mds recovery

  sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
  clients centrally in MDCache::open_snaprealms()

  Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.

But we still need this patch to make it to work with the old ceph releases.

Thanks

- Xiubo

...

Thanks,

             Ilya

Ilya Dryomov

1:11 p.m.

On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...

On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
    }

    /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
  mds: send snap related messages centrally during mds recovery

  sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
  clients centrally in MDCache::open_snaprealms()

  Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.

This is debatable:

- given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

- MDS side fix would be backported to supported Ceph releases

- people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients

Thanks,

Ilya

Xiubo Li

1:23 p.m.

On 5/17/23 21:11, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com
fs/ceph/snap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
     }

     /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
   mds: send snap related messages centrally during mds recovery

   sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
   clients centrally in MDCache::open_snaprealms()

   Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.
This is debatable:

given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

MDS side fix would be backported to supported Ceph releases

people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients

Yeah. While IMO usually upgrading the kernel will be safer and easier than upgrading the whole ceph cluster, and upgrading the ceph cluster may cause the fs metadatas corruption issue, which we have hit many time from CUs and ceph-user mail list.

Will leave it to you for this. If that still doesn't make sense I will drop this patch.

Thanks

- Xiubo

...

Thanks,

             Ilya

Xiubo Li

1:33 p.m.

On 5/17/23 21:11, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com
fs/ceph/snap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
     }

     /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
   mds: send snap related messages centrally during mds recovery

   sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
   clients centrally in MDCache::open_snaprealms()

   Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.
This is debatable:

given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

MDS side fix would be backported to supported Ceph releases

people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients

Just searched the ceph tracker and I found another 3 trackers have the same issue:

https://tracker.ceph.com/issues/57817 https://tracker.ceph.com/issues/57703 https://tracker.ceph.com/issues/57686

So plusing this time and the previous CU case:

https://www.spinics.net/lists/ceph-users/msg77106.html

I have seen at least 5 times.

All this are reproduced when doing MDS failover, and this is the root cause in MDS side.

Thanks

- Xiubo

...

Thanks,

             Ilya

Ilya Dryomov

1:56 p.m.

On Wed, May 17, 2023 at 3:33 PM Xiubo Li xiubli@redhat.com wrote:

...

On 5/17/23 21:11, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: > From: Xiubo Li xiubli@redhat.com > > When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > request may still contain a list of 'split_realms', and we need > to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > Cc: stable@vger.kernel.org > Cc: Frank Schilder frans@dtu.dk > Reported-by: Frank Schilder frans@dtu.dk > URL: https://tracker.ceph.com/issues/61200 > Signed-off-by: Xiubo Li xiubli@redhat.com > --- > fs/ceph/snap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > index 0e59e95a96d9..d95dfe16b624 100644 > --- a/fs/ceph/snap.c > +++ b/fs/ceph/snap.c > @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > continue; > adjust_snap_realm_parent(mdsc, child, realm->ino); > } > + } else { > + p += sizeof(u64) * num_split_inos; > + p += sizeof(u64) * num_split_realms; > } > > /* > -- > 2.40.1 > Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
   mds: send snap related messages centrally during mds recovery

   sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
   clients centrally in MDCache::open_snaprealms()

   Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.
This is debatable:

given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

MDS side fix would be backported to supported Ceph releases

people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients
Just searched the ceph tracker and I found another 3 trackers have the same issue:

https://tracker.ceph.com/issues/57817 https://tracker.ceph.com/issues/57703 https://tracker.ceph.com/issues/57686

So plusing this time and the previous CU case:

https://www.spinics.net/lists/ceph-users/msg77106.html

I have seen at least 5 times.

All this are reproduced when doing MDS failover, and this is the root cause in MDS side.

OK, given that the fixup in the kernel client is small, it seems justified. But, please, add a comment in the new else branch saying that it's there only to work around a bug in the MDS.

Thanks,

Ilya

Xiubo Li

18 May 18 May

12:40 a.m.

On 5/17/23 21:56, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 3:33 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 21:11, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: >> From: Xiubo Li xiubli@redhat.com >> >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >> request may still contain a list of 'split_realms', and we need >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >> >> Cc: stable@vger.kernel.org >> Cc: Frank Schilder frans@dtu.dk >> Reported-by: Frank Schilder frans@dtu.dk >> URL: https://tracker.ceph.com/issues/61200 >> Signed-off-by: Xiubo Li xiubli@redhat.com >> --- >> fs/ceph/snap.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >> index 0e59e95a96d9..d95dfe16b624 100644 >> --- a/fs/ceph/snap.c >> +++ b/fs/ceph/snap.c >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >> continue; >> adjust_snap_realm_parent(mdsc, child, realm->ino); >> } >> + } else { >> + p += sizeof(u64) * num_split_inos; >> + p += sizeof(u64) * num_split_realms; >> } >> >> /* >> -- >> 2.40.1 >> > Hi Xiubo, > > This code appears to be very old -- it goes back to the initial commit > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > explanation for why this popped up only now? As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

> Has MDS always been including split_inos and split_realms arrays in > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > change, I'd argue that this needs to be addressed on the MDS side. While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
    mds: send snap related messages centrally during mds recovery

    sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
    clients centrally in MDCache::open_snaprealms()

    Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.
This is debatable:

given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

MDS side fix would be backported to supported Ceph releases

people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients
Just searched the ceph tracker and I found another 3 trackers have the same issue:

https://tracker.ceph.com/issues/57817 https://tracker.ceph.com/issues/57703 https://tracker.ceph.com/issues/57686

So plusing this time and the previous CU case:

https://www.spinics.net/lists/ceph-users/msg77106.html

I have seen at least 5 times.

All this are reproduced when doing MDS failover, and this is the root cause in MDS side.
OK, given that the fixup in the kernel client is small, it seems justified. But, please, add a comment in the new else branch saying that it's there only to work around a bug in the MDS.

Sure. Will update it in next version.

Thanks

- Xiubo

...

Thanks,

             Ilya

Xiubo Li

19 May 19 May

12:55 a.m.

On 5/17/23 21:56, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 3:33 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 21:11, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 2:46 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 19:29, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: >> From: Xiubo Li xiubli@redhat.com >> >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >> request may still contain a list of 'split_realms', and we need >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >> >> Cc: stable@vger.kernel.org >> Cc: Frank Schilder frans@dtu.dk >> Reported-by: Frank Schilder frans@dtu.dk >> URL: https://tracker.ceph.com/issues/61200 >> Signed-off-by: Xiubo Li xiubli@redhat.com >> --- >> fs/ceph/snap.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >> index 0e59e95a96d9..d95dfe16b624 100644 >> --- a/fs/ceph/snap.c >> +++ b/fs/ceph/snap.c >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >> continue; >> adjust_snap_realm_parent(mdsc, child, realm->ino); >> } >> + } else { >> + p += sizeof(u64) * num_split_inos; >> + p += sizeof(u64) * num_split_realms; >> } >> >> /* >> -- >> 2.40.1 >> > Hi Xiubo, > > This code appears to be very old -- it goes back to the initial commit > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > explanation for why this popped up only now? As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

> Has MDS always been including split_inos and split_realms arrays in > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > change, I'd argue that this needs to be addressed on the MDS side. While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
    mds: send snap related messages centrally during mds recovery

    sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
    clients centrally in MDCache::open_snaprealms()

    Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
But we still need this patch to make it to work with the old ceph releases.
This is debatable:

given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low

MDS side fix would be backported to supported Ceph releases

people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients
Just searched the ceph tracker and I found another 3 trackers have the same issue:

https://tracker.ceph.com/issues/57817 https://tracker.ceph.com/issues/57703 https://tracker.ceph.com/issues/57686

So plusing this time and the previous CU case:

https://www.spinics.net/lists/ceph-users/msg77106.html

I have seen at least 5 times.

All this are reproduced when doing MDS failover, and this is the root cause in MDS side.
OK, given that the fixup in the kernel client is small, it seems justified. But, please, add a comment in the new else branch saying that it's there only to work around a bug in the MDS.

...

Thanks,

             Ilya

Gregory Farnum

17 May 17 May

1:58 p.m.

On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...

On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.

I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?

On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...

And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these "unexpected" SPLIT ops show up. (Which leads me to assume Zheng checked that the referenced server patch would work, and he was mostly right.)

So what logic makes it so "split_realm == sizeof(ceph_mds_snap_realm) + <extra snap buffer size>" is the condition which leads to breakage, and allows everything else to work? -Greg

Ilya Dryomov

2:27 p.m.

On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...

On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?

Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...

On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.

Thanks,

Ilya

Gregory Farnum

3:03 p.m.

On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov idryomov@gmail.com wrote:

...

On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
   }

   /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.

Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled* -Greg

...

Thanks,

            Ilya

Gregory Farnum

3:04 p.m.

Just to be clear, I'd like the details here so we can see if there are ways to prevent similar issues in the future, which I haven't heard anybody talk about. :)

On Wed, May 17, 2023 at 8:03 AM Gregory Farnum gfarnum@redhat.com wrote:

...

On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: > From: Xiubo Li xiubli@redhat.com > > When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > request may still contain a list of 'split_realms', and we need > to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > Cc: stable@vger.kernel.org > Cc: Frank Schilder frans@dtu.dk > Reported-by: Frank Schilder frans@dtu.dk > URL: https://tracker.ceph.com/issues/61200 > Signed-off-by: Xiubo Li xiubli@redhat.com > --- > fs/ceph/snap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > index 0e59e95a96d9..d95dfe16b624 100644 > --- a/fs/ceph/snap.c > +++ b/fs/ceph/snap.c > @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > continue; > adjust_snap_realm_parent(mdsc, child, realm->ino); > } > + } else { > + p += sizeof(u64) * num_split_inos; > + p += sizeof(u64) * num_split_realms; > } > > /* > -- > 2.40.1 > Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.
Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled* -Greg

...
Thanks,
            Ilya

Frank Schilder

3:39 p.m.

Hi all,

joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches.

For these two reasons alone I strongly favour fixing both sides.

Extra reading, feel free to skip.

Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice.

I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason.

The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything.

The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do.

Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3.

I hope this was a useful point of view from someone suffering from the condition.

Best regards and thanks for your efforts addressing this! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

Xiubo Li

18 May 18 May

1:24 a.m.

Thanks Frank for your feedback.

On 5/17/23 23:39, Frank Schilder wrote:

...

Hi all,

joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches.

For these two reasons alone I strongly favour fixing both sides.

Extra reading, feel free to skip.

Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice.

I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason.

The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything.

The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do.

Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3.

I hope this was a useful point of view from someone suffering from the condition.

Best regards and thanks for your efforts addressing this!

Frank Schilder AIT Risø Campus Bygning 109, rum S14

Xiubo Li

1:23 a.m.

On 5/17/23 23:04, Gregory Farnum wrote:

...

Just to be clear, I'd like the details here so we can see if there are ways to prevent similar issues in the future, which I haven't heard anybody talk about. :)

We have hit a similar issue with Venky in https://github.com/ceph/ceph/pull/48382. This PR just added two extra members and the old kclient couldn't recognize them and also just incorrectly parsed them as a new snaptrace.

Venky has fixed it by checking the peer client's feature bit before sending the msgs.

Thanks

- Xiubo

...

On Wed, May 17, 2023 at 8:03 AM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: >> From: Xiubo Li xiubli@redhat.com >> >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >> request may still contain a list of 'split_realms', and we need >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >> >> Cc: stable@vger.kernel.org >> Cc: Frank Schilder frans@dtu.dk >> Reported-by: Frank Schilder frans@dtu.dk >> URL: https://tracker.ceph.com/issues/61200 >> Signed-off-by: Xiubo Li xiubli@redhat.com >> --- >> fs/ceph/snap.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >> index 0e59e95a96d9..d95dfe16b624 100644 >> --- a/fs/ceph/snap.c >> +++ b/fs/ceph/snap.c >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >> continue; >> adjust_snap_realm_parent(mdsc, child, realm->ino); >> } >> + } else { >> + p += sizeof(u64) * num_split_inos; >> + p += sizeof(u64) * num_split_realms; >> } >> >> /* >> -- >> 2.40.1 >> > Hi Xiubo, > > This code appears to be very old -- it goes back to the initial commit > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > explanation for why this popped up only now? As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

> Has MDS always been including split_inos and split_realms arrays in > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > change, I'd argue that this needs to be addressed on the MDS side. While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
  mds: send snap related messages centrally during mds recovery

  sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
  clients centrally in MDCache::open_snaprealms()

  Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.
Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled* -Greg

...
Thanks,
             Ilya

Ilya Dryomov

17 May 17 May

4:18 p.m.

On Wed, May 17, 2023 at 5:03 PM Gregory Farnum gfarnum@redhat.com wrote:

...

On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: > From: Xiubo Li xiubli@redhat.com > > When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > request may still contain a list of 'split_realms', and we need > to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > Cc: stable@vger.kernel.org > Cc: Frank Schilder frans@dtu.dk > Reported-by: Frank Schilder frans@dtu.dk > URL: https://tracker.ceph.com/issues/61200 > Signed-off-by: Xiubo Li xiubli@redhat.com > --- > fs/ceph/snap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > index 0e59e95a96d9..d95dfe16b624 100644 > --- a/fs/ceph/snap.c > +++ b/fs/ceph/snap.c > @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > continue; > adjust_snap_realm_parent(mdsc, child, realm->ino); > } > + } else { > + p += sizeof(u64) * num_split_inos; > + p += sizeof(u64) * num_split_realms; > } > > /* > -- > 2.40.1 > Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
 mds: send snap related messages centrally during mds recovery

 sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
 clients centrally in MDCache::open_snaprealms()

 Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.
Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled*

I'm not sure what count or magic number you are referring to.

AFAIU the issue is that ceph_update_snap_trace() expects to be called with "p" pointing at ceph_mds_snap_realm encoding. In the case that is being fixed it gets called with "p" pointing at "split_inos" array (and "split_realms" array following that). The exact check that throws it off (whether ceph_decode_need() or something else) is irrelevant -- we are already toasted by then. In the worst case no check could fire and ceph_update_snap_trace() could interpret the bogus array data as a valid ceph_mds_snap_realm with who knows what consequences...

Thanks,

Ilya

Xiubo Li

18 May 18 May

1:18 a.m.

On 5/17/23 23:03, Gregory Farnum wrote:

...

On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote: > From: Xiubo Li xiubli@redhat.com > > When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > request may still contain a list of 'split_realms', and we need > to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > Cc: stable@vger.kernel.org > Cc: Frank Schilder frans@dtu.dk > Reported-by: Frank Schilder frans@dtu.dk > URL: https://tracker.ceph.com/issues/61200 > Signed-off-by: Xiubo Li xiubli@redhat.com > --- > fs/ceph/snap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > index 0e59e95a96d9..d95dfe16b624 100644 > --- a/fs/ceph/snap.c > +++ b/fs/ceph/snap.c > @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > continue; > adjust_snap_realm_parent(mdsc, child, realm->ino); > } > + } else { > + p += sizeof(u64) * num_split_inos; > + p += sizeof(u64) * num_split_realms; > } > > /* > -- > 2.40.1 > Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
  mds: send snap related messages centrally during mds recovery

  sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
  clients centrally in MDCache::open_snaprealms()

  Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.
Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled*

Yeah, it's called and this is where the corrupted snaptrace reported.

The ceph_update_snap_trace() will try to parse the snaptraces in a loop and if there still has extra memories it will try to prase the rest memories as a new snaptrace, but the extra memories do not have enough memories needed and just reports it as a bad msg.

Thanks

- Xiubo

...

-Greg

...

Thanks,

             Ilya

Xiubo Li

12:47 a.m.

On 5/17/23 22:27, Ilya Dryomov wrote:

...

On Wed, May 17, 2023 at 3:59 PM Gregory Farnum gfarnum@redhat.com wrote:

...
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov idryomov@gmail.com wrote:

...
On Wed, May 17, 2023 at 1:04 PM Xiubo Li xiubli@redhat.com wrote:

...
On 5/17/23 18:31, Ilya Dryomov wrote:

...
On Wed, May 17, 2023 at 7:24 AM xiubli@redhat.com wrote:

...
From: Xiubo Li xiubli@redhat.com

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org Cc: Frank Schilder frans@dtu.dk Reported-by: Frank Schilder frans@dtu.dk URL: https://tracker.ceph.com/issues/61200 Signed-off-by: Xiubo Li xiubli@redhat.com

fs/ceph/snap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); }
  } else {
          p += sizeof(u64) * num_split_inos;
          p += sizeof(u64) * num_split_realms;
    }

    /*
-- 2.40.1
Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now?
As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team.

...
Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng zyan@redhat.com Date: Fri Jul 21 21:40:46 2017 +0800
  mds: send snap related messages centrally during mds recovery

  sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
  clients centrally in MDCache::open_snaprealms()

  Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Before this commit it will only send the 'split_realm' list for the _SPLIT op.
It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire.
I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides?
Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that.

In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply.

...
On Wed, May 17, 2023 at 4:07 AM Xiubo Li xiubli@redhat.com wrote:

...
And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch.

Yeah, correct.

The problem is not all the UPDATE cases in MDS will encode and pass the 'split_realms' list to clients via the MClientSnap requests. By reading Zheng's commit I think he just copied that code to one UPDATE case from old SPLIT case and didn't notice it will introduce this issue in kclient.

Sending the 'split_realms' is unnecessary and useless in client side for UPDATE case, we should fix it in MDS side too.

Thanks

- Xiubo

...

Thanks,

             Ilya

959

days inactive

961

days old

linux-stable-mirror@lists.linaro.org

24 comments

participants

tags (0)

participants (6)

Frank Schilder
Gregory Farnum
Ilya Dryomov
Venky Shankar
xiubli＠redhat.com
Xiubo Li