Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable. We reproduced the problem on the EC2 instances of different types. Reverting aforementioned commint on top of the latest stable verison v6.1.94 helps to resolve the problem.
Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes: [1] and [2]. We would like to ask for the help to investigate this problem and if some of those patches need to be backported. Also, is it safe to just revert problematic commit until proper fixes/backports will be available?
We will help to do testing and confirm if fix works, but let me also list the steps we used to reproduce the problem if it will help to identify the problem: 1. Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory Service with: - three AWS FSX file systems filesystem1..filesystem3 - three Windows servers; They have DFS installed as per https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs-...: - dfs-srv1: EC2AMAZ-2EGTM59 - dfs-srv2: EC2AMAZ-1N36PRD - dfs-srv3: EC2AMAZ-0PAUH2U
2. Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode and three folders targets in it: - referral-a mapped to filesystem1.corp.local - referral-b mapped to filesystem2.corp.local - referral-c mapped to filesystem3.corp.local - local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every Windows server. This helps to quickly define underlying server when DFS is mounted.
3. Enabled cifs debug logs: ``` echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 7 > /proc/fs/cifs/cifsFYI ```
4. Mount DFS namespace on Amazon Linux 2023 instance running any vanilla kernel v6.1.54+: ``` dmesg -c &>/dev/null cd /mnt mount -t cifs -o cred=/mnt/creds,echo_interval=5 \ //corp.fsxtest.local/dfs-namespace \ ./dfs-namespace ```
5. List DFS root, it's also required to avoid recursive mounts that happen during regular 'ls' run: ``` sh -c 'ls dfs-namespace' dfs-srv2 referral-a referral-b ```
The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount: ``` [root@ip-172-31-2-82 mnt]# mount | grep dfs //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) ```
List files in first folder: ``` sh -c 'ls dfs-namespace/referral-a' filea.txt.txt ```
6. Shutdown DFS server-2. List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59: ``` sh -c 'ls dfs-namespace' dfs-srv1 referral-a referral-b ```
7. Try to list files in another folder, this causes ls to fail with error: ``` sh -c 'ls dfs-namespace/referral-b' ls: cannot access 'dfs-namespace/referral-b': No route to host```
Sometimes it's also 'Operation now in progress' error.
mount shows the same output: ``` //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) ```
I also attached kernel debug logs from this test.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
Reported-by: Andrei Paniakin apanyaki@amazon.com Bisected-by: Simba Bonga simbarb@amazon.com ---
#regzbot introduced: v6.1.54..v6.2-rc1
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable. We reproduced the problem on the EC2 instances of different types. Reverting aforementioned commint on top of the latest stable verison v6.1.94 helps to resolve the problem.
Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes: [1] and [2]. We would like to ask for the help to investigate this problem and if some of those patches need to be backported. Also, is it safe to just revert problematic commit until proper fixes/backports will be available?
We will help to do testing and confirm if fix works, but let me also list the steps we used to reproduce the problem if it will help to identify the problem:
- Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory
Service with:
- three AWS FSX file systems filesystem1..filesystem3
- three Windows servers; They have DFS installed as per https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs-...:
- dfs-srv1: EC2AMAZ-2EGTM59
- dfs-srv2: EC2AMAZ-1N36PRD
- dfs-srv3: EC2AMAZ-0PAUH2U
- Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode
and three folders targets in it:
- referral-a mapped to filesystem1.corp.local
- referral-b mapped to filesystem2.corp.local
- referral-c mapped to filesystem3.corp.local
- local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every Windows server. This helps to quickly define underlying server when DFS is mounted.
- Enabled cifs debug logs:
echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 7 > /proc/fs/cifs/cifsFYI
- Mount DFS namespace on Amazon Linux 2023 instance running any vanilla
kernel v6.1.54+:
dmesg -c &>/dev/null cd /mnt mount -t cifs -o cred=/mnt/creds,echo_interval=5 \ //corp.fsxtest.local/dfs-namespace \ ./dfs-namespace
- List DFS root, it's also required to avoid recursive mounts that happen
during regular 'ls' run:
sh -c 'ls dfs-namespace' dfs-srv2 referral-a referral-b
The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount:
[root@ip-172-31-2-82 mnt]# mount | grep dfs //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
List files in first folder:
sh -c 'ls dfs-namespace/referral-a' filea.txt.txt
- Shutdown DFS server-2.
List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59:
sh -c 'ls dfs-namespace' dfs-srv1 referral-a referral-b
- Try to list files in another folder, this causes ls to fail with error:
sh -c 'ls dfs-namespace/referral-b' ls: cannot access 'dfs-namespace/referral-b': No route to host``` Sometimes it's also 'Operation now in progress' error. mount shows the same output:
//corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
I also attached kernel debug logs from this test. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=851f657a86421 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a924817d2ed9 Reported-by: Andrei Paniakin <apanyaki@amazon.com> Bisected-by: Simba Bonga <simbarb@amazon.com> --- #regzbot introduced: v6.1.54..v6.2-rc1
Friendly reminder, did anyone had a chance to look into this report?
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable. We reproduced the problem on the EC2 instances of different types. Reverting aforementioned commint on top of the latest stable verison v6.1.94 helps to resolve the problem.
Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes: [1] and [2]. We would like to ask for the help to investigate this problem and if some of those patches need to be backported. Also, is it safe to just revert problematic commit until proper fixes/backports will be available?
We will help to do testing and confirm if fix works, but let me also list the steps we used to reproduce the problem if it will help to identify the problem:
- Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory
Service with:
- three AWS FSX file systems filesystem1..filesystem3
- three Windows servers; They have DFS installed as per https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs-...:
- dfs-srv1: EC2AMAZ-2EGTM59
- dfs-srv2: EC2AMAZ-1N36PRD
- dfs-srv3: EC2AMAZ-0PAUH2U
- Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode
and three folders targets in it:
- referral-a mapped to filesystem1.corp.local
- referral-b mapped to filesystem2.corp.local
- referral-c mapped to filesystem3.corp.local
- local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every Windows server. This helps to quickly define underlying server when DFS is mounted.
- Enabled cifs debug logs:
echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 7 > /proc/fs/cifs/cifsFYI
- Mount DFS namespace on Amazon Linux 2023 instance running any vanilla
kernel v6.1.54+:
dmesg -c &>/dev/null cd /mnt mount -t cifs -o cred=/mnt/creds,echo_interval=5 \ //corp.fsxtest.local/dfs-namespace \ ./dfs-namespace
- List DFS root, it's also required to avoid recursive mounts that happen
during regular 'ls' run:
sh -c 'ls dfs-namespace' dfs-srv2 referral-a referral-b
The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount:
[root@ip-172-31-2-82 mnt]# mount | grep dfs //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
List files in first folder:
sh -c 'ls dfs-namespace/referral-a' filea.txt.txt
- Shutdown DFS server-2.
List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59:
sh -c 'ls dfs-namespace' dfs-srv1 referral-a referral-b
- Try to list files in another folder, this causes ls to fail with error:
sh -c 'ls dfs-namespace/referral-b' ls: cannot access 'dfs-namespace/referral-b': No route to host``` Sometimes it's also 'Operation now in progress' error. mount shows the same output:
//corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
I also attached kernel debug logs from this test. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=851f657a86421 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a924817d2ed9 Reported-by: Andrei Paniakin <apanyaki@amazon.com> Bisected-by: Simba Bonga <simbarb@amazon.com> --- #regzbot introduced: v6.1.54..v6.2-rc1
Friendly reminder, did anyone had a chance to look into this report?
It seems like so far nobody had a chance to look into this report 🤔
If I understand the report correctly the regression is specific for the current 6.1.y stable series, so also not much the CIFS devs themselves can do. Maybe the stable team missed the report with the plethora of mail that they get.. I'll change the subject to make this more prominent for them.
I think a good next step would be to bisect to the commit that fixed the relevant issue somewhere between v6.1.54..v6.2-rc1 so the stable team knows what needs backporting .. You can do that somewhat like so[0]:
$ git bisect start --term-new=fixed --term-old=unfixed $ git bisect fixed v6.2-rc1 $ git bisect unfixed v6.1
Then you just need to carry around the commit that broke the behaviour for you (which could be quite some work). Maybe others also have better ideas on how to approach that.
A revert may be a bit more complicated as the breaking commit in seems to be a dependency for a commit that fixes something:
efc0b0bcffcba ("smb: propagate error code of extract_sharename()") Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Cheers, chris
[0]: https://stackoverflow.com/a/17153598
#regzbot introduced: 062eacf57ad91b5c272f89dc964fd6dd9715ea7d #regzbot summary: cifs: broken failover for server inside DFS
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable. We reproduced the problem on the EC2 instances of different types. Reverting aforementioned commint on top of the latest stable verison v6.1.94 helps to resolve the problem.
Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes: [1] and [2]. We would like to ask for the help to investigate this problem and if some of those patches need to be backported. Also, is it safe to just revert problematic commit until proper fixes/backports will be available?
We will help to do testing and confirm if fix works, but let me also list the steps we used to reproduce the problem if it will help to identify the problem:
- Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory
Service with:
- three AWS FSX file systems filesystem1..filesystem3
- three Windows servers; They have DFS installed as per https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs-...:
- dfs-srv1: EC2AMAZ-2EGTM59
- dfs-srv2: EC2AMAZ-1N36PRD
- dfs-srv3: EC2AMAZ-0PAUH2U
- Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode
and three folders targets in it:
- referral-a mapped to filesystem1.corp.local
- referral-b mapped to filesystem2.corp.local
- referral-c mapped to filesystem3.corp.local
- local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every Windows server. This helps to quickly define underlying server when DFS is mounted.
- Enabled cifs debug logs:
echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control echo 7 > /proc/fs/cifs/cifsFYI
- Mount DFS namespace on Amazon Linux 2023 instance running any vanilla
kernel v6.1.54+:
dmesg -c &>/dev/null cd /mnt mount -t cifs -o cred=/mnt/creds,echo_interval=5 \ //corp.fsxtest.local/dfs-namespace \ ./dfs-namespace
- List DFS root, it's also required to avoid recursive mounts that happen
during regular 'ls' run:
sh -c 'ls dfs-namespace' dfs-srv2 referral-a referral-b
The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount:
[root@ip-172-31-2-82 mnt]# mount | grep dfs //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
List files in first folder:
sh -c 'ls dfs-namespace/referral-a' filea.txt.txt
- Shutdown DFS server-2.
List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59:
sh -c 'ls dfs-namespace' dfs-srv1 referral-a referral-b
- Try to list files in another folder, this causes ls to fail with error:
sh -c 'ls dfs-namespace/referral-b' ls: cannot access 'dfs-namespace/referral-b': No route to host``` Sometimes it's also 'Operation now in progress' error. mount shows the same output:
//corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1)
I also attached kernel debug logs from this test. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=851f657a86421 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a924817d2ed9 Reported-by: Andrei Paniakin <apanyaki@amazon.com> Bisected-by: Simba Bonga <simbarb@amazon.com> --- #regzbot introduced: v6.1.54..v6.2-rc1
Friendly reminder, did anyone had a chance to look into this report?
It seems like so far nobody had a chance to look into this report 🤔
If I understand the report correctly the regression is specific for the current 6.1.y stable series, so also not much the CIFS devs themselves can do. Maybe the stable team missed the report with the plethora of mail that they get.. I'll change the subject to make this more prominent for them.
I think a good next step would be to bisect to the commit that fixed the relevant issue somewhere between v6.1.54..v6.2-rc1 so the stable team knows what needs backporting .. You can do that somewhat like so[0]:
$ git bisect start --term-new=fixed --term-old=unfixed $ git bisect fixed v6.2-rc1 $ git bisect unfixed v6.1
Then you just need to carry around the commit that broke the behaviour for you (which could be quite some work). Maybe others also have better ideas on how to approach that.
A revert may be a bit more complicated as the breaking commit in seems to be a dependency for a commit that fixes something:
efc0b0bcffcba ("smb: propagate error code of extract_sharename()") Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Cheers, chris
#regzbot introduced: 062eacf57ad91b5c272f89dc964fd6dd9715ea7d #regzbot summary: cifs: broken failover for server inside DFS
Bisection showed that 7ad54b98fc1f ("cifs: use origin fullpath for automounts") is a first good commit. Applying it on top of 6.1.94 fixed the reported problem. It also passed Amazon Linux kernel regression tests when applied on top of our latest kernel 6.1. Since the code in 6.1.92 is a bit different I updated the original patch:
From: Paulo Alcantara pc@cjr.nz Date: Sun, 18 Dec 2022 14:37:32 -0300 Subject: [PATCH] cifs: use origin fullpath for automounts
commit 7ad54b98fc1f141cfb70cfe2a3d6def5a85169ff upstream.
Use TCP_Server_Info::origin_fullpath instead of cifs_tcon::tree_name when building source paths for automounts as it will be useful for domain-based DFS referrals where the connections and referrals would get either re-used from the cache or re-created when chasing the dfs link.
Signed-off-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Andrew Paniakin apanyaki@amazon.com --- fs/smb/client/cifs_dfs_ref.c | 34 ++++++++++++++++++++++++++++++++-- fs/smb/client/cifsproto.h | 18 ++++++++++++++++++ fs/smb/client/dir.c | 21 +++++++++++++++------ 3 files changed, 65 insertions(+), 8 deletions(-)
diff --git a/fs/smb/client/cifs_dfs_ref.c b/fs/smb/client/cifs_dfs_ref.c index 020e71fe1454e..876f9a43a99db 100644 --- a/fs/smb/client/cifs_dfs_ref.c +++ b/fs/smb/client/cifs_dfs_ref.c @@ -258,6 +258,31 @@ char *cifs_compose_mount_options(const char *sb_mountdata, goto compose_mount_options_out; }
+static int set_dest_addr(struct smb3_fs_context *ctx, const char *full_path) +{ + struct sockaddr *addr = (struct sockaddr *)&ctx->dstaddr; + char *str_addr = NULL; + int rc; + + rc = dns_resolve_server_name_to_ip(full_path, &str_addr, NULL); + if (rc < 0) + goto out; + + rc = cifs_convert_address(addr, str_addr, strlen(str_addr)); + if (!rc) { + cifs_dbg(FYI, "%s: failed to convert ip address\n", __func__); + rc = -EINVAL; + goto out; + } + + cifs_set_port(addr, ctx->port); + rc = 0; + +out: + kfree(str_addr); + return rc; +} + /* * Create a vfsmount that we can automount */ @@ -295,8 +320,7 @@ static struct vfsmount *cifs_dfs_do_automount(struct path *path) ctx = smb3_fc2context(fc);
page = alloc_dentry_path(); - /* always use tree name prefix */ - full_path = build_path_from_dentry_optional_prefix(mntpt, page, true); + full_path = dfs_get_automount_devname(mntpt, page); if (IS_ERR(full_path)) { mnt = ERR_CAST(full_path); goto out; @@ -315,6 +339,12 @@ static struct vfsmount *cifs_dfs_do_automount(struct path *path) goto out; }
+ rc = set_dest_addr(ctx, full_path); + if (rc) { + mnt = ERR_PTR(rc); + goto out; + } + rc = smb3_parse_devname(full_path, ctx); if (!rc) mnt = fc_mount(fc); diff --git a/fs/smb/client/cifsproto.h b/fs/smb/client/cifsproto.h index f37e4da0fe405..6dbc9afd67281 100644 --- a/fs/smb/client/cifsproto.h +++ b/fs/smb/client/cifsproto.h @@ -57,8 +57,26 @@ extern void exit_cifs_idmap(void); extern int init_cifs_spnego(void); extern void exit_cifs_spnego(void); extern const char *build_path_from_dentry(struct dentry *, void *); +char *__build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, + const char *tree, int tree_len, + bool prefix); extern char *build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, bool prefix); +static inline char *dfs_get_automount_devname(struct dentry *dentry, void *page) +{ + struct cifs_sb_info *cifs_sb = CIFS_SB(dentry->d_sb); + struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb); + struct TCP_Server_Info *server = tcon->ses->server; + + if (unlikely(!server->origin_fullpath)) + return ERR_PTR(-EREMOTE); + + return __build_path_from_dentry_optional_prefix(dentry, page, + server->origin_fullpath, + strlen(server->origin_fullpath), + true); +} + static inline void *alloc_dentry_path(void) { return __getname(); diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index 863c7bc3db86f..477302157ab3d 100644 --- a/fs/smb/client/dir.c +++ b/fs/smb/client/dir.c @@ -78,14 +78,13 @@ build_path_from_dentry(struct dentry *direntry, void *page) prefix); }
-char * -build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, - bool prefix) +char *__build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, + const char *tree, int tree_len, + bool prefix) { int dfsplen; int pplen = 0; struct cifs_sb_info *cifs_sb = CIFS_SB(direntry->d_sb); - struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb); char dirsep = CIFS_DIR_SEP(cifs_sb); char *s;
@@ -93,7 +92,7 @@ build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, return ERR_PTR(-ENOMEM);
if (prefix) - dfsplen = strnlen(tcon->tree_name, MAX_TREE_SIZE + 1); + dfsplen = strnlen(tree, tree_len + 1); else dfsplen = 0;
@@ -123,7 +122,7 @@ build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, } if (dfsplen) { s -= dfsplen; - memcpy(s, tcon->tree_name, dfsplen); + memcpy(s, tree, dfsplen); if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_POSIX_PATHS) { int i; for (i = 0; i < dfsplen; i++) { @@ -135,6 +134,16 @@ build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, return s; }
+char *build_path_from_dentry_optional_prefix(struct dentry *direntry, void *page, + bool prefix) +{ + struct cifs_sb_info *cifs_sb = CIFS_SB(direntry->d_sb); + struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb); + + return __build_path_from_dentry_optional_prefix(direntry, page, tcon->tree_name, + MAX_TREE_SIZE, prefix); +} + /* * Don't allow path components longer than the server max. * Don't allow the separator character in a path component.
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable.
Friendly reminder, did anyone had a chance to look into this report?
If I understand the report correctly the regression is specific for the current 6.1.y stable series, so also not much the CIFS devs themselves can do. Maybe the stable team missed the report with the plethora of mail that they get.. I'll change the subject to make this more prominent for them.
I think a good next step would be to bisect to the commit that fixed the relevant issue somewhere between v6.1.54..v6.2-rc1 so the stable team knows what needs backporting .. You can do that somewhat like so[0]:
Bisection showed that 7ad54b98fc1f ("cifs: use origin fullpath for automounts") is a first good commit. Applying it on top of 6.1.94 fixed the reported problem. It also passed Amazon Linux kernel regression tests when applied on top of our latest kernel 6.1. Since the code in 6.1.92 is a bit different I updated the original patch:
Hey Andrew,
good job on the bisection!
I think it might make sense to send the backported version of the patch for inclusion to the stable tree directly (see "Option 3" [here][0]).
Cheers, Chris
[0]: https://www.kernel.org/doc/html/next/process/stable-kernel-rules.html#option...
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was released in v6.1.54 and broke the failover when one of the servers inside DFS becomes unavailable.
Friendly reminder, did anyone had a chance to look into this report?
If I understand the report correctly the regression is specific for the current 6.1.y stable series, so also not much the CIFS devs themselves can do. Maybe the stable team missed the report with the plethora of mail that they get.. I'll change the subject to make this more prominent for them.
I think a good next step would be to bisect to the commit that fixed the relevant issue somewhere between v6.1.54..v6.2-rc1 so the stable team knows what needs backporting .. You can do that somewhat like so[0]:
Bisection showed that 7ad54b98fc1f ("cifs: use origin fullpath for automounts") is a first good commit. Applying it on top of 6.1.94 fixed the reported problem. It also passed Amazon Linux kernel regression tests when applied on top of our latest kernel 6.1. Since the code in 6.1.92 is a bit different I updated the original patch:
I think it might make sense to send the backported version of the patch for inclusion to the stable tree directly (see "Option 3" [here][0]).
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
#regzbot poke
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote:
Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
[snip]
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
#regzbot poke
We got the confirmation from requesters that the kernel with this patch works properly, our regression tests also passed, so I submitted backport request: https://lore.kernel.org/stable/20240713031147.20332-1-apanyaki@amazon.com/
On 12/07/2024, Andrew Paniakin wrote:
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote:
On 19/06/2024, Andrew Paniakin wrote: > Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
[snip]
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
#regzbot poke
We got the confirmation from requesters that the kernel with this patch works properly, our regression tests also passed, so I submitted backport request: https://lore.kernel.org/stable/20240713031147.20332-1-apanyaki@amazon.com/
There was an issue with backporting the follow-up fix for this patch: https://lore.kernel.org/all/20240716152749.667492414@linuxfoundation.org/ I'll work on fixing this issue and send new patches again for the next cycle.
On 23.07.24 02:51, Andrew Paniakin wrote:
On 12/07/2024, Andrew Paniakin wrote:
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote:
On 24/06/24 10:59AM, Andrew Paniakin wrote: > On 19/06/2024, Andrew Paniakin wrote: >> Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
We got the confirmation from requesters that the kernel with this patch works properly, our regression tests also passed, so I submitted backport request: https://lore.kernel.org/stable/20240713031147.20332-1-apanyaki@amazon.com/
There was an issue with backporting the follow-up fix for this patch: https://lore.kernel.org/all/20240716152749.667492414@linuxfoundation.org/ I'll work on fixing this issue and send new patches again for the next cycle.
Andrew, was there any progress? From here it looks like this fell through the cracks, but I might be missing something.
Ciao, Thorsten
On 27/09/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 23.07.24 02:51, Andrew Paniakin wrote:
On 12/07/2024, Andrew Paniakin wrote:
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote:
On 25/06/2024, Christian Heusel wrote: > On 24/06/24 10:59AM, Andrew Paniakin wrote: >> On 19/06/2024, Andrew Paniakin wrote: >>> Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
We got the confirmation from requesters that the kernel with this patch works properly, our regression tests also passed, so I submitted backport request: https://lore.kernel.org/stable/20240713031147.20332-1-apanyaki@amazon.com/
There was an issue with backporting the follow-up fix for this patch: https://lore.kernel.org/all/20240716152749.667492414@linuxfoundation.org/ I'll work on fixing this issue and send new patches again for the next cycle.
Andrew, was there any progress? From here it looks like this fell through the cracks, but I might be missing something.
Ciao, Thorsten
Hi Thorsten, sorry for delay in reply. I had to do one step back and update my development setup, in order to prevent rebase process breaking: created script to use crosstool [1] to test my future backports on all platforms and make sure to search follow-up fixes for the patch I'm porting, found kernel.dance [2] for it. Now I'm trying to reproduce issue mentioned in follow-up fix [3] to have clear red/green test results. I think I should be able to send tested fixes in next 2 weeks.
[1] https://cdn.kernel.org/pub/tools/crosstool/ [2] https://kernel.dance/ [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
On 16/10/2024, Andrew Paniakin wrote:
On 27/09/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 23.07.24 02:51, Andrew Paniakin wrote:
On 12/07/2024, Andrew Paniakin wrote:
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote:
On 24/06/26 03:09PM, Andrew Paniakin wrote: > On 25/06/2024, Christian Heusel wrote: >> On 24/06/24 10:59AM, Andrew Paniakin wrote: >>> On 19/06/2024, Andrew Paniakin wrote: >>>> Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
We got the confirmation from requesters that the kernel with this patch works properly, our regression tests also passed, so I submitted backport request: https://lore.kernel.org/stable/20240713031147.20332-1-apanyaki@amazon.com/
There was an issue with backporting the follow-up fix for this patch: https://lore.kernel.org/all/20240716152749.667492414@linuxfoundation.org/ I'll work on fixing this issue and send new patches again for the next cycle.
Andrew, was there any progress? From here it looks like this fell through the cracks, but I might be missing something.
Ciao, Thorsten
Hi Thorsten, sorry for delay in reply. I had to do one step back and update my development setup, in order to prevent rebase process breaking: created script to use crosstool [1] to test my future backports on all platforms and make sure to search follow-up fixes for the patch I'm porting, found kernel.dance [2] for it. Now I'm trying to reproduce issue mentioned in follow-up fix [3] to have clear red/green test results. I think I should be able to send tested fixes in next 2 weeks.
[1] https://cdn.kernel.org/pub/tools/crosstool/ [2] https://kernel.dance/ [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
Hi Thorsten, Last weeks I had to work on few urgent internal issues, so this work got delayed. I got confirmation from the manager to make this task my priority until it's done. To progress faster I setup systemtap and was able find the reason why my reproducer didn't work.
On 10/11/2024, Andrew Paniakin wrote:
On 16/10/2024, Andrew Paniakin wrote:
On 27/09/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 23.07.24 02:51, Andrew Paniakin wrote:
On 12/07/2024, Andrew Paniakin wrote:
On 11/07/2024, Linux regression tracking (Thorsten Leemhuis) wrote:
On 27.06.24 22:16, Christian Heusel wrote: > On 24/06/26 03:09PM, Andrew Paniakin wrote: >> On 25/06/2024, Christian Heusel wrote: >>> On 24/06/24 10:59AM, Andrew Paniakin wrote: >>>> On 19/06/2024, Andrew Paniakin wrote: >>>>> Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was
Hmmm, unless I'm missing something it seems nobody did so. Andrew, could you take care of that to get this properly fixed to prevent others from running into the same problem?
Hi Thorsten, Last weeks I had to work on few urgent internal issues, so this work got delayed. I got confirmation from the manager to make this task my priority until it's done. To progress faster I setup systemtap and was able find the reason why my reproducer didn't work.
Hi Thorsten,
I completed investigation of this issue and got the data showing that we should not backport a follow-up fix d5a863a153e9 ("cifs: avoid dup prefix path in dfs_get_automount_devname()") to linux-6.1. We planned to do it last time [1] but it will break a mount of the share subdirectories.
Summary: 1. Main purpose of the 7ad54b98fc1f1 ("cifs: use origin fullpath for automounts") is to use a better-cached TCP_Server_Info::origin_fullpath URL. But what fixes the failover issue that we reported is a set_dest_addr() call at the automount start. This fix would work even without switch to origin fullpath.
2. It's not mentioned in a commit message of the fix d5a863a153e9 ("cifs: avoid dup prefix path in dfs_get_automount_devname()"), but it only works when 7ad54b98fc1f1 ("cifs: use origin fullpath for automounts" and a1c0d00572fc ("cifs: share dfs connections and supers") applied both. Second patch changed origin_fullpath contents from namespace root to a full mount path e.g. from '//corp.fsxtest.local/namespace/' to '//corp.fsxtest.local/namespace/folderA/fs1-folder/'. Since prefix path '/folderA/' also stored in cifs superblock info, we need second fix to avoid adding it twice.
But the change a1c0d00572fc wasn't ported to linux-6.1, and probably shouldn't because it's a part of big cifs driver rework made in linux-6.2, not just a bug fix. So if we backport d5a863a153e9 ("cifs: avoid dup prefix path in dfs_get_automount_devname()") to linux-6.1, path construction routine __build_path_from_dentry_optional_prefix will not add prefix path from a superblock info because it assumes origin_fullpath already has it.
My next step is to resend 7ad54b98fc1f1 ("cifs: use origin fullpath for automounts") with required comments and send an update to this thread once it merged.
Please find detailed explanation and test results below.
=== Root cause analysis of the DFS failover issue === Steps to trigger failover issue: 1. Create test environment: * Active Directory domain //corp.fsxtest.local * DFS namespace '//corp.fsxtest.local/namespace', root server at 172.31.25.164 * Two namespace servers (IP addresses will simplify logs reading): * EC2AMAZ-JF3R0PQ at 172.31.48.144 * EC2AMAZ-T0UUIJ3 at 172.31.60.51 * Network file system fs1.corp.fsxtest.local * DFS link //corp.fsxtest.local/namespace/fs1-folder refers to //fs1.corp.fsxtest.local/folder1
2. Mount DFS root: mount -t cifs -o cred=/mnt/creds,noserverino,echo_interval=5 \ //corp.fsxtest.local/namespace \ /mnt/dfs-namespace
3. Identify selected root target from logs: dmesg | grep connect_dfs_target [635023.023630] CIFS: fs/smb/client/connect.c: connect_dfs_target: full_path=\corp.fsxtest.local\namespace ref_path=\corp.fsxtest.local\namespace target=\EC2AMAZ-JF3R0PQ.corp.fsxtest.local\namespace
4. Stop target server server, wait for failover finish: aws ec2 stop-instances --instance-ids $JF3R0PQ && sleep 60
5. Try to access DFS folder link, this fails: [root@ip-172-31-55-195 ~]# sh -c 'ls /mnt/dfs-namespace/fs1-folder' ls: cannot access '/mnt/dfs-namespace/fs1-folder': No route to host
Verbose logs show that cifs client uses stale IP after failover.
Initial logs from cifs_smb3_do_mount [2], mount_get_conns[3] and cifs_get_tcp_session [4] show that client resolved DFS root server address corp.fsxtest.local to 172.31.25.164 and connected to it, as expected: ``` [635022.848477] CIFS: fs/smb/client/cifsfs.c: Devname: \corp.fsxtest.local\namespace flags: 0 [635022.850258] CIFS: fs/smb/client/connect.c: VFS: in mount_get_conns as Xid: 0 with uid: 0 [635022.850926] CIFS: fs/smb/client/connect.c: UNC: \corp.fsxtest.local\namespace [635022.851531] CIFS: fs/smb/client/connect.c: generic_ip_connect: connecting to 172.31.25.164:445 ```
Then it asked root server for referrals and connected to a first one EC2AMAZ-JF3R0PQ (172.31.48.144): ``` [635023.023630] CIFS: fs/smb/client/connect.c: connect_dfs_target: full_path=\corp.fsxtest.local\namespace ref_path=\corp.fsxtest.local\namespace target=\EC2AMAZ-JF3R0PQ.corp.fsxtest.local\namespace [635023.025029] CIFS: fs/smb/client/dfs_cache.c: dfs_cache_get_tgt_referral: path: \corp.fsxtest.local\namespace [635023.025850] CIFS: fs/smb/client/dfs_cache.c: dfs_cache_get_tgt_referral: target name: \EC2AMAZ-JF3R0PQ.corp.fsxtest.local\namespace [635023.026805] CIFS: fs/smb/client/dfs_cache.c: setup_referral: set up new ref [635023.030202] CIFS: fs/smb/client/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: EC2AMAZ-JF3R0PQ.corp.fsxtest.local to 172.31.48.144 expiry 0 ```
mount completed and I stopped EC2AMAZ-JF3R0PQ, logs from __reconnect_target_unlocked [5] show that client resolved and connected to a next option EC2AMAZ-T0UUIJ3 (172.31.60.51): ``` [635064.392246] CIFS: fs/smb/client/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: EC2AMAZ-T0UUIJ3.corp.fsxtest.local to 172.31.60.51 expiry 1741642209 [635064.393449] CIFS: fs/smb/client/connect.c: reconn_set_ipaddr_from_hostname: next dns resolution scheduled for 121 seconds in the future [635064.394497] CIFS: fs/smb/client/connect.c: __reconnect_target_unlocked: reconn_set_ipaddr_from_hostname: rc=0 [635064.395352] CIFS: fs/smb/client/connect.c: generic_ip_connect: connecting to 172.31.60.51:445 ```
Then I accessed DFS link, this triggers automount, cifs_dfs_do_automount [6] logs show that cifs client uses new address EC2AMAZ-T0UUIJ3 in a path, but connects to an old IP 172.31.48.144 of EC2AMAZ-JF3R0PQ: ``` [635117.289268] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_d_automount: fs1-folder [635117.289913] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_do_automount: full_path: //EC2AMAZ-T0UUIJ3.corp.fsxtest.local/namespace/fs1-folder [635117.294269] CIFS: fs/smb/client/connect.c: generic_ip_connect: connecting to 172.31.48.144:445 [635120.386908] CIFS: fs/smb/client/connect.c: Error -113 connecting to server ```
This is because all steps of the reconnect flow update only TCP_Server_Info* object, not the smb3_fs_context *smb3_fs_context: - cifs_demultiplex_thread - cifs_read_from_socket - cifs_readv_from_socket - cifs_reconnect - reconnect_dfs_server - __reconnect_target_unlocked
smb3_fs_context is a cifs internal part of the generic VFS fs_context object. It stores root target UNC and IP address from which you create TCP_Server_Info* object later. During automount this context smb3_fs_context_dup()ed from the parent super block private data. The target UNC is refreshed from new referral, but the IP address in smb3_fs_context is never refreshed.
Patch 7ad54b98fc1f1 ("cifs: use origin fullpath for automounts") adds a set_dest_addr() call at the automount start. This helper resolves root target IP address and updates it in smb3_fs_context.
I built linux-6.1 with this fix and tested failover again. This time first target was a EC2AMAZ-T0UUIJ3 (172.31.60.51): ``` [ 264.651876] CIFS: fs/smb/client/connect.c: connect_dfs_target: full_path=\corp.fsxtest.local\namespace ref_path=\corp.fsxtest.local\namespace target=\EC2AMAZ-T0UUIJ3.corp.fsxtest.local\namespace [ 264.653327] CIFS: fs/smb/client/dfs_cache.c: dfs_cache_get_tgt_referral: path: \corp.fsxtest.local\namespace [ 264.654157] CIFS: fs/smb/client/dfs_cache.c: dfs_cache_get_tgt_referral: target name: \EC2AMAZ-T0UUIJ3.corp.fsxtest.local\namespace [ 264.655154] CIFS: fs/smb/client/dfs_cache.c: setup_referral: set up new ref [ 264.665896] CIFS: fs/smb/client/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: EC2AMAZ-T0UUIJ3.corp.fsxtest.local to 172.31.60.51 expiry 0 ```
When I stopped it client reconnected to EC2AMAZ-JF3R0PQ (172.31.48.144): ``` [ 306.444808] CIFS: fs/smb/client/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: EC2AMAZ-JF3R0PQ.corp.fsxtest.local to 172.31.48.144 expiry 1742025040 [ 306.446041] CIFS: fs/smb/client/connect.c: reconn_set_ipaddr_from_hostname: next dns resolution scheduled for 121 seconds in the future [ 306.447067] CIFS: fs/smb/client/connect.c: __reconnect_target_unlocked: reconn_set_ipaddr_from_hostname: rc=0 [ 306.447916] CIFS: fs/smb/client/connect.c: generic_ip_connect: connecting to 172.31.48.144:445 ```
Then I accessed DFS link folder, triggered automount and cifs client used root server address corp.fsxtest.local and IP 172.31.25.164, just as needed: ``` [ 346.722178] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_d_automount: fs1-folder [ 346.722821] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_do_automount: full_path: //corp.fsxtest.local/namespace/fs1-folder [ 346.726676] CIFS: fs/smb/client/dns_resolve.c: dns_resolve_server_name_to_ip: resolved: corp.fsxtest.local to 172.31.25.164 expiry 0 [ 346.727694] CIFS: fs/smb/client/cifsfs.c: Devname: \corp.fsxtest.local\namespace flags: 0 [ 346.728400] CIFS: fs/smb/client/connect.c: Username: Admin [ 346.728881] CIFS: fs/smb/client/connect.c: file mode: 0755 dir mode: 0755 [ 346.729476] CIFS: fs/smb/client/connect.c: VFS: in mount_get_conns as Xid: 15 with uid: 0 [ 346.730185] CIFS: fs/smb/client/connect.c: UNC: \corp.fsxtest.local\namespace [ 346.730811] CIFS: fs/smb/client/connect.c: generic_ip_connect: connecting to 172.31.25.164:445 [ 346.731578] CIFS: fs/smb/client/connect.c: Socket created ```
=== RCA of the duplicated prefix path issue === To make sure the backport is correct, I tried to reproduce the described problem first. I tried to trigger the issue by putting DFS links at different places in the path or doing tricks with links to another namespace with their own links, but with no luck. Then I spend lots of time reading cifs code and concluded that linux-6.1 can't have prefix duplication issue in a path, so I tried the build of d5a863a153e9^ and was able to reproduce the issue immediately.
Steps: 1. Move DFS link fs1-folder (access to basic folder will not trigger automount) inside folderA: //corp.fsxtest.local/namespace/folderA/fs1-folder
2. Mount this subfolder: [root@ip-172-31-55-195 ~]# mount -t cifs -o cred=/mnt/creds,noserverino,echo_interval=5 \ //corp.fsxtest.local/namespace/folderA /mnt/dfs-namespace
3. Try to access link fs1-folder, this fails: [root@ip-172-31-55-195 ~]# sh -c 'ls /mnt/dfs-namespace/fs1-folder' ls: cannot access '/mnt/dfs-namespace/fs1-folder': No such file or directory
cifs_dfs_do_automount logs below print the reason: - cifs_sb prepath, this is added to the path by dfs_get_automount_devname after DFS tree. - full_path, constructed by dfs_get_automount_devname [7] using DFS tree UNC, cifs_sb prepath and origin_fullpath. As we can see folderA appears there twice: ``` [ 2522.544482] CIFS: fs/cifs/cifs_dfs_ref.c: cifs_dfs_d_automount: fs1-folder [ 2522.545072] CIFS: fs/cifs/dir.c: using cifs_sb prepath <folderA> [ 2522.545593] CIFS: fs/cifs/cifs_dfs_ref.c: cifs_dfs_do_automount: full_path: //corp.fsxtest.local/namespace/folderA/folderA/fs1-folder [SNIP] [ 2522.686368] CIFS: fs/cifs/cifs_dfs_ref.c: leaving cifs_dfs_d_automount [automount failed] ```
Same build of linux-6.1 with my backport of 7ad54b98fc1f1 ("cifs: use origin fullpath for automounts") handles prefix path mount correctly, no follow-up needed: ``` [ 152.901711] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_d_automount: fs1-folder [ 152.902356] CIFS: fs/smb/client/dir.c: using cifs_sb prepath <folderA> [ 152.902921] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_do_automount: full_path: //corp.fsxtest.local/namespace/folderA/fs1-folder [SNIP] [ 153.224169] CIFS: fs/smb/client/cifs_dfs_ref.c: leaving cifs_dfs_d_automount [ok] ```
=== Verify that backport d5a863a153e9 ("cifs: avoid dup prefix path in dfs_get_automount_devname()") breaks prepath automount === It's pretty clear from code, but to double check I tested a build of v6.1.129 + 7ad54b98fc1f ("cifs: use origin fullpath for automounts") + d5a863a153e9 ("cifs: avoid dup prefix path in dfs_get_automount_devname()"). As expected, I got '//corp.fsxtest.local/namespace/fs1-folder' instead of '//corp.fsxtest.local/namespace/folderA/fs1-folder': ``` [ 630.368406] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_d_automount: fs1-folder [ 630.369031] CIFS: fs/smb/client/cifs_dfs_ref.c: cifs_dfs_do_automount: full_path: //corp.fsxtest.local/namespace/fs1-folder ```
[1] https://lore.kernel.org/all/20240716152749.667492414@linuxfoundation.org/ [2] https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs... [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs... [4] https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs... [5] https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs... [6] https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs... [7] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
linux-stable-mirror@lists.linaro.org