Hi, Sumit, John, Amit, All
I am investigating on the VtsKernelNetTest failure with HiKey 4.4 kernel, and I found the problem is that the socket.connect call returns -11, with adding printk lines in the SYSCALL_DEFINE3 of connect in socket.c file, I found that the error is returned by the line of "err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,sock->file->f_flags);" https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There actually I don't know which connect function is called, so I searched the .connect assignment in kernel/linaro/hisilicon-4.4/net/ipv4/
and with adding printk lines, I found the implementation is tcp_v4_connect in net/ipv4/tcp_ipv4.c here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There with adding printk lines, I found it the -11 is returned by call of ip_route_connect here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
Then I need to go to the definition of ip_route_connect to add printks and so on to find the real place where -11 is return, and check the reason there.
but this work seems stupid and time consuming, I think there should be smart methods I need to learn.
Could you please help go give some suggestion on what to do with such cases?
Thanks in advance!
Hi, -11 is -EAGAIN -- AFAIK connect() returning EAGAIN usually means the listen() backlog is full. Unless the test's purpose is to check connections are accepted quickly under load, it may make sense to make the test handle EAGAIN instead of failing on it, e.g. change
xyz = connect(...);
to something more like
int retries = 10; do { xyz = connect(...) if(xyz >= 0 || errno != EAGAIN) break; sleep(1); } while(--retries); On Fri, 27 Jul 2018 at 16:41, Yongqin Liu yongqin.liu@linaro.org wrote:
Hi, Sumit, John, Amit, All
I am investigating on the VtsKernelNetTest failure with HiKey 4.4 kernel, and I found the problem is that the socket.connect call returns -11, with adding printk lines in the SYSCALL_DEFINE3 of connect in socket.c file, I found that the error is returned by the line of "err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,sock->file->f_flags);" https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There actually I don't know which connect function is called, so I searched the .connect assignment in kernel/linaro/hisilicon-4.4/net/ipv4/
and with adding printk lines, I found the implementation is tcp_v4_connect in net/ipv4/tcp_ipv4.c here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There with adding printk lines, I found it the -11 is returned by call of ip_route_connect here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
Then I need to go to the definition of ip_route_connect to add printks and so on to find the real place where -11 is return, and check the reason there.
but this work seems stupid and time consuming, I think there should be smart methods I need to learn.
Could you please help go give some suggestion on what to do with such cases?
Thanks in advance!
-- Best Regards, Yongqin Liu
#mailing list linaro-android@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-android _______________________________________________ linaro-android mailing list linaro-android@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-android
On Fri, 27 Jul 2018 at 23:03, Bero Rosenkränzer < Bernhard.Rosenkranzer@linaro.org> wrote:
Hi, -11 is -EAGAIN -- AFAIK connect() returning EAGAIN usually means the listen() backlog is full. Unless the test's purpose is to check connections are accepted quickly under load, it may make sense to make the test handle EAGAIN instead of failing on it, e.g. change
xyz = connect(...);
to something more like
int retries = 10; do { xyz = connect(...) if(xyz >= 0 || errno != EAGAIN) break; sleep(1); } while(--retries);
Hmm, I don't think the failure is that case, it passes for 4.14 and 4.9 kernel, and it passes for other parameter combination as well.
And here, except this special case, I most want to know the methods on how to debug on kernel side functions. With adding printk lines to find the real error happened function, it's not smart I think. maybe others have good methods on such cases.
Thanks, Yongqin Liu
On Fri, 27 Jul 2018 at 16:41, Yongqin Liu yongqin.liu@linaro.org wrote:
Hi, Sumit, John, Amit, All
I am investigating on the VtsKernelNetTest failure with HiKey 4.4
kernel, and I found the problem is that the socket.connect call returns -11,
with adding printk lines in the SYSCALL_DEFINE3 of connect in socket.c
file,
I found that the error is returned by the line of "err =
sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,sock->file->f_flags);"
https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There actually I don't know which connect function is called, so I
searched the .connect assignment in kernel/linaro/hisilicon-4.4/net/ipv4/
and with adding printk lines, I found the implementation is
tcp_v4_connect in net/ipv4/tcp_ipv4.c here:
https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There with adding printk lines, I found it the -11 is returned by call
of ip_route_connect here:
https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
Then I need to go to the definition of ip_route_connect to add printks
and so on to find the real place where -11 is return, and check the reason there.
but this work seems stupid and time consuming, I think there should be
smart methods I need to learn.
Could you please help go give some suggestion on what to do with such
cases?
Thanks in advance!
-- Best Regards, Yongqin Liu
#mailing list linaro-android@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-android _______________________________________________ linaro-android mailing list linaro-android@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-android
On Mon, 30 Jul 2018 at 01:42, Yongqin Liu yongqin.liu@linaro.org wrote:
On Fri, 27 Jul 2018 at 23:03, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote:
Hi, -11 is -EAGAIN -- AFAIK connect() returning EAGAIN usually means the listen() backlog is full. Unless the test's purpose is to check connections are accepted quickly under load, it may make sense to make the test handle EAGAIN instead of failing on it, e.g. change
xyz = connect(...);
to something more like
int retries = 10; do { xyz = connect(...) if(xyz >= 0 || errno != EAGAIN) break; sleep(1); } while(--retries);
Hmm, I don't think the failure is that case, it passes for 4.14 and 4.9 kernel, and it passes for other parameter combination as well.
And here, except this special case, I most want to know the methods on how to debug on kernel side functions. With adding printk lines to find the real error happened function, it's not smart I think. maybe others have good methods on such cases.
I use ftrace all the time. It has all the advantages of printk without the downsides (atomic/interrupt context, modification of kernel behavior). The information is accumulated in ring buffers that can be retrieved and analysed once the test case has been executed. If you go that way you'll want to use 'trace-cmd' to control what gets traced. For trace analysis use kernelshark - it is really good and you see exactly what is happening on the entire system.
This tutorial here [1] gives you a lot of information on what I just wrote above - I especially recommend the LWN articles by Steve Rostedt.
Last but not least I recommend patience - the road you're embarking on is long and arduous.
Hope this helps, Mathieu
[1]. https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
Thanks, Yongqin Liu
On Fri, 27 Jul 2018 at 16:41, Yongqin Liu yongqin.liu@linaro.org wrote:
Hi, Sumit, John, Amit, All
I am investigating on the VtsKernelNetTest failure with HiKey 4.4 kernel, and I found the problem is that the socket.connect call returns -11, with adding printk lines in the SYSCALL_DEFINE3 of connect in socket.c file, I found that the error is returned by the line of "err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,sock->file->f_flags);" https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There actually I don't know which connect function is called, so I searched the .connect assignment in kernel/linaro/hisilicon-4.4/net/ipv4/
and with adding printk lines, I found the implementation is tcp_v4_connect in net/ipv4/tcp_ipv4.c here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There with adding printk lines, I found it the -11 is returned by call of ip_route_connect here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
Then I need to go to the definition of ip_route_connect to add printks and so on to find the real place where -11 is return, and check the reason there.
but this work seems stupid and time consuming, I think there should be smart methods I need to learn.
Could you please help go give some suggestion on what to do with such cases?
Thanks in advance!
-- Best Regards, Yongqin Liu
#mailing list linaro-android@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-android _______________________________________________ linaro-android mailing list linaro-android@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-android
-- Best Regards, Yongqin Liu
#mailing list linaro-android@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-android _______________________________________________ linaro-android mailing list linaro-android@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-android
On 30 July 2018 at 13:12, Yongqin Liu yongqin.liu@linaro.org wrote:
On Fri, 27 Jul 2018 at 23:03, Bero Rosenkränzer Bernhard.Rosenkranzer@linaro.org wrote:
Hi, -11 is -EAGAIN -- AFAIK connect() returning EAGAIN usually means the listen() backlog is full. Unless the test's purpose is to check connections are accepted quickly under load, it may make sense to make the test handle EAGAIN instead of failing on it, e.g. change
xyz = connect(...);
to something more like
int retries = 10; do { xyz = connect(...) if(xyz >= 0 || errno != EAGAIN) break; sleep(1); } while(--retries);
Hmm, I don't think the failure is that case, it passes for 4.14 and 4.9 kernel, and it passes for other parameter combination as well.
And here, except this special case, I most want to know the methods on how to debug on kernel side functions. With adding printk lines to find the real error happened function, it's not smart I think. maybe others have good methods on such cases.
I usually work only inside the kernel, but recently I had to debug why wpa_cli wouldn't connect to wpa_supplicant on my setup.
strace to find the entry point that errs (and you could see what error). Then generous sprinkling of trace_printk("%s:%d\n", __func__, __LINE__) around areas of interest. If you know the end point that returns the error, you could dump_stack() and find which functions are involved and add the trace_printk calls.
cheers.
On Fri, Jul 27, 2018 at 7:41 AM, Yongqin Liu yongqin.liu@linaro.org wrote:
Hi, Sumit, John, Amit, All
I am investigating on the VtsKernelNetTest failure with HiKey 4.4 kernel, and I found the problem is that the socket.connect call returns -11, with adding printk lines in the SYSCALL_DEFINE3 of connect in socket.c file, I found that the error is returned by the line of "err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,sock->file->f_flags);" https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There actually I don't know which connect function is called, so I searched the .connect assignment in kernel/linaro/hisilicon-4.4/net/ipv4/
and with adding printk lines, I found the implementation is tcp_v4_connect in net/ipv4/tcp_ipv4.c here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
There with adding printk lines, I found it the -11 is returned by call of ip_route_connect here: https://android.googlesource.com/kernel/hikey-linaro/+/android-hikey-linaro-...
Then I need to go to the definition of ip_route_connect to add printks and so on to find the real place where -11 is return, and check the reason there.
but this work seems stupid and time consuming, I think there should be smart methods I need to learn.
Honestly, for situations like this where there is some failure in userspace in a subsystem I'm not familiar with, I usually do exactly what you've done. I know that's not what you probably want to hear though.
I strace initially to figure out what syscall is failing to userspace, then I add printk bits to the syscall handler in the kernel to figure out what component is failing, then dig down the stack.
ftrace can be useful in some cases (and critical in high perf cases where you can't slow the system down with printks), but the enabling/dumping steps are usually extra overhead in my process. Kernel debuggers can also be useful, but getting those enabled properly on a random arm board has always been a bigger barrier then I'm willing to jump when I have a small problem.
The only other useful tricks I have are: * git bisection * intuition-driven git diff comparisions (again when you have known good and bad commits). * git log/tig on directories to isolate changes
So yea, printk debugging is a bit "stupid" but its pretty quick/reliable for chasing down these sorts of problems. The biggest issue usually being unfamiliarity with the code, so while going function by function, reading the code and placing debug messages to trace how the logic runs might not feel smart, can have the useful sideeffect of letting you learn more about the subsystem, which will help next time your in the area.
Mostly when doing this, its important to be able to quickly automate the building/flashing/booting/testing process so you can iterate quickly.
thanks -john
linaro-android@lists.linaro.org