Hi there,
You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test
The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.
I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?
Thank you for doing this testing, by the way, even if the failure is a bit obscure at the moment.
Best regards
Paul
Hi There,
I have been withholding the commit of this patch until I hear from you.
Regards
Paul
On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas < paul.richard.thomas@gmail.com> wrote:
Hi there,
You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test
The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.
I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?
Thank you for doing this testing, by the way, even if the failure is a bit obscure at the moment.
Best regards
Paul
Hello Paul,
Paul Richard Thomas paul.richard.thomas@gmail.com writes:
Hi There,
I have been withholding the commit of this patch until I hear from you.
Sorry for the late response. I don't know much about Fortran or gfortran, but I tried to have a look at the failure. More details below, but unfortunately I didn't find anything concrete. Hopefully the Valgrind reports can help.
Please let me know if there are other tests or investigation I can make.
On Tue, 2 Jul 2024 at 08:48, Paul Richard Thomas < paul.richard.thomas@gmail.com> wrote:
Hi there,
You detected a failure in gfortran.dg/class_transformational_2.f90: PASS: gfortran.dg/class_transformational_2.f90 -O0 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O0 execution test PASS: gfortran.dg/class_transformational_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/class_transformational_2.f90 -O1 execution test PASS: gfortran.dg/class_transformational_2.f90 -O2 (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O2 execution test PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -fomit-frame-pointer ...snip... PASS: gfortran.dg/class_transformational_2.f90 -O3 -g (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -O3 -g execution test PASS: gfortran.dg/class_transformational_2.f90 -Os (test for excess errors) PASS: gfortran.dg/class_transformational_2.f90 -Os execution test
The stop message in the full log indicates a numeric error in the first test. I am unable to reproduce the error. Adding deallocation of all the allocated variables (which I should have done in the first place) and running valgrind with -s shows no errors and no memory loss.
I find it odd that it should fail once at -O1 and not at -O2 and higher. Can you provide me with any insights; eg, by rerunning the testcase outside of the dejagnu framework?
I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine. Here's a GDB session showing where it abruptly exits:
$ gdb -q class_transformational_2.exe Reading symbols from class_transformational_2.exe... (gdb) break check_spread Breakpoint 1 at 0x10c72: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n STOP 12 [Inferior 1 (process 3684330) exited with code 014] (gdb)
If I step into reshape, things seem to work fine, all the way to _gfortrani_reshape_packed. If I then type "next" after the last statement in that function, the process ends:
_gfortrani_reshape_packed (ret=0x252e0 "", rsize=128, source=0x25258 "\001\001\001\001", ssize=128, pad=0x0, psize=8) at /home/thiago.bauermann/src/gcc/libgfortran/intrinsics/reshape_packed.c:38 38 size = (rsize > ssize) ? ssize : rsize; (gdb) n 39 memcpy (ret, source, size); (gdb) n 42 while (rsize > 0) (gdb) n STOP 12 [Inferior 1 (process 3739928) exited with code 014] (gdb)
If instead of typing "next", I type "step", then GDB enters realloc, and some "MAIN__::__copy_MAIN___S" thing before moving to the next line. Then it actually leaves the line with the reshape call and proceeds further! It ends up exiting within check_result, line 48:
⋮ 48 if (any (a%i .ne. ii)) stop stop_flag + 2 (gdb) STOP 12 [Inferior 1 (process 3739974) exited with code 014] (gdb)
So this seems to be a heisenbug, where the program behaves differently in the presence of a debugger...
Just some baseless speculation: maybe the realloc call is failing? And for some unknown reason, when doing the single-stepping in GDB it succeeds? I can't think of anything else at least so far.
For comparison, here are sessions on a binary built with -O0:
$ gdb -q class_transformational_2-O0.exe Reading symbols from class_transformational_2-O0.exe... (gdb) break check_spread Breakpoint 1 at 0x136e2: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 54. (gdb) r Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O0.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Breakpoint 1, MAIN__::check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:54 54 stop_flag = 10 (gdb) n 55 a = [(s(j,10*j), j = 1,2)] (gdb) 56 b = spread (a, dim = 2, ncopies = 2) (gdb) 57 c = spread (b, dim = 1, ncopies = 4) (gdb) 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( _data = (((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ))) ((( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 )) (( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )))), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( _data = (( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 1 ), ( i = 2 ), ( i = 2 ), ( i = 2 ), ( i = 2 )), _vptr = 0x26174 <__vtab_MAIN___S.22> ) (gdb)
Note that 'c' is very different than in the -O1 case. Is that expected?
Now with a binary built with -O2:
$ gdb -q class_transformational_2-O2.exe Reading symbols from class_transformational_2-O2.exe... (gdb) start Temporary breakpoint 1 at 0x10704: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 15. Starting program: /home/thiago.bauermann/.cache/builds/gcc-native-aarch32/gcc/testsuite/gfortran/class_transformational_2-O2.exe [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Temporary breakpoint 1, MAIN__ () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:15 15 class(t), allocatable :: scalar, a(:), aa(:), b(:,:), c(:,:,:), field(:,:,:) (gdb) break 58 Breakpoint 2 at 0x10b34: file /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90, line 58. (gdb) c Continuing.
Breakpoint 2, check_spread () at /home/thiago.bauermann/src/gcc/gcc/testsuite/gfortran.dg/class_transformational_2.f90:58 58 a = reshape (c, [size (c)]) (gdb) p c $1 = ( i = (0, 1072693248), x = 1, d = 1 ) (gdb) n 59 ishape = [4,2,2] (gdb) p a $2 = ( i = (0, 1045149306), x = 1.2904777690891933e-08, d = 1.2904777690891933e-08 ) (gdb)
Here, 'c' is the same as in the -O1 case.
If I run the "continue" GDB command, then the program completes successfully. I wasn't able to break on check_spread this time because that function isn't present in the optimized binary.
Another thing that I noticed is that the test occasionaly fails on aarch64-linux, about 1 in 50 times when I run it repeatedly in a loop. This happens with the "-O1", "-O2", "-O3" and "-O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions" variations. But not with the "-O0" variation.
Because the failure is intermittent, I wasn't able to run a debugger when it happens yet. I'll try again next week with some scripting.
I tried reproducing on x86_64-linux, but couldn't.
I'm attaching the valgrind reports for arm and aarch64.
Hello,
One more detail:
Thiago Jung Bauermann thiago.bauermann@linaro.org writes:
I can see the problem reliably when running the testcase binary for -O1 on an armv8l-linux-gnueabihf machine.
I ran your patch through a different CI loop that we have, where instead of using the distro's toolchain (binutils, gcc, glibc) to build and test the patch, it builds every component from scratch and from their respective tips of trunk.
This time it didn't detect any problem. All gfortran.dg/class_transformational_2.f90 tests passed:
https://ci.linaro.org/job/tcwg_gnu_native_check_gcc--master-arm-precommit/2/...
I think this means that with Ubuntu 22.04 glibc we see the problem, but when using the latest upstream glibc we don't.
linaro-toolchain@lists.linaro.org