SGE job crash with Segmentation fault and no help

It turned out that the SIGSEGV error (section below) was due to the stack being used up. The C/C++ signal handler proposed (segv.c) uses its own stack to try and display the top of stack trace. NB gdb back trace command (bt) appears to rely on the default stack and so sometimes gdb bt produces nothing.

It may be that the error is sufficiently bad that the stack trace has disappeared by the time the signal handler is called but often it is useful (i.e. better than nothing) in that it reports the stack is empty.

SGE s_stack and -pe smp poor documentation

As a result of experiments it appears:

Solutions to GNU GDB Debugger Problems

Problem C++ eventually crashes with Segmentation fault status 139

Error occurs more than a day after SGE batch job started.

Valgrind valgrind-3.15.0 does not support Intel AVX512 vector instructions.
Valgrind with non-AVX version of code finds a few initialised variables but nothing the explains late crash.

Long run time makes interactive GDB hopeless

A better approach may be, if you have the source code, to add a signal handler for signal SIGSEGV see example segv.c

GDB batch gives error Suspended (tty output)

./script.bat >& script.out &

fails to detach from the interactive terminal and run independently in the background. Instead it produces Suspended (tty output)

script.bat contains tcsh command including GDB commands to run in batch mode (-batch), run the program and generate a back track stack dump (bt) when the program fails. Fragment:

#WBL  8 Nov 2020

gdb -batch -return-child-result \
    --eval-command=run --eval-command=bt --args \
./program	\
  program's arguments

setenv save $status

if($save) then
  echo "gdb ... ./program failed status $save"
  exit $save
endif

echo "$0 done status $save" `date`
exit $save

It appears that despite the -batch argument, GDB is trying to read user commands from the terminal.

Work around

Direct the script (and thus GDB) to read from /dev/null
./script.bat < /dev/null >& script.out &

Although program is multi-threaded, it appears that the GDB overhead is high and the multiple CPUs are not effectively used.


W.B.Langdon Back 8 November 2020 (last update 19 Nov 2020)