fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
<agentzh>
fche: yes, it was only in the dyninst case.
orivej has joined #systemtap
slowfranklin has joined #systemtap
pwithnall has joined #systemtap
orivej has quit [Ping timeout: 264 seconds]
<fche>
agentzh, hm, what do you think about adding an alarm() call into stapdyn mutator.cxx code, line 640ish, to impose a timeout on dyninst shutdown?
slowfranklin has quit [Quit: slowfranklin]
orivej has joined #systemtap
orivej has quit [Ping timeout: 272 seconds]
brolley has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 244 seconds]
<agentzh>
fche: that sounds a bit unsafe to me since if it indeed fails to remove the instrumentation from the target process, the target process may be left in a bad state and vulnerable to segfaults and etc.
<agentzh>
we observed such things before though for bugs in gdb.
<agentzh>
i hit a gnu make 4.2.1 bug last night when running stap's official test suite. the jobserver impl in gmake just hangs upon exiting while waiting for a broken pipe forever (or maybe it looks like a kernel/glibc bug, not sure).
<agentzh>
the make process just hangs there forever.
<agentzh>
seems like properly shutting down is indeed tricky for certain software :)
<agentzh>
i had to kill gmake myself.
* fche
has long ago reported a gnu-make bug that I managed to trigger with stap testsuite interrupts
<agentzh>
fche: for a related thing, i also observed a leftover stap-serverd process after running the full stap test suite.
<fche>
it was (is!) a signal unsafe function call in a signal handler
<agentzh>
fche: oh, in dyninst or stapdyn?
<fche>
gnumake
<agentzh>
ah, okay...you saw that too?
<agentzh>
okay, saw your earlier message :)
<agentzh>
fche: are your testbots for stap seeing random test successes and failures for certan test cases?
<agentzh>
i'm seeing inconsistent faiures/passes across different runs of the test suite on the same machine against the same stap installation.
<agentzh>
it would be great if you can share some sample test reports of your testbots for the current master.
zodbot has joined #systemtap
<agentzh>
preferrably from fedora testbots (if any).
<agentzh>
i'm using -j4 for the test run.
<agentzh>
-j5 would lead to machine lockup sometimes and -j6 would always lock.
<agentzh>
fche: btw, i'm seeing quite some syscall test failures while running the stap test suite on kernel 4.16.16, which do not appear in your 4.16.13 testbots' reports: https://pastebin.com/3h9jFqr2
<agentzh>
does it ring a bell?
<agentzh>
or is it similar to the 4.17 syscall breakage?
<fche>
hm do you have a make-install'd copy of stap?
<fche>
x86-64 ?
<agentzh>
yes, it was an installed copy.
<agentzh>
under /opt/stap/
<agentzh>
it's x86_64
<agentzh>
uname says 4.16.16-200.fc27.x86_64
<agentzh>
and it is the only kernel installed.
<agentzh>
i removed all the other kernel versions.
<fche>
the __NR_compat_read should show up in the installed runtime/linux/compat_unistd.h file
<agentzh>
fche: seems like the generated .c file does not include that header.
<fche>
I bet you're running /usr/bin/stap rather than your freshly-built /opt copy of stap
<fche>
hm maybe not
<agentzh>
fche: if i manually patch the generated .c file to include that header (by adding #include "linux/compat_unistd.h"), then i would get a lot of ""__NR_execveat" redefined [-Werror]" errors.
<agentzh>
i always use the absolute paths to invoke stap.
<agentzh>
and i don't have the system stap package installed.
<agentzh>
and there's no /usr/bin/stap or /bin/stap.
<fche>
yeah
<fche>
looking into the problem now, there is something inappropriate going on
<agentzh>
fche: thanks!
<agentzh>
after patching that #include into the generated .c, the full error listing is here: https://pastebin.com/2E0CT7jx
<agentzh>
not sure if it's helpful.
<fche>
thanks
<agentzh>
sure
<agentzh>
seems like it includes ./arch/x86/include/asm/unistd.h instead.
wcohen has joined #systemtap
<agentzh>
fche: the compilation error thing seems like a different problem from those rd_syscall and tp_syscall test failures. not seeing the compilation error in those tests' systemtap.log files at all.
<fche>
yeah
<fche>
they should each work though
<agentzh>
*nod*
<agentzh>
rd_syscall tests also have some random failures
<agentzh>
like "^alarm: nanosleep \(\[1.000000000\], [x0-9a-fA-F]+\) = 0"
<agentzh>
what does "rd" stand for, btw?
<fche>
maybe was supposed to be 'nd' non-dwarf
<agentzh>
i see. thanks. i'll add it to my own notes :)
<agentzh>
i'm seeing the "ERROR: probe overhead exceeded threshold" error in some of the syscall tests, like "systemtap.examples/profiling/syscalllatency run" and "systemtap.examples/profiling/functioncallcount run" in testsuite/systemtap.examples/check.exp.
<agentzh>
not sure if it's a seperate issue.
<fche>
yes, separate
<fche>
your make backtrace from 90 mins ago btw is probably not the most helpful one. Was there more than one /usr/bin/make alive at the time?
brolley has left #systemtap [#systemtap]
wcohen has quit [Ping timeout: 264 seconds]
orivej has joined #systemtap
<agentzh>
fche: nope, only one make process was hanging.
<agentzh>
fche: is there any changes needed for my quit() patch sent earlier?
<fche>
I'd like to hear other folks' opinion on that one. one little nit - tool-shedding - a variable with as global impact as this one should have a more aristocratic name, and probably sit beside the overall script state one we use to track RUNNING / STOPPING / etc. state
<agentzh>
fche: i was just copying what the existing _stp_exit_flag global variable does in the existing stap runtime.
<agentzh>
if you have any concrete naming suggestions, plese let me know.
<agentzh>
*please
<fche>
how about __stp_abort_flag, and s/quit/abort/ to further indicate its impact ?
<agentzh>
fche: your call :)
<agentzh>
it indeed looks more alarming.
<agentzh>
you want me to send another patch for it?