fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
<agentzh> fche: yes, it was only in the dyninst case.
orivej has joined #systemtap
slowfranklin has joined #systemtap
pwithnall has joined #systemtap
orivej has quit [Ping timeout: 264 seconds]
<fche> agentzh, hm, what do you think about adding an alarm() call into stapdyn mutator.cxx code, line 640ish, to impose a timeout on dyninst shutdown?
slowfranklin has quit [Quit: slowfranklin]
orivej has joined #systemtap
orivej has quit [Ping timeout: 272 seconds]
brolley has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 244 seconds]
<agentzh> fche: that sounds a bit unsafe to me since if it indeed fails to remove the instrumentation from the target process, the target process may be left in a bad state and vulnerable to segfaults and etc.
<agentzh> we observed such things before though for bugs in gdb.
zodbot has quit [Disconnected by services]
<agentzh> i hit a gnu make 4.2.1 bug last night when running stap's official test suite. the jobserver impl in gmake just hangs upon exiting while waiting for a broken pipe forever (or maybe it looks like a kernel/glibc bug, not sure).
<agentzh> the make process just hangs there forever.
<agentzh> seems like properly shutting down is indeed tricky for certain software :)
<agentzh> i had to kill gmake myself.
* fche has long ago reported a gnu-make bug that I managed to trigger with stap testsuite interrupts
<agentzh> fche: for a related thing, i also observed a leftover stap-serverd process after running the full stap test suite.
<fche> it was (is!) a signal unsafe function call in a signal handler
<agentzh> fche: oh, in dyninst or stapdyn?
<fche> gnumake
<agentzh> ah, okay...you saw that too?
<agentzh> okay, saw your earlier message :)
<agentzh> fche: are your testbots for stap seeing random test successes and failures for certan test cases?
<agentzh> i'm seeing inconsistent faiures/passes across different runs of the test suite on the same machine against the same stap installation.
<agentzh> it would be great if you can share some sample test reports of your testbots for the current master.
zodbot has joined #systemtap
<agentzh> preferrably from fedora testbots (if any).
<agentzh> i'm using -j4 for the test run.
<agentzh> -j5 would lead to machine lockup sometimes and -j6 would always lock.
<agentzh> no matter how much RAM I add to the VM.
<fche> (don't y'all hit it at once!) :)
<agentzh> and no matter how frequent I enforce memory compaction in the background while running the tests.
<agentzh> oh, that link is slow :)
<agentzh> finally opened.
<fche> the part that could be useful to you would be to compare adjacent runs on the same platform ("cmp")
<agentzh> appreciate it.
<fche> but immediatley you see that the numbers fluctuate from run to run
<agentzh> where can i see the git commit SHA?
<agentzh> the "rg" column?
<fche> 'versions' column ... commit release-XX-YY-gHEXCODE
<agentzh> oh, i see. thanks.
<fche> maybe should reopen
<fche> hm, four year old bug
<agentzh> fche: heh
<agentzh> thanks for the pointer.
<agentzh> fche: btw, i'm seeing quite some syscall test failures while running the stap test suite on kernel 4.16.16, which do not appear in your 4.16.13 testbots' reports: https://pastebin.com/3h9jFqr2
<agentzh> does it ring a bell?
<agentzh> or is it similar to the 4.17 syscall breakage?
<agentzh> the stap version reported in this PR is not the current master, but i saw exactly the same failures with the current master.
<agentzh> btw, the gmake backtrace i saw was different from yours in that ticket. mine was like this: https://pastebin.com/FJTuu2Fb
<agentzh> can i ask what "tp" stands for in "tp_syscall"?
<fche> tracepoint
<agentzh> i see. thanks.
<agentzh> is there any special sysctl/kernel configs require to enable tracepoints in fed27?
<agentzh> *required
<agentzh> seems like tp_syscall is failing completely on my system.
<agentzh> or just absent.
<fche> stap -L 'kernel.trace("*sys*")' <-- says what?
<fche> kernel.trace("raw_syscalls:sys_enter") $regs:struct pt_regs* $id:long int
<fche> kernel.trace("raw_syscalls:sys_exit") $regs:struct pt_regs* $ret:long int
<fche> are the bits the tp_syscall machinery uses
<agentzh> fche: it says this: https://pastebin.com/bpjHZdyC
<agentzh> a few lines, but not many.
<agentzh> seems like my box does have raw_syscalls:sys_enter and raw_syscalls:sys_exit.
<fche> ok, that's good
<fche> would be surprised if that were not so
<agentzh> then any hints on debugging all those tp_syscall test failures?
<fche> try a stap -e 'probe tp_syscall.read { print(argstr) }'
<fche> and figure out the errors (if any) with more -v`s
pwithnall has quit [Ping timeout: 252 seconds]
<agentzh> on it.
<agentzh> fche: got a weird gcc compilation error when running that stap command: https://pastebin.com/he0psEbp
<agentzh> it says ‘__NR_compat_read’ undeclared
<fche> hm do you have a make-install'd copy of stap?
<fche> x86-64 ?
<agentzh> yes, it was an installed copy.
<agentzh> under /opt/stap/
<agentzh> it's x86_64
<agentzh> uname says 4.16.16-200.fc27.x86_64
<agentzh> and it is the only kernel installed.
<agentzh> i removed all the other kernel versions.
<fche> the __NR_compat_read should show up in the installed runtime/linux/compat_unistd.h file
<agentzh> fche: seems like the generated .c file does not include that header.
<fche> I bet you're running /usr/bin/stap rather than your freshly-built /opt copy of stap
<fche> hm maybe not
<agentzh> fche: if i manually patch the generated .c file to include that header (by adding #include "linux/compat_unistd.h"), then i would get a lot of ""__NR_execveat" redefined [-Werror]" errors.
<agentzh> i always use the absolute paths to invoke stap.
<agentzh> and i don't have the system stap package installed.
<agentzh> and there's no /usr/bin/stap or /bin/stap.
<fche> yeah
<fche> looking into the problem now, there is something inappropriate going on
<agentzh> fche: thanks!
<agentzh> after patching that #include into the generated .c, the full error listing is here: https://pastebin.com/2E0CT7jx
<agentzh> not sure if it's helpful.
<fche> thanks
<agentzh> sure
<agentzh> seems like it includes ./arch/x86/include/asm/unistd.h instead.
wcohen has joined #systemtap
<agentzh> fche: the compilation error thing seems like a different problem from those rd_syscall and tp_syscall test failures. not seeing the compilation error in those tests' systemtap.log files at all.
<fche> yeah
<fche> they should each work though
<agentzh> *nod*
<agentzh> rd_syscall tests also have some random failures
<agentzh> like "^alarm: nanosleep \(\[1.000000000\], [x0-9a-fA-F]+\) = 0"
<agentzh> what does "rd" stand for, btw?
<fche> maybe was supposed to be 'nd' non-dwarf
<agentzh> i see. thanks. i'll add it to my own notes :)
<agentzh> i'm seeing the "ERROR: probe overhead exceeded threshold" error in some of the syscall tests, like "systemtap.examples/profiling/syscalllatency run" and "systemtap.examples/profiling/functioncallcount run" in testsuite/systemtap.examples/check.exp.
<agentzh> not sure if it's a seperate issue.
<fche> yes, separate
<fche> your make backtrace from 90 mins ago btw is probably not the most helpful one. Was there more than one /usr/bin/make alive at the time?
brolley has left #systemtap [#systemtap]
wcohen has quit [Ping timeout: 264 seconds]
orivej has joined #systemtap
<agentzh> fche: nope, only one make process was hanging.
<agentzh> fche: is there any changes needed for my quit() patch sent earlier?
<fche> I'd like to hear other folks' opinion on that one. one little nit - tool-shedding - a variable with as global impact as this one should have a more aristocratic name, and probably sit beside the overall script state one we use to track RUNNING / STOPPING / etc. state
<agentzh> fche: i was just copying what the existing _stp_exit_flag global variable does in the existing stap runtime.
<agentzh> if you have any concrete naming suggestions, plese let me know.
<agentzh> *please
<fche> how about __stp_abort_flag, and s/quit/abort/ to further indicate its impact ?
<agentzh> fche: your call :)
<agentzh> it indeed looks more alarming.
<agentzh> you want me to send another patch for it?
<agentzh> v2 patch i mean :)
<agentzh> fche: btw, are you okay with this one-line change regarding to the ternary '?:' operator precedence fix? https://sourceware.org/ml/systemtap/2018-q3/msg00129.html