#systemtap on 2018-08-27 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

01:16 _whitelogger has joined #systemtap

02:28 _whitelogger has joined #systemtap

03:19 _whitelogger has joined #systemtap

03:25 _whitelogger has joined #systemtap

04:13 _whitelogger has joined #systemtap

06:49 <agentzh> fche: yes, it was only in the dyninst case.

06:53 orivej has joined #systemtap

08:02 slowfranklin has joined #systemtap

08:09 pwithnall has joined #systemtap

11:05 orivej has quit [Ping timeout: 264 seconds]

12:26 <fche> agentzh, hm, what do you think about adding an alarm() call into stapdyn mutator.cxx code, line 640ish, to impose a timeout on dyninst shutdown?

12:52 slowfranklin has quit [Quit: slowfranklin]

13:02 orivej has joined #systemtap

13:43 orivej has quit [Ping timeout: 272 seconds]

14:02 brolley has joined #systemtap

14:56 orivej has joined #systemtap

16:08 orivej has quit [Ping timeout: 244 seconds]

17:18 <agentzh> fche: that sounds a bit unsafe to me since if it indeed fails to remove the instrumentation from the target process, the target process may be left in a bad state and vulnerable to segfaults and etc.

17:18 <agentzh> we observed such things before though for bugs in gdb.

17:19 zodbot has quit [Disconnected by services]

17:21 <agentzh> fche: see https://sourceware.org/bugzilla/show_bug.cgi?id=23175 for example.

17:24 <agentzh> i hit a gnu make 4.2.1 bug last night when running stap's official test suite. the jobserver impl in gmake just hangs upon exiting while waiting for a broken pipe forever (or maybe it looks like a kernel/glibc bug, not sure).

17:24 <agentzh> the make process just hangs there forever.

17:24 <agentzh> seems like properly shutting down is indeed tricky for certain software :)

17:25 <agentzh> i had to kill gmake myself.

17:26 * fche has long ago reported a gnu-make bug that I managed to trigger with stap testsuite interrupts

17:26 <agentzh> fche: for a related thing, i also observed a leftover stap-serverd process after running the full stap test suite.

17:26 <fche> it was (is!) a signal unsafe function call in a signal handler

17:27 <agentzh> fche: oh, in dyninst or stapdyn?

17:27 <fche> gnumake

17:27 <agentzh> ah, okay...you saw that too?

17:28 <agentzh> okay, saw your earlier message :)

17:29 <agentzh> fche: are your testbots for stap seeing random test successes and failures for certan test cases?

17:30 <agentzh> i'm seeing inconsistent faiures/passes across different runs of the test suite on the same machine against the same stap installation.

17:30 <agentzh> it would be great if you can share some sample test reports of your testbots for the current master.

17:31 zodbot has joined #systemtap

17:31 <agentzh> preferrably from fedora testbots (if any).

17:31 <agentzh> i'm using -j4 for the test run.

17:31 <agentzh> -j5 would lead to machine lockup sometimes and -j6 would always lock.

17:31 <fche> https://web.elastic.org/~dejazilla/viewsummary.php?_sort=1A has an ingested form of data

17:32 <agentzh> no matter how much RAM I add to the VM.

17:32 <fche> (don't y'all hit it at once!) :)

17:32 <agentzh> and no matter how frequent I enforce memory compaction in the background while running the tests.

17:33 <agentzh> oh, that link is slow :)

17:34 <agentzh> finally opened.

17:34 <fche> the part that could be useful to you would be to compare adjacent runs on the same platform ("cmp")

17:36 <agentzh> appreciate it.

17:36 <fche> but immediatley you see that the numbers fluctuate from run to run

17:36 <agentzh> where can i see the git commit SHA?

17:36 <agentzh> the "rg" column?

17:36 <fche> 'versions' column ... commit release-XX-YY-gHEXCODE

17:36 <agentzh> oh, i see. thanks.

17:37 <fche> https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1099164 <-- -btw

17:37 <fche> maybe should reopen

17:38 <fche> hm, four year old bug

17:55 <agentzh> fche: heh

17:55 <agentzh> thanks for the pointer.

17:56 <agentzh> fche: btw, i'm seeing quite some syscall test failures while running the stap test suite on kernel 4.16.16, which do not appear in your 4.16.13 testbots' reports: https://pastebin.com/3h9jFqr2

17:56 <agentzh> does it ring a bell?

17:56 <agentzh> or is it similar to the 4.17 syscall breakage?

18:07 <agentzh> fche: a full PR: https://sourceware.org/bugzilla/show_bug.cgi?id=23577

18:08 <agentzh> the stap version reported in this PR is not the current master, but i saw exactly the same failures with the current master.

18:09 <agentzh> btw, the gmake backtrace i saw was different from yours in that ticket. mine was like this: https://pastebin.com/FJTuu2Fb

18:15 <agentzh> can i ask what "tp" stands for in "tp_syscall"?

18:17 <fche> tracepoint

18:17 <agentzh> i see. thanks.

18:17 <agentzh> is there any special sysctl/kernel configs require to enable tracepoints in fed27?

18:17 <agentzh> *required

18:19 <agentzh> seems like tp_syscall is failing completely on my system.

18:19 <agentzh> or just absent.

18:26 <fche> stap -L 'kernel.trace("*sys*")' <-- says what?

18:26 <fche> kernel.trace("raw_syscalls:sys_enter") $regs:struct pt_regs* $id:long int

18:26 <fche> kernel.trace("raw_syscalls:sys_exit") $regs:struct pt_regs* $ret:long int

18:26 <fche> are the bits the tp_syscall machinery uses

18:29 <agentzh> fche: it says this: https://pastebin.com/bpjHZdyC

18:29 <agentzh> a few lines, but not many.

18:30 <agentzh> seems like my box does have raw_syscalls:sys_enter and raw_syscalls:sys_exit.

18:30 <fche> ok, that's good

18:30 <fche> would be surprised if that were not so

18:31 <agentzh> then any hints on debugging all those tp_syscall test failures?

18:31 <fche> try a stap -e 'probe tp_syscall.read { print(argstr) }'

18:31 <fche> and figure out the errors (if any) with more -v`s

18:32 pwithnall has quit [Ping timeout: 252 seconds]

18:33 <agentzh> on it.

18:36 <agentzh> fche: got a weird gcc compilation error when running that stap command: https://pastebin.com/he0psEbp

18:36 <agentzh> it says ‘__NR_compat_read’ undeclared

18:36 <fche> hm do you have a make-install'd copy of stap?

18:36 <fche> x86-64 ?

18:37 <agentzh> yes, it was an installed copy.

18:37 <agentzh> under /opt/stap/

18:37 <agentzh> it's x86_64

18:37 <agentzh> uname says 4.16.16-200.fc27.x86_64

18:38 <agentzh> and it is the only kernel installed.

18:38 <agentzh> i removed all the other kernel versions.

18:40 <fche> the __NR_compat_read should show up in the installed runtime/linux/compat_unistd.h file

18:54 <agentzh> fche: seems like the generated .c file does not include that header.

18:55 <fche> I bet you're running /usr/bin/stap rather than your freshly-built /opt copy of stap

18:55 <fche> hm maybe not

18:55 <agentzh> fche: if i manually patch the generated .c file to include that header (by adding #include "linux/compat_unistd.h"), then i would get a lot of ""__NR_execveat" redefined [-Werror]" errors.

18:56 <agentzh> i always use the absolute paths to invoke stap.

18:56 <agentzh> and i don't have the system stap package installed.

18:56 <agentzh> and there's no /usr/bin/stap or /bin/stap.

19:01 <fche> yeah

19:01 <fche> looking into the problem now, there is something inappropriate going on

19:05 <agentzh> fche: thanks!

19:05 <agentzh> after patching that #include into the generated .c, the full error listing is here: https://pastebin.com/2E0CT7jx

19:05 <agentzh> not sure if it's helpful.

19:05 <fche> thanks

19:06 <agentzh> sure

19:06 <agentzh> seems like it includes ./arch/x86/include/asm/unistd.h instead.

19:12 wcohen has joined #systemtap

19:14 <agentzh> fche: the compilation error thing seems like a different problem from those rd_syscall and tp_syscall test failures. not seeing the compilation error in those tests' systemtap.log files at all.

19:15 <fche> yeah

19:15 <fche> they should each work though

19:15 <agentzh> *nod*

19:16 <agentzh> rd_syscall tests also have some random failures

19:16 <agentzh> like "^alarm: nanosleep $\[1.000000000\], [x0-9a-fA-F]+$ = 0"

19:16 <agentzh> what does "rd" stand for, btw?

19:17 <fche> maybe was supposed to be 'nd' non-dwarf

19:20 <agentzh> i see. thanks. i'll add it to my own notes :)

19:26 <agentzh> i'm seeing the "ERROR: probe overhead exceeded threshold" error in some of the syscall tests, like "systemtap.examples/profiling/syscalllatency run" and "systemtap.examples/profiling/functioncallcount run" in testsuite/systemtap.examples/check.exp.

19:26 <agentzh> not sure if it's a seperate issue.

19:28 <fche> yes, separate

19:30 <fche> your make backtrace from 90 mins ago btw is probably not the most helpful one. Was there more than one /usr/bin/make alive at the time?

21:47 brolley has left #systemtap [#systemtap]

22:05 wcohen has quit [Ping timeout: 264 seconds]

22:09 orivej has joined #systemtap

22:22 <agentzh> fche: nope, only one make process was hanging.

22:23 <agentzh> fche: is there any changes needed for my quit() patch sent earlier?

22:37 <fche> I'd like to hear other folks' opinion on that one. one little nit - tool-shedding - a variable with as global impact as this one should have a more aristocratic name, and probably sit beside the overall script state one we use to track RUNNING / STOPPING / etc. state

22:47 <agentzh> fche: i was just copying what the existing _stp_exit_flag global variable does in the existing stap runtime.

22:48 <agentzh> if you have any concrete naming suggestions, plese let me know.

22:48 <agentzh> *please

22:52 <fche> how about __stp_abort_flag, and s/quit/abort/ to further indicate its impact ?

23:00 <agentzh> fche: your call :)

23:00 <agentzh> it indeed looks more alarming.

23:00 <agentzh> you want me to send another patch for it?

23:01 <agentzh> v2 patch i mean :)

23:06 <agentzh> fche: btw, are you okay with this one-line change regarding to the ternary '?:' operator precedence fix? https://sourceware.org/ml/systemtap/2018-q3/msg00129.html