fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
khaled has quit [Quit: Konversation terminated!]
hpt has joined #systemtap
sscox has joined #systemtap
hpt has quit [Ping timeout: 246 seconds]
hpt has joined #systemtap
orivej has joined #systemtap
hpt has quit [Ping timeout: 248 seconds]
hpt has joined #systemtap
gromero has quit [Ping timeout: 248 seconds]
hpt has quit [Ping timeout: 245 seconds]
hpt has joined #systemtap
sapatel has quit [Ping timeout: 276 seconds]
yog_ has joined #systemtap
hpt has quit [Ping timeout: 276 seconds]
hpt has joined #systemtap
khaled has joined #systemtap
orivej has quit [Ping timeout: 276 seconds]
higgins` has joined #systemtap
khaled_ has joined #systemtap
hpt has quit [Ping timeout: 245 seconds]
khaled has quit [*.net *.split]
fche has quit [*.net *.split]
wcohen has quit [*.net *.split]
higgins has quit [*.net *.split]
fche has joined #systemtap
wcohen has joined #systemtap
yog_ has quit [Ping timeout: 245 seconds]
mjw has joined #systemtap
sscox has quit [Ping timeout: 248 seconds]
orivej has joined #systemtap
sscox has joined #systemtap
dmalcolm_ has quit [Quit: Leaving]
sapatel has joined #systemtap
orivej has quit [Ping timeout: 265 seconds]
tromey has joined #systemtap
orivej has joined #systemtap
amerey has quit [Quit: Leaving]
amerey has joined #systemtap
sapatel_ has joined #systemtap
sapatel_ has quit [Client Quit]
sapatel has quit [Ping timeout: 248 seconds]
sapatel has joined #systemtap
agentzh has joined #systemtap
<agentzh> hi guys, i noticed a strange issue that some times a uprobe is never triggered and stap just hangs forever. how to debug such things?
tromey has quit [Quit: ERC (IRC client for Emacs 26.2)]
<agentzh> -DDEBUG_UPROBES?
<agentzh> seems like -DDEBUG_UPROBES_RIP might be useful too.
<agentzh> hmm, not very useful. just got a single message "_stp_handle_start:178: cannot map pid 0 to host namespace pid" regardless it hangs or not.
<fche> hmm
<fche> sysrq-t ?
<fche> if a uprobe can hang something, the kernel's been a bad bad boy
<agentzh> not hanging the target, just staprun itself.
<fche> so an interrupt doesn't stop it?
<agentzh> fche: will look into sysrq-t. thanks
<agentzh> fche: ctrl-c does stop it. just no probes fired.
<fche> sometimes the uprobe removal must wait until some userspace threads pass some particular section, IIRC
<fche> ok
<fche> so not a hang
<fche> just no probes being fired
<agentzh> the target process is calling usleep(1) in a tight loop.
<fche> ok that's a totally different situation :)
<agentzh> and the stap script just probes on usleep function entry.
<fche> and how are you trying to probe that?
<agentzh> so it should always fire very quickly.
<agentzh> probe process("/path/to/libc-xxx.so").function("usleep") {...}
<agentzh> this is my probe.
<agentzh> specifying -x PID in staprun command line does make it hard to reproduce.
<fche> stap .... -c CMD ?
<agentzh> removing -x PID makes it very easy to reproduce.
<agentzh> the target process is always running (in an infinite loop)'
<agentzh> it's on centos 7.
<agentzh> tried the latest kernel shipped by centos 7, same thing.
<agentzh> also some other versions of the kernel.
<fche> try -DDEBUG_TASK_FINDER_VMA -DDEBUG_SYMS
<agentzh> trying. thanks
<fche> it may be that usleep at userspace is not implemented with that libc function at all
<fche> a macro may map it to something else
<fche> tried gdb'ing it? break usleep and see what function it really is ?
<agentzh> but this is just random.
<agentzh> most of the time it fires correctly.
<agentzh> and with -x PID, it always fires.
<fche> so it may not fire if stap's started without -x & without -c
<fche> yeah I'd try those -DDEBUG thing
<agentzh> right, 1/5 of the chance it may not fire at all.
<agentzh> i ran staprun xxx.ko in a shell loop to test this.
<agentzh> stap just hangs there.
<agentzh> *staprun
<agentzh> the begin probe does fire.
<agentzh> printing out "Start tracing..." (my own message) and then nothing.
<agentzh> trying -DDEBUG -DDEBUG_TASK_FINDER_VMA -DDEBUG_SYMS and gdb
<agentzh> i also saw -x PID hangs without hitting any probes some times.
<agentzh> just rarer.
<agentzh> just tried gdb and set a breakpoint on usleep.
<agentzh> it keeps hitting that breakpoing in the target process as expected.
<agentzh> "Breakpoint 1, 0x00007f547ddc00b0 in usleep () from /lib64/libc.so.6"
<agentzh> gdb reports.
<agentzh> trying the stap -DXXX options now
<agentzh> ah, that's a lot of output. i'll paste it somewhere. a sec...
<agentzh> for a bad run
<agentzh> this is just a single bad run.
<agentzh> pid 6118 is my target process which keeps calling usleep(1) in an infinite loop.
<fche> so the interesting thing there is to see if the libc.so mmaps are properly identified
<fche> that's a lot of processes
sscox has quit [Ping timeout: 245 seconds]
<agentzh> fche: in __stp_call_mmap_callbacks:... lines?
<fche> yup
<fche> s/-DDEBUG_SYMS/-DDEBUG_SYMBOLS/
<fche> (for runtime/syms.c)
<fche> runtime/sym.c
<agentzh> i compared the __stp_call_mmap_callbacks lines for libc-2.xxx and the 6118 pid for both a good run and a bad run. they are exactly the same.
<agentzh> they both look like this: https://pastebin.com/LJc30HpF
<agentzh> 4 lines
<agentzh> i'll try the DEBUG_SYMBOLS too.
<agentzh> a sec
<agentzh> with -DDEBUG_SYMBOLS: http://openresty.org/download/stap-debug2.txt
<agentzh> for this run, my target process polling usleep(1) has the pid 6382 instead.
<fche> should be seeing reports from _stp_umodule_relocate()
<fche> as the uprobe is being contemplated for registration
<fche> 78 dbug_sym(1, "[%d] %s, %lx\n", tsk->pid, path, offset);
<agentzh> yeah, got "_stp_do_relocation:74: found kernel _stext load address: 0xffffffffa9e00000"
<agentzh> oh, this is different
<agentzh> umodule_relocate
<agentzh> no such line it seems
<fche> yup, wonder if it's a buildid problem
<agentzh> but if it's a build id problem, it should never work instead of just randomly?
<agentzh> just wondering
<fche> yes
<fche> nothing's being upgraded under the covers I assume
orivej has quit [Ping timeout: 248 seconds]
<fche> I'd probably add some dbug() type instrumentation to linux/uprobes-inode.c and/or sym.c into those paths to see what's going on
<agentzh> I don't see any output lines matching "_stp_umodule_relocate" in a good run
<agentzh> either
<fche> yeah in general this part needs better diagnostics
<fche> back before inode-uprobes (linux 3.5 era?), we had a pretty systematic probe registration/attempt/unregistration tracing with -DDEBUG_PROBES IIRC
<agentzh> okay, i'll try peeking into the uprobes-inode.c and sym.c files.
<agentzh> thanks for the suggestion.
<agentzh> oh, the utrace era?
<fche> yeah :)
<agentzh> :)
<fche> but yeah, patches to improve these diagnostics would be super welcome
orivej has joined #systemtap
<agentzh> got it
mjw has quit [Ping timeout: 252 seconds]