fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
<agentzh> fche: i've made stap_symbols.c actually compile: https://pastebin.com/gEWEjfGJ
<agentzh> this extra diff at least keep the old shape i think.
<agentzh> *keeps
<agentzh> does it look better?
<fche> yup, thanks, ship it
<agentzh> cool, thanks
<irker705> systemtap: yichun systemtap.git:refs/heads/master * release-4.2-13-gb243981 / buildrun.cxx main.cxx runtime/dyninst/linux_defs.h runtime/sym.h session.h translate.cxx translator-output.cxx translator-output.h: translate.cxx: Make stap-symbols.h a separate CU. http://tinyurl.com/wsuep3z
<agentzh> fche: got any hints for https://sourceware.org/bugzilla/show_bug.cgi?id=25290 ?
<agentzh> the process(EXE).begin thing.
<fche> hm, must find the part of the runtime that does the initial scan of the task list
<fche> in order to populate that process().begin callback
<fche> runtime/linux/task_finder2.c - drsmith's baby from way back - is closely related
<agentzh> cool
<fche> stap_start_task_finder()
<agentzh> k
<fche> we could use some diagnostic magic over in that function
<agentzh> okay
<agentzh> dbug_xxx huh?
<fche> worth a shot
<agentzh> will do.
<fche> there is a dbug_task_vma already in use around there
<fche> just not enough - for the initial traversal
<agentzh> got it
<agentzh> fche: btw, we do have a patch to add a --include FILE option to stap which loads the specified user tapset files ONLY. will you be interested?
<agentzh> the idea is to load tapset lib and macro files only on demand in this mode.
<agentzh> can save up to 700ms in Pass 1 and Pass 2.
<agentzh> kinda similar to gcc's -include FILE option in some way.
<agentzh> currently stap always loads all the tapset files which is quite wasteful.
<fche> it does a fair bit less work now than it used to (4.1 and 4.2 methinks)
<fche> particularly in not pass-2-processing functions/probes that are not actually referenced
<fche> another way to accomplish such fine-grained control is to have a whole separate env SYSTEMTAP_TAPSET=/path/ directory, which has only the subset you know you need
<agentzh> yeah, i noted that change but still quite slow. the SYSTEMTAP_TAPSET approach works but not very flexible, still need to copying files around.
<fche> or symlink
<agentzh> yeah, symlinks are better.
<fche> finding the right --include=foo --include=bar flags would prereq about as much work as actually creating an alternate tapset directory
hpt has joined #systemtap
<agentzh> fche: we already done that part ourselves and pre-build a database for it with all the dep relationship. but yeah, it may be beyond the scope of the stap project itself.
<agentzh> similar to --use-user-stapconf FILE
<agentzh> fche: we're sponsoring Linaro's developers to work on some patches for stap.
<agentzh> we'll have more patches soon :)
orivej has quit [Ping timeout: 258 seconds]
sscox has joined #systemtap
orivej has joined #systemtap
irker705 has quit [Quit: transmission timeout]
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
khaled has joined #systemtap
hpt has quit [Ping timeout: 265 seconds]
mjw has joined #systemtap
sscox has quit [Ping timeout: 248 seconds]
orivej has joined #systemtap
wcohen has quit [Ping timeout: 252 seconds]
agentzh has quit [Ping timeout: 255 seconds]
agentzh has joined #systemtap
agentzh has quit [Changing host]
agentzh has joined #systemtap
sscox has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
wcohen has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
amerey has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 265 seconds]
tromey has joined #systemtap
orivej has joined #systemtap
amerey has quit [Ping timeout: 250 seconds]
<agentzh> patch for adding more debugging logs to task_finder2.
<agentzh> unfortunately it does not reveal anything unsual. the output for a good run and a bad run is the same.
<agentzh> exactly the same.
<agentzh> it's just that the bad runs never trigger the process(EXE).begin probe handler.
<fche> methinks it's just missing that one magic tracing spot
<agentzh> the relevant output logs for the target process 11816 is here: https://pastebin.com/uy3hWHCL
irker265 has joined #systemtap
<agentzh> yeah, maybe. is there any other spots i should check?
<irker265> systemtap: sapatel systemtap.git:refs/heads/master * release-4.2-14-gfbf9a32 / bpf-opt.cxx: PR25298: stapbpf unused blocks may cause segfault http://tinyurl.com/s42kokk
<agentzh> also, where can i find docs and code for those functions like utrace_set_events, utrace_control, and utrace_attach_task? i thought utrace is already replaced by uprobes in modern kernels.
<agentzh> or they were emulated somehow by ptrace or uprobes?
<agentzh> *they are
<agentzh> they look like magic to me.
<agentzh> more aligned with gdb's "attach" way it seems.
<fche> utrace is not replaced by uprobes
<fche> utrace was a proposed kernel-side api for building uprobes and other things
<fche> kernel later grew uprobes equivalent apis without utrace
<fche> so stap implements a baby utrace based on tracepoints etc., for those 'other things' that we needed still
<fche> yes very much more of a gdb debugging level
serhei has quit [*.net *.split]
serhei has joined #systemtap
amerey has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
<agentzh> fche: ah, that's interesting.
<agentzh> thanks for the info.
<agentzh> now it seems to that the utrace apis do register the probe handler callbacks successfully, but for some reasons that i don't know, the callback is never fired in some runs.
<agentzh> am i mssing anything here?
<agentzh> *missing
<agentzh> any suggestions for further debugging would be very appreciated :)
<fche> would put tracing into __stp_call_callbacks()
<agentzh> k, will do
yogananth_ has joined #systemtap
<agentzh> thanks
<fche> were you able to test it on newer kernels btw?
<agentzh> yes sure
<agentzh> like 5.0
yog_ has quit [Ping timeout: 248 seconds]
<fche> aha and same behaviour?
yogananth has quit [Ping timeout: 265 seconds]
yog_ has joined #systemtap
<agentzh> same behavior
<agentzh> just less frequently.
<agentzh> also tried 4.15 kernel from ubuntu. same thing.
<agentzh> kernel 5.0 is from fedora 28
<agentzh> this issue has been bothering me for quite a while. alas.
<agentzh> it can be reproduced by a much simpler stap script (a stap oneliner): https://sourceware.org/bugzilla/show_bug.cgi?id=25290#c1
<agentzh> added a new comment to that PR.
<agentzh> everyone should be able to reproduce it easily.
<fche> I suspect the misbehavior is centered on the way that this part of the systemtap runtime must wait for the kernel to quiesce the target process
<agentzh> i have that feeling as well.
<agentzh> tracing the __stp_call_callbacks thing atm
tromey has quit [Quit: ERC (IRC client for Emacs 26.2)]
orivej has joined #systemtap
<agentzh> fche: hmm, it seems like the callback is invoked successfully even in bad runs.
<agentzh> i put my stap patch (for new dbug_task calls) and the bad run logs here: https://gist.github.com/agentzh/ad8b9705a3c10efb889b5c4ccda7a2d5
<agentzh> these new dbug_task calls make the problem much easier to reproduce.
sscox has quit [Ping timeout: 268 seconds]
<agentzh> fche: okay, i further narrowed it down to the stap_utrace_probe_handler() function.
<agentzh> it skipped calling '(*p->probe->ph) (c);' somehow.
<agentzh> the stap_utrace_probe_handler() function is indeed entered.
<agentzh> fche: okay, i nailed it down. it is because the condition in 'if (atomic_read (session_state()) != STAP_SESSION_RUNNING)' is true, thus skipping the probe handler altogether.
wcohen has quit [Ping timeout: 268 seconds]
irker265 has quit [Quit: transmission timeout]
orivej has quit [Ping timeout: 258 seconds]
<fche> hm that sounds like a race condition
<fche> we may set that flag a little too late
<fche> maybe the condition needs to accept STAP_SESSION_STARTING too
wcohen has joined #systemtap