fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
sscox has quit [Ping timeout: 276 seconds]
sscox has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
sscox has quit [Ping timeout: 276 seconds]
sscox has joined #systemtap
hpt has joined #systemtap
yogananth has joined #systemtap
_whitelogger has joined #systemtap
yogananth has quit [Ping timeout: 246 seconds]
yogananth has joined #systemtap
yogananth_ has joined #systemtap
yogananth has quit [Read error: Connection reset by peer]
yogananth_ has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
_whitelogger has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
gromero has quit [Ping timeout: 276 seconds]
hpt has quit [Ping timeout: 244 seconds]
yogananth has joined #systemtap
tromey has joined #systemtap
gromero has joined #systemtap
orivej has joined #systemtap
amerey has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
orivej has quit [Read error: Connection reset by peer]
yogananth has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 246 seconds]
yakiza has quit [Quit: WeeChat 2.4]
orivej has joined #systemtap
<agentzh>
hi folks, i'm having a strange issue with calling __stp_get_mm_path() via a tapset function in the context of timer.profile: https://pastebin.com/zWBbSysT
<agentzh>
when running the stap script repeatedly, the CPU lockup would happen.
<agentzh>
do i need any kind of locking here? like task_lock()/task_unlock()?
<agentzh>
adding task_lock/task_unlock around that snippet seems to make it deadlock much more quickly.
<agentzh>
any hints on this please? many thanks!
<fche>
from a timer.profile, you won't be able to take locks legally
<fche>
so no wonder __stp_get_mm_path() causes you problems
<fche>
it is generally best to assume it is *unsafe* to call any kernel or stap runtime C function unless you argue/prove it safe first
<fche>
consider gathering that info from other places, like an execve probe or such
<agentzh>
ah, intersting. but it seems it is safe to call execname() in timer.profile? because it is simple enough?
<agentzh>
thanks for the info!
<agentzh>
i've been fighting against this for hours. alas.
<agentzh>
fche: by execve probes, you mean something like probe kprocess.exec?
<agentzh>
maybe one way of doing this is to record the pids for the task mm paths, and then simply checking pids in timer.profile?
<agentzh>
*record the pids in probe kprocess.exec
<fche>
yes, something like that
<fche>
tapset functions are designed to be safe from all contexts; those that we know aren't are marked with /* guru */ or something like that
<fche>
so yes absolutely, go use execname() anywhere you like
<fche>
(we rely on kernel guarantees to make this safe)
<fche>
but: don't improvise with custom kernel or stap runtime calls!
<fche>
or else you'll be fighting against this for hours, alas. :-) :-)
<agentzh>
heh, indeed.
<agentzh>
i wonder if you guys have wonderful ways to debug such deadlocks or cpu lockups. they are scary :)
<agentzh>
or tools?
<fche>
closest thing is to run with a lockdep-enabled kernel like fedora rawhide
<fche>
but ideally avoid all this by not using custom embedded-C and/or not calling unknown-safety kernel or runtime code from there
<agentzh>
i see. thanks!
<agentzh>
fche: oh btw, i've just noted that @vma() always returns 0 on kernel 5.1 or 5.0. haven't investigated myself yet. is that a known problem?
<agentzh>
sorry, not always returning zero, just seems like returning a wrong address on PIE.
<agentzh>
i'll try create a proper PR.
<agentzh>
*creating
wcohen has joined #systemtap
lindi- has quit [Ping timeout: 250 seconds]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1)]
<agentzh>
not sure what the real cause is. help needed :)
<fche>
will need to wait till next week, I'm afraid
<fche>
I'd compare a successful run's trace on f28 to this one on f29
<fche>
(possibly kernel version dependent, so f29 older kernel may work fine too
<fche>
so yeah, diff the traces is where I'd start
<fche>
sounds a little bit familiar from a vaguely similar problem we had a few years (?) ago, whereby changes in ld.so loading policy or precise executable segment flags & layout caused a change in the way the kernel loaded the parts of the program
<fche>
and the stap runtime wouldn't recognize the later-than-usual loaded areas as part
<agentzh>
i tried earlier kernel on fedora 29, same thing. so it might not be a kernel issue. more likely be a toolchain issue like ld.so.
<agentzh>
*older kernels
<agentzh>
no worries. i can surely wait :)
<agentzh>
also tried the latest master of elfutils with stap, still the same thing.
<agentzh>
i compared the traces betwen fed29 and centos7. task->user makes the map lookup fail.
<agentzh>
pid matches though.
<agentzh>
more details are in that PR already :)
<fche>
yeah I'd focus on the order & way in which the a.out binary is mapped piecemeal into the address space