fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
sscox has quit [Ping timeout: 276 seconds]
sscox has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
sscox has quit [Ping timeout: 276 seconds]
sscox has joined #systemtap
hpt has joined #systemtap
yogananth has joined #systemtap
_whitelogger has joined #systemtap
yogananth has quit [Ping timeout: 246 seconds]
yogananth has joined #systemtap
yogananth_ has joined #systemtap
yogananth has quit [Read error: Connection reset by peer]
yogananth_ has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
_whitelogger has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
gromero has quit [Ping timeout: 276 seconds]
hpt has quit [Ping timeout: 244 seconds]
yogananth has joined #systemtap
tromey has joined #systemtap
gromero has joined #systemtap
orivej has joined #systemtap
amerey has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
orivej has quit [Read error: Connection reset by peer]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
yakiza has joined #systemtap
<yakiza> Hello people i am trying to build systemtap from source and i am getting the foloowing error https://defuse.ca/b/9lBICWA6MvpdMp4ttSLXd5
agentzh has joined #systemtap
khaled has joined #systemtap
<fche> yakiza,
<fche> I don't see an error in there
<fche> maybe your paste is incomplete?
gromero has quit [Ping timeout: 250 seconds]
orivej has joined #systemtap
yogananth has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 246 seconds]
yakiza has quit [Quit: WeeChat 2.4]
orivej has joined #systemtap
<agentzh> hi folks, i'm having a strange issue with calling __stp_get_mm_path() via a tapset function in the context of timer.profile: https://pastebin.com/zWBbSysT
<agentzh> when running the stap script repeatedly, the CPU lockup would happen.
<agentzh> do i need any kind of locking here? like task_lock()/task_unlock()?
<agentzh> adding task_lock/task_unlock around that snippet seems to make it deadlock much more quickly.
<agentzh> any hints on this please? many thanks!
<fche> from a timer.profile, you won't be able to take locks legally
<fche> so no wonder __stp_get_mm_path() causes you problems
<fche> it is generally best to assume it is *unsafe* to call any kernel or stap runtime C function unless you argue/prove it safe first
<fche> consider gathering that info from other places, like an execve probe or such
<agentzh> ah, intersting. but it seems it is safe to call execname() in timer.profile? because it is simple enough?
<agentzh> thanks for the info!
<agentzh> i've been fighting against this for hours. alas.
<agentzh> fche: by execve probes, you mean something like probe kprocess.exec?
<agentzh> maybe one way of doing this is to record the pids for the task mm paths, and then simply checking pids in timer.profile?
<agentzh> *record the pids in probe kprocess.exec
<fche> yes, something like that
<fche> tapset functions are designed to be safe from all contexts; those that we know aren't are marked with /* guru */ or something like that
<fche> so yes absolutely, go use execname() anywhere you like
<fche> (we rely on kernel guarantees to make this safe)
<fche> but: don't improvise with custom kernel or stap runtime calls!
<fche> or else you'll be fighting against this for hours, alas. :-) :-)
<agentzh> heh, indeed.
<agentzh> i wonder if you guys have wonderful ways to debug such deadlocks or cpu lockups. they are scary :)
<agentzh> or tools?
<fche> closest thing is to run with a lockdep-enabled kernel like fedora rawhide
<fche> but ideally avoid all this by not using custom embedded-C and/or not calling unknown-safety kernel or runtime code from there
<agentzh> i see. thanks!
<agentzh> fche: oh btw, i've just noted that @vma() always returns 0 on kernel 5.1 or 5.0. haven't investigated myself yet. is that a known problem?
<agentzh> sorry, not always returning zero, just seems like returning a wrong address on PIE.
<agentzh> i'll try create a proper PR.
<agentzh> *creating
wcohen has joined #systemtap
lindi- has quit [Ping timeout: 250 seconds]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1)]
lindi- has joined #systemtap
sscox has quit [Ping timeout: 276 seconds]
wakatana has joined #systemtap
wakatana has quit [Quit: leaving]
sscox has joined #systemtap
dmalcolm has quit [Quit: Leaving]
khaled_ has joined #systemtap
wakatana has joined #systemtap
khaled has quit [Ping timeout: 245 seconds]
wakatana has quit [Quit: leaving]
amerey has quit [Quit: Leaving]
pfallenop has quit [Ping timeout: 245 seconds]
pfallenop has joined #systemtap
<agentzh> fche: OK, the vma tracker is indeed broken on fedora 29: https://sourceware.org/bugzilla/show_bug.cgi?id=24875
<agentzh> always getting address 0x0.
<agentzh> not sure what the real cause is. help needed :)
<fche> will need to wait till next week, I'm afraid
<fche> I'd compare a successful run's trace on f28 to this one on f29
<fche> (possibly kernel version dependent, so f29 older kernel may work fine too
<fche> so yeah, diff the traces is where I'd start
<fche> sounds a little bit familiar from a vaguely similar problem we had a few years (?) ago, whereby changes in ld.so loading policy or precise executable segment flags & layout caused a change in the way the kernel loaded the parts of the program
<fche> and the stap runtime wouldn't recognize the later-than-usual loaded areas as part
<agentzh> i tried earlier kernel on fedora 29, same thing. so it might not be a kernel issue. more likely be a toolchain issue like ld.so.
<agentzh> *older kernels
<agentzh> no worries. i can surely wait :)
<agentzh> also tried the latest master of elfutils with stap, still the same thing.
<agentzh> i compared the traces betwen fed29 and centos7. task->user makes the map lookup fail.
<agentzh> pid matches though.
<agentzh> more details are in that PR already :)
<fche> yeah I'd focus on the order & way in which the a.out binary is mapped piecemeal into the address space
<agentzh> interesting, i'll try playing with it.
<agentzh> thanks for the hint.