fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
gila has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
introom has quit [Ping timeout: 260 seconds]
introom has joined #systemtap
naveen2 has joined #systemtap
naveen has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
naveen2 has quit [Quit: WeeChat 1.9.1]
naveen has joined #systemtap
naveen2 has joined #systemtap
naveen has quit [Ping timeout: 240 seconds]
slowfranklin has joined #systemtap
naveen2 has quit [Quit: WeeChat 1.9.1]
pwithnall has joined #systemtap
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #systemtap
mjw has joined #systemtap
pwithnall has quit [Quit: pwithnall]
pwithnall has joined #systemtap
naveen has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
brolley has joined #systemtap
tromey has joined #systemtap
gregwork has joined #systemtap
<gregwork> are there any useful tap's for troubleshooting hadoop performance ?
<fche> hi
<fche> I am not aware of any scripts targeted to hadoop per se, but stap has helped with various system- and sometimes jvm-level problems
<gregwork> can you run multiple stap scripts at the same time ?
<fche> certainly
<fche> so what kinds of problems are you seeing, and what brings you to stap ?
<gregwork> oh im involved with a task force here to investigate hadoop perf issues in prod. We have got a static workload we can run over and over on the data science admins looking at the app, I was going to look at the view from the OS perspective. Observe where the kernel thinks the app is spending its time, what syscalls/io etc
<gregwork> im hoping to compare notes as we run through the workload
slowfranklin has left #systemtap [#systemtap]
<fche> aha. yeah, syscalls analysis is a pretty basic use of the tool. also good if one needs to go deeper - device drivers, file systems, etc. internals
<gregwork> i know that strace can be run with -c
<gregwork> to count the syscalls used by a process
<gregwork> is there a stap/deeper equiv
<fche> sure, stap works easily systemwide rather than per-process, and uses a much smaller interference interface to gather the syscall stuff
<fche> t
<fche> there is a strace workalike script in there that you could try just to get the feel
<gregwork> which one is that
<fche> (but note, your kernel version needs to be <= 4.16 for now)
<fche> % stap --example strace.stp should find it
<gregwork> this is rhel7
<gregwork> so its going to be much older than 4.16
<fche> right, things work fine there
<fche> but --example doesn't exist yet, till 7.6
<gregwork> i believe this is 7.5
<fche> so check out /usr/share/systemtap/examples
<fche> or % stap -i
<fche> stap> sample syscall
<gregwork> what are your thoughts on oprofile
<gregwork> in comparison to systemtap
<fche> very different feel; well suited to its type of problem
<fche> more similar to perf
pwithnall has quit [Ping timeout: 244 seconds]
<gregwork> is there any integration of systemtap with prometheus or equiv
<fche> get out of my head :)
<fche> git systemtap just gained an exporter glue process that lets stap scripts export data to prometheus (or compatible consumers such as pcp)
fche has quit [Remote host closed the connection]
fche has joined #systemtap
<gregwork> is that capability new new or would the rhel7 version have it
<fche> very new
<fche> maybe future rhel7 will
<fche> but really there isn't that much to it - a small python script that's invoked by systemd
<fche> does not represent any change to core systemtap binaries at all
gila has quit [Quit: Textual IRC Client: www.textualapp.com]
<fche> if you're a customer, opening a bugzilla.redhat.com RFE to backport that function into future rhel7 would be very helpful
<gregwork> i am, but in the mean time i am curious as to what would be the most reasonable way to get this feeding into prom
<gregwork> redhat rfe process is not quick
<fche> for experiment, build your own stap out of git
mjw has quit [Quit: Leaving]
orivej has quit [Ping timeout: 240 seconds]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
orivej has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
gila has joined #systemtap
<fche> agentzh, some more verbosity and a dyninst version would help with PR23605
<agentzh> fche: replied there :)
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
<fche> hm, do I understand correctly that this is intermittent?
<agentzh> right
<fche> is it 'just' an OOM situation perhaps?
<fche> (don't see how a pure userspace independent dyninst job should be affected by anything else, except in terms of resource exhaustion)
<agentzh> i just got a different error with the same test case.
<agentzh> updated the PR ticket.
<agentzh> this time it is an assertion failure.
<agentzh> the VM has plenty of RAM.
<agentzh> 12G in total.
<agentzh> and the only thing running is 8 stap processes.
<fche> right, but depending on what those 8 stap processes are doing, you might run out of ram in the vm
<agentzh> all very trivial scripts like that one.
<agentzh> no real payload at all.
<agentzh> just unit tests.
<agentzh> no big ones.
<agentzh> unlike stap's official test suite.
<agentzh> (which has load tests)
<agentzh> fche: is -vvv enough for debugging that dyninst problem?
<agentzh> or should i do -vvvv?
<fche> the more the merrier generally, as per [man error::reporting]
<fche> w.r.t. dyninst, if the problem is reasonably reproducible, an strace of the process would be helpful
<agentzh> fche: strace on the target process or on the stapdyn process?
<fche> stapdyn mainly
<fche> if you strace the target, dyninst (via ptrace) will fail anyway
<agentzh> seems like i should strace the full stap process tree then.
<fche> the stapdyn one alone would be fine
<fche> though it's not easy, if it's intermittent, so you can't just run stapdyn FOO.so etc.
<agentzh> right.
<agentzh> okay, just got one -vvvv output sample for the assertion failure in dyninst.
<agentzh> will paste it to the PR.
<fche> ok
<agentzh> added the attachment to the PR.
<agentzh> i'll add the strace thing
<agentzh> strace -f stap ... should be enough?
<fche> ideally stap --runtime=dyninst -p4 SOMETHING
<fche> then strace -f stapdun FOO.so in a loop until it crashes
<agentzh> hmm, seems like i should use mozilla rr to record it instead.
<agentzh> then we could have the whole history.
<agentzh> about everything.
<fche> if you like, you're welcome to try
<fche> still, a live strace (or similar) of the dyninst target executable will cause stapdyn to fail, so must not do that
<agentzh> yeah, i'll try once i have a chance. currently in the middle of something else.
pwithnall has joined #systemtap
brolley has left #systemtap [#systemtap]
<agentzh> fche: i added a new test module in this patch: https://sourceware.org/ml/systemtap/2018-q3/msg00156.html
<agentzh> do you like it?
<agentzh> the good news is that my test generator itself is also simpler :)