fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
amerey has quit [Remote host closed the connection]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 265 seconds]
derek0883 has joined #systemtap
<linus2> int fun1() {fun2(); fun3(); fun4();}
<linus2> function("fun1").callees(1) means it triggers every time fun1 is executed and fun1->fun2,fun1->fun3,fun1->fun4 triggers, but fun8->fun2,fun8->fun3,fun8->fun4 not triggers,right?
<fche> linus2, correct
<linus2> fche: cool
<linus2> I have a test
hpt has joined #systemtap
<linus2> fche: why this error? probe process(@1).function("*@*config*").callees(1).return
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 244 seconds]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 265 seconds]
orivej has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 264 seconds]
khaled has joined #systemtap
mjw has joined #systemtap
hpt has quit [Ping timeout: 240 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 244 seconds]
gromero has quit [Ping timeout: 272 seconds]
orivej has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
<linus2> fche: ?
<linus2> why this error? probe process(@1).function("*@*config*").callees(1).return
amerey has joined #systemtap
sscox has quit [Ping timeout: 258 seconds]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
sscox has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has quit [Remote host closed the connection]
amerey has quit [Remote host closed the connection]
derek0883 has joined #systemtap
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
amerey has quit [Quit: Leaving]
orivej has quit [Ping timeout: 258 seconds]
<agentzh> fche: howdy!
<fche> YO
<agentzh> we ran into stap errors like "ERROR: utrace_control returned error -3 on pid 20427".
<agentzh> 3 is ESCH
<agentzh> i wonder if it's just the target process already quits and gets reaped?
<agentzh> should we mute this error for this error code?
<agentzh> we only see this on machines with a lot of cpu cores
<agentzh> 32c/64t
<agentzh> it's in the task finder.
<agentzh> the processes are not really those we are interested in.
<fche> it's not a unique error code; it's generated in a couple of places in stp_utrace.c, apparently related to process lifecycles faster than the code can respond to, something like that
<fche> nothing really a user can do anything about
<fche> I believe the effect is mainly that some tasks will go un-probed even though expected
<fche> perhaps that should be a warning only
derek0883 has joined #systemtap
<agentzh> yeah i see a lot of places can produce that error code.
<agentzh> inside the stap runtime.
<agentzh> along the utrace_control() code path.
<agentzh> so it's not a bug in stap or kernel we should be concerned about?
<agentzh> we're also debugging a kernel freeze when running stap on such boxes with many cpu cores.
<agentzh> hopefully this one is not related.
<fche> could see a kernel freeze (some cpu spinning madly) -causing- this sort of downstream utrace error
<agentzh> interestingly the kernel freeze never happens on smaller boxes with many fewer cpu cores even when using exactly the same kernel binary and stap scripts.
<fche> what causes that takes work and/or luck to figure out :)
<agentzh> and also lockdep/debug kernels variants of the same version also found nothing.
<fche> intresting
<agentzh> we also noted the work queue thing may lead to use-after-free panics.
<agentzh> the work queue is still used in the task finder when atomic/interrupt contexts are hit.
<agentzh> but that may be another separate issue.
<fche> AIUI, the very purpose of workqueues is to defer work from atomic/interrupt contexts
derek0883 has quit [Ping timeout: 244 seconds]
derek0883 has joined #systemtap
<agentzh> fche: yeah, but seems like the delayed work uses some data structures which may already have been freed. we will dig deeper.
<fche> thanks
<agentzh> sure
<agentzh> fortunately we can now reliably reproduce the kernel freeze locally in a kvm vm.
<agentzh> we can analyze the kdump out of it with the crash utility.
<agentzh> it was first observed in a customer's big box which is bare metal.
<agentzh> fche: oh btw, some times the kernel from certain linux distros may use a non-default gcc to compile. shall we introduce a --cc=PATH opton for stap?
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<fche> -B CC=.....
orivej has joined #systemtap
sscox has quit [Ping timeout: 240 seconds]