fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
amerey has quit [Remote host closed the connection]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 265 seconds]
derek0883 has joined #systemtap
<linus2>
int fun1() {fun2(); fun3(); fun4();}
<linus2>
function("fun1").callees(1) means it triggers every time fun1 is executed and fun1->fun2,fun1->fun3,fun1->fun4 triggers, but fun8->fun2,fun8->fun3,fun8->fun4 not triggers,right?
<fche>
linus2, correct
<linus2>
fche: cool
<linus2>
I have a test
hpt has joined #systemtap
<linus2>
fche: why this error? probe process(@1).function("*@*config*").callees(1).return
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 244 seconds]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 265 seconds]
orivej has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 264 seconds]
khaled has joined #systemtap
mjw has joined #systemtap
hpt has quit [Ping timeout: 240 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 244 seconds]
gromero has quit [Ping timeout: 272 seconds]
orivej has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
<linus2>
fche: ?
<linus2>
why this error? probe process(@1).function("*@*config*").callees(1).return
amerey has joined #systemtap
sscox has quit [Ping timeout: 258 seconds]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
sscox has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has quit [Remote host closed the connection]
amerey has quit [Remote host closed the connection]
derek0883 has joined #systemtap
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
amerey has quit [Quit: Leaving]
orivej has quit [Ping timeout: 258 seconds]
<agentzh>
fche: howdy!
<fche>
YO
<agentzh>
we ran into stap errors like "ERROR: utrace_control returned error -3 on pid 20427".
<agentzh>
3 is ESCH
<agentzh>
i wonder if it's just the target process already quits and gets reaped?
<agentzh>
should we mute this error for this error code?
<agentzh>
we only see this on machines with a lot of cpu cores
<agentzh>
32c/64t
<agentzh>
it's in the task finder.
<agentzh>
the processes are not really those we are interested in.
<fche>
it's not a unique error code; it's generated in a couple of places in stp_utrace.c, apparently related to process lifecycles faster than the code can respond to, something like that
<fche>
nothing really a user can do anything about
<fche>
I believe the effect is mainly that some tasks will go un-probed even though expected
<fche>
perhaps that should be a warning only
derek0883 has joined #systemtap
<agentzh>
yeah i see a lot of places can produce that error code.
<agentzh>
inside the stap runtime.
<agentzh>
along the utrace_control() code path.
<agentzh>
so it's not a bug in stap or kernel we should be concerned about?
<agentzh>
we're also debugging a kernel freeze when running stap on such boxes with many cpu cores.
<agentzh>
hopefully this one is not related.
<fche>
could see a kernel freeze (some cpu spinning madly) -causing- this sort of downstream utrace error
<agentzh>
interestingly the kernel freeze never happens on smaller boxes with many fewer cpu cores even when using exactly the same kernel binary and stap scripts.
<fche>
what causes that takes work and/or luck to figure out :)
<agentzh>
and also lockdep/debug kernels variants of the same version also found nothing.
<fche>
intresting
<agentzh>
we also noted the work queue thing may lead to use-after-free panics.
<agentzh>
the work queue is still used in the task finder when atomic/interrupt contexts are hit.
<agentzh>
but that may be another separate issue.
<fche>
AIUI, the very purpose of workqueues is to defer work from atomic/interrupt contexts
derek0883 has quit [Ping timeout: 244 seconds]
derek0883 has joined #systemtap
<agentzh>
fche: yeah, but seems like the delayed work uses some data structures which may already have been freed. we will dig deeper.
<fche>
thanks
<agentzh>
sure
<agentzh>
fortunately we can now reliably reproduce the kernel freeze locally in a kvm vm.
<agentzh>
we can analyze the kdump out of it with the crash utility.
<agentzh>
it was first observed in a customer's big box which is bare metal.
<agentzh>
fche: oh btw, some times the kernel from certain linux distros may use a non-default gcc to compile. shall we introduce a --cc=PATH opton for stap?
derek0883 has quit [Remote host closed the connection]