fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
lickingball has quit [Remote host closed the connection]
<kerneltoast>
fche, agentzh, i believe i spotted the cause of the probe lock lockup
<kerneltoast>
it looks like an out of order lock scenario
<kerneltoast>
where the probe lock and zone->lock inside mm are acquired and released out of order
<kerneltoast>
so they deadlock
<fche>
mmmm don't think we should be -taking- any mm locks during probe handler execution
khaled has quit [Quit: Konversation terminated!]
<fche>
seems like the syscall.exit tracepoint probe handler being run on cpu0
<kerneltoast>
that's the thing
<kerneltoast>
it happens in an irq
<fche>
hm and on cpu5 it's some other tracepoint being hit
<fche>
don't see the role for mm lock
<kerneltoast>
fche, CPU0 is stuck inside free_pcppages_bulk, trying to acquire zone->lock
<kerneltoast>
CPU5 is further along in free_pcppages_bulk, stuck inside a tracepoint, while having zone->lock held
<fche>
hmmmmm!
<kerneltoast>
CPU0 has the probe lock held, and is waiting to get zone->lock
<kerneltoast>
CPU5 has zone->lock held, and is waiting to get the probe lock
<fche>
now the stap side locks (stp_probe_lock) are all designed to have timeouts
<kerneltoast>
yes unless you have the timeouts turned off
<fche>
I think this may be a case of "don't turn timeouts off"
<kerneltoast>
that's not a very elegant solution either
<kerneltoast>
we'll just trylock for a while when in reality we can't hold the lock
<fche>
it'd prevent the (apparent) deadlock at least
<kerneltoast>
disabling irqs while holding the lock would do it too
<fche>
I'm curious how we get into that apic_timer_interrupt on cpu0 .... we normally block interrupts during probe handlers generally
<kerneltoast>
i made a patch for this before but you said it was redundant because of the irq disable in the probe handlers, plus context reentrancy protection
<kerneltoast>
yeah i'm curious as well
<kerneltoast>
STP_INTERRUPT lets you toggle the interrupt blocking iirc
<fche>
wonder if that apic_timer_interrupt thing is like the "nmi: cpu stuck in spinlock"
<kerneltoast>
oh i guess the context reentrancy protection didn't help because this is happening on different CPUs
<fche>
we don't use/document that
<fche>
nothing suspiciuos seems to happen on cpu5: it's just a stap probe handler that happens to run associated with an interrupt handler
<fche>
anything interesting going on at the other cpus?
<kerneltoast>
not really, just some stuckage on the probe lock
<fche>
cpu11
<kerneltoast>
cpu11 is spinning on trying to acquire the lock
<fche>
and 14
<kerneltoast>
yeah same deal
<fche>
and 15
<kerneltoast>
yeeep
<fche>
yeah what gives all those ought to time out etc
<kerneltoast>
unless the test in question has timeouts disabled
<fche>
I don't see that in the tests
<kerneltoast>
it's also possible the backtrace just happened to show them while they were spinning
<kerneltoast>
the real deadlock is that zone->lock is a normal spin_lock
<kerneltoast>
so two cpus are deadbeefed
<fche>
can see how 0 is waiting for 1
<fche>
waiting for 5
<fche>
not seeing how 5 is waiting for anyone
<kerneltoast>
oh you mean if it had the timeout then 5 shoulda given up eventually
<fche>
maybe the machine is not really hung just spinning very very busily
<fche>
yes
<kerneltoast>
and then it spins long enough that everything explodes i suppose
<kerneltoast>
spinning for a while in irqs is bad m'kay
<fche>
not sure I see explosions here, but rather maybe super very crazy slow progress
<kerneltoast>
i couldn't ssh to the machine
<fche>
well ya
<kerneltoast>
re: bulkmode, still getting transport failures. only possible cause is that __stp_relay_subbuf_start_callback() returns 0
<kerneltoast>
seems like we need moar subbuffers
<kerneltoast>
and subwoofers
<fche>
or make them bigger
<kerneltoast>
yeah either way
<kerneltoast>
i don't understand why this subbuffer thingy exists
<kerneltoast>
maybe each subbuffer is kmalloced or something
<kerneltoast>
and making it too big is bad
<kerneltoast>
oh the subbuffers are vmapped
<kerneltoast>
i guess you need multiple subbuffers for flushing purposes
<kerneltoast>
if it takes too long for the reader to flush out a filled subbuffer, you're screwed
<fche>
at least one for the probe to write into
<fche>
and others for userspace to read from
<fche>
screwed = missing some output, at worst, not the worst thing
<kerneltoast>
yeah but i think increasing the subbuffer count is the better option
<kerneltoast>
oh i think i see what's wrong
<kerneltoast>
the subbuffers are way too big, and we only have 8 of them
<kerneltoast>
if stp_print_flush needs to flush out a buffer that isn't full, access to that big unfilled buffer is lost until userspace reads out the data
<fche>
due to bad code generation (boo me), there's an infinite loop - the stp_lock_probe that's called just before evaluating dependent probe conditional expressions
<fche>
it does a 'goto out;' in case of a locking error
<fche>
unfortunately
<fche>
there is an out: label just above
<fche>
so ... infinite loop
<fche>
the intent was to jump FORWARD to a subsequent out: label
<fche>
and that explains why it's this particular test that triggers it
<fche>
anyway will fix it
<agentzh>
ah, xmas for me...
<agentzh>
we lost 3 customers due to this!!!
<agentzh>
:D
irker760 has quit [Quit: transmission timeout]
derek0883 has quit [Remote host closed the connection]
<kerneltoast>
ahahahaha w o w
<kerneltoast>
i wonder why i never noticed that
<kerneltoast>
I looked at that generated stap code a lot when hunting this bug
<kerneltoast>
agentzh, time to convince our customers to come back?
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<agentzh>
kerneltoast: re "time to convince our customers to come back?" that's exactly what i'm gonna do! begging them to come back :D
<agentzh>
kerneltoast: re 3k less tests in parallel mode, wow, that's new to me.
<agentzh>
kerneltoast: so fche's probe lock infinite loop bug is not in our private branch?
<agentzh>
it seems?
<agentzh>
if that's the case, then the probe lock deadlock is not the reason for our 3 customers.
<agentzh>
but something else.
<agentzh>
like earlier panics/freezes you fixed.
demon000_ has joined #systemtap
tonyj has quit [Remote host closed the connection]
irker960 has quit [Quit: transmission timeout]
<kerneltoast>
agentzh, yeah i tried to diff a parallel run to a serial run and the parallel run had 5000 tests completed, while serial had 8000
<kerneltoast>
i think our private branch was too old to have the probe lock lockup. I only ran into it running the full testsuite on stap master. I think I've run into mutex_trylock issues on our private branch though
<kerneltoast>
our 3 former customers must've faced one of the other issues i fixed (the in_atomic() panic was most common on our private branch for me), or one of the print bugs I've fixed
<kerneltoast>
we're lucky fche caught the probe lock lockup right now, or we might've been screwed after merging stap master into our own branch :P
<agentzh>
yes, indeed!
<agentzh>
so fche made a point in forcing you to run the fat test suite ;)
<agentzh>
just to catch other's mistakes.
<agentzh>
in this case, fche's own.
tux3_ has joined #systemtap
tux3 has quit [Read error: Connection reset by peer]
tux3_ has joined #systemtap
tux3_ has quit [Changing host]
demon000_ has quit [Ping timeout: 272 seconds]
khaled has joined #systemtap
derek0883 has quit [Remote host closed the connection]
<kerneltoast>
yes you looks good with brown paper bag
<fche>
try it, could be a new fashion trend
<kerneltoast>
budget PPE
irker820 has joined #systemtap
<irker820>
systemtap: sultan systemtap.git:master * release-4.4-28-g8819e2a04 / runtime/print_flush.c runtime/transport/relay_v2.c runtime/transport/transport.c runtime/transport/transport.h staprun/relay.c: always use per-cpu bulkmode relayfs files to communicate with userspace
<irker820>
systemtap: sultan systemtap.git:master * release-4.4-29-gd86b64029 / tapset-timers.cxx: Revert "REVERTME: tapset-timers: work around on-the-fly deadlocks caused by mutex_trylock"
<kerneltoast>
🚢🚢🚢
derek0883 has quit [Remote host closed the connection]
sscox has joined #systemtap
derek0883 has joined #systemtap
* serhei
considers potential proliferation of %( runtime != bpf conditionals in the main tapset code
<serhei>
since I'll be making many more tapset functions, I think what I'll do instead of creating a ton of functions like
<serhei>
instead have the toplevel tapsets contains
<serhei>
%( runtime != bpf %? function foo() {} function bar() {} ... %)
<serhei>
and put the bpf implementation into tapset/bpf/
<serhei>
which previously contained only functions mirroring the ones under tapset/linux/
<serhei>
giving ample notice to allow people (primarly fche) to bikeshed
<fche>
do we need the toplevel ones at all/
<serhei>
hmm
<serhei>
the way this started was that toplevel existed
<serhei>
then jistone moved the tapset functions with backend-specific implementations into linux/ and dyninst/
<serhei>
and kept all other ones in place
<fche>
the rascal
<serhei>
then when bpf backend was written, the functions that were in linux/ grew counterparts in bpf/ and the functions that stayed at the top level grew little %( runtime != bpf %? branches
<serhei>
the categories of tapset functions that exist now are
<serhei>
- same implementation everywhere
<serhei>
- same implementation on linux and dyninst, but not bpf
<serhei>
- different implementation per each backend
<fche>
ok
<fche>
it seems as though category 2 examples could devolve into category 3 (with symlinks or some other trickery to make cross-references)
<fche>
or if few in number, do the %( runtime %) trick
<fche>
do you have a sample function name for each category?
<serhei>
ah. symlinks would be some new feature whereby in e.g. dyninst or bpf uconversions.stp you would have
<serhei>
function foo() %same_as_linux
<serhei>
?
<serhei>
fche: category1 is user_string; category2 is user_string_n; category3 is kernel_string_n
<fche>
symlinks could be physical symlinks
<serhei>
ahh
<serhei>
git supports symlinks
<fche>
yeah ln -s ../linux/foo.stp bpf/foo.stp
* serhei
just assumed function level rather than file level
<serhei>
that also works
<fche>
yeah depends on the granularity
* serhei
looks forward to people editing a symlinked file by mistake
<fche>
not too interested in a new parser level construct like that %same_as_linux
<serhei>
and then realizing the mistake before they commit of course :)
<serhei>
fche, me neither
<fche>
ok.
<fche>
so anyway we have a range of tools (conditionals, symlinks, wrapper functions ...) -- it's just a matter of tastefully choosing among them
<fche>
go for it
<serhei>
will do
orivej has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<agentzh>
serhei: always glad to see you're working on the bpf runtime :)
<agentzh>
kerneltoast fche: so it's already xmas?
<agentzh>
no known deadlocks and panics right now?
<kerneltoast>
yep
<agentzh>
hooray!
<agentzh>
it took us so long to get here.
<agentzh>
but finally.
<kerneltoast>
only 3 months
<agentzh>
so happy.
<agentzh>
lol
<fche>
well ... "known" is contingent
<fche>
but yeah real nice progress!
<kerneltoast>
we do know that the unknown panics and deadlocks are rare
<kerneltoast>
because we haven't seen them yet :P
<agentzh>
so we have to run those tracepont_ontheflay.exp tests separately after running the whole thing in parallel?
<agentzh>
that sounds like a bug in the test scaffold? fche?
<kerneltoast>
agreed, parallel should be running all the tests that serial does
<agentzh>
aye
<agentzh>
kerneltoast: so next we should merge the latest master into our private branch and then i'll go begging our lost customers.
<kerneltoast>
agentzh, sounds like a plan
<agentzh>
great
<kerneltoast>
time to bust out the exotic drinks and celebrate
<fche>
agentzh, they should be the same set of tests; haven't looked into why it might have been skipped
<agentzh>
kerneltoast: yeah, indeed. sadly i can't guy them for you personally ;)
derek0883 has quit [Remote host closed the connection]
<agentzh>
fche: looking forward to your buildbots running tests in parallel, so maybe you can reprodeuce it.
<fche>
we'll look into it
<agentzh>
thanks
<fche>
though our buildbots ares smallish VMs rather than beefy tens-of-cores hardware