fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
<kerneltoast>
that might be what happened to my ryzen laptop
<kerneltoast>
fche, updated the gist with the full dmesg
derek0883 has quit [Remote host closed the connection]
<fche>
interesting
derek0883 has joined #systemtap
<kerneltoast>
the implicated code looks horrid...
<fche>
stp_lock_probe can indeed block a little while (limited by macros, should be a very low maximum elapsed time, << 20 seconds)
<fche>
ok, and does the unpatched copy of stap work better for you? (sdt.stp etc. is pretty early in the testsuite)
<kerneltoast>
unpatched as in without my task work patch?
<fche>
yes
<kerneltoast>
this lockup is sporadic. i ran the testsuite 3 times, and it occured 2/3 times
<kerneltoast>
i'll try it without my task work patch, but i really don't think that's the issue...
<fche>
yeah. suggest rebooting before running the suite again with upstream code
<kerneltoast>
i have no choice but to reboot. the vm is locked up
<agentzh>
kerneltoast: it was not a lockdep/debug kernel, as per your gist.
<agentzh>
so it was not exactly the same as your laptop runs.
<kerneltoast>
my laptop also had a 5.8 kernel so indeed a lot was different
<kerneltoast>
but if there is the potential for this lockup to occur on 3.10 without lockdep, i don't see what's stopping it from happening on a totally different system
<kerneltoast>
running the testsuite again now at upstream HEAD...
<kerneltoast>
fche, do the buildbots only run the testsuite single-threaded?
<fche>
not sure
<kerneltoast>
fche, it died again
<kerneltoast>
without my patch
<kerneltoast>
getting the vmcore now...
<agentzh>
the last time i tried running the test suite in parallel (-j16), it also froze. a long time ago.
<agentzh>
*the full test suite
<kerneltoast>
ahaha, it died due to the bug my task work patch fixes
<fche>
ok so really it should never take long to lock, even on a suppress-time-limits type of stap example
<kerneltoast>
this lock probe stuff is really nuts and i'm not quite sure how to audit its usage...
<kerneltoast>
i suspect a lock is never released or something
<fche>
well, you might just not understand what that part is about
<fche>
the lock machinery in question here are the ones used for protecting stap script-level global variables from concurrent probe handlers' modifications
<kerneltoast>
fche, the centos 7 locked up without my patch (no panic)
<fche>
PERFECT
<fche>
er
<kerneltoast>
so uh
<kerneltoast>
how am i going to get your testsuite results for that patch :)
<kerneltoast>
i just checked the vmcore and it is indeed the same lockup from earlier
<kerneltoast>
i'm innocent!
<kerneltoast>
fche, err maybe i'm misreading something, but stp_unlock_probe() never releases lock 0?
<fche>
I think it's okay, though a weird way to write it
<kerneltoast>
oof yeah you're right, but it hurts my brain
<kerneltoast>
i guess it's a clever way to keep i unsigned
<kerneltoast>
fche, well if you have any ideas for that lockup, i can test patches you send
<fche>
hmmmm I'd try to find out which .exp / .stp file the first erroneous one was