fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
derek0883 has joined #systemtap
<kerneltoast> this doesn't fix the mem leaks you noticed
<fche> it looks plausible, can you put it through the testsuite?
<kerneltoast> yeah, i'll just need to find another machine to use because i got rid of the crummy ryzen laptop
<kerneltoast> or i could convince you it doesn't need the test suite
<kerneltoast> this basically makes those two code paths behave the same as when CONFIG_PREEMPT=y
<kerneltoast> if stap works fine on CONFIG_PREEMPT=y, then this change will work fine as well
<kerneltoast> the big delta is just from indentation
<fche> trust no one
<kerneltoast> our fuzzer would typically hit a kernel panic within a few minutes
<kerneltoast> after this change, i've had the fuzzer running for 110 minutes without any crashes
<kerneltoast> it passes our lean test suite as well
<kerneltoast> it also complimented my haircut
<fche> that's certainly encouraging
<kerneltoast> the fuzzer is running on centos 7 btw
<kerneltoast> fche, were you convinced?
<agentzh> heh
* agentzh knows kerneltoast doesn't want to run the full test suite.
* kerneltoast sweats
khaled has quit [Quit: Konversation terminated!]
hpt has joined #systemtap
_whitelogger has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
_whitelogger has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
khaled has joined #systemtap
hpt has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
mjw has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
khaled has quit [Remote host closed the connection]
khaled has joined #systemtap
tromey has joined #systemtap
amerey has joined #systemtap
mjw has quit [Ping timeout: 264 seconds]
mjw has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
<kerneltoast> fche, hi
<fche> duuuude
<fche> so how was that Full 100% Thorough Testsuite Run ? :)
<kerneltoast> uhhhhhh
* kerneltoast sweats
<fche> no it's easy
<fche> just type
<fche> sudo make installcheck
<fche> easy peasy
<kerneltoast> yeah but
<fche> glad to help
<fche> have a NICE day!
<kerneltoast> :(
<kerneltoast> i sent back my crummy ryzen laptop
<kerneltoast> that's what i was using for testsuite runs
<kerneltoast> i take it you weren't convinced?
<fche> I'd be more comfy with that data.
<kerneltoast> time to rebuild my laptop's kernel i guess
<fche> no access to a decent vm/server ?
<fche> didn't realize it'd be such a hardship
<kerneltoast> i use a couple of agentzh's boxes for VMs but that would just be on centos 7 if i tested it there
<fche> it'd be useful
<kerneltoast> fastest machine in my household is my laptop
<kerneltoast> ok it's running on centos 7 now
derek0883 has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, lol the centos 7 vm running the testsuite died: https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc
<kerneltoast> that might be what happened to my ryzen laptop
<kerneltoast> fche, updated the gist with the full dmesg
derek0883 has quit [Remote host closed the connection]
<fche> interesting
derek0883 has joined #systemtap
<kerneltoast> the implicated code looks horrid...
<fche> stp_lock_probe can indeed block a little while (limited by macros, should be a very low maximum elapsed time, << 20 seconds)
<fche> ok, and does the unpatched copy of stap work better for you? (sdt.stp etc. is pretty early in the testsuite)
<kerneltoast> unpatched as in without my task work patch?
<fche> yes
<kerneltoast> this lockup is sporadic. i ran the testsuite 3 times, and it occured 2/3 times
<kerneltoast> i'll try it without my task work patch, but i really don't think that's the issue...
<fche> yeah. suggest rebooting before running the suite again with upstream code
<kerneltoast> i have no choice but to reboot. the vm is locked up
<agentzh> kerneltoast: it was not a lockdep/debug kernel, as per your gist.
<agentzh> so it was not exactly the same as your laptop runs.
<kerneltoast> my laptop also had a 5.8 kernel so indeed a lot was different
<kerneltoast> but if there is the potential for this lockup to occur on 3.10 without lockdep, i don't see what's stopping it from happening on a totally different system
<kerneltoast> running the testsuite again now at upstream HEAD...
<kerneltoast> fche, do the buildbots only run the testsuite single-threaded?
<fche> not sure
<kerneltoast> fche, it died again
<kerneltoast> without my patch
<kerneltoast> getting the vmcore now...
<agentzh> the last time i tried running the test suite in parallel (-j16), it also froze. a long time ago.
<agentzh> *the full test suite
<kerneltoast> ahaha, it died due to the bug my task work patch fixes
<kerneltoast> that's funny
<kerneltoast> i'll run it again
<agentzh> cool
<agentzh> the std test suite can also reproduce that bug.
<kerneltoast> yeah
<kerneltoast> what's even more interesting though is that kdump caught this
<kerneltoast> but
<kerneltoast> kdump did not catch the lockup
<kerneltoast> that sounds exactly like what happened on my laptop
<kerneltoast> it died again due to the same panic
<kerneltoast> i'll try one more time i guess
<kerneltoast> fche, the macro limit for stp_lock_probe is only used if --suppress-time-limits is passed
<kerneltoast> err
<fche> um surely we didn't flip the polarity by mistake
<fche> surely
<kerneltoast> it's ignored if --suppress-time-limits is passed
<kerneltoast> maybe the testsuite uses that at some point?
<kerneltoast> fche, it does
<kerneltoast> grep suppress-time-limits -r testsuite/
<fche> yes, that'd be on purpose
<fche> ok
<fche> ok so really it should never take long to lock, even on a suppress-time-limits type of stap example
<kerneltoast> this lock probe stuff is really nuts and i'm not quite sure how to audit its usage...
<kerneltoast> i suspect a lock is never released or something
<fche> well, you might just not understand what that part is about
<fche> the lock machinery in question here are the ones used for protecting stap script-level global variables from concurrent probe handlers' modifications
<kerneltoast> fche, the centos 7 locked up without my patch (no panic)
<fche> PERFECT
<fche> er
<kerneltoast> so uh
<kerneltoast> how am i going to get your testsuite results for that patch :)
<kerneltoast> i just checked the vmcore and it is indeed the same lockup from earlier
<kerneltoast> i'm innocent!
<kerneltoast> fche, err maybe i'm misreading something, but stp_unlock_probe() never releases lock 0?
<fche> I think it's okay, though a weird way to write it
<kerneltoast> oof yeah you're right, but it hurts my brain
<kerneltoast> i guess it's a clever way to keep i unsigned
<kerneltoast> fche, well if you have any ideas for that lockup, i can test patches you send
<fche> hmmmm I'd try to find out which .exp / .stp file the first erroneous one was
<fche> [ 582.984523] stap_5f3145612d5982e492dbe75419d7e19_14681 (sdt.stp): systemtap: 4.4/0.177, base: ffffffffc0638000, memory: 192data/68text/19ctx/2063net/133alloc kb, probes: 12
<fche> probably
<fche> rerun that by hand
<kerneltoast> let me send you the second dmesg
<fche> how exactly did you run the test? make -j
<fche> ?
<kerneltoast> make -j17 installcheck-parallel
<fche> ok try it without the -j
<kerneltoast> and keep the -parallel?
<fche> or not
<fche> just make installcheck
<kerneltoast> no wait
<kerneltoast> that's the same
<kerneltoast> urgh
<fche> that's the same url over and over ?
<fche> this is
<kerneltoast> yeah but i uploaded different files to the same gist
<fche> the twilight zone
<kerneltoast> in both logs, the backtraces for the lockup are coming from an stap module that's just labelled as (<input>)
<kerneltoast> so it's gotta be a suppress-time-limits example that doesn't use an .stp file
<kerneltoast> does that sound right?
<agentzh> the dmesg looks like a deadlock? the enter_tracepoint_probe_xxx appears twice in the same bt.
<fche> reentrancy would be bad m'kay
<fche> tracepointed nested inside kretprobe
<fche> um, we have mechanisms to prevent that, to stop the nested probe from being entered
* agentzh finds the std stap test suite a good fuzzer when running in parallel on big machines.
<fche> heh, stap was/is a good fuzzer for certain types of kernel bugs too, alas
<agentzh> stp_lock_probe is nested in the backtrace.
<agentzh> a deadlock confirmed?
<agentzh> it doesn't look like having anything to do with our recent patches?
<agentzh> yeah, fuzzer for everything.
<agentzh> including the hardware.
<fche> yeah I wouldn't think so, but something is very strange if this can happen
<fche> again we have anti-reentrancy measures in the code
<fche> and we'd be hitting this bug all over, including on our rhel7 buildbot
<agentzh> will you point to the lines for the anti-reentrancy measures?
<agentzh> and will you confirm if the test suite is serial or parallel in your buildbot?
<fche> _stp_runtime_entryfn_get_context()
<agentzh> i think kerneltoast is using the official stap master.
<fche> we normally run serially.
<agentzh> ah, okay, then kerneltoast should try serial runs too to avoid stress :)
<agentzh> it'll take way longer though.
<agentzh> considering that box has a 8c16t CPU.
<kerneltoast> I'd have to do two serial runs to get test results for the task work patch
<kerneltoast> 24+ hours?
<agentzh> it'll take hours at least.
<agentzh> right now it's ~40min with -j17?
<kerneltoast> that was on my 4800H laptop
<kerneltoast> which i no longer have
<kerneltoast> 4800H is faster than i9-9900K iirc
<agentzh> my box should be a little bit faster.
<agentzh> really?
<kerneltoast> your box has better single threaded performance though
<agentzh> on geekbench, i9 wins on both single-core and multi-core.
<kerneltoast> i was also running the testsuite on metal. I guess it'll go a bit slower in a vm
<agentzh> hopefully not on my metal ;)
<agentzh> kvm should be fast when not hitting any bugs.
<agentzh> seems like you'll need 20+ hours to run twice in serial.
<agentzh> it's crazy.
<kerneltoast> ouch
<kerneltoast> well, it's going
<kerneltoast> i'll check back on it after dinner
derek088_ has joined #systemtap
derek0883 has quit [Ping timeout: 264 seconds]
mjw has quit [Quit: Leaving]
derek088_ has quit [Remote host closed the connection]
derek0883 has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]
ChanServ has quit [shutting down]
ChanServ has joined #systemtap
derek0883 has quit [Ping timeout: 264 seconds]
amerey has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
amerey has quit [Quit: Leaving]
khaled has quit [Quit: Konversation terminated!]