#systemtap on 2020-11-03 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:01 <kerneltoast> fche, please review: https://gist.github.com/kerneltoast/fbb088a9d30639f671369d65705583c6

00:03 derek0883 has joined #systemtap

00:07 <kerneltoast> this doesn't fix the mem leaks you noticed

00:09 <fche> it looks plausible, can you put it through the testsuite?

00:10 <kerneltoast> yeah, i'll just need to find another machine to use because i got rid of the crummy ryzen laptop

00:10 <kerneltoast> or i could convince you it doesn't need the test suite

00:11 <kerneltoast> this basically makes those two code paths behave the same as when CONFIG_PREEMPT=y

00:11 <kerneltoast> if stap works fine on CONFIG_PREEMPT=y, then this change will work fine as well

00:12 <kerneltoast> the big delta is just from indentation

00:14 <fche> trust no one

00:15 <kerneltoast> our fuzzer would typically hit a kernel panic within a few minutes

00:15 <kerneltoast> after this change, i've had the fuzzer running for 110 minutes without any crashes

00:16 <kerneltoast> it passes our lean test suite as well

00:20 <kerneltoast> it also complimented my haircut

00:20 <fche> that's certainly encouraging

00:21 <kerneltoast> the fuzzer is running on centos 7 btw

01:03 <kerneltoast> fche, were you convinced?

01:03 <agentzh> heh

01:04 * agentzh knows kerneltoast doesn't want to run the full test suite.

01:04 * kerneltoast sweats

01:34 khaled has quit [Quit: Konversation terminated!]

01:42 hpt has joined #systemtap

03:07 _whitelogger has joined #systemtap

03:10 derek0883 has quit [Remote host closed the connection]

03:35 derek0883 has joined #systemtap

04:04 _whitelogger has joined #systemtap

04:34 derek0883 has quit [Remote host closed the connection]

04:59 derek0883 has joined #systemtap

05:00 derek0883 has quit [Remote host closed the connection]

05:03 derek0883 has joined #systemtap

05:13 derek0883 has quit [Remote host closed the connection]

05:28 derek0883 has joined #systemtap

06:26 derek0883 has quit [Remote host closed the connection]

06:40 derek0883 has joined #systemtap

07:16 derek0883 has quit [Remote host closed the connection]

07:18 orivej has joined #systemtap

07:41 orivej has quit [Ping timeout: 260 seconds]

08:02 khaled has joined #systemtap

09:29 hpt has quit [Ping timeout: 260 seconds]

10:24 orivej has joined #systemtap

10:50 mjw has joined #systemtap

11:09 khaled has quit [Quit: Konversation terminated!]

11:10 khaled has joined #systemtap

11:38 derek0883 has joined #systemtap

11:43 derek0883 has quit [Ping timeout: 260 seconds]

13:25 khaled has quit [Remote host closed the connection]

13:27 khaled has joined #systemtap

14:21 tromey has joined #systemtap

14:39 amerey has joined #systemtap

14:57 mjw has quit [Ping timeout: 264 seconds]

15:02 mjw has joined #systemtap

15:07 khaled has quit [Quit: Konversation terminated!]

15:08 khaled has joined #systemtap

16:01 <kerneltoast> fche, hi

16:02 <fche> duuuude

16:02 <fche> so how was that Full 100% Thorough Testsuite Run ? :)

16:02 <kerneltoast> uhhhhhh

16:02 * kerneltoast sweats

16:02 <fche> no it's easy

16:02 <fche> just type

16:02 <fche> sudo make installcheck

16:02 <fche> easy peasy

16:02 <kerneltoast> yeah but

16:03 <fche> glad to help

16:03 <fche> have a NICE day!

16:03 <kerneltoast> :(

16:03 <kerneltoast> i sent back my crummy ryzen laptop

16:03 <kerneltoast> that's what i was using for testsuite runs

16:05 <kerneltoast> i take it you weren't convinced?

16:06 <fche> I'd be more comfy with that data.

16:07 <kerneltoast> time to rebuild my laptop's kernel i guess

16:07 <fche> no access to a decent vm/server ?

16:08 <fche> didn't realize it'd be such a hardship

16:09 <kerneltoast> i use a couple of agentzh's boxes for VMs but that would just be on centos 7 if i tested it there

16:09 <fche> it'd be useful

16:09 <kerneltoast> fastest machine in my household is my laptop

16:19 <kerneltoast> ok it's running on centos 7 now

17:06 derek0883 has joined #systemtap

17:41 khaled has quit [Quit: Konversation terminated!]

17:43 khaled has joined #systemtap

17:48 derek0883 has quit [Remote host closed the connection]

17:48 derek0883 has joined #systemtap

18:00 <kerneltoast> fche, lol the centos 7 vm running the testsuite died: https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc

18:02 <kerneltoast> that might be what happened to my ryzen laptop

18:06 <kerneltoast> fche, updated the gist with the full dmesg

18:07 derek0883 has quit [Remote host closed the connection]

18:07 <fche> interesting

18:08 derek0883 has joined #systemtap

18:08 <kerneltoast> the implicated code looks horrid...

18:08 <fche> stp_lock_probe can indeed block a little while (limited by macros, should be a very low maximum elapsed time, << 20 seconds)

18:09 <fche> ok, and does the unpatched copy of stap work better for you? (sdt.stp etc. is pretty early in the testsuite)

18:10 <kerneltoast> unpatched as in without my task work patch?

18:10 <fche> yes

18:10 <kerneltoast> this lockup is sporadic. i ran the testsuite 3 times, and it occured 2/3 times

18:11 <kerneltoast> i'll try it without my task work patch, but i really don't think that's the issue...

18:12 <fche> yeah. suggest rebooting before running the suite again with upstream code

18:12 <kerneltoast> i have no choice but to reboot. the vm is locked up

18:12 <agentzh> kerneltoast: it was not a lockdep/debug kernel, as per your gist.

18:13 <agentzh> so it was not exactly the same as your laptop runs.

18:13 <kerneltoast> my laptop also had a 5.8 kernel so indeed a lot was different

18:14 <kerneltoast> but if there is the potential for this lockup to occur on 3.10 without lockdep, i don't see what's stopping it from happening on a totally different system

18:14 <kerneltoast> running the testsuite again now at upstream HEAD...

18:17 <kerneltoast> fche, do the buildbots only run the testsuite single-threaded?

18:24 <fche> not sure

18:24 <kerneltoast> fche, it died again

18:25 <kerneltoast> without my patch

18:25 <kerneltoast> getting the vmcore now...

18:27 <agentzh> the last time i tried running the test suite in parallel (-j16), it also froze. a long time ago.

18:27 <agentzh> *the full test suite

18:27 <kerneltoast> ahaha, it died due to the bug my task work patch fixes

18:28 <kerneltoast> https://gist.github.com/kerneltoast/bae70a4f8a700cf9c62f148a8f26569d

18:28 <kerneltoast> that's funny

18:28 <kerneltoast> i'll run it again

18:28 <agentzh> cool

18:28 <agentzh> the std test suite can also reproduce that bug.

18:29 <kerneltoast> yeah

18:29 <kerneltoast> what's even more interesting though is that kdump caught this

18:29 <kerneltoast> but

18:29 <kerneltoast> kdump did not catch the lockup

18:29 <kerneltoast> that sounds exactly like what happened on my laptop

18:38 <kerneltoast> it died again due to the same panic

18:38 <kerneltoast> i'll try one more time i guess

18:47 <kerneltoast> fche, the macro limit for stp_lock_probe is only used if --suppress-time-limits is passed

18:47 <kerneltoast> err

18:47 <fche> um surely we didn't flip the polarity by mistake

18:47 <fche> surely

18:47 <kerneltoast> it's ignored if --suppress-time-limits is passed

18:49 <kerneltoast> maybe the testsuite uses that at some point?

18:50 <kerneltoast> fche, it does

18:50 <kerneltoast> grep suppress-time-limits -r testsuite/

18:51 <fche> yes, that'd be on purpose

18:51 <fche> ok

18:52 <fche> ok so really it should never take long to lock, even on a suppress-time-limits type of stap example

18:54 <kerneltoast> this lock probe stuff is really nuts and i'm not quite sure how to audit its usage...

18:54 <kerneltoast> i suspect a lock is never released or something

18:54 <fche> well, you might just not understand what that part is about

18:55 <fche> the lock machinery in question here are the ones used for protecting stap script-level global variables from concurrent probe handlers' modifications

18:56 <kerneltoast> fche, the centos 7 locked up without my patch (no panic)

18:56 <fche> PERFECT

18:56 <fche> er

18:57 <kerneltoast> so uh

18:58 <kerneltoast> how am i going to get your testsuite results for that patch :)

19:01 <kerneltoast> i just checked the vmcore and it is indeed the same lockup from earlier

19:01 <kerneltoast> i'm innocent!

19:03 <kerneltoast> fche, err maybe i'm misreading something, but stp_unlock_probe() never releases lock 0?

19:05 <fche> I think it's okay, though a weird way to write it

19:06 <kerneltoast> oof yeah you're right, but it hurts my brain

19:07 <kerneltoast> i guess it's a clever way to keep i unsigned

19:09 <kerneltoast> fche, well if you have any ideas for that lockup, i can test patches you send

19:11 <fche> hmmmm I'd try to find out which .exp / .stp file the first erroneous one was

19:11 <fche> [ 582.984523] stap_5f3145612d5982e492dbe75419d7e19_14681 (sdt.stp): systemtap: 4.4/0.177, base: ffffffffc0638000, memory: 192data/68text/19ctx/2063net/133alloc kb, probes: 12

19:11 <fche> probably

19:11 <fche> rerun that by hand

19:12 <kerneltoast> let me send you the second dmesg

19:12 <fche> how exactly did you run the test? make -j

19:12 <fche> ?

19:12 <kerneltoast> make -j17 installcheck-parallel

19:12 <fche> ok try it without the -j

19:12 <kerneltoast> and keep the -parallel?

19:13 <fche> or not

19:13 <fche> just make installcheck

19:16 <kerneltoast> fche, here's the second dmesg: https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc

19:16 <kerneltoast> no wait

19:16 <kerneltoast> that's the same

19:16 <kerneltoast> urgh

19:18 <kerneltoast> fche, okay here it is https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc#file-centos7-testsuite-lockup-nopatch-txt

19:19 <fche> that's the same url over and over ?

19:19 <fche> this is

19:19 <kerneltoast> yeah but i uploaded different files to the same gist

19:20 <fche> the twilight zone

19:20 <kerneltoast> there's https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc#file-centos7-testsuite-lockup-nopatch-txt and https://gist.github.com/kerneltoast/19b04eaebb0b53b0e1da80c1557086fc#file-centos7-testsuite-lockup-txt

19:22 <kerneltoast> in both logs, the backtraces for the lockup are coming from an stap module that's just labelled as (<input>)

19:24 <kerneltoast> so it's gotta be a suppress-time-limits example that doesn't use an .stp file

19:26 <kerneltoast> does that sound right?

19:32 <agentzh> the dmesg looks like a deadlock? the enter_tracepoint_probe_xxx appears twice in the same bt.

19:32 <fche> reentrancy would be bad m'kay

19:34 <fche> tracepointed nested inside kretprobe

19:34 <fche> um, we have mechanisms to prevent that, to stop the nested probe from being entered

19:38 * agentzh finds the std stap test suite a good fuzzer when running in parallel on big machines.

19:39 <fche> heh, stap was/is a good fuzzer for certain types of kernel bugs too, alas

19:41 <agentzh> stp_lock_probe is nested in the backtrace.

19:41 <agentzh> a deadlock confirmed?

19:41 <agentzh> it doesn't look like having anything to do with our recent patches?

19:42 <agentzh> yeah, fuzzer for everything.

19:42 <agentzh> including the hardware.

19:42 <fche> yeah I wouldn't think so, but something is very strange if this can happen

19:42 <fche> again we have anti-reentrancy measures in the code

19:42 <fche> and we'd be hitting this bug all over, including on our rhel7 buildbot

19:43 <agentzh> will you point to the lines for the anti-reentrancy measures?

19:43 <agentzh> and will you confirm if the test suite is serial or parallel in your buildbot?

19:43 <fche> _stp_runtime_entryfn_get_context()

19:43 <agentzh> i think kerneltoast is using the official stap master.

19:43 <fche> we normally run serially.

19:44 <agentzh> ah, okay, then kerneltoast should try serial runs too to avoid stress :)

19:45 <agentzh> it'll take way longer though.

19:45 <agentzh> considering that box has a 8c16t CPU.

19:45 <kerneltoast> I'd have to do two serial runs to get test results for the task work patch

19:46 <kerneltoast> 24+ hours?

19:46 <agentzh> it'll take hours at least.

19:46 <agentzh> right now it's ~40min with -j17?

19:46 <kerneltoast> that was on my 4800H laptop

19:46 <kerneltoast> which i no longer have

19:47 <kerneltoast> 4800H is faster than i9-9900K iirc

19:47 <agentzh> my box should be a little bit faster.

19:47 <agentzh> really?

19:47 <kerneltoast> https://www.cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+7+4800H&id=3676

19:47 <kerneltoast> https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i9-9900K+%40+3.60GHz&id=3334

19:48 <kerneltoast> your box has better single threaded performance though

19:48 <agentzh> https://browser.geekbench.com/processors/intel-core-i9-9900k

19:49 <agentzh> https://browser.geekbench.com/processors/amd-ryzen-7-4800h

19:49 <agentzh> on geekbench, i9 wins on both single-core and multi-core.

19:50 <kerneltoast> i was also running the testsuite on metal. I guess it'll go a bit slower in a vm

19:50 <agentzh> hopefully not on my metal ;)

19:50 <agentzh> kvm should be fast when not hitting any bugs.

19:51 <agentzh> seems like you'll need 20+ hours to run twice in serial.

19:51 <agentzh> it's crazy.

19:51 <kerneltoast> ouch

19:52 <kerneltoast> well, it's going

19:53 <kerneltoast> i'll check back on it after dinner

20:14 derek088_ has joined #systemtap

20:16 derek0883 has quit [Ping timeout: 264 seconds]

20:19 mjw has quit [Quit: Leaving]

20:41 derek088_ has quit [Remote host closed the connection]

20:54 derek0883 has joined #systemtap

20:55 amerey has quit [Remote host closed the connection]

20:56 amerey has joined #systemtap

21:06 amerey has quit [Remote host closed the connection]

21:15 tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]

21:26 ChanServ has quit [shutting down]

21:32 ChanServ has joined #systemtap

21:40 derek0883 has quit [Ping timeout: 264 seconds]

21:50 amerey has joined #systemtap

22:14 derek0883 has joined #systemtap

22:25 derek0883 has quit [Ping timeout: 260 seconds]

22:27 amerey has quit [Remote host closed the connection]

22:27 amerey has joined #systemtap

22:29 derek0883 has joined #systemtap

22:29 derek0883 has quit [Remote host closed the connection]

22:29 derek0883 has joined #systemtap

23:14 amerey has quit [Quit: Leaving]

23:18 khaled has quit [Quit: Konversation terminated!]