#systemtap on 2020-11-25 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:35 lijunlong has quit [Read error: Connection reset by peer]

00:39 lijunlong has joined #systemtap

00:48 <kerneltoast> fche, i have results for the rcu cleanup patch

00:49 <kerneltoast> https://gist.github.com/kerneltoast/89935e70335a200e51c2033b044a32b3#file-systemtap-sum-diff

00:49 <kerneltoast> looks like that's good to go

00:53 derek0883 has quit [Remote host closed the connection]

01:17 derek0883 has joined #systemtap

01:20 hpt has joined #systemtap

02:00 <fche> kerneltoast, concur, thanks!

02:08 khaled has quit [Quit: Konversation terminated!]

02:42 derek0883 has quit [Remote host closed the connection]

02:43 derek0883 has joined #systemtap

02:45 lijunlong has quit [Ping timeout: 256 seconds]

02:45 derek0883 has quit [Remote host closed the connection]

02:46 derek0883 has joined #systemtap

02:46 lijunlong has joined #systemtap

02:56 derek0883 has quit [Remote host closed the connection]

02:57 derek0883 has joined #systemtap

03:06 <irker573> systemtap: sultan systemtap.git:master * release-4.4-11-g3c4f82ca0 / runtime/linux/runtime_context.h: runtime_context: factor out RCU usage using a rw lock

03:06 irker573 has joined #systemtap

03:36 derek0883 has quit [Remote host closed the connection]

03:47 khaled has joined #systemtap

04:02 lijunlong has quit [Remote host closed the connection]

04:06 derek0883 has joined #systemtap

04:07 derek0883 has quit [Remote host closed the connection]

04:08 lijunlong has joined #systemtap

04:31 derek0883 has joined #systemtap

04:49 derek0883 has quit [Remote host closed the connection]

04:50 derek0883 has joined #systemtap

05:00 derek0883 has quit [Remote host closed the connection]

05:38 derek0883 has joined #systemtap

06:05 irker573 has quit [Quit: transmission timeout]

06:07 derek0883 has quit [Ping timeout: 272 seconds]

06:22 lijunlong has quit [Remote host closed the connection]

06:32 orivej has quit [Ping timeout: 246 seconds]

09:39 zamba has quit [Quit: WeeChat 2.4]

09:52 hpt has quit [Ping timeout: 256 seconds]

11:15 mjw has joined #systemtap

12:29 orivej has joined #systemtap

13:17 derek0883 has joined #systemtap

14:21 derek0883 has quit [Ping timeout: 260 seconds]

14:50 amerey has joined #systemtap

15:09 <fche> kerneltoast, some of our buildbots are not happy after this patch, showing apparent hangs/crashes during testing

15:09 <fche> hm, one of the on-the-fly tests is common to the crashes (whether on f33, f32, or f34 rawhide x86-64)

17:10 <kerneltoast> fche, could i get a log?

17:11 <fche> haven't seen one, I think the my kernels are hanging and a watchdog is rebooting them within 30ish secondes

17:11 <fche> make installcheck RUNTESTFLAGS=uprobes_onthefly.exp

17:20 <kerneltoast> fche, can _stp_runtime_context_wait() be called from an NMI?

17:21 <kerneltoast> I'm guessing no

17:21 <kerneltoast> Because it has an msleep

17:21 <fche> no

17:21 <fche> that's a shutdown-time process from a clean user context

17:22 <kerneltoast> that was the only hazard i saw in my patch when i made it

17:22 <kerneltoast> i wonder what's exploding

17:22 <fche> that uprobes_onthefly test has been a good stresser of a bunch of runtime subsystems

17:23 <kerneltoast> what's the oldest kernel you're seeing the issue on?

17:24 <fche> 5.9 something

17:24 <kerneltoast> oldest is 5.9??

17:24 <kerneltoast> o_O

17:24 <fche> something like that. the buildbots passed on stuff older than fedora32's kernels

17:25 <kerneltoast> so only 5.9 explodes? ouch

17:36 <fche> can't say 'only', only reporting what I've seen so far

17:37 <kerneltoast> i have a theory

17:37 <kerneltoast> watchdog could fire if the write_lock_irqsave(&_stp_context_lock, flags); takes too long

17:40 <kerneltoast> fche, could you test a patch for me?

17:40 <fche> candu like the reactor

17:41 <kerneltoast> fche, https://gist.github.com/kerneltoast/26c3564d4a98e5b267f0b2770774e029

17:42 <kerneltoast> we don't need to disable IRQs there because of our good friend read_trylock

17:42 <fche> hmmmmm why would that help?

17:43 <kerneltoast> maybe if the watchdog impl depends on sending IRQs

17:43 <kerneltoast> anyway i've got nothing else atm, may as well try it :)

17:46 <fche> am rerunning the test on a lockdep kernel, sans watchdog

17:46 <kerneltoast> this is why i prefer cats

17:46 <kerneltoast> cats help me code too

17:47 <fche> on a somewhat nippy day, such a furry laptop would certainly be nice

18:07 <fche> ok got a nicer lockdep trace, moohaha

18:07 <fche> meh

18:07 <fche> not so good

18:08 <fche> the trace relates to netconsole sending dmesg content

18:09 <fche> rather than the actual problem, whatever it is

18:09 <fche> but yeah got the vm hanging on a rawhide 5.10.rc3ish -debug kernel

18:14 <fche> argh, can't get sysrq-t going now that it's hung

18:14 <fche> will see what else I can do

18:14 <fche> it's a nice solid hang

18:14 <fche> (this is without your patch)

18:20 <kerneltoast> ah ok

18:44 derek0883 has joined #systemtap

18:44 <kerneltoast> fche, i ran uprobes on the fly for 30 min in a loop and got the probe lock deadlock

18:45 <kerneltoast> what if you're just hitting the probe lock deadlock?

18:45 <kerneltoast> you can check if that happened if you got a vmcore

18:45 <kerneltoast> sometimes it doesn't leave anything in dmesg

18:45 <kerneltoast> (such as right now for me)

18:49 <kerneltoast> fche, check it out: https://gist.github.com/kerneltoast/e6978cc583d699f3dca319b1d0ea105e

18:51 <kerneltoast> i got backtraces from all the interesting processes at the time of death

18:59 <kerneltoast> the swapper backtraces are interesting too

18:59 <kerneltoast> it's all very interesting

18:59 khaled has quit [Ping timeout: 246 seconds]

19:00 khaled has joined #systemtap

19:08 <kerneltoast> fche, i think the probe lock deadlocks are caused by the mutex trylock used in IRQ context

19:09 <kerneltoast> according to the stuff i dumped into that gist at least

19:09 <kerneltoast> mutex_trylock tries to acquire a spin lock without disabling IRQs

19:09 <kerneltoast> in the backtraces i've posted, there's a mutex_trylock in process context and then one in IRQ context

19:10 <kerneltoast> and these both occur inside the probe locks

19:18 <kerneltoast> oh but the IRQs don't occur on the same cpu that attempts the mutex trylock

19:19 <kerneltoast> at least in the backtraces i provided

19:21 <fche> mutex_trylock

19:21 <fche> ewwwww

19:21 <fche> yeah I think it's time to deal with that bad boy

19:21 <fche> this was the one we were thinking of nuking by switching the STP_BULKMODE on

19:21 <kerneltoast> yeah

19:22 <kerneltoast> i just didn't get to it yet because i was in the thick of some print insanity :P

19:22 <fche> insanity comes in threes

19:22 <fche> or three hundreds

19:23 <kerneltoast> but yeah i suspect you are hitting the probe lock deadlock

19:24 <kerneltoast> were you able to get a vmcore?

19:24 <fche> not yet

19:24 <kerneltoast> the tests are running on bare metal?

19:25 <fche> kvm

19:25 <kerneltoast> huh should be easy to grab then

19:25 <kerneltoast> i used virsh dump

19:25 <fche> trying

19:26 <fche> hmm I don't think I ever analyzed a virsh dump before

19:27 <lindi-> is the crash tool still the easiest for that?

19:28 <kerneltoast> fche, here are the commands i use:

19:28 <kerneltoast> virsh dump NAMEOFVM path/to/output-vmcore --memory-only --format kdump-zlib

19:29 <kerneltoast> crash path/to/vmlinux path/to/output-vmcore

19:29 <fche> ah netconsole logs did make it through to another machine

19:37 <fche> https://paste.centos.org/view/b801a5da

19:37 <fche> but that looks a bit odd

19:38 <fche> hmmmm

19:38 <fche> what are the chances the act of locking the context triggers a tracepoint which causes an attempt at locking of the context

19:38 <fche> i.e., simple infinite recursion

19:39 <kerneltoast> ...did 5.9 add a lock tracepoint

19:39 <fche> wouldn't be surprised if it's been there awhile

19:40 <kerneltoast> the unwinder was really not confident about that backtrace

19:40 <fche> ok my guess is no way around that except blocking attaching to those tracepoints

19:40 <fche> yeah

19:41 <kerneltoast> could you grab a vmcore?

19:44 <kerneltoast> i think a tracepoint in a lock is unlikely

19:44 <kerneltoast> i'm not seeing any such thing in 5.10

19:44 <fche> have a 2GB one

19:44 <fche> will try to /bin/crash it, just need to fish out the right vmlinux for it

19:45 <kerneltoast> i hope you didn't have kaslr enabled

19:52 <kerneltoast> fche, i deleted the mutex trylocks and still got the probe lock deadlock, but now the backtraces are different

19:52 <kerneltoast> so i think the mutex trylock is one source of probe lock deadlocking

19:53 <fche> wdyt about focusing on bulkifying the transport next?

19:53 <kerneltoast> so much context switching

19:53 <kerneltoast> that partially depends on the wip print patch

19:54 <kerneltoast> because irqs need to be disabled when modifying the print buffer and flushing

19:54 <kerneltoast> i think my print patch will fix any mutex trylock deadlock potential though

19:54 <fche> would love to see a new patch that starts with STP_BULKMODE'ing the rutnime

19:54 <fche> and backing off of the inode mutex

19:54 <fche> and then doing as little as possible to any of the other lock goo

19:55 <kerneltoast> yeah but bulkifying needs IRQs disabled when playing with the log buffer

19:55 <kerneltoast> at all callsites

19:56 <kerneltoast> my fat print patch does that

19:56 <kerneltoast> we could ship the print patch and resolve the NMI stuff later

19:56 derek0883 has quit [Remote host closed the connection]

19:56 <kerneltoast> since NMIs can already cause log buffer panics right now

19:57 derek088_ has joined #systemtap

19:58 <kerneltoast> fche, this is the patch i'm talking about: https://gist.github.com/kerneltoast/93d1f216c8fe1f5f21d8740422afc631

19:58 <kerneltoast> it's an improvement over the current situation but doesn't fix the NMI problem

19:58 <kerneltoast> and it makes it possible to bulkify

19:59 <kerneltoast> that patch did pretty well in the testsuite, if you scroll down and look

20:20 derek088_ has quit [Remote host closed the connection]

20:21 derek0883 has joined #systemtap

20:24 <kerneltoast> fche, btw I'm on vacation thursday and friday

20:24 <fche> aha

20:25 <fche> well, speculatively, we could try pushing it into the tree and see if that helps

20:26 <kerneltoast> how long can master be kept aflame?

20:42 <fche> we're post-release, we should be ok for a bit

20:43 <kerneltoast> cool

20:45 <kerneltoast> i'll try getting something done for bulkmode today

20:45 <kerneltoast> manage to fish out the vmlinux for your vmcore?

20:50 <fche> oh was taking a break

20:51 <kerneltoast> no breaks!

20:51 <kerneltoast> WORK HARDER

20:53 * kerneltoast takes a break

20:57 <fche> hahaha

20:57 <fche> https://oyster.ignimgs.com/wordpress/stg.ign.com/2014/08/Space-Quest-III-IGN_46.jpg

21:16 <kerneltoast> is that what working in an office is like

21:24 <fche> YES - at least it was in 1989

21:28 khaled has quit [Quit: Konversation terminated!]

21:29 <kerneltoast> I should probably watch office space

21:29 <fche> 1999

21:30 <fche> there was much progress

21:30 khaled has joined #systemtap

21:34 <kerneltoast> back before all this smp crap

22:01 derek0883 has quit [Remote host closed the connection]

22:04 derek0883 has joined #systemtap

22:18 derek0883 has quit [Remote host closed the connection]

22:34 derek0883 has joined #systemtap

23:01 amerey has quit [Quit: Leaving]

23:22 fche has quit [Read error: Connection reset by peer]

23:36 fche has joined #systemtap

23:51 derek0883 has quit [Remote host closed the connection]