fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
lijunlong has quit [Read error: Connection reset by peer]
lijunlong has joined #systemtap
<kerneltoast> fche, i have results for the rcu cleanup patch
<kerneltoast> looks like that's good to go
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
hpt has joined #systemtap
<fche> kerneltoast, concur, thanks!
khaled has quit [Quit: Konversation terminated!]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
lijunlong has quit [Ping timeout: 256 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
lijunlong has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<irker573> systemtap: sultan systemtap.git:master * release-4.4-11-g3c4f82ca0 / runtime/linux/runtime_context.h: runtime_context: factor out RCU usage using a rw lock
irker573 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
khaled has joined #systemtap
lijunlong has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
lijunlong has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
irker573 has quit [Quit: transmission timeout]
derek0883 has quit [Ping timeout: 272 seconds]
lijunlong has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 246 seconds]
zamba has quit [Quit: WeeChat 2.4]
hpt has quit [Ping timeout: 256 seconds]
mjw has joined #systemtap
orivej has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
amerey has joined #systemtap
<fche> kerneltoast, some of our buildbots are not happy after this patch, showing apparent hangs/crashes during testing
<fche> hm, one of the on-the-fly tests is common to the crashes (whether on f33, f32, or f34 rawhide x86-64)
<kerneltoast> fche, could i get a log?
<fche> haven't seen one, I think the my kernels are hanging and a watchdog is rebooting them within 30ish secondes
<fche> make installcheck RUNTESTFLAGS=uprobes_onthefly.exp
<kerneltoast> fche, can _stp_runtime_context_wait() be called from an NMI?
<kerneltoast> I'm guessing no
<kerneltoast> Because it has an msleep
<fche> no
<fche> that's a shutdown-time process from a clean user context
<kerneltoast> that was the only hazard i saw in my patch when i made it
<kerneltoast> i wonder what's exploding
<fche> that uprobes_onthefly test has been a good stresser of a bunch of runtime subsystems
<kerneltoast> what's the oldest kernel you're seeing the issue on?
<fche> 5.9 something
<kerneltoast> oldest is 5.9??
<kerneltoast> o_O
<fche> something like that. the buildbots passed on stuff older than fedora32's kernels
<kerneltoast> so only 5.9 explodes? ouch
<fche> can't say 'only', only reporting what I've seen so far
<kerneltoast> i have a theory
<kerneltoast> watchdog could fire if the write_lock_irqsave(&_stp_context_lock, flags); takes too long
<kerneltoast> fche, could you test a patch for me?
<fche> candu like the reactor
<kerneltoast> we don't need to disable IRQs there because of our good friend read_trylock
<fche> hmmmmm why would that help?
<kerneltoast> maybe if the watchdog impl depends on sending IRQs
<kerneltoast> anyway i've got nothing else atm, may as well try it :)
<fche> am rerunning the test on a lockdep kernel, sans watchdog
<kerneltoast> this is why i prefer cats
<kerneltoast> cats help me code too
<fche> on a somewhat nippy day, such a furry laptop would certainly be nice
<fche> ok got a nicer lockdep trace, moohaha
<fche> meh
<fche> not so good
<fche> the trace relates to netconsole sending dmesg content
<fche> rather than the actual problem, whatever it is
<fche> but yeah got the vm hanging on a rawhide 5.10.rc3ish -debug kernel
<fche> argh, can't get sysrq-t going now that it's hung
<fche> will see what else I can do
<fche> it's a nice solid hang
<fche> (this is without your patch)
<kerneltoast> ah ok
derek0883 has joined #systemtap
<kerneltoast> fche, i ran uprobes on the fly for 30 min in a loop and got the probe lock deadlock
<kerneltoast> what if you're just hitting the probe lock deadlock?
<kerneltoast> you can check if that happened if you got a vmcore
<kerneltoast> sometimes it doesn't leave anything in dmesg
<kerneltoast> (such as right now for me)
<kerneltoast> i got backtraces from all the interesting processes at the time of death
<kerneltoast> the swapper backtraces are interesting too
<kerneltoast> it's all very interesting
khaled has quit [Ping timeout: 246 seconds]
khaled has joined #systemtap
<kerneltoast> fche, i think the probe lock deadlocks are caused by the mutex trylock used in IRQ context
<kerneltoast> according to the stuff i dumped into that gist at least
<kerneltoast> mutex_trylock tries to acquire a spin lock without disabling IRQs
<kerneltoast> in the backtraces i've posted, there's a mutex_trylock in process context and then one in IRQ context
<kerneltoast> and these both occur inside the probe locks
<kerneltoast> oh but the IRQs don't occur on the same cpu that attempts the mutex trylock
<kerneltoast> at least in the backtraces i provided
<fche> mutex_trylock
<fche> ewwwww
<fche> yeah I think it's time to deal with that bad boy
<fche> this was the one we were thinking of nuking by switching the STP_BULKMODE on
<kerneltoast> yeah
<kerneltoast> i just didn't get to it yet because i was in the thick of some print insanity :P
<fche> insanity comes in threes
<fche> or three hundreds
<kerneltoast> but yeah i suspect you are hitting the probe lock deadlock
<kerneltoast> were you able to get a vmcore?
<fche> not yet
<kerneltoast> the tests are running on bare metal?
<fche> kvm
<kerneltoast> huh should be easy to grab then
<kerneltoast> i used virsh dump
<fche> trying
<fche> hmm I don't think I ever analyzed a virsh dump before
<lindi-> is the crash tool still the easiest for that?
<kerneltoast> fche, here are the commands i use:
<kerneltoast> virsh dump NAMEOFVM path/to/output-vmcore --memory-only --format kdump-zlib
<kerneltoast> crash path/to/vmlinux path/to/output-vmcore
<fche> ah netconsole logs did make it through to another machine
<fche> but that looks a bit odd
<fche> hmmmm
<fche> what are the chances the act of locking the context triggers a tracepoint which causes an attempt at locking of the context
<fche> i.e., simple infinite recursion
<kerneltoast> ...did 5.9 add a lock tracepoint
<fche> wouldn't be surprised if it's been there awhile
<kerneltoast> the unwinder was really not confident about that backtrace
<fche> ok my guess is no way around that except blocking attaching to those tracepoints
<fche> yeah
<kerneltoast> could you grab a vmcore?
<kerneltoast> i think a tracepoint in a lock is unlikely
<kerneltoast> i'm not seeing any such thing in 5.10
<fche> have a 2GB one
<fche> will try to /bin/crash it, just need to fish out the right vmlinux for it
<kerneltoast> i hope you didn't have kaslr enabled
<kerneltoast> fche, i deleted the mutex trylocks and still got the probe lock deadlock, but now the backtraces are different
<kerneltoast> so i think the mutex trylock is one source of probe lock deadlocking
<fche> wdyt about focusing on bulkifying the transport next?
<kerneltoast> so much context switching
<kerneltoast> that partially depends on the wip print patch
<kerneltoast> because irqs need to be disabled when modifying the print buffer and flushing
<kerneltoast> i think my print patch will fix any mutex trylock deadlock potential though
<fche> would love to see a new patch that starts with STP_BULKMODE'ing the rutnime
<fche> and backing off of the inode mutex
<fche> and then doing as little as possible to any of the other lock goo
<kerneltoast> yeah but bulkifying needs IRQs disabled when playing with the log buffer
<kerneltoast> at all callsites
<kerneltoast> my fat print patch does that
<kerneltoast> we could ship the print patch and resolve the NMI stuff later
derek0883 has quit [Remote host closed the connection]
<kerneltoast> since NMIs can already cause log buffer panics right now
derek088_ has joined #systemtap
<kerneltoast> fche, this is the patch i'm talking about: https://gist.github.com/kerneltoast/93d1f216c8fe1f5f21d8740422afc631
<kerneltoast> it's an improvement over the current situation but doesn't fix the NMI problem
<kerneltoast> and it makes it possible to bulkify
<kerneltoast> that patch did pretty well in the testsuite, if you scroll down and look
derek088_ has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, btw I'm on vacation thursday and friday
<fche> aha
<fche> well, speculatively, we could try pushing it into the tree and see if that helps
<kerneltoast> how long can master be kept aflame?
<fche> we're post-release, we should be ok for a bit
<kerneltoast> cool
<kerneltoast> i'll try getting something done for bulkmode today
<kerneltoast> manage to fish out the vmlinux for your vmcore?
<fche> oh was taking a break
<kerneltoast> no breaks!
<kerneltoast> WORK HARDER
* kerneltoast takes a break
<fche> hahaha
<kerneltoast> is that what working in an office is like
<fche> YES - at least it was in 1989
khaled has quit [Quit: Konversation terminated!]
<kerneltoast> I should probably watch office space
<fche> 1999
<fche> there was much progress
khaled has joined #systemtap
<kerneltoast> back before all this smp crap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
amerey has quit [Quit: Leaving]
fche has quit [Read error: Connection reset by peer]
fche has joined #systemtap
derek0883 has quit [Remote host closed the connection]