fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
khaled has quit [Client Quit]
lijunlong has quit [Read error: Connection reset by peer]
lijunlong has joined #systemtap
hpt has joined #systemtap
orivej_ has quit [Ping timeout: 260 seconds]
<kerneltoast> fche, i'm pumping this through the testsuite now: https://gist.github.com/kerneltoast/37a38b68aad8a8c669074465e4bafc81
<kerneltoast> it doesn't fail at the test that broke the previous patch
<kerneltoast> honestly, who needs print messages
<kerneltoast> #define _stp_print(...)
<kerneltoast> fche, if you want to try convincing lkml to fix this tracepoint dilemma feel free
<kerneltoast> slap your fancy @redhat.com mail on the table and demand fixes :P
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> oh boy, the fight isn't over yet
<kerneltoast> probes can be called from inside an NMI
<kerneltoast> so my locks can deadlock even though they have irqsave
hpt has quit [Ping timeout: 240 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej_ has joined #systemtap
orivej has joined #systemtap
orivej_ has quit [Ping timeout: 260 seconds]
orivej has quit [Read error: Connection reset by peer]
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
hpt has joined #systemtap
orivej_ has joined #systemtap
orivej has quit [Ping timeout: 264 seconds]
orivej_ has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
tonyj has quit [Remote host closed the connection]
<kerneltoast> fche, okay now it's NMI-safe and running the testsuite: https://gist.github.com/kerneltoast/74d1094a7c5b6c8d9035b44b4a5dfd7a
orivej_ has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
<agentzh> kerneltoast: will it waste CPU cycles?
<agentzh> hopefully it's not hot polling :)
<kerneltoast> it's polling every jiffy
<kerneltoast> the poll worker itself is atomic and doesn't do much work
<agentzh> every jiffy sounds very frequent :)
<kerneltoast> it's not, it's just a worker
<kerneltoast> and it's not polling every cpu
<agentzh> okay, we can do more benchmark.
<agentzh> with real use cases.
<agentzh> will it lose more data than before?
<agentzh> or it's similar?
<agentzh> in case of big data output.
derek0883 has quit [Remote host closed the connection]
<kerneltoast> there is a higher risk than before of losing data with large output
<kerneltoast> the reason why this wasn't a problem before was because a single buffer was used without synchronization
<kerneltoast> and that caused one of the panics i posted here
<agentzh> okay
<kerneltoast> there is no chance of data getting garbled anymore
<kerneltoast> but there is a chance of it being truncated
<agentzh> i wonder if we can use a safer approach for contexts like probe end?
<agentzh> it's known to be a safe context?
<agentzh> it's common for stap scripts to emit output in that probe.
<kerneltoast> wouldn't that delay all the printing until the stap module unloads?
<agentzh> that is an optimization instead of enforcement.
hpt has quit [Ping timeout: 264 seconds]
<agentzh> just for the probe end code.
<agentzh> we can still use your polling approach for printing in other probe contexts.
<kerneltoast> is probe end executed inside a tracepoint?
<agentzh> nope.
<agentzh> but you can check.
<agentzh> i'm not 100% sure.
<kerneltoast> well we still won't be able to get rid of the polling
<agentzh> it's in the syscall context of the staprun i believe.
<kerneltoast> there is already an optimization to avoid relying on the poll worker
hpt has joined #systemtap
<kerneltoast> the poll worker is only used when probe flush is called with IRQs disabled
<kerneltoast> *print flush
derek0883 has joined #systemtap
<kerneltoast> when IRQs are enabled, the print flush is done directly in the current context
<kerneltoast> yeah that will defer print messages until probe exit
<agentzh> yes
<agentzh> it's the most common case for at least our stap scripts.
<kerneltoast> i don't think the overhead from polling will be too bad
<agentzh> we also have some printing in timer.s(N) probes, which is in hrtimer_interrupt(), unfortunately.'
<agentzh> it's not the overhead i'm worried about right now, but data truncation :)
<kerneltoast> the polling is done to avoid truncation
<kerneltoast> if we skip flushing when the context doesn't allow it, we'll lose data
<agentzh> yeah sure.
<agentzh> i'm worried that it is worse than before.
<kerneltoast> the polling won't truncate messages in-flight
<agentzh> or is it?
<kerneltoast> all the synchronization in place now should make it better
<kerneltoast> it's big and ugly but i've been very careful
derek0883 has quit [Remote host closed the connection]
<agentzh> okay, cool
khaled has joined #systemtap
hpt has quit [Ping timeout: 272 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
khaled has quit [Remote host closed the connection]
khaled has joined #systemtap
_whitelogger has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
tonyj has joined #systemtap
tromey has joined #systemtap
amerey has joined #systemtap
wcohen has quit [Remote host closed the connection]
wcohen has joined #systemtap
<kerneltoast> fche, yo
<fche> hello
<kerneltoast> the print buffers need to be bigger
<fche> coudl be
<kerneltoast> that was the only problem exposed by the testsuite
<fche> hm just looking over your gist patch
<fche> how much of the recent batch of problems comes from having to muck with the inode mutex?
<fche> I know I've asked this before, but .....
<kerneltoast> i didn't go further than the lockdep warning
<fche> is there some spinlocky way of protecting the inode content in question other than via agentzh's mutex_trylock stuff?
<kerneltoast> possible if you roll your own mutex code
<kerneltoast> and copy the definition of struct mutex into stap
<fche> I mean short of that
<kerneltoast> nope
<kerneltoast> the inode only has that mutex
<fche> is it solely the inode size field that we're trying to protect ?
<kerneltoast> i don't think so
<kerneltoast> agentzh said that data was getting garbled
<kerneltoast> not truncated
<fche> depends on what form of garbling there was
derek0883 has joined #systemtap
<fche> in your cpumask variant of the patch, is it proper for cpumask_copy & cpumask_clear to come in that sequence?
<fche> as opposed to clear first?
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> yeah, if we clear it first then we'll just see nothing
<fche> oh wait
<fche> you're clearing the input not the output, got it
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, so I'm guessing after reading the bug report you don't think we can dodge the mutex?
<agentzh> kerneltoast: it'll be great if you can redo the stress tests using the sample .stp script in that bugzilla ticket with your latest patch to make sure it is still working fine.
<kerneltoast> hmm but the inode mutex hasn't been removed
<kerneltoast> but i'll try that .stp anyway
<agentzh> i know. just to make sure it has no other side effects in such extreme cases.
<agentzh> like a specific stress test case.
<agentzh> thanks
<kerneltoast> agentzh, do you know where stap does the relayfs read in userspace?
<agentzh> in staprun
<kerneltoast> k
<agentzh> iirc, staprun/mainloop.c
<fche> kerneltoast, yeah
<agentzh> func stp_main_loop()
<agentzh> kerneltoast: and also in staprun/relay.c
<agentzh> in reader_thread
<agentzh> func
<agentzh> the latter is more related i think.
<kerneltoast> lol we don't need the inode mutex
<kerneltoast> we just need to disable irqs on the local cpu when flushing
<kerneltoast> i am going to cry :)
<kerneltoast> i should've looked at relayfs earlier...
<agentzh> wow
<kerneltoast> staprun properly pins a reader thread to each cpu
<kerneltoast> and only has that thread read that cpu's buffer
<agentzh> there are two different modes?
<kerneltoast> relayfs stores a buffer per cpu
<agentzh> one pin'd one is not?
<agentzh> just my vague impression.
<kerneltoast> there are two modes in stap for some reason actually
<kerneltoast> STP_BULKMODE is pinned mode
<agentzh> yeah
<agentzh> BULKMODE is not the default.
<agentzh> and we don't use it for simplicity.
<agentzh> because it requires an extra stap-merge step to collect the per-cpu output files.
<kerneltoast> so BULKMODE would need to be the default and then the print flush function should have its own local irq save
<kerneltoast> that is the alternative.
<agentzh> it'll be a tough call for fche :)
<kerneltoast> non-bulkmode is a clear abuse of relayfs
<agentzh> since it seems to break backward compatibility.
<agentzh> or you have ways to do it otherwise?
<kerneltoast> which backward compatibility are you thinking of?
<kerneltoast> bulkmode doesn't work on old kernels?
<agentzh> bulkmode requires stap-merge to post-process the output.
<agentzh> iirc
<agentzh> it changes the way how users would normally use stap.
<kerneltoast> the ALTERNATIVE alternative would be to have a single buffer for all CPUs protected by a single lock. not so great
<agentzh> sounds tricky.
<agentzh> the non-bulk mode indeed may overload cpu 0.
<agentzh> i also complained about it in the past.
<kerneltoast> we're technically doing the output merging already
<kerneltoast> with this dance that my patch does
<agentzh> the current default way is not pretty indeed.
<agentzh> hopefully we can have something better.
<fche> hmmmm
<agentzh> and without forcing the user to always use stap-merge.
<fche> now that I think about it (again), yeah it's weird that the kernel->user isn't the normal 1-buffer-per-cpu thing
<fche> and then let userspace merge or not merge (depending on -b)
<kerneltoast> i guess if the user wants non-bulkmode, we'd have a single buffer for all cpus
<fche> at some point
<fche> but that point does not have to be at the relayfs level
<fche> it can be at the staprun/stapio STDOUT level
<kerneltoast> it does need to be at the relayfs level. relayfs is designed to use per-cpu buffers
<kerneltoast> non-bulkmode abuses relayfs by reading per-cpu data from different cpus
<kerneltoast> which is why we end up needing the inode mutex
<fche> yes I understand
<fche> and I agree it doesn't smell right
<fche> my point is we can emulate non-bulk mode by making stapio/staprun still receive all the per-cpu buffers
<fche> but merge them (well, pipe them to stdout as fast as possible, probably on a per-line buffered basis, probably)
<kerneltoast> how would we do that though
<fche> well we already have N threads in stap* reading the buffers
<fche> the trace$N files
<fche> instead of writing to a separate file, they can each write to stdout in non "-b" mode
<kerneltoast> the N threads are only pinned to each cpu in bulkmode though
<kerneltoast> err are there really N threads in non-bulkmode?
<kerneltoast> because there's this check inside relay.c:
<kerneltoast> return -1;
<kerneltoast> _err("This is inconsistent! Please file a bug report. Exiting now.\n");
<kerneltoast> if (ncpus > 1 && bulkmode == 0) {
<kerneltoast> _err("ncpus=%d, bulkmode = %d\n", ncpus, bulkmode);
<kerneltoast> }
<fche> well we can change that easily enough.
<fche> We Control the Vertical. We Control the Horizontal.
<kerneltoast> yeah but i dunno how much of "bulkmode" needs to be bulk
<fche> meaning?
<kerneltoast> what exactly is user avoiding by not using bulkmode?
<kerneltoast> the per-cpu userspace threads?
<kerneltoast> the multiple files?
<fche> not bulk mode: trades convenience for performance
<fche> and vice versa
<kerneltoast> and we're nuking some of that performance by spawning a bunch of threads, no?
<fche> the performance loss is probably that of synchronizing/merging things early
<fche> having N threads do N non-conflicting things is fine
<kerneltoast> does the bulkmode merging try to order prints correctly?
<fche> yes, I believe via timestamps
<fche> so that computation would be deferred in the -b case
<kerneltoast> and we'd just ditch that with non-bulk
<fche> or do the computation live within staprun/stapio
<kerneltoast> hmm if we do it live then is there a point to the bulkmode toggle?
<fche> the point of the toggle is to let a user choose whether to do it live or not.
<fche> doing it live incurs some runtime cost
<kerneltoast> ah ok
<fche> not doing it live incurs the cost LATER, after the stap session is done.
<kerneltoast> so the default would be geared towards convenience via live merging
<fche> yes.
<kerneltoast> gotcha
<kerneltoast> now it just needs to be coded!
<fche> and the way that's implemented NOW is that the kernel side does the merging (via locks)
<fche> yes, just a SMOP, but maybe not serious
<fche> because the kernel side just needs to -DSTAP_BULKMODE=1
<fche> and userspace needs to act as if bulkmode=1 in a few more places, almost
<fche> how does that sound?
<kerneltoast> sounds good
<kerneltoast> if only i could go back in time a week
<kerneltoast> and tell myself not to try fixing this in the runtime
<fche> geez get on that mr. time traveler guy
<kerneltoast> b r a i n i s m e l t i n g
jistone has quit [Ping timeout: 260 seconds]
jistone has joined #systemtap
jistone has quit [Ping timeout: 260 seconds]
<kerneltoast> fche, willing to take my big print patch if i rework it to just always flush directly as is done now?
<kerneltoast> there's other nice stuff in that patch, like guarding against print usage while the print driver is dead
<fche> kerneltoast, would take a look
<kerneltoast> cool, i'll do that after i get some grub
jistone has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
jistone has quit [Ping timeout: 260 seconds]
jistone_ has joined #systemtap
jistone_ is now known as jistone
jistone has quit [Changing host]
jistone has joined #systemtap
sscox has quit [Quit: sscox]
<agentzh> live merging will be cool for this.
<agentzh> it'll also solve the cpu 0 issue i mentioned above.
<agentzh> glad we have something simpler which also solves another old issue :)
<fche> yes
<fche> although
<fche> with our luck
<fche> WATCH SOMETHING ELSE CRAWL OUT FROM UNDER THE ROCK
<agentzh> heh, hopefully not a snake :)
<agentzh> i'll make sure we do extensive testing with the new solution.
sscox has joined #systemtap
<agentzh> this kernel thing is hard.
<fche> darn kernel thing
<agentzh> :)
tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
amerey has quit [Quit: Leaving]
mjw has quit [Quit: Leaving]
derek0883 has joined #systemtap
derek088_ has joined #systemtap
derek0883 has quit [Ping timeout: 272 seconds]