#systemtap on 2020-11-19 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:22 khaled has quit [Quit: Konversation terminated!]

00:24 khaled has joined #systemtap

00:25 khaled has quit [Client Quit]

00:49 lijunlong has quit [Read error: Connection reset by peer]

00:53 lijunlong has joined #systemtap

01:31 hpt has joined #systemtap

01:33 orivej_ has quit [Ping timeout: 260 seconds]

01:53 <kerneltoast> fche, i'm pumping this through the testsuite now: https://gist.github.com/kerneltoast/37a38b68aad8a8c669074465e4bafc81

01:53 <kerneltoast> it doesn't fail at the test that broke the previous patch

01:54 <kerneltoast> honestly, who needs print messages

01:54 <kerneltoast> #define _stp_print(...)

01:57 <kerneltoast> fche, if you want to try convincing lkml to fix this tracepoint dilemma feel free

01:57 <kerneltoast> slap your fancy @redhat.com mail on the table and demand fixes :P

03:46 derek0883 has quit [Remote host closed the connection]

03:49 derek0883 has joined #systemtap

04:40 <kerneltoast> oh boy, the fight isn't over yet

04:40 <kerneltoast> probes can be called from inside an NMI

04:41 <kerneltoast> so my locks can deadlock even though they have irqsave

04:55 hpt has quit [Ping timeout: 240 seconds]

04:59 derek0883 has quit [Remote host closed the connection]

05:06 derek0883 has joined #systemtap

05:29 orivej has joined #systemtap

05:48 orivej has quit [Ping timeout: 260 seconds]

05:48 orivej_ has joined #systemtap

05:52 orivej has joined #systemtap

05:53 orivej_ has quit [Ping timeout: 260 seconds]

05:57 orivej has quit [Read error: Connection reset by peer]

05:59 orivej has joined #systemtap

06:04 orivej has quit [Ping timeout: 260 seconds]

06:04 orivej has joined #systemtap

06:09 derek0883 has quit [Remote host closed the connection]

06:17 derek0883 has joined #systemtap

06:19 orivej has quit [Ping timeout: 260 seconds]

06:19 orivej has joined #systemtap

06:24 hpt has joined #systemtap

06:28 orivej_ has joined #systemtap

06:28 orivej has quit [Ping timeout: 264 seconds]

06:34 orivej_ has quit [Ping timeout: 256 seconds]

06:34 orivej has joined #systemtap

06:37 tonyj has quit [Remote host closed the connection]

06:37 <kerneltoast> fche, okay now it's NMI-safe and running the testsuite: https://gist.github.com/kerneltoast/74d1094a7c5b6c8d9035b44b4a5dfd7a

06:40 orivej_ has joined #systemtap

06:40 orivej has quit [Ping timeout: 256 seconds]

06:47 <agentzh> kerneltoast: will it waste CPU cycles?

06:48 <agentzh> hopefully it's not hot polling :)

06:48 <kerneltoast> it's polling every jiffy

06:48 <kerneltoast> the poll worker itself is atomic and doesn't do much work

06:49 <agentzh> every jiffy sounds very frequent :)

06:49 <kerneltoast> it's not, it's just a worker

06:49 <kerneltoast> and it's not polling every cpu

06:49 <agentzh> okay, we can do more benchmark.

06:49 <agentzh> with real use cases.

06:50 <agentzh> will it lose more data than before?

06:50 <agentzh> or it's similar?

06:50 <agentzh> in case of big data output.

06:50 derek0883 has quit [Remote host closed the connection]

06:50 <kerneltoast> there is a higher risk than before of losing data with large output

06:50 <kerneltoast> the reason why this wasn't a problem before was because a single buffer was used without synchronization

06:51 <kerneltoast> and that caused one of the panics i posted here

06:51 <agentzh> okay

06:51 <kerneltoast> there is no chance of data getting garbled anymore

06:51 <kerneltoast> but there is a chance of it being truncated

06:52 <agentzh> i wonder if we can use a safer approach for contexts like probe end?

06:52 <agentzh> it's known to be a safe context?

06:52 <agentzh> it's common for stap scripts to emit output in that probe.

06:52 <kerneltoast> wouldn't that delay all the printing until the stap module unloads?

06:53 <agentzh> that is an optimization instead of enforcement.

06:53 hpt has quit [Ping timeout: 264 seconds]

06:53 <agentzh> just for the probe end code.

06:53 <agentzh> we can still use your polling approach for printing in other probe contexts.

06:54 <kerneltoast> is probe end executed inside a tracepoint?

06:54 <agentzh> nope.

06:54 <agentzh> but you can check.

06:54 <agentzh> i'm not 100% sure.

06:54 <kerneltoast> well we still won't be able to get rid of the polling

06:54 <agentzh> it's in the syscall context of the staprun i believe.

06:54 <kerneltoast> there is already an optimization to avoid relying on the poll worker

06:55 hpt has joined #systemtap

06:55 <kerneltoast> the poll worker is only used when probe flush is called with IRQs disabled

06:55 <kerneltoast> *print flush

06:55 derek0883 has joined #systemtap

06:55 <kerneltoast> when IRQs are enabled, the print flush is done directly in the current context

06:55 <agentzh> this is the probe end context: https://gist.github.com/agentzh/ec8e334f66b5c413e9cdf3d7028a55a6

06:56 <kerneltoast> yeah that will defer print messages until probe exit

06:56 <agentzh> yes

06:56 <agentzh> it's the most common case for at least our stap scripts.

06:57 <kerneltoast> i don't think the overhead from polling will be too bad

06:57 <agentzh> we also have some printing in timer.s(N) probes, which is in hrtimer_interrupt(), unfortunately.'

06:58 <agentzh> it's not the overhead i'm worried about right now, but data truncation :)

06:58 <kerneltoast> the polling is done to avoid truncation

06:59 <kerneltoast> if we skip flushing when the context doesn't allow it, we'll lose data

06:59 <agentzh> yeah sure.

06:59 <agentzh> i'm worried that it is worse than before.

06:59 <kerneltoast> the polling won't truncate messages in-flight

06:59 <agentzh> or is it?

07:00 <kerneltoast> all the synchronization in place now should make it better

07:01 <kerneltoast> it's big and ugly but i've been very careful

07:04 derek0883 has quit [Remote host closed the connection]

07:07 <agentzh> okay, cool

08:02 khaled has joined #systemtap

08:27 hpt has quit [Ping timeout: 272 seconds]

09:30 derek0883 has joined #systemtap

09:31 derek0883 has quit [Remote host closed the connection]

10:34 khaled has quit [Quit: Konversation terminated!]

10:37 khaled has joined #systemtap

11:50 khaled has quit [Remote host closed the connection]

11:52 khaled has joined #systemtap

12:10 _whitelogger has joined #systemtap

12:45 derek0883 has joined #systemtap

13:50 derek0883 has quit [Ping timeout: 260 seconds]

14:57 tonyj has joined #systemtap

14:59 tromey has joined #systemtap

15:03 amerey has joined #systemtap

16:20 wcohen has quit [Remote host closed the connection]

16:23 wcohen has joined #systemtap

16:42 <kerneltoast> fche, yo

16:42 <fche> hello

16:43 <kerneltoast> the print buffers need to be bigger

16:43 <fche> coudl be

16:44 <kerneltoast> that was the only problem exposed by the testsuite

16:44 <fche> hm just looking over your gist patch

16:44 <fche> how much of the recent batch of problems comes from having to muck with the inode mutex?

16:44 <fche> I know I've asked this before, but .....

16:44 <kerneltoast> i didn't go further than the lockdep warning

16:44 <fche> is there some spinlocky way of protecting the inode content in question other than via agentzh's mutex_trylock stuff?

16:45 <kerneltoast> possible if you roll your own mutex code

16:45 <kerneltoast> and copy the definition of struct mutex into stap

16:45 <fche> I mean short of that

16:45 <kerneltoast> nope

16:46 <kerneltoast> the inode only has that mutex

16:46 <fche> is it solely the inode size field that we're trying to protect ?

16:47 <kerneltoast> i don't think so

16:47 <kerneltoast> agentzh said that data was getting garbled

16:47 <kerneltoast> not truncated

16:48 <fche> depends on what form of garbling there was

16:56 <kerneltoast> fche, https://sourceware.org/bugzilla/show_bug.cgi?id=26131

17:18 derek0883 has joined #systemtap

17:29 <fche> in your cpumask variant of the patch, is it proper for cpumask_copy & cpumask_clear to come in that sequence?

17:29 <fche> as opposed to clear first?

17:31 derek0883 has quit [Remote host closed the connection]

17:32 derek0883 has joined #systemtap

17:36 <kerneltoast> yeah, if we clear it first then we'll just see nothing

17:40 <fche> oh wait

17:40 <fche> you're clearing the input not the output, got it

17:43 derek0883 has quit [Remote host closed the connection]

17:45 derek0883 has joined #systemtap

19:38 <kerneltoast> fche, so I'm guessing after reading the bug report you don't think we can dodge the mutex?

19:47 <agentzh> kerneltoast: it'll be great if you can redo the stress tests using the sample .stp script in that bugzilla ticket with your latest patch to make sure it is still working fine.

19:48 <kerneltoast> hmm but the inode mutex hasn't been removed

19:48 <kerneltoast> but i'll try that .stp anyway

19:49 <agentzh> i know. just to make sure it has no other side effects in such extreme cases.

19:49 <agentzh> like a specific stress test case.

19:49 <agentzh> thanks

19:49 <kerneltoast> agentzh, do you know where stap does the relayfs read in userspace?

19:50 <agentzh> in staprun

19:51 <kerneltoast> k

19:51 <agentzh> iirc, staprun/mainloop.c

19:51 <fche> kerneltoast, yeah

19:51 <agentzh> func stp_main_loop()

19:53 <agentzh> kerneltoast: and also in staprun/relay.c

19:53 <agentzh> in reader_thread

19:53 <agentzh> func

19:53 <agentzh> the latter is more related i think.

19:56 <kerneltoast> lol we don't need the inode mutex

19:56 <kerneltoast> we just need to disable irqs on the local cpu when flushing

19:56 <kerneltoast> i am going to cry :)

19:57 <kerneltoast> i should've looked at relayfs earlier...

19:57 <agentzh> wow

19:57 <kerneltoast> staprun properly pins a reader thread to each cpu

19:57 <kerneltoast> and only has that thread read that cpu's buffer

19:58 <agentzh> there are two different modes?

19:58 <kerneltoast> relayfs stores a buffer per cpu

19:58 <agentzh> one pin'd one is not?

19:58 <agentzh> just my vague impression.

19:58 <kerneltoast> there are two modes in stap for some reason actually

19:58 <kerneltoast> STP_BULKMODE is pinned mode

19:58 <agentzh> yeah

19:58 <agentzh> BULKMODE is not the default.

19:58 <agentzh> and we don't use it for simplicity.

19:59 <agentzh> because it requires an extra stap-merge step to collect the per-cpu output files.

19:59 <kerneltoast> so BULKMODE would need to be the default and then the print flush function should have its own local irq save

19:59 <kerneltoast> that is the alternative.

19:59 <agentzh> it'll be a tough call for fche :)

20:00 <kerneltoast> non-bulkmode is a clear abuse of relayfs

20:00 <agentzh> since it seems to break backward compatibility.

20:00 <agentzh> or you have ways to do it otherwise?

20:00 <kerneltoast> which backward compatibility are you thinking of?

20:00 <kerneltoast> bulkmode doesn't work on old kernels?

20:00 <agentzh> bulkmode requires stap-merge to post-process the output.

20:01 <agentzh> iirc

20:01 <agentzh> it changes the way how users would normally use stap.

20:02 <kerneltoast> the ALTERNATIVE alternative would be to have a single buffer for all CPUs protected by a single lock. not so great

20:02 <agentzh> sounds tricky.

20:02 <agentzh> the non-bulk mode indeed may overload cpu 0.

20:02 <agentzh> i also complained about it in the past.

20:02 <kerneltoast> we're technically doing the output merging already

20:03 <kerneltoast> with this dance that my patch does

20:03 <agentzh> the current default way is not pretty indeed.

20:03 <agentzh> hopefully we can have something better.

20:03 <fche> hmmmm

20:03 <agentzh> and without forcing the user to always use stap-merge.

20:03 <fche> now that I think about it (again), yeah it's weird that the kernel->user isn't the normal 1-buffer-per-cpu thing

20:04 <fche> and then let userspace merge or not merge (depending on -b)

20:04 <kerneltoast> i guess if the user wants non-bulkmode, we'd have a single buffer for all cpus

20:04 <fche> at some point

20:04 <fche> but that point does not have to be at the relayfs level

20:04 <fche> it can be at the staprun/stapio STDOUT level

20:05 <kerneltoast> it does need to be at the relayfs level. relayfs is designed to use per-cpu buffers

20:06 <kerneltoast> non-bulkmode abuses relayfs by reading per-cpu data from different cpus

20:06 <kerneltoast> which is why we end up needing the inode mutex

20:06 <fche> yes I understand

20:06 <fche> and I agree it doesn't smell right

20:06 <fche> my point is we can emulate non-bulk mode by making stapio/staprun still receive all the per-cpu buffers

20:07 <fche> but merge them (well, pipe them to stdout as fast as possible, probably on a per-line buffered basis, probably)

20:07 <kerneltoast> how would we do that though

20:07 <fche> well we already have N threads in stap* reading the buffers

20:07 <fche> the trace$N files

20:07 <fche> instead of writing to a separate file, they can each write to stdout in non "-b" mode

20:09 <kerneltoast> the N threads are only pinned to each cpu in bulkmode though

20:10 <kerneltoast> err are there really N threads in non-bulkmode?

20:10 <kerneltoast> because there's this check inside relay.c:

20:10 <kerneltoast> return -1;

20:10 <kerneltoast> _err("This is inconsistent! Please file a bug report. Exiting now.\n");

20:10 <kerneltoast> if (ncpus > 1 && bulkmode == 0) {

20:10 <kerneltoast> _err("ncpus=%d, bulkmode = %d\n", ncpus, bulkmode);

20:10 <kerneltoast> }

20:11 <fche> well we can change that easily enough.

20:11 <fche> We Control the Vertical. We Control the Horizontal.

20:11 <kerneltoast> yeah but i dunno how much of "bulkmode" needs to be bulk

20:11 <fche> meaning?

20:11 <kerneltoast> what exactly is user avoiding by not using bulkmode?

20:11 <kerneltoast> the per-cpu userspace threads?

20:12 <kerneltoast> the multiple files?

20:12 <fche> not bulk mode: trades convenience for performance

20:12 <fche> and vice versa

20:13 <kerneltoast> and we're nuking some of that performance by spawning a bunch of threads, no?

20:13 <fche> the performance loss is probably that of synchronizing/merging things early

20:14 <fche> having N threads do N non-conflicting things is fine

20:16 <kerneltoast> does the bulkmode merging try to order prints correctly?

20:17 <fche> yes, I believe via timestamps

20:17 <fche> so that computation would be deferred in the -b case

20:17 <kerneltoast> and we'd just ditch that with non-bulk

20:17 <fche> or do the computation live within staprun/stapio

20:17 <kerneltoast> hmm if we do it live then is there a point to the bulkmode toggle?

20:18 <fche> the point of the toggle is to let a user choose whether to do it live or not.

20:18 <fche> doing it live incurs some runtime cost

20:18 <kerneltoast> ah ok

20:18 <fche> not doing it live incurs the cost LATER, after the stap session is done.

20:19 <kerneltoast> so the default would be geared towards convenience via live merging

20:19 <fche> yes.

20:19 <kerneltoast> gotcha

20:19 <kerneltoast> now it just needs to be coded!

20:19 <fche> and the way that's implemented NOW is that the kernel side does the merging (via locks)

20:19 <fche> yes, just a SMOP, but maybe not serious

20:19 <fche> because the kernel side just needs to -DSTAP_BULKMODE=1

20:19 <fche> and userspace needs to act as if bulkmode=1 in a few more places, almost

20:20 <fche> how does that sound?

20:20 <kerneltoast> sounds good

20:23 <kerneltoast> if only i could go back in time a week

20:23 <kerneltoast> and tell myself not to try fixing this in the runtime

20:23 <fche> geez get on that mr. time traveler guy

20:23 <kerneltoast> b r a i n i s m e l t i n g

20:26 jistone has quit [Ping timeout: 260 seconds]

20:37 jistone has joined #systemtap

20:42 jistone has quit [Ping timeout: 260 seconds]

20:42 <kerneltoast> fche, willing to take my big print patch if i rework it to just always flush directly as is done now?

20:43 <kerneltoast> there's other nice stuff in that patch, like guarding against print usage while the print driver is dead

20:43 <fche> kerneltoast, would take a look

20:43 <kerneltoast> cool, i'll do that after i get some grub

20:43 jistone has joined #systemtap

20:45 derek0883 has quit [Remote host closed the connection]

20:53 derek0883 has joined #systemtap

21:00 jistone has quit [Ping timeout: 260 seconds]

21:00 jistone_ has joined #systemtap

21:00 jistone_ is now known as jistone

21:00 jistone has quit [Changing host]

21:00 jistone has joined #systemtap

21:03 sscox has quit [Quit: sscox]

21:07 <agentzh> live merging will be cool for this.

21:07 <agentzh> it'll also solve the cpu 0 issue i mentioned above.

21:07 <agentzh> glad we have something simpler which also solves another old issue :)

21:08 <fche> yes

21:08 <fche> although

21:08 <fche> with our luck

21:08 <fche> WATCH SOMETHING ELSE CRAWL OUT FROM UNDER THE ROCK

21:08 <agentzh> heh, hopefully not a snake :)

21:09 <agentzh> i'll make sure we do extensive testing with the new solution.

21:09 sscox has joined #systemtap

21:09 <agentzh> this kernel thing is hard.

21:09 <fche> darn kernel thing

21:09 <agentzh> :)

21:27 tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]

22:35 amerey has quit [Remote host closed the connection]

22:35 amerey has joined #systemtap

22:39 derek0883 has quit [Remote host closed the connection]

22:45 derek0883 has joined #systemtap

22:56 derek0883 has quit [Remote host closed the connection]

22:57 amerey has quit [Quit: Leaving]

23:10 mjw has quit [Quit: Leaving]

23:14 derek0883 has joined #systemtap

23:29 derek088_ has joined #systemtap

23:31 derek0883 has quit [Ping timeout: 272 seconds]