#systemtap on 2020-06-18 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:30 derek0883 has quit [Remote host closed the connection]

00:30 derek0883 has joined #systemtap

00:35 derek0883 has quit [Ping timeout: 265 seconds]

01:19 derek0883 has joined #systemtap

01:21 derek0883 has quit [Remote host closed the connection]

01:23 derek0883 has joined #systemtap

01:27 derek0883 has quit [Ping timeout: 246 seconds]

01:41 hpt has joined #systemtap

01:56 derek0883 has joined #systemtap

01:59 derek0883 has quit [Remote host closed the connection]

02:00 derek0883 has joined #systemtap

02:05 derek0883 has quit [Ping timeout: 260 seconds]

02:20 derek0883 has joined #systemtap

02:22 derek0883 has quit [Remote host closed the connection]

02:23 derek0883 has joined #systemtap

02:28 derek0883 has quit [Ping timeout: 272 seconds]

02:49 derek0883 has joined #systemtap

02:50 derek0883 has quit [Remote host closed the connection]

03:23 derek0883 has joined #systemtap

03:23 derek0883 has quit [Remote host closed the connection]

03:27 <agentzh> fche: created a PR to record the relay issue: https://sourceware.org/bugzilla/show_bug.cgi?id=26131

03:55 derek0883 has joined #systemtap

04:01 <agentzh> fche: first patch to remove the relay_flush call: https://gist.github.com/agentzh/2fb39b7a78d877a01708285beac1c90b

05:03 derek0883 has quit [Remote host closed the connection]

05:10 _whitelogger has joined #systemtap

05:17 derek0883 has joined #systemtap

05:19 derek0883 has quit [Remote host closed the connection]

06:05 derek0883 has joined #systemtap

06:06 derek0883 has quit [Remote host closed the connection]

06:06 derek0883 has joined #systemtap

06:07 derek0883 has quit [Read error: Connection reset by peer]

06:07 derek0883 has joined #systemtap

06:07 derek0883 has quit [Remote host closed the connection]

06:11 orivej has joined #systemtap

06:19 orivej has quit [Quit: No Ping reply in 180 seconds.]

06:20 orivej has joined #systemtap

06:32 khaled has joined #systemtap

06:34 orivej has quit [Ping timeout: 256 seconds]

07:24 hassan64 has joined #systemtap

07:25 <hassan64> I sit possible to run systemtap script (*.stp) at target machine with a command "stap <file.stp>", rather than cross-compiling it to *,ko at Host machine and then import *.ko module to target and run it via staprun ?

07:31 <ema> hassan64: yes, you can just ran "stap file.stp" as long as all the required dependencies are installed on the target machine

07:31 <ema> s/ran/run/

07:32 <agentzh> fche: i just realised that we could reuse the inode spinlock which is used by the relay reader.

07:32 <agentzh> initial testing looks very promising.

07:35 <hassan64> ema: Thanks, I also need to know, is there any possibility to translate *.stp to *.ko, inside a target machine, in order to avoid hassle of cross-compiling at host machine?

07:35 <agentzh> hassan64: you need the C compiler.

07:35 <agentzh> on the target machine.

07:35 <agentzh> the same C compiler used to build your target kernel.

07:36 <agentzh> cross-compilation is never fun, i agree :)

07:41 <hassan64> yeah, but I do not have C Compiler on target machine.

07:41 <hassan64> Actually, what I am trying to do is

07:42 <hassan64> I have a some application need to run at target machine. I have "file.sh", in which I tell how to run the application. Then I need to probe/trace what that application doing at system level i.e syscalls etc.

07:43 <agentzh> fche: ah, my bad, seems like inode_lock() uses rw_semaphore which may sleep...

07:44 <hassan64> I know I can run "stap <file.stp>" at target. But I also run the "file.sh" through stap command.

07:45 <hassan64> I know this done through "staprun -c <file.sh> <file.ko>". But here I do not have *.ko, avoiding cross-compiltaion. Is there any command like "stap -c <file.sh> <file.stp>""

07:46 <agentzh> ko needs a C compiler to generate. i don't see how you can bypass ko without a C compiler.

07:47 <agentzh> stap does support ebpf runtime to some extend. i'm not sure if your kernel supports ebpf. if yes, then stap can directly generate ebpf programs on the target machine without using any C compiler.

07:47 <agentzh> i'm not sure if stap's ebpf backend is mature enough for your purposes. but you can always try and report it back.

07:48 <hassan64> I think my kernel may have support for ebpf.

07:49 <hassan64> If I enable this, how can I achieve this i.e tracing/probing my application through stap ?

07:49 <agentzh> you can check out the documentation of stap online or through stap manpages.

07:49 <agentzh> google is always helpful.

08:35 orivej has joined #systemtap

08:38 <agentzh> fche: i got a new patch to address this race issue. will load a whole night. if everything goes right, i'll send the patch for you to review.

08:39 <agentzh> in the new patch, i use down_write_trylock() to do a best effort lock aquisition for the relay inode in interrupt contexts. in other context, i use inode_lock() directly.

08:39 <agentzh> so it's not really a kernel bug but rather a lack of locking on the stap runtime side between writers and readers.

08:40 <agentzh> the stap runtime currently only has a lock to protect multiple writers. but it fails to protect races between readers and writers.

08:41 <agentzh> it's nice to finally get to the bottom of this long standing subtle bug.

08:42 <agentzh> i'll also look at the old relayfs case after relay_v2 is sorted out.

08:51 <agentzh> fche: my stress test already runs more than 1500 sec without any \0 or output file size differences. will keep it running for a whole night and check it out tomorrow. it's looking good now.

08:51 <agentzh> end of day for me. night &

09:01 orivej has quit [Ping timeout: 246 seconds]

09:01 orivej has joined #systemtap

09:06 hassan64 has quit [Remote host closed the connection]

09:09 orivej has quit [Ping timeout: 260 seconds]

09:09 orivej_ has joined #systemtap

09:25 orivej_ has quit [Ping timeout: 256 seconds]

09:25 orivej has joined #systemtap

09:34 orivej has quit [Quit: No Ping reply in 180 seconds.]

09:35 orivej has joined #systemtap

09:46 orivej has quit [Ping timeout: 256 seconds]

09:46 orivej has joined #systemtap

09:55 orivej has quit [Ping timeout: 258 seconds]

09:55 orivej has joined #systemtap

10:06 orivej has quit [Quit: No Ping reply in 180 seconds.]

10:07 orivej has joined #systemtap

10:19 orivej has quit [Quit: No Ping reply in 180 seconds.]

10:20 orivej has joined #systemtap

10:22 hpt has quit [Ping timeout: 264 seconds]

10:48 orivej has quit [Ping timeout: 246 seconds]

10:49 orivej has joined #systemtap

11:19 wcohen has quit [Remote host closed the connection]

11:32 orivej has quit [Ping timeout: 258 seconds]

11:32 orivej has joined #systemtap

11:50 orivej has quit [Ping timeout: 240 seconds]

11:51 orivej has joined #systemtap

11:57 wcohen has joined #systemtap

12:13 orivej has quit [Ping timeout: 256 seconds]

12:13 orivej has joined #systemtap

12:22 tromey has joined #systemtap

12:31 orivej has quit [Ping timeout: 256 seconds]

12:31 orivej has joined #systemtap

12:45 orivej has quit [Ping timeout: 265 seconds]

12:45 orivej has joined #systemtap

13:12 orivej has quit [Quit: No Ping reply in 180 seconds.]

13:13 mjw has joined #systemtap

13:13 orivej has joined #systemtap

15:10 zodbot has quit [Remote host closed the connection]

15:24 zodbot has joined #systemtap

15:56 zodbot has quit [Ping timeout: 240 seconds]

16:05 sapatel has joined #systemtap

16:43 zodbot has joined #systemtap

16:52 <agentzh> fche: is this patch good for committing? https://gist.github.com/agentzh/2fb39b7a78d877a01708285beac1c90b

16:53 <agentzh> and also this one? https://gist.github.com/agentzh/84e937cd50238c5b96ba75d46a3f85a2

16:59 <fche> how much contention is on that rchan inode->i_rwsem ? is a loop needed?

17:02 <agentzh> for the second patch, i wonder if we can directly call inode_lock() for contexts which allow sleeping.

17:02 <agentzh> yes, a loop is needed. and the default 100 iterations still seem not good enough.

17:02 <fche> nah, we're inside a spinlock already

17:02 <fche> so in that case the race is with staprun userspace?

17:03 <agentzh> it is with relay_file_read().

17:03 <agentzh> the kernel space, but yes, triggerd by staprun's reader_thread().

17:03 <agentzh> relay_file_read() competes with the inode lock.

17:03 <fche> is inode_lock a spinny or a sleepy operation?

17:04 <agentzh> its rw-semaphore, so it's sleepy.

17:04 <agentzh> at least by default.

17:04 <fche> rwlocks are spinny by default AIUI

17:04 <agentzh> that's why i used the *_trylock() variant in my 2nd patch.

17:04 <fche> except on -rt

17:04 <agentzh> ah, i didn't know.

17:05 <agentzh> so how to protect against the sleepy cases?

17:06 <fche> but rwlocks aren't rwsems so that doesn't apply

17:06 <agentzh> okay

17:06 <agentzh> i tried to use it directly and it leads to deadlocks.

17:06 <agentzh> on my stock fedora kernel.

17:07 <fche> ok so I see why you'd do a down_write_trylock etc. inside the _stp_print_lock.

17:07 <agentzh> so it must sleep.

17:07 <agentzh> should i do it outside the _stp_print_lock?

17:07 <agentzh> and use in_interrupt() to determine if the context allows sleeping?

17:07 <fche> nah, not sure that's really help, we have other already-atomic callers into this code methinks.

17:08 <agentzh> okay

17:08 <agentzh> the _stp_print_lock only protects multiple writers, so it does not help with the reader.

17:10 <agentzh> i just found that even 200 iterations are not good enough...

17:10 derek0883 has joined #systemtap

17:12 <fche> the code should plan to fail

17:12 <agentzh> simply do atomic_inc(&_stp_transport_failures) ?

17:12 <agentzh> but we could do better in contexts allowing sleeping?

17:13 <fche> yeah, increment the failures

17:15 <fche> hm, this trylock gadget should probably be moved over into transport/relay_v2.c

17:15 <fche> the actual function that deals with resources contended with userspace

17:16 <fche> _stp_data_write_reserve probably

17:16 <fche> the runtime/print_flush.c cannot really deal with rchan_buf type things, those are configuration specific

17:18 <agentzh> not really, i tried.

17:19 <agentzh> the reader would read unitialized reserved space in the subbufs.

17:19 <agentzh> so there's still \0 in the output under load test.

17:19 <agentzh> simply failing would be worse then before. it warrants data loss for scripts generating a lot of output.

17:20 <agentzh> that's why i put the lock around code doing the memcpy().

17:20 <agentzh> the lock contention is serious here since the staprun keeps reading like crazy.

17:29 <fche> staprun shouldn't poll so frequently as a matter of fact

17:32 <fche> but yeah seeing that relay_v2's _stp_data_write_commit() is empty, it's clear there is a race that covers the subbuf reservation and the memcpy vs. the staprun read

17:32 derek0883 has quit [Ping timeout: 246 seconds]

17:33 derek0883 has joined #systemtap

17:34 irker644 has joined #systemtap

17:34 <irker644> systemtap: wcohen systemtap.git:master * release-4.3-16-g3d922919d / testsuite/systemtap.examples/profiling/periodic.stp: Use explicit @cast() operators for periodic.stp

18:03 orivej has quit [Ping timeout: 256 seconds]

18:03 orivej has joined #systemtap

18:29 <agentzh> fche: how about this v2 patch? https://gist.github.com/agentzh/f83bb342cb7c6de45b4a913b3e29493c

18:29 <agentzh> i moved the inode lock outside the print lock.

18:30 <agentzh> so it may sleep in contexts which allow.

18:30 <fche> at least also protect it with something that ties it to relay_v2.c

18:31 <agentzh> what does that mean?

18:32 <fche> well, just concerned that this transport-independent file is learning about rchan_buf and inode and such inside only a #ifdef (__KERNEL__) guard

18:34 <fche> maybe also assert STP_TRANSPORT_VERSION == 2

18:34 <agentzh> okay, gotcha. yeah, i was trying to be lazy. will address it.

18:35 <agentzh> for trasport v1, is it centos 6's 2.6 kernel?

18:35 <agentzh> just wondering how to test that case.

18:35 <agentzh> i'll do more thorough load testing for different configs.

18:36 <agentzh> load testing is helpful.

18:36 <fche> yeah probably

18:36 <fche> though rhel6 has practically fallen off of our radar ... not quite but close

18:37 <agentzh> any ETA for that?

18:37 <fche> nope

18:37 <agentzh> rl6 is indeed reaching its EOL.

18:37 <fche> there is eol and there is eol-eol :)

18:37 <agentzh> lol

18:37 <agentzh> indeed

19:02 <irker644> systemtap: wcohen systemtap.git:master * release-4.3-17-gb2d18cb3a / testsuite/systemtap.examples/process/semop-watch.stp: Use explicit @cast() operators for semop-watch.stp example.

19:19 <agentzh> fche: is the first patch for removing relay_flush() good to merge?

19:45 <fche> agentzh, sure

19:47 derek0883 has quit [Ping timeout: 240 seconds]

19:59 orivej has quit [Ping timeout: 240 seconds]

19:59 orivej_ has joined #systemtap

20:14 derek0883 has joined #systemtap

20:25 mjw has quit [Quit: Leaving]

20:41 <agentzh> fche: thanks.

20:42 <agentzh> for the 2nd patch, seems like in_interrupt() || in_atomic() is still not enough.

20:42 <agentzh> still seeing deadlocks.

20:42 <fche> yeah suggest not trying anything beyond a trylock

20:43 <agentzh> k, will do this first.

20:43 <agentzh> do you think it should be inside the print lock or outside?

20:43 <fche> some slight data loss is better. if you can just drop the data that we can't reliably send out, and increment the transport-failures counter, that should be good enough

20:43 <agentzh> okay, corrupted data is indeed dangerious.

20:44 <agentzh> could lead to all kinds of evil.

21:00 <agentzh> fche: just tried discard data directly and it seems like a bit too much data loss due to the fact it discards 8192+ bytes a time...

21:01 <agentzh> the STP_BUFFER_SIZE has a default value of 8KB.

21:01 <agentzh> maybe we still let it proceed?

21:01 orivej_ has quit [Ping timeout: 260 seconds]

21:01 orivej has joined #systemtap

21:01 <agentzh> it seems 99% of the time it is good and even for the 1% of the time, the bad data is quite limited.

21:02 <agentzh> just under 1kb in my own load test.

21:19 <fche> IMO let's let it discard

21:20 <fche> and ideally figure out whether staprun can be made to back off so the contention is not hit so much of the time

21:20 <fche> ISTR we poll every 50ms or some such thing, which is too short even for interactive purposes

21:23 <irker644> systemtap: wcohen systemtap.git:master * release-4.3-18-g36430614d / tapset/linux/kprocess.stp: Use kernel.trace("sched:sched_process_fork") for kprocess.create when possible

21:29 tromey has quit [Quit: ERC (IRC client for Emacs 28.0.50)]

21:30 <agentzh> fche: staprun uses ppoll with a default timeout of 10sec.

21:32 <fche> strace stapio

21:32 <fche> there is a 200ms ppoll also

21:33 <fche> hm 200ms is not that bad

21:33 <fche> that shouldn't be contending, but maybe it's different during probe-end timeframe

21:48 <agentzh> because ko emits a lot of data and the staprun reader also needs to read a lot of data.

21:48 <agentzh> if the staprun reader does not read fast enough, there would also be data loss on the sender side.

21:49 <agentzh> losing so much data so frequently would be a regression for our use cases.

21:49 <agentzh> not an improvement...

21:57 <agentzh> fche: i just tried process.begin probes, same thing. almost immediately lose about 8KB of output under load.

21:57 orivej has quit [Ping timeout: 258 seconds]

21:57 orivej has joined #systemtap

22:07 <fche> known-lost data is an improvement over corrupt data tho

22:14 <agentzh> well, more data and more info are much more important to us since we can always skip bad data...

22:14 <agentzh> assuming no other consequences.

22:15 <fche> stap -s NNNN ?

22:15 <fche> bump that up for you? or try stap -b

22:17 <agentzh> yeah, increasing the subbuf_size would definitely help.

22:17 <agentzh> but i still don't want to use full buffering to avoid any data loss...

22:18 <agentzh> since our stap scripts can generate hundreds of mega bytes of data...

22:18 <agentzh> -b may not be so helpful for our use cases, since it's usually only using one CPU...

22:19 <agentzh> and for my load test, it's only generating one cpu file.

22:22 <agentzh> fche: btw, the lock contention is bad when the machine is under significant load.

22:22 <agentzh> if the machine is idle, then it's fine...

22:23 <fche> ok would be good to know whence the contention exactly

22:23 <fche> but anyway

22:23 <agentzh> to reduce the data loss rate, maybe the user should decrease the STP_BUFFER_SIZE value?

22:23 <fche> IMHO still dropping data rather than corrupting it is a good default

22:24 <agentzh> i agree.

22:24 <agentzh> maybe make it controllable by a macro?

22:24 <agentzh> like _STP_UNSFAE_RELAY_DATA?

22:24 <fche> and for your purposes you could add another #ifdef STP_TRANSPORT_RISKY or something to tweak behaviour

22:24 <agentzh> sounds good

22:24 orivej has quit [Ping timeout: 260 seconds]

22:24 <agentzh> i'll adjust the patch accordingly.

22:30 <agentzh> fche: will you have a quick look at this v4 patch? https://gist.github.com/agentzh/d821ea47d23f3f75a5d8246264b79005

22:38 sapatel has quit [Remote host closed the connection]

22:43 derek0883 has quit [Remote host closed the connection]

22:44 derek0883 has joined #systemtap

22:50 <agentzh> fche: okay, i was wrong. sorry. stap -b helps even though only one cpu file is used. strange...

22:50 <agentzh> i just tried -b and there's no data loss even under load.

22:53 <agentzh> ah, i see. when -b is not specified, the relay trace always binds to cpu 0...

22:54 <agentzh> that's why the contention is so severe in this case.

22:54 <agentzh> is it a bug?

22:54 <agentzh> shouldn't it choose other cpus?

22:54 <agentzh> cpu cores, i mean.

22:57 <agentzh> -b helps because it randomly chooses cpu for output.

22:58 <agentzh> i have a 8c/16t CPU, so it's much better.

22:59 <agentzh> -b allocates 16x memory though...and only one set of bufs can be used.

22:59 <agentzh> quite wasteful.

22:59 <agentzh> (as a work-around)

23:08 <agentzh> okay, seems like the kernel's relay impl indeed hardcodes cpu 0 to the "global" bufs.

23:08 <agentzh> seems like a misdesign to me...

23:13 <agentzh> wondering if there's a way to choose a custom cpu for the global buf case...

23:16 sapatel has joined #systemtap

23:30 khaled has quit [Quit: Konversation terminated!]

23:43 <fche> hey

23:44 <fche> so yeah I suspect lkml won't have much interest in rearchitecting relayfs :) re. cpu0

23:44 <fche> re. patch v4

23:44 <fche> hm not sure the entryfn_context logic is right in the first hunk

23:45 <fche> that shouldn't be conditional on inode_locked in the exit path

23:45 <fche> and ditto second hunk

23:45 <fche> the third hunk in stp_print_flush it looks correct

23:46 <fche> hm but no even there the put_context stuff must match the get_context

23:46 <fche> whether or not RISKY and whether or not inode_lockd

23:46 <fche> I like the new lock/unlock function suite, that's good abstraction.

23:48 <agentzh> ah, right, sorry, my bad.

23:49 <agentzh> i added the RISKY part too fast and didn't notice the put thing.

23:49 <agentzh> fixing

23:49 <agentzh> thanks for the catch.

23:56 <agentzh> done. v5 is here: https://gist.github.com/agentzh/26deab4b091b43a8bcab81c3c6fc89e1

23:57 <fche> I think the entryfn_put_context needs to be outside the if (inode_locked) conditional

23:57 <agentzh> also fixed the _stp_get_rchan_subbuf() call to always use 0 cpu id if not in the bulk mode.

23:57 <agentzh> oh oh, right...sorry again...

23:58 <fche> THERE WILL BE BEATINGS AND COMFY CHAIRS

23:58 <agentzh> my head is fried after days of tracing this race...

23:58 <fche> heh

23:58 <fche> you've done amazing work.

23:58 <agentzh> thanks!