fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 265 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 246 seconds]
hpt has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 272 seconds]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
<agentzh> fche: created a PR to record the relay issue: https://sourceware.org/bugzilla/show_bug.cgi?id=26131
derek0883 has joined #systemtap
<agentzh> fche: first patch to remove the relay_flush call: https://gist.github.com/agentzh/2fb39b7a78d877a01708285beac1c90b
derek0883 has quit [Remote host closed the connection]
_whitelogger has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Read error: Connection reset by peer]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
orivej has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #systemtap
khaled has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
hassan64 has joined #systemtap
<hassan64> I sit possible to run systemtap script (*.stp) at target machine with a command "stap <file.stp>", rather than cross-compiling it to *,ko at Host machine and then import *.ko module to target and run it via staprun ?
<ema> hassan64: yes, you can just ran "stap file.stp" as long as all the required dependencies are installed on the target machine
<ema> s/ran/run/
<agentzh> fche: i just realised that we could reuse the inode spinlock which is used by the relay reader.
<agentzh> initial testing looks very promising.
<hassan64> ema: Thanks, I also need to know, is there any possibility to translate *.stp to *.ko, inside a target machine, in order to avoid hassle of cross-compiling at host machine?
<agentzh> hassan64: you need the C compiler.
<agentzh> on the target machine.
<agentzh> the same C compiler used to build your target kernel.
<agentzh> cross-compilation is never fun, i agree :)
<hassan64> yeah, but I do not have C Compiler on target machine.
<hassan64> Actually, what I am trying to do is
<hassan64> I have a some application need to run at target machine. I have "file.sh", in which I tell how to run the application. Then I need to probe/trace what that application doing at system level i.e syscalls etc.
<agentzh> fche: ah, my bad, seems like inode_lock() uses rw_semaphore which may sleep...
<hassan64> I know I can run "stap <file.stp>" at target. But I also run the "file.sh" through stap command.
<hassan64> I know this done through "staprun -c <file.sh> <file.ko>". But here I do not have *.ko, avoiding cross-compiltaion. Is there any command like "stap -c <file.sh> <file.stp>""
<agentzh> ko needs a C compiler to generate. i don't see how you can bypass ko without a C compiler.
<agentzh> stap does support ebpf runtime to some extend. i'm not sure if your kernel supports ebpf. if yes, then stap can directly generate ebpf programs on the target machine without using any C compiler.
<agentzh> i'm not sure if stap's ebpf backend is mature enough for your purposes. but you can always try and report it back.
<hassan64> I think my kernel may have support for ebpf.
<hassan64> If I enable this, how can I achieve this i.e tracing/probing my application through stap ?
<agentzh> you can check out the documentation of stap online or through stap manpages.
<agentzh> google is always helpful.
orivej has joined #systemtap
<agentzh> fche: i got a new patch to address this race issue. will load a whole night. if everything goes right, i'll send the patch for you to review.
<agentzh> in the new patch, i use down_write_trylock() to do a best effort lock aquisition for the relay inode in interrupt contexts. in other context, i use inode_lock() directly.
<agentzh> so it's not really a kernel bug but rather a lack of locking on the stap runtime side between writers and readers.
<agentzh> the stap runtime currently only has a lock to protect multiple writers. but it fails to protect races between readers and writers.
<agentzh> it's nice to finally get to the bottom of this long standing subtle bug.
<agentzh> i'll also look at the old relayfs case after relay_v2 is sorted out.
<agentzh> fche: my stress test already runs more than 1500 sec without any \0 or output file size differences. will keep it running for a whole night and check it out tomorrow. it's looking good now.
<agentzh> end of day for me. night &
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #systemtap
hassan64 has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 260 seconds]
orivej_ has joined #systemtap
orivej_ has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]
orivej has joined #systemtap
hpt has quit [Ping timeout: 264 seconds]
orivej has quit [Ping timeout: 246 seconds]
orivej has joined #systemtap
wcohen has quit [Remote host closed the connection]
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
wcohen has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
tromey has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 265 seconds]
orivej has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]
mjw has joined #systemtap
orivej has joined #systemtap
zodbot has quit [Remote host closed the connection]
zodbot has joined #systemtap
zodbot has quit [Ping timeout: 240 seconds]
sapatel has joined #systemtap
zodbot has joined #systemtap
<agentzh> fche: is this patch good for committing? https://gist.github.com/agentzh/2fb39b7a78d877a01708285beac1c90b
<fche> how much contention is on that rchan inode->i_rwsem ? is a loop needed?
<agentzh> for the second patch, i wonder if we can directly call inode_lock() for contexts which allow sleeping.
<agentzh> yes, a loop is needed. and the default 100 iterations still seem not good enough.
<fche> nah, we're inside a spinlock already
<fche> so in that case the race is with staprun userspace?
<agentzh> it is with relay_file_read().
<agentzh> the kernel space, but yes, triggerd by staprun's reader_thread().
<agentzh> relay_file_read() competes with the inode lock.
<fche> is inode_lock a spinny or a sleepy operation?
<agentzh> its rw-semaphore, so it's sleepy.
<agentzh> at least by default.
<fche> rwlocks are spinny by default AIUI
<agentzh> that's why i used the *_trylock() variant in my 2nd patch.
<fche> except on -rt
<agentzh> ah, i didn't know.
<agentzh> so how to protect against the sleepy cases?
<fche> but rwlocks aren't rwsems so that doesn't apply
<agentzh> okay
<agentzh> i tried to use it directly and it leads to deadlocks.
<agentzh> on my stock fedora kernel.
<fche> ok so I see why you'd do a down_write_trylock etc. inside the _stp_print_lock.
<agentzh> so it must sleep.
<agentzh> should i do it outside the _stp_print_lock?
<agentzh> and use in_interrupt() to determine if the context allows sleeping?
<fche> nah, not sure that's really help, we have other already-atomic callers into this code methinks.
<agentzh> okay
<agentzh> the _stp_print_lock only protects multiple writers, so it does not help with the reader.
<agentzh> i just found that even 200 iterations are not good enough...
derek0883 has joined #systemtap
<fche> the code should plan to fail
<agentzh> simply do atomic_inc(&_stp_transport_failures) ?
<agentzh> but we could do better in contexts allowing sleeping?
<fche> yeah, increment the failures
<fche> hm, this trylock gadget should probably be moved over into transport/relay_v2.c
<fche> the actual function that deals with resources contended with userspace
<fche> _stp_data_write_reserve probably
<fche> the runtime/print_flush.c cannot really deal with rchan_buf type things, those are configuration specific
<agentzh> not really, i tried.
<agentzh> the reader would read unitialized reserved space in the subbufs.
<agentzh> so there's still \0 in the output under load test.
<agentzh> simply failing would be worse then before. it warrants data loss for scripts generating a lot of output.
<agentzh> that's why i put the lock around code doing the memcpy().
<agentzh> the lock contention is serious here since the staprun keeps reading like crazy.
<fche> staprun shouldn't poll so frequently as a matter of fact
<fche> but yeah seeing that relay_v2's _stp_data_write_commit() is empty, it's clear there is a race that covers the subbuf reservation and the memcpy vs. the staprun read
derek0883 has quit [Ping timeout: 246 seconds]
derek0883 has joined #systemtap
irker644 has joined #systemtap
<irker644> systemtap: wcohen systemtap.git:master * release-4.3-16-g3d922919d / testsuite/systemtap.examples/profiling/periodic.stp: Use explicit @cast() operators for periodic.stp
orivej has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
<agentzh> i moved the inode lock outside the print lock.
<agentzh> so it may sleep in contexts which allow.
<fche> at least also protect it with something that ties it to relay_v2.c
<agentzh> what does that mean?
<fche> well, just concerned that this transport-independent file is learning about rchan_buf and inode and such inside only a #ifdef (__KERNEL__) guard
<fche> maybe also assert STP_TRANSPORT_VERSION == 2
<agentzh> okay, gotcha. yeah, i was trying to be lazy. will address it.
<agentzh> for trasport v1, is it centos 6's 2.6 kernel?
<agentzh> just wondering how to test that case.
<agentzh> i'll do more thorough load testing for different configs.
<agentzh> load testing is helpful.
<fche> yeah probably
<fche> though rhel6 has practically fallen off of our radar ... not quite but close
<agentzh> any ETA for that?
<fche> nope
<agentzh> rl6 is indeed reaching its EOL.
<fche> there is eol and there is eol-eol :)
<agentzh> lol
<agentzh> indeed
<irker644> systemtap: wcohen systemtap.git:master * release-4.3-17-gb2d18cb3a / testsuite/systemtap.examples/process/semop-watch.stp: Use explicit @cast() operators for semop-watch.stp example.
<agentzh> fche: is the first patch for removing relay_flush() good to merge?
<fche> agentzh, sure
derek0883 has quit [Ping timeout: 240 seconds]
orivej has quit [Ping timeout: 240 seconds]
orivej_ has joined #systemtap
derek0883 has joined #systemtap
mjw has quit [Quit: Leaving]
<agentzh> fche: thanks.
<agentzh> for the 2nd patch, seems like in_interrupt() || in_atomic() is still not enough.
<agentzh> still seeing deadlocks.
<fche> yeah suggest not trying anything beyond a trylock
<agentzh> k, will do this first.
<agentzh> do you think it should be inside the print lock or outside?
<fche> some slight data loss is better. if you can just drop the data that we can't reliably send out, and increment the transport-failures counter, that should be good enough
<agentzh> okay, corrupted data is indeed dangerious.
<agentzh> could lead to all kinds of evil.
<agentzh> fche: just tried discard data directly and it seems like a bit too much data loss due to the fact it discards 8192+ bytes a time...
<agentzh> the STP_BUFFER_SIZE has a default value of 8KB.
<agentzh> maybe we still let it proceed?
orivej_ has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
<agentzh> it seems 99% of the time it is good and even for the 1% of the time, the bad data is quite limited.
<agentzh> just under 1kb in my own load test.
<fche> IMO let's let it discard
<fche> and ideally figure out whether staprun can be made to back off so the contention is not hit so much of the time
<fche> ISTR we poll every 50ms or some such thing, which is too short even for interactive purposes
<irker644> systemtap: wcohen systemtap.git:master * release-4.3-18-g36430614d / tapset/linux/kprocess.stp: Use kernel.trace("sched:sched_process_fork") for kprocess.create when possible
tromey has quit [Quit: ERC (IRC client for Emacs 28.0.50)]
<agentzh> fche: staprun uses ppoll with a default timeout of 10sec.
<fche> strace stapio
<fche> there is a 200ms ppoll also
<fche> hm 200ms is not that bad
<fche> that shouldn't be contending, but maybe it's different during probe-end timeframe
<agentzh> because ko emits a lot of data and the staprun reader also needs to read a lot of data.
<agentzh> if the staprun reader does not read fast enough, there would also be data loss on the sender side.
<agentzh> losing so much data so frequently would be a regression for our use cases.
<agentzh> not an improvement...
<agentzh> fche: i just tried process.begin probes, same thing. almost immediately lose about 8KB of output under load.
orivej has quit [Ping timeout: 258 seconds]
orivej has joined #systemtap
<fche> known-lost data is an improvement over corrupt data tho
<agentzh> well, more data and more info are much more important to us since we can always skip bad data...
<agentzh> assuming no other consequences.
<fche> stap -s NNNN ?
<fche> bump that up for you? or try stap -b
<agentzh> yeah, increasing the subbuf_size would definitely help.
<agentzh> but i still don't want to use full buffering to avoid any data loss...
<agentzh> since our stap scripts can generate hundreds of mega bytes of data...
<agentzh> -b may not be so helpful for our use cases, since it's usually only using one CPU...
<agentzh> and for my load test, it's only generating one cpu file.
<agentzh> fche: btw, the lock contention is bad when the machine is under significant load.
<agentzh> if the machine is idle, then it's fine...
<fche> ok would be good to know whence the contention exactly
<fche> but anyway
<agentzh> to reduce the data loss rate, maybe the user should decrease the STP_BUFFER_SIZE value?
<fche> IMHO still dropping data rather than corrupting it is a good default
<agentzh> i agree.
<agentzh> maybe make it controllable by a macro?
<agentzh> like _STP_UNSFAE_RELAY_DATA?
<fche> and for your purposes you could add another #ifdef STP_TRANSPORT_RISKY or something to tweak behaviour
<agentzh> sounds good
orivej has quit [Ping timeout: 260 seconds]
<agentzh> i'll adjust the patch accordingly.
<agentzh> fche: will you have a quick look at this v4 patch? https://gist.github.com/agentzh/d821ea47d23f3f75a5d8246264b79005
sapatel has quit [Remote host closed the connection]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<agentzh> fche: okay, i was wrong. sorry. stap -b helps even though only one cpu file is used. strange...
<agentzh> i just tried -b and there's no data loss even under load.
<agentzh> ah, i see. when -b is not specified, the relay trace always binds to cpu 0...
<agentzh> that's why the contention is so severe in this case.
<agentzh> is it a bug?
<agentzh> shouldn't it choose other cpus?
<agentzh> cpu cores, i mean.
<agentzh> -b helps because it randomly chooses cpu for output.
<agentzh> i have a 8c/16t CPU, so it's much better.
<agentzh> -b allocates 16x memory though...and only one set of bufs can be used.
<agentzh> quite wasteful.
<agentzh> (as a work-around)
<agentzh> okay, seems like the kernel's relay impl indeed hardcodes cpu 0 to the "global" bufs.
<agentzh> seems like a misdesign to me...
<agentzh> wondering if there's a way to choose a custom cpu for the global buf case...
sapatel has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
<fche> hey
<fche> so yeah I suspect lkml won't have much interest in rearchitecting relayfs :) re. cpu0
<fche> re. patch v4
<fche> hm not sure the entryfn_context logic is right in the first hunk
<fche> that shouldn't be conditional on inode_locked in the exit path
<fche> and ditto second hunk
<fche> the third hunk in stp_print_flush it looks correct
<fche> hm but no even there the put_context stuff must match the get_context
<fche> whether or not RISKY and whether or not inode_lockd
<fche> I like the new lock/unlock function suite, that's good abstraction.
<agentzh> ah, right, sorry, my bad.
<agentzh> i added the RISKY part too fast and didn't notice the put thing.
<agentzh> fixing
<agentzh> thanks for the catch.
<fche> I think the entryfn_put_context needs to be outside the if (inode_locked) conditional
<agentzh> also fixed the _stp_get_rchan_subbuf() call to always use 0 cpu id if not in the bulk mode.
<agentzh> oh oh, right...sorry again...
<fche> THERE WILL BE BEATINGS AND COMFY CHAIRS
<agentzh> my head is fried after days of tracing this race...
<fche> heh
<fche> you've done amazing work.
<agentzh> thanks!