fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
<promach> fche: I believe those two sequential wake_up() timing overhead is trivial since computers nowadays are multicore and HT
<promach> right ?
<promach> I mean both recv and send threads could be running on different cores
<fche> promach, sorry, not very familiar with this area
<fche> multicore does not imply timing overhead being minimal - if anything, numa / faraway memory can mean more cache pingponging etc.
<promach> in this case, how to measure the timing overhead after the calls to those two sequential wake_up() calls ?
<fche> could try putting statement probes before /after the wake_up's
<promach> I believe interrupt handler cannot be interrupted by some normal threads
<fche> or if putting one after is not possible (being at the end of a function), a .return probe for the second case
<fche> stap (kprobes) is not an interrupt / thread, it's an exception
<fche> so yeah I believe interrupt handlers can be probed
<promach> fche: how is stap(kprobes) different from https://paste.ubuntu.com/p/dhTb7pxV82/ ?
<fche> well, it's not compiled in
<fche> so it's inserted and removed on the fly
<fche> with commesurate overheads however
<fche> if you can compile changes like that in, that's about the fastest way to go
<fche> though do_gettimeofday() may not be the best function to call, ktime* or get_cycles* or something like that worth a look
<promach> so printing out timestamps will give me the same timing info as if I am using stap(kprobes) , right ?
<fche> no, because dynamically inserted probes incur dispatching costs
<fche> so stap-observed times will be longer
<promach> ok, I see. By the way, what I worry is that the printed timestamps in the interrupt handler might not be accurate because wake_up() does not really take the actual effect until the interrupt handler finally exits, right ?
<fche> That's a whole other consideration. wake_up just marks tasks as runnable. the time to them actually running could be indefinitely large
<promach> the time to them actually running could be indefinitely large <-- this is my actual concern right now
<promach> but the wait_queue argument inside wake_up() is actually thread, so printing timestamp within that thread function seems not really feasible
<promach> I cannot be sure where exactly to put the timestamp printout within that thread function
<fche> if you know which thread/task is ultimately getting woken up (if the queue has only a single waiter, e.g.), then you can probe kernel.trace("sched_switch") and check out if ($next == ...)
<fche> for noting when the target task finally is about to start
<promach> ok, cool. this would probably address my concern
<promach> wait, ftrace also has sched_switch option
<fche> sure; it's a defined tracepoint in the kernel that several tools can use
<promach> sudo stap --example sched_switch.stp 'module("/lib/modules/4.18.6-041806-generic/kernel/drivers/riffa/riffa.ko").function("wake_up")'
<promach> fche: Wrong number of arguments, use none, 'pid nr' or 'name proc'
<fche> that script is designed to be invoked with "pid NUMBER" or "proc NAME" as parameters
<promach> fche: but I am probing a linux driver
<fche> yes, so this particular script is not a great fit for you
<promach> hmm...
<promach> thanks
<promach> fche: but is it possible to probe only this riffa.ko linux driver without involving other pids using this sched-latency.stp script ?
<fche> you'd need to modify it
<fche> the easiest way could be to drop the sched_wakeup probe, and replace it with that module().statement() or .function() probe in your driver, which is about to do the wakeups
<fche> that way only riffa-related tasks get woken up
<fche> but for that to work, this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array
<promach> hmm...
<promach> and http://www.brendangregg.com/blog/2017-03-16/perf-sched.html does not seem to work with threads scheduling/wakeup within a linux driver
orivej has joined #systemtap
jmux has joined #systemtap
<jmux> Hi. I've written a systemtap probe script I want to convert into a simpler kprobe.
<jmux> The systemtap script is in http://paste.debian.net/1053371/ , the converted kprobe in http://paste.debian.net/1053372/
<jmux> The script works as expected, but my simpler kprobe quickly kills the system, because I get a page fault because I use __getname and __putname, as far as I could deduce.
slowfranklin has joined #systemtap
<jmux> So I've tried to find information how to handle the page fault, but I couldn't find any examples. systemtap doesn't seem to use that handler and has a lot of code for it's own string handling.
<jmux> Does anybody know how I can handle the page fault correctly in the kprobe fault handler?
sscox has quit [Ping timeout: 268 seconds]
mjw has joined #systemtap
<fche> jmux, stap's runtime includes page-fault-suppression, segmentation configuration, and a few other tricks when accessing kernel- or user-space values like that
<fche> precisely because a page fault (or other exception) during probe handling could be fatal
<fche> and from a kprobe, you can't let the kernel do its normal thing for e.g. userspace pagefaults
<fche> so you would need to copy over some of the wild inline-assembly stuff in runtime/string* or someplace like that
<fche> this is one of the types of value systemtap brings to the table
<jmux> fche: I don't need any userspace communication. I just want to evaluate the path in kernel space.
<jmux> And for whatever reason the __getpath / __putpath calls generate a page fault AFAIK (trapnr is 14).
<fche> could it be the earlier magic = path->.... chain?
<jmux> I looked at other users of these calls and couldn't see any locking whatsoever. If I replace the "pathbuf = __getname()" with char pathbuf[PATH_MAX] it works / seem to work.
<jmux> Normally I can reproduce the page fault error and system degration withing 1-2 minutes opening and closing Dolphin windows in KDE.
<jmux> If I comment the whole __getname to __putname block I can't reproduce. If I just leave the "if __gentame then __putname", it breaks
<jmux> But putting 4k PATH_MAX on the stack is probably not the best idea.
<fche> er yes, that too
<fche> (systemtap strings don't go on the stack)
<fche> again, for such reasons
<jmux> fche: so everyything I could deduce from my tested variants points to the allocation, but I don't know how I can find the real position of the page fault.
<fche> it could be the kernel stack overflow
<fche> put the name on the heap, worth a try
<fche> hm, actually I don't see this part in your code
<jmux> What I didn't try yet is to use kmalloc + kfree directly, as a lot of other places in the linux code use __getname / __putname without any locking I could see
<jmux> fche: the code just gets the memory from a cache for PATH_MAX objects
<fche> another cute problem in this space is the inability to do some types of memory allocation from within (atomic) kprobe context
<jmux> But I tested by replacing the "pathbuf = __getname()" with char pathbuf[PATH_MAX].
slowfranklin has left #systemtap [#systemtap]
<jmux> fche: that's is definitly the problem, I think. For the first 1-2 minutes opening windows is fine. But probably the probe exhaust the cache at some point and then results in a page fault
<jmux> opening and closing Dolphin windows in KDE that is. Couldn't relyable break the kernel via SSH, but with GUI it just takes 1-2 minutes
<jmux> fche: when does systemtap allocates its strings? I originally thought to use this module as a base fro my kprobe, but after seeing this generated code, I thought it's easier to start from scratch
<promach> fche: this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array <-- I cannot possibly have any idea about this pid before the riffa c program is run
<fche> jmux, stap creates a synthetic, explicit stack frame-like context structure to track local / temporary variables, especially strings
<fche> created on the heap as a big array at startup
<fche> its generated code is not intended to be easily understood by humans, let alone used as a basis for non-stap work
<fche> but 'course you're very welcome to try!
<fche> promach, you don't haev to do it before the program is run, but need to do it at some point during the run
<fche> if you don't know how to get a pid out of a waitq structure, another way could be to have another pair of probes, one for riffa entry and one for exit, which sets a global flag to say 'the sched wakeup about to happen is INTERESTING'
<fche> like ...
<fche> probe module("riffa").function("whatever").call { interesting[pid()] = 1} probe module("riffa").function("whatever").return{ delete interesting[pid()] }
<fche> then keep that probe kernel.trace("sched_wakeup") and add a conditional ... if (! pid() in interesting) next; at the top
<jmux> fche: yup - I just read my old systemtap kprobe code and realized it's basically all static variables with fixed char arrays.
<fche> yup. stap does a little better than that, at the cost of having to track it all etc., and making the code harder to read
<jmux> Now I'm wondering how this works, with "As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of the same handler) may run concurrently on different CPUs."
<jmux> I saw malloc uses spinlocks - guess to avoid a sleep. Then I found STP_ALLOC_FLAGS, so now I try kmalloc with these ("Default, and should be "safe" from anywhere.").
<jmux> I guess using the cached PATH_MAX slaps is simply not secure in this context. Probably might sleep. It's not a big problem if I fail some kmallocs, so have some stray fanotify events.
sscox has joined #systemtap
<promach> fche: and add a conditional ... if (! pid() in interesting) next; at the top <-- I do not get this
<promach> what do you mean by "a conditional" ?
<fche> an if statement is a type of conditional
<fche> probe kernel.trace("sched_wakeup") { if (! pid() in interesting) next; ts0[$p->pid] = now() }
<jmux> fche: so I replaced the __getname with a kmalloc(PATH_MAX, (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN) & ~__GFP_RECLAIM) => no more reproducible kprobe faults
<jmux> whatever happens in this cache slap seems to be the origin. Even without the cache, this should be a hot memory path, so there shouldn't be much of an impact
<jmux> fche: thanks for investing your time on my problem
<fche> no problem
<fche> glad you made some progress!
tromey has joined #systemtap
brolley has joined #systemtap
jmux has quit [Quit: Konversation terminated!]
brolley has left #systemtap [#systemtap]
wcohen has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
orivej has quit [Ping timeout: 250 seconds]
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
sscox has quit [Ping timeout: 268 seconds]
wcohen has quit [Ping timeout: 268 seconds]
mjw has quit [Quit: Leaving]