fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
<promach>
fche: I believe those two sequential wake_up() timing overhead is trivial since computers nowadays are multicore and HT
<promach>
right ?
<promach>
I mean both recv and send threads could be running on different cores
<fche>
promach, sorry, not very familiar with this area
<fche>
multicore does not imply timing overhead being minimal - if anything, numa / faraway memory can mean more cache pingponging etc.
<promach>
in this case, how to measure the timing overhead after the calls to those two sequential wake_up() calls ?
<fche>
could try putting statement probes before /after the wake_up's
<promach>
I believe interrupt handler cannot be interrupted by some normal threads
<fche>
or if putting one after is not possible (being at the end of a function), a .return probe for the second case
<fche>
stap (kprobes) is not an interrupt / thread, it's an exception
<fche>
so yeah I believe interrupt handlers can be probed
<fche>
if you can compile changes like that in, that's about the fastest way to go
<fche>
though do_gettimeofday() may not be the best function to call, ktime* or get_cycles* or something like that worth a look
<promach>
so printing out timestamps will give me the same timing info as if I am using stap(kprobes) , right ?
<fche>
no, because dynamically inserted probes incur dispatching costs
<fche>
so stap-observed times will be longer
<promach>
ok, I see. By the way, what I worry is that the printed timestamps in the interrupt handler might not be accurate because wake_up() does not really take the actual effect until the interrupt handler finally exits, right ?
<fche>
That's a whole other consideration. wake_up just marks tasks as runnable. the time to them actually running could be indefinitely large
<promach>
the time to them actually running could be indefinitely large <-- this is my actual concern right now
<promach>
but the wait_queue argument inside wake_up() is actually thread, so printing timestamp within that thread function seems not really feasible
<promach>
I cannot be sure where exactly to put the timestamp printout within that thread function
<fche>
if you know which thread/task is ultimately getting woken up (if the queue has only a single waiter, e.g.), then you can probe kernel.trace("sched_switch") and check out if ($next == ...)
<fche>
for noting when the target task finally is about to start
<promach>
ok, cool. this would probably address my concern
<promach>
wait, ftrace also has sched_switch option
<fche>
sure; it's a defined tracepoint in the kernel that several tools can use
<promach>
fche: but is it possible to probe only this riffa.ko linux driver without involving other pids using this sched-latency.stp script ?
<fche>
you'd need to modify it
<fche>
the easiest way could be to drop the sched_wakeup probe, and replace it with that module().statement() or .function() probe in your driver, which is about to do the wakeups
<fche>
that way only riffa-related tasks get woken up
<fche>
but for that to work, this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array
<jmux>
The script works as expected, but my simpler kprobe quickly kills the system, because I get a page fault because I use __getname and __putname, as far as I could deduce.
slowfranklin has joined #systemtap
<jmux>
So I've tried to find information how to handle the page fault, but I couldn't find any examples. systemtap doesn't seem to use that handler and has a lot of code for it's own string handling.
<jmux>
Does anybody know how I can handle the page fault correctly in the kprobe fault handler?
sscox has quit [Ping timeout: 268 seconds]
mjw has joined #systemtap
<fche>
jmux, stap's runtime includes page-fault-suppression, segmentation configuration, and a few other tricks when accessing kernel- or user-space values like that
<fche>
precisely because a page fault (or other exception) during probe handling could be fatal
<fche>
and from a kprobe, you can't let the kernel do its normal thing for e.g. userspace pagefaults
<fche>
so you would need to copy over some of the wild inline-assembly stuff in runtime/string* or someplace like that
<fche>
this is one of the types of value systemtap brings to the table
<jmux>
fche: I don't need any userspace communication. I just want to evaluate the path in kernel space.
<jmux>
And for whatever reason the __getpath / __putpath calls generate a page fault AFAIK (trapnr is 14).
<fche>
could it be the earlier magic = path->.... chain?
<jmux>
I looked at other users of these calls and couldn't see any locking whatsoever. If I replace the "pathbuf = __getname()" with char pathbuf[PATH_MAX] it works / seem to work.
<jmux>
Normally I can reproduce the page fault error and system degration withing 1-2 minutes opening and closing Dolphin windows in KDE.
<jmux>
If I comment the whole __getname to __putname block I can't reproduce. If I just leave the "if __gentame then __putname", it breaks
<jmux>
But putting 4k PATH_MAX on the stack is probably not the best idea.
<fche>
er yes, that too
<fche>
(systemtap strings don't go on the stack)
<fche>
again, for such reasons
<jmux>
fche: so everyything I could deduce from my tested variants points to the allocation, but I don't know how I can find the real position of the page fault.
<fche>
it could be the kernel stack overflow
<fche>
put the name on the heap, worth a try
<fche>
hm, actually I don't see this part in your code
<jmux>
What I didn't try yet is to use kmalloc + kfree directly, as a lot of other places in the linux code use __getname / __putname without any locking I could see
<jmux>
fche: the code just gets the memory from a cache for PATH_MAX objects
<fche>
another cute problem in this space is the inability to do some types of memory allocation from within (atomic) kprobe context
<jmux>
But I tested by replacing the "pathbuf = __getname()" with char pathbuf[PATH_MAX].
slowfranklin has left #systemtap [#systemtap]
<jmux>
fche: that's is definitly the problem, I think. For the first 1-2 minutes opening windows is fine. But probably the probe exhaust the cache at some point and then results in a page fault
<jmux>
opening and closing Dolphin windows in KDE that is. Couldn't relyable break the kernel via SSH, but with GUI it just takes 1-2 minutes
<jmux>
fche: when does systemtap allocates its strings? I originally thought to use this module as a base fro my kprobe, but after seeing this generated code, I thought it's easier to start from scratch
<promach>
fche: this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array <-- I cannot possibly have any idea about this pid before the riffa c program is run
<fche>
jmux, stap creates a synthetic, explicit stack frame-like context structure to track local / temporary variables, especially strings
<fche>
created on the heap as a big array at startup
<fche>
its generated code is not intended to be easily understood by humans, let alone used as a basis for non-stap work
<fche>
but 'course you're very welcome to try!
<fche>
promach, you don't haev to do it before the program is run, but need to do it at some point during the run
<fche>
if you don't know how to get a pid out of a waitq structure, another way could be to have another pair of probes, one for riffa entry and one for exit, which sets a global flag to say 'the sched wakeup about to happen is INTERESTING'
<fche>
then keep that probe kernel.trace("sched_wakeup") and add a conditional ... if (! pid() in interesting) next; at the top
<jmux>
fche: yup - I just read my old systemtap kprobe code and realized it's basically all static variables with fixed char arrays.
<fche>
yup. stap does a little better than that, at the cost of having to track it all etc., and making the code harder to read
<jmux>
Now I'm wondering how this works, with "As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of the same handler) may run concurrently on different CPUs."
<jmux>
I saw malloc uses spinlocks - guess to avoid a sleep. Then I found STP_ALLOC_FLAGS, so now I try kmalloc with these ("Default, and should be "safe" from anywhere.").
<jmux>
I guess using the cached PATH_MAX slaps is simply not secure in this context. Probably might sleep. It's not a big problem if I fail some kmallocs, so have some stray fanotify events.
sscox has joined #systemtap
<promach>
fche: and add a conditional ... if (! pid() in interesting) next; at the top <-- I do not get this
<promach>
what do you mean by "a conditional" ?
<fche>
an if statement is a type of conditional
<fche>
probe kernel.trace("sched_wakeup") { if (! pid() in interesting) next; ts0[$p->pid] = now() }
<jmux>
fche: so I replaced the __getname with a kmalloc(PATH_MAX, (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN) & ~__GFP_RECLAIM) => no more reproducible kprobe faults
<jmux>
whatever happens in this cache slap seems to be the origin. Even without the cache, this should be a hot memory path, so there shouldn't be much of an impact
<jmux>
fche: thanks for investing your time on my problem
<fche>
no problem
<fche>
glad you made some progress!
tromey has joined #systemtap
brolley has joined #systemtap
jmux has quit [Quit: Konversation terminated!]
brolley has left #systemtap [#systemtap]
wcohen has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
orivej has quit [Ping timeout: 250 seconds]
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]