#systemtap on 2018-11-27 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

01:34 orivej has joined #systemtap

02:22 orivej has quit [Ping timeout: 268 seconds]

03:01 <promach> fche: I believe those two sequential wake_up() timing overhead is trivial since computers nowadays are multicore and HT

03:01 <promach> right ?

03:02 <promach> I mean both recv and send threads could be running on different cores

03:19 <fche> promach, sorry, not very familiar with this area

03:19 <fche> multicore does not imply timing overhead being minimal - if anything, numa / faraway memory can mean more cache pingponging etc.

03:25 <promach> in this case, how to measure the timing overhead after the calls to those two sequential wake_up() calls ?

03:25 <fche> could try putting statement probes before /after the wake_up's

03:26 <promach> I believe interrupt handler cannot be interrupted by some normal threads

03:26 <fche> or if putting one after is not possible (being at the end of a function), a .return probe for the second case

03:26 <fche> stap (kprobes) is not an interrupt / thread, it's an exception

03:26 <fche> so yeah I believe interrupt handlers can be probed

03:27 <promach> fche: how is stap(kprobes) different from https://paste.ubuntu.com/p/dhTb7pxV82/ ?

03:28 <fche> well, it's not compiled in

03:28 <fche> so it's inserted and removed on the fly

03:28 <fche> with commesurate overheads however

03:29 <fche> if you can compile changes like that in, that's about the fastest way to go

03:29 <fche> though do_gettimeofday() may not be the best function to call, ktime* or get_cycles* or something like that worth a look

03:30 <promach> so printing out timestamps will give me the same timing info as if I am using stap(kprobes) , right ?

03:30 <fche> no, because dynamically inserted probes incur dispatching costs

03:31 <fche> so stap-observed times will be longer

03:32 <promach> ok, I see. By the way, what I worry is that the printed timestamps in the interrupt handler might not be accurate because wake_up() does not really take the actual effect until the interrupt handler finally exits, right ?

03:34 <fche> That's a whole other consideration. wake_up just marks tasks as runnable. the time to them actually running could be indefinitely large

03:35 <promach> the time to them actually running could be indefinitely large <-- this is my actual concern right now

03:36 <promach> but the wait_queue argument inside wake_up() is actually thread, so printing timestamp within that thread function seems not really feasible

03:37 <promach> I cannot be sure where exactly to put the timestamp printout within that thread function

03:39 <fche> if you know which thread/task is ultimately getting woken up (if the queue has only a single waiter, e.g.), then you can probe kernel.trace("sched_switch") and check out if ($next == ...)

03:39 <fche> for noting when the target task finally is about to start

03:41 <promach> ok, cool. this would probably address my concern

03:43 <promach> wait, ftrace also has sched_switch option

03:44 <fche> sure; it's a defined tracepoint in the kernel that several tools can use

03:45 <promach> https://sourceware.org/systemtap/examples/profiling/sched_switch.stp

03:52 <promach> sudo stap --example sched_switch.stp 'module("/lib/modules/4.18.6-041806-generic/kernel/drivers/riffa/riffa.ko").function("wake_up")'

03:52 <promach> fche: Wrong number of arguments, use none, 'pid nr' or 'name proc'

03:55 <fche> that script is designed to be invoked with "pid NUMBER" or "proc NAME" as parameters

03:55 <promach> fche: but I am probing a linux driver

03:55 <fche> yes, so this particular script is not a great fit for you

03:56 <promach> hmm...

03:56 <fche> https://sourceware.org/systemtap/examples/process/sched-latency.stp probably a more relevant choice

03:58 <promach> thanks

04:00 <promach> fche: but is it possible to probe only this riffa.ko linux driver without involving other pids using this sched-latency.stp script ?

04:00 <fche> you'd need to modify it

04:01 <fche> the easiest way could be to drop the sched_wakeup probe, and replace it with that module().statement() or .function() probe in your driver, which is about to do the wakeups

04:01 <fche> that way only riffa-related tasks get woken up

04:01 <fche> but for that to work, this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array

04:02 <promach> hmm...

04:20 <promach> and http://www.brendangregg.com/blog/2017-03-16/perf-sched.html does not seem to work with threads scheduling/wakeup within a linux driver

04:22 <promach> so, I still need to modify https://sourceware.org/systemtap/examples/process/sched-latency.stp

05:16 orivej has joined #systemtap

08:31 jmux has joined #systemtap

09:28 <jmux> Hi. I've written a systemtap probe script I want to convert into a simpler kprobe.

09:29 <jmux> The systemtap script is in http://paste.debian.net/1053371/ , the converted kprobe in http://paste.debian.net/1053372/

09:31 <jmux> The script works as expected, but my simpler kprobe quickly kills the system, because I get a page fault because I use __getname and __putname, as far as I could deduce.

09:31 slowfranklin has joined #systemtap

09:34 <jmux> So I've tried to find information how to handle the page fault, but I couldn't find any examples. systemtap doesn't seem to use that handler and has a lot of code for it's own string handling.

09:35 <jmux> Does anybody know how I can handle the page fault correctly in the kprobe fault handler?

10:56 sscox has quit [Ping timeout: 268 seconds]

11:48 mjw has joined #systemtap

12:14 <fche> jmux, stap's runtime includes page-fault-suppression, segmentation configuration, and a few other tricks when accessing kernel- or user-space values like that

12:14 <fche> precisely because a page fault (or other exception) during probe handling could be fatal

12:15 <fche> and from a kprobe, you can't let the kernel do its normal thing for e.g. userspace pagefaults

12:15 <fche> so you would need to copy over some of the wild inline-assembly stuff in runtime/string* or someplace like that

12:16 <fche> this is one of the types of value systemtap brings to the table

12:17 <jmux> fche: I don't need any userspace communication. I just want to evaluate the path in kernel space.

12:17 <jmux> And for whatever reason the __getpath / __putpath calls generate a page fault AFAIK (trapnr is 14).

12:19 <fche> could it be the earlier magic = path->.... chain?

12:19 <jmux> I looked at other users of these calls and couldn't see any locking whatsoever. If I replace the "pathbuf = __getname()" with char pathbuf[PATH_MAX] it works / seem to work.

12:20 <jmux> Normally I can reproduce the page fault error and system degration withing 1-2 minutes opening and closing Dolphin windows in KDE.

12:22 <jmux> If I comment the whole __getname to __putname block I can't reproduce. If I just leave the "if __gentame then __putname", it breaks

12:22 <jmux> But putting 4k PATH_MAX on the stack is probably not the best idea.

12:23 <fche> er yes, that too

12:23 <fche> (systemtap strings don't go on the stack)

12:23 <fche> again, for such reasons

12:25 <jmux> fche: so everyything I could deduce from my tested variants points to the allocation, but I don't know how I can find the real position of the page fault.

12:26 <fche> it could be the kernel stack overflow

12:26 <fche> put the name on the heap, worth a try

12:27 <fche> hm, actually I don't see this part in your code

12:27 <jmux> What I didn't try yet is to use kmalloc + kfree directly, as a lot of other places in the linux code use __getname / __putname without any locking I could see

12:28 <jmux> fche: the code just gets the memory from a cache for PATH_MAX objects

12:29 <fche> another cute problem in this space is the inability to do some types of memory allocation from within (atomic) kprobe context

12:30 <jmux> But I tested by replacing the "pathbuf = __getname()" with char pathbuf[PATH_MAX].

12:30 slowfranklin has left #systemtap [#systemtap]

12:31 <jmux> fche: that's is definitly the problem, I think. For the first 1-2 minutes opening windows is fine. But probably the probe exhaust the cache at some point and then results in a page fault

12:32 <jmux> opening and closing Dolphin windows in KDE that is. Couldn't relyable break the kernel via SSH, but with GUI it just takes 1-2 minutes

12:35 <jmux> fche: when does systemtap allocates its strings? I originally thought to use this module as a base fro my kprobe, but after seeing this generated code, I thought it's easier to start from scratch

13:01 <promach> fche: this riffa.ko-side probe must figure out which pid it will wake up (eventually), to save it in that ts0[] array <-- I cannot possibly have any idea about this pid before the riffa c program is run

13:06 <fche> jmux, stap creates a synthetic, explicit stack frame-like context structure to track local / temporary variables, especially strings

13:06 <fche> created on the heap as a big array at startup

13:07 <fche> its generated code is not intended to be easily understood by humans, let alone used as a basis for non-stap work

13:07 <fche> but 'course you're very welcome to try!

13:07 <fche> promach, you don't haev to do it before the program is run, but need to do it at some point during the run

13:08 <fche> if you don't know how to get a pid out of a waitq structure, another way could be to have another pair of probes, one for riffa entry and one for exit, which sets a global flag to say 'the sched wakeup about to happen is INTERESTING'

13:08 <fche> like ...

13:09 <fche> probe module("riffa").function("whatever").call { interesting[pid()] = 1} probe module("riffa").function("whatever").return{ delete interesting[pid()] }

13:10 <fche> then keep that probe kernel.trace("sched_wakeup") and add a conditional ... if (! pid() in interesting) next; at the top

13:13 <jmux> fche: yup - I just read my old systemtap kprobe code and realized it's basically all static variables with fixed char arrays.

13:14 <fche> yup. stap does a little better than that, at the cost of having to track it all etc., and making the code harder to read

13:14 <jmux> Now I'm wondering how this works, with "As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of the same handler) may run concurrently on different CPUs."

13:16 <jmux> I saw malloc uses spinlocks - guess to avoid a sleep. Then I found STP_ALLOC_FLAGS, so now I try kmalloc with these ("Default, and should be "safe" from anywhere.").

13:17 <jmux> I guess using the cached PATH_MAX slaps is simply not secure in this context. Probably might sleep. It's not a big problem if I fail some kmallocs, so have some stray fanotify events.

13:37 sscox has joined #systemtap

13:37 <promach> fche: and add a conditional ... if (! pid() in interesting) next; at the top <-- I do not get this

13:38 <promach> what do you mean by "a conditional" ?

13:44 <fche> an if statement is a type of conditional

13:45 <fche> probe kernel.trace("sched_wakeup") { if (! pid() in interesting) next; ts0[$p->pid] = now() }

13:52 <jmux> fche: so I replaced the __getname with a kmalloc(PATH_MAX, (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN) & ~__GFP_RECLAIM) => no more reproducible kprobe faults

13:54 <jmux> whatever happens in this cache slap seems to be the origin. Even without the cache, this should be a hot memory path, so there shouldn't be much of an impact

13:54 <jmux> fche: thanks for investing your time on my problem

14:03 <fche> no problem

14:03 <fche> glad you made some progress!

14:23 tromey has joined #systemtap

15:17 brolley has joined #systemtap

17:21 jmux has quit [Quit: Konversation terminated!]

17:59 brolley has left #systemtap [#systemtap]

18:34 wcohen has joined #systemtap

18:43 slowfranklin has joined #systemtap

19:00 slowfranklin has quit [Quit: slowfranklin]

19:15 slowfranklin has joined #systemtap

20:39 orivej has quit [Ping timeout: 250 seconds]

20:50 slowfranklin has quit [Quit: slowfranklin]

21:00 slowfranklin has joined #systemtap

21:19 slowfranklin has quit [Quit: slowfranklin]

21:25 slowfranklin has joined #systemtap

21:56 slowfranklin has quit [Quit: slowfranklin]

22:49 tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]

22:57 sscox has quit [Ping timeout: 268 seconds]

23:02 wcohen has quit [Ping timeout: 268 seconds]

23:36 mjw has quit [Quit: Leaving]