#systemtap on 2020-08-25 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:01 amerey has quit [Remote host closed the connection]

00:01 amerey has joined #systemtap

00:03 khaled has quit [Quit: Konversation terminated!]

00:19 amerey has quit [Remote host closed the connection]

00:31 derek0883 has quit [Remote host closed the connection]

00:31 derek0883 has joined #systemtap

00:37 derek0883 has quit [Ping timeout: 265 seconds]

00:56 derek0883 has joined #systemtap

01:08 <linus2> int fun1() {fun2(); fun3(); fun4();}

01:10 <linus2> function("fun1").callees(1) means it triggers every time fun1 is executed and fun1->fun2,fun1->fun3,fun1->fun4 triggers, but fun8->fun2,fun8->fun3,fun8->fun4 not triggers,right?

01:13 <fche> linus2, correct

01:14 <linus2> fche: cool

01:14 <linus2> I have a test

01:17 hpt has joined #systemtap

01:46 <linus2> fche: why this error? probe process(@1).function("*@*config*").callees(1).return

02:29 derek0883 has quit [Remote host closed the connection]

02:30 derek0883 has joined #systemtap

02:35 derek0883 has quit [Ping timeout: 240 seconds]

02:43 derek0883 has joined #systemtap

03:22 derek0883 has quit [Ping timeout: 244 seconds]

03:36 derek0883 has joined #systemtap

03:55 orivej has quit [Ping timeout: 265 seconds]

04:05 orivej has joined #systemtap

04:08 derek0883 has quit [Remote host closed the connection]

04:09 derek0883 has joined #systemtap

04:26 derek0883 has quit [Remote host closed the connection]

04:31 derek0883 has joined #systemtap

05:39 derek0883 has quit [Remote host closed the connection]

05:43 derek0883 has joined #systemtap

05:43 derek0883 has quit [Remote host closed the connection]

06:08 derek0883 has joined #systemtap

06:10 derek0883 has quit [Remote host closed the connection]

06:48 orivej has quit [Ping timeout: 264 seconds]

07:01 khaled has joined #systemtap

09:33 mjw has joined #systemtap

10:24 hpt has quit [Ping timeout: 240 seconds]

10:55 derek0883 has joined #systemtap

11:03 derek0883 has quit [Ping timeout: 244 seconds]

11:40 gromero has quit [Ping timeout: 272 seconds]

12:30 orivej has joined #systemtap

12:41 derek0883 has joined #systemtap

12:42 derek0883 has quit [Remote host closed the connection]

12:43 derek0883 has joined #systemtap

12:48 derek0883 has quit [Ping timeout: 240 seconds]

13:25 <linus2> fche: ?

13:25 <linus2> why this error? probe process(@1).function("*@*config*").callees(1).return

14:01 amerey has joined #systemtap

15:28 sscox has quit [Ping timeout: 258 seconds]

16:16 derek0883 has joined #systemtap

16:23 orivej has quit [Ping timeout: 256 seconds]

16:24 derek0883 has quit [Remote host closed the connection]

16:25 derek0883 has joined #systemtap

16:29 orivej has joined #systemtap

16:36 orivej has quit [Ping timeout: 240 seconds]

16:40 derek0883 has quit [Remote host closed the connection]

16:40 derek0883 has joined #systemtap

17:36 sscox has joined #systemtap

17:44 derek0883 has quit [Remote host closed the connection]

17:44 derek0883 has joined #systemtap

17:59 orivej has joined #systemtap

18:11 amerey has quit [Remote host closed the connection]

18:12 amerey has joined #systemtap

18:30 amerey has quit [Remote host closed the connection]

18:30 amerey has joined #systemtap

18:37 amerey has quit [Remote host closed the connection]

18:38 amerey has joined #systemtap

18:38 derek0883 has quit [Remote host closed the connection]

18:38 amerey has quit [Remote host closed the connection]

18:39 derek0883 has joined #systemtap

18:39 amerey has joined #systemtap

18:40 amerey has quit [Remote host closed the connection]

18:40 amerey has joined #systemtap

18:43 derek0883 has quit [Ping timeout: 240 seconds]

18:45 amerey has quit [Quit: Leaving]

20:27 orivej has quit [Ping timeout: 258 seconds]

20:34 <agentzh> fche: howdy!

20:34 <fche> YO

20:35 <agentzh> we ran into stap errors like "ERROR: utrace_control returned error -3 on pid 20427".

20:35 <agentzh> 3 is ESCH

20:35 <agentzh> i wonder if it's just the target process already quits and gets reaped?

20:35 <agentzh> should we mute this error for this error code?

20:35 <agentzh> we only see this on machines with a lot of cpu cores

20:35 <agentzh> 32c/64t

20:36 <agentzh> it's in the task finder.

20:36 <agentzh> the processes are not really those we are interested in.

20:37 <fche> it's not a unique error code; it's generated in a couple of places in stp_utrace.c, apparently related to process lifecycles faster than the code can respond to, something like that

20:37 <fche> nothing really a user can do anything about

20:37 <fche> I believe the effect is mainly that some tasks will go un-probed even though expected

20:37 <fche> perhaps that should be a warning only

20:38 derek0883 has joined #systemtap

20:40 <agentzh> yeah i see a lot of places can produce that error code.

20:40 <agentzh> inside the stap runtime.

20:40 <agentzh> along the utrace_control() code path.

20:41 <agentzh> so it's not a bug in stap or kernel we should be concerned about?

20:42 <agentzh> we're also debugging a kernel freeze when running stap on such boxes with many cpu cores.

20:42 <agentzh> hopefully this one is not related.

20:42 <fche> could see a kernel freeze (some cpu spinning madly) -causing- this sort of downstream utrace error

20:43 <agentzh> interestingly the kernel freeze never happens on smaller boxes with many fewer cpu cores even when using exactly the same kernel binary and stap scripts.

20:43 <fche> what causes that takes work and/or luck to figure out :)

20:43 <agentzh> and also lockdep/debug kernels variants of the same version also found nothing.

20:43 <fche> intresting

20:43 <agentzh> we also noted the work queue thing may lead to use-after-free panics.

20:44 <agentzh> the work queue is still used in the task finder when atomic/interrupt contexts are hit.

20:44 <agentzh> but that may be another separate issue.

20:44 <fche> AIUI, the very purpose of workqueues is to defer work from atomic/interrupt contexts

21:15 derek0883 has quit [Ping timeout: 244 seconds]

21:18 derek0883 has joined #systemtap

21:25 <agentzh> fche: yeah, but seems like the delayed work uses some data structures which may already have been freed. we will dig deeper.

21:26 <fche> thanks

21:52 <agentzh> sure

21:52 <agentzh> fortunately we can now reliably reproduce the kernel freeze locally in a kvm vm.

21:52 <agentzh> we can analyze the kdump out of it with the crash utility.

21:53 <agentzh> it was first observed in a customer's big box which is bare metal.

21:55 <agentzh> fche: oh btw, some times the kernel from certain linux distros may use a non-default gcc to compile. shall we introduce a --cc=PATH opton for stap?

22:12 derek0883 has quit [Remote host closed the connection]

22:14 derek0883 has joined #systemtap

22:33 <fche> -B CC=.....

23:27 orivej has joined #systemtap

23:43 sscox has quit [Ping timeout: 240 seconds]