#systemtap on 2018-11-21 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

03:08 agentzh has quit [Remote host closed the connection]

04:48 sscox has quit [Ping timeout: 246 seconds]

05:01 orivej has quit [Ping timeout: 268 seconds]

08:50 slowfranklin has joined #systemtap

09:15 orivej has joined #systemtap

10:08 mjw has joined #systemtap

13:52 sscox has joined #systemtap

14:14 wcohen has joined #systemtap

14:57 brolley has joined #systemtap

15:18 tromey has joined #systemtap

16:47 orivej has quit [Ping timeout: 252 seconds]

18:18 slowfranklin has quit [Quit: slowfranklin]

18:48 orivej has joined #systemtap

20:35 slowfranklin has joined #systemtap

20:54 slowfranklin has quit [Quit: slowfranklin]

21:04 slowfranklin has joined #systemtap

21:06 slowfranklin_ has joined #systemtap

21:06 slowfranklin has quit [Read error: No route to host]

21:06 slowfranklin_ is now known as slowfranklin

21:17 mjw has quit [Quit: Leaving]

21:32 slowfranklin has quit [Quit: slowfranklin]

21:32 wcohen has quit [Ping timeout: 246 seconds]

21:45 tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]

21:58 sscox has quit [Ping timeout: 245 seconds]

22:07 agentzh has joined #systemtap

22:08 <agentzh> fche: i want to hear your opinion on how to break the hanging in stapio's _stp_main_loop() while writing warning messages to stderr.

22:09 <fche> hi agentzh

22:09 <agentzh> the signal handler only sets a C global var for the _stp_main_loop() loop to check proactively. but once the loop itself is blocking on write(2, ...).

22:09 <agentzh> it won't have a chance to check the global var flag.

22:09 <agentzh> hey, fche

22:09 <agentzh> i'm still scratching my head on how to fix this blocking bug.

22:10 <fche> perchance we just shouldn't deal with sigpipe specially at all - let it kill our process

22:10 <agentzh> yeah, maybe.

22:11 <agentzh> but who will unload the stap kernel module then?

22:11 <agentzh> stapio is special in that it needs to do cleanup.

22:11 <agentzh> unlike stap.

22:11 <fche> can deal with one process at a time

22:11 <fche> for stap, sigpipe probably not a big deal

22:12 <agentzh> another case is stderr's write buffer is full, and in this case the stap/stapio process won't respond to SIGTERM at all.

22:12 <agentzh> since both them might be blocking on write(2, ...)

22:12 <agentzh> agreed, for stap it's fine.

22:12 <fche> for stapio, it could treat a sigpipe as a sigterm etc. and unload

22:12 <fche> and shut up about it :)

22:13 <agentzh> oh, so we could simply ignore the stp_main_loop thread?

22:13 <fche> worth a shot.

22:19 <agentzh> hmm, will try :)

22:20 <agentzh> thanks for the suggestion.

22:20 <fche> another possibility is to use fcntl (O_NONBLOCK) on those file descriptors once we are entering signal processing

22:20 <agentzh> tried. those fds won't allow it.

22:20 <agentzh> stderr is special it seems.

22:20 <fche> TRY HARDER

22:21 <fche> fcntl a hundred times in a loop

22:21 <fche> break into the kernel, crash it

22:22 <fche> ANYTHING IT TAKES :-)

22:22 <agentzh> err, okay...

22:23 <agentzh> will try...

22:23 <agentzh> fortunately i have a script to reproduce it easily locally.

23:00 brolley has left #systemtap [#systemtap]

23:08 <agentzh> fche: setting O_NONBLOCK seems to work for me!

23:08 <agentzh> i put it inside a 500 loop.

23:08 <agentzh> my script no longer hangs.

23:09 <agentzh> i'll prepare a formal patch for the mailing list submission.

23:18 <fche> go fo rit

23:19 <agentzh> i think for staprun, we could similary marking stderr/stdout as nonblocking in the signal handler?

23:20 <agentzh> sorry, i mean stapio

23:20 <agentzh> *similarly

23:21 <agentzh> i noted that the close_relayfs() call inside cleanup_and_exit() could also block on write(1, ...) in stapio.

23:22 <agentzh> (through pthread_join in the relay reader thread).

23:38 <fche> agentzh, not sure what the minimum set of fcntls' would be

23:38 <fche> does the reproduction process consist of piping stap into a process or shell that may just sit there and block with a full pipe?

23:44 <agentzh> yes.

23:44 <agentzh> i cannot easily reproduce the case that when the other side of the pipe dies, stap side's write() is still blocking. but i did see such cases in the wild.

23:48 <agentzh> in my local testing, seems like the first fcntl already works. but trying harder might be more robust.

23:48 <agentzh> my original test was messed up, maybe.

23:48 * fche was kidding about looping

23:49 <fche> it's a cold day here in the Great White Up, so a little joviality was needed

23:52 <agentzh> heh, i'll remove the loop then :)

23:55 <agentzh> not sure if setting O_NONBLOCK will affect existing pending write() syscalls though.

23:55 <agentzh> fche: do you know the behavior?

23:59 <fche> I wouldn't expect it to affect syscalls in flight