#systemtap on 2018-11-19 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

08:46 mjw has joined #systemtap

09:27 slowfranklin has joined #systemtap

10:45 orivej has quit [Ping timeout: 268 seconds]

10:51 sscox has quit [Ping timeout: 252 seconds]

11:45 orivej has joined #systemtap

13:26 wcohen has quit [Ping timeout: 245 seconds]

13:45 orivej has quit [Ping timeout: 245 seconds]

13:53 orivej has joined #systemtap

14:05 sscox has joined #systemtap

14:20 wcohen has joined #systemtap

14:36 orivej has quit [Ping timeout: 240 seconds]

15:01 orivej has joined #systemtap

15:04 drsmith has joined #systemtap

15:17 tromey has joined #systemtap

15:30 orivej has quit [Ping timeout: 260 seconds]

15:43 orivej has joined #systemtap

18:18 slowfranklin has quit [Quit: slowfranklin]

18:37 slowfranklin has joined #systemtap

18:38 slowfranklin has quit [Client Quit]

18:39 slowfranklin has joined #systemtap

18:47 slowfranklin has quit [Quit: slowfranklin]

18:48 slowfranklin has joined #systemtap

18:49 slowfranklin has quit [Client Quit]

19:01 <agentzh> fche: it seems like stap does not handle SIGPIPE properly and it still tries to write to a (broken) pipe blockingly even in the signal handler.

19:02 <agentzh> thus leading to an infinite hang and never responds to TERM signal.

19:22 <fche> agentzh,

19:24 orivej has quit [Ping timeout: 252 seconds]

19:37 <fche> is this stapio or stap per se?

19:38 <agentzh> i think it's stap per se.

19:39 <agentzh> stap is controlled by a script. and the script closes the stderr stream (and stdout) after sending a SIGTERM to the stap process.

19:41 <agentzh> seems like stap registers SA_RESTART on SIGPIPE, which makes it impossible to abort a write() syscall on a broken stderr pipe?

19:41 <agentzh> and that write() is also blocking.

19:41 <agentzh> which looks quite fragile.

19:50 <agentzh> the guilty line is in handle_interrupt() at main.c:280: int rc = write (2, msg, sizeof(msg)-1);

19:53 <fche> hehe, blocked in an error message print!

19:59 <agentzh> right

20:00 <agentzh> maybe we should remove SA_RESTART for sigpipe and handle it differently in that signal handler?

20:00 <agentzh> like skipping that write() syscall and exit right away?

20:01 <agentzh> but stapio is also running.

20:01 <agentzh> seems like the SIGTERM sent by stap down to stapio does to trigger its exit either.

20:02 <fche> we definitely want to pass the signal down

20:02 <agentzh> i know.

20:02 <agentzh> stapio's stp_main_loop() thread is also blocking on write().

20:02 <agentzh> according to the backtrace of the stapio process in that PR.

20:18 <agentzh> fche: seems like stapio is also blocking on writing to stderr (fd 2). on line staprun/mainloop.c:810

20:19 <agentzh> warn("%.*s", strlen(dupstr)-9, dupstr+9);

20:19 <agentzh> i think stapio shares the same stderr stream as stap, right?

20:19 <agentzh> stderr is also gone for stapio, i think.

20:20 <fche> yes

20:20 <agentzh> and stapio explicitly reigsteres a SIG_IGN handler for SIGPIPE, which does not look right to me.

20:20 <agentzh> *registers

20:20 <agentzh> so stapio is also blocking forever.

20:20 <agentzh> before it has a chance to handle SIGTERM

20:21 <fche> hm, I'll have to think about that ... I've seen multithreaded programs goof that up - one thread block-writes to a fd, which another one closes

20:21 <fche> that one could justifiably hang

20:21 <fche> but doesn't explain the stap-per-se case, interesting

20:22 <fche> if an fd op causes a sigpipe, I'd expect further fd ops to error-out instead of block

20:26 <agentzh> fche: seems like the kernel only relies on the signpipe signal to notify the user programs.

20:26 <agentzh> the syscall just hangs there.

20:27 <agentzh> forever.

21:59 tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]

22:31 orivej has joined #systemtap

22:33 wcohen has quit [Ping timeout: 245 seconds]

22:35 sscox has quit [Ping timeout: 245 seconds]

23:33 wcohen has joined #systemtap