fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
mjw has joined #systemtap
slowfranklin has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
sscox has quit [Ping timeout: 252 seconds]
orivej has joined #systemtap
wcohen has quit [Ping timeout: 245 seconds]
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
sscox has joined #systemtap
wcohen has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
drsmith has joined #systemtap
tromey has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
<agentzh> fche: it seems like stap does not handle SIGPIPE properly and it still tries to write to a (broken) pipe blockingly even in the signal handler.
<agentzh> thus leading to an infinite hang and never responds to TERM signal.
<fche> agentzh,
orivej has quit [Ping timeout: 252 seconds]
<fche> is this stapio or stap per se?
<agentzh> i think it's stap per se.
<agentzh> stap is controlled by a script. and the script closes the stderr stream (and stdout) after sending a SIGTERM to the stap process.
<agentzh> seems like stap registers SA_RESTART on SIGPIPE, which makes it impossible to abort a write() syscall on a broken stderr pipe?
<agentzh> and that write() is also blocking.
<agentzh> which looks quite fragile.
<agentzh> the guilty line is in handle_interrupt() at main.c:280: int rc = write (2, msg, sizeof(msg)-1);
<fche> hehe, blocked in an error message print!
<agentzh> right
<agentzh> maybe we should remove SA_RESTART for sigpipe and handle it differently in that signal handler?
<agentzh> like skipping that write() syscall and exit right away?
<agentzh> but stapio is also running.
<agentzh> seems like the SIGTERM sent by stap down to stapio does to trigger its exit either.
<fche> we definitely want to pass the signal down
<agentzh> i know.
<agentzh> stapio's stp_main_loop() thread is also blocking on write().
<agentzh> according to the backtrace of the stapio process in that PR.
<agentzh> fche: seems like stapio is also blocking on writing to stderr (fd 2). on line staprun/mainloop.c:810
<agentzh> warn("%.*s", strlen(dupstr)-9, dupstr+9);
<agentzh> i think stapio shares the same stderr stream as stap, right?
<agentzh> stderr is also gone for stapio, i think.
<fche> yes
<agentzh> and stapio explicitly reigsteres a SIG_IGN handler for SIGPIPE, which does not look right to me.
<agentzh> *registers
<agentzh> so stapio is also blocking forever.
<agentzh> before it has a chance to handle SIGTERM
<fche> hm, I'll have to think about that ... I've seen multithreaded programs goof that up - one thread block-writes to a fd, which another one closes
<fche> that one could justifiably hang
<fche> but doesn't explain the stap-per-se case, interesting
<fche> if an fd op causes a sigpipe, I'd expect further fd ops to error-out instead of block
<agentzh> fche: seems like the kernel only relies on the signpipe signal to notify the user programs.
<agentzh> the syscall just hangs there.
<agentzh> forever.
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
orivej has joined #systemtap
wcohen has quit [Ping timeout: 245 seconds]
sscox has quit [Ping timeout: 245 seconds]
wcohen has joined #systemtap