fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
mjw has joined #systemtap
slowfranklin has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
sscox has quit [Ping timeout: 252 seconds]
orivej has joined #systemtap
wcohen has quit [Ping timeout: 245 seconds]
orivej has quit [Ping timeout: 245 seconds]
orivej has joined #systemtap
sscox has joined #systemtap
wcohen has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
drsmith has joined #systemtap
tromey has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin has quit [Client Quit]
<agentzh>
fche: it seems like stap does not handle SIGPIPE properly and it still tries to write to a (broken) pipe blockingly even in the signal handler.
<agentzh>
thus leading to an infinite hang and never responds to TERM signal.
<fche>
agentzh,
orivej has quit [Ping timeout: 252 seconds]
<fche>
is this stapio or stap per se?
<agentzh>
i think it's stap per se.
<agentzh>
stap is controlled by a script. and the script closes the stderr stream (and stdout) after sending a SIGTERM to the stap process.
<agentzh>
seems like stap registers SA_RESTART on SIGPIPE, which makes it impossible to abort a write() syscall on a broken stderr pipe?
<agentzh>
and that write() is also blocking.
<agentzh>
which looks quite fragile.
<agentzh>
the guilty line is in handle_interrupt() at main.c:280: int rc = write (2, msg, sizeof(msg)-1);
<fche>
hehe, blocked in an error message print!
<agentzh>
right
<agentzh>
maybe we should remove SA_RESTART for sigpipe and handle it differently in that signal handler?
<agentzh>
like skipping that write() syscall and exit right away?
<agentzh>
but stapio is also running.
<agentzh>
seems like the SIGTERM sent by stap down to stapio does to trigger its exit either.
<fche>
we definitely want to pass the signal down
<agentzh>
i know.
<agentzh>
stapio's stp_main_loop() thread is also blocking on write().
<agentzh>
according to the backtrace of the stapio process in that PR.
<agentzh>
fche: seems like stapio is also blocking on writing to stderr (fd 2). on line staprun/mainloop.c:810