fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
agentzh has quit [Remote host closed the connection]
sscox has quit [Ping timeout: 246 seconds]
orivej has quit [Ping timeout: 268 seconds]
slowfranklin has joined #systemtap
orivej has joined #systemtap
mjw has joined #systemtap
sscox has joined #systemtap
wcohen has joined #systemtap
brolley has joined #systemtap
tromey has joined #systemtap
orivej has quit [Ping timeout: 252 seconds]
slowfranklin has quit [Quit: slowfranklin]
orivej has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin_ has joined #systemtap
slowfranklin has quit [Read error: No route to host]
slowfranklin_ is now known as slowfranklin
mjw has quit [Quit: Leaving]
slowfranklin has quit [Quit: slowfranklin]
wcohen has quit [Ping timeout: 246 seconds]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
sscox has quit [Ping timeout: 245 seconds]
agentzh has joined #systemtap
<agentzh> fche: i want to hear your opinion on how to break the hanging in stapio's _stp_main_loop() while writing warning messages to stderr.
<fche> hi agentzh
<agentzh> the signal handler only sets a C global var for the _stp_main_loop() loop to check proactively. but once the loop itself is blocking on write(2, ...).
<agentzh> it won't have a chance to check the global var flag.
<agentzh> hey, fche
<agentzh> i'm still scratching my head on how to fix this blocking bug.
<fche> perchance we just shouldn't deal with sigpipe specially at all - let it kill our process
<agentzh> yeah, maybe.
<agentzh> but who will unload the stap kernel module then?
<agentzh> stapio is special in that it needs to do cleanup.
<agentzh> unlike stap.
<fche> can deal with one process at a time
<fche> for stap, sigpipe probably not a big deal
<agentzh> another case is stderr's write buffer is full, and in this case the stap/stapio process won't respond to SIGTERM at all.
<agentzh> since both them might be blocking on write(2, ...)
<agentzh> agreed, for stap it's fine.
<fche> for stapio, it could treat a sigpipe as a sigterm etc. and unload
<fche> and shut up about it :)
<agentzh> oh, so we could simply ignore the stp_main_loop thread?
<fche> worth a shot.
<agentzh> hmm, will try :)
<agentzh> thanks for the suggestion.
<fche> another possibility is to use fcntl (O_NONBLOCK) on those file descriptors once we are entering signal processing
<agentzh> tried. those fds won't allow it.
<agentzh> stderr is special it seems.
<fche> TRY HARDER
<fche> fcntl a hundred times in a loop
<fche> break into the kernel, crash it
<fche> ANYTHING IT TAKES :-)
<agentzh> err, okay...
<agentzh> will try...
<agentzh> fortunately i have a script to reproduce it easily locally.
brolley has left #systemtap [#systemtap]
<agentzh> fche: setting O_NONBLOCK seems to work for me!
<agentzh> i put it inside a 500 loop.
<agentzh> my script no longer hangs.
<agentzh> i'll prepare a formal patch for the mailing list submission.
<fche> go fo rit
<agentzh> i think for staprun, we could similary marking stderr/stdout as nonblocking in the signal handler?
<agentzh> sorry, i mean stapio
<agentzh> *similarly
<agentzh> i noted that the close_relayfs() call inside cleanup_and_exit() could also block on write(1, ...) in stapio.
<agentzh> (through pthread_join in the relay reader thread).
<fche> agentzh, not sure what the minimum set of fcntls' would be
<fche> does the reproduction process consist of piping stap into a process or shell that may just sit there and block with a full pipe?
<agentzh> yes.
<agentzh> i cannot easily reproduce the case that when the other side of the pipe dies, stap side's write() is still blocking. but i did see such cases in the wild.
<agentzh> in my local testing, seems like the first fcntl already works. but trying harder might be more robust.
<agentzh> my original test was messed up, maybe.
* fche was kidding about looping
<fche> it's a cold day here in the Great White Up, so a little joviality was needed
<agentzh> heh, i'll remove the loop then :)
<agentzh> not sure if setting O_NONBLOCK will affect existing pending write() syscalls though.
<agentzh> fche: do you know the behavior?
<fche> I wouldn't expect it to affect syscalls in flight