fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
agentzh has quit [Remote host closed the connection]
sscox has quit [Ping timeout: 246 seconds]
orivej has quit [Ping timeout: 268 seconds]
slowfranklin has joined #systemtap
orivej has joined #systemtap
mjw has joined #systemtap
sscox has joined #systemtap
wcohen has joined #systemtap
brolley has joined #systemtap
tromey has joined #systemtap
orivej has quit [Ping timeout: 252 seconds]
slowfranklin has quit [Quit: slowfranklin]
orivej has joined #systemtap
slowfranklin has joined #systemtap
slowfranklin has quit [Quit: slowfranklin]
slowfranklin has joined #systemtap
slowfranklin_ has joined #systemtap
slowfranklin has quit [Read error: No route to host]
slowfranklin_ is now known as slowfranklin
mjw has quit [Quit: Leaving]
slowfranklin has quit [Quit: slowfranklin]
wcohen has quit [Ping timeout: 246 seconds]
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
sscox has quit [Ping timeout: 245 seconds]
agentzh has joined #systemtap
<agentzh>
fche: i want to hear your opinion on how to break the hanging in stapio's _stp_main_loop() while writing warning messages to stderr.
<fche>
hi agentzh
<agentzh>
the signal handler only sets a C global var for the _stp_main_loop() loop to check proactively. but once the loop itself is blocking on write(2, ...).
<agentzh>
it won't have a chance to check the global var flag.
<agentzh>
hey, fche
<agentzh>
i'm still scratching my head on how to fix this blocking bug.
<fche>
perchance we just shouldn't deal with sigpipe specially at all - let it kill our process
<agentzh>
yeah, maybe.
<agentzh>
but who will unload the stap kernel module then?
<agentzh>
stapio is special in that it needs to do cleanup.
<agentzh>
unlike stap.
<fche>
can deal with one process at a time
<fche>
for stap, sigpipe probably not a big deal
<agentzh>
another case is stderr's write buffer is full, and in this case the stap/stapio process won't respond to SIGTERM at all.
<agentzh>
since both them might be blocking on write(2, ...)
<agentzh>
agreed, for stap it's fine.
<fche>
for stapio, it could treat a sigpipe as a sigterm etc. and unload
<fche>
and shut up about it :)
<agentzh>
oh, so we could simply ignore the stp_main_loop thread?
<fche>
worth a shot.
<agentzh>
hmm, will try :)
<agentzh>
thanks for the suggestion.
<fche>
another possibility is to use fcntl (O_NONBLOCK) on those file descriptors once we are entering signal processing
<agentzh>
tried. those fds won't allow it.
<agentzh>
stderr is special it seems.
<fche>
TRY HARDER
<fche>
fcntl a hundred times in a loop
<fche>
break into the kernel, crash it
<fche>
ANYTHING IT TAKES :-)
<agentzh>
err, okay...
<agentzh>
will try...
<agentzh>
fortunately i have a script to reproduce it easily locally.
brolley has left #systemtap [#systemtap]
<agentzh>
fche: setting O_NONBLOCK seems to work for me!
<agentzh>
i put it inside a 500 loop.
<agentzh>
my script no longer hangs.
<agentzh>
i'll prepare a formal patch for the mailing list submission.
<fche>
go fo rit
<agentzh>
i think for staprun, we could similary marking stderr/stdout as nonblocking in the signal handler?
<agentzh>
sorry, i mean stapio
<agentzh>
*similarly
<agentzh>
i noted that the close_relayfs() call inside cleanup_and_exit() could also block on write(1, ...) in stapio.
<agentzh>
(through pthread_join in the relay reader thread).
<fche>
agentzh, not sure what the minimum set of fcntls' would be
<fche>
does the reproduction process consist of piping stap into a process or shell that may just sit there and block with a full pipe?
<agentzh>
yes.
<agentzh>
i cannot easily reproduce the case that when the other side of the pipe dies, stap side's write() is still blocking. but i did see such cases in the wild.
<agentzh>
in my local testing, seems like the first fcntl already works. but trying harder might be more robust.
<agentzh>
my original test was messed up, maybe.
* fche
was kidding about looping
<fche>
it's a cold day here in the Great White Up, so a little joviality was needed
<agentzh>
heh, i'll remove the loop then :)
<agentzh>
not sure if setting O_NONBLOCK will affect existing pending write() syscalls though.
<agentzh>
fche: do you know the behavior?
<fche>
I wouldn't expect it to affect syscalls in flight