ur5us_ has joined #jruby
lopex has quit [Quit: Connection closed for inactivity]
ur5us_ has quit [Ping timeout: 246 seconds]
ur5us_ has joined #jruby
sagax has joined #jruby
ur5us_ has quit [Ping timeout: 264 seconds]
ur5us_ has joined #jruby
peacand has quit [Remote host closed the connection]
ur5us_ has quit [Ping timeout: 264 seconds]
peacand has joined #jruby
ruurd has joined #jruby
ur5us_ has joined #jruby
ur5us_ has quit [Ping timeout: 264 seconds]
ChrisSeatonGitte has quit [Quit: Bridge terminating on SIGTERM]
ahorek[m] has quit [Quit: Bridge terminating on SIGTERM]
KarolBucekGitter has quit [Quit: Bridge terminating on SIGTERM]
enebo[m] has quit [Quit: Bridge terminating on SIGTERM]
lopex[m] has quit [Quit: Bridge terminating on SIGTERM]
UweKuboschGitter has quit [Quit: Bridge terminating on SIGTERM]
BlaneDabneyGitte has quit [Quit: Bridge terminating on SIGTERM]
boc_tothefuture[ has quit [Quit: Bridge terminating on SIGTERM]
rdubya[m] has quit [Quit: Bridge terminating on SIGTERM]
TimGitter[m]1 has quit [Quit: Bridge terminating on SIGTERM]
liamwhiteGitter[ has quit [Quit: Bridge terminating on SIGTERM]
daveg_lookout[m] has quit [Quit: Bridge terminating on SIGTERM]
TimGitter[m] has quit [Quit: Bridge terminating on SIGTERM]
MattPattersonGit has quit [Quit: Bridge terminating on SIGTERM]
nhh[m] has quit [Quit: Bridge terminating on SIGTERM]
XavierNoriaGitte has quit [Quit: Bridge terminating on SIGTERM]
CharlesOliverNut has quit [Quit: Bridge terminating on SIGTERM]
RomainManni-Buca has quit [Quit: Bridge terminating on SIGTERM]
dentarg[m] has quit [Quit: Bridge terminating on SIGTERM]
GGibson[m] has quit [Quit: Bridge terminating on SIGTERM]
headius[m] has quit [Quit: Bridge terminating on SIGTERM]
byteit101[m] has quit [Quit: Bridge terminating on SIGTERM]
OlleJonssonGitte has quit [Quit: Bridge terminating on SIGTERM]
kares[m] has quit [Quit: Bridge terminating on SIGTERM]
chrisseaton[m] has quit [Quit: Bridge terminating on SIGTERM]
ravicious[m] has quit [Quit: Bridge terminating on SIGTERM]
JesseChavezGitte has quit [Quit: Bridge terminating on SIGTERM]
slonopotamus[m] has quit [Quit: Bridge terminating on SIGTERM]
kai[m] has quit [Quit: Bridge terminating on SIGTERM]
FlorianDoubletGi has quit [Quit: Bridge terminating on SIGTERM]
hopewise[m] has quit [Quit: Bridge terminating on SIGTERM]
MarcinMielyskiGi has quit [Quit: Bridge terminating on SIGTERM]
JulesIvanicGitte has quit [Quit: Bridge terminating on SIGTERM]
lopex has joined #jruby
daveg_lookout[m] has joined #jruby
<daveg_lookout[m]> headius: we continue to see frequent instance deaths using 9.2.14 + Monitor monkey-patch. Instances are dying so fast we've had problems getting dumps (we have an aggressive policy of killing instances and replacing them). Just got a dump that looks very reminiscent of https://github.com/jruby/jruby/issues/6309. I'll attach dump to that issue, we can open a new Issue if you think it's different
kai[m]1 has joined #jruby
enebo[m] has joined #jruby
lopex[m] has joined #jruby
ChrisSeatonGitte has joined #jruby
boc_tothefuture[ has joined #jruby
ravicious[m] has joined #jruby
headius[m] has joined #jruby
KarolBucekGitter has joined #jruby
XavierNoriaGitte has joined #jruby
nhh[m] has joined #jruby
OlleJonssonGitte has joined #jruby
chrisseaton[m] has joined #jruby
RomainManni-Buca has joined #jruby
UweKuboschGitter has joined #jruby
CharlesOliverNut has joined #jruby
FlorianDoubletGi has joined #jruby
rdubya[m] has joined #jruby
slonopotamus[m] has joined #jruby
ahorek[m] has joined #jruby
byteit101[m] has joined #jruby
MattPattersonGit has joined #jruby
JesseChavezGitte has joined #jruby
dentarg[m] has joined #jruby
GGibson[m] has joined #jruby
hopewise[m] has joined #jruby
JulesIvanicGitte has joined #jruby
kares[m] has joined #jruby
TimGitter[m] has joined #jruby
liamwhiteGitter[ has joined #jruby
MarcinMielyskiGi has joined #jruby
BlaneDabneyGitte has joined #jruby
TimGitter[m]1 has joined #jruby
<headius[m]> daveg_lookout: ok I will have a look
<headius[m]> it does look like the same issue
<daveg_lookout[m]> I just added a comment to the issue, it's slightly different in that the enumerators are created with #each, instead of #to_enum in the original
<headius[m]> ok, not a big
<headius[m]> not a big difference but good to know
<daveg_lookout[m]> I have a heap dump (or at least most of one, not sure it wasn't interrupted before completing) but it's 650MB. i can run some analysis over it if that would be useful
ruurd has quit [Quit: bye folks]
<headius[m]> I think the interesting bit in a heap dump would be to examine the state of those enumerators and figure out why they are blocking
<headius[m]> I am looking into other fiber impl code and tests to see if there's anything we aren't doing that might point toward a solution
<headius[m]> interesting, I see that one of the peeking threads is in an exception handler block
<headius[m]> could be an exception raised from the underlying AR thingy?
<daveg_lookout[m]> definitely possible. there are 2 threads that are in the process of raising interrupts
<headius[m]> yeah I see the same
<headius[m]> there are not a lot of test cases for exceptions across this fiber edge
<daveg_lookout[m]> running now, will take 20-30 minutes, i expect
<headius[m]> it would be helpful to know which fiber those peekers are waiting for so we can determine what state they are in and why they are not returning results
<daveg_lookout[m]> VisualVM is still trying to load the heap dump, I'm not expecting too much on that front. Second stack dump just completed, looking now
<headius[m]> ok
<daveg_lookout[m]> Threads 2103, 2104, 9863, 9864 are still in same place. Thread 8043 is slightly different -- now doing java.lang.Throwable.fillInStackTrace within raise exception. Thread 1400 is now in Thread.interrupt from ThreadFiber.handleExceptionDuringExchange. I'll add the new trace to the issue.
<headius[m]> ok
<daveg_lookout[m]> added
<headius[m]> Is that the right file? You say 8043 is now doing fillInStackTrace but I see it at interrupt0 still
<headius[m]> filename is same as previous upload
<headius[m]> daveg_lookout: the new upload seems to have both of the peek threads still at interrupt0
<headius[m]> when this hangs do you see runaway CPU use or is it silent?
<daveg_lookout[m]> it remained normal, until we removed it from load balancer, then dropped
<daveg_lookout[m]> so it was still managing to do a lot of normal work
<headius[m]> I am trying to determine whether there might be a race and interrupting a thread waiting on a fiber
<headius[m]> The simple behavior seems to match but I will come up with a torture test
<daveg_lookout[m]> sounds good. let me know if i can help
<headius[m]> Did you see my question above about thread 8043?
<headius[m]> I am not seeing what you are seeing
<headius[m]> This is the first reference to hanging in that interrupt method that I have found
<headius[m]> In this case it is waiting on a kernel level heap lock
<headius[m]> I would assume you're on a fairly recent Linux kernel
<headius[m]> If we can get a native thread trace that might tell us a bit more
<daveg_lookout[m]> this isn't super new -- last day on ubuntu 16 before upgrading to ubuntu 18. 4.4.0-1119-aws
<headius[m]> Hmm well always a chance the newer kernel will help something
<headius[m]> Some instructions there on getting a thread dump from a running process using GDB or pstack
<headius[m]> It would definitely be helpful to know what's happening below interrupt0
<daveg_lookout[m]> pstack isn't giving anything useful. no symbols found, only 2 frames on the root pid and 5 frames on pid 8043
<daveg_lookout[m]> trying gdb now
subbu is now known as subbu|lunch
<headius[m]> That is the peek thread?
<daveg_lookout[m]> no -- i had a typo, hold on
<daveg_lookout[m]> updated gist, more useful now
<daveg_lookout[m]> for reference, http://support.sas.com/kb/58/075.html is much more useful than that IBM page
<daveg_lookout[m]> i `continue`d a few times then dumped thread again and attached that to gist as well
<headius[m]> Ok cool
subbu|lunch is now known as subbu
NightMonkey has quit [Read error: Connection reset by peer]
<headius[m]> So it seems like it may not be hung but is stuck cycling
NightMonkey has joined #jruby
<daveg_lookout[m]> agreed. and the threads that are trying to peek definitely seem to be stuck
<headius[m]> Ok I will poke around more. This helps
ur5us_ has joined #jruby
<daveg_lookout[m]> Let me know if you want me to do any last things on this instance. it's going to get killed soon by deploy of next release
kroth[m] has joined #jruby
<headius[m]> I think it is as simple as this loop in ThreadFiber never getting exited... it keeps trying to rethrow some exception in the target fiber and never making progress
drbobbeaty has quit [Ping timeout: 240 seconds]
<headius[m]> unfortunately no other Ruby impl has ever tried to tackle propagating exceptions like this so we will have to suss out what the right sequence is here
<daveg_lookout[m]> heh. finding all kinds of interesting things. i may have a 3rd issue, still trying to make sure it's not our bug. Redis commands (from client perspective) get slower the longer an instance is alive. redis server time is stable. but too early to report
<headius[m]> well one thing about this continually raising exceptions: generating stack traces can be rather expensive
drbobbeaty has joined #jruby
<headius[m]> so if it gets into this state and has one or more threads just spinning on exceptions that could slow other threads down
<headius[m]> it would also be burning GC cycles pretty fast
<headius[m]> you might be able to attach visualvm and monitor the GC to see if it's running a lot
<headius[m]> this logic in exchangeWithFiber seems right but apparently there is some state where it might get stuck in this loop
<daveg_lookout[m]> yeah, we definitely see old generation gc counts and times up when this starts going bad
<daveg_lookout[m]> I've gotta run, will be back on later tonight. thanks again for all the help!
<headius[m]> yeah I will keep trying to find an edge case that trigges this
kroth[m] is now known as kroth_lookout[m]
<kroth_lookout[m]> fwiw: it’s not just old gen gc counts, although those are easiest to pick out. we also see young gen gc counts and heap usage climb pretty drastically in some cases
<headius[m]> kroth_lookout: ah you work with daveg_lookout ?
ur5us_ has quit [Ping timeout: 264 seconds]
<kroth_lookout[m]> yup
<headius[m]> there's a great deal of allocation and JVM safepoint overhead involved in generating a stack trace so that would seem to fit
<headius[m]> hmm well this contrived case never exits but doesn't get stuck the same way
<headius[m]> ruby -e "t = Thread.new { f = Fiber.new { loop { Fiber.yield } }; loop { f.resume } }; sleep 1; t.raise('foo'); t.join"
<headius[m]> the raise seems to get lost and never terminates the fiber and thread
<headius[m]> hmmm seems I am on to something
ur5us_ has joined #jruby
ur5us has joined #jruby
ur5us_ has quit [Ping timeout: 260 seconds]