#jruby on 2021-01-21 — irc logs at freenode.irclog.whitequark.org

2020-12-10 18:57 ChanServ changed the topic of #jruby to: Get 9.2.14.0! http://jruby.org/ | http://wiki.jruby.org | http://logs.jruby.org/jruby/ | http://bugs.jruby.org | Paste at http://gist.github.com

02:22 ur5us_ has joined #jruby

02:43 lopex has quit [Quit: Connection closed for inactivity]

03:08 ur5us_ has quit [Ping timeout: 246 seconds]

04:59 ur5us_ has joined #jruby

05:17 sagax has joined #jruby

05:54 ur5us_ has quit [Ping timeout: 264 seconds]

07:10 ur5us_ has joined #jruby

07:16 peacand has quit [Remote host closed the connection]

08:27 ur5us_ has quit [Ping timeout: 264 seconds]

09:01 peacand has joined #jruby

09:07 ruurd has joined #jruby

09:27 ur5us_ has joined #jruby

09:44 ur5us_ has quit [Ping timeout: 264 seconds]

15:19 ChrisSeatonGitte has quit [Quit: Bridge terminating on SIGTERM]

15:19 ahorek[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 KarolBucekGitter has quit [Quit: Bridge terminating on SIGTERM]

15:19 enebo[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 lopex[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 UweKuboschGitter has quit [Quit: Bridge terminating on SIGTERM]

15:19 BlaneDabneyGitte has quit [Quit: Bridge terminating on SIGTERM]

15:19 boc_tothefuture[ has quit [Quit: Bridge terminating on SIGTERM]

15:19 rdubya[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 TimGitter[m]1 has quit [Quit: Bridge terminating on SIGTERM]

15:19 liamwhiteGitter[ has quit [Quit: Bridge terminating on SIGTERM]

15:19 daveg_lookout[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 TimGitter[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 MattPattersonGit has quit [Quit: Bridge terminating on SIGTERM]

15:19 nhh[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 XavierNoriaGitte has quit [Quit: Bridge terminating on SIGTERM]

15:19 CharlesOliverNut has quit [Quit: Bridge terminating on SIGTERM]

15:19 RomainManni-Buca has quit [Quit: Bridge terminating on SIGTERM]

15:19 dentarg[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 GGibson[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 headius[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 byteit101[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 OlleJonssonGitte has quit [Quit: Bridge terminating on SIGTERM]

15:19 kares[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 chrisseaton[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 ravicious[m] has quit [Quit: Bridge terminating on SIGTERM]

15:19 JesseChavezGitte has quit [Quit: Bridge terminating on SIGTERM]

15:20 slonopotamus[m] has quit [Quit: Bridge terminating on SIGTERM]

15:20 kai[m] has quit [Quit: Bridge terminating on SIGTERM]

15:20 FlorianDoubletGi has quit [Quit: Bridge terminating on SIGTERM]

15:20 hopewise[m] has quit [Quit: Bridge terminating on SIGTERM]

15:20 MarcinMielyskiGi has quit [Quit: Bridge terminating on SIGTERM]

15:20 JulesIvanicGitte has quit [Quit: Bridge terminating on SIGTERM]

15:44 lopex has joined #jruby

15:55 daveg_lookout[m] has joined #jruby

15:55 <daveg_lookout[m]> headius: we continue to see frequent instance deaths using 9.2.14 + Monitor monkey-patch. Instances are dying so fast we've had problems getting dumps (we have an aggressive policy of killing instances and replacing them). Just got a dump that looks very reminiscent of https://github.com/jruby/jruby/issues/6309. I'll attach dump to that issue, we can open a new Issue if you think it's different

16:08 kai[m]1 has joined #jruby

16:08 enebo[m] has joined #jruby

16:08 lopex[m] has joined #jruby

16:08 ChrisSeatonGitte has joined #jruby

16:08 boc_tothefuture[ has joined #jruby

16:08 ravicious[m] has joined #jruby

16:08 headius[m] has joined #jruby

16:08 KarolBucekGitter has joined #jruby

16:08 XavierNoriaGitte has joined #jruby

16:08 nhh[m] has joined #jruby

16:08 OlleJonssonGitte has joined #jruby

16:08 chrisseaton[m] has joined #jruby

16:08 RomainManni-Buca has joined #jruby

16:08 UweKuboschGitter has joined #jruby

16:08 CharlesOliverNut has joined #jruby

16:08 FlorianDoubletGi has joined #jruby

16:08 rdubya[m] has joined #jruby

16:08 slonopotamus[m] has joined #jruby

16:08 ahorek[m] has joined #jruby

16:08 byteit101[m] has joined #jruby

16:08 MattPattersonGit has joined #jruby

16:08 JesseChavezGitte has joined #jruby

16:08 dentarg[m] has joined #jruby

16:08 GGibson[m] has joined #jruby

16:08 hopewise[m] has joined #jruby

16:08 JulesIvanicGitte has joined #jruby

16:08 kares[m] has joined #jruby

16:08 TimGitter[m] has joined #jruby

16:08 liamwhiteGitter[ has joined #jruby

16:08 MarcinMielyskiGi has joined #jruby

16:08 BlaneDabneyGitte has joined #jruby

16:08 TimGitter[m]1 has joined #jruby

16:48 <headius[m]> daveg_lookout: ok I will have a look

17:01 <headius[m]> it does look like the same issue

17:06 <daveg_lookout[m]> I just added a comment to the issue, it's slightly different in that the enumerators are created with #each, instead of #to_enum in the original

17:07 <headius[m]> ok, not a big

17:07 <headius[m]> not a big difference but good to know

17:18 <daveg_lookout[m]> I have a heap dump (or at least most of one, not sure it wasn't interrupted before completing) but it's 650MB. i can run some analysis over it if that would be useful

17:18 ruurd has quit [Quit: bye folks]

17:18 <headius[m]> I think the interesting bit in a heap dump would be to examine the state of those enumerators and figure out why they are blocking

17:19 <headius[m]> I am looking into other fiber impl code and tests to see if there's anything we aren't doing that might point toward a solution

17:21 <headius[m]> interesting, I see that one of the peeking threads is in an exception handler block

17:22 <headius[m]> could be an exception raised from the underlying AR thingy?

17:23 <daveg_lookout[m]> definitely possible. there are 2 threads that are in the process of raising interrupts

17:24 <headius[m]> yeah I see the same

17:24 <headius[m]> there are not a lot of test cases for exceptions across this fiber edge

17:38 <daveg_lookout[m]> running now, will take 20-30 minutes, i expect

18:11 <headius[m]> it would be helpful to know which fiber those peekers are waiting for so we can determine what state they are in and why they are not returning results

18:18 <daveg_lookout[m]> VisualVM is still trying to load the heap dump, I'm not expecting too much on that front. Second stack dump just completed, looking now

18:19 <headius[m]> ok

18:27 <daveg_lookout[m]> Threads 2103, 2104, 9863, 9864 are still in same place. Thread 8043 is slightly different -- now doing java.lang.Throwable.fillInStackTrace within raise exception. Thread 1400 is now in Thread.interrupt from ThreadFiber.handleExceptionDuringExchange. I'll add the new trace to the issue.

18:27 <headius[m]> ok

18:32 <daveg_lookout[m]> added

18:34 <headius[m]> Is that the right file? You say 8043 is now doing fillInStackTrace but I see it at interrupt0 still

18:34 <headius[m]> filename is same as previous upload

18:35 <headius[m]> daveg_lookout: the new upload seems to have both of the peek threads still at interrupt0

18:35 <headius[m]> when this hangs do you see runaway CPU use or is it silent?

18:40 <daveg_lookout[m]> it remained normal, until we removed it from load balancer, then dropped

18:40 <daveg_lookout[m]> so it was still managing to do a lot of normal work

18:41 <headius[m]> I am trying to determine whether there might be a race and interrupting a thread waiting on a fiber

18:41 <headius[m]> The simple behavior seems to match but I will come up with a torture test

18:42 <daveg_lookout[m]> sounds good. let me know if i can help

18:43 <headius[m]> Did you see my question above about thread 8043?

18:43 <headius[m]> I am not seeing what you are seeing

18:48 <headius[m]> https://blogs.oracle.com/poonam/hung-jvm-due-to-the-threads-stuck-in-pthreadcondtimedwait

18:48 <headius[m]> This is the first reference to hanging in that interrupt method that I have found

18:49 <headius[m]> In this case it is waiting on a kernel level heap lock

18:52 <headius[m]> I would assume you're on a fairly recent Linux kernel

18:53 <headius[m]> If we can get a native thread trace that might tell us a bit more

18:56 <daveg_lookout[m]> this isn't super new -- last day on ubuntu 16 before upgrading to ubuntu 18. 4.4.0-1119-aws

18:57 <headius[m]> Hmm well always a chance the newer kernel will help something

18:59 <headius[m]> https://www.ibm.com/support/pages/collecting-thread-dump-core-file-or-running-process-linux

18:59 <headius[m]> Some instructions there on getting a thread dump from a running process using GDB or pstack

18:59 <headius[m]> It would definitely be helpful to know what's happening below interrupt0

19:05 <daveg_lookout[m]> pstack isn't giving anything useful. no symbols found, only 2 frames on the root pid and 5 frames on pid 8043

19:06 <daveg_lookout[m]> trying gdb now

19:09 subbu is now known as subbu|lunch

19:13 <daveg_lookout[m]> https://gist.github.com/dgolombek/084b8333aacc474c5f489057d8f359b6 -- not very interesting

19:18 <headius[m]> That is the peek thread?

19:20 <daveg_lookout[m]> no -- i had a typo, hold on

19:22 <daveg_lookout[m]> updated gist, more useful now

19:23 <daveg_lookout[m]> for reference, http://support.sas.com/kb/58/075.html is much more useful than that IBM page

19:27 <daveg_lookout[m]> i `continue`d a few times then dumped thread again and attached that to gist as well

19:28 <headius[m]> Ok cool

19:36 subbu|lunch is now known as subbu

19:48 NightMonkey has quit [Read error: Connection reset by peer]

19:49 <headius[m]> So it seems like it may not be hung but is stuck cycling

19:51 NightMonkey has joined #jruby

19:56 <daveg_lookout[m]> agreed. and the threads that are trying to peek definitely seem to be stuck

19:57 <headius[m]> Ok I will poke around more. This helps

20:07 ur5us_ has joined #jruby

20:19 <daveg_lookout[m]> Let me know if you want me to do any last things on this instance. it's going to get killed soon by deploy of next release

21:22 kroth[m] has joined #jruby

22:16 <headius[m]> I think it is as simple as this loop in ThreadFiber never getting exited... it keeps trying to rethrow some exception in the target fiber and never making progress

22:21 drbobbeaty has quit [Ping timeout: 240 seconds]

22:22 <headius[m]> unfortunately no other Ruby impl has ever tried to tackle propagating exceptions like this so we will have to suss out what the right sequence is here

22:22 <daveg_lookout[m]> heh. finding all kinds of interesting things. i may have a 3rd issue, still trying to make sure it's not our bug. Redis commands (from client perspective) get slower the longer an instance is alive. redis server time is stable. but too early to report

22:23 <headius[m]> well one thing about this continually raising exceptions: generating stack traces can be rather expensive

22:23 drbobbeaty has joined #jruby

22:23 <headius[m]> so if it gets into this state and has one or more threads just spinning on exceptions that could slow other threads down

22:23 <headius[m]> it would also be burning GC cycles pretty fast

22:24 <headius[m]> you might be able to attach visualvm and monitor the GC to see if it's running a lot

22:24 <headius[m]> this logic in exchangeWithFiber seems right but apparently there is some state where it might get stuck in this loop

22:25 <headius[m]> https://github.com/jruby/jruby/blob/cf4d39c354995a6c67fb86334b2cefe350d5938c/core/src/main/java/org/jruby/ext/fiber/ThreadFiber.java#L98-L120

22:25 <daveg_lookout[m]> yeah, we definitely see old generation gc counts and times up when this starts going bad

22:27 <daveg_lookout[m]> I've gotta run, will be back on later tonight. thanks again for all the help!

22:27 <headius[m]> yeah I will keep trying to find an edge case that trigges this

22:31 kroth[m] is now known as kroth_lookout[m]

22:31 <kroth_lookout[m]> fwiw: it’s not just old gen gc counts, although those are easiest to pick out. we also see young gen gc counts and heap usage climb pretty drastically in some cases

22:32 <headius[m]> kroth_lookout: ah you work with daveg_lookout ?

22:32 ur5us_ has quit [Ping timeout: 264 seconds]

22:32 <kroth_lookout[m]> yup

22:32 <headius[m]> there's a great deal of allocation and JVM safepoint overhead involved in generating a stack trace so that would seem to fit

22:36 <headius[m]> hmm well this contrived case never exits but doesn't get stuck the same way

22:36 <headius[m]> ruby -e "t = Thread.new { f = Fiber.new { loop { Fiber.yield } }; loop { f.resume } }; sleep 1; t.raise('foo'); t.join"

22:37 <headius[m]> the raise seems to get lost and never terminates the fiber and thread

22:47 * headius[m] sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/NxFRMkEwtkqmWBMwdtPyocRt/message.txt >

22:48 <headius[m]> hmmm seems I am on to something

22:52 ur5us_ has joined #jruby

23:05 ur5us has joined #jruby

23:06 ur5us_ has quit [Ping timeout: 260 seconds]