#jruby on 2021-01-12 — irc logs at freenode.irclog.whitequark.org

2020-12-10 18:57 ChanServ changed the topic of #jruby to: Get 9.2.14.0! http://jruby.org/ | http://wiki.jruby.org | http://logs.jruby.org/jruby/ | http://bugs.jruby.org | Paste at http://gist.github.com

00:00 ur5us has quit [Remote host closed the connection]

00:00 ur5us has joined #jruby

00:03 ur5us has quit [Remote host closed the connection]

00:03 ur5us has joined #jruby

00:16 ur5us_ has joined #jruby

00:18 ur5us has quit [Ping timeout: 264 seconds]

02:07 ur5us has joined #jruby

02:08 ur5us_ has quit [Ping timeout: 264 seconds]

03:34 sagax has joined #jruby

04:11 sagax has quit [Quit: Konversation terminated!]

04:26 sagax has joined #jruby

04:44 ur5us has quit [Ping timeout: 256 seconds]

07:56 Antiarc_ has quit [Ping timeout: 240 seconds]

08:27 Antiarc has joined #jruby

08:43 Antiarc has quit [Ping timeout: 264 seconds]

13:00 ruurd has quit [Read error: Connection reset by peer]

13:01 ruurd has joined #jruby

16:15 Antiarc has joined #jruby

17:02 <daveg_lookout[m]> headius: We finally deployed 9.2.14.0 to try and verify https://github.com/jruby/jruby/issues/6326 -- unfortunately, it, or a bug very much resembling it reoccurred. The instance was happy for a few days with lower load, but as we ramped up requests to it, almost all the threads locked up. I've attached new thread dump to that issue. Instance is still live. Anything I can do to help debug?

17:23 <headius[m]> daveg_lookout: good morning!

17:23 <headius[m]> keep that instance handy and I will have a look at your latest

17:25 <daveg_lookout[m]> Good morning! Will do

17:27 <headius[m]> if you are able to pull a heap dump that may be helpful

17:27 <headius[m]> jmap command line or jconsole or visualvm should be able to help there

17:58 <daveg_lookout[m]> Ok, dumping now. And stackdump with -l succeeded after 25 minutes. Almost 8 MB...

17:59 <daveg_lookout[m]> I can't upload it here, I'll attach it to the issue

18:00 <headius[m]> I just added a comment... going off the log you were able to attach this may be a new or different issue, since all the thread blocked waiting for a lock appear to be trying to get a redis connection

18:00 <headius[m]> I have not found any threads stuck on AR connection pool this time

18:01 <headius[m]> maybe you can check my work on that but based on this new dump it may be our issue or it may be something specific to the connection_pool or redis gems

18:02 <daveg_lookout[m]> Oh, interesting, i hadn't seen the redis pieces. I had noticed the lack of AR, but hadn't gone through the rest of it as much.

18:03 <daveg_lookout[m]> ok, let me spend some time digging around this, thanks and sorry, I should have spotted that earlier

18:03 <headius[m]> that's fine, I agree it does look similar to the other problem and may still indicate a JRuby bug, but the different stack makes me wonder

18:04 <headius[m]> if you can confirm there's no AR involved I think we should open a new issue

18:06 <headius[m]> I will take a quick look at these lines in redis and see if I can eyeball a problem

18:36 <daveg_lookout[m]> There are several places where we're holding an AR Connection from the pool, then separately getting a redis connection from a separate pool and hanging on that. So agreed that AR is not directly to blame

18:39 <headius[m]> ok

18:55 <headius[m]> I am looking into Thread.handle_interrupt to see if there might be an issue

18:55 <headius[m]> I have some behavior that might indicate a bug

18:55 <daveg_lookout[m]> Hold off on digging into this too much. I'm beginning to suspect that we're seeing a server in a post-traumatic situation -- something bad happened, it permanently lost connection to DB, and now the redis stuff is it pretending things are fine

18:56 <headius[m]> I will keep exploring this issue I am seeing... but it may or may not be related

18:56 <daveg_lookout[m]> or in any case, subsequent dumps show different locations, suggesting that the redis locks were released and things moved on

18:56 <headius[m]> it would require that the thread attempting to acquire this lock gets interrupted after it enters the `ensure` but before it releases the lock with `mon_exit`

18:57 <daveg_lookout[m]> uggh, that would be bad

18:57 <headius[m]> so similar to 6405 but a different patten

18:57 <headius[m]> pattern

18:58 <headius[m]> if you are able to pull a full heap dump off that instance we would be able to look at the actual lock these threads are trying to acquire and confirm if the thread that locked it has gone away or otherwise left it locked

18:58 <headius[m]> I will let you explore on your end to see if you can find out more

18:59 <headius[m]> this redis case is much narrower than the AR issue in that it is a single, boring Monitor lock that the redis gem is just using as a general-purpose lock

18:59 <headius[m]> so in theory any similar use should have a problem too

19:00 <headius[m]> your theory about a post-traumatic situation could be a clue... if a thread attempting to release that lock were suddenly interrupted it might remain locked

19:00 <daveg_lookout[m]> yep, heap dump is still running, 20 minutes later

19:00 <headius[m]> enebo: this might be motivation to get monitor.rb out of Ruby code just to tighten the lock/unlock logic and avoid Ruby interrupt hassles

19:01 <headius[m]> and also make sure handle_interrupt is behaving right

19:01 <rdubya[m]> We've previously seen issues with that redis lock as well, we were never able to track down a cause and have since changed our server architecture so haven't seen it recently

19:02 <headius[m]> rdubya: ok that is interesting

19:02 <headius[m]> daveg_lookout: might as well throw this new case and info into a new issue

19:02 <headius[m]> if it turns out to be nothing then no harm

19:03 <daveg_lookout[m]> ok, will do, thank you!

19:07 <headius[m]> hmm this may not be just a JRuby bug... this may be a Ruby issue in general with this library

19:41 <enebo[m]> headius: seems reasonable to me. The broader issues that people write to MRI happens to work will always be a potential problem but we can at least feel condident about what we are doing.

19:42 <enebo[m]> I think the potential doomscrolling is killing me this week. I might switch over to comms free windows box and work on the launcher the rest of the afternoon

19:43 <enebo[m]> I can still access news but I never have any personal social media on my windows machine...

19:52 subbu is now known as subbu|lunch

19:59 <headius[m]> ok I may have found a bug here that could explain it

20:00 <headius[m]> it is really a bug in monitor.rb but it is only exposed (or just much easier to expose) on parallel-threaded implementations

20:00 <headius[m]> both JRuby and TruffleRuby show it

20:00 <headius[m]> CRuby could potentially have it but it is hard to trigger thread events at the right boundary

20:01 <headius[m]> daveg_lookout: enebo: https://gist.github.com/headius/eb0583999cad5b3f53a14dd6a17a5e51

20:02 <headius[m]> this script should run forever printing "ok" but on JRuby it quickly errors out because the monitor is left locked

20:03 <headius[m]> I believe it is a bug in the way monitor.rb tries to avoid interrupting the ensure here:

20:03 <headius[m]> https://github.com/ruby/monitor/blob/master/lib/monitor.rb#L227-L236

20:04 <headius[m]> they are not actually preventing interrupts between entering ensure and calling handle_interrupt

20:05 <headius[m]> updated my gist with a potential patch I believe is more correct

20:06 <headius[m]> so it could be a bug in CRuby if it is possible to interrupt the thread between entering ensure and the point at which handle_interrupt disables interrupts

20:07 <headius[m]> this patch could be applied manually to your system daveg_lookout or if this is confirmed as a good fix we may be able to get a monitor gem release you can activate early on

20:07 <headius[m]> or another option would be monkey patching

20:07 <daveg_lookout[m]> ok. monkey patching is probably fastest path to learning more

20:08 <headius[m]> to be clear I believe I have found a clear bug in monitor.rb but it may not impact CRuby due to the lack of parallel execution

20:09 <daveg_lookout[m]> yep, quite plausible. that link isn't working for me, strangely

20:10 <headius[m]> the gist link?

20:10 <daveg_lookout[m]> no, the link to monitor.rb

20:10 <headius[m]> huh weird

20:11 <headius[m]> in any case look at mon_synchronize in any monitor.rb and you will see the old code I believe is broken

20:13 <daveg_lookout[m]> yep, i see that.

20:13 <headius[m]> enebo: strange CI failure here... I would expect it to fail every time but it only failed this one job

20:13 <headius[m]> https://travis-ci.com/github/jruby/jruby/jobs/470978029

20:14 <headius[m]> from this PR, which is a trivial patch:

20:14 <headius[m]> https://github.com/jruby/jruby/pull/6525

20:15 <headius[m]> this is not a new spec either... strange

20:16 <headius[m]> I will merge that PR and see if it happens again on master

20:17 <headius[m]> daveg_lookout: I will file a bug with monitor.rb

20:17 <daveg_lookout[m]> sounds great. i'm opening a new JRuby bug with the stack trace and we can link them

20:17 <headius[m]> ok

20:29 ur5us has joined #jruby

20:37 <travis-ci> jruby/jruby (master:4282d7a by Charles Oliver Nutter): The build was broken. https://travis-ci.com/jruby/jruby/builds/212896345 [208 min 13 sec]

20:37 travis-ci has left #jruby [#jruby]

20:37 travis-ci has joined #jruby

20:40 <daveg_lookout[m]> Created https://github.com/jruby/jruby/issues/6526, sorry about delay

20:43 ur5us_ has joined #jruby

20:45 ur5us has quit [Ping timeout: 264 seconds]

20:45 subbu|lunch is now known as subbu

20:45 <headius[m]> ok thanks

21:10 <headius[m]> daveg_lookout: https://github.com/ruby/monitor/issues/2

21:10 <headius[m]> also enebo

21:25 <enebo[m]> headius: your github link is 404 for me

21:27 <headius[m]> the monitor issue?

21:28 <headius[m]> enebo: I just clicked that link and it is ok for me

21:29 <daveg_lookout[m]> I think Monitor is a private repo

21:30 <headius[m]> oh yeah that would do it

21:32 <headius[m]> daveg_lookout: enebo: filed a bug to make the repo public

21:33 <headius[m]> there is a gem release and CRuby appears to be using these sources now so it should be public

21:33 <daveg_lookout[m]> still private for me

21:34 <headius[m]> I don't have admin privileges so I can't make it public, but hopefully someone will handle that soon

21:34 <headius[m]> it is definitely private right now but it should be public

21:41 <headius[m]> daveg_lookout: I will do a PR to patch our copy of monitor for 9.2.15.0 and 9.3 which will resolve the issue for us... once they get ruby/monitor patched and released it will be fixed for future versions that use the gem contents

21:50 <daveg_lookout[m]> Sounds great, thanks!

21:53 subbu is now known as subbu|away

22:02 <headius[m]> https://github.com/jruby/jruby/pull/6527

22:44 ur5us_ has quit [Remote host closed the connection]

22:44 ur5us_ has joined #jruby

22:59 subbu|away is now known as subbu