ur5us has quit [Remote host closed the connection]
ur5us has joined #jruby
ur5us has quit [Remote host closed the connection]
ur5us has joined #jruby
ur5us_ has joined #jruby
ur5us has quit [Ping timeout: 264 seconds]
ur5us has joined #jruby
ur5us_ has quit [Ping timeout: 264 seconds]
sagax has joined #jruby
sagax has quit [Quit: Konversation terminated!]
sagax has joined #jruby
ur5us has quit [Ping timeout: 256 seconds]
Antiarc_ has quit [Ping timeout: 240 seconds]
Antiarc has joined #jruby
Antiarc has quit [Ping timeout: 264 seconds]
ruurd has quit [Read error: Connection reset by peer]
ruurd has joined #jruby
Antiarc has joined #jruby
<daveg_lookout[m]>
headius: We finally deployed 9.2.14.0 to try and verify https://github.com/jruby/jruby/issues/6326 -- unfortunately, it, or a bug very much resembling it reoccurred. The instance was happy for a few days with lower load, but as we ramped up requests to it, almost all the threads locked up. I've attached new thread dump to that issue. Instance is still live. Anything I can do to help debug?
<headius[m]>
daveg_lookout: good morning!
<headius[m]>
keep that instance handy and I will have a look at your latest
<daveg_lookout[m]>
Good morning! Will do
<headius[m]>
if you are able to pull a heap dump that may be helpful
<headius[m]>
jmap command line or jconsole or visualvm should be able to help there
<daveg_lookout[m]>
Ok, dumping now. And stackdump with -l succeeded after 25 minutes. Almost 8 MB...
<daveg_lookout[m]>
I can't upload it here, I'll attach it to the issue
<headius[m]>
I just added a comment... going off the log you were able to attach this may be a new or different issue, since all the thread blocked waiting for a lock appear to be trying to get a redis connection
<headius[m]>
I have not found any threads stuck on AR connection pool this time
<headius[m]>
maybe you can check my work on that but based on this new dump it may be our issue or it may be something specific to the connection_pool or redis gems
<daveg_lookout[m]>
Oh, interesting, i hadn't seen the redis pieces. I had noticed the lack of AR, but hadn't gone through the rest of it as much.
<daveg_lookout[m]>
ok, let me spend some time digging around this, thanks and sorry, I should have spotted that earlier
<headius[m]>
that's fine, I agree it does look similar to the other problem and may still indicate a JRuby bug, but the different stack makes me wonder
<headius[m]>
if you can confirm there's no AR involved I think we should open a new issue
<headius[m]>
I will take a quick look at these lines in redis and see if I can eyeball a problem
<daveg_lookout[m]>
There are several places where we're holding an AR Connection from the pool, then separately getting a redis connection from a separate pool and hanging on that. So agreed that AR is not directly to blame
<headius[m]>
ok
<headius[m]>
I am looking into Thread.handle_interrupt to see if there might be an issue
<headius[m]>
I have some behavior that might indicate a bug
<daveg_lookout[m]>
Hold off on digging into this too much. I'm beginning to suspect that we're seeing a server in a post-traumatic situation -- something bad happened, it permanently lost connection to DB, and now the redis stuff is it pretending things are fine
<headius[m]>
I will keep exploring this issue I am seeing... but it may or may not be related
<daveg_lookout[m]>
or in any case, subsequent dumps show different locations, suggesting that the redis locks were released and things moved on
<headius[m]>
it would require that the thread attempting to acquire this lock gets interrupted after it enters the `ensure` but before it releases the lock with `mon_exit`
<daveg_lookout[m]>
uggh, that would be bad
<headius[m]>
so similar to 6405 but a different patten
<headius[m]>
pattern
<headius[m]>
if you are able to pull a full heap dump off that instance we would be able to look at the actual lock these threads are trying to acquire and confirm if the thread that locked it has gone away or otherwise left it locked
<headius[m]>
I will let you explore on your end to see if you can find out more
<headius[m]>
this redis case is much narrower than the AR issue in that it is a single, boring Monitor lock that the redis gem is just using as a general-purpose lock
<headius[m]>
so in theory any similar use should have a problem too
<headius[m]>
your theory about a post-traumatic situation could be a clue... if a thread attempting to release that lock were suddenly interrupted it might remain locked
<daveg_lookout[m]>
yep, heap dump is still running, 20 minutes later
<headius[m]>
enebo: this might be motivation to get monitor.rb out of Ruby code just to tighten the lock/unlock logic and avoid Ruby interrupt hassles
<headius[m]>
and also make sure handle_interrupt is behaving right
<rdubya[m]>
We've previously seen issues with that redis lock as well, we were never able to track down a cause and have since changed our server architecture so haven't seen it recently
<headius[m]>
rdubya: ok that is interesting
<headius[m]>
daveg_lookout: might as well throw this new case and info into a new issue
<headius[m]>
if it turns out to be nothing then no harm
<daveg_lookout[m]>
ok, will do, thank you!
<headius[m]>
hmm this may not be just a JRuby bug... this may be a Ruby issue in general with this library
<enebo[m]>
headius: seems reasonable to me. The broader issues that people write to MRI happens to work will always be a potential problem but we can at least feel condident about what we are doing.
<enebo[m]>
I think the potential doomscrolling is killing me this week. I might switch over to comms free windows box and work on the launcher the rest of the afternoon
<enebo[m]>
I can still access news but I never have any personal social media on my windows machine...
subbu is now known as subbu|lunch
<headius[m]>
ok I may have found a bug here that could explain it
<headius[m]>
it is really a bug in monitor.rb but it is only exposed (or just much easier to expose) on parallel-threaded implementations
<headius[m]>
both JRuby and TruffleRuby show it
<headius[m]>
CRuby could potentially have it but it is hard to trigger thread events at the right boundary
<headius[m]>
they are not actually preventing interrupts between entering ensure and calling handle_interrupt
<headius[m]>
updated my gist with a potential patch I believe is more correct
<headius[m]>
so it could be a bug in CRuby if it is possible to interrupt the thread between entering ensure and the point at which handle_interrupt disables interrupts
<headius[m]>
this patch could be applied manually to your system daveg_lookout or if this is confirmed as a good fix we may be able to get a monitor gem release you can activate early on
<headius[m]>
or another option would be monkey patching
<daveg_lookout[m]>
ok. monkey patching is probably fastest path to learning more
<headius[m]>
to be clear I believe I have found a clear bug in monitor.rb but it may not impact CRuby due to the lack of parallel execution
<daveg_lookout[m]>
yep, quite plausible. that link isn't working for me, strangely
<headius[m]>
the gist link?
<daveg_lookout[m]>
no, the link to monitor.rb
<headius[m]>
huh weird
<headius[m]>
in any case look at mon_synchronize in any monitor.rb and you will see the old code I believe is broken
<daveg_lookout[m]>
yep, i see that.
<headius[m]>
enebo: strange CI failure here... I would expect it to fail every time but it only failed this one job
<enebo[m]>
headius: your github link is 404 for me
<headius[m]>
the monitor issue?
<headius[m]>
enebo: I just clicked that link and it is ok for me
<daveg_lookout[m]>
I think Monitor is a private repo
<headius[m]>
oh yeah that would do it
<headius[m]>
daveg_lookout: enebo: filed a bug to make the repo public
<headius[m]>
there is a gem release and CRuby appears to be using these sources now so it should be public
<daveg_lookout[m]>
still private for me
<headius[m]>
I don't have admin privileges so I can't make it public, but hopefully someone will handle that soon
<headius[m]>
it is definitely private right now but it should be public
<headius[m]>
daveg_lookout: I will do a PR to patch our copy of monitor for 9.2.15.0 and 9.3 which will resolve the issue for us... once they get ruby/monitor patched and released it will be fixed for future versions that use the gem contents