ur5us has quit [Ping timeout: 258 seconds]
tests has joined #jruby
tests has quit [Quit: Connection closed]
Antiarc has quit [Ping timeout: 240 seconds]
Antiarc has joined #jruby
ur5us has joined #jruby
ur5us has quit [Ping timeout: 276 seconds]
drbobbeaty has joined #jruby
Antiarc has quit [Ping timeout: 260 seconds]
<headius[m]> enebo: do you know why even simple blocks still push a frame?
<headius[m]> that is a lot of bytecode for a { nil } block
<headius[m]> I think the protoco logic is conflating needing to have a frame created for it (for proc evals and such) and the body actually needing the frame pushed
Antiarc has joined #jruby
<headius[m]> hmmm it is getting marked as needing the frame block
<headius[m]> that is wrong
<headius[m]> enebo: ok I think I see what is happening... we always load the block passed into a body into a local, so that the yield instruction can always just go to that variable to get it
<headius[m]> but in this case eagerly loading it for the block also makes it look like it needs the frame since that is where yield would get its block
<headius[m]> so by trying to load the block eagerly when it is never needed we are actually deoptimizing the block
<enebo[m]> you mean to a temp var?
<headius[m]> yeah
<enebo[m]> I don't see why that would ask for a frame by itself
<headius[m]> because it has to use LoadFrameClosureInstr in a block
<enebo[m]> but I have not spent time looking and I am not in much of a position to look at this atm
<headius[m]> which "REQUIRES_BLOCK" from the frame, so it forces a frame push
<headius[m]> but we do it unconditionally whether the block will be yielded or not
<enebo[m]> yeah I see
<enebo[m]> ok so wait...I am looking a little bit now :)
<headius[m]> hah
<headius[m]> join ussss
<enebo[m]> Do we do this because a higher level frame may have gotten this and we just bubble block through frames
<enebo[m]> It feels like we should have a block param to the scope we are executing as a param
<enebo[m]> Or if we don't I am wondering if we can
<enebo[m]> I guess I could be in a closure and see a yield
<headius[m]> blocks can see two potential blocks... the frame's block, which would be the one passed to the surrounding method, and any block passed as an argument to the block call
<enebo[m]> yeah
<headius[m]> we pass the latter as an argument always, but the frame block we acquire from stack
<headius[m]> a looping test of proc.call'ing a { nil } block is about 2x if I make this frame closure load only happen right before yields
<enebo[m]> ah I see
<headius[m]> it cuts out a lot of frame juggling that is irreducible
<enebo[m]> if no yield is present we are still emitting it
<headius[m]> yeah exactly
<headius[m]> I don't know the best pattern in IR to only add instrs if the code is a certain shape, but we could scan for yield and only do this eager block get in that case
<headius[m]> or it could be lazy like this and optimized in a later pass
<enebo[m]> ok I want to check on thing on IRFlags but I think this should be simple if yield is our actual only reason for retrieving it
<headius[m]> I have a patch but it is maybe a bit hacky
<enebo[m]> REQUIRES_BLOCK appears to be in case of seeing LoadFrameClosure and is Proc.new maybe is happening
<headius[m]> the other one we always add is UpdateBlockExecutionStateInstr
<headius[m]> right that definitely triggers needing the frame, but we do it eagerly for all blocks
<enebo[m]> So if we add a boolean to IRScope hasYield (which will need to cope with child closures) then we can omit it
<headius[m]> so it forces all blocks to push frame right now
<headius[m]> hmmm
<enebo[m]> If the closure itself and no child closures have a yield it does not need one and actually that may not even be true
<enebo[m]> Only the child which has one maybe since it can still get it out of the frame
<enebo[m]> but it is pinned on notion presence of yield is crux
<headius[m]> if there is a child closure it will already force frame push anyway
<headius[m]> so we don't have to see across that boundary right now
<enebo[m]> ok well fine. this can also be more conservative
<enebo[m]> we can take it further if the children are also trivial closures
<headius[m]> (it doesn't need to push a frame if we could contruct child closure directly from parent closure's frame, but that is a future optz)
<enebo[m]> but this is not really important first step I guess anyways
<headius[m]> different instr for child closure could do it... create nested closure with parent's binding as an arg
<headius[m]> anyway that is outside this
<headius[m]> what I have now works but it will add overhead to yields from blocks because it goes back to frame every time
<headius[m]> I will gist what I have
<enebo[m]> ok
<enebo[m]> So I think the solution I would entertain is when building iter I would look for yield and no child closures and then instead of emitting prepareImplicitState() at front of build I would add it at end and that method will prepend those instrs
<headius[m]> if we can clean this up to only load the frame closure (once) if it will potentially be used this would speed up leaf closures a lot
<enebo[m]> That does not even require saving any state to IRScope
<headius[m]> oh that is a good thought
<enebo[m]> well your diff is worth talking over
<headius[m]> so once it is done, if it saw a frame closure-needing instr it injects the right code at the beginning
<headius[m]> sure
<headius[m]> so my diff could be reoptimized later by protocol or whatever
<enebo[m]> Yours does not even track this or chage anything other than the location of the yield itself
<enebo[m]> I can think of one interesting wrinkle with it
<headius[m]> or we could add a YieldFrameClosure instr that would do this lazily but be compiled to efficient get-once + normal yield in later passes
<enebo[m]> you can have many yields
<enebo[m]> I think it will still work though
<headius[m]> yes
<headius[m]> not common but yes
<headius[m]> so clearly if this yield was in a loop going to the frame every iter is inefficient
<enebo[m]> I don't know if there are assumptions we will only load the closure once
<headius[m]> if we yield once it is no worse than what we had
<enebo[m]> the weird aspect of how this is built is it assumes a specific temp var for the load
<headius[m]> yes
<enebo[m]> if we put it in front of lets say 3 yields it would just keep asking the frame for the same value and then reset it to the same value
<headius[m]> yes
<enebo[m]> This single variable is important for inliner but I am not sure if this would break anything
<enebo[m]> Once an inline was finished it should remove this
<enebo[m]> so inliner is currently also broken so it is all just making sure we do not deviate too much
<enebo[m]> Worst case here would be that we need to write smoethinhg which removes all occurrences of that instr vs finding one and breaking
<enebo[m]> break'ing
<enebo[m]> we tend to do that when we think we only have one so that is very likely how it is written
<enebo[m]> So let's just consider the other path
<enebo[m]> boolean hasYield and a boolean method we call at the end of build which checks that field and whether it has children
<enebo[m]> then we call prepareImplicitState(boolean) and prepend those instrs at the front and the boolean will say whether we should emit the loadframeclosureinstr
<enebo[m]> so we pretty easily just add those at the end instead of the beginning
<enebo[m]> we have public void addInstrAtBeginning(Instr instr) {
<headius[m]> You know it occurs to me this is still conflating needing a frame with being a closure in the first place
<headius[m]> At least for now, all closures will have a frame in hand from the parent method because we don't eliminate that yet
<enebo[m]> yeah we have to get it from somewhere and that is how right now
<headius[m]> So the frame closure is always available whether the block body itself actually needs other frame fields
<headius[m]> Flagging it as requiring a full frame for this instruction is overkill
<enebo[m]> It would actually be nice to add a second Block field :)
<headius[m]> It is more like load block binding closure knowing that the binding is always there
<enebo[m]> Ah the joy of passing another param through our whole call stack
<headius[m]> That's not a trivial change but it's doable
<enebo[m]> It does decouple it and will not require the field on the frame
<headius[m]> It would mean we are always reading that frame field whether it's used or not but it would not force a frame push
<headius[m]> No it still needs the frame field because we may be many frames away from the original method, and the only place to get its block from is the frame
<enebo[m]> well I just mean all calls will subcall and pass it down
<enebo[m]> part of me is thinking abotu eval now :)
<enebo[m]> but I think even a call to eval itself would just pass current secondBlock param
<headius[m]> I don't think that works
<headius[m]> def foo; yield; end; def bar; foo { yield }; end
<headius[m]> the yield there needs to get bar's block but there is a foo call in between
<headius[m]> bar => foo => yield needs to access a block from two stack frames up
<headius[m]> only place for that is bar's frame carried in the binding to the block
<enebo[m]> foo would receive { yield} as block and whatever is block passed to bar as secondBlock
<headius[m]> it might work for a simple case
<enebo[m]> if that was 3 calls up it would just pass as secondBlock the whole time
<headius[m]> def foo; baz { yield }; end
<headius[m]> we would have to keep stacking up blocks endlessly
<headius[m]> so now baz would have to receive three blocks... foo
<headius[m]> foo's, bar's, and the original block
<headius[m]> bar => foo => baz => foo's block => bar's block
<enebo[m]> I think though in this case bar would recieve its own { yield} as block and continue passing same secondBlock
<headius[m]> but that would be the wrong secondBlock
<headius[m]> that would be passing bar's block through instead of foo's
<headius[m]> the innermost block needs to be able to yield to whatever block came into bar, and the block inside foo needs to yield to the block passed to foo, so every block gets carried along into baz for this to work
<enebo[m]> well I am having a hard time visualizing
<headius[m]> I thought you mean just passing the frame block into the block every time, which is a feasible
<headius[m]> so it doesn't have to trigger a frame push to get it
<enebo[m]> yeah I do mean that
<headius[m]> I mean really it comes back to this instr not really needing a pushed frame
<headius[m]> it should be LoadClosureImplicitBlock
<headius[m]> and just go to the already-available binding for it
<headius[m]> I guess that doesn't fix this entirely though.. we should not load it eagerly if we don't need it, period
<headius[m]> that would eliminate the frame push but still be doing useless work
<headius[m]> I am willing to entertain the idea of passing more state on call stack but for now let
<headius[m]> let's assume this has to be in the frame
<enebo[m]> yeah I think that conversation needs to be much larger in scope
<enebo[m]> no sense only addressing it for block since it will involve changing a lot of code
<enebo[m]> Without that idea, for now, we can eliminate the frame dependency for simple blocks
<headius[m]> I could see it as a good optimization for the simple case of passing a block that yields just one level deep
<headius[m]> so if we happen to have the secondBlock it is done and needs no frame
<enebo[m]> tons of stuff does
<headius[m]> for sure
<enebo[m]> I like my outline for the idea just because it will not change how inliner works
<enebo[m]> it will still have this instr only once if it is there at all
<headius[m]> I can look at prepending the load instr at the end of the build
<headius[m]> I think that would be the cleanest
<headius[m]> and I might just go ahead with a LoadClosureImplicitBlock since we know we can get it off JVM call stack rather than frame stack
<headius[m]> then it would be there if needed but still not trigger frame push
<enebo[m]> prepareImplicit is used for all scopes so you may just want a second version of that just for iter/lambda
<enebo[m]> otherwise all *inner methods would need to move that to the end of the build
<headius[m]> hmm
<headius[m]> maybe I should do that
<headius[m]> I mean there is no value in any of these methods copying this value around if it won
<headius[m]> won't be used
<enebo[m]> It is only two instrs and an if for closure/non-closure
<enebo[m]> I am fascinated by this method now
<headius[m]> make prepareImplicitState prepend only if the closure is needed
<headius[m]> that is all it does
<enebo[m]> it will loadframeclosure for anything not method or metaclassbody
<enebo[m]> Seems like more than we would need
<headius[m]> well and receive self but we can do that cheaply
<headius[m]> yes
<headius[m]> I can just handle all other cases at the end of build and push frame closure instr
<enebo[m]> ok
<enebo[m]> yeah I am fairly excited to see how this turns out
<headius[m]> we can consider eliminating this for methods later but it is cheap
<headius[m]> the frame access is the problem to fix
<enebo[m]> yep
<enebo[m]> it is one of those things where our design is layered up on a bunch of decisions we would not have picked today
<enebo[m]> but this looks like it is a pretty simple change for win scenario
<headius[m]> yeah could be a nice big win for small blocks
<headius[m]> and it makes my proc.call inlining work really nicely ๐Ÿ˜€
<enebo[m]> This is so common in Rails
<enebo[m]> Although I guess it is nesting
<enebo[m]> plenty of single log() { method } too
<enebo[m]> The example above also scope captures
<enebo[m]> but plenty doesn't
<enebo[m]> log("rollback transaction", "TRANSACTION") { @connection.rollback }
<headius[m]> mutex.synchronize bro
<enebo[m]> yeah
<headius[m]> unfortunate that we have to have a frame at all for that but this will eliminate the pushing and popping in the block
<headius[m]> synchronize would be a good candidate for specialization
<headius[m]> this seems to be working
<headius[m]> this is on graal
<headius[m]> the patch is pretty small
<enebo[m]> ah
<enebo[m]> I wondered when I saw the cache
<headius[m]> oops
<headius[m]> added diff to the gist
<headius[m]> buildLambda is basically buildClosure, yeah?
<headius[m]> it is not just ->{}
<enebo[m]> I would rename that to hasYield
<enebo[m]> buildLambda is only for -> I think
<enebo[m]> buildIter is for basic closures
<headius[m]> ok so will need similar logic for that.. I will make a common method
<enebo[m]> afk a bit
<headius[m]> enebo: I went ahead and made this other instr that does not trigger frame and now both cases are much faster
<headius[m]> but no yield is like 30x better than before
<headius[m]> lopex: I put those joni items on 2.1.42 milestone
ur5us has joined #jruby
<headius[m]> surge_g: confirmed your issue is the same threading + $~ issue we suspected
<fidothe> headius[m]: good news, 9.3 snapshot worked drop-in with my production system. Bad news bug-that-smells-like-#6160 is still there, possibly worse than in 9.2.16.0
<headius[m]> fidothe: do you have some a way for me to reproduce it?
<fidothe> I would see it maybe once in 20 instances under 9.2, but saw it on several of the 20 instances under the March 29 9.3 snapshot
<headius[m]> I think you should open a new bug
<fidothe> I havenโ€™t managed to get a working reproduction yet. Will try in AWS using the same instance type Iโ€™m seeing the problem in
<fidothe> Will open a new bug once I have something concrete
<enebo[m]> headius: looks good
<enebo[m]> headius: is it 30x on java 8?
<headius[m]> hard to tell with the overhead of the loop
<enebo[m]> so it is faster but you cannot tell if it is 30x?
<headius[m]> I will do a quick comparison
<enebo[m]> ok I was just not sure how to evaluate the improvement because I only saw ggraalvm without fixnum cache
<enebo[m]> not that it really matters. I was just curious
<enebo[m]> There is little doubt it is doing less
<headius[m]> 0.14s down to 0.09s where 0.08 of that is the loop
<headius[m]> rough estimate
<headius[m]> so like 6x
<headius[m]> for foo { nil } yielded 10M times in a flat loop
<enebo[m]> yeah sweet. land it
<enebo[m]> unless it is not done running on CI
<enebo[m]> hmm a few fails
<enebo[m]> I wonder
<headius[m]> hmm yeah
<headius[m]> I pivoted to this $~ issue and it seems like split never writes anything to backref except nil
<enebo[m]> File does not exist: D:/a/jruby/jruby/test/jruby/test_globals
<headius[m]> it uses backref as an out variable during the logic but then clears it before returning
<enebo[m]> heh interesting and split will not use it internally as that is in joni itself
<headius[m]> yeah
<headius[m]> so I have a patch that passes in an IRubyObject[] holder allocated lazily in a threadlocal
<headius[m]> should be no overhead except for getting that local and assigning values into it along the way
<headius[m]> and no thread steppage
<headius[m]> well other than clearing backref at the end which it still needs to do to be the same behavior
<headius[m]> which could step on some other function reading backref
<headius[m]> I wonder how many of these backref-needing functions actually don't
<enebo[m]> well worst case perhaps we reduce some races :)
<headius[m]> yeah at least a few others probably use backref simply for out values from regexp functions
<headius[m]> will make that into a test case of some kind
<enebo[m]> headius: you still include nil to $~ because it behaved that way when it actually used $~ but isn't not doing that set valid since it never touches it so it should not be responsible for setting it to nil at the end?
<headius[m]> yes?
<enebo[m]> HAH
<headius[m]> it depends if the clearing is expected behavior or not
<headius[m]> since it doesn't read it it seems appropriate that it would not write it
<headius[m]> but I did not want to take that leap without more investigation
<enebo[m]> That's how I feel
<enebo[m]> I keep thinking we don't touch and the only thing which would observe this is something else on another thread trying to use it
<enebo[m]> I feel like two outcomes are possible
<enebo[m]> 1. somehow something else muddles $~ with its own data but everything works out because split just happens to clear it
<headius[m]> well someone would have to be depending on split clearing this, which would seem weird
<headius[m]> yeah
<enebo[m]> 2. we don't clear it but it is whatever the other thread wants because we did not wipe it
<headius[m]> like if there is some call after split that does an implicit read of $~ and would have gotten nil before but not after
<enebo[m]> yeah
<headius[m]> but that would be a risky expectation anyway
<enebo[m]> I think it feels if that is true it is another bug where we are not clearing
<headius[m]> someone replaces split(regex) with split(string) and it stops clearing
<headius[m]> (it only does this if you pass regex)
<enebo[m]> ok so in that case I think it makes the clearing even less consistent even though it has been inconsistent like that forever for us
<enebo[m]> I guess for 9.2 one can argue this is closer to what we would expect but we were already doing this
<enebo[m]> So long as this does not trigger anyones repoductions it is ok there
<enebo[m]> for 9.3 I think we should remove it
<headius[m]> pushed a spec, back to block frame optz
<enebo[m]> but maybe even 9.2 if the repro does not work
<headius[m]> I could push a 9.3 spec that removes the clearing altogether
<headius[m]> not spec, PR
<enebo[m]> you reduced the race by a lot but it is still possible
<headius[m]> well there is no internal race now at least
<headius[m]> other threads reading backref will still get effed up by this
<enebo[m]> yeah I meant that race
<headius[m]> so split is aces after this PR regardless
<enebo[m]> split itself when mixed is fine
<enebo[m]> so I see how that is totally fixed but I was looking at the notion many method use $~
<headius[m]> yeah
<enebo[m]> It is really goofy that $~ is used by MRI for split too
<headius[m]> but perhaps fewer than we think!
<headius[m]> honestly they could do the same thing I have been doing and provide versions of these regexp functions that use a holder array
<enebo[m]> yeah for sure
<headius[m]> most of the core functions they have now reading and writing backref could avoid doing so altogether with a stack-allocated array
<headius[m]> maybe I can get tenderlove to do it
<enebo[m]> but there is also the problem with many others that $~ is set for users and not for the impl
<headius[m]> gods I just thought of something.. if we could make this change for aref it would avoid frames for methods calling x[1]
<headius[m]> depends what aref ends up setting into backref
<headius[m]> like who is doing str[/blah/] and then expecting $~ to have something
<headius[m]> stop that
<headius[m]> that one backref use is responsible for a VERY high percentage of our frames
<headius[m]> because we have to treat all [] calls as potentially needing it
<headius[m]> most failures on block thing are a botched commit... looks like I renamed a method inconsistently
<enebo[m]> heh I wonder how much would break with no $~ set on aref
<headius[m]> yeah
<enebo[m]> so we need to hack aref site to raise when args0 is regexp and then switch to a slow path with frame setup
<headius[m]> there are specs for it
<enebo[m]> which will not work on aliases of aref :)
<headius[m]> well that is a general purpose optimization I have been wanting to do
<enebo[m]> generic raise
<enebo[m]> yeah
<headius[m]> have the call site set up a fake frame when it might be needed
<enebo[m]> all sites would have to handle raises for each frame field
<headius[m]> like we can see it only needs backref, just push and pop a blank frame for that one field
<enebo[m]> well I was thinking more along the lines we don't know what we will need and if it hits the case the method impl will raise and tell us and it will switchpoint to something else
<headius[m]> yeah sure
<enebo[m]> I suppose though a callsite has a cache and could know
<enebo[m]> the throw would not have to check on the way in at the site it would still check in the impl itself but it already is doing that
<enebo[m]> cost/code of catching maybe is very expensive compared to just checking on the way in perhaps
<enebo[m]> heh make some huge ass callsite logic and then just explode from the weight but opt really well on microbenchmarks
<headius[m]> missed part of diff in block PR, should see actual failures now
<enebo[m]> I was confused to see those windows run result of not finding a file
<headius[m]> I see that one locally too
<enebo[m]> is that not why it failed on windows?
<headius[m]> it is something unrelated to Windows
<headius[m]> ahah
<headius[m]> singleton classes can see the surrounding block implicit closure
<headius[m]> I missed that one
<headius[m]> oh this is way better... I moved the set of needsYieldBlock into getYieldClosureVariable
<headius[m]> now it will work for any scope that tries to use that var
<headius[m]> ok I pushed another fix that should get much further... break time
<headius[m]> ok remaining failures seem to be a problem with getting the implicit block from the passed in binding... it is not the same as getting it from pushed frame for some case
travis-ci has joined #jruby
travis-ci has left #jruby [#jruby]
<travis-ci> jruby/jruby (jruby-9.2:d68e624 by Charles Oliver Nutter): The build was broken. https://travis-ci.com/jruby/jruby/builds/221847945 [192 min 20 sec]
travis-ci has joined #jruby
<travis-ci> jruby/jruby (jruby-9.2:d68e624 by Charles Oliver Nutter): The build was broken. https://travis-ci.com/jruby/jruby/builds/221847945 [192 min 21 sec]
travis-ci has left #jruby [#jruby]
<headius[m]> deploy broken again
<headius[m]> enebo: block PR looks solid now
<headius[m]> I did not have to regress any optimizations... I just needed to also get the block from the binding in interpreter (I incorrectly thought there would always be a frame in interpreter, but there isn't if the rest of the block doesn't need it)
<headius[m]> right now this only helps proc.call call sites that only see a single proc object ever, so pretty much just BIPS benchmarks ๐Ÿ˜€
<headius[m]> if another proc comes along it just calls slow path
<headius[m]> the smart way to bind this would be to have the method handles unpack the target block and let JVM sort out optimizing multiple but it is a decent POC as it
<headius[m]> as it is
<enebo[m]> looks good to me
travis-ci has joined #jruby
<travis-ci> jruby/jruby (jruby-9.2:6a0b3ba by Charles Oliver Nutter): The build was broken. https://travis-ci.com/jruby/jruby/builds/221851335 [170 min 56 sec]
travis-ci has left #jruby [#jruby]
<headius[m]> there is the experiment for having split not touch backref at all
<headius[m]> there are about a dozen other methods in String that claim to read or write backref, so the next step would be auditing them to see how many actually need it
<headius[m]> block frame PR is green
ur5us has quit [Ping timeout: 258 seconds]
<headius[m]> If this split PR passes that will be one less method to trigger the optimization
<headius[m]> Trigger deoptimization