lucasb has quit [Quit: Connection closed for inactivity]
drbobbeaty has joined #jruby
jeremyevans has quit [Remote host closed the connection]
jeremyevans has joined #jruby
<rtyler1> I have been summoned
<fzakaria[m]> hi
Antiarc has quit [Quit: ZNC 1.7.4+deb7 - https://znc.in]
Antiarc has joined #jruby
<headius[m]> enebo: ok took a while but I have rails new running to completion with -X+C
<headius[m]> pushing last fix in a moment...mostly bugs in the JIT that never came up
<headius[m]> I'd expect rails app to be ok as well...this loads a ton of code
<headius[m]> will try that tomorrow unless you get to it before me
<headius[m]> you may want to use this patch for testing, since I think using JRubyClassLoader interferes with most of the class caching features
<headius[m]> ttfn
nirvdrum has joined #jruby
travis-ci has joined #jruby
<travis-ci> jruby/jruby (jit_irscope_removal:4227eb9 by Charles Oliver Nutter): The build was fixed. https://travis-ci.org/jruby/jruby/builds/642237399 [171 min 1 sec]
travis-ci has left #jruby [#jruby]
<headius[m]> Woot
nirvdrum has quit [Ping timeout: 265 seconds]
rusk has joined #jruby
rusk has quit [Write error: Connection reset by peer]
rusk has joined #jruby
rusk has quit [Read error: Connection reset by peer]
rusk has joined #jruby
drbobbeaty has quit [Ping timeout: 260 seconds]
nirvdrum has joined #jruby
shellac has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
_whitelogger has joined #jruby
nirvdrum has joined #jruby
_whitelogger has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
drbobbeaty has joined #jruby
lucasb has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
<headius[m]> Good morning!
<headius[m]> enebo: I just realized that compileAndLoadFile was still parsing
<enebo[m]> ah ok I was hoping something was happening
<headius[m]> I'm trying to refine things a bit before getting back to slides
<headius[m]> also looking at adding the cache logic to non-compile loadFile
<headius[m]> -X+C has other side effects like forcing early JIT
<enebo[m]> yeah personally I think we want --dev to work with caching
<enebo[m]> so I think caching should be somewhat orthogonal
<headius[m]> right
<headius[m]> this would allow that
nirvdrum has joined #jruby
sagax has quit [Remote host closed the connection]
sagax has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
xardion has quit [Remote host closed the connection]
xardion has joined #jruby
<headius[m]> playing with some other tooling to dig deeper...
<headius[m]> -XX:+TraceBytecodes in JDK debug builds prints out every bytecode executed
<headius[m]> jruby --dev -e 1 executes over 33M
<enebo[m]> so did decoupling improve things a bit?
<headius[m]> flame graph puts most of the boot time in the preludes, which largely points at RubyGems
<enebo[m]> -X+C will force all IR executed to do a full build
<headius[m]> a bit, I'll try to get some details in a moment
<headius[m]> driving today's exploration from a slide perspective
<headius[m]> enebo: I will push remaining local changes
<enebo[m]> ok
<headius[m]> heh with cached classes we execute 35M
nirvdrum has joined #jruby
<enebo[m]> hmm
<enebo[m]> so colder since it is all new code
<headius[m]> cold vs hot doesn't matter for this measure
<enebo[m]> I believe though one big issue is that this will still file system search all files
<headius[m]> that's certainly true...the classes are based on the full path of a found file
<enebo[m]> and it has to resolve via LOAD_PATH
<enebo[m]> Not really sure I want to go there but if we said up to point X we will just load this script of stuff then all that file system stuff would disappear
<enebo[m]> I definitely feel that has issues we don't want
<enebo[m]> I am pretty curious about the cost of continuous resolution though
<enebo[m]> anyways I am going on a tangent since file actitivity really ends up being like 1 more File create
<enebo[m]> both before/after is doing load_path resolution and persist is just checking one more file per resolution
<enebo[m]> I think more broadly we could so a resolutionless core boot but it woudl be a future
<enebo[m]> So as a future it would be nice to see how expensive resolution is
<headius[m]> resolving files is certainly a nontrivial cost
<headius[m]> I'm not sure how it compares to actually parsing+compiling+interpreting those files though
<enebo[m]> well at one point someone did look at that a bit (nirvdrum?) and it was significant cost
<enebo[m]> I don't recall how significant (I thought more than 10% but less than 20%)
<enebo[m]> nirvdrum: do you recall...or were you even who I am thinking of
<enebo[m]> This obviously was a long time ago as well
<headius[m]> emitting cached classes with indy increases the count to 48M
<headius[m]> enebo: my changes are pushed
<headius[m]> -X+C -Xcompile.class.cache to dump, then just -Xcompile.class.cache to use it (works with --dev)
<headius[m]> oh not just that
<headius[m]> also CLASSPATH
<headius[m]> for making use of them that is
<enebo[m]> so CLASSPATH=~/.ir
<enebo[m]> qualitatively class load of AOT removed parse+build+interp with extra file load + reflective construct + bytecode interp
<headius[m]> I pushed the commit that uses system classloader also
<headius[m]> we'll want to revisit that perhaps
<headius[m]> enebo: yes that's about the size of it
<enebo[m]> when using appcds we still have to file check ~/.ir right?
<enebo[m]> even though we will not load that
<enebo[m]> err meaning we will check status of that but it will already be in VM
<headius[m]> appcds only works with jars so you have to jar up everything in .ir first
<headius[m]> but yes still need that jar on classpath with appcds
<enebo[m]> yeah I get that but does that mean you are checking in the jar then or ~/.ir to see if up to date?
<headius[m]> FWIW using a jar instead of dir in CLASSPATH also helps
<headius[m]> filesystem access for classloading appears to be a good bit slower
<headius[m]> the two flags you want for CDS are -XX:ArchiveClassesAtExit=blah.jsa and then -XX:SharedArchiveFile=blah.jsa to use it
<enebo[m]> Could you maybe make a cheat sheet gist
<headius[m]> I think that archive flag might have been introduced in 13...prior to that you had to do a three-step dance to generate a list of classes to cache
<headius[m]> I'm testing 13 and 14 mostly
<enebo[m]> ah also I am not trying 9+ yet but I should
<headius[m]> yeah I'll do that for slides anyway but prioritize it today
<headius[m]> ok this is a bit of good news
<headius[m]> Using just --dev on JDK13, -e 1 takes 1.367s
<headius[m]> With cached classes and CDS, that reduces to 1.267
<headius[m]> unfortunately with no cached classes (interp only) and CDS it drops to 1.081s
<headius[m]> CDS is a win but cold bytecode hurts
<enebo[m]> more good news
<enebo[m]> cache.classes with --dev of 'echo exit| rails c' on one controller app is 14.6-15s but with just --dev an no caching it is 17.2-17.6s
<enebo[m]> yeah definite win...on Java 8
<enebo[m]> whoa Java 13 is 10.8s
<enebo[m]> without even enabling parallel...let's see what it is with just --dev
<headius[m]> well that's promising
<enebo[m]> 13.7 with --dev alone
<headius[m]> -e may be a poor test case for this, which is interesting in itself
<enebo[m]> gem list is a poor one too apparently
<headius[m]> are you passing those flags as env var?
<enebo[m]> yeah since it is double invoke
<headius[m]> pretty sure rails c spawns a second jruby
<headius[m]> ok
<enebo[m]> time echo exit | CLASSPATH=/home/enebo/.ir JRUBY_OPTS="-Xcompile.cache.classes" jruby --dev -S rails c >/dev/null 2>&1
<enebo[m]> I realize I messed up here
<enebo[m]> only --dev for first invoke
rusk has quit [Read error: Connection reset by peer]
<enebo[m]> so maybe not as good as hoped...
<enebo[m]> ok 11.3s for --dev alone now
<enebo[m]> 8.3s for cache
<enebo[m]> which is pretty huge
<enebo[m]> this is without appcds
<headius[m]> oh I moved the logging output to another property for now
<enebo[m]> 7.8s with parGC
<headius[m]> I'll push that change
<enebo[m]> good...that should eliminate the /dev/null need
<enebo[m]> I do find it weird how much faster this mangy 13 is than 8
<headius[m]> yeah that is a bit surprising to me
<enebo[m]> could we add jlink to the mix here and kill some crap on JVM load itself
<headius[m]> when running on 9+ we have to load JRuby into system classloader, which means bytecode verification is back on
<headius[m]> probably?
<enebo[m]> This Java 13 is from my last machine so I will update that and probably update JVMs
<headius[m]> that's a tool I have not learned yet
<headius[m]> apparently JDK14 has some hidden enhancements to save some parts of heap as well
<headius[m]> one thing does occur to me about -e 1 versus rails c
<headius[m]> in the former nearly everything executes exactly once
<enebo[m]> so why would 8 be so much slower than 13 is that is the case (verification)
<enebo[m]> almost feels like the opposite
<headius[m]> in the latter I would expect a lot more code is actually being used at runtime
<headius[m]> yes that I don't know
<headius[m]> 13 is slower for -e 1
<headius[m]> I wish we had a single-process example of the size of rails c
<headius[m]> I would like to get a bytecode trace of something large and see what the flame graph looks like
<enebo[m]> 12.6 and 10.8 for Java 8 which is considerably slower for me but this 8 is u172 which is outdated
<headius[m]> the flame graph for -e 1 is dominated by loading preludes as I mentioned above
<headius[m]> I doubt 8 will have improved much on this over the past year
<enebo[m]> So I bet we win on rails because some of the JIT loaded code is hot
<enebo[m]> -e 1 literally does nothing
<headius[m]> yeah that's what I'm thinking too
<enebo[m]> I mean it is fine to focus on it to understand things but it seems reasonable having all that stuff as JITed code leads to other wins
<headius[m]> -e boots a bunch of classes and does a little bit of work in RG but that's it
<headius[m]> and comparing the classes that load versus hot interpreter appears to be a loss
<enebo[m]> This is a pretty big different
<enebo[m]> 11.4s -> 8.3s without even doing appcds
<headius[m]> oh yeah there's clearly other wins though I'm glad you're seeing startup improvement too
<headius[m]> one big win will be warmup time, which we may or may not have time to try
<enebo[m]> heap should also get better
<headius[m]> but going straight to jitted code for almost everything should clearly reduce the warmup curve
<enebo[m]> I think this will be a winner but removing reflection could end up being another tier for warming up faster
<headius[m]> hmm I'm still having problems getting a full script to properly calculate flags
<headius[m]> if you see some of my commits last night I'm now trying to force interpreter context to get created on parent + children and stuff like that
<headius[m]> so much of this is lazy and it's not clear how to force everything to happen
<enebo[m]> could you verify my single controller echo exit to rails console timings
<headius[m]> I just ran into another file that apparently didn't see that it evals
<headius[m]> rails 6?
<headius[m]> can you push your app somewhere perhaps?
<enebo[m]> good question
<enebo[m]> hahah rails 6 rc2
<enebo[m]> I should update that :)
<enebo[m]> but honestly we should not see much difference
<enebo[m]> ok so that is another thing I will do today is update some software
<enebo[m]> latest 8,13,14 and rerun and make a current Rails 6 app
<nirvdrum> enebo[m]: The PR and numbers I was looking at way back when: https://github.com/jruby/jruby/pull/1604
shellac has quit [Quit: ["Textual IRC Client: www.textualapp.com"]]
<enebo[m]> The other fun part for this talk could be seeing how redmine changes assume we do not his a bug
<enebo[m]> Wouldn't it be grand to see a win like quicker warmup
<headius[m]> enebo: I can confirm it once you have an app I can run...just want to make sure I'm doing the exact same thing you are
<nirvdrum> enebo[m]: It was shaving seconds off for me at the time. That was pre 9k though.
<headius[m]> I am feeling better about this stuff now and we haven't even started to look at heap and warmup
<headius[m]> 9k did remove .class searching among other things
<enebo[m]> nirvdrum: yeah I do remember it being substantial
<headius[m]> that cuts the number of file searches in half
<nirvdrum> enebo[m]: A sticking point at the time was the file watcher APIs weren't available in in whatever the latest JDK that Ruboto supported was.
<enebo[m]> HAHAHA LOL
<headius[m]> I'd love to get that cache updated and back in
<headius[m]> and I need to revisit my load branch with all my copious free time
<nirvdrum> Bootsnap does a lot of this now, too, I believe.
<enebo[m]> ok I guess I have a lunch date...so I have to go but we have something interesting and should get a first swipe of talk today
<headius[m]> I have started throwing some slides together
<headius[m]> I will need to include some comparison with other Ruby impls so might have to re-run a few thing from my startup time post
<nirvdrum> The basic idea is `require` needs to be responsive to changes to the filesystem to be compliant, but the filesystem likely doesn't change that much during execution. And it would never impact already loaded files. But, because of that feature, the `require` can't cache information about the fs contents and so has to do a full search every time. I added a file watcher so you could cache the contents and let the file watcher invalidate the
<nirvdrum> cache.
<nirvdrum> All of this gets compounded by RubyGems appending stuff to the $LOAD_PATH. So, if you require a gem and that then requires files within the gem, that could result in a full walk of $LOAD_PATH even though the file you want to load is almost certainly at the end of the $LOAD_PATH.
<headius[m]> mmm good point
<headius[m]> and I'm not sure there's any tool to condense an entire app into a single dir
<nirvdrum> But, it's possible a bare `require` from a gem picks up a file from something earlier in the `$LOAD_PATH` and that needs to be honored. Fun stuff.
<headius[m]> bundler can relocate gems into a vendor dir but I believe they still get their own entries
<nirvdrum> IIRC, this was all exacerbated running in a type 2 hypervisor because syscalls are relatively expensive there.
<headius[m]> enebo: the other half of this we can talk about is that being able to run mostly from bytecode feeds into native+AOT options across the board
rusk has joined #jruby
<headius[m]> so graalvm native is an extreme case but we can use it to compile an entire Ruby app to native code for almost no warmup and much faster startup
<headius[m]> and openj9 JIT caching can get a big boost by saving C1-like native code
<headius[m]> I am actually surprised I have not run into more scripts that are too big to compile
<headius[m]> almost everything for rails compiles to bytecode ok
<fzakaria[m]> The bundler maintainer broke down the classloading bug in JRuby really nicely:
<fzakaria[m]> Ok, so this seems like due to a weird inconsistency in jruby that should probably be fixed, where Pathname#expand_path with an argument always returns uri:classloader:// (double slash) as the canonical version of "uri:classloader" paths, while Pathname#expand_path without an argument always returns uri:classloader/ (single slash) as the canonical version of "uri:classloader" paths.
rusk has quit [Read error: Connection reset by peer]
<headius[m]> enebo: so far I can't reproduce your result
rusk has joined #jruby
<headius[m]> nirvdrum: not to mention filesystem access
rusk has quit [Remote host closed the connection]
<headius[m]> filesystem access does indeed continue to show up in profiles
<headius[m]> hard to know what percentage of actual wall clock time is wasted on filesystem access
<nirvdrum> Yeah. If you're not on an SSD, you'll definitely feel the seek time. But even if you are on SSD, you might have problems in virtual/cloud environment. At least that was what I was seeing.
_whitelogger has joined #jruby
travis-ci has joined #jruby
travis-ci has left #jruby [#jruby]
<travis-ci> jruby/jruby (jit_irscope_removal:d5897d3 by Charles Oliver Nutter): The build was broken. https://travis-ci.org/jruby/jruby/builds/642508705 [167 min 2 sec]
nirvdrum has quit [Ping timeout: 265 seconds]
<headius[m]> enebo: oh one of the issues I ran into is the laziness in def method
<headius[m]> since we don't eagerly compiled the body of the method, flags aren't calculated and there's no InterpreterContext
<headius[m]> so that's one thing I had to force in the AOT compilation logic
sagax has quit [Ping timeout: 248 seconds]
<enebo[m]> headius: oh this is from the notion of compiling initially and not detecting things like eval?
<headius[m]> yeah
<enebo[m]> headius: in the case of building the AOT classes we definitely need to force all methods to get instructions and calculate flags
<headius[m]> we have nothing to ensure flags are computed so I was getting bad results until I started forcing it to build
<enebo[m]> we could probably make isEval a field on IRScope and do the logic of that in AST
<enebo[m]> but really all this would help for now would be the forcing the load of everything when we "compile"
<enebo[m]> in a transparent mode we would want to reduce that cost but for now I think we accept a manual compile run
<headius[m]> yeah it's ok as it is for now, and I'm only forcing it for the scripts we attempt to AOT
<enebo[m]> so you are not seeing a faster run in a simple rails app?
<enebo[m]> I did "compile" all the way through running rails c
<headius[m]> I am not
<headius[m]> I'm not sure I'm running it like you are though
<enebo[m]> time echo exit | CLASSPATH=/home/enebo/.ir JRUBY_OPTS="-Xcompile.cache.classes=false --dev -J-XX:+UseParallelGC" jruby -S rails c >/dev/null 2>&1
<enebo[m]> =false to =true to enable but I did an -X+C originally to "compile" once
<headius[m]> ok
<headius[m]> I'm getting some bytecode traces for -e 1 and gem list with and without caches to show a breakdown of where most bytecodes are executing
<enebo[m]> I am consistently amused/angry about needing to use parallel :)
<headius[m]> excluding preludes, bytecode count reduces during init
<headius[m]> preludes increases though
<enebo[m]> I was going to make a pipeline slide(s) and maybe some historical attempts
<headius[m]> yeah definitely do that
<enebo[m]> probably also something on why it is important
<headius[m]> yeah that too
<headius[m]> bits and pieces from startup time sections of recent JRuby talks
<headius[m]> I think the biggest takeaway from this is that cold bytecode is still the main killer
<headius[m]> we can whittle off 0.1s here and there but when the rubber hits the road waiting for JIT is the real problem
<enebo[m]> other bigger bench thoughts is heap vs process size changes and warmup time and potentially overall throughput once warmed up
<enebo[m]> I am somewhat happy to see a bigger app get a bigger benefit
<enebo[m]> I did not really expect that
<enebo[m]> I am bummed a tiny app is not seeing much positive
<headius[m]> the actual execution of gem list runs fewer bytecodes than booting RubyGems
<headius[m]> enebo: so you are not seeing a gain anymore?
<enebo[m]> I am for rails
<enebo[m]> I don't recall what gem list was or wasnt
<headius[m]> I worry something's broken and it's not loading as much code in cache mode
<headius[m]> after I get this next bytecode trace I'll try your command line here
<enebo[m]> it is possible for sure so I guess we need to figure out if there is a problem or not
<headius[m]> it is counter-intuitive to me that the cached mode is running more bytecodes than pure interp
<headius[m]> I'm trying your command line on a bare rails 6 app now
<enebo[m]> my actual numbers will not match yours as I have no yarn installed currently
<enebo[m]> but both modes hits same point so that should not differ
<enebo[m]> correcting not having yarn or nodejs right now
<enebo[m]> ok so I totally nuked ~/.ir and did a full runner script to generate an app
<enebo[m]> Now the difference between cache with and without --dev is 0.3s (14.3 vs 14.0)
<enebo[m]> so cache is still faster but weirdly not dramatically so
<headius[m]> that almost sounds like noise
<enebo[m]> this is also with 6.0.2.1 of rails
<enebo[m]> except it seems consistent
<enebo[m]> so I am going to nuke ~/.ir and do a single rails c with -X+C and see if it is faster again
<headius[m]> I'm getting 12.1s for --dev and 14.6s for --dev with cache
<enebo[m]> I compiled soooo much code running all those rails commands
sagax has joined #jruby
<headius[m]> ah hell
<headius[m]> I thought this trace thing turned off jit but it doesn't
<headius[m]> so that's interesting but I have to rerun numbers
<headius[m]> it would make sense that cached would run more bytecodes because less stuff gets hot
<headius[m]> in the interpreted case the interpreter itself gets hot really fast
<headius[m]> at least I have an explanation now
<headius[m]> I didn't notice this line in the readme:
<headius[m]> To get the full picture of what's going on, especially on a longer running program, it might be interesting to run your program with -Xint, since as things start to get compiled they will disappear from view...
<enebo[m]> ok so this is interesting
<enebo[m]> I nuked ~/.ir and did rails c once with -X+C
<enebo[m]> then timed the echo exit rails c thing
<enebo[m]> boom the gain is back
<enebo[m]> 14.3s vs 12.4s
<enebo[m]> So seemingly only compiling that much code AOT was helpful but if I compile a lot more code by calling 5-6 rails commands a lot of that gain disappears
<headius[m]> hmmm
<headius[m]> when did you generate it last?
<headius[m]> one thing we need to be careful of is that this does not do any sanity checking of the cached classes right now
<headius[m]> so it could be that they're stale and it loads them up and blows up and then goes on to interpret anyway
<enebo[m]> I nuked both times and started from scratch
<headius[m]> my commits today did have bytecode changes
<enebo[m]> before runner and then again for the rails console change
<headius[m]> hmm ok
<headius[m]> I have been wiping out too but not seeing gains
<enebo[m]> but I did not have HEAD I think I was lacking last two commits
<enebo[m]> I will retest again though
<headius[m]> so the trace numbers I have now will still be interesting
<headius[m]> since this does not reflect code that jits, it shows how much cold bytecode is executing, which is more in the cached case
<headius[m]> the total bytecode execution count may take a while to generate but it should show the gain in overall execution of JIT compiled script versus interpreter
<headius[m]> executing 33M bytecodes cold is pretty ugly and shows the problem pretty well
nirvdrum has joined #jruby
ur5us has joined #jruby
<headius[m]> these new numbers better look great for cached classes or I'll be really confused
<headius[m]> there's no way precompiled scripts should execute more bytecodes than the IR interpreter
ur5us has quit [Ping timeout: 245 seconds]
ur5us has joined #jruby
nirvdrum has quit [Ping timeout: 268 seconds]
nirvdrum has joined #jruby
ARTSIOM has joined #jruby
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
ARTSIOM has quit [Client Quit]
ur5us has quit [Ping timeout: 246 seconds]
drbobbeaty has joined #jruby
lucasb has quit [Quit: Connection closed for inactivity]