cpuguy83 has quit [Remote host closed the connection]
cpuguy83 has joined #jruby
hosiawak has quit [Ping timeout: 268 seconds]
hosiawak has joined #jruby
hosiawak has quit [Remote host closed the connection]
hosiawak has joined #jruby
hosiawak has quit [Ping timeout: 268 seconds]
hosiawak has joined #jruby
hosiawak has quit [Ping timeout: 240 seconds]
hosiawak has joined #jruby
hosiawak has quit [Ping timeout: 268 seconds]
hosiawak has joined #jruby
cpuguy83 has quit [Remote host closed the connection]
cpuguy83 has joined #jruby
cpuguy83 has quit [Remote host closed the connection]
cpuguy83 has joined #jruby
cpuguy83 has quit [Ping timeout: 240 seconds]
<headius[m]> I cracked the code
<headius[m]> for some reason JRuby sends the headers early as a separate packet, and the client doesn't ack that until after 0.04s or so
cpuguy83 has joined #jruby
cpuguy83 has quit [Ping timeout: 240 seconds]
cpuguy83 has joined #jruby
cpuguy83 has quit [Remote host closed the connection]
Antiarc has quit [Quit: ZNC 1.7.4+deb7 - https://znc.in]
Antiarc has joined #jruby
hosiawak has quit [Ping timeout: 245 seconds]
hosiawak has joined #jruby
hosiawak has quit [Ping timeout: 250 seconds]
hosiawak has joined #jruby
hosiawak has quit [Ping timeout: 276 seconds]
_whitelogger has joined #jruby
cpuguy83 has joined #jruby
hosiawak has joined #jruby
KeyJoo has joined #jruby
cpuguy83 has quit [Ping timeout: 240 seconds]
hosiawak has quit [Remote host closed the connection]
rusk has joined #jruby
drbobbeaty has joined #jruby
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
KeyJoo has quit [Quit: KeyJoo]
shellac has joined #jruby
drbobbeaty has joined #jruby
lucasb has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
subbu is now known as subbu|away
cpuguy83 has joined #jruby
cpuguy83 has quit [Ping timeout: 240 seconds]
bzb has joined #jruby
subbu|away is now known as subbu
enebo has quit [Ping timeout: 246 seconds]
xardion has quit [Remote host closed the connection]
xardion has joined #jruby
enebo has joined #jruby
<headius[m]> Good morning!
bzb has quit [Quit: Leaving]
enebo has quit [Ping timeout: 240 seconds]
enebo has joined #jruby
shellac has quit [Ping timeout: 245 seconds]
<headius[m]> aha, I finally cracked this ACK nut
<headius[m]> crACKed it
<rdubya[m]> 🥳
<headius[m]> from 200 req/s to 20k
<rdubya[m]> nice
<headius[m]> unfortunately it's not good news
<rdubya[m]> not so nice lol
<headius[m]> the problem lies in our socket impl
<headius[m]> I've just hacked around part of it in the benchclient
<headius[m]> well "problem" may be a stretch
<headius[m]> we work properly but because we don't obey all the socket flags puma sets we send response packets differently than MRI
<headius[m]> it just happens to hit the delayed ACK problem
<headius[m]> actually I have to update that because I did just get QUICKACK to work
<headius[m]> there
<headius[m]> ok so the "bug" is basically that JRuby's response is broken into packets differently than MRI's
<lopex> ip packets ?
<headius[m]> I believe it's TCP_CORK doing it, but MRI sends headers plus the first part of the response all at once while acking the request
<lopex> er, tcp
<enebo[m]> yeah so for hello world size payloads which can complete processing a request in much less than 0.04s people will notice us as slower
<headius[m]> we send headers as one packet and then wait for ack before sending the body
<headius[m]> enebo: yeah if the request required more than 40ms we probably would not notice this, and also possibly if it were a larger packet
<enebo[m]> so my main takeaway is people doing minimal benchmarking may end up drawing a poor conclusion
<lopex> mss ?
<headius[m]> MRI does not have the 40ms problem even though the body gets broken into two packets
<headius[m]> may be because it closes connection after the second part of response
<enebo[m]> if quick ack is client side setting then realistically I doubt we can influence this mistake in evaluation of Rubys either
<headius[m]> there's some questions remaining but the bottom line is that we packetize the response differently and it just so happens to trigger this delay problem
<headius[m]> right, it's a client-side socket option that appears to get reset to default very easily
<enebo[m]> I guess in our talk we can lightly cover the dangers of using too tiny of a bench as well as point out this issue
<headius[m]> my first patch set it right after connection establishment and that didn't work
<enebo[m]> With a larger bench showing us perform reasonably it should be easier to make the point
<headius[m]> which sent me down this rat's nest of tcpdump
<headius[m]> I moved the option to immediately before request write and that did it
<enebo[m]> it is troubling that people make decisions running really silly benchmarks. I do get it too, but it is still troubling
<enebo[m]> of course unless someone wants to make a scalable date service :)
<headius[m]> from my research it appears this is a problem on Linux and on Windows, though on Windows it appears there's a way to disable the delay globally
<headius[m]> Linux does not normally have a way unless you use a kernel patch which is apparently associated with realtime...sorta makes sense, if you want realtime behavior you don't want ACK delays randomly
<headius[m]> well this would certainly affect, say, an NTP server
<enebo[m]> headius: half serious question...what issues would prevent us going full native for sockets? ssl?
<headius[m]> primarily it's the many socket structs
<enebo[m]> I guess there is some variety there
<headius[m]> we could really just adopt the FFI-based socket lib that was written for rbx and which TR has been improving, but in both of their cases they have a build time step on each system to generate the struct layouts
<headius[m]> in theory that library is at least as good as what we have and probably a lot better in most ways
<enebo[m]> in most cases it feels like difference in behavior than different struct layouts but it is a big undertaking I guess is the main answer
<enebo[m]> meh on ffi sockets
<headius[m]> There's a possible out for us too, though... netty ships a pure-native socket library for Windows and Linux that has all the options and flags and such
<enebo[m]> aha yeah I wondered about netty since it is the god of servers
<headius[m]> of course that ties us to a pre-built binary
<headius[m]> yeah they basically figured all this out but never blogged it
<headius[m]> as far as I could find
<enebo[m]> and will limit what platforms we are on without contributing to that library
<headius[m]> correct
<enebo[m]> At this point I would say education on client benching may be best short term bet. Perhaps we can have some benchmarking page on our wiki
<headius[m]> of course we can maintain a dual impl that's JDK sockets on unsupported platforms, but yeah
<enebo[m]> because 0.04 is not a massive penalty for a real app and we are unsure whether MRI or JRuby will really pay this is ordinary responses
<headius[m]> it's also possible that we could pregenerate socket structs for all the platforms we need and it would be fine
<headius[m]> I mean, it's a finite set
<enebo[m]> the full logic tree of that for the lazy is not coming
<enebo[m]> I am wrestling with the effort vs the actual problem
<headius[m]> yeah it's a bit frustrating
<enebo[m]> If it was low effort I feel this conversation would be worth more
<headius[m]> we aren't really broken here
<headius[m]> we're just not obeying some low-level, Linux-specific packeting flags
<headius[m]> really it's JDK that's broken if these flags are really recommended or required, because we're just calling JDK's sockets
<enebo[m]> as it stands it really hurts bare metal benching which is misused by some to evaluate JRuby but realistically this is probably not a real problem for nearly all actual conventional uses
<headius[m]> perhaps...I don't have any idea how big a problem this might be in reality
<enebo[m]> you also bring up a reasonable point. JDK {n} may end up fixing this at some point too
<headius[m]> it's not like it adds 40ms across the board, or even reliably
<enebo[m]> well we need to probably look at tcpdump and see how common that delay is
<enebo[m]> you actually know enough now to be able to observe whether bigger stuff even has the issue at least
<headius[m]> at worst it's 40ms minus whatever time is consumed between the request and the final ack of the response
<headius[m]> so request handling eats part of that 40
<enebo[m]> perhaps we need to do more analysis before we evaluate solutions
<enebo[m]> yeah
<headius[m]> I think the problem here is that this intermediate packet doesn't look like it needs an immediate ack, so client doesn't send one
<headius[m]> I put both our packets and MRI's packets (minus the final part of the body) here: https://gist.github.com/headius/66abd0cb5142240026dc6562919b039a
<headius[m]> actually nevermind...I see now the MRI response packet is 25068, which is 25k for the body and 68 for the headers
<headius[m]> so MRI manages to respond in exactly one packet
<headius[m]> the delayed ack doesn't matter because it's done
<enebo[m]> whoa how big are packets these days?
<headius[m]> yeah seems big
<enebo[m]> has it always been 65k
<headius[m]> let me grab MRI's final ack
<headius[m]> I suspect it's sent immediately because wrk sends the next request with it
<headius[m]> so the pipeline flows
<enebo[m]> aha ok coming back
<enebo[m]> TCP packet is 65k but ethernet frames MTU is like 1500
<lopex> isnt this tcpi_snd_mss and tcpi_rcv_mss ?
<enebo[m]> so lower physical layers are breaking that up and TCP does a bunch of hijinx to assemble that into a "packet"
<lopex> mtu is on eth only right ?
<headius[m]> yup that's it
<enebo[m]> including retransmission or reordering if one get sent out of order
<headius[m]> so here's the tail end packet info from MRI response followed by the ack from client
* headius[m] sent a long message: < >
<headius[m]> it sends the next request with the ack
<headius[m]> in the JRuby case there's no data to send with the ack so it delays it
<headius[m]> and server sits waiting for it
<enebo[m]> we need to send "
<enebo[m]> :)
<headius[m]> yeah I don't remember what you can send from client in the middle of the response
<headius[m]> in any case it's not "broken", it's just unfortunate
<enebo[m]> This is interesting too that you are using localhost
<headius[m]> puma could be modified to do this better, possibly...it does do separate writes for headers and body
<enebo[m]> I mean I get what is happening but on a real network multiple packets have their own latency as well
<headius[m]> presumably that's why TCP_CORK is used, so those get sent as a single packet
<enebo[m]> so likely this will not be as big a deal in a real app
<headius[m]> hey you know what, I'll try a 70k response on MRI
<enebo[m]> I guess MRI doing it in one packet is the big win here
<headius[m]> that should break response into two packets and have the same problem if theories line up
<enebo[m]> yeah makes sense
<headius[m]> interesting
<headius[m]> no delay problem...but MRI does not wait to send the second part of the response
<headius[m]> so the ack comes with the next request anyway, after two packets from server
<headius[m]> that may be a clue...it's possible TCP_NODELAY is making this work for them
cpuguy83 has joined #jruby
<headius[m]> ahah but we are ok too now
<enebo[m]> yay
<headius[m]> I'll revert my socket patches to JRuby and try it
<headius[m]> yeah it's working
<headius[m]> heh, now I want to make sure it's still broken
Antiarc has quit [Remote host closed the connection]
<headius[m]> yeah back down to 25k and it breaks
<headius[m]> and restore QUICKACK and it's fixed
<headius[m]> so I'm not entirely clear why the larger response means server doesn't wait for ack
Antiarc has joined #jruby
<enebo[m]> probably just hopes for the best with possible retransmission later
<headius[m]> we do still break the headers off as a separate initial packet
<headius[m]> but unlike the 25k case, the 75k case starts right in on the body
<enebo[m]> I guess though if these are both the server wants to send two packets why would it wait in one case
<headius[m]> "we"
<headius[m]> I mean the kernel
<headius[m]> could be that the kernel sees the next packet is a big sucker and just goes for it
<enebo[m]> syscall is likely a big difference for MRI
<headius[m]> yeah I guess that's what you said
<headius[m]> so like it knows there's more packets to come and has to dump the buffer
<enebo[m]> I can only ponder why would a larger amount of data just decide to push forward
<headius[m]> right
<headius[m]> that gets into deeper magic of how it's deciding when to send packets
<enebo[m]> but TCP does have to potentially retransmit lost packets too so it is not too clear to me why this is the case
<headius[m]> in any case it's not something I can see in tcpdump and I suspect I wouldn't see puma actually waiting
<enebo[m]> yeah
<headius[m]> it's merrily tossing stuff onto the wire and it's just that stars align when the body fits entirely in a packet
<headius[m]> kernel holds it for ack
enebo has quit [Ping timeout: 245 seconds]
subbu is now known as subbu|lunch
hosiawak has joined #jruby
<hosiawak> headius[m]: I managed to figure out deployment on Puma, this real life Rails app does 2x better in terms of reqs/s on JRuby than on MRI 2.6 (invokedynamic + Graal VM), it also uses 1.2 GB as opposed to 2.5 GB on MRI and runs all the bg processes (delayed job, sidekiq, mails etc.) in the same process using Quartz. So overall I'm very pleased with JRuby. Thanks for your great work :)
<hosiawak>
<hosiawak> end
cpuguy83 has quit [Remote host closed the connection]
<headius[m]> Oh, you know I was going to ask if you really needed to use a war file...we definitely recommend deploying on Puma if you can because the Java server model is kind of a dying practice
<headius[m]> And those numbers look excellent! Maybe we can get you to do a guest blog post at some point
subbu|lunch is now known as subbu
<lopex[m]> headius: did you see the message above ?
<headius[m]> MTU?
cpuguy83 has joined #jruby
<lopex[m]> the one about jruby/puma/graal
enebo has joined #jruby
<headius[m]> I did, that's what I was responding to
<headius[m]> hosiawak fwiw we have not seen GraalVM perform better than openjdk on any non-trivial jruby application
<headius[m]> If you haven't tried it already, I would recommend testing openjdk. If your app is actually faster on Graal VM it would be a first
<enebo[m]> hosiawak: Also if you are using Java 9+ be sure to specify parallel GC: -J-XX:+UseParallelGC
<enebo[m]> lopex: MTU is ethernet only I believe
<lopex> enebo[m]: yeah MSS is the tcp one right ?
<enebo[m]> lopex: you are dredging info I used to enjoy 30 years ago :P
<enebo[m]> lopex: possibly...I wonder if I still have that Stevens network book
<enebo[m]> I believe I have some ancient TCP/IP multi-volume thing too
<enebo[m]> yeah MSS is the TCP one
<lopex> yeah, I was just asking
<enebo[m]> Maximum Segment Size
<enebo[m]> Maximum Transmission Unit
<lopex> I know
<enebo[m]> Synonyms Suck Folks (SSF)
<lopex> was just wondering about that packet size and mss
<lopex> since it's on the tcp socket right ?
<enebo[m]> yeah mss will be the limiting factor I guess?
<enebo[m]> Part of me is weirded out this design has held up for so long
<enebo[m]> I mean I know things have changed here and there but overall the main bits have held up
<enebo[m]> And this was all designed around potentially adding other protocols
<enebo[m]> Is it brilliant or just too incumbent to ever change?
<lopex> like that bgp thingy ?
<lopex> I read it's even worse
<enebo> BGP is on top of transport though right?
<lopex> you tell me
<lopex> it would make sense
<enebo> NO YOU TELL ME :)
<headius[m]> I still have my Stevens networking book
<headius[m]> no idea if it's still useful though
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
hosiawak has quit [Ping timeout: 240 seconds]
cpuguy83 has quit [Remote host closed the connection]
rtyler has joined #jruby
<rtyler> headius[m]: this may be relevant to your interests https://twitter.com/damageboy/status/1194751035136450560
<headius[m]> oh fun
cpuguy83 has joined #jruby