NightMonkey has joined #jruby
NightMonkey has quit [Ping timeout: 240 seconds]
slyphon has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
claudiuinberlin has joined #jruby
NightMonkey has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
shellac has joined #jruby
shellac has quit [Client Quit]
shellac has joined #jruby
drbobbeaty has quit [Ping timeout: 265 seconds]
<GitHub78> [jruby] ninkibah opened issue #5238: Jruby 9.1.16 still slow on Java8 without -Xcompile.invokedynamics=false https://git.io/fbpNn
rdubya has joined #jruby
rdubya1 has joined #jruby
rdubya has quit [Read error: Connection reset by peer]
drbobbeaty has joined #jruby
jrafanie has joined #jruby
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rdubya1 has quit [Quit: Leaving.]
jrafanie has joined #jruby
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
slyphon has joined #jruby
<enebo> ys
<nirvdrum> lopex: Do you know when rb_enc_fast_mbclen, rb_enc_mbclen, and rb_enc_precise_mbclen should be called?
_whitelogger has joined #jruby
<lopex> eregon: so we were about reinventing the wheel
<lopex> nirvdrum: let me recall
<enebo> eregon: do you recall tuning it this way to reduce memory?
<lopex> nirvdrum: there's also onigenc_mbclen_approximate
<nirvdrum> lopex: Yeah. I'm looking at the usages and I'm having a hard time really working out why there's both rb_enc_mbclen and rb_enc_precise_mbclen.
<enebo> I have wondered why one is chosen over the other
<nirvdrum> And then there are macros like MBCLEN_CHARFOUND_LEN that do nothing.
<enebo> non-precise when you don't have to actually know length vs it being more than 1 sort of thing
<nirvdrum> enebo: I assume it's a CR_BROKEN vs non-broken thing. And maybe to ensure that your head pointer isn't in the middle of a multi-byte char sequence.
<lopex> nirvdrum: I think we should talk about their semantics since those names are hopeless
<lopex> and then assign those to the names
<enebo> nirvdrum: ah so precise may just explode in bad strings
<lopex> nirvdrum: easy to get lost
<lopex> nirvdrum: I see at leeast three, non validating, validating returning number of missing bytes, validating with fallback to 1 as length
<nirvdrum> I think it just returns a negative number if it needs more bytes.
<lopex> nirvdrum: unigmo uses last one for parsing so it doesnt enter infinite loops
<nirvdrum> Okay. It's probably used for CR_BROKEN then. I've noticed on invalid byte sequences, the length is treated as "1".
<lopex> nirvdrum: the bigger issue is where they have extra guards in client code
<lopex> nirvdrum: or unknown
<nirvdrum> Where do you see the difference in validating vs non-validating?
<lopex> nirvdrum: validating, like where the client code always checks for negative ?
<nirvdrum> As far as I can tell, in jcodings it always validates and it's up to the caller to determine whether to ignore negative valuse.
<nirvdrum> *values
<lopex> yeah
<nirvdrum> Okay.
<lopex> nirvdrum: but some client code might pass the result unchecked
<nirvdrum> I meant "validating" as in "validating the byte sequence is valid for the given encoding".
<nirvdrum> But even that's weird because the return value only considers the first character, correct?
<lopex> yeah, for truncated / broken
<enebo> but that is not valid?
<lopex> nirvdrum: no
<enebo> sorry I will not interject :)
<lopex> nirvdrum: utf8 goes throught a number of tables for each postion
<lopex> *through
<lopex> nirvdrum: so it walks the char
<nirvdrum> But if you give it a byte sequence with multiple characters, you only get the byte length of the first one, no?
<lopex> yes
<nirvdrum> This further confuses the semantics of these length functions for me.
<nirvdrum> I think where I'm driving at is if we know a String is not CR_BROKEN, as most strings are, I don't think we need to be doing all this extra validation.
<nirvdrum> You'd need to be careful that you're on a proper character boundary, of course.
<lopex> nirvdrum: but there
<nirvdrum> But UTF-8 has a special bit sequence to indicate continuation bytes.
<lopex> nirvdrum: there's two options like failing altogether or falling back to 1 just to safely advance right ?
<lopex> nirvdrum: that's why the default returns missing bytes
<nirvdrum> I'm ignoring the CR_BROKEN case for the moment.
<lopex> in jcodings now
<lopex> and unknown right ?
<nirvdrum> No. I'm particularly interested in CR_VALID.
<nirvdrum> CR_7BIT is trivial. Always return 1.
<lopex> and for valid the same just the length of a char
<nirvdrum> CR_UNKNOWN sounds like it always needs rb_enc_precise_mbclen.
<lopex> yes
<lopex> nirvdrum: or approximate
<eregon> enebo: mostly for access and building speed + escape analysis I think
<nirvdrum> So, I'm confused by the use of rb_enc_mbclen on CR_VALID strings.
<lopex> nirvdrum: yeah, it's a wasted effort
<eregon> enebo: and avoid too many allocations so less memory used too
<enebo> eregon: ok yeah graal seems to do better with unwrapping boxed values (like hash)
<enebo> hashcode as Integer
<lopex> nirvdrum: we're on the same page I guess
<nirvdrum> lopex: Why is Encoding#length(byte) deprecated then?
<nirvdrum> Being able to use the UTF8EncLen table looks like it'd be much faster.
<lopex> nirvdrum: at was a remnant of an old oniguruma api
<nirvdrum> Okay.
<enebo> hotspot will not escape those so then we have to entertain part of the triple maybe being 2 arrays
<lopex> nirvdrum: it's was totally non validating
<eregon> enebo: yeah, most operations on small Hash can basically fold at compilation if the Hash is escape analyzed
<lopex> nirvdrum: we should add two length methods on Encoding I think
<enebo> sorry I meant cache hashcode value as boxed int in that array not Ruby Hashes
<lopex> nirvdrum: precise and approx
<eregon> enebo: yep, we didn't care about it so far, but maybe we should
<nirvdrum> lopex: So backing up. Because I want to really make sure we're talking about the same thing.
<enebo> eregon: seems Graal is pretty good at figuring that case out so you maybe don't need to
<eregon> enebo: if the Hash doesn't escape then no need to box, and no allocation ever. But if it does escape we'll box those 1 to 3 hashes to Integer
<nirvdrum> lopex: Actually, what's the difference in rb_enc_fast_mbclen and rbc_enc_mbclen then? They both validate the byte sequence for the given encoding, right?
<enebo> eregon: yeah
<enebo> eregon: That was what I was trying to say
<eregon> :)
<enebo> eregon: corollary for us is most people still use hotspot so we are tailoring more for that
<enebo> eregon: since your impl probably will never care about hotspot primarily I would just do the simpler thing if graal figures it out
claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]
<enebo> nirvdrum: lopex: can one of you add some comments explaining semantics to StringSuport or wherever those methods are once this conversation is over?
<enebo> I admit I tend to just use precise and I can see that probably was a big mistake :P
<enebo> I would almost be in favor of fixing the names and just adding an rb: comment
<nirvdrum> +1
<lopex> nirvdrum: they both use ONIGENC_PRECISE_MBC_ENC_LEN
<lopex> nirvdrum: but they have precise_mbc_enc_len function pointer on encoding so on this we're compatible on this in jcodings
<lopex> so everything goes through enc->precise_mbc_enc_len
<lopex> nirvdrum: without a table with those names we wont be able to talk though
<lopex> which does what in mri code
<nirvdrum> So, rb_enc_precise_mbclen returns either the MBC byte length or a negative number indicating more bytes are needed. rb_enc_mbclen calls rb_enc_precise_mbclen and if it sees a negative number, it returns 1 if the encoding's min. char length is less than or equal to the remaining bytes. But what does rb_enc_fast_mbclen do? Does it do any byte sequenc
<nirvdrum> e validation at all?
<lopex> nirvdrum: I'll do a write-up
<nirvdrum> Thanks.
<lopex> rb_enc_fast_mbclen -> ONIGENC_MBC_ENC_LEN -> onigenc_mbclen_approximate
<lopex> nirvdrum: ^^
<nirvdrum> I think I get a bit confused, too, because jcodings only has Encoding#length. The rest is implemented in StringSupport.
<lopex> nirvdrum: but I'll go through all those with the writeup
<enebo> lopex: I want an english description of what it actually does vs just the relationship alone
<lopex> enebo: I want too :P
<enebo> hahah ok
<lopex> the problem is that it's an evolving api on mris side
<enebo> nirvdrum: I would like those moved into jcodings if possible so we can make an optimized length on valid and then share that logic
<lopex> and yes, they do a LOT of unnecessary work
<nirvdrum> So onigenc_mbclen_approximate calls ONIGENC_PRECISE_MBC_ENC_LEN as well.
<lopex> nirvdrum: hence we want additional non validating length on encoding
<nirvdrum> This is nutty.
<lopex> nirvdrum: last question is where to fail and where to advanc by 1 on invalid char
<nirvdrum> Right.
<nirvdrum> And I'd love to know what "fast" means in rb_enc_fast_mbclen, because it looks like it does much the same work as the others.
<nirvdrum> I think for the purposes of the API, I'm willing to accept as an invariant that the caller only sets the head pointer to a byte position corresponding to the first by in a character. By accepting that, I can assume the CR of the full string can be respected. In particular, if the full string is not CR_BROKEN, then any individual character in it can
<nirvdrum> not be broken.
<lopex> nirvdrum: and for valid utf-8 you could use that fast walking thing
<lopex> without array lookups
<lopex> nirvdrum: actually would it work as an c ect ?
<lopex> *c ext
<nirvdrum> I think the MRI rules are if you modify a String in an extension, you're responsible for updating its code range as well.
<lopex> nirvdrum: I mean the code which treats char* as unsigned int*
<nirvdrum> I guess it would, but you'd have to cross a native boundary so I'm not sure how much you'd gain.
<nirvdrum> But with valid UTF-8, that's not really an issue. You only need to look at the leading byte.
<nirvdrum> On the whole, I think we (MRI, JRuby, TruffleRuby, etc.) should be okay specializing for UTF-8.
<lopex> it's branchless
<nirvdrum> Support the other encodings, of course, but I don't think everything should be as slow as the lowest common denominator.
<nirvdrum> Interesting.
<lopex> and only for valid utf8 of course
shellac has quit [Quit: Computer has gone to sleep.]
<lopex> nirvdrum: once I made Unsafe version of it, but there were some problems I forgot
<lopex> but I saw 3x speedups even with unsafe overhead on long strings
<nirvdrum> I've seen implementations that vectorize as well.
shellac has joined #jruby
shellac has quit [Quit: Computer has gone to sleep.]
<lopex> enebo: wrt that hash headius proposed we could keep that in the bucket array in a nonscan version
<lopex> enebo: also, with rare enough collisions we could get rid of that
claudiuinberlin has joined #jruby
ahorek has joined #jruby
ahorek has quit [Client Quit]
<GitHub162> [jruby] kares opened pull request #5239: review String -> RubyString UTF (8) encoding (master...string-encode) https://git.io/fbh9q
Antiarc has quit [Quit: ZNC 1.6.6+deb1 - http://znc.in]
ahorek has joined #jruby
Antiarc has joined #jruby
shellac has joined #jruby
<ChrisBr> headius: there?
cschneid_ has quit [Ping timeout: 260 seconds]
cschneid has joined #jruby
claudiuinberlin has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<headius> ChrisBr: I'm here now
<ChrisBr> ah great :)
<ChrisBr> how do we want to continue regarding the hash PR?
<ChrisBr> headius: ^^
<ChrisBr> I probably will have some time over the we and next week to finish things off
claudiuinberlin has joined #jruby
claudiuinberlin has quit [Quit: Textual IRC Client: www.textualapp.com]
drbobbeaty has quit [Ping timeout: 260 seconds]
Antiarc has quit [Quit: ZNC 1.6.6+deb1 - http://znc.in]
Antiarc has joined #jruby
drbobbeaty has joined #jruby
drbobbeaty has quit [Client Quit]
<headius> ChrisBr: hey sorry notifications not great
<headius> ChrisBr: well if it's in a mergeable state now I'd say we should go for it and start letting it bake a bit
<headius> I would like to see how the new hash structure interacts with Graal optimizations too
ianz has joined #jruby
<ianz> it appears to not be on the s3 bucket.
<headius> kares: good call on using jmh...we should add more jmh benchmarks
<headius> ianz: hmm
<headius> ianz: I'm not sure why it's not there, I thought enebo restored all the versions he had
<headius> but there's a better place to get it from now anyway
<ianz> 1.7.26 is there BTW
<ianz> better place sounds good...?
<headius> ugh there's still bad links on site
<headius> I thought enebo fixed all of them
<headius> all our releases are pushed to Maven central
<headius> so jruby-complete under there should have everything you need
<headius> weird
<headius> only the 1.7.26 page was updated for new links
<ianz> i see it there. Thanks!
<GitHub97> [jruby.github.io] enebo pushed 1 new commit to master: https://git.io/fbjsr
<GitHub97> jruby.github.io/master b3901b6 Thomas E. Enebo: Script missed 1.7.27
<enebo> ianz: complicated update script and I spaced out .27 .26 was first one to use proper -dist release.
<enebo> ianz: should be done and complete is accessible again
<ianz> thank you!
ianz has quit [Quit: http://www.kiwiirc.com/ - A hand crafted IRC client]
ianz has joined #jruby
ianz has quit [Client Quit]
ianz has joined #jruby
ianz has quit [Client Quit]
ianz has joined #jruby
ianz has quit [Client Quit]
drbobbeaty has joined #jruby