<enebo>
kares: you can land what you have if it passes tests
<enebo>
kares: I know you are done for now but I keep looking at us using a RubyHash and realize if it was all native it would be a simple value type and likely just dumb fields
<enebo>
kares: so I guess we can get more later if we want to keep pushing that direction
hoi has joined #jruby
hoi has quit [Client Quit]
claudiuinberlin has joined #jruby
sgeorge has quit [Remote host closed the connection]
sgeorge has joined #jruby
sgeorge has quit [Ping timeout: 244 seconds]
sgeorge has joined #jruby
sgeorge has quit [Ping timeout: 244 seconds]
Puffball has joined #jruby
<havenwood>
I'm updating ruby-versions metadata so ruby-install can use the new Maven location to fetch binaries. Most of the bins are the exact same checksums, but I noticed a few anomalies with the checksums compared to the same versions on AWS.
<havenwood>
jruby-dist-1.7.19-bin is a different checksum for both zip and tar.gz, and jruby-dist-9.1.17.0-bin.zip is as well, just the zip.
<havenwood>
The rest of the binaries are the same checksums compared to the old versions.
<havenwood>
enebo: Should I just defer to the new checksums ^ for the few that changed?
<havenwood>
I'm just updating ruby-versions for the dist-bin, version 1.7.5 and later.
<havenwood>
If earlier .tar.gzs are added, or more src versions, I'd be happy to update ruby-versions with those as well.
<enebo>
lopex: I still have never followed through on my assertion caching length on string would pay for itself
<lopex>
no profile data
<enebo>
lopex: There would be a tiny amount of cost for sb case
<lopex>
so just guessing
<enebo>
lopex: but it is so expensive in mbc case
<enebo>
lopex: but yeah no evidence and it would not be faster for sure in sb case
<lopex>
enebo: c deals with it from very beginning :P
<lopex>
and most code ranges are sb
<enebo>
I was told by MRI dev(s) that the reason length was never considered is because of lack of space in their struct
<lopex>
or are they ?
<lopex>
heh
<lopex>
so why jruby doesnt do that ?
<enebo>
lopex: well likely they are but killing perf for mbc at the cost of sb being a tiny bit slower might not be a good tradeoff
<lopex>
because mri didnt have space for that
<enebo>
we just have never tried
<enebo>
and we ported their logic to some degree
<enebo>
remember how long m17n took
Caerus has quit [Ping timeout: 268 seconds]
<lopex>
it took me a year for string alone
<enebo>
I think in my view of this was we were not going to deviate from MRI until we were confident we were correct
<lopex>
yes
<lopex>
that was my attitude as well
<enebo>
so adding length could be done as an appendage/add-on but I also thought about using length field for CR as well
<lopex>
and yet you have to be bug to bug compatible
<lopex>
length for cr ?
<enebo>
so negative values could indicate unknown/valid
<lopex>
ah, I recall now
<enebo>
arr.length with 7bit env is just length
<lopex>
but it would have to go through centralized api
<enebo>
err arr.length and length is 7bit
<lopex>
otherwise you'd be lost
<enebo>
well is is7bit() sure
<enebo>
the methods would be super small and inline
<enebo>
well I would lay money on that anyways
<enebo>
lopex: joni is present!
<enebo>
on maven
drbobbeaty has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<nirvdrum>
Whoa. That's a lot of backreading. Is it worth me catching up?
<enebo>
nirvdrum: not really. we are just removing regions so you can implement match_p without setting them up
<enebo>
nirvdrum: most of that was not understanding why we had regions and two ints
<enebo>
nirvdrum: one this of interest is rb_str_subpos exists for pos repositioning for match_p (and a couple of other things). Why they did not just use their normal char walking code is a mystery
<enebo>
could just be some microopt we don't really understand
<nirvdrum>
So how are you tracking match positions if you remove regions?
<lopex>
enebo: ok
<enebo>
nirvdrum: no but match? doesn't
<lopex>
nirvdrum: via two int fields in matcher
<enebo>
well we do but we don't actually need to for match?
<lopex>
"match?"
<enebo>
nirvdrum: actually that was what started part of that discussion was that when regions are disabled we still calc beg/end for the match even though match_p doesn't need that
<enebo>
It is a tiny amount of logic though so not likely very important
<nirvdrum>
Ahh.
<enebo>
lopex: you saw that joni is on maven repos
<nirvdrum>
I think the only thing we do differently for `match?` right now is avoid setting `$~`.
<lopex>
enebo: I believe you
<lopex>
enebo: I was testing with local copy
<enebo>
lopex: just making sure you know :)
<lopex>
nirvdrum: now you can force the region to be null
<lopex>
using reg.matcherNoRegion
<nirvdrum>
Nifty.
<enebo>
nirvdrum: yesterday we talked about idea we can make a more specialized interp for match_p to shave some of this regionish logic out but it likely would not be a big gain
<enebo>
nirvdrum: but since we have the method we can change that impl any time if we decide to play with it
<nirvdrum>
Cool.
<lopex>
nirvdrum: also there's a possibility to use separate interpreter that omits some group logic in joni
<nirvdrum>
I suppose my next big mountain to climb is really figuring out what joni is doing.
<nirvdrum>
We ported code from JRuby and slapped a boundary around the whole thing. It mostly works, but isn't ideal.
<lopex>
like the group is not referred by \1 for example
<nirvdrum>
But I never really know where to start.
<lopex>
but groups largely group so it's invitable for the most part
<lopex>
enebo: I wonder how much the semantics changes when you just (?..)
<lopex>
er (?:..)
<nirvdrum>
Not even remotely related to what you guys are talking about, but I'd really, really, really like to get basic regexp patterns without capture to be as fast as a substring search.
<lopex>
enebo: we could have external array which says which groups are cpaturing
<enebo>
lopex: oh so you mean we region (?:...) along with capturing regions as same data?
jrafanie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<enebo>
lopex: if so then won't match? be broken
<lopex>
nirvdrum: but even them there is a question what fast skip algo you use
<lopex>
enebo: something like changing capturing to not capturing
<lopex>
enebo: if not referred
<enebo>
I am not quite following. how would we know if it was referred or not
<enebo>
(?: ...) is never referred
<nirvdrum>
We end up down paths with code like Regexp.new(Regexp.quote(pattern)) and pattern ends up being ',' or "\n".
<lopex>
nirvdrum: like for example /foo/ you can already use build boyer moor map
<enebo>
but () may be but unless you mean \1 then I don't get it
<lopex>
nirvdrum: but regexp mostly are known at parse time
<lopex>
so it's all tradeoffs
<nirvdrum>
So the regexp is a single ASCII char, no modifiers, no bounds, no captures.
<nirvdrum>
It ideally would work the same as indexOf.
<lopex>
enebo: (?:...) is just not capturing
<enebo>
AST generation does validate regexp so we could allow joni to know if it is smple string
<enebo>
lopex: isn't it?
<enebo>
I never remeber the syntax
<nirvdrum>
enebo: In this case it'd be a runtime thing.
<enebo>
I thought (?: was non-capturing group
<lopex>
enebo: not capturing group
<lopex>
enebo: I said otherwise ?
<enebo>
lopex: I don't understand what you are asking or why now
<nirvdrum>
Anyway. I didn't mean to derail your conversation.
<lopex>
enebo: well we are in agreement, I'm confused
<enebo>
nirvdrum: well AST can be marked as regexp but we could also mark it as no special chars
<enebo>
lopex: you brought up non-capturing and then said something about tracking them separately from regions
<enebo>
lopex: so I did not bring this up at all. I think I just did/do not understand what you meant before
<enebo>
nirvdrum: at IR build time for us we could implement it as simple string search
<enebo>
nirvdrum: or we could make joni have an optimized implementation for just that
<lopex>
enebo: something like the capture can be disabled later on
<enebo>
lopex: oh! like we can tell after we have run it that the code using it never requires backref so we remove the regions?
<lopex>
enebo: yes, but just changing the interpreter loop
<enebo>
It is unfortunate that $~ lives past current stack
<lopex>
and some array of numbers
<nirvdrum>
enebo: For match? you could just rewrite it to String#index(pattern) != 0. But I'd like to have it optimized for `match` as well. In that case you would need to know the match boundaries.
<lopex>
nirvdrum: wrt tradeoffs, something like "looongstringbefore abcd" =~ /abcd/
<nirvdrum>
I just think joni ends up going down a more complicated path.
<lopex>
nirvdrum: joni will build a boyer moor map for abdc
<lopex>
nirvdrum: and then fast skip to the interesting point before it even enters interpreter loop
<nirvdrum>
lopex: This is where I shamefully admit I don't know what that is :-P
<lopex>
nirvdrum: not all indexOfs have this
<lopex>
nirvdrum: boyer moore ?
<nirvdrum>
Yeah. I'll read up on it.
<lopex>
nirvdrum: nowadays mri uses sunday search, it's a modification
<enebo>
lopex: I think notion though at parse time knowing it can be something much simpler means not feeding it into the engine of joni
<lopex>
nirvdrum: but in gist you just build a skip map given a string you search for
<nirvdrum>
Really, my understanding of regexp engines is limited to foundational automata. The pumping lemma and such.
<lopex>
it's just string searching algos
<enebo>
even it joni is super fast all the code around getting to that fast execution is not free
<lopex>
nirvdrum: and you advance faster given that map
<nirvdrum>
Gotcha.
<lopex>
but you have to build it first, so indexOF could do that on some length threshold
<enebo>
I think I am on a different wavelength on optimizing that case now
<enebo>
I don't think it should ahve anything to do with joni other than joni pointing out it is this simple case
<nirvdrum>
lopex, enebo: Perhaps I'm advocating for making some of these operations encoding and code range aware.
<nirvdrum>
But I say that naively not having looked at the internals. If it's just a byte machine that may not even matter.
<lopex>
wtf is Horspool
<enebo>
even if joni is faster at finding the match on a simple string all the shit we plow through before we hit that fast code is substantial
<nirvdrum>
indeed.
<nirvdrum>
And Graal isn't going to help us inline through it.
<enebo>
for truffle no doubt would just make a very simple specialized path for it
<nirvdrum>
TRegex may do that. I haven't played with it yet.
<enebo>
for IR we could do it a couple of ways
<enebo>
a ~= /a/ would still need to say it is executing match in stack trace so some sleight of hand
<nirvdrum>
I could provide a specialization for these simple cases, but then I need to maintain my own equivalent to regions and such so the `MatchData` instances can be constructed properly. It's doable, but I certainly don't want my own ad hoc limited regexp engine.
<enebo>
nirvdrum: but literally only for cases like /\n/ where match would be pretty simple
<enebo>
start/end is trivial in simple substring match
<nirvdrum>
lopex: I haven't. But I'll check that out. I believe Chris has worked with Edd in the past.
<lopex>
though I always saw degradations before steady state was riched on hotspot
<lopex>
reached even
<nirvdrum>
enebo: I ended up down this chain of thought when encountering this snippet from csv.rb: parse.sub!(@parsers[:line_end], "")
<enebo>
oh yeah and in fact this would be more complicated for us since sub! is a call
<nirvdrum>
Which basically is the same thing as String#chomp, but looks up a regexp from a map and uses that as an argument to String#sub!
<lopex>
beauty
<enebo>
so we would need to pass in a type which had match/whatever but was not specially a RubyRegexp
<nirvdrum>
I think it was written this way so you could use something other than "\n" to demarcate different rows. But, I doubt anyone ever really does that.
<enebo>
yeah I doubt that as well but \r\n may be possible perhaps