<lopex>
"Previously, GitLab used Ruby's Regexp for pattern matching. This change switches the regular expression engine to RE2 while maintaining the previous syntax. GitLab internally converts /pattern/flags into valid RE2 patterns"
<lopex>
now what ?
<headius[m]>
That's interesting
<headius[m]>
I wonder what they gain from using RE2
<lopex>
pathologicals regexps I guess
<lopex>
there's no explosion from backtracing
<lopex>
but that mir binding is quite old in GH repo it seems
<lopex>
unless they forked it
<enebo[m]>
Maybe this is super naive but I feel they could error on the features which are problematic through static analysis of the regex. Does RE2 not have pathological behavior?
<enebo[m]>
If they used JRuby they should enable our ability to break out via exception feature
<enebo[m]>
Obviously someone can make an RE2 expr as some pathological vector unless there is some timeout capablity?
<enebo[m]>
lopex: do you know?
<lopex>
enebo[m]: not of my head
<enebo[m]>
lopex: I mean the syntax looks like it would have nearly all the same pathological behavior except look behinds
<lopex>
which they could detect beforehand ?
<lopex>
using a regexp ? :P
<lopex>
enebo[m]: and it depends if those regexp are created from user input right ?
<lopex>
which is a bug for all my applications
<lopex>
er, in my opinion
<lopex>
enebo[m]: ah, so some of that is the CI
<lopex>
you just made me read through the actual thing
<lopex>
so no injections
<enebo[m]>
lopex: they basically allow regexps as arbitrary use submitted values so any bad regexp could hose over a process
<lopex>
yeah
<enebo[m]>
which could be made into some attack vector if you submit 1000 orgs to submit bad regexp for 1000 jobs at same time or something
<lopex>
er, yeah, right
<enebo[m]>
It probably is less about an attack and more the annoyance of looking at CPU hogs and finding some stupid unintended regexp
<lopex>
yeah, but that's the dos
<headius[m]>
MRI could make the same enhancement we have
<enebo[m]>
Still the possibility is there. I would a) submit PR to add timeout to regexp evaluation to MRI vs add whole new dep b) use regexp to disallow common mistake ones
<lopex>
enebo[m]: so, it doesnt look the dont use onigmo for routes etc
<headius[m]>
Just need to check their threading bits periodically
<lopex>
well, so much of stdlib depends on onigmo
<headius[m]>
So this is for a very specific area of the app
<lopex>
yeah
<enebo[m]>
Maintaining a set of bindings to a google regexp library in C++ is probably same work as a) but it would not need to be maintained if it got accepted...
<enebo[m]>
Also a) would not be superior in controlling execution time overall. RE2 can definitely still take forever if someone submits a malicious regexp
<lopex>
enebo[m]: but their hidden CI variables feature is also questionable
<lopex>
so we could support an option for that almost freely
<lopex>
on that interruptible thing
<enebo[m]>
lopex: yeah using joni + interrupt and we have ultimate solution
<lopex>
headius[m]: enebo[m] right ?
<lopex>
ok
<enebo[m]>
I am skeptical that re2 solves this issue unless it has some timeout of interrupt capability
<lopex>
enebo[m]: and what about potential callouts in onigmo ?
<enebo[m]>
Just removing one feature which does backtracking is not really a fix as much as limiting the vectors for damage
<lopex>
same problem
<enebo[m]>
lopex: code snippet showing what you are talking about?
<enebo[m]>
lopex: or do you mean onigmo could support timeout via callouts?
<lopex>
enebo[m]: callouts being cancelled via timeouts, rather
<enebo[m]>
lopex: I don't follow onigmo so I don't know what callouts are
<lopex>
perl has it
<enebo[m]>
Does Ruby?
<lopex>
no
<lopex>
just wondering
<lopex>
pcre has it
<lopex>
enebo[m]: I imagine it's a nightmare
<lopex>
maintain regexp state, and call into embedding runtime
<enebo[m]>
lopex: WHAT DOES PERL NOT HAVE?
<lopex>
popularity anymore
<enebo[m]>
lopex: The irony is it won't have ability ot be timed out :P
<enebo[m]>
lopex: BURN
<lopex>
enebo[m]: in per runtime ?
<lopex>
*perl
<lopex>
or is it you ironizing the irony
<enebo[m]>
lopex: it can do anything so I figure why not
<lopex>
right
<enebo[m]>
Their regexp engine is probably turing complete already too
<enebo[m]>
PCRE is PERL
<lopex>
yeah
<enebo[m]>
PCRE will run PIR and use callouts to implement the instr!
<lopex>
but, pcre is just a liprary extracted from perl ?
<lopex>
or just a semistandard ?
<enebo[m]>
I am jumping through the looking glass of possibilities
<lopex>
hah
<enebo[m]>
Like the guy who decided to write a paper on useful code which only contains ordinary ascii cars as raw assembly
<lopex>
or the jumpless code converter for assembly
<enebo[m]>
So fun discovery this afternoon
<enebo[m]>
MRI translates BEGIN into PRELUDE_NODE which just tacks those onto the front of the instr list
<enebo[m]>
I am wonder what that looks like if you use their new experimental AST library
<lopex>
life is getting harder
<enebo[m]>
I guess technically it will print ouf it is a prelude node
<lopex>
on every front in this business
<enebo[m]>
lopex: in this case it may be easier than what we do
<enebo[m]>
lopex: we make older preexenode which is an iternode and it ends up even being made into a special IRScope which extends IRFor (which extends IRClosure)
<enebo[m]>
err it actually is an IRFor
<lopex>
enebo[m]: so you have a room to hack it over ?
<enebo[m]>
So if we change to doing prelude node like their parser does we can eliminate having to do all this funky stuff in IR
<enebo[m]>
It will help simplify our lvar operand construction
<lopex>
enebo[m]: er, so lvar is affected by some global thing node ?
<enebo[m]>
lopex: sorry...I missed this until now
<enebo[m]>
lopex: any scope in the system can have local variables and we make operands for all of them
<lopex>
enebo[m]: you said you have to change prelude node to simplify lvar right ?
<enebo[m]>
lopex: but 'for' and BEGIN do not have their own scopes so they use parents
<lopex>
ah, so it's a mri compt thing ?
<lopex>
compat
<enebo[m]>
'for' and BEGIN have special logic when constructing lvar operands which changes depth coing out of parser
<lopex>
ok
<enebo[m]>
nope. Just me trying to eliminate some logic in IR but I am at a loss for why it is there
<lopex>
yeah, I can undestand those are some artificial scopes
<enebo[m]>
but not having BEGIN as part of the equation makes the problem smaller
<lopex>
like 'for'
<enebo[m]>
lopex: yeah BEING is an IRFor today and is one case we always seem to change depth + 1 for any lvar encountered
<enebo[m]>
BEGIN = BEING
<enebo[m]>
weird typing
<enebo[m]>
lopex: I think it since it is an IRFor any passes we do has to know it's variable is down one IRScope even though it is not down one DynamicScope when it is actually run
<enebo[m]>
If it is not an IRFor but merely instrs at front of script scope then that all disappears
<enebo[m]>
'for' itself still exists which is really unfortunate
<enebo[m]>
It is a highly unpopular construct in the language so it is a bummer we have to have code to maintain its pseudo-blockish without its own variable scope nature
<lopex>
yeah
<lopex>
havent seen that in a while
<lopex>
and it always looked as an obscured feature