<enebo>
lopex: I will give it a quick check. Silly I did not bother. I trust you so much
<nirvdrum>
lopex: Awesome. Thanks.
<lopex>
enebo: this is the most annoying thing to test in jcodings
<enebo>
lopex: well I have a test case for sure
<enebo>
const_set and const_defined? now rely on these for identification ofwhether it is valid constant name
<enebo>
whereas in the past we used jlString
<enebo>
lopex: speaking of fun!!!! if you wanted to be a rock star you could add flag/enum support for RubySymbols so we can mark what type of identifier they can represent
<enebo>
lopex: MRI added this a while back whereas we O(n) check over and over
mkristian has quit [Quit: This computer has gone to sleep]
<enebo>
of course I go to EUC-JP page and it links to unicode entry but it is an ALNUM
<enebo>
I think onigmo is just wrong here
<enebo>
oh hmm
<enebo>
should I use isWord and not isAlnum?
<enebo>
lopex: actually what is the difference?
<enebo>
isWord does fix it
<lopex>
yeah, those both a are true for unicode
<enebo>
lopex: so maybe EUC-JP specifically does not think they are ALNUM while for unicode they do? but MRI will basically still think it is a valid identifier character for a constant.
<enebo>
lopex: isWord is basically all characters which do not separate words? Is '$' isWord?
<lopex>
for unicode ?
<enebo>
lopex: I don't know for anything
<enebo>
lopex: what does isWord mean
<lopex>
for unicode it's 0-9a-zA-Z_
<lopex>
from ascii range
<lopex>
god knows what's there
<lopex>
but the problem is in char types and not ranges
<enebo>
So <256 gets isalnum and _ check but then anything else is fine?
<enebo>
so at lexer level I could put some multibyte space char and it makes it past this point
<enebo>
so MRI must validate this later somehow
<enebo>
lopex: or am I mistaken?
<enebo>
./include/ruby/ruby.h:static inline int rb_isascii(int c){ return '\0' <= c && c <= '\x7f'; }
<lopex>
there's #define is_identchar(p,e,enc) (ISALNUM((unsigned char)*(p)) || (*(p)) == '_' || !ISASCII(*(p))) in symbol.c too
<enebo>
hehe so absolutely any character outside of that range will be valid for an identifier in the lexing portion of MRI (JRuby is a bit different since Character.isLetterOrDigit() I think will say yes/no for mbcs?
<enebo>
lopex: ok so my problem right now...lambda is ok in constant name in MRI but I have no idea how they approve it. !ISASCII seems mad. That cannot possibly be valid can it?
<nirvdrum>
SVM sees the static TruffleOptions.AOT value and discards the other branch which contains the code doing the dynamic lookup.
<nirvdrum>
enebo: Yeah, that's the idea. I haven't looked at it in a while. I *think* the additional overhead would be minimal. But I'd have to work out the thread-safety of the tables.
<nirvdrum>
Since those are read-only, two threads both loading the tables wouldn't be the end of the world.
<nirvdrum>
lopex: I'd have to look at that again.
<enebo>
so if I remember this is not just a load time issue but also a memory one
<nirvdrum>
Basically I don't want to head down this path if it's apt to be rejected out of hand. But I'm happy to collaborate on it.
<enebo>
nirvdrum: lopex: how many types are we talking about?
<enebo>
telling me one per encoding is not what I am asking :)
<lopex>
no
<lopex>
dunno, like 30 impls max ?
<nirvdrum>
Memory potentially. But the tables would end up compiled into the process and currently the whole process is loaded into memory anyway. So I'm not sure there's really any savings to be had there.
<lopex>
er more like 50
<enebo>
yeah no one cares about 50 classes
<enebo>
not at this point :)
<nirvdrum>
Ruby has 110 encodings, but a good number of those are aliases.
<enebo>
I am just wondering how much of an issue the data is from memory perspective
<nirvdrum>
Loading the maps lazily would be more of a memory savings for the JVM.
<enebo>
I am guessing it is megs of data not like 1meg of data
<enebo>
yeah
<nirvdrum>
Some of the encoding tables are 1MB+
<enebo>
I am just being devil's advocate about just making it all eager
<nirvdrum>
Loading all of them would be noticeable.
<enebo>
ok yeah that will stack up quick
<nirvdrum>
Let me just go measure.
<enebo>
nirvdrum: well I wondered about loading them as a single piece of data
<nirvdrum>
Maybe not so bad. 3.2 MB of table data.
<enebo>
but we would not want to increase heap by several megs
<nirvdrum>
They're compact binary implementations though, so it'd be more in memory.
<enebo>
so perhaps lazy data makes sense unless 2 of it is utf encodings we always load
<enebo>
ah yeah
<enebo>
ok yeah I doubt we want that hit
<enebo>
so we have compact data and we expand it on loading?
<nirvdrum>
I see 51 encoding files (no idea if multiple classes per file) and 29 transcoding files.
<enebo>
ah fudge
<enebo>
I did a mvn:prepare before updating jcodings
<nirvdrum>
It's loaded into a byte[] and int[] depending on whether it's a byte-oriented or word-oriented file.
<nirvdrum>
The additional overhead won't be massive.
<nirvdrum>
16 bytes for an array header?
<enebo>
well that does not sound like it is uncompressed or anything
<nirvdrum>
lopex would know better.
<nirvdrum>
While I'm at it, I'd love nothing more than to address the static index value in Encoding.