antocuni changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://botbot.me/freenode/pypy/ ) | use cffi for calling C | "PyPy: the Gradual Reduction of Magic (tm)"
<antocuni>
ok, I *think* the vmprof+eventlet problem is caused by the fact that at each switch, we call vmprof_stop_sampling twice, and vmprof_start_sampling only once
<mattip>
about translation using too much memory, if you rerun make alone (outside translation) it uses less memory
<mattip>
something like the translation process memory is leaking into the forked subprocess used to run make
bbot2 has quit [Quit: buildmaster reconfigured: bot disconnecting]
bbot2 has joined #pypy
<mattip>
let's see what happens tonite
jcea has joined #pypy
antocuni has joined #pypy
jcea has quit [Quit: jcea]
jcea has joined #pypy
<kenaan_>
cfbolz unicode-utf8 82223a975b6b /pypy/module/_pypyjson/interp_decoder.py: fix unicode \-encoding in _pypyjson
<kenaan_>
cfbolz unicode-utf8 a9bb96fbf9d4 /pypy/: fix more tests BUT: a slight pessimization, because object decoding becomes a little bit slower
jcea has quit [Client Quit]
jcea has joined #pypy
<arigato>
cfbolz: why can't decode_key() return a utf8 byte string instead of a unicode string on default?
<cfbolz>
arigato: it can, but it doesn't help anyway
<cfbolz>
because on the branch there is no general UnicodeDictStrategy
<cfbolz>
(on the branch it only works for ascii strings :-( )
<arigato>
I'm fearing that you're chanching pypyjson in this way because it makes sense for now, but then we'll need UnicodeDictStrategy anyway, and we'll forget to revert pypyjson
<cfbolz>
yes, I see that fear. I should at least put a todo
<kenaan_>
cfbolz unicode-utf8 8dac9e38c3d5 /TODO: add todo
<kenaan_>
cfbolz unicode-utf8 6a13aba253bd /rpython/rlib/: use an actual iterator, to make the code nicer (they work well in rpython nowadays)
<kenaan_>
cfbolz unicode-utf8 5b81f483c459 /pypy/module/_pypyjson/interp_encoder.py: fix encoding to operate on utf-8 encoded strings
<cfbolz>
arigato: before I continue a lot, could you take a look at this diff?:
jamesaxl has quit [Read error: Connection reset by peer]
jamesaxl has joined #pypy
mattip has left #pypy ["bye"]
rubdos has quit [Ping timeout: 250 seconds]
<kenaan_>
cfbolz unicode-utf8 f5be33826726 /rpython/rlib/: support for append_utf8
<kenaan_>
cfbolz unicode-utf8 48da1a44d860 /pypy/objspace/std/unicodeobject.py: replace a lot of uses of StringBuilder by Utf8StringBuilder
<kenaan_>
cfbolz unicode-utf8 f5a5189e5314 /pypy/objspace/std/unicodeobject.py: small cleanup of copy-pasted join code
<cfbolz>
arigato: it's all completely annoying. architecture-wise we should have a type in rutf8 that contains most of the logic in unicodeobject.py. then, unwrapping a w_unicode would give that type. but then we would get yet another indirection.
<arigato>
yes
<arigato>
the alternative would be to add a field to the low-level rstr
<arigato>
but it's also annoying
<arigato>
of course, all these tuple-returning functions we have in the branch now are also relatively costly
<antocuni>
uh, apparently we don't have a way to check whether we already installed a `pypyjit.set_compile_hook` :(
<arigato>
cfbolz: maybe at some point we should do something about that
<cfbolz>
arigato: or not, to discourage designs where you return a lot of tuples :-P
<cfbolz>
But yes, I see your point
marr has joined #pypy
<fijal>
cfbolz: one of my thinking was "let's not have yet another layer of rpython magic"
<fijal>
we can make tuple returning function do what they would do in C right?
<fijal>
specifically x, y = foo() kinda call
<fijal>
it seems even easy-ish
<arigato>
it's all but easy-ish
<arigato>
it's a mess that have implications everywhere including throughout the JIT
<fijal>
and the gc?
<arigato>
dunno, I can see a way that makes it have no implications in the GC
<arigato>
but everywhere else
<fijal>
right
<fijal>
well, any good ideas how to do it otherwise?
<kenaan_>
rlamy default 2477eb379774 /pypy/module/_io/interp_textio.py: Keep chipping away at readline_w()
<arigato>
no magic idea that will solve all your use cases, no
<fijal>
well, one option would be to return the builder
<fijal>
which is again a bit of a mess for JIT
<fijal>
arigato: is there a good way to measure if returning a tuple is indeed a problem?
<fijal>
arigato: so cfbolz has a good point that we already use utf8 on pypy3
<fijal>
so maybe having an rpython-level utf8 string would solve both the tuple issue and pypy3 issue?
<arigato>
right, it would solve a few deeper issues than the tuple one, like recomputing things currently stored on W_UnicodeObject in some situations
<arigato>
on the other hand, it's a major mess
<arigato>
pypy3 doesn't "have" a utf8 string, it just uses a regular string that happens to contain utf8
<fijal>
why is it a major mess?
<fijal>
I mean, we would use a subclass of str() on emulated level, with rpython-level being slightly different with extra fields
<fijal>
I think the main problem is that emulated level will be even slower, but maybe that's ok?
<arigato>
so you're thinking about a rstr.UTF8STR that would look like a rstr.STR with a few extra fields?
<fijal>
yeah
<fijal>
and the emulated layer would be *cough* a subclass of str
<arigato>
would it annotate as a different and incompatible SomeUtf8String ?
<fijal>
yeah
<arigato>
I'm sure you'll need sometimes to convert between that and a regular str, is making a copy ok?
<fijal>
we can have an operation that does that
<fijal>
makes a copy while emulated and a cast when not emulated
<fijal>
"cast"
<fijal>
one way you need to scan a string anyway
<arigato>
well, how do you "cast"?
<fijal>
right
<arigato>
the rstr.UTF8STR cannot be compatible enough, not easily
<arigato>
you'd need a copy, which defeats the point of .encode('utf8') not making a copy
<fijal>
indeed
<fijal>
there are messier options of course
<fijal>
like, have a bit saying which one is it and storing extra data at the end of the string
<fijal>
(which is, super messy)
<arigato>
as I said earlier we could have an extra pointer inside all rstr.STR
<arigato>
so that we don't need a different rstr.UTF8STR
<fijal>
yes, that's an option too
<fijal>
it kinda shifts the balance in RPython a bit
* fijal
should really make food
<fijal>
arigato: the problem is as follows - what do we do with py3k?
<fijal>
where text_w returns utf8 string (but no flags)
<fijal>
do we rerun check_utf8 when rewrapping it?
<fijal>
maybe?
<fijal>
and we write a super fast check_utf8
<fijal>
or do we do something else?
<fijal>
that sounds like the easiest option for now (and one that's also an improvement on the current setup anyway)
<arigato>
I guess you're talking in the continuation of the current work
<arigato>
not in the RPython string hack world
<fijal>
I mean - how do we merge utf8 to py3k
<arigato>
because in the RPython string hack world, it's easier
<fijal>
after the merge to default
<arigato>
yes, I understand
jamesaxl has quit [Read error: Connection reset by peer]
<arigato>
I'm saying, we came up with a different idea, so let's explore it a little bit
<fijal>
yes, sure
<arigato>
in this different world, it's easier for py3k
<fijal>
so the question is - do we explore it now or do we first try to merge the current approach to py3k?
<arigato>
who knows
jamesaxl has joined #pypy
<fijal>
note that even if we call check_utf8 at the rewrapping, it's STILL a massive improvement over the current situation
<fijal>
and gives us clear path how to finish the branch (and mozilla contract)
<fijal>
maybe we should make it a Leysin sprint topic "improve even further" :-)
<arigato>
I should ask I guess: are you sure that the work that CPython/PyPy5.9 does in .encode('utf8') and .decode('utf8') is really enough to offset the extra overhead in the unicode-utf8 branch of mostly every other operation?
<arigato>
well it's also less memory, so it's not clearly "every other operation"
<fijal>
what is "every other operation"?
<fijal>
getitem, sure
<arigato>
but every operation actually looking inside the string, like most unicode methods, are probably a bit slower
<fijal>
(and yes, I believe so)
<fijal>
I doubt it
<fijal>
eg find scans a lot less of memory
<arigato>
ok
<arigato>
I guess we'll see in benchmark results
<fijal>
startswith for example should be faster
<fijal>
arigato: well, give me an example :)
<arigato>
things like UnicodeDictStrategy missing is probably costing something too
<fijal>
again, no
<fijal>
because I added the one for ascii
<fijal>
and we don't run a single benchmark with an actual non-ascii unicode payload I think
<fijal>
isupper is probably slower
<fijal>
no, it's exact same speed on constant string
<arigato>
ok, then maybe. I'll trust the benchmarks
<fijal>
I think we SHOULD benchmark unicode non-ascii payloads :)
<fijal>
but then we never did, so complaining that the branch might do something there is a bit problematic
<arigato>
it seems to me that there is more complexity, which will translate into slower interpreted code and more bridges in the JIT
<arigato>
but that's only a guess
<arigato>
"more bridges" is mostly about: you do a small operation on a unicode string, and you get a bridge for ascii/non-ascii-unicode-string
<fijal>
right
<fijal>
let's translate and have a look
<fijal>
we should also carefully look at some logs