cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://botbot.me/freenode/pypy/ ) | use cffi for calling C | "the modern world where network packets and compiler optimizations are effectively hostile"
<kenaan>
arigo nogil-unsafe-2 55fb6aff1863 /pypy/config/pypyoption.py: With nogil, it doesn't make sense to not have threads (and it fails translation right now, FIXME)
<kenaan>
arigo nogil-unsafe-2 cd60a593d1b4 /rpython/: Fixes. Now the branch seems to "work" again
<kenaan>
arigo nogil-unsafe-2 13c93572cf88 /rpython/translator/c/src/: Make the current stack limits a thread-local, too
rokujyouhitoma has quit [Ping timeout: 240 seconds]
<gh16ito>
Cool, njs
<gh16ito>
OK, I'm off.
<gh16ito>
Thanks for the help.
gh16ito has quit [Quit: Leaving]
<arigato>
ok, now the nogil-unsafe-2 branch works again
<arigato>
it seems to give slowdowns, still, even when compared with single-thread with the same interpreter
<arigato>
the magic number "4" is optimized to get the worst case for me
<Remi_M>
arigato: indeed, I do see a 2x slowdown :O
<Remi_M>
not sure why pypy got a 6x slowdown on my machine, but there may simply be multiple of these slowdowns in there
<arigato>
did you compare a lldebug and an optimized build?
<Remi_M>
it was an optimised build with debug symbols; didn't compare anything else yet
<arigato>
ok
<Remi_M>
hm.. why do I not see a slowdown if I only increment 'foo', not 'foo2'?
<arigato>
not sure
<arigato>
I think the effect in this case is that you see only about half the increments, in the global variable
<Remi_M>
what do you mean by 'see'? they are definitely in the code, can the CPU ignore some of the writes?
<arigato>
yes, because they are racy
<arigato>
ah, probably each *read* of "foo" is considered to come from the most recent write by the same cpu
<Remi_M>
and by writing to a different location, we are possibly disabling that memory consistency optimisation?
<arigato>
the cache line still pings-pongs between them, but most of the iterations are done without slowdown
<arigato>
no, I think the problem is that if you wait a little bit, then the cpu has time to notice it should reload the value from the non-local cache line
<Remi_M>
that may be the reason why I wasn't able to find that slowdown with false sharing in my earlier experiments...
<arigato>
(I'm not sure though, mostly guessing)
<Remi_M>
it does sound plausible and somewhat familiar now...
<Remi_M>
well, to make this a bit more mysterious: if foo and foo2 both are declared globally, the slowdown is much smaller...
<arigato>
(also, note that my x.py is very minimal: you have to press enter after all results are printed, and measure the "user" time; then it's expected to have 2x as much if you run two threads)
oberstet has joined #pypy
<Remi_M>
ah no, if I move the two global variables away from each other (into separate cache lines), I (again) get a (tremendous) slowdown of 15x
_main_ has joined #pypy
kenaan has joined #pypy
<kenaan>
arigo nogil-unsafe-2 b03810fcb5c1 /pypy/: Don't decrement the ticker at all in the nogil-unsafe-2 branch
<Remi_M>
this is fun. on 4 threads the slowdown is 40x :)
__main__ has quit [Ping timeout: 255 seconds]
_main_ is now known as __main__
<arigato>
ah, so it might be two global variables...
__main__ has quit [Read error: Connection reset by peer]
__main__ has joined #pypy
<arigato>
obscure, it seems that the next big slow-down is caused by the read of the next byte from the bytecode string
<Remi_M>
I guess it's still something like any write to a different location/cache-line disables whatever optimisation is otherwise valid
<arigato>
ah, and also in FOR_ITER, a read from an array of pointers
exarkun has quit [Ping timeout: 276 seconds]
exarkun has joined #pypy
<arigato>
unless it's badly reported and it occurs in the previous line, which might be along the "jmp"
<Remi_M>
here, most time is spent on pyframe.last_instr = ... (I think)
<arigato>
"ah"
<Remi_M>
maybe the read of pyframe.debugdata is also involved
<arigato>
ok, I get a factor between 2.1x and 5.4x(!) depending on what I replace '10000' with
<arigato>
seems that we really need to take care of false conflicts
<arigato>
I guess the worst case is when sum_a_bit() makes about one frame per minor collection; then as soon as there is this minor collection, there is one pyframe (with all dependent data) per frame,
<arigato>
and the dependent data is of a different size so it goes in some other pages
<arigato>
so in the end *everything* is in the same cache line as the same thing from the other thread
rokujyouhitoma has quit [Ping timeout: 276 seconds]
exarkun has quit [Ping timeout: 248 seconds]
exarkun has joined #pypy
oberstet has quit [Ping timeout: 240 seconds]
Remi_M has joined #pypy
<Remi_M>
arigato: another data point confirming your conclusion: always returning 64 in rffi_platform.memory_alignment() seems to give nice scaling for the original example too
<arigato>
ah ok
rokujyouhitoma has joined #pypy
marky1991 has quit [Ping timeout: 248 seconds]
<arigato>
the current ordering of tracing is a bit too random to separate the threads
<arigato>
it starts by scanning all thread's stacks
<arigato>
so it will copy the objects directly referenced
<arigato>
but put further references in a single list
<arigato>
which is processed later
nimaje1 has joined #pypy
nimaje1 is now known as nimaje
nimaje has quit [Killed (verne.freenode.net (Nickname regained by services))]
<arigato>
there are also issues (probably less important but unknown) of objects that are harder to identify to a given thread:
<arigato>
for example, if it's stored in an old object
rokujyouhitoma has quit [Ping timeout: 240 seconds]
<Remi_M>
seems like I always end up at the "Hoard memory allocator" when looking for false-sharing avoidance in allocators. of course that is probably not implemented in an afternoon...
yuyichao has joined #pypy
oberstet has joined #pypy
<arigato>
Remi_M: I think we should get something roughly reasonable if we reorder tracing to occur thread after thread, and after a thread we simply waste a few allocations to make sure we're past the 64/128 bytes limit
<arigato>
so it should help if there is no fragmentation, and if there is enough fragmentation we just rely on randomness
Rhy0lite has joined #pypy
rokujyouhitoma has joined #pypy
<Remi_M>
yes, that may be enough
rokujyouhitoma has quit [Ping timeout: 276 seconds]