cfbolz changed the topic of #pypy to: PyPy, the flexible snake (IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end ) | use cffi for calling C | if a pep adds a mere 25-30 [C-API] functions or so, it's a drop in the ocean (cough) - Armin
<vstinner>
hello. i'm working on adding a "geometric mean" value to the compare_to command of the pyperf project: https://github.com/psf/pyperf/pull/79
<vstinner>
i'm not sure that i'm computing the geometric mean of the right values
<vstinner>
i'm computing the means of "speeds": ratios (benchmark mean) / (reference benchmark mean)
<vstinner>
a benchmark suite is made of multiple benchmarks, and i would like to get a single number to easily compare two suites
<vstinner>
in my PR, i consider that a geometric mean > 1.0 means "faster" whereas speed.pypy.org says that a geometric mean of 0.24 means "faster"
<fijal>
vstinner: what's the distribution you assume?
<vstinner>
fijal: i don't understand your question, sorry. distribution of what?
<cfbolz>
vstinner: small numbers are better, no?
<cfbolz>
0.5 means you are 2x faster
<pmp-p>
if timing a finite number of instructions i'd go +1 wiht cfbolz
<pmp-p>
if 0.5 means "half the time" not twice computational amount in a given time
<fijal>
vstinner: distribution of consecutive runs
<fijal>
if you are doing a geometric mean, you are assuming something about the probability distribution, no?
<vstinner>
cfbolz: maybe i'm doing it backwards :)
<cfbolz>
vstinner: above you wrote " ratios (benchmark mean) / (reference benchmark mean)"
rfgpfeiffer has joined #pypy
oberstet_ has joined #pypy
oberstet has quit [Ping timeout: 265 seconds]
<vstinner>
cfbolz: it will save your time if you consider that i have no idea of what i am doing :-D
<cfbolz>
:-)
rfgpfeiffer has quit [Ping timeout: 240 seconds]
<vstinner>
ok ok, i fixed my PR so now geo mean < 1.0 means faster and geo mean > 1.0 means slower, as on speed.pypyp.org
<Hodgestar>
Using the geometric mean to aggregate benchmark results was deprecated in 1988, but it is still popular. Possibly because no one could agree on what weights to give the individual components. ;)
<vstinner>
Hodgestar: depreacated ok, but replaced with that?
<vstinner>
Hodgestar: i'm trying to give a single value to summarize N benchmarks of a benchmark suite, when comparing two benchmark suites results
<Hodgestar>
vstinner: The geometric mean is an odd way to combine run times together, right? It multiplies them (X1 * X2 * ...) where a real program would add the run times of the different things it does (X1 + X2 + ...). But even if the individual benchmarks are somehow representative, other programs will do different amounts of those sorts of work, so one would ideally want to add weights (w1 X1 + w2 X2 + ...) but the weights would be
<Hodgestar>
different for each use case.
<Hodgestar>
vstinner: Sorry that isn't a suggestion or a criticism -- I am just thinking about the problem out loud.
<vstinner>
Hodgestar: it's not absolute timings in seconds, but normalized values
<Hodgestar>
vstinner: I'm aware. The normalizing is part of the issue. E.g. If we set PY36 to X1 = 1 and PY37 to X1 = 2, that sweeps under the rug the issue of what fraction of time do programs actually spend doing X1.
<vstinner>
Hodgestar: currently, people throw 60 lines of benchmark results: some are faster, some are slower. honestly, even if i'm used to benchmarking, i have no idea if overall if it means that the change makes Python faster or slower
<vstinner>
Hodgestar: i expect that the geometric mean will help me to take a decision
<vstinner>
i don't know what is the geometric mean when 10 benchmarks are 1.01x slower but 1 benchmark is 2.0x faster. overall, is it a good thing or not? :)
<mattip>
weights would take into account how common the faster action actually is in real life
<mattip>
but there is no "real life" for python
<mattip>
so just weighting everything equally is as good as any other metric
<mattip>
unless you have some heuristic to say benchmark A is ten times as important as benchmark B
<vstinner>
mattip: i put a weight of 0 in pyperformance microbenchmarks that I consider as non relevant/useless: i simply removed them :-D
<Hodgestar>
vstinner: Lol. Nice. :)
<mattip>
if you look at speed.pypy.org and what you do all day is readthedocs building sphinx documentation, then PyPy is not your tool
<mattip>
but if you do templating then definitely, PyPy is fantastic
<Hodgestar>
vstinner, mattip: Maybe people could be allowed to specify their own weights, or there could be a few different weightings that are meant to represent common scenarios (but that sounds like a lot of work and complication for uncertain gains).
<vstinner>
mattip: i don't want to have to both with weights
<vstinner>
Hodgestar: i wrote pyperf for people running a benchmark in 5 min and then tweet the result. for people who have no idea of what they are doing
<vstinner>
that's why pyperf writes explicitly "faster" and "slower". previously, people (including me) read a benchmark result backwards :)
<vstinner>
ah, about the case 10 benchmarks slower (1.01x slower) and 1 benchmark faster (2.0x faster), I got my reply: geometric mean that overall, it's faster :)
<vstinner>
by the way, the std dev is very large! 710 ns for a mean of 349 ns! i asked the author if there is something wrong with Python or the benchmark
<vstinner>
context: https://bugs.python.org/issue41972 bytes.find() is inefficient for a specific pattern, it's about fine tuning the Bloom Filter
<mattip>
so in most cases all you want is some relative measure of "did this make things better or worse"
<mattip>
and if the answer is "both" then
<vstinner>
mattip: lol
<vstinner>
"did this make things better or worse" => "yes" :-D
<mattip>
the change probably has some heuristic that is tuned, so provide a lever for people to tune it
<mattip>
e.g. gcc flags for all kinds of optimizations and projects that explore the optimization space and choose the best ones
jacob22 has quit [Read error: Connection reset by peer]
<mattip>
measuring something in ns sounds fishy to me, the whole benchmark is probably testing things like cpu caches and opcode pipelining
<mattip>
"measureing something high-level like bytes.find()"
<vstinner>
for me the right part is that in the same process, the benchmark produces very different values:
<vstinner>
- value 1: 9.72 us (+138%)
<vstinner>
- value 2: 364 ns (-91%)
<vstinner>
- value 3: 2.16 us (-47%)
<vstinner>
sorry, the *strange* part
jacob22 has joined #pypy
<vstinner>
mattip: i tried but failed to suggest to people to stop bothering about nanoseconds
<vstinner>
mattip: but at least, i tried to make such benchmark a little bit more reliable :-p
<vstinner>
not everybody on earth is connected to #pypy, most people run nonsense benchmarks :-D
<mattip>
benchmarks and reliable in the same sentence! cfbolz has a paper for you
<mattip>
numpy uses asv and has a ~20 minute benchmark suite.
<mattip>
Every time I try to run it, I get wildly different results
<vstinner>
mattip: haha, i read it
<vstinner>
i hate this paper
<mattip>
the paper or the idea that benchmarking is unreliable?
ctismer_ has joined #pypy
ctismer has quit [Ping timeout: 256 seconds]
ctismer_ is now known as ctismer
<Dejan>
so instead of benchmarking we just say "benchmarking is unreliable" and we give up :)
<Dejan>
i agree with the statement ofc
<mattip>
I see two uses for benchmarking
<mattip>
short term a/b testing for a comparing two algorithms in a systems test
<mattip>
and
<mattip>
long-term stability testing on a set of benchmarks on a fixed machine (like speed.pypy.org) where you can collect statistics over time and try to find regressions/improvements
<vstinner>
mattip: i hate the truth that it's not possible to benchmark anything :-D it's not possible to get reliable and reproducible benchmark results
<simpson>
Worse (and ironically), benchmarking is possible on older hardware designs, but we long since have stopped using that hardware because it's relatively slow.
<vstinner>
simpson: are you thinking at HyperThreading, Turbo Boost and things like that? both can be disabled (more or less easily)
lritter has joined #pypy
<simpson>
vstinner: I'm thinking further back than that, to the switch from in-order to out-of-order execution, and the switch from constant-access RAM to caches.
rfgpfeiffer has joined #pypy
<gsnedders>
(though the value of disabling things like SMT and CPU freq scaling is debatable, given then you're testing in a configuration nobody is actually running in prod)
<simpson>
Right. People shop for software like they shop for plumbing or screws in a home-improvement store; they expect hard qualitative numbers which aren't just internally/relatively correct, but which give them some objective hint as to whether it'll perform well enough for their needs.
<gsnedders>
CPU freq scaling especially is significant, though, given so much performance nowadays is gated behind it, especially in multi-threaded/process situations
<gsnedders>
but yeah, it's useful to give _some_ indication, but it can be totally misleading
<fijal>
simpson: we always struggled with "how many cores are actually running the program"
<fijal>
because depending on that, your settings should likely be quite different
<simpson>
fijal: Yeah! And before that, it was "how much L2 do you have?" and etc.