<alyssa>
anarsoul: If you can do so cleanly, go ahead!
<alyssa>
(Cleanly = without regressing anything in terms of performance/functionality)
<alyssa>
Also, I was later informed that mesa/st has tis own index cache that doesn't seem to be used much, so it might be easier to try to promote that to core Gallium and/or have a CAP to make it always be used
<alyssa>
Er, not even mesa/st, the mesa GL itself
<alyssa>
src/mesa/vbo/vbo_minmax_index.c:
<anarsoul>
interesting
vstehle has quit [Ping timeout: 256 seconds]
<anarsoul>
alyssa: hm, I don't see much difference with minmax cache :(
<bbrezillon>
apol: did you check the damage rect coordinates origin (the spec says "Coordinates are specified relative to the lower left corner")
<bbrezillon>
well, if you set damage rect to cover the whole surface, then you have to redraw everything
<apol>
bbrezillon: I added this change since I couldn't figure out what was going wrong and I get similar output as weston https://invent.kde.org/snippets/748
<apol>
bbrezillon: well we are rendering the same AFAIU, like I said, if I don't call eglSetDamageRegionKHR it all works okay
<bbrezillon>
apol: which is expected since we reload the entire FB into the tile buffer in that case
<bbrezillon>
I guess you only redraw a sub-region if eglSetDamageRegion() is set
<bbrezillon>
if the trace you've added to mesa and some extra traces in kwin, you can make sure the damage rects match
<bbrezillon>
s/if/with/
<apol>
let me look at what we are rendering, thanks
* apol
is feeling n00b
<alyssa>
bifrost branching is complicated~
<alyssa>
Anything to save a bit, I guess.
<cwabbott>
^ the mantra of the entire bifrost ISA
<alyssa>
cwabbott: :3
<alyssa>
I thought it was "don't do whatever *that* was again *vigorous waving in midgard's general direction*"
<cwabbott>
there are so many clever hacks to squeeze out those bits, though
<cwabbott>
I vaguely remember figuring out the branch condition stuff... the biggest PITA was that the blob compiler sucked ass and didn't use the floating-point comparisons nearly as much as it could've
<alyssa>
Wee.
<alyssa>
Well, I guess that Bifrost motto (the second one) is my philosophy for IR design here
<alyssa>
"You know all the zillions of rewrites you did for midgard? yeah, don't do that again."
<alyssa>
See: register allocation, scheduling, control flow, !32-bit..
<alyssa>
Meanwhile, the blob here is literally emitting FCMP.D3D.OGT.f32 and then BRANCH.EQ.i32.Z
<alyssa>
(which I guess is what you meant)
<apol>
bbrezillon: so the damage traces match, I see on the logs what I expect to see. It's actually these artifacts I start seeing outside the damaged region that I don't understand
<bbrezillon>
apol: does the damage extent (mesa trace you've added) cover the region showing artifacts?
<apol>
bbrezillon: I don't think so, do you know if there's a good tool to debug this?
<bbrezillon>
not that I know
<bbrezillon>
apol: what GPU are you testing on BTW?
<apol>
that's a pinebook pro
<apol>
OpenGL renderer string: Mali T860 (Panfrost)
<bbrezillon>
should work just fine
<bbrezillon>
alyssa: should I move all panfrost_emit_ functions to a new file (pan_cmdstream.{c,h} ?) or should they stay where they are (pan_context.c, pan_varyings.c, pan_compute.c, pan_attributes.c)?
<bbrezillon>
apol: if the artifacts appear outside the damage extent, maybe it's a bug in the damage region tracking logic (you have to merge damage regions of the N last frames, N being the buffer age)
<bbrezillon>
apol: I'd say that you should pass 'region' to regionToRects(), not 'output.damageHistory.constFirst()'
<apol>
bbrezillon: I've already tried this
<apol>
there's something else wrong
<apol>
but I'm sadly not very familiar with our codebase, which doesn't help
yann has quit [Ping timeout: 260 seconds]
<apol>
bbrezillon: yep, I was creating the rects wrong :'(
<apol>
bbrezillon: it's working great now, thanks for the patience :)
apol has quit [Remote host closed the connection]
gcl_ has quit [Ping timeout: 240 seconds]
gcl has joined #panfrost
nerdboy has joined #panfrost
pH5 has quit [Quit: bye]
<alyssa>
bbrezillon: Up to you, whatever you think is easier / less churn :)
<alyssa>
patch was dropped since 1) it doesn't hadle other sizes but you could fix that and 2) I never got around to doing good benchmarking and from the asm I'm not convinced it's actually a win
<alyssa>
Theoretically it should definitely be a win since it amortizes the cost of reading user indices so it'll save [size of index buffer] in read traffic
<alyssa>
But in practice I don't know if `for (i < len) out[i] = in[i]` is as efficient as memcpy ... obviously a compiler on -O3 should be able to vectorize it and get close but memcpy I think can have some hand asm tricks on some platforms, and I don't know if arm32/64 do that
<alyssa>
If it's a bbottleneck for you, though, worth a look
<HdkR>
ARM64 has some major memcpy optimizations in the libraries
<anarsoul>
iirc supertuxkart spends quite some time in min/max calculation
<anarsoul>
but not q3a
<alyssa>
anarsoul: yeah, but it's not clear if the bottleneck is actually the min/max ALU side or just the I/O of reading all that data
<alyssa>
HdkR: significantly better than an unrolled NEON vectorized load/store then?