<alyssa>
Lots of cleanup *and* a perf improvement? Count me in ;)
* alyssa
tries ot learn `perf annotate`
davidlt_ has joined #panfrost
davidlt has quit [Ping timeout: 265 seconds]
<alyssa>
Evidently bitfields are slow.
davidlt_ has quit [Ping timeout: 265 seconds]
<urjaman>
I'm not exactly surprised
<alyssa>
Actually that wasn't the issue, it was another reading-from-GPU-memory issue. How many of theose will we keep catching, idk :c
<alyssa>
-----Actually, it *is* still the issue. Because bitfields necessarily imply reading from memory.
<HdkR>
x86 cheats with bitfields since it has some instructions for helping accessing them directly from memory
<HdkR>
ARM has to do a loadstore shuffle
<alyssa>
HdkR: The issue here isn't the shuffling around per se, it's that reading from GPU mapped memory is stupidly expensive
<HdkR>
ah, uncached then?
<alyssa>
Yeah, for now at least
<alyssa>
When you manually inspect code like `foo.bar = 5;` it's like, cool, that's just a write, no reads here
<alyssa>
but if you look at the assembly level, that has to load the uncached memory to do the dance... and that becomes slow.
<HdkR>
yea, uncached ends up being pretty bad
<HdkR>
usually ends up being worth keeping it cached then doing a dcache flush at the end of whatever you need to do
<alyssa>
oh hey it's register spilling
<alyssa>
ins't that cute
* alyssa
shivers
<alyssa>
(I must say - as far as learning perf goes, this has been extremely educative. Way more productive with this than I've ever been with any other profiler ever and this is day #1. <3)
<alyssa>
On min/max index computation... I see 99% of the time spent loading indices, which absolutely supports the theory that this is a caching issue (so the proposed Gallium-based fix ought to work well)
<alyssa>
Next to the memory access, the actual min'ing and max'ing is effectively free.
<alyssa>
Same thing with the heavy access_tiled_image_generic usage in stk
<HdkR>
vector min/max ends up being three cycles per op, compared to uncached memory accesses it is nothing :P
<alyssa>
true!
<anarsoul>
alyssa: btw do you see heavy access_tiled_image_generic usage in weston?
<anarsoul>
I'm seeing it with lima for some reason :(
<alyssa>
anarsoul: not sure, I can look in a bit
<alyssa>
currently in gnome
<anarsoul>
I assume it'd be the same
<HdkR>
It's a sad day that we don't get gather loads on ARM until SVE
<alyssa>
(Note: this is again the same bottleneck we see for WebGL on firefox. Unfortunately I don't think there's much to be done there.)
TheKit has quit [Read error: Connection reset by peer]
<alyssa>
anarsoul: In weston, I'm seeing the top function be panfrost_store_tiled_image_yes.
<alyssa>
It's just a lot of memory access anyway.
<anarsoul>
alyssa: it's not the case with gnome-shell?
<alyssa>
If you do an impl for lima, please do check what the win is. But it might be pretty decent for glamor at least :)
<anarsoul>
do you have any specific benchmark in mind?
<alyssa>
anarsoul: the MR linked, and the issue linked from that, talk about ShmPutImage in x11perf, which seems a decent proxy for glamor perf
<alyssa>
tomeu: dj,hgskkgyrs ack, I didn't realize you were *already* working on this in a branch, aaa I didn't mean to duplicate effort >..<
<alyssa>
Not time wasted - I did need to learn perf - but still feel bad :|
<alyssa>
Actually, it looks like most of it is complementary (so conflicts will be "fun" but not strictly duplicated work)
<daniels>
anarsoul: that's weird, we don't ourselves do any readbacks unless you're taking screenshots, we don't use FBOs, and we only do software uploads when software clients give us changed buffers
<daniels>
so it shouldn't be spending a ton of time doing that
<anarsoul>
daniels: according to perf it's coming from gl-renderer, which (indirectly) calls _mesa_TexSubImage2D
<daniels>
anarsoul: right, we do that to upload client content which has been given to us as a SHM buffer
<daniels>
we only do it clipped to the changed region(s), but that means the TexSubImage2D path might not be tile-aligned
<daniels>
alyssa changed Panfrost so that it would only do a partial fallback for unaligned regions (i.e. use the generic unaligned access routine for the sub-tile regions, use the fast routine for the others), rather than doing all the accesses using the generic helper if the region was unaligned
<daniels>
hmm yeah, 2091d311c9d0 applied that fix to the shared code, so it should've helped Lima as well