ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!
tlwoerner has joined #lima
kaspter has joined #lima
camus has joined #lima
kaspter has quit [Ping timeout: 264 seconds]
camus is now known as kaspter
_whitelogger has joined #lima
kaspter has quit [Ping timeout: 264 seconds]
kaspter has joined #lima
camus has joined #lima
kaspter has quit [Ping timeout: 256 seconds]
camus is now known as kaspter
FLHerne has quit [Quit: Goodbye (ZNC disconnected)]
Danct12 has joined #lima
warpme_ has joined #lima
dev1990 has joined #lima
monstr has joined #lima
FLHerne has joined #lima
mripard has joined #lima
Putti has quit [Ping timeout: 276 seconds]
<linkmauve> enunes, fyi https://gitlab.gnome.org/GNOME/gtk/-/commits/ngl-clip-classification brings the gears demo from ~8.5 fps to ~32 fps. :)
<linkmauve> Disabling clipping altogether gives me ~57 fps, and we might reach that with some very slight changes to the theme.
camus has joined #lima
kaspter has quit [Ping timeout: 264 seconds]
camus is now known as kaspter
Putti has joined #lima
<enunes> linkmauve: a comparison may be very cheap if it can be optimized as a selection of two different inputs to something
<enunes> it can be more expensive if it goes to two large branches... but overall I think comparisons are not the biggest issue, loops are
<linkmauve> There are no loops in my current benchmark (scrolling in the left pane in gtk4-demo).
<linkmauve> In the end a GTK dev implemented a switch between shaders doing no clipping, rectangular clipping, or rounded rectangular clipping (the old default).
<linkmauve> I’m going to try to replace the second set with glScissors, so that the first one can be used for almost every draw call.
enty is now known as ente
<linkmauve> Do I have to do anything to enable the shader cache?
<linkmauve> Compiling shaders takes 41.74% of the startup time of gtk4-demo.
<linkmauve> Out of 5.8s that’s still almost 2.5s.
<enunes> linkmauve: yeah I noticed that too some time ago, it is in fact that register allocation during compilation takes quite a bit, and on these shaders that are spilling multiple times, it takes a while
<linkmauve> I wiped out my ~/.cache/mesa_shader_cache and it didn’t get recreated, even though it was 1.8 MiB before.
<enunes> the shader cache in lima is enabled by default, I think there is another level of that in mesa but I dont know how to use it
<enunes> in lima its per application, its not meant to save compilation time across runs but just to avoid internal recompilation (for example if you change texture swizzle or format which may trigger an internal shader recompilation)
<linkmauve> Oh…
<enunes> there is a MR by anholt which may help register allocation a bit
<linkmauve> So I guess it wouldn’t make sense for GTK to use glGetProgramBinary() either?
<enunes> but in general I think the better advice there is... make the shaders simple and not spill 30 times, so compilation times should go down
<linkmauve> The latest changes on the https://gitlab.gnome.org/GNOME/gtk/-/commits/ngl-clip-classification branch made GTK compile each fragment shader three times. \o/
<enunes> maybe register allocation can be further optimized, but the point is if a shader is so complicated it takes many seconds just to compile, its not going to perform very well anyway
<linkmauve> One with tons of spills, one with fewer, and one with even fewer.
<linkmauve> For a simple colour fill shader, that’s 114 instructions, 14 instructions, or one instruction.
<enunes> I dont think you need to use glGetProgramBinary at all
<enunes> 3 times you mean gtk calls the compiler 3 times, or lima compiles it 3 times by itself?
<linkmauve> GTK calls it three times, on three different shader inputs.
<enunes> ah, well that should be ok
<linkmauve> It’s what brings it from ~8 fps to ~30 fps.
<enunes> seems like a nice boost
<linkmauve> enunes, what do you mean by “I think there is another level of that in mesa but I dont know how to use it”, is it the one configured with MESA_GLSL_CACHE_DISABLE?
<enunes> yeah I meant that one, I personally never used it
<enunes> does it work?
<linkmauve> I had 1.8 MiB of data in there before, but now it doesn’t get recreated.
<linkmauve> So maybe?
<linkmauve> It’s the one that would help in-between application runs.
<enunes> it's kinda... wrong
<linkmauve> enunes, wouldn’t d4f706389c92e389aa8f75b9e7e8a28289d257de help wrt texture swizzles?
<enunes> linkmauve: yeah it helps a bit, but do you use texture swizzles too? it was just implemented a couple of weeks ago
<linkmauve> I have no idea.
<linkmauve> I’m not much of a GTK dev myself, just trying to make the whole thing better on my phone. :)
<enunes> probably not, otherwise you would be seeing multiple recompilation of your shaders even if you are not triggering them from the application
<linkmauve> But I mean, with this change it should be possible/easier to use the Mesa cache and skip compilation the next time the same shader is getting compiled, or what is still missing?
<enunes> I dont know how the mesa shader cache works, need to catch up with that
<enunes> but it seems difficult to cache mali binaries directly, it would probably need additional information to use to hash it like we do for lima, so I guess it's more likely caching intermediate representation before backend compilation?
<enunes> in which case wouldn't help your compile times much
<linkmauve> This cache doesn’t have to be the full final binary, it could be an earlier step which would still let you skip most of the compilation work.
<linkmauve> Although… you say codegen is the most expensive step of the whole compilation process?
<enunes> yes, but in your case its probably the backend register allocator so wouldnt help much
<linkmauve> What makes the final binary impossible to cache? What are the runtime-dependent things you need to regenerate for every compilation/linkage?
<enunes> it would require knowledge of the backend
<enunes> it's not "impossible", but we don't pass this information to upper layers in mesa anywhere
<enunes> not am I aware of some interface to do so
<enunes> we only recompile shaders in those cases because it's not possible to implement the features directly in hardware, and this varies completely with whatever the target hardware supports
<linkmauve> What is the backend here?
<enunes> lima, as a mesa backend/driver
<linkmauve> Ah, but AIUI the Mesa cache is already meant to invalidate cache based on which driver is being used.
<linkmauve> Not just which driver, but also which version of the driver.
<enunes> but mesa doesnt know we are recompiling it according to the features, so it wouldnt know which binary to use
<linkmauve> So when the user upgrades Mesa, even if Lima didn’t see any change it will recreate the cache over time, instead of reusing the previous binaries.
<linkmauve> The features, as in the GPU’s features?
<linkmauve> In case the user shares their ~/.cache directory between like, an A64 and an A20?
<enunes> yes, for example for some texture formats, it's not implemented in hardware, so lima modifies your shader to swizzle the result so it "supports" that texture format
<enunes> mesa has no idea of that, its just that lima checks what texture formats you currently have bound and will do it for you if you have textures in those formats
<linkmauve> Can’t you encode this information alongside the binary you store in the cache?
<enunes> we do it in the internal lima shader cache, but this information doesnt go up to the mesa layers
<linkmauve> So that if you get an incompatible binary from the cache, you ignore it and trigger a full compilation anyway?
<enunes> that is pretty much what happens but with the in memory shader cache in lima, and that has nothing to do with any mesa shader cache
<enunes> because lima knows all the things that require recompilation to support in the mali400 hardware
<enunes> I suppose we "could" save it to disk, sure, but that would be kind of awkward in the backend, dont know if any drivers do it
<linkmauve> Maybe we should bring that issue to #dri-devel?
<enunes> honestly, I dont see it as a way to go forward... we already apparently cache the intermediate code from mesa, and we cache the binaries in the backend to avoid the internal recompilation
<enunes> if it is a real problem and in shaders that wont crawl at <1fps, we should probably invest that effort in optimizing the backend
<linkmauve> It is a real problem in startup time.
<enunes> probably further optimize this https://gitlab.freedesktop.org/enunes/mesa/-/commit/9bf210ba982ba4e0a1cd125285eb65bc2213242f I had already added a blurb there about it being potentially slow on very large shaders
<linkmauve> I’m rebuilding master with debug symbols to make my flamegraph more useful.
<enunes> maybe further optimize the data structures in that, I already did with that mesa set, but something smarter could give some speedup
<enunes> I think it is only an issue because of shaders that are spilling too much to be nearly unusable
<linkmauve> I could rebuild GTK with no support for rounded rectangle clipping and see how much it saves in compilation times.
<enunes> does it take long to compile those shaders with shader-db run -j1 ?
<linkmauve> ./run -j1 /tmp/bar 7.27s user 0.14s system 97% cpu 7.604 total
<enunes> yeah 7s seems pretty bad
<enunes> if you were to perf record that... I suppose it would fall somewhere in ppir_regalloc_prog() ?
<linkmauve> I’m finishing to rebuild Mesa with debug symbols first.
kaspter has quit [Ping timeout: 264 seconds]
monstr has quit [Remote host closed the connection]
kaspter has joined #lima
<linkmauve> Should have disabled LTO. -_-
<linkmauve> … and I forgot sun4i-drm. ^^'
<linkmauve> Woah, a GL application doing just one draw call and exiting is now a full 0.1s faster than 20.3.4!
<linkmauve> 0.30~0.38s on current master, and 0.43~0.47s on 20.3.4!
<enunes> I wouldnt expect much difference there
<linkmauve> I was using the binary compiled by ArchLinux for 20.3.4.
<linkmauve> So maybe a build difference?
<enunes> hmm dont know really, a lot of things change including things out of control from lima
<linkmauve> dlopen() used to take 41.90% of the flamegraph, now it’s only 8.42%.
<linkmauve> So possibly just because I built only the drivers I’m interested in, the dynamic linker has less work to do?
<linkmauve> dlopen() called by loader_open_driver().
<enunes> dont know really... one thing that might be relevant (or not) is how you are installing the drivers, mesa creates hard links for all the drivers but if you do some funny stuff (like .zip them) it might end up creating actual file copies
<linkmauve> Using my distribution’s packages, so tar, which does support hard links.
<linkmauve> Oh no, gtk4-demo segfaults on master!
<linkmauve> #0 0x0000ffff7980d038 in ppir_node_to_instr () at /usr/lib/dri/sun4i-drm_dri.so
<enunes> hmm that is new and unexpected, and hasn't happened in a while :) are you sure?
<enunes> havent seen crashes in compilation in ppir for a long time
<linkmauve> I am sure I’m getting a segfault, why or how, I have no idea yet!
<enunes> if it really crashes, I'm interested in what fragment shader you are feeding it
<linkmauve> One from GTK, not sure which yet (I’m rebuilding it to get real real debug symbols now).
<enunes> linkmauve: its missing the preamble, I tried to paste it together with the preamble glsl files but then it didnt crash to me
<linkmauve> Hmm. :/
<enunes> unless its because I'm a couple of days of git pull away, but more likely I just ended with a different shader by pasting things
<enunes> do you have a captured .shader_test?
<linkmauve> https://linkmauve.fr/files/gtk-lima.txt is the full backtrace.
<linkmauve> Yes.
<linkmauve> Ah, you may have to:
<linkmauve> #define GSK_GLES 1
<linkmauve> #define NO_CLIP 1
<linkmauve> Although shader-db doesn’t crash here. :/
<anarsoul> blend.glsl will be pretty expensive on mali4x0, too many ifs in main()
<anarsoul> it's better to precompile several shaders rather than selecting what to do based on u_mode
<linkmauve> That’s what I told the GTK people, but they didn’t seem to agree so far.
<enunes> if it doesnt crash with shader-db it still *might* be if it triggers only with runtime texture swizzling modifications
<anarsoul> I don't think it'll be efficient on any GPU
<linkmauve> I’ll try showing them performance increases instead of arguing without profiling. :)
<anarsoul> linkmauve: are there any gpu experts among them? :)
<anarsoul> OK, sounds good
<linkmauve> anarsoul, it could be efficient on drivers which do specialisation.
<linkmauve> I’m quite familiar with this kind of huge shader, Dolphin (another project I’ve been working on) is using them extensively: https://fr.dolphin-emu.org/blog/2017/07/30/ubershaders/
<linkmauve> And it absolutely destroys any GPU I own. ^^
<enunes> I saw that article one time before, this discussion reminded me of it too
<anarsoul> even with clause-based ISA it'll take some time till it gets to luminosity
<anarsoul> uber shaders is not a good idea for most (if not all) mobile GPUs
<linkmauve> enunes, anarsoul, would you like to come to #gtk on irc.gnome.org at some point? You’d have a lot more legitimacy than I do.
<anarsoul> I can, but the question is do they target gtk4 for mobile GPUs? If they don't my presence won't help much
<linkmauve> They’ve been helping me helping them to fix the issues on my PinePhone at least. ^^
<enunes> linkmauve: I could join, but mostly to provide info on how things are in lima, I already went more into the gtk code than I would like to :)
camus has joined #lima
kaspter has quit [Ping timeout: 245 seconds]
camus is now known as kaspter
kaspter has quit [Ping timeout: 246 seconds]
camus has joined #lima
camus is now known as kaspter
dddddd has quit [Ping timeout: 256 seconds]
dddddd has joined #lima
camus has joined #lima
kaspter has quit [Ping timeout: 245 seconds]
camus is now known as kaspter
dev1990 has quit [Quit: Konversation terminated!]