#lima on 2021-03-08 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:13 tlwoerner has joined #lima

01:54 kaspter has joined #lima

01:57 camus has joined #lima

01:58 kaspter has quit [Ping timeout: 264 seconds]

01:58 camus is now known as kaspter

03:44 _whitelogger has joined #lima

06:22 kaspter has quit [Ping timeout: 264 seconds]

06:23 kaspter has joined #lima

07:35 camus has joined #lima

07:35 kaspter has quit [Ping timeout: 256 seconds]

07:35 camus is now known as kaspter

07:56 FLHerne has quit [Quit: Goodbye (ZNC disconnected)]

08:20 Danct12 has joined #lima

08:21 warpme_ has joined #lima

09:02 dev1990 has joined #lima

09:19 monstr has joined #lima

09:37 FLHerne has joined #lima

10:10 mripard has joined #lima

10:53 Putti has quit [Ping timeout: 276 seconds]

11:01 <linkmauve> enunes, fyi https://gitlab.gnome.org/GNOME/gtk/-/commits/ngl-clip-classification brings the gears demo from ~8.5 fps to ~32 fps. :)

11:02 <linkmauve> Disabling clipping altogether gives me ~57 fps, and we might reach that with some very slight changes to the theme.

11:57 camus has joined #lima

11:58 kaspter has quit [Ping timeout: 264 seconds]

11:58 camus is now known as kaspter

11:59 Putti has joined #lima

13:11 <enunes> linkmauve: a comparison may be very cheap if it can be optimized as a selection of two different inputs to something

13:12 <enunes> it can be more expensive if it goes to two large branches... but overall I think comparisons are not the biggest issue, loops are

13:32 <linkmauve> There are no loops in my current benchmark (scrolling in the left pane in gtk4-demo).

13:33 <linkmauve> In the end a GTK dev implemented a switch between shaders doing no clipping, rectangular clipping, or rounded rectangular clipping (the old default).

13:34 <linkmauve> I’m going to try to replace the second set with glScissors, so that the first one can be used for almost every draw call.

15:04 enty is now known as ente

16:08 <linkmauve> Do I have to do anything to enable the shader cache?

16:09 <linkmauve> Compiling shaders takes 41.74% of the startup time of gtk4-demo.

16:09 <linkmauve> Out of 5.8s that’s still almost 2.5s.

16:10 <enunes> linkmauve: yeah I noticed that too some time ago, it is in fact that register allocation during compilation takes quite a bit, and on these shaders that are spilling multiple times, it takes a while

16:11 <linkmauve> I wiped out my ~/.cache/mesa_shader_cache and it didn’t get recreated, even though it was 1.8 MiB before.

16:11 <enunes> the shader cache in lima is enabled by default, I think there is another level of that in mesa but I dont know how to use it

16:12 <enunes> in lima its per application, its not meant to save compilation time across runs but just to avoid internal recompilation (for example if you change texture swizzle or format which may trigger an internal shader recompilation)

16:13 <linkmauve> Oh…

16:13 <enunes> there is a MR by anholt which may help register allocation a bit

16:13 <linkmauve> So I guess it wouldn’t make sense for GTK to use glGetProgramBinary() either?

16:13 <enunes> but in general I think the better advice there is... make the shaders simple and not spill 30 times, so compilation times should go down

16:14 <linkmauve> The latest changes on the https://gitlab.gnome.org/GNOME/gtk/-/commits/ngl-clip-classification branch made GTK compile each fragment shader three times. \o/

16:14 <enunes> maybe register allocation can be further optimized, but the point is if a shader is so complicated it takes many seconds just to compile, its not going to perform very well anyway

16:15 <linkmauve> One with tons of spills, one with fewer, and one with even fewer.

16:15 <linkmauve> For a simple colour fill shader, that’s 114 instructions, 14 instructions, or one instruction.

16:18 <enunes> I dont think you need to use glGetProgramBinary at all

16:18 <enunes> 3 times you mean gtk calls the compiler 3 times, or lima compiles it 3 times by itself?

16:19 <linkmauve> GTK calls it three times, on three different shader inputs.

16:19 <enunes> ah, well that should be ok

16:19 <linkmauve> It’s what brings it from ~8 fps to ~30 fps.

16:20 <enunes> seems like a nice boost

16:20 <linkmauve> enunes, what do you mean by “I think there is another level of that in mesa but I dont know how to use it”, is it the one configured with MESA_GLSL_CACHE_DISABLE?

16:21 <enunes> yeah I meant that one, I personally never used it

16:21 <enunes> does it work?

16:21 <linkmauve> I had 1.8 MiB of data in there before, but now it doesn’t get recreated.

16:21 <linkmauve> So maybe?

16:23 <linkmauve> It’s the one that would help in-between application runs.

16:24 <enunes> wow I just found this https://www.phoronix.com/scan.php?page=news_item&px=Mesa-Lima-Shader-Cache

16:24 <enunes> it's kinda... wrong

16:24 <linkmauve> enunes, wouldn’t d4f706389c92e389aa8f75b9e7e8a28289d257de help wrt texture swizzles?

16:28 <enunes> linkmauve: yeah it helps a bit, but do you use texture swizzles too? it was just implemented a couple of weeks ago

16:28 <linkmauve> I have no idea.

16:28 <linkmauve> I’m not much of a GTK dev myself, just trying to make the whole thing better on my phone. :)

16:29 <enunes> probably not, otherwise you would be seeing multiple recompilation of your shaders even if you are not triggering them from the application

16:30 <linkmauve> But I mean, with this change it should be possible/easier to use the Mesa cache and skip compilation the next time the same shader is getting compiled, or what is still missing?

16:33 <enunes> I dont know how the mesa shader cache works, need to catch up with that

16:33 <enunes> but it seems difficult to cache mali binaries directly, it would probably need additional information to use to hash it like we do for lima, so I guess it's more likely caching intermediate representation before backend compilation?

16:34 <enunes> in which case wouldn't help your compile times much

16:35 <linkmauve> This cache doesn’t have to be the full final binary, it could be an earlier step which would still let you skip most of the compilation work.

16:38 <linkmauve> Although… you say codegen is the most expensive step of the whole compilation process?

16:38 <enunes> yes, but in your case its probably the backend register allocator so wouldnt help much

16:41 <linkmauve> What makes the final binary impossible to cache? What are the runtime-dependent things you need to regenerate for every compilation/linkage?

16:43 <enunes> it would require knowledge of the backend

16:44 <enunes> it's not "impossible", but we don't pass this information to upper layers in mesa anywhere

16:44 <enunes> not am I aware of some interface to do so

16:44 <enunes> we only recompile shaders in those cases because it's not possible to implement the features directly in hardware, and this varies completely with whatever the target hardware supports

16:45 <linkmauve> What is the backend here?

16:45 <enunes> lima, as a mesa backend/driver

16:45 <linkmauve> Ah, but AIUI the Mesa cache is already meant to invalidate cache based on which driver is being used.

16:45 <linkmauve> Not just which driver, but also which version of the driver.

16:46 <enunes> but mesa doesnt know we are recompiling it according to the features, so it wouldnt know which binary to use

16:46 <linkmauve> So when the user upgrades Mesa, even if Lima didn’t see any change it will recreate the cache over time, instead of reusing the previous binaries.

16:46 <linkmauve> The features, as in the GPU’s features?

16:47 <linkmauve> In case the user shares their ~/.cache directory between like, an A64 and an A20?

16:47 <enunes> yes, for example for some texture formats, it's not implemented in hardware, so lima modifies your shader to swizzle the result so it "supports" that texture format

16:48 <enunes> mesa has no idea of that, its just that lima checks what texture formats you currently have bound and will do it for you if you have textures in those formats

16:48 <linkmauve> Can’t you encode this information alongside the binary you store in the cache?

16:49 <enunes> we do it in the internal lima shader cache, but this information doesnt go up to the mesa layers

16:49 <linkmauve> So that if you get an incompatible binary from the cache, you ignore it and trigger a full compilation anyway?

16:50 <enunes> that is pretty much what happens but with the in memory shader cache in lima, and that has nothing to do with any mesa shader cache

16:51 <enunes> because lima knows all the things that require recompilation to support in the mali400 hardware

16:53 <enunes> I suppose we "could" save it to disk, sure, but that would be kind of awkward in the backend, dont know if any drivers do it

16:55 <linkmauve> Maybe we should bring that issue to #dri-devel?

16:56 <enunes> honestly, I dont see it as a way to go forward... we already apparently cache the intermediate code from mesa, and we cache the binaries in the backend to avoid the internal recompilation

16:57 <enunes> if it is a real problem and in shaders that wont crawl at <1fps, we should probably invest that effort in optimizing the backend

16:59 <linkmauve> It is a real problem in startup time.

16:59 <enunes> probably further optimize this https://gitlab.freedesktop.org/enunes/mesa/-/commit/9bf210ba982ba4e0a1cd125285eb65bc2213242f I had already added a blurb there about it being potentially slow on very large shaders

16:59 <linkmauve> I’m rebuilding master with debug symbols to make my flamegraph more useful.

16:59 <enunes> maybe further optimize the data structures in that, I already did with that mesa set, but something smarter could give some speedup

17:00 <enunes> I think it is only an issue because of shaders that are spilling too much to be nearly unusable

17:02 <linkmauve> I could rebuild GTK with no support for rounded rectangle clipping and see how much it saves in compilation times.

17:03 <enunes> does it take long to compile those shaders with shader-db run -j1 ?

17:11 <linkmauve> ./run -j1 /tmp/bar 7.27s user 0.14s system 97% cpu 7.604 total

17:13 <enunes> yeah 7s seems pretty bad

17:14 <enunes> if you were to perf record that... I suppose it would fall somewhere in ppir_regalloc_prog() ?

17:14 <linkmauve> I’m finishing to rebuild Mesa with debug symbols first.

17:26 kaspter has quit [Ping timeout: 264 seconds]

17:27 monstr has quit [Remote host closed the connection]

17:27 kaspter has joined #lima

17:32 <linkmauve> Should have disabled LTO. -_-

17:36 <linkmauve> … and I forgot sun4i-drm. ^^'

17:54 <linkmauve> Woah, a GL application doing just one draw call and exiting is now a full 0.1s faster than 20.3.4!

17:55 <linkmauve> 0.30~0.38s on current master, and 0.43~0.47s on 20.3.4!

17:56 <enunes> I wouldnt expect much difference there

17:57 <linkmauve> I was using the binary compiled by ArchLinux for 20.3.4.

17:57 <linkmauve> So maybe a build difference?

17:58 <enunes> hmm dont know really, a lot of things change including things out of control from lima

17:58 <linkmauve> dlopen() used to take 41.90% of the flamegraph, now it’s only 8.42%.

17:59 <linkmauve> So possibly just because I built only the drivers I’m interested in, the dynamic linker has less work to do?

17:59 <linkmauve> dlopen() called by loader_open_driver().

18:01 <enunes> dont know really... one thing that might be relevant (or not) is how you are installing the drivers, mesa creates hard links for all the drivers but if you do some funny stuff (like .zip them) it might end up creating actual file copies

18:03 <linkmauve> Using my distribution’s packages, so tar, which does support hard links.

18:13 <linkmauve> Oh no, gtk4-demo segfaults on master!

18:13 <linkmauve> #0 0x0000ffff7980d038 in ppir_node_to_instr () at /usr/lib/dri/sun4i-drm_dri.so

18:21 <enunes> hmm that is new and unexpected, and hasn't happened in a while :) are you sure?

18:22 <enunes> havent seen crashes in compilation in ppir for a long time

18:23 <linkmauve> I am sure I’m getting a segfault, why or how, I have no idea yet!

18:23 <enunes> if it really crashes, I'm interested in what fragment shader you are feeding it

18:25 <linkmauve> One from GTK, not sure which yet (I’m rebuilding it to get real real debug symbols now).

18:40 <linkmauve> enunes, this one: https://gitlab.gnome.org/GNOME/gtk/-/blob/master/gsk/resources/glsl/blend.glsl

18:53 <enunes> linkmauve: its missing the preamble, I tried to paste it together with the preamble glsl files but then it didnt crash to me

18:54 <linkmauve> Hmm. :/

18:55 <enunes> unless its because I'm a couple of days of git pull away, but more likely I just ended with a different shader by pasting things

18:55 <enunes> do you have a captured .shader_test?

18:56 <linkmauve> https://linkmauve.fr/files/gtk-lima.txt is the full backtrace.

18:56 <linkmauve> Yes.

18:57 <linkmauve> Ah, you may have to:

18:57 <linkmauve> #define GSK_GLES 1

18:57 <linkmauve> #define NO_CLIP 1

18:58 <linkmauve> enunes, https://linkmauve.fr/files/3.shader_test

18:58 <linkmauve> Although shader-db doesn’t crash here. :/

18:59 <anarsoul> blend.glsl will be pretty expensive on mali4x0, too many ifs in main()

18:59 <anarsoul> it's better to precompile several shaders rather than selecting what to do based on u_mode

19:00 <linkmauve> That’s what I told the GTK people, but they didn’t seem to agree so far.

19:00 <enunes> if it doesnt crash with shader-db it still *might* be if it triggers only with runtime texture swizzling modifications

19:00 <anarsoul> I don't think it'll be efficient on any GPU

19:00 <linkmauve> I’ll try showing them performance increases instead of arguing without profiling. :)

19:00 <anarsoul> linkmauve: are there any gpu experts among them? :)

19:01 <anarsoul> OK, sounds good

19:01 <linkmauve> anarsoul, it could be efficient on drivers which do specialisation.

19:02 <linkmauve> I’m quite familiar with this kind of huge shader, Dolphin (another project I’ve been working on) is using them extensively: https://fr.dolphin-emu.org/blog/2017/07/30/ubershaders/

19:02 <linkmauve> And it absolutely destroys any GPU I own. ^^

19:03 <enunes> I saw that article one time before, this discussion reminded me of it too

19:05 <anarsoul> even with clause-based ISA it'll take some time till it gets to luminosity

19:06 <anarsoul> uber shaders is not a good idea for most (if not all) mobile GPUs

19:08 <linkmauve> enunes, anarsoul, would you like to come to #gtk on irc.gnome.org at some point? You’d have a lot more legitimacy than I do.

19:09 <anarsoul> I can, but the question is do they target gtk4 for mobile GPUs? If they don't my presence won't help much

19:10 <linkmauve> They’ve been helping me helping them to fix the issues on my PinePhone at least. ^^

19:39 <enunes> linkmauve: I could join, but mostly to provide info on how things are in lima, I already went more into the gtk code than I would like to :)

20:54 camus has joined #lima

20:54 kaspter has quit [Ping timeout: 245 seconds]

20:54 camus is now known as kaspter

21:47 kaspter has quit [Ping timeout: 246 seconds]

21:47 camus has joined #lima

21:49 camus is now known as kaspter

21:58 dddddd has quit [Ping timeout: 256 seconds]

22:00 dddddd has joined #lima

23:33 camus has joined #lima

23:33 kaspter has quit [Ping timeout: 245 seconds]

23:33 camus is now known as kaspter

23:55 dev1990 has quit [Quit: Konversation terminated!]