Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa
<anarsoul> yuq8251: hi
<anarsoul> have you noticed that X11 has large latency when glamor is in use with lima?
<anarsoul> i.e. if I move cursor it moves 2x-3x slower than I move mouse
<yuq8251> hi
<anarsoul> I'm not sure where it comes from since CPU load isn't high
<yuq8251> which x11 application?
<yuq8251> or desktop?
<anarsoul> yuq8251: no applications, just plain X11 with xterm
<anarsoul> just try moving cursor
<yuq8251> I didn't see it on my amlogic s905x
<anarsoul> yuq8251: amlogic s905x may have cursor plane
<anarsoul> you can try 'Option "SWcursor" "true"' in your xorg config to enable sw cursor
<yuq8251> let me use software cursor
<yuq8251> same result
<anarsoul> do you want me to make video?
<anarsoul> can you try on some board with allwinner soc?
<yuq8251> yeah, a video, I can try on allwinner, but need much time to setup one
<anarsoul> watch my hand moving mouse and see how cursor lags
<anarsoul> yuq8251: I think it's somehow related to GPU, if I lower resolution to 1024x768 there's no lag
<anarsoul> but it's here in 1920x1080
<anarsoul> I mean to GPU load
<anarsoul> yuq8251: you can also try lowering GPU frequency, IIRC s905x has mali450 that runs at pretty high freq
<anarsoul> yuq8251: or try launching mpv playing some video and glxgears
<anarsoul> and move glxgears window
<anarsoul> or better some other window (xterm)
<anarsoul> you'll see that gears spin slower
<anarsoul> it looks like it couldn't render it in time and tries to catch up :)
<yuq8251> looks like task from multi context is not interleaved
<yuq8251> as you are moving window then glxgears stop
<yuq8251> or the window manage does not update other window when moving a window
<anarsoul> yuq8251: maybe, but I get similar result with no apps and just a cursor
<anarsoul> however I'm not sure how many contexts are there
<yuq8251> you may add some print or debugfs in kernel driver to monitor
<yuq8251> btw, glamor does not call eglswapbuffer
<anarsoul> and what are the consequences?
<yuq8251> it uses glflush to update render target
<yuq8251> so if it does not call glclear, tile buffer gpu like lima have to reload the screen all the time when glflush
<anarsoul> I see
<anarsoul> and that's expensive
<yuq8251> yes
<yuq8251> I RE the mali blob before, it treat glflush case with continuously using GP PLBU buffer
<anarsoul> *sigh* looks like there's a lot to fix before we can use lima with X11
<yuq8251> and will overflow when many glflush
<yuq8251> so I use the reload method
<yuq8251> this is the revert commit
<yuq8251> I think it would be much better for composite window manager
<yuq8251> and wayland desktop
<anarsoul> hm
<anarsoul> I can try xcompmgr
<anarsoul> it doesn't help
<anarsoul> and moving cursor is enough to get glxgears to stutter
<yuq8251> xserver has a GL context, glxgears has one, composite WM has one
<anarsoul> I see
<anarsoul> yuq8251: can you reproduce the issue on your side?
<yuq8251> Oh, I can see now with glxgears running, even without WM
<anarsoul> btw, please review my pp cf branch when you have some time
<yuq8251> ok
<anarsoul> it doesn't regress in piglit, so at least it generates correct code
<yuq8251> that's nice
<yuq8251> have you tested some desktop?
<yuq8251> like xfce
<anarsoul> nope
<anarsoul> due to this latency issue
<anarsoul> anything in X11 isn't really usable for me
<yuq8251> I can see similar lag with weston, but much better
<anarsoul> yuq8251: it's gets a lot worse with something GPU-heavy
<anarsoul> I tried starting ioquake3 and lag in menu is tens of seconds
mardestan has joined #lima
yuq8251 has quit [Remote host closed the connection]
cwabbott has joined #lima
jrmuizel has quit [Remote host closed the connection]
dddddd has joined #lima
jrmuizel has joined #lima
<enunes> anarsoul: hey, sure we can rework the register selection for spilling, do you have some ideas?
<enunes> I'm more worried first to fix the infinite loop case you hit, maybe I should pick your branch and remove that attempts implementation
<enunes> to try to reproduce it and propose an improvement in marking registers unspillable
<enunes> and then we can also improve the register selection algorithm
<enunes> with shaderdb that is much easier now
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<anarsoul> enunes: yeah, I have one idea
<anarsoul> enunes: we can do two passes: 1st pass: calculate maximum register pressure, 2nd pass: choose one register that is in block where max reg pressure is reached
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
<enunes> anarsoul: I have to look at how we can calculate this, but seems better than what we have
<enunes> I noticed that the mesa ralloc has "ra_get_best_spill_node"
jrmuizel has quit [Read error: Connection reset by peer]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<anarsoul> enunes: yeah, maybe it's better to use it
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<enunes> anarsoul: quick attempt to use it seems to have marginal gain and inconclusive results,
<enunes> slightly better in the spills though, maybe it is worth it
<anarsoul> enunes: try glmark2 -b ideas
<enunes> with your branch?
<anarsoul> yes
<enunes> anarsoul: ideas works, resolves spilling with exactly 10 attempts
<enunes> without this change it just aborted as took >10
<enunes> shadow renders a bit strange in ideas
<enunes> ah wait, no, it also aborted the compilation, but still worked?
<enunes> let me start over
jrmuizel has joined #lima
<anarsoul> enunes: yeah, it's weird, it aborts compilation but continues to work
<anarsoul> however rendering is incorrect
<enunes> anarsoul: well yeah I can reproduce the infinite loop with the current master, with this local change switching to ra_get_best_spill_node it doesn't resolve spilling either, but regalloc runs out of registers and correctly aborts quickly
<anarsoul> enunes: OK, so can you send an MR with your local change to my branch?
<anarsoul> however I'm not sure if it's possible...
<anarsoul> if you just point me to your branch I can just pull this change
<enunes> anarsoul: I will do that, first I will try to see what it is doing to see if we can optimize it in some way
<anarsoul> I think vectorization should help ideas
<enunes> anarsoul: still fails regalloc even with vectorize
<anarsoul> :(
<anarsoul> well, then we'll have to look into it later
<anarsoul> blob compiles it just fine, so it should be possible
<anarsoul> enunes: does vectorize help if you use vector select? (just fake it for now)
<enunes> lets see
<enunes> still regalloc fail
<anarsoul> :(
<enunes> seems that many registers should still be spillable, I'm wondering why it gives up
<anarsoul> enunes: it should also help if we fuse branch condition into branch
<anarsoul> I'll play with it after cf branch merges
<anarsoul> it shouldn't be too difficult
<anarsoul> enunes: what are we missing in ppir besides cf?
<anarsoul> I think all the other sampler types
<anarsoul> and that's probably it?
<enunes> then bugfixes I guess
<anarsoul> and optimizations
<anarsoul> also X11 is not really usable, glamor works but we have some issue with job queue
<anarsoul> glxgears freezes when I move another window and then tries to catch up
<enunes> anarsoul: I saw the discussion... yeah that seems hard to debug
<enunes> is this a build without debugs?
<anarsoul> yes
<enunes> job queue you mean the drm sched one?
<anarsoul> I guess you can reproduce it since you're using pine64
<anarsoul> I'm not sure how it's implemented
<enunes> hmm apparently mesa ralloc is marking many nodes as "in_stack" and they are not spilling candidates, need to figure out what that means
<anarsoul> enunes: I doubt there's a bug in it
<anarsoul> it's used by vc4, v3d and i965
<enunes> anarsoul: yeah I'm sure it's not a bug in it, I wonder if we should set something different so that it doesn't do that, or just what it means
<anarsoul> enunes: there's an explanation what in_stack means in register_allocate.c
<anarsoul> see comment at the top of file
<enunes> sure I read that, still not clear to me why it stays set after the algorithm executes and why it is a condition to select the best spillable node
<anarsoul> enunes: anyway, don't spend too much time on it, fusing condition into branch will save one reg for each branch
<anarsoul> N regs for nested branches :)
<enunes> anarsoul: yeah there is an explanation for that stuff in the commit logs, I don't think we can do anything about it
<enunes> especially if branching takes registers away, maybe it is indeed unresolvable
<enunes> I suppose I will submit a MR to switch to ra_get_best_spill_node anyway since it solves the infinite loop issue
<enunes> and it seems that this is what everyone else uses
<anarsoul> enunes: just point me to the branch and I'll cherry pick the commit
<enunes> anarsoul: I guess i can submit it anyway and we can possibly merge it anyway before cf gets merged?
<enunes> not sure if you already intend to merge the current cf iteration
<anarsoul> enunes: I do, waiting for some review :)
<anarsoul> it causes not regression in piglit and fixes 41 test
<anarsoul> s/not/no
<anarsoul> also we can actually run X11 now
<enunes> mostly out of curiosity, why do we need ppir_op_dummy ?
<enunes> also I would appreciate some more verbose commit messages for this as it's +616 -248 lines :)
<anarsoul> I'll try to add more to commit message, but there's nothing interesting in implementation
<anarsoul> enunes: ppir_op_dummy is used for placeholder for ppir_dest which is reg
<enunes> it gets removed eventually?
<anarsoul> basically we can get nir where register is read before it's assigned, it's totally fine, but compiler expects non-NULL value in comp->var_nodes
<anarsoul> it's just ignored
<enunes> this is the nir undef value?
<anarsoul> no
<anarsoul> it's not undef
<anarsoul> basically we can have something like: loop { r1 = r2; if (somecond) break; r2 = someothervalue }
<anarsoul> it's a read from uninitialized register
<anarsoul> but it gets initialized on next iteration :)
<enunes> I see, and nir doesnt create that undef assignment for it in this case?
<anarsoul> no
<anarsoul> (and it makes no sense - it's redundant)
<anarsoul> it's not ssa
<anarsoul> it's a reg
<anarsoul> it can be assigned multiple times
<enunes> hmm so thats the difference then, its not ssa
<anarsoul> enunes: I think we should assign different spill cost for regs with different number of components
<anarsoul> IIRC we're using vec4 temporary regardless of number of used components
<enunes> yes
<anarsoul> so it's beneficial to spill regs with more components
<enunes> ok, I can try that
<anarsoul|c> Even if we stored floats as floats it's more beneficial to spill vec4 regs
<enunes> anarsoul: hah, very nice
<enunes> anything else we might want to favour, some type of instruction maybe?
<enunes> anarsoul: btw, this reminds me: not duplicating the use of uniforms was also something that greatly affected register pressure
<enunes> we might want to do that again, I think I recall even the blob does it
<enunes> right now one uniform used by the entire program basically takes away 1 register which will likely be spilled anyway, so we don't really save memory accesses by not doing that