#lima on 2020-03-09 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:46 camus1 has joined #lima

00:48 kaspter has quit [Ping timeout: 255 seconds]

00:48 camus1 is now known as kaspter

00:51 yuq825 has joined #lima

01:24 <anarsoul|2> cwabbott: what does "mul ^110/$0.y/$7.w ^100 $8.z" mean in disassembly?

01:25 <anarsoul|2> it can't store to $0 and $7 at the same time :\

01:40 <anarsoul|2> yeah, looks like that's the reason it fails.

01:40 <anarsoul|2> debugging where it's coming from...

02:53 <anarsoul|2> well, it's supposed to write to both regs, but the issue is that $0.y is also used for other purpose

02:54 <anarsoul|2> looks like missing dep?

02:55 <anarsoul|2> nir:

02:55 <anarsoul|2> r10 = fmul r6, ssa_17

02:55 <anarsoul|2> r5 = mov r10

02:55 <anarsoul|2> but both reg stores end up as root nodes

02:59 <anarsoul|2> and first comment in gpir_emit_alu() looks fishy

03:38 _whitelogger has joined #lima

04:12 chewitt has joined #lima

04:43 <anarsoul|2> darn

04:44 <anarsoul|2> "We assume here that writes are placed before reads. If this changes, then this needs to be updated."

04:44 <anarsoul|2> that's from schedule_try_place_node()

04:44 <anarsoul|2> that's obviously wrong assumption for nir regs if they're used in loops.

04:56 dddddd has quit [Remote host closed the connection]

05:07 Barada has joined #lima

05:16 <anarsoul|2> so yeah, liveness of physregs is computed incorrectly

05:17 <anarsoul|2> I'm not sure how to fix it at the moment, I'm still figuring out how gpir compiler works :)

05:18 <anarsoul|2> we're lucky that at least some vertex shaders with loops are working :)

06:11 <anarsoul|2> hm

06:13 <anarsoul|2> after 2nd (or rather nth) thought I think that not emiting movs actually breaks compiler

06:18 <anarsoul|2> e.g. https://gist.github.com/anarsoul/195adcb13b382f224e8f75f3952ab9d4

06:19 <anarsoul|2> has assign_regs: store 65, reg: 19, assigned: 1

06:20 <anarsoul|2> it attempts to store reg5

06:20 <anarsoul|2> which should appear in live_out of block 1

06:20 <anarsoul|2> but it doesn't

06:26 <anarsoul|2> I'm tempted to switch lima to using draw module (i.e. sw vertex shader) :\

06:47 <anarsoul|2> guess I'll submit an MR with improved disassembler for gp

07:50 <rellla> anarsoul: this is a 'grep -r -E (\$12|\$13|\$14|\$15) *' : http://imkreisrum.de/deqp/greps/gpir_reg_12_13_14_15

07:59 anarsoul|c has joined #lima

08:28 monstr has joined #lima

08:51 yann has quit [Ping timeout: 256 seconds]

09:22 <cwabbott> rellla: no, that assumption is correct... it's only talking about within a single instruction, in the case where you have something like r5 = add r5, r6

09:23 <cwabbott> that gets expanded to something like "read r5; read r6; add; write r5"

09:24 <cwabbott> it assumes that the write gets placed before the read, which is the case because we schedule bottom-up

09:25 <cwabbott> anarsoul|c: and yes, you can have two writes to different things that point to the same ALU slot, and in that case the disassembler will print out something like that

09:40 gcl has quit [Ping timeout: 260 seconds]

09:41 yann has joined #lima

09:50 gcl has joined #lima

10:08 anarsoul|c has quit [Quit: Connection closed for inactivity]

10:23 cwabbott has left #lima [#lima]

10:31 anarsoul|c has joined #lima

11:13 dddddd has joined #lima

11:43 Barada has quit [Quit: Barada]

11:48 Barada has joined #lima

12:00 Barada has quit [Quit: Barada]

12:08 Barada has joined #lima

12:16 <rellla> cwabbott: did you mean anarsoul with your first oing?

12:17 <rellla> *ping?

12:38 anarsoul|c has quit [Quit: Connection closed for inactivity]

13:18 gcl_ has joined #lima

13:21 gcl has quit [Ping timeout: 256 seconds]

14:03 jonkerj has quit [Remote host closed the connection]

14:03 jonkerj has joined #lima

14:16 yuq825 has quit [Remote host closed the connection]

14:31 Barada has quit [Quit: Barada]

14:32 dddddd has quit [Remote host closed the connection]

15:59 gcl_ has quit [Ping timeout: 268 seconds]

16:24 <anarsoul|2> rellla: thanks

16:40 cwabbott has joined #lima

17:09 yann has quit [Ping timeout: 260 seconds]

17:12 <rellla> anarsoul|2: i had an issue in mali_syscall_tracker regarding gp uniform decoding. i'm doing new dumps now including your gpir patch for better readability ...

17:12 <anarsoul|2> cwabbott: thanks for comment on write/read order. What about missing movs? I don't understand how we're tracking dependencies if we don't have a node for mov

17:13 <rellla> btw, we have sum3 in the dumps...

17:13 <anarsoul|2> rellla: heh. So it's some new op?

17:13 <anarsoul|2> that looks like sum3 but not sum3?

17:13 <cwabbott> rellla: iirc sum3 is used for dot products

17:13 <cwabbott> so if you use dot() it should show up

17:13 <cwabbott> (of vec3's obviously)

17:14 <anarsoul|2> cwabbott: see scrollback, rellla found that blob uses another undecoded op

17:14 <cwabbott> anarsoul|2: not sure exactly what you're asking about

17:14 <rellla> cwabbott: so it's not ($1.x + $1.y + $1.z) ?

17:14 <cwabbott> anarsoul|2: yeah, no idea about that one

17:14 <anarsoul|2> op18.v1

17:14 <anarsoul|2> cwabbott: do you by any chance remember why scheduler doesn't expect movs?

17:15 <cwabbott> rellla: if you have a good guess about what it is, you can try to create a simple shader that exhibits it and see if you can get offline-shader-compiler to emit it

17:15 <rellla> cwabbott: https://pastebin.com/raw/Y4w6d7Ps

17:15 <rellla> this is what blob does

17:16 <rellla> see original shader and disassembled code.

17:17 <cwabbott> rellla: my guess is that it's like sum3 but with different precision

17:18 <cwabbott> it's possible that sum3 is actually higher-precision than just naively doing (x + y) + z

17:18 <cwabbott> since otherwise it's the same

17:18 <rellla> having for (int i = 0; i < 4; i++)

17:18 <rellla> { res += tmp[i];} uses op18 with an additional add $.w on top. taking just 2 components leads to a simple add

17:19 <rellla> seems i need some test to find that out ...

17:20 <rellla> imho there must be a difference, because we have sum3 and op18 in the blob dumps

17:20 <cwabbott> ok, so my hypothesis is probably right then... when the shader uses dot() they take the liberty to compute the intermediate results in higher precision, but when you manually do it they do exactly what the user says

17:21 <cwabbott> and op18 is just a faster (but still low-precision) way of doing exactly what the user says

17:21 <rellla> ok, so op18 could be a lower precised sum3 then, not a dot(), but an add(1,2,3)

17:21 <cwabbott> yeah

17:21 <rellla> ok, understand

17:22 <cwabbott> anarsoul|2: we don't use mov's because they're never necessary before the scheduler

17:23 <cwabbott> there are some situations where they're always necessary (e.g. load directly into store) but the scheduler has to handle that anyways on-the-fly, since that can be generated when spilling

17:24 <cwabbott> so inserting mov's beforehand doesn't actually save any complexity, and making the scheduler handle it would just add more complexity

17:26 <anarsoul|2> see https://gist.github.com/anarsoul/195adcb13b382f224e8f75f3952ab9d4

17:26 <anarsoul|2> block_4

17:26 <anarsoul|2> r10 = fmul r6, ssa_17

17:26 <anarsoul|2> r8 = fmul r5, ssa_17

17:26 <anarsoul|2> r5 = mov r10

17:27 <anarsoul|2> and of course I missed some lines :) anyway I'm not sure what we can get rid of 'r5 = mov r10' here

17:28 <cwabbott> anarsoul|2: when translated into gpir that turns into a load that follows directly into a store

17:28 anarsoul|c has joined #lima

17:28 <anarsoul|2> that seems to turn into store from mul node

17:28 <anarsoul|2> but it's not correct

17:28 <cwabbott> well, actually, since r10 is written it means that the store to r5 is taken from the mul

17:28 <anarsoul|2> yeah, exactly.

17:29 <cwabbott> why isn't it correct?

17:29 <anarsoul|2> because r5 is used after r10 is written?

17:29 <anarsoul|2> as well as r6

17:29 <cwabbott> those r5's should get the old value of r5

17:30 <anarsoul|2> cwabbott: but we don't add any deps when translating a move :\

17:30 <cwabbott> there shouldn't be any deps necessary

17:31 <cwabbott> it's a core assumption of the nir->gpir translator that any intra-block dependencies are expressed solely as value registers

17:31 <anarsoul|2> read after write deps

17:31 <cwabbott> i.e. we should never read a register which is written earlier in the blobk

17:32 <cwabbott> *block

17:32 <anarsoul|2> yes, but you essentially turn it into r10 = r5 = fmul ...

17:33 <anarsoul|2> so r5 gets written before it is read

17:33 <anarsoul|2> both writes are root nodes

17:34 <cwabbott> i don't understand... we keep around the value register for all nir registers

17:35 <cwabbott> so that if we ever write a nir register in a block, all later reads will use the value reg instead of the physreg

17:36 <cwabbott> the use of r5 in the fmul there actually comes from the previous iteration of the loop, or the load_input at the beginning

17:36 <cwabbott> and we don't need any deps for that since it's across a basic block boundary

17:37 <cwabbott> we do need a write-after-read dep, but iirc that gets handled before RA

17:38 <cwabbott> err, before reduce_scheduler

17:39 <anarsoul|2> cwabbott: I'll try looking into it later today. I'm pretty certain that counter (which is r9) gets the same physical register as r5

17:39 <anarsoul|2> as result counter gets overwritten and shader hangs

17:47 <cwabbott> that would indeed be wrong :)

17:48 monstr has quit [Remote host closed the connection]

17:48 <cwabbott> but it's not the lack of read-after-write dependencies that's wrong... that's by design

17:49 <cwabbott> it could be liveness analysis going wrong or something

18:25 dddddd has joined #lima

18:37 <anarsoul|2> cwabbott: $0.y never appears in live_out of block 4, so yeah, could be

18:37 <anarsoul|2> that's what gets assigned to r5 and r10

18:40 deesix has quit [Ping timeout: 240 seconds]

18:41 dddddd has quit [Ping timeout: 256 seconds]

18:42 deesix has joined #lima

18:54 dddddd has joined #lima

19:52 <anarsoul|2> OK, I think I fixed it

19:55 <anarsoul|2> it also fixes dEQP-GLES2.functional.shaders.loops.while_constant_iterations.nested_sequence_vertex but not other 3 hangs :(

20:32 buzzmarshall has joined #lima

20:45 <anarsoul|2> darn

20:46 <anarsoul|2> another fun with movs

20:46 <anarsoul|2> https://gist.github.com/anarsoul/c6852427cee2e2a80968c9cc5a01a1e9

20:47 <anarsoul|2> it doesn't hang, but it skip iteration

20:47 <anarsoul|2> "017:acc0: add ^102/$3.y/$2.w $3.y $1.w // $3.y = $2.w = $3.y + $1.w" is wrong

20:48 <anarsoul|2> $2.w is i

20:48 <anarsoul|2> $3.y is supposed to be i + 1

20:49 <anarsoul|2> it should have been: $2.w = $3.y; $3.y = $3.y + $1.w

20:52 <anarsoul|2> it's coming from:

20:52 <anarsoul|2> block block_4:

20:52 <anarsoul|2> r6 = fadd r5, ssa_11

20:52 <anarsoul|2> r5 = mov r6

20:52 <anarsoul|2> /* preds: block_3 */

20:52 <anarsoul|2> r4 = mov r5

21:05 yann has joined #lima

21:39 <anarsoul|2> I think I got it...

21:39 <anarsoul|2> my fix to calc_def_block() wasn't sufficient