ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!
camus1 has joined #lima
kaspter has quit [Ping timeout: 255 seconds]
camus1 is now known as kaspter
yuq825 has joined #lima
<anarsoul|2> cwabbott: what does "mul ^110/$0.y/$7.w ^100 $8.z" mean in disassembly?
<anarsoul|2> it can't store to $0 and $7 at the same time :\
<anarsoul|2> yeah, looks like that's the reason it fails.
<anarsoul|2> debugging where it's coming from...
<anarsoul|2> well, it's supposed to write to both regs, but the issue is that $0.y is also used for other purpose
<anarsoul|2> looks like missing dep?
<anarsoul|2> nir:
<anarsoul|2> r10 = fmul r6, ssa_17
<anarsoul|2> r5 = mov r10
<anarsoul|2> but both reg stores end up as root nodes
<anarsoul|2> and first comment in gpir_emit_alu() looks fishy
_whitelogger has joined #lima
chewitt has joined #lima
<anarsoul|2> darn
<anarsoul|2> "We assume here that writes are placed before reads. If this changes, then this needs to be updated."
<anarsoul|2> that's from schedule_try_place_node()
<anarsoul|2> that's obviously wrong assumption for nir regs if they're used in loops.
dddddd has quit [Remote host closed the connection]
Barada has joined #lima
<anarsoul|2> so yeah, liveness of physregs is computed incorrectly
<anarsoul|2> I'm not sure how to fix it at the moment, I'm still figuring out how gpir compiler works :)
<anarsoul|2> we're lucky that at least some vertex shaders with loops are working :)
<anarsoul|2> hm
<anarsoul|2> after 2nd (or rather nth) thought I think that not emiting movs actually breaks compiler
<anarsoul|2> has assign_regs: store 65, reg: 19, assigned: 1
<anarsoul|2> it attempts to store reg5
<anarsoul|2> which should appear in live_out of block 1
<anarsoul|2> but it doesn't
<anarsoul|2> I'm tempted to switch lima to using draw module (i.e. sw vertex shader) :\
<anarsoul|2> guess I'll submit an MR with improved disassembler for gp
<rellla> anarsoul: this is a 'grep -r -E (\$12|\$13|\$14|\$15) *' : http://imkreisrum.de/deqp/greps/gpir_reg_12_13_14_15
anarsoul|c has joined #lima
monstr has joined #lima
yann has quit [Ping timeout: 256 seconds]
<cwabbott> rellla: no, that assumption is correct... it's only talking about within a single instruction, in the case where you have something like r5 = add r5, r6
<cwabbott> that gets expanded to something like "read r5; read r6; add; write r5"
<cwabbott> it assumes that the write gets placed before the read, which is the case because we schedule bottom-up
<cwabbott> anarsoul|c: and yes, you can have two writes to different things that point to the same ALU slot, and in that case the disassembler will print out something like that
gcl has quit [Ping timeout: 260 seconds]
yann has joined #lima
gcl has joined #lima
anarsoul|c has quit [Quit: Connection closed for inactivity]
cwabbott has left #lima [#lima]
anarsoul|c has joined #lima
dddddd has joined #lima
Barada has quit [Quit: Barada]
Barada has joined #lima
Barada has quit [Quit: Barada]
Barada has joined #lima
<rellla> cwabbott: did you mean anarsoul with your first oing?
<rellla> *ping?
anarsoul|c has quit [Quit: Connection closed for inactivity]
gcl_ has joined #lima
gcl has quit [Ping timeout: 256 seconds]
jonkerj has quit [Remote host closed the connection]
jonkerj has joined #lima
yuq825 has quit [Remote host closed the connection]
Barada has quit [Quit: Barada]
dddddd has quit [Remote host closed the connection]
gcl_ has quit [Ping timeout: 268 seconds]
<anarsoul|2> rellla: thanks
cwabbott has joined #lima
yann has quit [Ping timeout: 260 seconds]
<rellla> anarsoul|2: i had an issue in mali_syscall_tracker regarding gp uniform decoding. i'm doing new dumps now including your gpir patch for better readability ...
<anarsoul|2> cwabbott: thanks for comment on write/read order. What about missing movs? I don't understand how we're tracking dependencies if we don't have a node for mov
<rellla> btw, we have sum3 in the dumps...
<anarsoul|2> rellla: heh. So it's some new op?
<anarsoul|2> that looks like sum3 but not sum3?
<cwabbott> rellla: iirc sum3 is used for dot products
<cwabbott> so if you use dot() it should show up
<cwabbott> (of vec3's obviously)
<anarsoul|2> cwabbott: see scrollback, rellla found that blob uses another undecoded op
<cwabbott> anarsoul|2: not sure exactly what you're asking about
<rellla> cwabbott: so it's not ($1.x + $1.y + $1.z) ?
<cwabbott> anarsoul|2: yeah, no idea about that one
<anarsoul|2> op18.v1
<anarsoul|2> cwabbott: do you by any chance remember why scheduler doesn't expect movs?
<cwabbott> rellla: if you have a good guess about what it is, you can try to create a simple shader that exhibits it and see if you can get offline-shader-compiler to emit it
<rellla> this is what blob does
<rellla> see original shader and disassembled code.
<cwabbott> rellla: my guess is that it's like sum3 but with different precision
<cwabbott> it's possible that sum3 is actually higher-precision than just naively doing (x + y) + z
<cwabbott> since otherwise it's the same
<rellla> having for (int i = 0; i < 4; i++)
<rellla> { res += tmp[i];} uses op18 with an additional add $.w on top. taking just 2 components leads to a simple add
<rellla> seems i need some test to find that out ...
<rellla> imho there must be a difference, because we have sum3 and op18 in the blob dumps
<cwabbott> ok, so my hypothesis is probably right then... when the shader uses dot() they take the liberty to compute the intermediate results in higher precision, but when you manually do it they do exactly what the user says
<cwabbott> and op18 is just a faster (but still low-precision) way of doing exactly what the user says
<rellla> ok, so op18 could be a lower precised sum3 then, not a dot(), but an add(1,2,3)
<cwabbott> yeah
<rellla> ok, understand
<cwabbott> anarsoul|2: we don't use mov's because they're never necessary before the scheduler
<cwabbott> there are some situations where they're always necessary (e.g. load directly into store) but the scheduler has to handle that anyways on-the-fly, since that can be generated when spilling
<cwabbott> so inserting mov's beforehand doesn't actually save any complexity, and making the scheduler handle it would just add more complexity
<anarsoul|2> block_4
<anarsoul|2> r10 = fmul r6, ssa_17
<anarsoul|2> r8 = fmul r5, ssa_17
<anarsoul|2> r5 = mov r10
<anarsoul|2> and of course I missed some lines :) anyway I'm not sure what we can get rid of 'r5 = mov r10' here
<cwabbott> anarsoul|2: when translated into gpir that turns into a load that follows directly into a store
anarsoul|c has joined #lima
<anarsoul|2> that seems to turn into store from mul node
<anarsoul|2> but it's not correct
<cwabbott> well, actually, since r10 is written it means that the store to r5 is taken from the mul
<anarsoul|2> yeah, exactly.
<cwabbott> why isn't it correct?
<anarsoul|2> because r5 is used after r10 is written?
<anarsoul|2> as well as r6
<cwabbott> those r5's should get the old value of r5
<anarsoul|2> cwabbott: but we don't add any deps when translating a move :\
<cwabbott> there shouldn't be any deps necessary
<cwabbott> it's a core assumption of the nir->gpir translator that any intra-block dependencies are expressed solely as value registers
<anarsoul|2> read after write deps
<cwabbott> i.e. we should never read a register which is written earlier in the blobk
<cwabbott> *block
<anarsoul|2> yes, but you essentially turn it into r10 = r5 = fmul ...
<anarsoul|2> so r5 gets written before it is read
<anarsoul|2> both writes are root nodes
<cwabbott> i don't understand... we keep around the value register for all nir registers
<cwabbott> so that if we ever write a nir register in a block, all later reads will use the value reg instead of the physreg
<cwabbott> the use of r5 in the fmul there actually comes from the previous iteration of the loop, or the load_input at the beginning
<cwabbott> and we don't need any deps for that since it's across a basic block boundary
<cwabbott> we do need a write-after-read dep, but iirc that gets handled before RA
<cwabbott> err, before reduce_scheduler
<anarsoul|2> cwabbott: I'll try looking into it later today. I'm pretty certain that counter (which is r9) gets the same physical register as r5
<anarsoul|2> as result counter gets overwritten and shader hangs
<cwabbott> that would indeed be wrong :)
monstr has quit [Remote host closed the connection]
<cwabbott> but it's not the lack of read-after-write dependencies that's wrong... that's by design
<cwabbott> it could be liveness analysis going wrong or something
dddddd has joined #lima
<anarsoul|2> cwabbott: $0.y never appears in live_out of block 4, so yeah, could be
<anarsoul|2> that's what gets assigned to r5 and r10
deesix has quit [Ping timeout: 240 seconds]
dddddd has quit [Ping timeout: 256 seconds]
deesix has joined #lima
dddddd has joined #lima
<anarsoul|2> OK, I think I fixed it
<anarsoul|2> it also fixes dEQP-GLES2.functional.shaders.loops.while_constant_iterations.nested_sequence_vertex but not other 3 hangs :(
buzzmarshall has joined #lima
<anarsoul|2> darn
<anarsoul|2> another fun with movs
<anarsoul|2> it doesn't hang, but it skip iteration
<anarsoul|2> "017:acc0: add ^102/$3.y/$2.w $3.y $1.w // $3.y = $2.w = $3.y + $1.w" is wrong
<anarsoul|2> $2.w is i
<anarsoul|2> $3.y is supposed to be i + 1
<anarsoul|2> it should have been: $2.w = $3.y; $3.y = $3.y + $1.w
<anarsoul|2> it's coming from:
<anarsoul|2> block block_4:
<anarsoul|2> r6 = fadd r5, ssa_11
<anarsoul|2> r5 = mov r6
<anarsoul|2> /* preds: block_3 */
<anarsoul|2> r4 = mov r5
yann has joined #lima
<anarsoul|2> I think I got it...
<anarsoul|2> my fix to calc_def_block() wasn't sufficient