ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at and - Contact ARM for binary driver support!
<anarsoul> i.e. it turns: block_1: { r1 = 0 } block_2: { r1 = r1 + 1; branch_if (r1 < 10) block_1; } into block_1: {} block_2: { r1 = 0; r1 = r1 + 1; branch_if (r1 < 10) block 1; }
<anarsoul> because "r1 = 0" doesn't have any successors in block_1
<anarsoul> typo here, should be "branch_if (r1 < 10) block_2"
<enunes> yeah seems fairly complicated, I wonder if there is some design changes to be made in the scheduler first to respect these, before we start changing all the existing lowerings and node creation
<anarsoul> I'm afraid the problem in in lowerings. We assumed that sources always come from the same block but it's wrong assumption
<enunes> isn't it also a problem even without the ppir lowerings? what if the nodes arrived in ppir already with sources from different blocks?
<anarsoul> enunes: That's totally fine. They're already evaluated
<enunes> I don't understand what evaluated means
<anarsoul> enunes: source is either ssa or register
<anarsoul> if it's in the same block we have to schedule it in the way that it's evaluated before it's used
<anarsoul> but we don't have do to that for previous block
<enunes> so what is bad is changing it from ssa to register or the other way around?
<anarsoul> enunes: basically if we add a mov r1, 0 from block_1 as a dep to first node of block_2, it's not a root node anymore, and block_1 will be empty
<anarsoul> see ppir_create_instr_from_node()
<anarsoul> root node is a node that doesn't have any successors
<anarsoul> I hope it makes sense :)
<enunes> it does make sense in parts, but it's hard to follow or make smarter questions without seeing the code in detail and debugging
<anarsoul> enunes: in short, we can't assume that any source is in predecessors list, it can be from previous block
<anarsoul> it's still WIP, I just rebased it and tried to fix ppir_lower_select()
<anarsoul> another possible solution: patch ppir_node_add_dep() to create an extra mov if succ->block != pred->block
<anarsoul> and add a pass to remove redundant movs
<anarsoul> enunes: I think all we need is to have proper dependency graph within a block. Block ordering guarantees that inter-block dependencies are met
<anarsoul> I think easiest way is to fix lowerings
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
dddddd has quit [Remote host closed the connection]
Barada has joined #lima
guillaume_g has joined #lima
guillaume_g has quit [Client Quit]
<rellla> anarsoul: bad news :)
<rellla> i currently check piglit with corresponding master so that you can see the regressions...
guillaume_g has joined #lima
<rellla> anarsoul: puh, the good news is, that pp-control-flow fixes ~250 tests, the bad news is, that we seem to have some regression in a previous commit.
<rellla> let me do some bisecting ...
<cwabbott> rellla: btw, for, it seems from a quick test that the abs modifier isn't supported with derivatives, the blob inserts a mov instruction before the derivative if you do dFdx(abs(...))
<cwabbott> anarsoul: sure, sorry I forgot about it!
<cwabbott> I got a little distracted trying to figure out why my fix for exp2/log2 didn't work, and discovered that we have to significantly change how we handle complex operations in general
<cwabbott> I spent some time fiddling with the backend to reverse-engineer what some of the opcodes do, and it seems that some of them (e.g. *_impl and preexp2) produce intermediate values that aren't supposed to be interpreted as floating-point at all
<cwabbott> the problem is that mov operations in the add and mul (and maybe complex/pass as well?) are still floating-point operations, so they flush some "invalid" values that would be never be produced by a floating-point operation
<cwabbott> e.g. they flush denorms -> 0 since denorms aren't supported, and NaN's with a nonstandard payload to a standard NaN
<cwabbott> the problems come when the intermediate results when interpreted as a floating-point number happen to be one of those things, we insert a move node cuz we couldn't schedule the op right away, and the move flushes it to the wrong thing
<cwabbott> the tl;dr is that we mostly have to schedule the whole sequence *exactly* as the blob does so that we don't have to insert any moves
<rellla> cwabbott: do you have some disassembly of what the blob does for glsl-derivs-abs-sign somewhere?
plaes has quit [Remote host closed the connection]
plaes has joined #lima
plaes has joined #lima
plaes has quit [Changing host]
guillaume_g has left #lima ["Konversation terminated!"]
jonkerj has quit [Quit: brb]
jonkerj has joined #lima
jonkerj has quit [Client Quit]
jonkerj has joined #lima
anarsoul has quit [Remote host closed the connection]
anarsoul has joined #lima
<cwabbott> it seems to even insert the move if the source is just negated
<cwabbott> I would've thought you could just swap which source gets the negate, apparently that doesn't work, or the blob isn't clever enough
<rellla> thanks i will have a look. it only affects the fdd/abs which is combined with negation!?
<rellla> as glsl-derivs-abs passes...
<cwabbott> weird indeed
<cwabbott> anarsoul: btw, is it maybe possible to have the scheduler create the predecessor lists itself and have the lowering passes ignore all that stuff? that's how it's usually done in other backends
chewitt has joined #lima
dddddd has joined #lima
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
guillaume_g has joined #lima
guillaume_g has quit [Quit: Konversation terminated!]
guillaume_g has joined #lima
afaerber has quit [Quit: Leaving]
afaerber has joined #lima
jrmuizel has joined #lima
<rellla> seems i've got some mess here... maybe piglit. rebuilding now.
guillaume_g has left #lima ["Konversation terminated!"]
chewitt has quit [Remote host closed the connection]
Barada has quit [Quit: Barada]
<anarsoul> rellla: my branch can be wedging the gpu, reloading lima module is enough to unwedge it
<rellla> maybe thats it...
<anarsoul> cwabbott: yeah, it makes sense
chewitt has joined #lima
afaerber has quit [Quit: Leaving]
ente has quit [Remote host closed the connection]
jrmuizel has quit [Remote host closed the connection]
chewitt has quit [Quit: Zzz..]
<anarsoul> well, node_to_instr can also insert movs
jrmuizel has joined #lima
jrmuizel has quit [Ping timeout: 246 seconds]
<anarsoul> cwabbott: so I'm thinking about building predecessor list in scheduler, I guess I need to add ppir_node * to ppir_src to achieve that?
<anarsoul> oh, ppir_update_spilled_dest() and ppir_update_spilled_src() also create nodes and update predecessor lists...
jrmuizel has joined #lima
<anarsoul> enunes: ideas?
<enunes> anarsoul: yes... back then I discussed with yuq whether we wanted to have a full scheduler re-run after introducing the spill/load nodes..., but he suggested to just add the nodes and instructions directly in regalloc anyway
<enunes> is it a problem for you even at regalloc time?
<enunes> I guess it adds some complexity as you might have to jump to the preceding instruction in case a load is inserted
<enunes> but I wonder if it is a real problem other than that
<enunes> loads and stores should always be in the same block as the instruction dealing with the spilled register
<anarsoul> well, I guess we can deal with it later
<anarsoul> darn
<anarsoul> I think we did too many premature optimizations and now it's really hard to add control flow support :\
<enunes> I would be in favour of sacrificing unnecessary optimizations, I just did one of that in the texture proj series :)
<anarsoul> like getting rid of successors/predecessors lists and just doing one op per instruction in order?
<anarsoul> that's technically a significant rewrite of ppir compiler
<enunes> don't we need successor/predecessor for other things than optimization?
<anarsoul> what things are you referring to?
<enunes> in general I don't think we have too many unnecessary optimizations, but we do have some specific things like trying to see if we can merge a mul and an add and switch to a pipeline register
<anarsoul> enunes: we reorder instructions
<enunes> is the reordering hard to remove?
<anarsoul> enunes: well, the problem is that we depend on pred/succ lists pretty much everywhere
<anarsoul> and we assume that they're complete (and that's true for a single block case)
<anarsoul> but it all breaks when we have more than one block
<anarsoul> grep for ppir_node_foreach_succ_safe() and ppir_node_foreach_pred_safe()
<enunes> I'd be ok with patches gradually removing the dependency on these lists and the reordering as preparation to support control flow, if that is what is blocking it
<enunes> I guess pred/succ makes sense on gpir as there are the dependency on the previous-result registers, but maybe ppir doesn't need to
<enunes> but yeah I suppose ultimately it is basically a redesign...
<anarsoul> enunes: it does need pred/succ lists for scheduler if we want to do any reordering
<anarsoul> btw I suspect that adding cf to gpir would be a pretty hard task
<enunes> can't we limit reordering only to intra-block?
<enunes> or otherwise I'd totally vote to remove reordering completely
<anarsoul> enunes: it's supposed to be intra-block only
<anarsoul> that's why we don't add nodes from other blocks to succ/pred lists
<anarsoul> but it just breaks everywhere else like in select lowering that assumes that all its sources are in the same block
<anarsoul> I believe ldtex is also broken in the same way
<anarsoul> so is inserting extra movs in node_to_instr
<anarsoul> spilling should be fine since we don't care there about succ/preds anymore
<anarsoul> but I'm not 100% sure though
<enunes> why do we have those movs in select anyway?
<enunes> the comment is not very clear, is it just optimization?
<enunes> one other thing I wonder sometimes is if it would be helpful to have some of these lima lowerings in lima-specific nir lowerings, like lima_nir_lower_uniform_to_scalar, so we don't have to deal with the pred/succ stuff manually
<anarsoul> enunes: it's not an optimization
<anarsoul> enunes: select is done by addition unit but it has only 2 inputs. 3rd input comes from mul unit
<anarsoul> so lowering inserts extra mov for 1st select argument (condition) to guarantee that it'll take mul slot of instruction
<anarsoul> it's just like ldtex, single nir instruction multiple pp instructions
<enunes> yeah I see now, makes sense
<enunes> and then we rely on the scheduler to actually put these two ops in the same instruction, with no clear indication from the lowering code...
<anarsoul> yeah
<enunes> it should explicitly use a pipeline register instead?
<enunes> I did that in the texture projection patch
<anarsoul> enunes: well, the problem is that there's no pipeline register for it
<anarsoul> scalar addition has only 2 inputs - a and b
<anarsoul> 3rd input is implicit
<enunes> it is the same with varying fetch and texture fetch, varying fetch can just write to discard and the value shows up in the next instruction
<enunes> well, next op...
<enunes> we should be able to represent that and not rely on the scheduler
adjtm has quit [Quit: Leaving]
<anarsoul> yeah, probably
ninolein has joined #lima
<anarsoul> enunes: I guess we need a function like nir_ssa_def_rewrite_uses() for ppir
<enunes> so as a preparatory work for cf, would it make sense to replace the existing scheduler with a dumb one that schedules 1 op per instr, except when there is a pipeline dependency between ops, and add pipeline dependencies in lowering for the required cases?
<anarsoul> enunes: I'm thinking about following:
<anarsoul> 1) add ppir_node pointer into ppir_src
<anarsoul> 2) stop using succ/pred lists anywhere but in scheduler and add a function that builds succ/pred lists before scheduling
<anarsoul> 3) probably fixing a mess with implicit dependencies like mov <-> select
<anarsoul> what do you think?
<anarsoul> probably we still have to deal with extra movs in node_to_instr...
<anarsoul> so pred/succ lists should be built before calling node_to_instr
<anarsoul> but it shouldn't be used anywhere but in node_to_instr and scheduler
<anarsoul> we definitely don't need it in spilling code
<enunes> maybe we can get rid of the movs in node_to_instr, move them to lowering? I tried to do that initially but gave up as it became too complex while trying to add a feature at the same time...
<enunes> maybe by itself it would be doable
<enunes> why do you still need succ/pred in step 2?
<anarsoul> enunes: to avoid making scheduler dumb
<enunes> I would gladly have a dumb one over one that does random optimizations or needs things like "succ->op == ppir_op_select && ppir_node_first_pred(succ) == node"...
<enunes> from a dumb one we could maybe do a global optimization pass rather than handpicked ones
<anarsoul> yeah, probably
<anarsoul> I guess it would be beneficial to do 1 op per instruction first and then attempt to merge them
<enunes> so this would require moving these magic things like the select dependency into explicit dependencies inserted while lowering, and even the dumb scheduler needs to respect those
adjtm has joined #lima
<anarsoul> enunes: I'm not sure if it's possible to move 'movs' from node_to_instr to lowering since lowering isn't aware of instruction format and limitations
<enunes> anarsoul: thinking about insert_to_load_tex for example... if the scheduler just respected pipeline register dependencies, it would be possible to move all insert_to_load_tex into ppir_lower_texture
<enunes> I could try to write a patch with what I'm thinking on this, if you think it's helpful in the end
<enunes> and attempt to move the tex and select dependencies out of the scheduler into lowering
<anarsoul> enunes: see insert_to_each_succ_instr() it also creates extra moves
<anarsoul> enunes: well, if you think that it's easy then it'll definitely be helpful. Otherwise it makes sense to start a discussion on ML
<enunes> yeah insert_to_each_succ_instr is not the greatest function ever to follow... 2 of its users have a comment saying "save a reg" so possibly just optimization? the other is const, which indeed can be tricky
<anarsoul> const is not tricky at all
<anarsoul> we can stick up to 2 consts to any instruction
ninolein has quit [Quit: - Chat comfortably. Anywhere.]
<anarsoul> I believe that it'd be better to treat consts differently from all other sources (ssa or regs)
<enunes> maybe it is possible to just do the const duplicating in lowering too as op nodes and it's not terrible?
<anarsoul> enunes: what about adding 3rd source type - const?
<enunes> so we don't have to "insert to each succ instr"
<anarsoul> we can convert ssas that are consts into const src in lowering
<anarsoul> we don't need a reg for consts
<enunes> doesn't sound bad, I just wonder if we only have ssa and reg in a way to mirror the nir src options
<anarsoul> technically it's costless optimization
<enunes> in many places we have to check for reg or ssa, or even "if !ssa", so having a 3rd might require some work everywhere
<anarsoul> now we compile something like gl_FragColor = vec4(1.0); into mov $1.xyzw, const0(1.0, 1.0, 1.0, 1.0); mov $0, $1
<anarsoul> because in nir it's:
<anarsoul> ssa1 = const vec4(1.0)
<anarsoul> store_output(ssa1)
<anarsoul> enunes: OK, then again - pipeline reg to load_const node that's cloned to each successor
<enunes> hmm what?
<anarsoul> enunes: basically the same what you proposed. Do not introduce another source type. In lower create a clone node for each successor and use pipeline reg
<anarsoul> it's a bit harder though since we'll be lowering not const but all nodes that have const in src...
<anarsoul> might be easier to treat consts differently
<enunes> ah yes, that might work too, indeed I think consts are like pipeline regs
<enunes> I guess duplicate the const node to all of its users and make it like an op outputting to a const pipeline reg
<enunes> there might be corner cases, like what if we run out of const slots and need to insert the mov
<anarsoul> enunes: that's not possible with dumb scheduler
<enunes> but still we should try to do that in lowering, not scheduler
<anarsoul> i.e. one op per instruction
<anarsoul> there's no instructions with more that 2 sources
<enunes> well assuming we introduce the pipeline-aware scheduler, it will be more than 1 op per instruction in case there is a pipeline dependency, including consts
<anarsoul> right...
<anarsoul> hm
<anarsoul> OK
<anarsoul> I think solution is to combine lowering and node_to_instr
<anarsoul> or rather allocate instr with nodes from the very beginning
<anarsoul> during lowering some instructions will be merged
<anarsoul> i.e. consts will go to their users when possible
<anarsoul> const itself will be lowered to mov rX, const
<anarsoul> if it has no users it gets removed
<anarsoul> (we'll need to keep track of users)
<anarsoul> all in all it sounds like a rewrite of ppir compiler with preserving some of data structures :)
<anarsoul> basically most of lower.c node_to_instr.c and scheduler.c need to be rewritten
<anarsoul> nir.c is fine
<enunes> I'll try to do the smaller thing I mentioned to see if it is really easy or I hit some blocker, maybe we can do that in parts rather than a big rewrite
<anarsoul> instr.c probably also needs to be rewritten
<anarsoul> enunes: sure, sounds good
<anarsoul> enunes: I suspect that trying to treat nodes separately from instructions isn't a good solution
<anarsoul> most of the difficulties arise when we try to place nodes into instructions
<anarsoul> or rather from the fact that we have some code that is not instruction-aware
<enunes> yeah makes sense
<enunes> what I'm proposing also goes in the direction to make lowering kinda instruction-aware
<anarsoul> enunes: I think it's not enough just to have pipeline regs
<anarsoul> I'll try to summarize it all and send an email to ML tonight
<rellla> anarsoul: btw, gpu was wedged. it's not possible for me to run the whole piglit tests.
<anarsoul> rellla: :(
<anarsoul> I'll need to pin point later what test wedges it (likely it's first failed test)
<rellla> i've got 436 passes, then it breaks, but i haven't looked in detail what the first one was.
<anarsoul> rellla: thanks for testing but it looks like we need some compiler refactoring before we can land control flow support
<rellla> :) i read backlog
<anarsoul> cwabbott: please merge your MR even without exp2/log2, it fixes some mysterious failures so it's definitely nice to have
afaerber has joined #lima
jrmuizel has quit [Remote host closed the connection]
dddddd has quit [Remote host closed the connection]
<MoeIcenowy> anarsoul: how to check whether the GPU is hang?
<MoeIcenowy> maybe we should add code to reset GPU in drm/lima
<anarsoul> MoeIcenowy: I don't know. Kernel driver doesn't complain, job still completes but result is wrong
<anarsoul> reset would be definitely useful