#lima on 2019-07-17 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:01 <anarsoul> i.e. it turns: block_1: { r1 = 0 } block_2: { r1 = r1 + 1; branch_if (r1 < 10) block_1; } into block_1: {} block_2: { r1 = 0; r1 = r1 + 1; branch_if (r1 < 10) block 1; }

00:01 <anarsoul> because "r1 = 0" doesn't have any successors in block_1

00:02 <anarsoul> typo here, should be "branch_if (r1 < 10) block_2"

00:02 <enunes> yeah seems fairly complicated, I wonder if there is some design changes to be made in the scheduler first to respect these, before we start changing all the existing lowerings and node creation

00:04 <anarsoul> I'm afraid the problem in in lowerings. We assumed that sources always come from the same block but it's wrong assumption

00:06 <enunes> isn't it also a problem even without the ppir lowerings? what if the nodes arrived in ppir already with sources from different blocks?

00:07 <anarsoul> enunes: That's totally fine. They're already evaluated

00:08 <enunes> I don't understand what evaluated means

00:09 <anarsoul> enunes: source is either ssa or register

00:10 <anarsoul> if it's in the same block we have to schedule it in the way that it's evaluated before it's used

00:11 <anarsoul> but we don't have do to that for previous block

00:12 <enunes> so what is bad is changing it from ssa to register or the other way around?

00:17 <anarsoul> enunes: basically if we add a mov r1, 0 from block_1 as a dep to first node of block_2, it's not a root node anymore, and block_1 will be empty

00:17 <anarsoul> see ppir_create_instr_from_node()

00:18 <anarsoul> root node is a node that doesn't have any successors

00:20 <anarsoul> I hope it makes sense :)

00:21 <enunes> it does make sense in parts, but it's hard to follow or make smarter questions without seeing the code in detail and debugging

00:23 <anarsoul> enunes: in short, we can't assume that any source is in predecessors list, it can be from previous block

00:25 <anarsoul> enunes: code is here: https://gitlab.freedesktop.org/anarsoul/mesa/tree/lima-pp-control-flow

00:25 <anarsoul> it's still WIP, I just rebased it and tried to fix ppir_lower_select()

00:31 <anarsoul> another possible solution: patch ppir_node_add_dep() to create an extra mov if succ->block != pred->block

00:32 <anarsoul> and add a pass to remove redundant movs

00:34 <anarsoul> enunes: I think all we need is to have proper dependency graph within a block. Block ordering guarantees that inter-block dependencies are met

00:40 <anarsoul> I think easiest way is to fix lowerings

01:04 jrmuizel has quit [Remote host closed the connection]

01:06 jrmuizel has joined #lima

02:44 jrmuizel has quit [Remote host closed the connection]

02:51 dddddd has quit [Remote host closed the connection]

05:09 Barada has joined #lima

06:52 guillaume_g has joined #lima

06:53 guillaume_g has quit [Client Quit]

07:01 <rellla> anarsoul: bad news :) http://imkreisrum.de/piglit/pp-control-flow/

07:02 <rellla> i currently check piglit with corresponding master so that you can see the regressions...

07:04 guillaume_g has joined #lima

08:23 <rellla> anarsoul: puh, the good news is, that pp-control-flow fixes ~250 tests, the bad news is, that we seem to have some regression in a previous commit. http://imkreisrum.de/piglit/master_3853871af80_pp-control-flow/

08:23 <rellla> let me do some bisecting ...

09:05 <cwabbott> rellla: btw, for http://imkreisrum.de/piglit/glsl-derivs3/glsl-derivs/shaders@glsl-derivs-abs-sign.html, it seems from a quick test that the abs modifier isn't supported with derivatives, the blob inserts a mov instruction before the derivative if you do dFdx(abs(...))

09:06 <cwabbott> anarsoul: sure, sorry I forgot about it!

09:07 <cwabbott> I got a little distracted trying to figure out why my fix for exp2/log2 didn't work, and discovered that we have to significantly change how we handle complex operations in general

09:10 <cwabbott> I spent some time fiddling with the backend to reverse-engineer what some of the opcodes do, and it seems that some of them (e.g. *_impl and preexp2) produce intermediate values that aren't supposed to be interpreted as floating-point at all

09:12 <cwabbott> the problem is that mov operations in the add and mul (and maybe complex/pass as well?) are still floating-point operations, so they flush some "invalid" values that would be never be produced by a floating-point operation

09:13 <cwabbott> e.g. they flush denorms -> 0 since denorms aren't supported, and NaN's with a nonstandard payload to a standard NaN

09:15 <cwabbott> the problems come when the intermediate results when interpreted as a floating-point number happen to be one of those things, we insert a move node cuz we couldn't schedule the op right away, and the move flushes it to the wrong thing

09:16 <cwabbott> the tl;dr is that we mostly have to schedule the whole sequence *exactly* as the blob does so that we don't have to insert any moves

09:20 <rellla> cwabbott: do you have some disassembly of what the blob does for glsl-derivs-abs-sign somewhere?

09:21 plaes has quit [Remote host closed the connection]

09:22 plaes has joined #lima

09:22 plaes has quit [Changing host]

09:25 guillaume_g has left #lima ["Konversation terminated!"]

09:37 jonkerj has quit [Quit: brb]

09:38 jonkerj has joined #lima

09:39 jonkerj has quit [Client Quit]

09:40 jonkerj has joined #lima

10:18 <cwabbott> rellla: https://pastebin.com/ugJw9Cbr

10:19 anarsoul has quit [Remote host closed the connection]

10:20 anarsoul has joined #lima

10:20 <cwabbott> it seems to even insert the move if the source is just negated

10:21 <cwabbott> I would've thought you could just swap which source gets the negate, apparently that doesn't work, or the blob isn't clever enough

10:25 <rellla> thanks i will have a look. it only affects the fdd/abs which is combined with negation!?

10:26 <rellla> as glsl-derivs-abs passes...

10:32 <cwabbott> weird indeed

10:36 <cwabbott> anarsoul: btw, is it maybe possible to have the scheduler create the predecessor lists itself and have the lowering passes ignore all that stuff? that's how it's usually done in other backends

11:10 chewitt has joined #lima

11:11 dddddd has joined #lima

12:01 jrmuizel has joined #lima

12:06 jrmuizel has quit [Remote host closed the connection]

12:11 guillaume_g has joined #lima

12:26 guillaume_g has quit [Quit: Konversation terminated!]

12:28 guillaume_g has joined #lima

12:40 afaerber has quit [Quit: Leaving]

12:55 afaerber has joined #lima

13:00 jrmuizel has joined #lima

13:22 <rellla> seems i've got some mess here... maybe piglit. rebuilding now.

13:24 guillaume_g has left #lima ["Konversation terminated!"]

13:58 chewitt has quit [Remote host closed the connection]

14:00 Barada has quit [Quit: Barada]

15:17 <anarsoul> rellla: my branch can be wedging the gpu, reloading lima module is enough to unwedge it

15:18 <rellla> maybe thats it...

15:18 <anarsoul> cwabbott: yeah, it makes sense

16:48 chewitt has joined #lima

17:16 afaerber has quit [Quit: Leaving]

17:36 ente has quit [Remote host closed the connection]

17:38 jrmuizel has quit [Remote host closed the connection]

18:01 chewitt has quit [Quit: Zzz..]

18:04 <anarsoul> well, node_to_instr can also insert movs

18:13 jrmuizel has joined #lima

18:20 jrmuizel has quit [Ping timeout: 246 seconds]

18:50 <anarsoul> cwabbott: so I'm thinking about building predecessor list in scheduler, I guess I need to add ppir_node * to ppir_src to achieve that?

18:56 <anarsoul> oh, ppir_update_spilled_dest() and ppir_update_spilled_src() also create nodes and update predecessor lists...

18:57 jrmuizel has joined #lima

18:57 <anarsoul> enunes: ideas?

19:01 <enunes> anarsoul: yes... back then I discussed with yuq whether we wanted to have a full scheduler re-run after introducing the spill/load nodes..., but he suggested to just add the nodes and instructions directly in regalloc anyway

19:03 <enunes> is it a problem for you even at regalloc time?

19:04 <enunes> I guess it adds some complexity as you might have to jump to the preceding instruction in case a load is inserted

19:04 <enunes> but I wonder if it is a real problem other than that

19:05 <enunes> loads and stores should always be in the same block as the instruction dealing with the spilled register

19:17 <anarsoul> well, I guess we can deal with it later

19:44 <anarsoul> darn

19:44 <anarsoul> I think we did too many premature optimizations and now it's really hard to add control flow support :\

19:48 <enunes> I would be in favour of sacrificing unnecessary optimizations, I just did one of that in the texture proj series :)

19:49 <anarsoul> like getting rid of successors/predecessors lists and just doing one op per instruction in order?

19:49 <anarsoul> that's technically a significant rewrite of ppir compiler

19:50 <enunes> don't we need successor/predecessor for other things than optimization?

19:50 <anarsoul> what things are you referring to?

19:52 <enunes> in general I don't think we have too many unnecessary optimizations, but we do have some specific things like trying to see if we can merge a mul and an add and switch to a pipeline register

19:53 <anarsoul> enunes: we reorder instructions

19:54 <enunes> is the reordering hard to remove?

19:55 <anarsoul> enunes: well, the problem is that we depend on pred/succ lists pretty much everywhere

19:55 <anarsoul> and we assume that they're complete (and that's true for a single block case)

19:56 <anarsoul> but it all breaks when we have more than one block

19:56 <anarsoul> grep for ppir_node_foreach_succ_safe() and ppir_node_foreach_pred_safe()

19:58 <enunes> I'd be ok with patches gradually removing the dependency on these lists and the reordering as preparation to support control flow, if that is what is blocking it

20:00 <enunes> I guess pred/succ makes sense on gpir as there are the dependency on the previous-result registers, but maybe ppir doesn't need to

20:00 <enunes> but yeah I suppose ultimately it is basically a redesign...

20:01 <anarsoul> enunes: it does need pred/succ lists for scheduler if we want to do any reordering

20:02 <anarsoul> btw I suspect that adding cf to gpir would be a pretty hard task

20:02 <enunes> can't we limit reordering only to intra-block?

20:02 <enunes> or otherwise I'd totally vote to remove reordering completely

20:03 <anarsoul> enunes: it's supposed to be intra-block only

20:04 <anarsoul> that's why we don't add nodes from other blocks to succ/pred lists

20:04 <anarsoul> but it just breaks everywhere else like in select lowering that assumes that all its sources are in the same block

20:05 <anarsoul> I believe ldtex is also broken in the same way

20:05 <anarsoul> so is inserting extra movs in node_to_instr

20:07 <anarsoul> spilling should be fine since we don't care there about succ/preds anymore

20:07 <anarsoul> but I'm not 100% sure though

20:11 <enunes> why do we have those movs in select anyway?

20:11 <enunes> the comment is not very clear, is it just optimization?

20:13 <enunes> one other thing I wonder sometimes is if it would be helpful to have some of these lima lowerings in lima-specific nir lowerings, like lima_nir_lower_uniform_to_scalar, so we don't have to deal with the pred/succ stuff manually

20:14 <anarsoul> enunes: it's not an optimization

20:15 <anarsoul> enunes: select is done by addition unit but it has only 2 inputs. 3rd input comes from mul unit

20:16 <anarsoul> so lowering inserts extra mov for 1st select argument (condition) to guarantee that it'll take mul slot of instruction

20:17 <anarsoul> it's just like ldtex, single nir instruction multiple pp instructions

20:17 <enunes> yeah I see now, makes sense

20:18 <enunes> and then we rely on the scheduler to actually put these two ops in the same instruction, with no clear indication from the lowering code...

20:18 <anarsoul> yeah

20:19 <enunes> it should explicitly use a pipeline register instead?

20:19 <enunes> I did that in the texture projection patch

20:19 <anarsoul> enunes: well, the problem is that there's no pipeline register for it

20:19 <anarsoul> see https://gitlab.freedesktop.org/panfrost/mali-isa-docs/blob/master/Utgard-PP.md

20:20 <anarsoul> scalar addition has only 2 inputs - a and b

20:20 <anarsoul> 3rd input is implicit

20:21 <enunes> it is the same with varying fetch and texture fetch, varying fetch can just write to discard and the value shows up in the next instruction

20:22 <enunes> well, next op...

20:22 <enunes> we should be able to represent that and not rely on the scheduler

20:24 adjtm has quit [Quit: Leaving]

20:24 <anarsoul> yeah, probably

20:26 ninolein has joined #lima

20:30 <anarsoul> enunes: I guess we need a function like nir_ssa_def_rewrite_uses() for ppir

20:30 <enunes> so as a preparatory work for cf, would it make sense to replace the existing scheduler with a dumb one that schedules 1 op per instr, except when there is a pipeline dependency between ops, and add pipeline dependencies in lowering for the required cases?

20:31 <anarsoul> enunes: I'm thinking about following:

20:31 <anarsoul> 1) add ppir_node pointer into ppir_src

20:32 <anarsoul> 2) stop using succ/pred lists anywhere but in scheduler and add a function that builds succ/pred lists before scheduling

20:33 <anarsoul> 3) probably fixing a mess with implicit dependencies like mov <-> select

20:33 <anarsoul> what do you think?

20:34 <anarsoul> probably we still have to deal with extra movs in node_to_instr...

20:34 <anarsoul> so pred/succ lists should be built before calling node_to_instr

20:35 <anarsoul> but it shouldn't be used anywhere but in node_to_instr and scheduler

20:35 <anarsoul> we definitely don't need it in spilling code

20:36 <enunes> maybe we can get rid of the movs in node_to_instr, move them to lowering? I tried to do that initially but gave up as it became too complex while trying to add a feature at the same time...

20:36 <enunes> maybe by itself it would be doable

20:37 <enunes> why do you still need succ/pred in step 2?

20:39 <anarsoul> enunes: to avoid making scheduler dumb

20:42 <enunes> I would gladly have a dumb one over one that does random optimizations or needs things like "succ->op == ppir_op_select && ppir_node_first_pred(succ) == node"...

20:43 <enunes> from a dumb one we could maybe do a global optimization pass rather than handpicked ones

20:43 <anarsoul> yeah, probably

20:43 <anarsoul> I guess it would be beneficial to do 1 op per instruction first and then attempt to merge them

20:47 <enunes> so this would require moving these magic things like the select dependency into explicit dependencies inserted while lowering, and even the dumb scheduler needs to respect those

20:47 adjtm has joined #lima

20:53 <anarsoul> enunes: I'm not sure if it's possible to move 'movs' from node_to_instr to lowering since lowering isn't aware of instruction format and limitations

20:56 <enunes> anarsoul: thinking about insert_to_load_tex for example... if the scheduler just respected pipeline register dependencies, it would be possible to move all insert_to_load_tex into ppir_lower_texture

21:05 <enunes> I could try to write a patch with what I'm thinking on this, if you think it's helpful in the end

21:06 <enunes> and attempt to move the tex and select dependencies out of the scheduler into lowering

21:06 <anarsoul> enunes: see insert_to_each_succ_instr() it also creates extra moves

21:07 <anarsoul> enunes: well, if you think that it's easy then it'll definitely be helpful. Otherwise it makes sense to start a discussion on ML

21:09 <enunes> yeah insert_to_each_succ_instr is not the greatest function ever to follow... 2 of its users have a comment saying "save a reg" so possibly just optimization? the other is const, which indeed can be tricky

21:09 <anarsoul> const is not tricky at all

21:09 <anarsoul> we can stick up to 2 consts to any instruction

21:10 ninolein has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

21:10 <anarsoul> I believe that it'd be better to treat consts differently from all other sources (ssa or regs)

21:10 <enunes> maybe it is possible to just do the const duplicating in lowering too as op nodes and it's not terrible?

21:11 <anarsoul> enunes: what about adding 3rd source type - const?

21:11 <enunes> so we don't have to "insert to each succ instr"

21:12 <anarsoul> we can convert ssas that are consts into const src in lowering

21:12 <anarsoul> we don't need a reg for consts

21:13 <enunes> doesn't sound bad, I just wonder if we only have ssa and reg in a way to mirror the nir src options

21:13 <anarsoul> technically it's costless optimization

21:14 <enunes> in many places we have to check for reg or ssa, or even "if !ssa", so having a 3rd might require some work everywhere

21:14 <anarsoul> now we compile something like gl_FragColor = vec4(1.0); into mov $1.xyzw, const0(1.0, 1.0, 1.0, 1.0); mov $0, $1

21:14 <anarsoul> because in nir it's:

21:14 <anarsoul> ssa1 = const vec4(1.0)

21:14 <anarsoul> store_output(ssa1)

21:15 <anarsoul> enunes: OK, then again - pipeline reg to load_const node that's cloned to each successor

21:18 <enunes> hmm what?

21:19 <anarsoul> enunes: basically the same what you proposed. Do not introduce another source type. In lower create a clone node for each successor and use pipeline reg

21:20 <anarsoul> it's a bit harder though since we'll be lowering not const but all nodes that have const in src...

21:21 <anarsoul> might be easier to treat consts differently

21:21 <enunes> ah yes, that might work too, indeed I think consts are like pipeline regs

21:21 <enunes> I guess duplicate the const node to all of its users and make it like an op outputting to a const pipeline reg

21:23 <enunes> there might be corner cases, like what if we run out of const slots and need to insert the mov

21:23 <anarsoul> enunes: that's not possible with dumb scheduler

21:23 <enunes> but still we should try to do that in lowering, not scheduler

21:24 <anarsoul> i.e. one op per instruction

21:24 <anarsoul> there's no instructions with more that 2 sources

21:24 <enunes> well assuming we introduce the pipeline-aware scheduler, it will be more than 1 op per instruction in case there is a pipeline dependency, including consts

21:25 <anarsoul> right...

21:25 <anarsoul> hm

21:25 <anarsoul> OK

21:26 <anarsoul> I think solution is to combine lowering and node_to_instr

21:26 <anarsoul> or rather allocate instr with nodes from the very beginning

21:26 <anarsoul> during lowering some instructions will be merged

21:26 <anarsoul> i.e. consts will go to their users when possible

21:27 <anarsoul> const itself will be lowered to mov rX, const

21:27 <anarsoul> if it has no users it gets removed

21:28 <anarsoul> (we'll need to keep track of users)

21:37 <anarsoul> all in all it sounds like a rewrite of ppir compiler with preserving some of data structures :)

21:38 <anarsoul> basically most of lower.c node_to_instr.c and scheduler.c need to be rewritten

21:38 <anarsoul> nir.c is fine

21:39 <enunes> I'll try to do the smaller thing I mentioned to see if it is really easy or I hit some blocker, maybe we can do that in parts rather than a big rewrite

21:39 <anarsoul> instr.c probably also needs to be rewritten

21:39 <anarsoul> enunes: sure, sounds good

21:42 <anarsoul> enunes: I suspect that trying to treat nodes separately from instructions isn't a good solution

21:42 <anarsoul> most of the difficulties arise when we try to place nodes into instructions

21:43 <anarsoul> or rather from the fact that we have some code that is not instruction-aware

21:44 <enunes> yeah makes sense

21:44 <enunes> what I'm proposing also goes in the direction to make lowering kinda instruction-aware

21:45 <anarsoul> enunes: I think it's not enough just to have pipeline regs

21:45 <anarsoul> I'll try to summarize it all and send an email to ML tonight

21:56 <rellla> anarsoul: btw, gpu was wedged. it's not possible for me to run the whole piglit tests.

21:56 <anarsoul> rellla: :(

21:56 <anarsoul> I'll need to pin point later what test wedges it (likely it's first failed test)

21:57 <rellla> i've got 436 passes, then it breaks, but i haven't looked in detail what the first one was.

21:59 <anarsoul> rellla: thanks for testing but it looks like we need some compiler refactoring before we can land control flow support

21:59 <rellla> :) i read backlog

22:07 <anarsoul> cwabbott: please merge your MR even without exp2/log2, it fixes some mysterious failures so it's definitely nice to have

22:13 afaerber has joined #lima

22:37 jrmuizel has quit [Remote host closed the connection]

23:23 dddddd has quit [Remote host closed the connection]

23:50 <MoeIcenowy> anarsoul: how to check whether the GPU is hang?

23:50 <MoeIcenowy> maybe we should add code to reset GPU in drm/lima

23:51 <anarsoul> MoeIcenowy: I don't know. Kernel driver doesn't complain, job still completes but result is wrong

23:52 <anarsoul> reset would be definitely useful