#panfrost on 2019-05-05 — irc logs at freenode.irclog.whitequark.org

2019-02-15 17:52 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

01:10 <HdkR> Playing with RA classes with the RA helper in mesa

01:12 <alyssa> HdkR: Good luck, tell me when you understand how it works :p

01:12 <alyssa> And then teach me

01:13 <HdkR> It's a bit annoying, but I really need it for things like ld_var_addr + st_vary

01:20 jolan has joined #panfrost

01:31 <robclark> ahh, interference classes and graph coloring ;-)

01:32 <HdkR> If this was LLVM then it would do the magic for me via tablegen :P

01:33 <alyssa> Be glad it's not LLBM

01:33 <alyssa> *LLVM

01:35 <robclark> does tablegen understand vec10's ;-)

01:35 <HdkR> With SVE being mainline, maybe indirectly now?

01:35 <HdkR> I haven't checked in a few months

01:36 <HdkR> It not natively supporting vec3 was a real annoyance

01:37 * robclark has vec1/vec2/vec3/vec4/vec8/vec10 plus conflicting half-reg classes in backend..

01:37 <alyssa> Show-off :P

01:37 <robclark> and iirc my old dusty branch for clover/opencl added vec16

01:37 <alyssa> When do I get fancy OpenCLfrost :P

01:37 <robclark> some variants of tex fetch need vec10

01:38 <HdkR> Yea, I understand the concern. This is coming from over a year of working on a backend to try and convince a company that an LLVM backend is a step forward towards documenting their shit publicly :P

01:38 <robclark> you probably don't want opencl yet.. it is compiler nightmare ;-)

01:38 <robclark> anyways, get gl compute shaders first.. that is baby-cl ;-)

01:39 <robclark> without having to deal with nonsense like vec16

01:40 <robclark> (in an age when everyone already moved to scalar isa's, khronos saw fit to add vec8 and vec16 to opencl... wtf)

01:41 <alyssa> robclark: When do I get fancy GL compute then? :P

01:43 <robclark> umm, exercise for the reader?

01:43 <robclark> :-P

01:44 <HdkR> There we go. I have vec2, vec3, and vec4 classes working correctly through the RA

01:45 <robclark> \o/

01:46 <HdkR> It's pretty easy once you realize what the utility wants for setting up register conflicts

01:48 vstehle has quit [Ping timeout: 250 seconds]

01:50 <robclark> HdkR: just in case it wasn't obvious, setup the conflicts once per screen and re-use

01:51 <robclark> (ie. don't do it per-shader-compile)

01:57 <HdkR> Definitely not obvious

01:57 <HdkR> Currently nothing in the Bifrost compiler side does any form of caching

01:59 <HdkR> I'll put it on a kanban to remember to do it in the future

02:00 <HdkR> Need to figure out what to do about the temp registers as well...

02:01 * robclark split ir3_ra so we construct the interference graph once for ir3_compiler, associated w/ the screen.. and re-use for every shader.. that is kinda the diff between ir3_compiler (global) and ir3_context (per-shader-compile-context)

02:03 <HdkR> Makes sense

02:07 <HdkR> https://trello.com/b/i7NTRoQA/panfrost There we go. I'll fill it out as I remember more :P

03:25 <HdkR> What do backends do with nir for destroying nir_op_vec4 and just having RA handle correct allocation in the destination register?

03:26 <HdkR> Since Bifrost wants to lower ALU to scalar but keep IO vector based

03:26 <HdkR> (Trying to find if a pass already exists for this or something)

03:27 <HdkR> Or is the right way to generate a no-op move that gets RA'd correctly and have a pass that eliminates it later?

03:28 * robclark generates mov and cleans it up in backend

03:28 <HdkR> I see

03:28 <robclark> (although tbf I have to do same things w/ tex instructions and various other cases)

03:29 <HdkR> Yea, many scalar architectures will need to do it

03:30 <HdkR> I guess I'll also just support generating a vector construction op that gets destroyed after RA

03:32 * robclark has collect/fanin and split/fanout meta instructions for that sorta thing.. and the cp pass that tries to reduce mov's (and propagate load_immed/uniform).. since even mov's need delay slots before they can be consumed

03:32 <HdkR> oof

03:33 <robclark> and then ra maps those to register classes to try to put things in the right place

03:34 <robclark> because sampler instructions for 3d tex w/ explict lod and offsets and whatever not work out to needing 10 consecutive scalar regs

03:34 <robclark> (and then once you solve that shader inputs/outputs are already solved)

03:34 <HdkR> Right

03:36 <robclark> nouveau has something similar iirc.. I originally called the meta instructions fanin/fanout but then liked nouveau naming of collect/split better.. so the names are used semi-interchangably in ir3 code

03:37 <HdkR> Yea, Nouveau will hit the exact same issues, with all the things

03:42 <robclark> iirc i965 backend has similar.. which they solve w/ mov's and meta "payload" instruction, but not sure how they get rid of the mov's... or maybe mov's are not as constly because result immediately available?

03:42 <robclark> ir3 isa just happily lets you read a src register before the result you are waiting for is avail.. which is fun

03:43 <robclark> nice and fast when you get it right, super confusing before you figure out what is going wrong

03:43 <HdkR> Welcome to Nvidia scheduling hell :P

03:43 <robclark> heheh, iirc ir3 was like that before nv moved to that approach

03:44 <robclark> (or well, it was like that since a3xx, I thought it was more recent gens for nv when they went that route)

03:44 <HdkR> Wow, Freedreno has been around since pre-maxwell days? Time sure flies

03:45 <robclark> yeah

03:46 <HdkR> Seems like yesterday that I was complaining it doesn't run Dolphin :P

03:47 <robclark> really you shouldn't talk about dolphin until a driver is a few years old :-P

03:50 <HdkR> Just giving ya a hard time ;)

03:53 <robclark> it brings back memories of lotsa compiler head scratching ;-)

03:54 <robclark> but yeah, going from a2xx to something maxwell-like was.. interesting

03:55 <HdkR> It's a heck of a jump in a single generation

03:55 <robclark> and super confusing until you realize what is going on

03:56 <HdkR> I bet

03:58 <robclark> fun when you realize certain groups of instructions don't even consume their src regs immediately, so you have write-after-read hazzards

03:59 <robclark> or that certain groups of instructions don't count for alu delay slot accounting

03:59 <HdkR> Yea, scheduling can become a nasty mess

03:59 <robclark> because they are split off into parallel pipe at decode stage

03:59 <robclark> yeah

04:00 <HdkR> I was lucky enough to previously have had a spreadsheet that showed exactly latencies and a manual talking about scheduling hazards :P

04:01 <robclark> the sched part (or at least figuring out how far apart instructions need to be) isn't too hard once you realize what is going on

04:02 <robclark> although best way to maximize parallelism w/out blowing up register usage (within threshold of what bumps you to lower level of warps in flight) is still an open science project

04:03 <HdkR> perfect RA to occupancy ratio is a hard problem

04:03 <HdkR> Similar to knowing when spilling is more effective than using a few more registers because of it

04:04 <robclark> I think in general, post sched, we can know max # of live vals to feed into RA stage... and even to decide to spill before RA

04:04 <robclark> the problem is sched before RA vs sched after RA

04:04 <robclark> so I *think* we need to do both

04:06 <HdkR> Truely you need to do both twice so the first pass will inform the second pass about what it should do

04:06 <robclark> (ie. pre RA sched to reduce live vals, possible directed by current # of live vals, and then post RA sched to try to extract as much remaining parallelism as we can get)

04:06 <HdkR> Since they feed in to each other, but that's time consuming

04:06 <HdkR> More suited towards offline compilation sort of thing

04:07 <robclark> I've looked a bit at sethi-hulman (sp?) numbering to try to inform scheduling with mixed results..

04:08 <robclark> but yeah, doing this well in a sane O() amount of time is hard

04:10 <robclark> (and it would be nice if I didn't have to sci-hub all the papers on the topic :-/)

04:10 <HdkR> yea :|

04:21 <HdkR> Alright, working for output IO, now to do it for input IO which will be easy

04:25 <robclark> :-)

04:31 <robclark> HdkR: btw, fun thing on newer gen's is combining ssbo/image atomic input reg with output.. so you have a "src" reg.xy (or .xyz for compexchange) where .x is dst and y/yz is src.. that is super-fun for RA

04:32 <robclark> (and so far I solved it same way as blob, w/ a falsedep and mov)

04:32 <HdkR> haha. What a rude constraint

04:32 <robclark> yeah, srsl

04:33 <HdkR> I guess this is due to isa encoding limitations?

04:33 <robclark> tbh, I have no clue why they did that.. I know why they changed image/ssbo instructions but there were enough bits

04:34 <robclark> they seem to want to be semi-backwards compatible..

04:34 <HdkR> interesting

04:34 <robclark> I assume because of hand written shaders for camera and 2d blit lib

04:35 <robclark> I guess this wouldn't matter for 2d blit lib, but maybe for camera compute shader fancyness?

04:36 <HdkR> You'd think that even with a hand written assembler then it still wouldn't matter

04:37 <HdkR> You could catch a meta op in the backend regardless

04:37 <HdkR> 32bit ARM does it all the time

04:38 <robclark> well, if the shader is binary that isn't going to work.. which I assume is the case

04:39 <robclark> if they had some asm the shader was written in that compiled to binary you could

04:39 <robclark> but for a gpu you aren't going to solve those compat problem with extra transistors

04:39 <HdkR> Nobody should ever bitbang out a shader :P

04:40 <robclark> yeah, you shouldn't.. I assume they thought about that after the fact ;-)

04:41 <robclark> anyways, that is my speculation.. I can't imagine why else they would do that

05:00 vstehle has joined #panfrost

05:11 herbmillerjr has quit [Ping timeout: 255 seconds]

05:57 <HdkR> Oh, hm

05:58 <HdkR> Need to declare conflicts between subclasses even if they already have a stated conflict on the base class they derive from?

05:59 <HdkR> vec3 and vec4 both conflict on scalar, but not each other. I just assumed it would work since the conflict with the base should alias?

06:12 <HdkR> Hm. RA failure, I can only assume due to the different register classes

06:13 <HdkR> Oh wait, no. Just derp

06:14 <HdkR> fragment storing I don't have complete yet :P

06:22 <HdkR> Interesting. fragment store location uses a 64bit pointer that points to a 128bit descriptor?

06:23 <HdkR> 64bit index in to a 128bit descriptor table?

06:23 <HdkR> Just threw an MRT at the blob to make sure I was seeing things correctly

06:35 herbmillerjr has joined #panfrost

06:48 <HdkR> https://paste.fedoraproject.org/paste/esohVgyHSq46-d~rxNh48w Isn't showing some literal arguments in the IR. but that is workable

06:49 <HdkR> Explicit movi for the 0,0 target location in the fragment while the blob loads that in to uniforms but w/e

06:50 * HdkR just noticed destination printing is a derp for vectors

06:56 <HdkR> Oh, I'm a derp. Forgot that the STORE would store in to a vec4 sized location because the blend shader still needs to run

06:58 <HdkR> I still like that blend shaders get their registers prepopulated with the incoming colour

07:01 <HdkR> more work tomorrow on this.

07:03 <HdkR> Lyude: I presume you'll want a branch so you can do bundling things? Or are you still wanting to go with job handling bits?

07:10 <Lyude> HdkR: I'm fine with bundling things

07:11 <Lyude> Don't want to keep you waiting if you've got stuff to do!

07:12 <HdkR> Bundling doesn't really restrict what I'm working on currently but I'm not not doing MIR->Bin or scheduling

07:14 <HdkR> It'll matter more in a few weeks if I want to verify working code ;P

07:14 <Lyude> the scheduling bits you mean?

07:15 <HdkR> scheduling, bundling, emitting bin. all the stuff I put off and didn't do when I decided to rewrite quite a bit :D

07:17 <Lyude> Ah

07:18 <HdkR> It's in a state that you can compile...exactly that shader pair

07:18 <HdkR> So in a sane state again

07:19 <HdkR> Which means simultaneous work can occur again without us stepping on each other

07:19 <HdkR> So I can push a wip branch tomorrow

07:20 <Lyude> Alright

07:20 <Lyude> (sorry, very tired ATM lol)

07:20 <HdkR> no problem

07:20 <HdkR> It's only 3am where you are :P

08:40 stikonas has joined #panfrost

08:51 stikonas_ has joined #panfrost

08:52 stikonas has quit [Ping timeout: 245 seconds]

09:22 cwabbott_ has joined #panfrost

09:25 cwabbott has quit [Ping timeout: 245 seconds]

09:25 cwabbott_ is now known as cwabbott

09:33 Elpaulo has quit [Quit: Elpaulo]

09:44 stikonas_ has quit [Remote host closed the connection]

09:50 stikonas has joined #panfrost

09:56 cwabbott has quit [Ping timeout: 252 seconds]

10:04 stikonas has quit [Remote host closed the connection]

10:22 cwabbott has joined #panfrost

10:30 herbmillerjr has quit [Ping timeout: 268 seconds]

12:52 raster has joined #panfrost

13:13 cwabbott has quit [Ping timeout: 248 seconds]

13:15 cwabbott has joined #panfrost

13:42 herbmillerjr has joined #panfrost

13:53 Elpaulo has joined #panfrost

13:59 <alyssa> HdkR: Granted, you can ignore blend shaders for a while...

13:59 <alyssa> Fixed-function blending is absolutely a thing and works in 90% of real-world cases (and 10% of dEQP cases :P)

14:10 <alyssa> robclark: Midgard shaders are weirdly backwards compatible, so that's nice.

14:10 <alyssa> Saved me a ton of work rewriting the compiler

14:18 <alyssa> Making a ton of progress on blend shaders, fwiw

14:38 stikonas has joined #panfrost

14:56 herbmillerjr has quit [Ping timeout: 244 seconds]

16:40 cwabbott has quit [Ping timeout: 252 seconds]

16:43 cwabbott has joined #panfrost

16:44 <HdkR> alyssa: :)

17:42 <hanetzer> reminding myself, currently panfrost is not really x11 capable right?

17:50 <alyssa> Not yet

17:52 cwabbott has quit [Ping timeout: 248 seconds]

18:08 cwabbott has joined #panfrost

18:09 stikonas has quit [Remote host closed the connection]

18:33 stikonas has joined #panfrost

18:37 belgin has joined #panfrost

18:42 <hanetzer> but say, x11 app inside of wayland wm should work, maybe?

18:54 belgin has quit [Quit: Leaving]

19:14 adjtm_ has joined #panfrost

19:17 adjtm has quit [Ping timeout: 246 seconds]

19:39 cwabbott has quit [Ping timeout: 264 seconds]

19:52 cwabbott has joined #panfrost

19:53 belgin has joined #panfrost

19:55 belgin has quit [Client Quit]

20:14 <alyssa> hanetzer: Depends on the app

20:15 <hanetzer> true, but that's even true on 'finished' drivers :)

20:33 Elpaulo has quit [Quit: Elpaulo]

20:34 <HdkR> Oh look at that, a previously unseen ftrunc op

20:35 <HdkR> Supported in both FMA and ADD pipes

20:49 <HdkR> Interesting. Different ops for round versus roundeven

21:07 stikonas has quit [Remote host closed the connection]

21:16 cwabbott has quit [Ping timeout: 252 seconds]

21:17 <alyssa> HdkR: Same as Midgard.

21:18 <HdkR> So silly :P

21:18 <HdkR> alyssa: Did you find the difference between the two ops on midgard?

21:22 cwabbott has joined #panfrost

21:24 unoccupied has quit [Quit: WeeChat 2.4]

21:25 <alyssa> HdkR: Who cares? :)

21:26 <HdkR> well, sure

21:29 <HdkR> But that doesn't stop me from being curious

21:30 <HdkR> Would need to slap around spir-v a bit with rounding modes to find out I guess

21:30 <HdkR> Or OpenCL

21:46 <HdkR> Kanban slowly filling out now :D

22:07 <alyssa> HdkR: robclark: fragment_ops.blend.* passing 1060/1060!

22:07 <alyssa> (Forcing blend shaders unconditionally. No fixed-function allowed right now since I didn't want to distract myself.)

22:08 <alyssa> I will concede I've written a disgusting amount of code in the last 30 hours.

22:10 <alyssa> Now, let's reenable fixed-function and rerun dEQP blend

22:12 <alyssa> Argh that code path is deeply broken

22:16 cwabbott has quit [Ping timeout: 258 seconds]

22:21 <alyssa> Granted the fixed function code is going to be a pain to debug

22:22 raster has quit [Read error: Connection reset by peer]

22:38 <alyssa> It's a closed routine but still wat

22:40 <alyssa> Huh.

22:41 <alyssa> Good news is that I do have full debugging infra for fixed function, since I had to deal with this last time..

22:45 <alyssa> DDid some debugging and we're failing 21 fixed-function tests right now

22:45 <alyssa> So 2% failing for blend. Shouldn't be too hard to figure out.

23:10 cwabbott has joined #panfrost

23:12 <HdkR> alyssa: woop woop