alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
<HdkR> Playing with RA classes with the RA helper in mesa
<alyssa> HdkR: Good luck, tell me when you understand how it works :p
<alyssa> And then teach me
<HdkR> It's a bit annoying, but I really need it for things like ld_var_addr + st_vary
jolan has joined #panfrost
<robclark> ahh, interference classes and graph coloring ;-)
<HdkR> If this was LLVM then it would do the magic for me via tablegen :P
<alyssa> Be glad it's not LLBM
<alyssa> *LLVM
<robclark> does tablegen understand vec10's ;-)
<HdkR> With SVE being mainline, maybe indirectly now?
<HdkR> I haven't checked in a few months
<HdkR> It not natively supporting vec3 was a real annoyance
* robclark has vec1/vec2/vec3/vec4/vec8/vec10 plus conflicting half-reg classes in backend..
<alyssa> Show-off :P
<robclark> and iirc my old dusty branch for clover/opencl added vec16
<alyssa> When do I get fancy OpenCLfrost :P
<robclark> some variants of tex fetch need vec10
<HdkR> Yea, I understand the concern. This is coming from over a year of working on a backend to try and convince a company that an LLVM backend is a step forward towards documenting their shit publicly :P
<robclark> you probably don't want opencl yet.. it is compiler nightmare ;-)
<robclark> anyways, get gl compute shaders first.. that is baby-cl ;-)
<robclark> without having to deal with nonsense like vec16
<robclark> (in an age when everyone already moved to scalar isa's, khronos saw fit to add vec8 and vec16 to opencl... wtf)
<alyssa> robclark: When do I get fancy GL compute then? :P
<robclark> umm, exercise for the reader?
<robclark> :-P
<HdkR> There we go. I have vec2, vec3, and vec4 classes working correctly through the RA
<robclark> \o/
<HdkR> It's pretty easy once you realize what the utility wants for setting up register conflicts
vstehle has quit [Ping timeout: 250 seconds]
<robclark> HdkR: just in case it wasn't obvious, setup the conflicts once per screen and re-use
<robclark> (ie. don't do it per-shader-compile)
<HdkR> Definitely not obvious
<HdkR> Currently nothing in the Bifrost compiler side does any form of caching
<HdkR> I'll put it on a kanban to remember to do it in the future
<HdkR> Need to figure out what to do about the temp registers as well...
* robclark split ir3_ra so we construct the interference graph once for ir3_compiler, associated w/ the screen.. and re-use for every shader.. that is kinda the diff between ir3_compiler (global) and ir3_context (per-shader-compile-context)
<HdkR> Makes sense
<HdkR> https://trello.com/b/i7NTRoQA/panfrost There we go. I'll fill it out as I remember more :P
<HdkR> What do backends do with nir for destroying nir_op_vec4 and just having RA handle correct allocation in the destination register?
<HdkR> Since Bifrost wants to lower ALU to scalar but keep IO vector based
<HdkR> (Trying to find if a pass already exists for this or something)
<HdkR> Or is the right way to generate a no-op move that gets RA'd correctly and have a pass that eliminates it later?
* robclark generates mov and cleans it up in backend
<HdkR> I see
<robclark> (although tbf I have to do same things w/ tex instructions and various other cases)
<HdkR> Yea, many scalar architectures will need to do it
<HdkR> I guess I'll also just support generating a vector construction op that gets destroyed after RA
* robclark has collect/fanin and split/fanout meta instructions for that sorta thing.. and the cp pass that tries to reduce mov's (and propagate load_immed/uniform).. since even mov's need delay slots before they can be consumed
<HdkR> oof
<robclark> and then ra maps those to register classes to try to put things in the right place
<robclark> because sampler instructions for 3d tex w/ explict lod and offsets and whatever not work out to needing 10 consecutive scalar regs
<robclark> (and then once you solve that shader inputs/outputs are already solved)
<HdkR> Right
<robclark> nouveau has something similar iirc.. I originally called the meta instructions fanin/fanout but then liked nouveau naming of collect/split better.. so the names are used semi-interchangably in ir3 code
<HdkR> Yea, Nouveau will hit the exact same issues, with all the things
<robclark> iirc i965 backend has similar.. which they solve w/ mov's and meta "payload" instruction, but not sure how they get rid of the mov's... or maybe mov's are not as constly because result immediately available?
<robclark> ir3 isa just happily lets you read a src register before the result you are waiting for is avail.. which is fun
<robclark> nice and fast when you get it right, super confusing before you figure out what is going wrong
<HdkR> Welcome to Nvidia scheduling hell :P
<robclark> heheh, iirc ir3 was like that before nv moved to that approach
<robclark> (or well, it was like that since a3xx, I thought it was more recent gens for nv when they went that route)
<HdkR> Wow, Freedreno has been around since pre-maxwell days? Time sure flies
<robclark> yeah
<HdkR> Seems like yesterday that I was complaining it doesn't run Dolphin :P
<robclark> really you shouldn't talk about dolphin until a driver is a few years old :-P
<HdkR> Just giving ya a hard time ;)
<robclark> it brings back memories of lotsa compiler head scratching ;-)
<robclark> but yeah, going from a2xx to something maxwell-like was.. interesting
<HdkR> It's a heck of a jump in a single generation
<robclark> and super confusing until you realize what is going on
<HdkR> I bet
<robclark> fun when you realize certain groups of instructions don't even consume their src regs immediately, so you have write-after-read hazzards
<robclark> or that certain groups of instructions don't count for alu delay slot accounting
<HdkR> Yea, scheduling can become a nasty mess
<robclark> because they are split off into parallel pipe at decode stage
<robclark> yeah
<HdkR> I was lucky enough to previously have had a spreadsheet that showed exactly latencies and a manual talking about scheduling hazards :P
<robclark> the sched part (or at least figuring out how far apart instructions need to be) isn't too hard once you realize what is going on
<robclark> although best way to maximize parallelism w/out blowing up register usage (within threshold of what bumps you to lower level of warps in flight) is still an open science project
<HdkR> perfect RA to occupancy ratio is a hard problem
<HdkR> Similar to knowing when spilling is more effective than using a few more registers because of it
<robclark> I think in general, post sched, we can know max # of live vals to feed into RA stage... and even to decide to spill before RA
<robclark> the problem is sched before RA vs sched after RA
<robclark> so I *think* we need to do both
<HdkR> Truely you need to do both twice so the first pass will inform the second pass about what it should do
<robclark> (ie. pre RA sched to reduce live vals, possible directed by current # of live vals, and then post RA sched to try to extract as much remaining parallelism as we can get)
<HdkR> Since they feed in to each other, but that's time consuming
<HdkR> More suited towards offline compilation sort of thing
<robclark> I've looked a bit at sethi-hulman (sp?) numbering to try to inform scheduling with mixed results..
<robclark> but yeah, doing this well in a sane O() amount of time is hard
<robclark> (and it would be nice if I didn't have to sci-hub all the papers on the topic :-/)
<HdkR> yea :|
<HdkR> Alright, working for output IO, now to do it for input IO which will be easy
<robclark> :-)
<robclark> HdkR: btw, fun thing on newer gen's is combining ssbo/image atomic input reg with output.. so you have a "src" reg.xy (or .xyz for compexchange) where .x is dst and y/yz is src.. that is super-fun for RA
<robclark> (and so far I solved it same way as blob, w/ a falsedep and mov)
<HdkR> haha. What a rude constraint
<robclark> yeah, srsl
<HdkR> I guess this is due to isa encoding limitations?
<robclark> tbh, I have no clue why they did that.. I know why they changed image/ssbo instructions but there were enough bits
<robclark> they seem to want to be semi-backwards compatible..
<HdkR> interesting
<robclark> I assume because of hand written shaders for camera and 2d blit lib
<robclark> I guess this wouldn't matter for 2d blit lib, but maybe for camera compute shader fancyness?
<HdkR> You'd think that even with a hand written assembler then it still wouldn't matter
<HdkR> You could catch a meta op in the backend regardless
<HdkR> 32bit ARM does it all the time
<robclark> well, if the shader is binary that isn't going to work.. which I assume is the case
<robclark> if they had some asm the shader was written in that compiled to binary you could
<robclark> but for a gpu you aren't going to solve those compat problem with extra transistors
<HdkR> Nobody should ever bitbang out a shader :P
<robclark> yeah, you shouldn't.. I assume they thought about that after the fact ;-)
<robclark> anyways, that is my speculation.. I can't imagine why else they would do that
vstehle has joined #panfrost
herbmillerjr has quit [Ping timeout: 255 seconds]
<HdkR> Oh, hm
<HdkR> Need to declare conflicts between subclasses even if they already have a stated conflict on the base class they derive from?
<HdkR> vec3 and vec4 both conflict on scalar, but not each other. I just assumed it would work since the conflict with the base should alias?
<HdkR> Hm. RA failure, I can only assume due to the different register classes
<HdkR> Oh wait, no. Just derp
<HdkR> fragment storing I don't have complete yet :P
<HdkR> Interesting. fragment store location uses a 64bit pointer that points to a 128bit descriptor?
<HdkR> 64bit index in to a 128bit descriptor table?
<HdkR> Just threw an MRT at the blob to make sure I was seeing things correctly
herbmillerjr has joined #panfrost
<HdkR> https://paste.fedoraproject.org/paste/esohVgyHSq46-d~rxNh48w Isn't showing some literal arguments in the IR. but that is workable
<HdkR> Explicit movi for the 0,0 target location in the fragment while the blob loads that in to uniforms but w/e
* HdkR just noticed destination printing is a derp for vectors
<HdkR> Oh, I'm a derp. Forgot that the STORE would store in to a vec4 sized location because the blend shader still needs to run
<HdkR> I still like that blend shaders get their registers prepopulated with the incoming colour
<HdkR> more work tomorrow on this.
<HdkR> Lyude: I presume you'll want a branch so you can do bundling things? Or are you still wanting to go with job handling bits?
<Lyude> HdkR: I'm fine with bundling things
<Lyude> Don't want to keep you waiting if you've got stuff to do!
<HdkR> Bundling doesn't really restrict what I'm working on currently but I'm not not doing MIR->Bin or scheduling
<HdkR> It'll matter more in a few weeks if I want to verify working code ;P
<Lyude> the scheduling bits you mean?
<HdkR> scheduling, bundling, emitting bin. all the stuff I put off and didn't do when I decided to rewrite quite a bit :D
<Lyude> Ah
<HdkR> It's in a state that you can compile...exactly that shader pair
<HdkR> So in a sane state again
<HdkR> Which means simultaneous work can occur again without us stepping on each other
<HdkR> So I can push a wip branch tomorrow
<Lyude> Alright
<Lyude> (sorry, very tired ATM lol)
<HdkR> no problem
<HdkR> It's only 3am where you are :P
stikonas has joined #panfrost
stikonas_ has joined #panfrost
stikonas has quit [Ping timeout: 245 seconds]
cwabbott_ has joined #panfrost
cwabbott has quit [Ping timeout: 245 seconds]
cwabbott_ is now known as cwabbott
Elpaulo has quit [Quit: Elpaulo]
stikonas_ has quit [Remote host closed the connection]
stikonas has joined #panfrost
cwabbott has quit [Ping timeout: 252 seconds]
stikonas has quit [Remote host closed the connection]
cwabbott has joined #panfrost
herbmillerjr has quit [Ping timeout: 268 seconds]
raster has joined #panfrost
cwabbott has quit [Ping timeout: 248 seconds]
cwabbott has joined #panfrost
herbmillerjr has joined #panfrost
Elpaulo has joined #panfrost
<alyssa> HdkR: Granted, you can ignore blend shaders for a while...
<alyssa> Fixed-function blending is absolutely a thing and works in 90% of real-world cases (and 10% of dEQP cases :P)
<alyssa> robclark: Midgard shaders are weirdly backwards compatible, so that's nice.
<alyssa> Saved me a ton of work rewriting the compiler
<alyssa> Making a ton of progress on blend shaders, fwiw
stikonas has joined #panfrost
herbmillerjr has quit [Ping timeout: 244 seconds]
cwabbott has quit [Ping timeout: 252 seconds]
cwabbott has joined #panfrost
<HdkR> alyssa: :)
<hanetzer> reminding myself, currently panfrost is not really x11 capable right?
<alyssa> Not yet
cwabbott has quit [Ping timeout: 248 seconds]
cwabbott has joined #panfrost
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
belgin has joined #panfrost
<hanetzer> but say, x11 app inside of wayland wm should work, maybe?
belgin has quit [Quit: Leaving]
adjtm_ has joined #panfrost
adjtm has quit [Ping timeout: 246 seconds]
cwabbott has quit [Ping timeout: 264 seconds]
cwabbott has joined #panfrost
belgin has joined #panfrost
belgin has quit [Client Quit]
<alyssa> hanetzer: Depends on the app
<hanetzer> true, but that's even true on 'finished' drivers :)
Elpaulo has quit [Quit: Elpaulo]
<HdkR> Oh look at that, a previously unseen ftrunc op
<HdkR> Supported in both FMA and ADD pipes
<HdkR> Interesting. Different ops for round versus roundeven
stikonas has quit [Remote host closed the connection]
cwabbott has quit [Ping timeout: 252 seconds]
<alyssa> HdkR: Same as Midgard.
<HdkR> So silly :P
<HdkR> alyssa: Did you find the difference between the two ops on midgard?
cwabbott has joined #panfrost
unoccupied has quit [Quit: WeeChat 2.4]
<alyssa> HdkR: Who cares? :)
<HdkR> well, sure
<HdkR> But that doesn't stop me from being curious
<HdkR> Would need to slap around spir-v a bit with rounding modes to find out I guess
<HdkR> Or OpenCL
<HdkR> Kanban slowly filling out now :D
<alyssa> HdkR: robclark: fragment_ops.blend.* passing 1060/1060!
<alyssa> (Forcing blend shaders unconditionally. No fixed-function allowed right now since I didn't want to distract myself.)
<alyssa> I will concede I've written a disgusting amount of code in the last 30 hours.
<alyssa> Now, let's reenable fixed-function and rerun dEQP blend
<alyssa> Argh that code path is deeply broken
cwabbott has quit [Ping timeout: 258 seconds]
<alyssa> Granted the fixed function code is going to be a pain to debug
raster has quit [Read error: Connection reset by peer]
<alyssa> It's a closed routine but still wat
<alyssa> Huh.
<alyssa> Good news is that I do have full debugging infra for fixed function, since I had to deal with this last time..
<alyssa> DDid some debugging and we're failing 21 fixed-function tests right now
<alyssa> So 2% failing for blend. Shouldn't be too hard to figure out.
cwabbott has joined #panfrost
<HdkR> alyssa: woop woop