<HdkR>
Playing with RA classes with the RA helper in mesa
<alyssa>
HdkR: Good luck, tell me when you understand how it works :p
<alyssa>
And then teach me
<HdkR>
It's a bit annoying, but I really need it for things like ld_var_addr + st_vary
jolan has joined #panfrost
<robclark>
ahh, interference classes and graph coloring ;-)
<HdkR>
If this was LLVM then it would do the magic for me via tablegen :P
<alyssa>
Be glad it's not LLBM
<alyssa>
*LLVM
<robclark>
does tablegen understand vec10's ;-)
<HdkR>
With SVE being mainline, maybe indirectly now?
<HdkR>
I haven't checked in a few months
<HdkR>
It not natively supporting vec3 was a real annoyance
* robclark
has vec1/vec2/vec3/vec4/vec8/vec10 plus conflicting half-reg classes in backend..
<alyssa>
Show-off :P
<robclark>
and iirc my old dusty branch for clover/opencl added vec16
<alyssa>
When do I get fancy OpenCLfrost :P
<robclark>
some variants of tex fetch need vec10
<HdkR>
Yea, I understand the concern. This is coming from over a year of working on a backend to try and convince a company that an LLVM backend is a step forward towards documenting their shit publicly :P
<robclark>
you probably don't want opencl yet.. it is compiler nightmare ;-)
<robclark>
anyways, get gl compute shaders first.. that is baby-cl ;-)
<robclark>
without having to deal with nonsense like vec16
<robclark>
(in an age when everyone already moved to scalar isa's, khronos saw fit to add vec8 and vec16 to opencl... wtf)
<alyssa>
robclark: When do I get fancy GL compute then? :P
<robclark>
umm, exercise for the reader?
<robclark>
:-P
<HdkR>
There we go. I have vec2, vec3, and vec4 classes working correctly through the RA
<robclark>
\o/
<HdkR>
It's pretty easy once you realize what the utility wants for setting up register conflicts
vstehle has quit [Ping timeout: 250 seconds]
<robclark>
HdkR: just in case it wasn't obvious, setup the conflicts once per screen and re-use
<robclark>
(ie. don't do it per-shader-compile)
<HdkR>
Definitely not obvious
<HdkR>
Currently nothing in the Bifrost compiler side does any form of caching
<HdkR>
I'll put it on a kanban to remember to do it in the future
<HdkR>
Need to figure out what to do about the temp registers as well...
* robclark
split ir3_ra so we construct the interference graph once for ir3_compiler, associated w/ the screen.. and re-use for every shader.. that is kinda the diff between ir3_compiler (global) and ir3_context (per-shader-compile-context)
<HdkR>
What do backends do with nir for destroying nir_op_vec4 and just having RA handle correct allocation in the destination register?
<HdkR>
Since Bifrost wants to lower ALU to scalar but keep IO vector based
<HdkR>
(Trying to find if a pass already exists for this or something)
<HdkR>
Or is the right way to generate a no-op move that gets RA'd correctly and have a pass that eliminates it later?
* robclark
generates mov and cleans it up in backend
<HdkR>
I see
<robclark>
(although tbf I have to do same things w/ tex instructions and various other cases)
<HdkR>
Yea, many scalar architectures will need to do it
<HdkR>
I guess I'll also just support generating a vector construction op that gets destroyed after RA
* robclark
has collect/fanin and split/fanout meta instructions for that sorta thing.. and the cp pass that tries to reduce mov's (and propagate load_immed/uniform).. since even mov's need delay slots before they can be consumed
<HdkR>
oof
<robclark>
and then ra maps those to register classes to try to put things in the right place
<robclark>
because sampler instructions for 3d tex w/ explict lod and offsets and whatever not work out to needing 10 consecutive scalar regs
<robclark>
(and then once you solve that shader inputs/outputs are already solved)
<HdkR>
Right
<robclark>
nouveau has something similar iirc.. I originally called the meta instructions fanin/fanout but then liked nouveau naming of collect/split better.. so the names are used semi-interchangably in ir3 code
<HdkR>
Yea, Nouveau will hit the exact same issues, with all the things
<robclark>
iirc i965 backend has similar.. which they solve w/ mov's and meta "payload" instruction, but not sure how they get rid of the mov's... or maybe mov's are not as constly because result immediately available?
<robclark>
ir3 isa just happily lets you read a src register before the result you are waiting for is avail.. which is fun
<robclark>
nice and fast when you get it right, super confusing before you figure out what is going wrong
<HdkR>
Welcome to Nvidia scheduling hell :P
<robclark>
heheh, iirc ir3 was like that before nv moved to that approach
<robclark>
(or well, it was like that since a3xx, I thought it was more recent gens for nv when they went that route)
<HdkR>
Wow, Freedreno has been around since pre-maxwell days? Time sure flies
<robclark>
yeah
<HdkR>
Seems like yesterday that I was complaining it doesn't run Dolphin :P
<robclark>
really you shouldn't talk about dolphin until a driver is a few years old :-P
<HdkR>
Just giving ya a hard time ;)
<robclark>
it brings back memories of lotsa compiler head scratching ;-)
<robclark>
but yeah, going from a2xx to something maxwell-like was.. interesting
<HdkR>
It's a heck of a jump in a single generation
<robclark>
and super confusing until you realize what is going on
<HdkR>
I bet
<robclark>
fun when you realize certain groups of instructions don't even consume their src regs immediately, so you have write-after-read hazzards
<robclark>
or that certain groups of instructions don't count for alu delay slot accounting
<HdkR>
Yea, scheduling can become a nasty mess
<robclark>
because they are split off into parallel pipe at decode stage
<robclark>
yeah
<HdkR>
I was lucky enough to previously have had a spreadsheet that showed exactly latencies and a manual talking about scheduling hazards :P
<robclark>
the sched part (or at least figuring out how far apart instructions need to be) isn't too hard once you realize what is going on
<robclark>
although best way to maximize parallelism w/out blowing up register usage (within threshold of what bumps you to lower level of warps in flight) is still an open science project
<HdkR>
perfect RA to occupancy ratio is a hard problem
<HdkR>
Similar to knowing when spilling is more effective than using a few more registers because of it
<robclark>
I think in general, post sched, we can know max # of live vals to feed into RA stage... and even to decide to spill before RA
<robclark>
the problem is sched before RA vs sched after RA
<robclark>
so I *think* we need to do both
<HdkR>
Truely you need to do both twice so the first pass will inform the second pass about what it should do
<robclark>
(ie. pre RA sched to reduce live vals, possible directed by current # of live vals, and then post RA sched to try to extract as much remaining parallelism as we can get)
<HdkR>
Since they feed in to each other, but that's time consuming
<HdkR>
More suited towards offline compilation sort of thing
<robclark>
I've looked a bit at sethi-hulman (sp?) numbering to try to inform scheduling with mixed results..
<robclark>
but yeah, doing this well in a sane O() amount of time is hard
<robclark>
(and it would be nice if I didn't have to sci-hub all the papers on the topic :-/)
<HdkR>
yea :|
<HdkR>
Alright, working for output IO, now to do it for input IO which will be easy
<robclark>
:-)
<robclark>
HdkR: btw, fun thing on newer gen's is combining ssbo/image atomic input reg with output.. so you have a "src" reg.xy (or .xyz for compexchange) where .x is dst and y/yz is src.. that is super-fun for RA
<robclark>
(and so far I solved it same way as blob, w/ a falsedep and mov)
<HdkR>
haha. What a rude constraint
<robclark>
yeah, srsl
<HdkR>
I guess this is due to isa encoding limitations?
<robclark>
tbh, I have no clue why they did that.. I know why they changed image/ssbo instructions but there were enough bits
<robclark>
they seem to want to be semi-backwards compatible..
<HdkR>
interesting
<robclark>
I assume because of hand written shaders for camera and 2d blit lib
<robclark>
I guess this wouldn't matter for 2d blit lib, but maybe for camera compute shader fancyness?
<HdkR>
You'd think that even with a hand written assembler then it still wouldn't matter
<HdkR>
You could catch a meta op in the backend regardless
<HdkR>
32bit ARM does it all the time
<robclark>
well, if the shader is binary that isn't going to work.. which I assume is the case
<robclark>
if they had some asm the shader was written in that compiled to binary you could
<robclark>
but for a gpu you aren't going to solve those compat problem with extra transistors
<HdkR>
Nobody should ever bitbang out a shader :P
<robclark>
yeah, you shouldn't.. I assume they thought about that after the fact ;-)
<robclark>
anyways, that is my speculation.. I can't imagine why else they would do that
vstehle has joined #panfrost
herbmillerjr has quit [Ping timeout: 255 seconds]
<HdkR>
Oh, hm
<HdkR>
Need to declare conflicts between subclasses even if they already have a stated conflict on the base class they derive from?
<HdkR>
vec3 and vec4 both conflict on scalar, but not each other. I just assumed it would work since the conflict with the base should alias?
<HdkR>
Hm. RA failure, I can only assume due to the different register classes
<HdkR>
Oh wait, no. Just derp
<HdkR>
fragment storing I don't have complete yet :P
<HdkR>
Interesting. fragment store location uses a 64bit pointer that points to a 128bit descriptor?
<HdkR>
64bit index in to a 128bit descriptor table?
<HdkR>
Just threw an MRT at the blob to make sure I was seeing things correctly