<raster>
is panfrost's mesa implementation still re-using the arm mali kernel drivers "as-is" or so i need a new kernel build with changes?
<narmstrong>
still using the arm mali driver
jernej has joined #panfrost
<alyssa>
narmstrong: Oh, awesome!
<alyssa>
I have a workaround for the text, got interrupted trying to actually fix
<alyssa>
Also, try the tomeu-perf branch, which fixes a ton of performance issues and will make Kodi, ya know, be less bad
<raster>
narmstrong: cool.
<raster>
one less moving part at this stage then
<raster>
:)
<alyssa>
I would help, but I couldn't find GBM-backed Kodi anywhere :(
<alyssa>
(Apparently it's in the dev branch but not stable? If so, where do I get official arm64 binaries? Not looking forward to compiling Kodi from source, given the size)
<alyssa>
tomeu: Ooo!
<chewitt>
Kodi doesn't build any official arm64 binaries
<alyssa>
And yeah, looks like a stride issue somewhere
<alyssa>
chewitt: *blinks*
<chewitt>
there's an official ppa for Ubuntu, but that's the only official Linux release
<alyssa>
tomeu: Couldn't tell you where, since stride is... a winsys issue more than anything ;P
<chewitt>
we leave packaging to individual distros as Kodi has quite a few dependencies
<alyssa>
chewitt: So, uh, where do I get legit unstable Kodi binaries?
<chewitt>
self-compile
* alyssa
blinks
<chewitt>
it's not possible to ship a universal binary.. too many distros fiddle with the dependencies
<chewitt>
esp. ffmpeg
<chewitt>
but
<chewitt>
it's not hard to compile
<HdkR>
Start a build and walk away for a day :D
<chewitt>
as long as you're cross compiling .. it's 5-10 mins on a decent spec compile box
<chewitt>
there's a few people around that can assist if you get stuck anywhere
<chewitt>
#kodi-linux and #kodi-dev channels have Kodi team people lurking
rhyskidd has quit [Quit: rhyskidd]
<chewitt>
what distro are you using?
rhyskidd has joined #panfrost
rhyskidd has quit [Remote host closed the connection]
<chewitt>
some of them run their own nightlies
rhyskidd has joined #panfrost
<chewitt>
although none of them will be shipping GBM stuff, only Xorg
anarsoul has quit [Remote host closed the connection]
anarsoul has joined #panfrost
<alyssa>
What the heck? Kodi's internal dep systems fetches code over HTTP, explicitly follows redirects, doesn't check signatures after, and silences everything
<alyssa>
chrisf: This falls firmly in the "I don't know what I don't know" category. The feature might be there, it might not, I genuinely don't know.
<HdkR>
Since pretty much all? hardware supports modifying the MSAA sampling pattern :)
<chrisf>
speaking of -- is setting the sample positions understood?
<HdkR>
Not in panfrost at least :P
<HdkR>
Those would be some late game features
<alyssa>
^^
* alyssa
is aiming for rock-solid ES 2.0 before venturing off into the unknown
<alyssa>
"Hi, can I run Kodi with Panfrost?" "No, but we have 12 pages of notes about multisampled instanced blending!"
<HdkR>
Time to implement meshlets and ray tracing in panfrost
<chrisf>
you joke, but i wouldnt be surprised if you can abuse it into doing something meshlet-like
<chrisf>
RT can go jump in the lake though; you have a couple watts, it's not an interesting thing to support
<HdkR>
Yea, I could see meshlets working
<HdkR>
Which would be cool to have a mobile GPU with that feature
<HdkR>
From what I've been hearing, Game devs really like the idea of them, just sad they are only available on Nvidia Turing atm
<chrisf>
perhaps not exactly what nv does, but there's definitely room for "more flexible geometry frontend" and mali hw ought to be able to do it
<HdkR>
Worst case you could probably have a compute job feed the data in to the rest of the graphics pipeline bits
<HdkR>
So not optimal but actually supported
<HdkR>
Maybe in a few years
<cwabbott>
HdkR: actually, that's pretty much how the existing pipeline works
<HdkR>
Yea, vertex shaders are effectively compute jobs
<cwabbott>
the vertex shader is just a compute shader, plus a few fixed-function bits for fetching attributes and storing varyings
<chrisf>
cwabbott: the question is whether those ff units are flexible enough to kill the "one vertex out" constraint
<HdkR>
It'll be interesting to know if attribute fetching is roughly the same performance as generic memory fetching though
<HdkR>
That too
<cwabbott>
it is, I think
<cwabbott>
it just uses the standard load/store path
<cwabbott>
all it does is compute the index and do data conversion
<HdkR>
One of the reasons why Turing is so good at mesh shading is because Nvidia finally fixed the issue that "generic" memory fetching was slower than the explicit attribute fetching
<chrisf>
"by removing the attribute fetch hardware" ?
<cwabbott>
chrisf: considering that geometry shaders and tesselation are implemented entirely using the same fixed-function attribute unit (plus a shader to implement the fixed-function tesselator, I guess) I would think so
<HdkR>
One can assume the fetching and forward data fetching just behaves the same between the different access types
<cwabbott>
you can just give it a vertex id and instance id, and it does the addressing calculation
<HdkR>
Nice
<cwabbott>
and there's no special vertex cache at all -- the driver just computes the min/max vertex indices, then dispatches a thread per index that loads the attributes from a buffer and stores all the varyings to another buffer
<cwabbott>
the HW doesn't touch the index buffer at all until the tiler
<HdkR>
chrisf: There has been some benchmarks done and pretty much all accesses behave the same aside from a couple. Fully divergent indirect UBO fetching serializes to each thread in the warp. Warp uniform memory fetches are ~20x faster. and UBO fetches are still dumb quick regardless.
<chrisf>
HdkR: link?
<HdkR>
I'll have to find it again, I can never remember the person's name on twitter that does it :P
<alyssa>
cwabbott: That's what I thought. Also, that's terrifying ;P
<HdkR>
gah. It's not Sebbi who made this bench, but someone in the same circle of awesome game devs
<HdkR>
It would theoretically hit near 64x if it was uniform right?
<HdkR>
:D
<bnieuwen1uizen>
not quite
<chrisf>
bnieuwen1uizen: need better uniform value analysis?
<bnieuwen1uizen>
chrisf: lets start by LLVM not messing around with loops so much that they become incomprehensible to any useful analysis
<HdkR>
ah, that's from the same test I guess
<HdkR>
Oh, yea. That's an issue
<HdkR>
I see there is some recent metadata added to keep at least some of that information around
<HdkR>
Now time to implement it in all optimization passes so the metadata doesn't vanish...
<bnieuwen1uizen>
yeah nicolai implemented something better for detecting uniformness for scalar loads, but now we have a readfirstLane before every load because the phis are not good yet
<HdkR>
oof
<bnieuwen1uizen>
the AMDGPU backend + control flow is not funny :(
<HdkR>
I looked at the divergence analysis code and was sad
* chrisf
is doing spirv for swiftshader currently, and ending up doing a lot of this stuff before going to llvm ir at all
<bnieuwen1uizen>
HdkR: I started trying to get that vector load + subgroup instr optimization implemented in nir though, so who knows, it may arrive in a driver near you ;)
<bnieuwen1uizen>
yeah, that is the easy way, llvmpipe also does that
<bnieuwen1uizen>
but all our vector ops are predicated and it would truly mess with RA to do it that way :(
<HdkR>
woo
<alyssa>
bnieuwen1uizen: NIR tbh
<alyssa>
:P
<bnieuwen1uizen>
? NIR to be honest?
<HdkR>
Bifrost+ really lends itself to having LLVM do easy codegen. Then you have to deal with LLVM :)
<alyssa>
HdkR: Yeah, but.. NIR ^^
<bnieuwen1uizen>
what makes it easy?
<alyssa>
HdkR: You have no excuses, there's shared code between bifrost and midgard and you don't get to share that if you do LLVM ;P
<bnieuwen1uizen>
alyssa: he can share more with amdgpu ;)
<bnieuwen1uizen>
heck LLVM is a lot more lines of code to share :P
<alyssa>
:VVV
<HdkR>
The bundles are fairly simple, more like what you'd see in a DSP
<HdkR>
So you could describe the instructions inside the bundles easily, and then use LLVM's packetizer to fit them in to the clauses pretty easily
<HdkR>
bundles = clauses
<bnieuwen1uizen>
is the LLVM packetizer any good?
<HdkR>
eeehhh. It's basic
<HdkR>
It's a simple slot based packetizer
<bnieuwen1uizen>
so write a NIR one!
* bnieuwen1uizen
runs
<cwabbott>
bnieuwenhuizen: heh... I wouldn't want to be using NIR that deep into the backend
<bnieuwen1uizen>
true
<alyssa>
cwabbott: Hush you're supposed to be on my side D:
<HdkR>
Basically you end up describing the machine's pipelines and the packetizer tries to fit them as well as possible. It would end up being describing each clause's potential layouts as "pipelines" and fit as aggressively as possible
<HdkR>
So it's meh
<HdkR>
It doesn't understand skewed pipelines at all
<HdkR>
So Midgard isn't a good fit there :P
<HdkR>
I don't think it supports any form of pipeline bypasses either, so Bifrost's temp registers would probably fail pretty hard in the basic implementation
<cwabbott>
HdkR: well, for temp registers, the whole thing is setup so that the compiler can be fairly oblivious to them
<cwabbott>
you just pretend that writes have no latency, and then if there's a conflict, you rewrite the read to use one of the temp registers
<cwabbott>
although now that I think about it, it does kinda affect how many instructions you can pack, since each FMA/ADD combination can only load 3 registers
<cwabbott>
or 2 if the previous instruction wrote 2
<HdkR>
Yea, it's a nice model but it doesn't quite fit in to LLVM's default packetizer well
<HdkR>
Which I've been bit with for other architectures
<HdkR>
bit by?
<alyssa>
HdkR: NIRrrrr
<alyssa>
You 'get' to write the scheduler yourself! No default assumptions to break!
<HdkR>
Yea, not planning on going the LLVM route :P
<HdkR>
Everyone knows LLVM is too slow. Nobody in their right mind would create a live compiler using LLVM right?
<HdkR>
</s>
<bnieuwen1uizen>
HdkR: I wonder how much of that is because people use it in stupid ways
<HdkR>
Using and abusing LLVM and assuming it'll optimize all the dumb things? Then it has to and it kills compile times? :P
<bnieuwen1uizen>
yep
<HdkR>
Yea, it's a big part of the issue
<HdkR>
Domain specific compilers are interesting. Once it becomes a generic problem then it is a time issue
<HdkR>
(Engineering time and computation time, woo)
<chrisf>
cwabbott: i thought the bifrost temp register was more limited-- there was exactly one register, it was available for exactly one instruction slot, .. ?
<HdkR>
Think there is T0 and T1 from what I remember and I think they can be passed to the next clause
<HdkR>
If I'm remembering correctly. Bit tired
<HdkR>
I'm assuming T0 and T1 so it can fit full 64bit things in to it
<cwabbott>
chrisf, HdkR: so, there are 2 stages that are visible in the ISA, FMA and ADD
<HdkR>
Right
<chrisf>
cwabbott: ok
<cwabbott>
FMA and ADD can refer to the previous FMA and ADD, that's what I call T0 and T1
<HdkR>
oh
<cwabbott>
(the instructions also have T0 and T1 as the destination, to remind you which is which)
<cwabbott>
then ADD can also have FMA from the same instruction as a source, which I call T
<HdkR>
I'm having a hard time interpreting that one
<cwabbott>
register file writes for one instruction happen at the same time as reads for the next as long as they're in the same clause (they actually are encoded in the same field)
<HdkR>
labeled T for the result of the FMA passing directly in to the ADD?
<cwabbott>
HdkR: yes
<cwabbott>
for example, you would use it with an unfused multiply-add
<HdkR>
Alright, so when you have {R%d, T0} that is T0 backed by reg R%d and will be written to both T0 and T%d?
<HdkR>
er
<HdkR>
R%d?
<cwabbott>
yeah
<HdkR>
R%d commit to RF happens at the end of clause right?
<cwabbott>
not at the end of a clause
<cwabbott>
during the register read/write phase
<HdkR>
ah, interesting
<cwabbott>
i mean, stage
<HdkR>
That clears up a bit for me then
<cwabbott>
so, imagine you have a clause with two instructions, instruction 0 and instruction 1
<cwabbott>
the following things would happen in order:
<cwabbott>
1. instruction 0 register read
<cwabbott>
2. instruction 0 FMA
<cwabbott>
3. instruction 0 ADD
<cwabbott>
4. instruction 0 write/instruction 1 read (at the same time)
<cwabbott>
5. instruction 1 FMA
<cwabbott>
6. instruction 1 ADD
<cwabbott>
7. instruction 1 write
<cwabbott>
so it would be 7 cycles total (or maybe more, depending on how many cycles FMA and ADD take)
<cwabbott>
and of course this is all being interleaved with other quads
<HdkR>
Right, typical GPU stuff
urjaman has quit [Ping timeout: 250 seconds]
<HdkR>
Do we have documentation about what all the clause types are?
<cwabbott>
oh, and one last thing... 1 and 7 plus 2 and 3 are all encoded in the 78-bit instruction word (which then gets packed into the clause)
<cwabbott>
*the first 78-bit instruction word
<cwabbott>
then 4, 5, 6 are encoded in the second
<cwabbott>
HdkR: not really, but feel free to fill it out :)
<HdkR>
I have an enum currently with only two. Need to dump a few more
<cwabbott>
I think there's some in the assmbler
<cwabbott>
but it's really easy to figure out with the disassembler
<HdkR>
aye, super nice. Dumping shaders from my little compute program
<cwabbott>
the clause type is uniquely determined by the one variable-latency instruction
urjaman has joined #panfrost
<cwabbott>
if there's no variable-latency instruction, it's always 0
<Lyude>
this all keeps reminding me how cute bifrost is
<HdkR>
Oh yea, I have three thrown in my enum atm
<HdkR>
0 being zero latency :D
<HdkR>
I should pull up the assembler to not be bitten by the rules I don't know yet
<Lyude>
cwabbott: btw, just to make sure: we redid the docs so that the endianness in the field descriptions matches the assembler didn't we?