alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - Transientification is terminating. Memory reductions in progress.
jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
stikonas_ has quit [Remote host closed the connection]
anarsoul|2 has quit [Ping timeout: 245 seconds]
<HdkR> Got some hacky mir instruction outputing and RA happening. Next steps making it less hacky and bundling correctly
<HdkR> If I wasn't riffing on alyssa's compiler so hard then I would have wrote this completely differently :P
anarsoul has quit [Remote host closed the connection]
anarsoul has joined #panfrost
mateo` has quit [Ping timeout: 245 seconds]
_whitelogger has quit [Ping timeout: 250 seconds]
_whitelogger has joined #panfrost
stikonas_ has joined #panfrost
stikonas_ has quit [Remote host closed the connection]
<narmstrong> got Kodi runing on S912 with GBM
<narmstrong> no text
<narmstrong> not smooth at all
<HdkR> <3
<narmstrong> but still pretty cool !
<tomeu> wow!
raster has joined #panfrost
jailbox has quit [Ping timeout: 245 seconds]
_whitelogger has joined #panfrost
chewitt has quit [Quit: Zzz..]
chewitt has joined #panfrost
jailbox has joined #panfrost
<narmstrong> finally, we can do HW Kodi playback on the Amlogic S912 ! Thanks alyssa ;-)
<tomeu> first sight of weston-simple-egl :)
<tomeu> alyssa: any ideas on what's wrong with the rendering? https://people.collabora.com/~tomeu/panfrost_simple_egl.mp4
<tomeu> hmm, maybe it's not the rendering, but just a wrong stride somewhere in the presentation path
<HdkR> oof
<tomeu> for some reason, glmark2 seems to work fine
<tomeu> narmstrong: if you want to try it out in your board: https://gitlab.freedesktop.org/panfrost/mesa/merge_requests/14
<narmstrong> tomeu: thanks, I'll have a try
<raster> is panfrost's mesa implementation still re-using the arm mali kernel drivers "as-is" or so i need a new kernel build with changes?
<narmstrong> still using the arm mali driver
jernej has joined #panfrost
<alyssa> narmstrong: Oh, awesome!
<alyssa> I have a workaround for the text, got interrupted trying to actually fix
<alyssa> Also, try the tomeu-perf branch, which fixes a ton of performance issues and will make Kodi, ya know, be less bad
<raster> narmstrong: cool.
<raster> one less moving part at this stage then
<raster> :)
<alyssa> I would help, but I couldn't find GBM-backed Kodi anywhere :(
<alyssa> (Apparently it's in the dev branch but not stable? If so, where do I get official arm64 binaries? Not looking forward to compiling Kodi from source, given the size)
<alyssa> tomeu: Ooo!
<chewitt> Kodi doesn't build any official arm64 binaries
<alyssa> And yeah, looks like a stride issue somewhere
<alyssa> chewitt: *blinks*
<chewitt> there's an official ppa for Ubuntu, but that's the only official Linux release
<alyssa> tomeu: Couldn't tell you where, since stride is... a winsys issue more than anything ;P
<chewitt> we leave packaging to individual distros as Kodi has quite a few dependencies
<alyssa> chewitt: So, uh, where do I get legit unstable Kodi binaries?
<chewitt> self-compile
* alyssa blinks
<chewitt> it's not possible to ship a universal binary.. too many distros fiddle with the dependencies
<chewitt> esp. ffmpeg
<chewitt> but
<chewitt> it's not hard to compile
<HdkR> Start a build and walk away for a day :D
<chewitt> as long as you're cross compiling .. it's 5-10 mins on a decent spec compile box
<chewitt> max
<alyssa> ...Huh
<alyssa> Like I'm open to it but
<alyssa> Scary
<chewitt> there's a few people around that can assist if you get stuck anywhere
<chewitt> #kodi-linux and #kodi-dev channels have Kodi team people lurking
rhyskidd has quit [Quit: rhyskidd]
<chewitt> what distro are you using?
rhyskidd has joined #panfrost
rhyskidd has quit [Remote host closed the connection]
<chewitt> some of them run their own nightlies
rhyskidd has joined #panfrost
<chewitt> although none of them will be shipping GBM stuff, only Xorg
anarsoul has quit [Remote host closed the connection]
anarsoul has joined #panfrost
<alyssa> What the heck? Kodi's internal dep systems fetches code over HTTP, explicitly follows redirects, doesn't check signatures after, and silences everything
<alyssa> I'm sorry, this is not okay.
<HdkR> Is that a makefile that calls cmake to build makefiles?
<alyssa> HdkR: Apparently.
<chewitt> I'll confess to not be that familiar with the 'depends' build process; LE builds all the dependencies independently
<Lyude> HdkR: ooooh-can I see what you've got so far for bifrost?
robertfoss_ has joined #panfrost
<HdkR> Lyude: You could see it, btu it isn't even generating real instructions yet
<HdkR> :P
<Lyude> ahhh
<Lyude> yeah that part is gonna be challenging :)
<alyssa> Have fun you two :P
<HdkR> It's almost generating instructions "correctly", just haven't touched bundling
<HdkR> Vacation ends on Monday so I'm going to get it as far as I can and push Sunday
<HdkR> Since I can't guarantee time past that
<alyssa> \o/
lrusak has joined #panfrost
<chrisf> alyssa: does midgard allow control over subpixel precision?
<chrisf> alyssa: im looking at a blob that claims 8 bits, im wondering if you can run the rasterizer in 4 bit mode
<chrisf> this is possible on some other GPUs
<HdkR> Have you searched around to see if there were multiple midgard GPUs that reported different precision based on driver version? :D
<HdkR> I see a T880 with different drivers reporting 4 and 8
<HdkR> No idea if that is misreporting from the driver or not
<chrisf> gpuinfo.org lacks this value for some reason
<HdkR> It has it in the vulkan results
<HdkR> That's what I was looking at as well, it's in the limits
<HdkR> Different drivers, one showing 4, other showing 8
<alyssa> chrisf: I'm not sure..
<chrisf> HdkR: they're different physical implementations of the T880, too :(
<alyssa> It's hard to know what the hw is/is-not capable of if we've never seen the blob use a feature
<alyssa> We don't know how much we don't know, yeah?
<HdkR> yea, but the number of compute units shouldn't effect that
<HdkR> So it either supports it or they hecked up reporting in their driver :P
<HdkR> Assume latter, hope for former
<HdkR> If you ask on their forum then maybe Peter Harris will respond to you about it
<chrisf> yeah i can find out from arm; i was just hoping i might get a "yes, you can totally do that, it's this bit here!"
<alyssa> chrisf: I mean, is that feature exposed to GL anywhere?
<chrisf> alyssa: it's an implementation dependent value. you can query it, but you cant control it
<alyssa> Hm
<alyssa> Unrelated to MSAA?
<chrisf> yes
<chrisf> this is the number of subpixel bits there are in window space positions of things
<HdkR> Yea, sampling pattern doens't directly relate
* alyssa mumbles
<alyssa> chrisf: This falls firmly in the "I don't know what I don't know" category. The feature might be there, it might not, I genuinely don't know.
<HdkR> Since pretty much all? hardware supports modifying the MSAA sampling pattern :)
<chrisf> speaking of -- is setting the sample positions understood?
<HdkR> Not in panfrost at least :P
<HdkR> Those would be some late game features
<alyssa> ^^
* alyssa is aiming for rock-solid ES 2.0 before venturing off into the unknown
<alyssa> "Hi, can I run Kodi with Panfrost?" "No, but we have 12 pages of notes about multisampled instanced blending!"
<HdkR> Time to implement meshlets and ray tracing in panfrost
<chrisf> you joke, but i wouldnt be surprised if you can abuse it into doing something meshlet-like
<chrisf> RT can go jump in the lake though; you have a couple watts, it's not an interesting thing to support
<HdkR> Yea, I could see meshlets working
<HdkR> Which would be cool to have a mobile GPU with that feature
<HdkR> From what I've been hearing, Game devs really like the idea of them, just sad they are only available on Nvidia Turing atm
<chrisf> perhaps not exactly what nv does, but there's definitely room for "more flexible geometry frontend" and mali hw ought to be able to do it
<HdkR> Worst case you could probably have a compute job feed the data in to the rest of the graphics pipeline bits
<HdkR> So not optimal but actually supported
<HdkR> Maybe in a few years
<cwabbott> HdkR: actually, that's pretty much how the existing pipeline works
<HdkR> Yea, vertex shaders are effectively compute jobs
<cwabbott> the vertex shader is just a compute shader, plus a few fixed-function bits for fetching attributes and storing varyings
<chrisf> cwabbott: the question is whether those ff units are flexible enough to kill the "one vertex out" constraint
<HdkR> It'll be interesting to know if attribute fetching is roughly the same performance as generic memory fetching though
<HdkR> That too
<cwabbott> it is, I think
<cwabbott> it just uses the standard load/store path
<cwabbott> all it does is compute the index and do data conversion
<HdkR> One of the reasons why Turing is so good at mesh shading is because Nvidia finally fixed the issue that "generic" memory fetching was slower than the explicit attribute fetching
<chrisf> "by removing the attribute fetch hardware" ?
<cwabbott> chrisf: considering that geometry shaders and tesselation are implemented entirely using the same fixed-function attribute unit (plus a shader to implement the fixed-function tesselator, I guess) I would think so
<HdkR> One can assume the fetching and forward data fetching just behaves the same between the different access types
<cwabbott> you can just give it a vertex id and instance id, and it does the addressing calculation
<HdkR> Nice
<cwabbott> and there's no special vertex cache at all -- the driver just computes the min/max vertex indices, then dispatches a thread per index that loads the attributes from a buffer and stores all the varyings to another buffer
<cwabbott> the HW doesn't touch the index buffer at all until the tiler
<HdkR> chrisf: There has been some benchmarks done and pretty much all accesses behave the same aside from a couple. Fully divergent indirect UBO fetching serializes to each thread in the warp. Warp uniform memory fetches are ~20x faster. and UBO fetches are still dumb quick regardless.
<chrisf> HdkR: link?
<HdkR> I'll have to find it again, I can never remember the person's name on twitter that does it :P
<alyssa> cwabbott: That's what I thought. Also, that's terrifying ;P
<HdkR> gah. It's not Sebbi who made this bench, but someone in the same circle of awesome game devs
<HdkR> Having a hard time
<HdkR> Oh, it was sebbi
<HdkR> You can see the tests that end in uniform hit something like 20x perf in cases
<HdkR> `cbuffer{float4} load linear: 328.935ms 0.050x` and then that dumb one
<HdkR> If you were to order that list by size then the different types would all be fairly consistent
<HdkR> (Intel is also fast at wave uniform fetches due to its architecture)
anarsoul|2 has joined #panfrost
<HdkR> Speaking of which. bnieuwenhuizen, does radeonsi do any load + broadcast for wave uniform loads? :)
<chrisf> if someone wants to convert to gles/vulkan and characterize our zoo of toy gpus... :)
<alyssa> chewitt: Got kodi-gbm going with the help of our friends over at #kodi-dev
<alyssa> Will play with this when I have some time (Monday, probably)
<alyssa> narmstrong: ^^
<HdkR> Woo kodi
stikonas has joined #panfrost
* HdkR is noticing that there is actually a large amount of people in the channel
<bnieuwen1uizen> HdkR: what do you mean?
<bnieuwen1uizen> on AMD loads using a scalar address end up in a scalar reg, which can be used from every lane
<HdkR> ah, neat
<HdkR> Wonder why it isn't as great of a perf increase in that microbench then
<HdkR> Since the Nvidia Blob hits 28x perf improvement in one case
<bnieuwen1uizen> what bench?
* bnieuwen1uizen admits to not always following the log here too closely
<HdkR> Ah sorry, Sebbi's memory load microbench here https://github.com/sebbbi/perftest#nvidia-turing-rtx-2080-ti
<HdkR> There are GCN and vega tests above that one
<HdkR> `cbuffer{float4} load uniform: 8.011ms 2.778x` from Vega FE
<bnieuwen1uizen> I think the main limitation is sorf of LLVM test for uniformness in loops comes down to "is it constant"
<HdkR> This is also a D3D bench sadly
<HdkR> So running Windows
<bnieuwen1uizen> oh no clue about the proprietary compiler
<bnieuwen1uizen> HdkR: You seen https://gist.github.com/sebbbi/ba4415339b535d22fb18e2d824564ec4 ? AFAIU nvidia does that optimization while AMD does not
<HdkR> It would theoretically hit near 64x if it was uniform right?
<HdkR> :D
<bnieuwen1uizen> not quite
<chrisf> bnieuwen1uizen: need better uniform value analysis?
<bnieuwen1uizen> chrisf: lets start by LLVM not messing around with loops so much that they become incomprehensible to any useful analysis
<HdkR> ah, that's from the same test I guess
<HdkR> Oh, yea. That's an issue
<HdkR> I see there is some recent metadata added to keep at least some of that information around
<HdkR> Now time to implement it in all optimization passes so the metadata doesn't vanish...
<bnieuwen1uizen> yeah nicolai implemented something better for detecting uniformness for scalar loads, but now we have a readfirstLane before every load because the phis are not good yet
<HdkR> oof
<bnieuwen1uizen> the AMDGPU backend + control flow is not funny :(
<HdkR> I looked at the divergence analysis code and was sad
* chrisf is doing spirv for swiftshader currently, and ending up doing a lot of this stuff before going to llvm ir at all
<bnieuwen1uizen> HdkR: I started trying to get that vector load + subgroup instr optimization implemented in nir though, so who knows, it may arrive in a driver near you ;)
<bnieuwen1uizen> yeah, that is the easy way, llvmpipe also does that
<bnieuwen1uizen> but all our vector ops are predicated and it would truly mess with RA to do it that way :(
<HdkR> woo
<alyssa> bnieuwen1uizen: NIR tbh
<alyssa> :P
<bnieuwen1uizen> ? NIR to be honest?
<HdkR> Bifrost+ really lends itself to having LLVM do easy codegen. Then you have to deal with LLVM :)
<alyssa> HdkR: Yeah, but.. NIR ^^
<bnieuwen1uizen> what makes it easy?
<alyssa> HdkR: You have no excuses, there's shared code between bifrost and midgard and you don't get to share that if you do LLVM ;P
<bnieuwen1uizen> alyssa: he can share more with amdgpu ;)
<bnieuwen1uizen> heck LLVM is a lot more lines of code to share :P
<alyssa> :VVV
<HdkR> The bundles are fairly simple, more like what you'd see in a DSP
<HdkR> So you could describe the instructions inside the bundles easily, and then use LLVM's packetizer to fit them in to the clauses pretty easily
<HdkR> bundles = clauses
<bnieuwen1uizen> is the LLVM packetizer any good?
<HdkR> eeehhh. It's basic
<HdkR> It's a simple slot based packetizer
<bnieuwen1uizen> so write a NIR one!
* bnieuwen1uizen runs
<cwabbott> bnieuwenhuizen: heh... I wouldn't want to be using NIR that deep into the backend
<bnieuwen1uizen> true
<alyssa> cwabbott: Hush you're supposed to be on my side D:
<HdkR> Basically you end up describing the machine's pipelines and the packetizer tries to fit them as well as possible. It would end up being describing each clause's potential layouts as "pipelines" and fit as aggressively as possible
<HdkR> So it's meh
<HdkR> It doesn't understand skewed pipelines at all
<HdkR> So Midgard isn't a good fit there :P
<HdkR> I don't think it supports any form of pipeline bypasses either, so Bifrost's temp registers would probably fail pretty hard in the basic implementation
<cwabbott> HdkR: well, for temp registers, the whole thing is setup so that the compiler can be fairly oblivious to them
<cwabbott> you just pretend that writes have no latency, and then if there's a conflict, you rewrite the read to use one of the temp registers
<cwabbott> although now that I think about it, it does kinda affect how many instructions you can pack, since each FMA/ADD combination can only load 3 registers
<cwabbott> or 2 if the previous instruction wrote 2
<HdkR> Yea, it's a nice model but it doesn't quite fit in to LLVM's default packetizer well
<HdkR> Which I've been bit with for other architectures
<HdkR> bit by?
<alyssa> HdkR: NIRrrrr
<alyssa> You 'get' to write the scheduler yourself! No default assumptions to break!
<HdkR> Yea, not planning on going the LLVM route :P
<HdkR> Everyone knows LLVM is too slow. Nobody in their right mind would create a live compiler using LLVM right?
<HdkR> </s>
<bnieuwen1uizen> HdkR: I wonder how much of that is because people use it in stupid ways
<HdkR> Using and abusing LLVM and assuming it'll optimize all the dumb things? Then it has to and it kills compile times? :P
<bnieuwen1uizen> yep
<HdkR> Yea, it's a big part of the issue
<HdkR> Domain specific compilers are interesting. Once it becomes a generic problem then it is a time issue
<HdkR> (Engineering time and computation time, woo)
<chrisf> cwabbott: i thought the bifrost temp register was more limited-- there was exactly one register, it was available for exactly one instruction slot, .. ?
<HdkR> Think there is T0 and T1 from what I remember and I think they can be passed to the next clause
<HdkR> If I'm remembering correctly. Bit tired
<HdkR> I'm assuming T0 and T1 so it can fit full 64bit things in to it
<cwabbott> chrisf, HdkR: so, there are 2 stages that are visible in the ISA, FMA and ADD
<HdkR> Right
<chrisf> cwabbott: ok
<cwabbott> FMA and ADD can refer to the previous FMA and ADD, that's what I call T0 and T1
<HdkR> oh
<cwabbott> (the instructions also have T0 and T1 as the destination, to remind you which is which)
<cwabbott> then ADD can also have FMA from the same instruction as a source, which I call T
<HdkR> I'm having a hard time interpreting that one
<cwabbott> register file writes for one instruction happen at the same time as reads for the next as long as they're in the same clause (they actually are encoded in the same field)
<HdkR> labeled T for the result of the FMA passing directly in to the ADD?
<cwabbott> HdkR: yes
<cwabbott> for example, you would use it with an unfused multiply-add
<HdkR> Alright, so when you have {R%d, T0} that is T0 backed by reg R%d and will be written to both T0 and T%d?
<HdkR> er
<HdkR> R%d?
<cwabbott> yeah
<HdkR> R%d commit to RF happens at the end of clause right?
<cwabbott> not at the end of a clause
<cwabbott> during the register read/write phase
<HdkR> ah, interesting
<cwabbott> i mean, stage
<HdkR> That clears up a bit for me then
<cwabbott> so, imagine you have a clause with two instructions, instruction 0 and instruction 1
<cwabbott> the following things would happen in order:
<cwabbott> 1. instruction 0 register read
<cwabbott> 2. instruction 0 FMA
<cwabbott> 3. instruction 0 ADD
<cwabbott> 4. instruction 0 write/instruction 1 read (at the same time)
<cwabbott> 5. instruction 1 FMA
<cwabbott> 6. instruction 1 ADD
<cwabbott> 7. instruction 1 write
<cwabbott> so it would be 7 cycles total (or maybe more, depending on how many cycles FMA and ADD take)
<cwabbott> and of course this is all being interleaved with other quads
<HdkR> Right, typical GPU stuff
urjaman has quit [Ping timeout: 250 seconds]
<HdkR> Do we have documentation about what all the clause types are?
<cwabbott> oh, and one last thing... 1 and 7 plus 2 and 3 are all encoded in the 78-bit instruction word (which then gets packed into the clause)
<cwabbott> *the first 78-bit instruction word
<cwabbott> then 4, 5, 6 are encoded in the second
<cwabbott> HdkR: not really, but feel free to fill it out :)
<HdkR> I have an enum currently with only two. Need to dump a few more
<cwabbott> I think there's some in the assmbler
<cwabbott> but it's really easy to figure out with the disassembler
<HdkR> aye, super nice. Dumping shaders from my little compute program
<cwabbott> the clause type is uniquely determined by the one variable-latency instruction
urjaman has joined #panfrost
<cwabbott> if there's no variable-latency instruction, it's always 0
<Lyude> this all keeps reminding me how cute bifrost is
<HdkR> Oh yea, I have three thrown in my enum atm
<HdkR> 0 being zero latency :D
<HdkR> I should pull up the assembler to not be bitten by the rules I don't know yet
<Lyude> cwabbott: btw, just to make sure: we redid the docs so that the endianness in the field descriptions matches the assembler didn't we?