#panfrost on 2019-01-04 — irc logs at freenode.irclog.whitequark.org

2018-12-27 00:26 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - Transientification is terminating. Memory reductions in progress.

00:08 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

00:18 stikonas_ has quit [Remote host closed the connection]

02:06 anarsoul|2 has quit [Ping timeout: 245 seconds]

02:49 <HdkR> Got some hacky mir instruction outputing and RA happening. Next steps making it less hacky and bundling correctly

03:18 <HdkR> If I wasn't riffing on alyssa's compiler so hard then I would have wrote this completely differently :P

03:42 anarsoul has quit [Remote host closed the connection]

03:43 anarsoul has joined #panfrost

06:29 mateo` has quit [Ping timeout: 245 seconds]

06:40 _whitelogger has quit [Ping timeout: 250 seconds]

06:53 _whitelogger has joined #panfrost

08:46 stikonas_ has joined #panfrost

08:47 stikonas_ has quit [Remote host closed the connection]

10:36 <narmstrong> got Kodi runing on S912 with GBM

10:36 <narmstrong> no text

10:36 <narmstrong> not smooth at all

10:36 <HdkR> <3

10:37 <narmstrong> but still pretty cool !

10:37 <narmstrong> used https://gitlab.freedesktop.org/narmstrong/panfrost-mesa/commits/winsys-rebased-meson with the meson winsys and a few fixes

11:00 <tomeu> wow!

11:06 raster has joined #panfrost

11:10 jailbox has quit [Ping timeout: 245 seconds]

11:26 _whitelogger has joined #panfrost

12:20 chewitt has quit [Quit: Zzz..]

12:31 chewitt has joined #panfrost

12:38 jailbox has joined #panfrost

13:32 <narmstrong> finally, we can do HW Kodi playback on the Amlogic S912 ! Thanks alyssa ;-)

14:07 <tomeu> first sight of weston-simple-egl :)

14:16 <tomeu> alyssa: any ideas on what's wrong with the rendering? https://people.collabora.com/~tomeu/panfrost_simple_egl.mp4

14:16 <tomeu> hmm, maybe it's not the rendering, but just a wrong stride somewhere in the presentation path

14:18 <HdkR> oof

14:27 <tomeu> for some reason, glmark2 seems to work fine

14:42 <tomeu> narmstrong: if you want to try it out in your board: https://gitlab.freedesktop.org/panfrost/mesa/merge_requests/14

14:44 <narmstrong> tomeu: thanks, I'll have a try

15:12 <raster> is panfrost's mesa implementation still re-using the arm mali kernel drivers "as-is" or so i need a new kernel build with changes?

15:43 <narmstrong> still using the arm mali driver

15:45 jernej has joined #panfrost

16:11 <alyssa> narmstrong: Oh, awesome!

16:11 <alyssa> I have a workaround for the text, got interrupted trying to actually fix

16:12 <alyssa> Also, try the tomeu-perf branch, which fixes a ton of performance issues and will make Kodi, ya know, be less bad

16:12 <raster> narmstrong: cool.

16:12 <raster> one less moving part at this stage then

16:12 <raster> :)

16:12 <alyssa> I would help, but I couldn't find GBM-backed Kodi anywhere :(

16:26 <alyssa> (Apparently it's in the dev branch but not stable? If so, where do I get official arm64 binaries? Not looking forward to compiling Kodi from source, given the size)

16:26 <alyssa> tomeu: Ooo!

16:26 <chewitt> Kodi doesn't build any official arm64 binaries

16:26 <alyssa> And yeah, looks like a stride issue somewhere

16:26 <alyssa> chewitt: *blinks*

16:27 <chewitt> there's an official ppa for Ubuntu, but that's the only official Linux release

16:27 <alyssa> tomeu: Couldn't tell you where, since stride is... a winsys issue more than anything ;P

16:27 <chewitt> we leave packaging to individual distros as Kodi has quite a few dependencies

16:28 <alyssa> chewitt: So, uh, where do I get legit unstable Kodi binaries?

16:28 <chewitt> self-compile

16:28 * alyssa blinks

16:29 <chewitt> it's not possible to ship a universal binary.. too many distros fiddle with the dependencies

16:29 <chewitt> esp. ffmpeg

16:29 <chewitt> but

16:30 <chewitt> it's not hard to compile

16:30 <HdkR> Start a build and walk away for a day :D

16:31 <chewitt> as long as you're cross compiling .. it's 5-10 mins on a decent spec compile box

16:31 <chewitt> max

16:31 <alyssa> ...Huh

16:32 <alyssa> chewitt: https://github.com/xbmc/xbmc/blob/master/docs/README.Linux.md is just suuuper scary

16:32 <alyssa> Like I'm open to it but

16:32 <alyssa> Scary

16:36 <chewitt> there's a few people around that can assist if you get stuck anywhere

16:37 <chewitt> #kodi-linux and #kodi-dev channels have Kodi team people lurking

16:39 rhyskidd has quit [Quit: rhyskidd]

16:39 <chewitt> what distro are you using?

16:39 rhyskidd has joined #panfrost

16:39 rhyskidd has quit [Remote host closed the connection]

16:39 <chewitt> some of them run their own nightlies

16:39 rhyskidd has joined #panfrost

16:40 <chewitt> although none of them will be shipping GBM stuff, only Xorg

16:48 anarsoul has quit [Remote host closed the connection]

16:49 anarsoul has joined #panfrost

16:51 <alyssa> What the heck? Kodi's internal dep systems fetches code over HTTP, explicitly follows redirects, doesn't check signatures after, and silences everything

16:52 <alyssa> I'm sorry, this is not okay.

16:52 <alyssa> https://github.com/xbmc/xbmc/blob/8710d80b7049035441c6ec69c6adc0e02db588b8/tools/depends/target/flatbuffers/Makefile

16:53 <HdkR> Is that a makefile that calls cmake to build makefiles?

16:54 <alyssa> HdkR: Apparently.

16:57 <chewitt> I'll confess to not be that familiar with the 'depends' build process; LE builds all the dependencies independently

16:57 <Lyude> HdkR: ooooh-can I see what you've got so far for bifrost?

16:58 robertfoss_ has joined #panfrost

16:59 <HdkR> Lyude: You could see it, btu it isn't even generating real instructions yet

16:59 <HdkR> :P

16:59 <Lyude> ahhh

16:59 <Lyude> yeah that part is gonna be challenging :)

17:00 <alyssa> Have fun you two :P

17:01 <HdkR> It's almost generating instructions "correctly", just haven't touched bundling

17:02 <HdkR> Vacation ends on Monday so I'm going to get it as far as I can and push Sunday

17:02 <HdkR> Since I can't guarantee time past that

17:02 <alyssa> \o/

17:02 lrusak has joined #panfrost

17:14 <chrisf> alyssa: does midgard allow control over subpixel precision?

17:15 <chrisf> alyssa: im looking at a blob that claims 8 bits, im wondering if you can run the rasterizer in 4 bit mode

17:15 <chrisf> this is possible on some other GPUs

17:22 <HdkR> Have you searched around to see if there were multiple midgard GPUs that reported different precision based on driver version? :D

17:23 <HdkR> I see a T880 with different drivers reporting 4 and 8

17:23 <HdkR> No idea if that is misreporting from the driver or not

17:24 <chrisf> gpuinfo.org lacks this value for some reason

17:24 <HdkR> It has it in the vulkan results

17:25 <HdkR> That's what I was looking at as well, it's in the limits

17:25 <HdkR> https://vulkan.gpuinfo.org/displayreport.php?id=4856#limits https://vulkan.gpuinfo.org/displayreport.php?id=3815#limits

17:25 <HdkR> Different drivers, one showing 4, other showing 8

17:28 <alyssa> chrisf: I'm not sure..

17:28 <chrisf> HdkR: they're different physical implementations of the T880, too :(

17:28 <alyssa> It's hard to know what the hw is/is-not capable of if we've never seen the blob use a feature

17:29 <alyssa> We don't know how much we don't know, yeah?

17:29 <HdkR> yea, but the number of compute units shouldn't effect that

17:29 <HdkR> So it either supports it or they hecked up reporting in their driver :P

17:30 <HdkR> Assume latter, hope for former

17:30 <HdkR> If you ask on their forum then maybe Peter Harris will respond to you about it

17:31 <chrisf> yeah i can find out from arm; i was just hoping i might get a "yes, you can totally do that, it's this bit here!"

17:32 <alyssa> chrisf: I mean, is that feature exposed to GL anywhere?

17:32 <chrisf> alyssa: it's an implementation dependent value. you can query it, but you cant control it

17:32 <alyssa> Hm

17:33 <alyssa> Unrelated to MSAA?

17:33 <chrisf> yes

17:33 <chrisf> this is the number of subpixel bits there are in window space positions of things

17:33 <HdkR> Yea, sampling pattern doens't directly relate

17:33 * alyssa mumbles

17:34 <alyssa> chrisf: This falls firmly in the "I don't know what I don't know" category. The feature might be there, it might not, I genuinely don't know.

17:34 <HdkR> Since pretty much all? hardware supports modifying the MSAA sampling pattern :)

18:01 <chrisf> speaking of -- is setting the sample positions understood?

18:04 <HdkR> Not in panfrost at least :P

18:04 <HdkR> Those would be some late game features

18:10 <alyssa> ^^

18:10 * alyssa is aiming for rock-solid ES 2.0 before venturing off into the unknown

18:11 <alyssa> "Hi, can I run Kodi with Panfrost?" "No, but we have 12 pages of notes about multisampled instanced blending!"

18:15 <HdkR> Time to implement meshlets and ray tracing in panfrost

18:16 <chrisf> you joke, but i wouldnt be surprised if you can abuse it into doing something meshlet-like

18:17 <chrisf> RT can go jump in the lake though; you have a couple watts, it's not an interesting thing to support

18:18 <HdkR> Yea, I could see meshlets working

18:18 <HdkR> Which would be cool to have a mobile GPU with that feature

18:19 <HdkR> From what I've been hearing, Game devs really like the idea of them, just sad they are only available on Nvidia Turing atm

18:19 <chrisf> perhaps not exactly what nv does, but there's definitely room for "more flexible geometry frontend" and mali hw ought to be able to do it

18:20 <HdkR> Worst case you could probably have a compute job feed the data in to the rest of the graphics pipeline bits

18:21 <HdkR> So not optimal but actually supported

18:22 <HdkR> Maybe in a few years

18:23 <cwabbott> HdkR: actually, that's pretty much how the existing pipeline works

18:24 <HdkR> Yea, vertex shaders are effectively compute jobs

18:24 <cwabbott> the vertex shader is just a compute shader, plus a few fixed-function bits for fetching attributes and storing varyings

18:25 <chrisf> cwabbott: the question is whether those ff units are flexible enough to kill the "one vertex out" constraint

18:25 <HdkR> It'll be interesting to know if attribute fetching is roughly the same performance as generic memory fetching though

18:25 <HdkR> That too

18:25 <cwabbott> it is, I think

18:25 <cwabbott> it just uses the standard load/store path

18:25 <cwabbott> all it does is compute the index and do data conversion

18:25 <HdkR> One of the reasons why Turing is so good at mesh shading is because Nvidia finally fixed the issue that "generic" memory fetching was slower than the explicit attribute fetching

18:26 <chrisf> "by removing the attribute fetch hardware" ?

18:27 <cwabbott> chrisf: considering that geometry shaders and tesselation are implemented entirely using the same fixed-function attribute unit (plus a shader to implement the fixed-function tesselator, I guess) I would think so

18:27 <HdkR> One can assume the fetching and forward data fetching just behaves the same between the different access types

18:27 <cwabbott> you can just give it a vertex id and instance id, and it does the addressing calculation

18:27 <HdkR> Nice

18:29 <cwabbott> and there's no special vertex cache at all -- the driver just computes the min/max vertex indices, then dispatches a thread per index that loads the attributes from a buffer and stores all the varyings to another buffer

18:30 <cwabbott> the HW doesn't touch the index buffer at all until the tiler

18:30 <HdkR> chrisf: There has been some benchmarks done and pretty much all accesses behave the same aside from a couple. Fully divergent indirect UBO fetching serializes to each thread in the warp. Warp uniform memory fetches are ~20x faster. and UBO fetches are still dumb quick regardless.

18:30 <chrisf> HdkR: link?

18:31 <HdkR> I'll have to find it again, I can never remember the person's name on twitter that does it :P

18:32 <alyssa> cwabbott: That's what I thought. Also, that's terrifying ;P

18:38 <HdkR> gah. It's not Sebbi who made this bench, but someone in the same circle of awesome game devs

18:38 <HdkR> Having a hard time

18:50 <HdkR> Oh, it was sebbi

18:50 <HdkR> chrisf: https://github.com/sebbbi/perftest#nvidia-turing-rtx-2080-ti

18:51 <HdkR> You can see the tests that end in uniform hit something like 20x perf in cases

18:52 <HdkR> `cbuffer{float4} load linear: 328.935ms 0.050x` and then that dumb one

18:53 <HdkR> If you were to order that list by size then the different types would all be fairly consistent

18:55 <HdkR> (Intel is also fast at wave uniform fetches due to its architecture)

18:57 anarsoul|2 has joined #panfrost

19:10 <HdkR> Speaking of which. bnieuwenhuizen, does radeonsi do any load + broadcast for wave uniform loads? :)

19:11 <chrisf> if someone wants to convert to gles/vulkan and characterize our zoo of toy gpus... :)

19:11 <alyssa> chewitt: Got kodi-gbm going with the help of our friends over at #kodi-dev

19:11 <alyssa> Will play with this when I have some time (Monday, probably)

19:11 <alyssa> narmstrong: ^^

19:12 <HdkR> Woo kodi

19:17 stikonas has joined #panfrost

19:17 * HdkR is noticing that there is actually a large amount of people in the channel

19:18 <bnieuwen1uizen> HdkR: what do you mean?

19:18 <bnieuwen1uizen> on AMD loads using a scalar address end up in a scalar reg, which can be used from every lane

19:19 <HdkR> ah, neat

19:19 <HdkR> Wonder why it isn't as great of a perf increase in that microbench then

19:19 <HdkR> Since the Nvidia Blob hits 28x perf improvement in one case

19:20 <bnieuwen1uizen> what bench?

19:20 * bnieuwen1uizen admits to not always following the log here too closely

19:21 <HdkR> Ah sorry, Sebbi's memory load microbench here https://github.com/sebbbi/perftest#nvidia-turing-rtx-2080-ti

19:21 <HdkR> There are GCN and vega tests above that one

19:21 <HdkR> `cbuffer{float4} load uniform: 8.011ms 2.778x` from Vega FE

19:22 <bnieuwen1uizen> I think the main limitation is sorf of LLVM test for uniformness in loops comes down to "is it constant"

19:22 <HdkR> This is also a D3D bench sadly

19:22 <HdkR> So running Windows

19:22 <bnieuwen1uizen> oh no clue about the proprietary compiler

19:23 <bnieuwen1uizen> HdkR: You seen https://gist.github.com/sebbbi/ba4415339b535d22fb18e2d824564ec4 ? AFAIU nvidia does that optimization while AMD does not

19:23 <HdkR> It would theoretically hit near 64x if it was uniform right?

19:23 <HdkR> :D

19:23 <bnieuwen1uizen> not quite

19:24 <chrisf> bnieuwen1uizen: need better uniform value analysis?

19:24 <bnieuwen1uizen> chrisf: lets start by LLVM not messing around with loops so much that they become incomprehensible to any useful analysis

19:24 <HdkR> ah, that's from the same test I guess

19:24 <HdkR> Oh, yea. That's an issue

19:25 <HdkR> I see there is some recent metadata added to keep at least some of that information around

19:26 <HdkR> Now time to implement it in all optimization passes so the metadata doesn't vanish...

19:26 <bnieuwen1uizen> yeah nicolai implemented something better for detecting uniformness for scalar loads, but now we have a readfirstLane before every load because the phis are not good yet

19:26 <HdkR> oof

19:27 <bnieuwen1uizen> the AMDGPU backend + control flow is not funny :(

19:28 <HdkR> I looked at the divergence analysis code and was sad

19:28 * chrisf is doing spirv for swiftshader currently, and ending up doing a lot of this stuff before going to llvm ir at all

19:30 <bnieuwen1uizen> HdkR: I started trying to get that vector load + subgroup instr optimization implemented in nir though, so who knows, it may arrive in a driver near you ;)

19:30 <bnieuwen1uizen> yeah, that is the easy way, llvmpipe also does that

19:31 <bnieuwen1uizen> but all our vector ops are predicated and it would truly mess with RA to do it that way :(

19:34 <HdkR> woo

19:44 <alyssa> bnieuwen1uizen: NIR tbh

19:44 <alyssa> :P

19:45 <bnieuwen1uizen> ? NIR to be honest?

19:47 <HdkR> Bifrost+ really lends itself to having LLVM do easy codegen. Then you have to deal with LLVM :)

19:47 <alyssa> HdkR: Yeah, but.. NIR ^^

19:47 <bnieuwen1uizen> what makes it easy?

19:47 <alyssa> HdkR: You have no excuses, there's shared code between bifrost and midgard and you don't get to share that if you do LLVM ;P

19:48 <bnieuwen1uizen> alyssa: he can share more with amdgpu ;)

19:48 <bnieuwen1uizen> heck LLVM is a lot more lines of code to share :P

19:48 <alyssa> :VVV

19:51 <HdkR> The bundles are fairly simple, more like what you'd see in a DSP

19:51 <HdkR> So you could describe the instructions inside the bundles easily, and then use LLVM's packetizer to fit them in to the clauses pretty easily

19:52 <HdkR> bundles = clauses

19:52 <bnieuwen1uizen> is the LLVM packetizer any good?

19:52 <HdkR> eeehhh. It's basic

19:52 <HdkR> It's a simple slot based packetizer

19:53 <bnieuwen1uizen> so write a NIR one!

19:53 * bnieuwen1uizen runs

19:54 <cwabbott> bnieuwenhuizen: heh... I wouldn't want to be using NIR that deep into the backend

19:55 <bnieuwen1uizen> true

19:55 <alyssa> cwabbott: Hush you're supposed to be on my side D:

19:56 <HdkR> Basically you end up describing the machine's pipelines and the packetizer tries to fit them as well as possible. It would end up being describing each clause's potential layouts as "pipelines" and fit as aggressively as possible

19:57 <HdkR> So it's meh

19:57 <HdkR> It doesn't understand skewed pipelines at all

19:57 <HdkR> So Midgard isn't a good fit there :P

20:01 <HdkR> I don't think it supports any form of pipeline bypasses either, so Bifrost's temp registers would probably fail pretty hard in the basic implementation

20:02 <cwabbott> HdkR: well, for temp registers, the whole thing is setup so that the compiler can be fairly oblivious to them

20:04 <cwabbott> you just pretend that writes have no latency, and then if there's a conflict, you rewrite the read to use one of the temp registers

20:05 <cwabbott> although now that I think about it, it does kinda affect how many instructions you can pack, since each FMA/ADD combination can only load 3 registers

20:05 <cwabbott> or 2 if the previous instruction wrote 2

20:05 <HdkR> Yea, it's a nice model but it doesn't quite fit in to LLVM's default packetizer well

20:06 <HdkR> Which I've been bit with for other architectures

20:06 <HdkR> bit by?

20:07 <alyssa> HdkR: NIRrrrr

20:08 <alyssa> You 'get' to write the scheduler yourself! No default assumptions to break!

20:09 <HdkR> Yea, not planning on going the LLVM route :P

20:13 <HdkR> Everyone knows LLVM is too slow. Nobody in their right mind would create a live compiler using LLVM right?

20:13 <HdkR> </s>

20:15 <bnieuwen1uizen> HdkR: I wonder how much of that is because people use it in stupid ways

20:17 <HdkR> Using and abusing LLVM and assuming it'll optimize all the dumb things? Then it has to and it kills compile times? :P

20:18 <bnieuwen1uizen> yep

20:22 <HdkR> Yea, it's a big part of the issue

20:23 <HdkR> Domain specific compilers are interesting. Once it becomes a generic problem then it is a time issue

20:32 <HdkR> (Engineering time and computation time, woo)

23:20 <chrisf> cwabbott: i thought the bifrost temp register was more limited-- there was exactly one register, it was available for exactly one instruction slot, .. ?

23:34 <HdkR> Think there is T0 and T1 from what I remember and I think they can be passed to the next clause

23:34 <HdkR> If I'm remembering correctly. Bit tired

23:36 <HdkR> I'm assuming T0 and T1 so it can fit full 64bit things in to it

23:38 <cwabbott> chrisf, HdkR: so, there are 2 stages that are visible in the ISA, FMA and ADD

23:38 <HdkR> Right

23:39 <chrisf> cwabbott: ok

23:39 <cwabbott> FMA and ADD can refer to the previous FMA and ADD, that's what I call T0 and T1

23:39 <HdkR> oh

23:39 <cwabbott> (the instructions also have T0 and T1 as the destination, to remind you which is which)

23:39 <cwabbott> then ADD can also have FMA from the same instruction as a source, which I call T

23:40 <HdkR> I'm having a hard time interpreting that one

23:40 <cwabbott> register file writes for one instruction happen at the same time as reads for the next as long as they're in the same clause (they actually are encoded in the same field)

23:41 <HdkR> labeled T for the result of the FMA passing directly in to the ADD?

23:41 <cwabbott> HdkR: yes

23:41 <cwabbott> for example, you would use it with an unfused multiply-add

23:42 <HdkR> Alright, so when you have {R%d, T0} that is T0 backed by reg R%d and will be written to both T0 and T%d?

23:42 <HdkR> er

23:42 <HdkR> R%d?

23:42 <cwabbott> yeah

23:43 <HdkR> R%d commit to RF happens at the end of clause right?

23:43 <cwabbott> not at the end of a clause

23:43 <cwabbott> during the register read/write phase

23:43 <HdkR> ah, interesting

23:43 <cwabbott> i mean, stage

23:44 <HdkR> That clears up a bit for me then

23:45 <cwabbott> so, imagine you have a clause with two instructions, instruction 0 and instruction 1

23:45 <cwabbott> the following things would happen in order:

23:45 <cwabbott> 1. instruction 0 register read

23:45 <cwabbott> 2. instruction 0 FMA

23:45 <cwabbott> 3. instruction 0 ADD

23:45 <cwabbott> 4. instruction 0 write/instruction 1 read (at the same time)

23:45 <cwabbott> 5. instruction 1 FMA

23:45 <cwabbott> 6. instruction 1 ADD

23:46 <cwabbott> 7. instruction 1 write

23:46 <cwabbott> so it would be 7 cycles total (or maybe more, depending on how many cycles FMA and ADD take)

23:47 <cwabbott> and of course this is all being interleaved with other quads

23:47 <HdkR> Right, typical GPU stuff

23:47 urjaman has quit [Ping timeout: 250 seconds]

23:47 <HdkR> Do we have documentation about what all the clause types are?

23:48 <cwabbott> oh, and one last thing... 1 and 7 plus 2 and 3 are all encoded in the 78-bit instruction word (which then gets packed into the clause)

23:49 <cwabbott> *the first 78-bit instruction word

23:49 <cwabbott> then 4, 5, 6 are encoded in the second

23:49 <cwabbott> HdkR: not really, but feel free to fill it out :)

23:50 <HdkR> I have an enum currently with only two. Need to dump a few more

23:50 <cwabbott> I think there's some in the assmbler

23:50 <cwabbott> but it's really easy to figure out with the disassembler

23:50 <HdkR> aye, super nice. Dumping shaders from my little compute program

23:51 <cwabbott> the clause type is uniquely determined by the one variable-latency instruction

23:52 urjaman has joined #panfrost

23:52 <cwabbott> if there's no variable-latency instruction, it's always 0

23:52 <Lyude> this all keeps reminding me how cute bifrost is

23:52 <HdkR> Oh yea, I have three thrown in my enum atm

23:52 <HdkR> 0 being zero latency :D

23:55 <HdkR> I should pull up the assembler to not be bitten by the rules I don't know yet

23:56 <Lyude> cwabbott: btw, just to make sure: we redid the docs so that the endianness in the field descriptions matches the assembler didn't we?