alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
cphealy has joined #panfrost
enunes has joined #panfrost
yann has quit [Ping timeout: 260 seconds]
yann has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
enunes has quit [Ping timeout: 256 seconds]
enunes has joined #panfrost
stikonas has quit [Remote host closed the connection]
jgmdev has quit [Ping timeout: 272 seconds]
atler has quit [Killed (beckett.freenode.net (Nickname regained by services))]
atler has joined #panfrost
vstehle has quit [Ping timeout: 246 seconds]
kaspter has joined #panfrost
kaspter has quit [Excess Flood]
kaspter has joined #panfrost
archetech has joined #panfrost
camus1 has joined #panfrost
kaspter has quit [Remote host closed the connection]
camus1 is now known as kaspter
davidlt has joined #panfrost
ente has quit [Ping timeout: 272 seconds]
chewitt has quit [Quit: Zzz..]
ente has joined #panfrost
warpme_ has quit [Quit: Connection closed for inactivity]
_whitelogger has joined #panfrost
camus1 has joined #panfrost
kaspter has quit [Ping timeout: 264 seconds]
camus1 is now known as kaspter
guillaume_g has joined #panfrost
mixfix41 is now known as h0tp0ck3t
h0tp0ck3t has left #panfrost [#panfrost]
xdarklight has quit [Ping timeout: 272 seconds]
xdarklight has joined #panfrost
camus1 has joined #panfrost
kaspter has quit [Ping timeout: 256 seconds]
camus1 is now known as kaspter
orkid has quit [Quit: leaving]
orkid has joined #panfrost
tgall_foo has quit [Read error: Connection reset by peer]
tgall_foo has joined #panfrost
archetech has quit [Quit: Konversation terminated!]
m][sko has joined #panfrost
ente has quit [Ping timeout: 246 seconds]
m][sko has quit [Quit: Connection closed]
ente has joined #panfrost
pmjdebruijn has quit [Ping timeout: 240 seconds]
stikonas has joined #panfrost
raster has joined #panfrost
kaspter has quit [Ping timeout: 264 seconds]
kaspter has joined #panfrost
leah is now known as _4of7
warpme_ has joined #panfrost
kaspter has quit [Ping timeout: 260 seconds]
kaspter has joined #panfrost
kaspter has quit [Remote host closed the connection]
kaspter has joined #panfrost
alpernebbi has joined #panfrost
alyssa has joined #panfrost
<alyssa> icecream95: crc-hud is super neat!
* alyssa had to go 'test' it with glmark, and uh supertuxkart
<alyssa> Unsurprisingly, CRC does very well on 2D UI, and very poorly on 3D content.
<alyssa> I don't know if drirc would make sense here.
<alyssa> (Best would be calibrating based on perf counters at runtime but unfortunately mali doesn't expose them at enough granularity for that =p)
<alyssa> --Although actually supertuxkart fps seems to be helped?
<alyssa> Yeah, it looks like CRC either helps or makes no difference
<alyssa> So keeping it on unconditional is probably fine
<macc24> alyssa: bifrost optimizations?
<macc24> :o
<alyssa> macc24: and midgard
<alyssa> Courtesy of icecream95 , I'm just 'testing'
<alyssa> For anyone too lazy to do the algebra themselves:
<alyssa> The expected cost of rendering with CRC is [m T_miss + (1 - m) T_hit] for the miss rate m
<alyssa> T_miss = T_r + T_w + F_w
<alyssa> For CRC read cost T_r, CRC write cost T_w, and framebuffer write cost F_w
<alyssa> T_hit = r
<alyssa> On the other hand, the expected cost of rendering without CRC is simply F_w
<alyssa> So CRC is a win iff [m T_miss + (1 - m) T_hit] < F_w
<alyssa> Noting that T_r = T_w = 8 bytes/tile since CRCs are 64-bit, and that F_w = (width)(height)(bpp) for the tile width/height and bpp bytes per pixel, CRC is a win iff:
<macc24> alyssa: finally gpu in my laptop will be faster than 2010 integrated graphics
<alyssa> m < ((width)(height)(bpp) - 8)/((width)(height)(bpp) + 8)
<alyssa> Said differently, letting h=1-m be the hit rate, CRC is a win iff:
<alyssa> h > 16 / ((width)(height)(bpp) + 8)
<alyssa> This is instructive: it says CRC is more effective as the tile size increases, and as the bytes per pixel increases.
alyssa has quit [Remote host closed the connection]
alyssa has joined #panfrost
<alyssa> For the average case, where tile size is 16x16 and there are 4 bytes per pixel, the hit rate needs to be 1.6%
<alyssa> This agrees with Arm's marketing materials which mention ~2% if I recall, I just wanted to do the calculation myself :p
<alyssa> For the two limiting cases:
<alyssa> * Tile size 16x16, with 256-bits per pixel -- hit rate needs to be only 0.2% for it to be a win
<alyssa> * Tile size 4x4, with 8-bits per pixel -- hit rate needs to be 66%!
<alyssa> Actually, the latter case is totally fantasy -- small tile sizes should only be used at high bpps
<alyssa> So for Midgard, where 128-bits is the cap, we have...
<alyssa> 0.4% for any power-of-two bpp greater than 128-bits
<alyssa> at worst 0.8% for npot bpp
<alyssa> So really, the limiting case is 6% for 8bpp, but already down to less than 2% for the expected bpp4 case.
<alyssa> So this justifies turning it on even for 3D: you can fill that 2% just with the onscreen HUD, or the sky.
<alyssa> TL;DR transaction elimination is totally OP
<macc24> alyssa: will it help with performance on kevin
<macc24> ?
<alyssa> Hope so
<alyssa> Synthetic benchmarks are seeing slight win.
<alyssa> Most significant being glmark -bdesktop, which IIRC is bandwidth limited, which I see up by 4%
<macc24> does it help when some parts of window stay the same?
alyssa has quit [Remote host closed the connection]
kaspter has quit [Quit: kaspter]
_4of7 is now known as leah
nlhowell has joined #panfrost
amonakov has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
alpernebbi has quit [Remote host closed the connection]
raster has joined #panfrost
<amonakov> Hi folks. Today for work I was looking into how f16vec2 fma is compiled for Bifrost, and using malisc with panfrost disassembler I discovered Mali compiler has a pretty bad performance bug:
<amonakov> despite presence of fma.v2f16, they unpack f16vec2 to f32 registers, perform fma.f32 twice, and repack back
<amonakov> This would have been impossible to notice and track down without panfrost disassembler, kudos to you!
<daniels> amonakov: glad to hear it! :) what's stopping you from using Panfrost directly, ooi?
<amonakov> well, it's for work, targeting Android with Mali drivers?..
<daniels> there are a couple of people looking into AOSP enablement, but it should largely work with gbm_gralloc as I understand it
<amonakov> non-rooted Android, and I believe Mesa does not speak the /dev/mali0 language :)
alyssa has joined #panfrost
<alyssa> amonakov: FMA.v2f16 should be used by the DDK under 'good' circumstances..
<amonakov> what is the DDK in this context?
<HdkR> Driver blob
<amonakov> yeah, they can use fma.v2f16 for 'x*y+z', but not when passed spir-v has the explicit fma insn
<HdkR> Only minor changes to the SoC. Guess they didn't want to disturb the design too much :D
<robmur01> those specs certainly sound like a "tock" to me
<macc24> HdkR: i hope chips like these find their way to chromebooks
<HdkR> Me too
* robmur01 hopes they find their way into useful computers :P
<HdkR> Chromebook is completely useful for RE and then copying a Linux install on to it instead ;)
<macc24> s/Linux/Cadmium/
<robmur01> meh, still not the same as a machine that's actually designed to run your own OS of choice, or at least can offer a normal EFI boot menu like this thing
<macc24> robmur01: still, depthcharge is better than traditional bios
<macc24> and developing linux distro for chromebooks is easier than developing linux distro for regular x86 hardware
<robmur01> equally, developing a towbar for blue Honda CB-F motorcycles is easier than developing a towbar for all cars
<robmur01> market size and diversity still doesn't make a blue motorcycle a *good* choice of towing machine
<macc24> robmur01: chromebooks have no legacy baggage and no need to put boot code in specific sector of a random drive
<HdkR> :D
<macc24> less choice = easier to develop for
<macc24> less choice for user, that is
<robmur01> macc24: what part of the standard EFI boot protocol on this SDM835 machine requires that, exactly?
<macc24> robmur01: i was talking about bios
<robmur01> pretty sure last time I tried I just booted a kernel image off a FAT32 partition on a USB stick
<HdkR> I boot my ProX off an EFI partition pointing to a Linux image on USB :P
* robmur01 doesn't understand what legacy PC BIOS has to do with anything Arm-related :/
<macc24> robmur01: my point was that making linux on chromebooks is easier than making linux on "regular" x86 hardware
<macc24> all chromebooks just load kernel from first partition without any additional config
<robmur01> and when exactly do you expect MTK SoCs to start turning up in regular x86 hardware?
* robmur01 is massively confused and going off to do something else
<macc24> TIL that i'm good at confusing people accidentally
<macc24> and i wish arm socs replaced intel/amd cpus
guillaume_g has quit [Quit: Konversation terminated!]
<HdkR> Gimme a 64 core post-hercules SoC and I'll replace a computer in my house :P
<macc24> HdkR: if mt8183 laptops had as many ports as my thinkpad x201 i would have typed this message from mt8183
<HdkR> :P
<macc24> the only reasons that i keep my xeon desktop are its gpu and x86 software
<HdkR> Luckily GPU is easy to fix in ATX form factor even with ARM
<HdkR> x86 software is a bit more rough without ARMv8.4
<macc24> well, it's more about fact that it can drive more than 2 displays at the same time
<amonakov> how do I build Mesa's in-tree panfrost disassembler? was using the one from ShaderProgramDisassembler repo, but it's probably very outdated by now
<robmur01> HdkR: y'know I've been here via x86 software on Armv8.0 for pushing a year now, right? :D
<HdkR> robmur01: FEX also supports that config. It's just a nightmare if you need to deal with say...unaligned atomics
<alyssa> amonakov: build mesa with -Dtools=panfrost
<alyssa> and it'll show up as `bifrost_compiler` in the build dir nested deeply
<alyssa> `bifrost_compiler disasm foo.bin` should work if mesa is built from git master
alyssa has quit [Remote host closed the connection]
alyssa has joined #panfrost
<macc24> HdkR: armv8.4?
<HdkR> macc24: ARMv8.4 mandates support for unaligned atomics
<HdkR> Where with ARMv8.1 it is optional to support and nobody supports it
<amonakov> alyssa: thanks (also disabled a bunch of stuff to cut down dependencies)
<amonakov> alyssa: do I understand correctly that I need to manually remove the mali blob shader header, unlike for ShaderProgramDisassembler?
<alyssa> if you built latest, not needed
<amonakov> ah, nice, thanks
<macc24> it would be embarrasing if linux on m1 had more features than cadmium in shorter time
<alyssa> this is offtopic for #panfrost
<macc24> i see no alyssa on ##panfrost-offtopic
<daniels> there's also the #asahi family of channels for M1 things :)
<daniels> I won't go into it here because way offtopic, but there are legal concerns raised around the Corellium port, which you can find out more about by looking into marcan's Twitter posts
<anarsoul> yay!
<alyssa> macc24: also, I would add "completely usable" is ... a stretch
<alyssa> by that metric panfrost has been "completely usable" since Jan 2019
<macc24> alyssa: completely usable as in all essential stuff like gpu, usb, suspending, display, whatever else working fine
<macc24> sound too
<macc24> anyway i'm gonna go run away since my desktop froze
<alyssa> macc24: Definitely no GPU support on that image.
<alyssa> Display is whatever was provided at boot only.
<alyssa> I don't believe suspend works.
<alyssa> I don't believe sound works.
<alyssa> I don't mean to knock the work -- the speed of the port is incredible -- but "completely usable" is misleading.
<robmur01> also good luck hotplugging monitors or changing resolution with simple-framebuffer
davidlt has quit [Ping timeout: 272 seconds]
<anarsoul> who cares about suspend on mac mini?
<robmur01> People who suspend their machine when they're not using it? I almost never cold-boot my desktop.
<robmur01> (note that hibernate needs baseline suspend support too)
nlhowell has quit [Ping timeout: 265 seconds]
<macc24> alyssa: i mean, sound and suspend doesn't work in cadmium too
<macc24> and i have seen people use llvmpipe on c201pa
<anarsoul> robmur01: I keep my laptop always on if its on charger
<anarsoul> anyway, they'll likely get to working suspend some day
<anarsoul> one step at a time
<macc24> shit i gotta speed up with getting suspend to work on duet
raster has quit [Quit: Gettin' stinky!]
raster has joined #panfrost
* alyssa is embarassed by the # of open panfrost MRs
<HdkR> That just means more people need poked for review. Which is a better problem than an a large downstream fork without any MRs :)
Net147 has quit [Read error: Connection reset by peer]
Net147 has joined #panfrost
<warpme_> Guys: i'm scratching my head why amlogic is only platform giving me app. segfault on Qt EGLFS (EGLFS: Qt draws to fullscreen EGL surface) while exactly the same sw. stack works well on rk/aw/rpi. Some datapoints: 1\GL provider (mesa) is the same on all HW; 2\ Qt X11/GL(glamour) works OK on AML. 3\AML issue is on all AML HW i have (lima on mali450, panfrost on t760 and bifrost on g31). My hypothesis is: EGLFS uses mesa
<warpme_> GLES call(s) which are exposing issue at mesa-drm in aml drm. What will be your opinion here?
<alyssa> backtrace? but yeah, sounds like amlogic display stuff is to blame
<alyssa> if both lima and panfrost are affected, but rk is not, it isn't a GL issue
<alyssa> scheduler constants are breaking dEQP-GLES3.functional.texture.format.sized.3d.rgba8_pot why...
<warpme_> alyssa: segfault is deep in Qt EGLplatformintegration driver. To get meaningful data from Qt internals - i need to debug build of Qt. Even cross-compiled on i7 it takes hours. I think it is not worth - as at end we will end with the same conclusion: issue is in aml drm driver??
<warpme_> asking here as suspect aml guys will point finger to mesa :-p
<anarsoul> it crash happens in the same place on lima and panfrost it's very unlikely to be mesa
<anarsoul> anyway, try building mesa with debug info to see if it even appears in backtrace
<warpme_> anarsoul: can't compare exact call/stack regs - but Qt call seen in gdb seems to be the same...
<anarsoul> could also be some Qt bug
<anarsoul> aml folks will ask you for a backtrace anyway
<anarsoul> :)
<warpme_> hmm - might be but why then all is ok on: aw/rk/rpi/intel/amd/nvidia?
<macc24> i bet it will be ok with LIBGL_ALWAYS_SOFTWARE=1
<alyssa> macc24: doesn't mean much, s/w drivers are special-cased for the winsys
<macc24> 'ok' as in 'the same result'
<warpme_> macc24: nope. with LIBGL_ALWAYS_SOFTWARE=1 segfaults the same....
<alyssa> bbrezillon: why the heck do *{S,U}{8,16}_TO_{S,U}32 exist
<alyssa> that is strictly equivalent to *MKVEC.v2i16 [whatever], #0
<alyssa> The + versions make sense though
<amonakov> hm? the signed variants shouldn't be equivalent?
<HdkR> Saves a constant needing to be encoded?
<alyssa> amonakov: --right, they're not, my bad.
<alyssa> My point still stands for *U16_TO_U32 at least
<alyssa> HdkR: #0 is free in the FMA pipe
<HdkR> So just the signed bit that matters :D
<alyssa> rrright
<alyssa> amonakov: oh hey, a Pidgin user! =)
<amonakov> yep, that I am :)
<alyssa> raster: in case you were wondering about yesterday's bug -- data race reading consecutive staging registers
<alyssa> my fault, but also so, so, bifrost
<raster> alyssa: ugh... race conditions in hw... :|
<alyssa> raster: Reading from staging registers is architecturally defined to be racy.
<alyssa> (I don't remember if it's undefined behaviour or flat out will never work.)
<raster> thats what i mean...
<raster> :)
<raster> ugh. :|
<alyssa> it's not a bug if it's documented right???
<raster> :P
<raster> x86 vs any other rch...
<raster> x86 is documented
<alyssa> [Anyway, I had accounted for this when a single reg is read, but had a subtle issue with vector regs]
<raster> but its ugly
<raster> well ok 6502 was not pretty either
<HdkR> Can confirm ugly x86
<alyssa> I like 6502.
<raster> but 69k, ppc, arm ... all much cleaner and nicer
<urjaman> lmao 69k
<alyssa> 68k?
<urjaman> Nice
<alyssa> but 1k more?
<raster> haahok 68k :)
<alyssa> 1k more... another k, another destiny...
<amonakov> alyssa: hold up, there's architecture guide with register definitions for Bifrost? where?
<alyssa> amonakov: effectively, mesa's comments ;)
<amonakov> ah, and git commit messages :)
<raster> amonakov: lies! there can be no such thing! :)
<icecream95> z80 > ARM > AArch64 > everything else
<amonakov> [as an application dev looking at Arm Mali tools, the more I look, the less sense it makes]
<alyssa> amonakov: that's correct.
<alyssa> if this stuff was publicly documented, people would realize none of this makes sense, by design ;p
<alyssa> icecream95: z81 > z80
<icecream95> alyssa: Only if z is positive
<alyssa> drat, foiled again
<alyssa> whoops
<alyssa> Pass: 21949, Fail: 13, Warn: 28, Skip: 93, Flake: 3, Duration: 5:12, Remaining: 0
<alyssa> that's like, most of them, right?
<icecream95> macc24: Enabling pstore might help with debugging suspend
* alyssa broke reg spilling again
<alyssa> trying against CI
<alyssa> 24 files changed, 2667 insertions(+), 652 deletions(-)
<alyssa> ughhh
<HdkR> ooo, that's a good one
<alyssa> HdkR: still the scheduler
<HdkR> `15 files changed, 1621 insertions(+), 4 deletions(-)` I've got this fun one today
<HdkR> Doomed to massive commits
<alyssa> a single commit?
<alyssa> mine was across the series :<
<HdkR> Yea, half of it is unit tests though
<alyssa> tests can go in a 2nd commit :p
<HdkR> That would be too easy then :D
<macc24> icecream95: display doesn't power on on resume
<macc24> i can ssh into machine just fine
<alyssa> glmark2 -bterrain is improved on bifrost by schedule patches :)
<alyssa> (9fps to 11fps, at 1080p)
<alyssa> not that 11fps is anything to write home about...
<alyssa> :shader5: - MESA_SHADER_FRAGMENT shader: 1314 inst, 1314 nops, 1314 clauses, 1 threads, 0 loops, 0:0 spills:fills
<alyssa> :shader5: - MESA_SHADER_FRAGMENT shader: 1314 inst, 872 nops, 141 clauses, 1 threads, 0 loops, 0:0 spills:fills
<italove> why so many nops?
<alyssa> italove: 16% reduction in nops, think positive! :p
<alyssa> and nops are from failing to fill the pipeline
<italove> haha, right
<alyssa> (that's before/after the schedule MR)
<alyssa> italove: Technically Midgard has lots of nops too, you just don't notice them
<alyssa> if you have a bundle of 2 instructions, say, { vmul.fmul ... / vadd.fadd ... }, effectively sadd/smul/vlut are all nops
<italove> oh I see, so the disasm doesn't show it because it's not part of the code, it's just something that happens when the scheduling can't fill the bundle with instructions?
<italove> makes sense
<italove> scheduler*
<icecream95> alyssa: G72 already gets 11 fps, so will scheduling make it 13 or 13.44444 fps?
<anarsoul> or 111?
<icecream95> alyssa: How should panpackcolor handle 12-byte formats?
<icecream95> pan_pack_color*
<alyssa> icecream95: 13.35195876432 +/- 1fps, with 30% confidence :p
<alyssa> wdym how?
<icecream95> alyssa: Should the size == 12 case be just a memcpy like size == 16 or does it need something like the size == 6 case?
raster has quit [Quit: Gettin' stinky!]
karolherbst has quit [Ping timeout: 272 seconds]
<alyssa> icecream95: First question is what the heck the size == 6 special case was about...
<alyssa> git blames me
<alyssa> 1b86e0927d4c829209a6134223b0ca5aff771c8d commit message sounds wholly unconvining
<alyssa> icecream95: For a bit of context, the hardware just does a dumb copy of clear_color (128-bits) straight into the tilebuffer, completely disregarding the pixel format.
<alyssa> Which means pan_pack_color is really a CPU side tilebuffer pixel pack, analogous to lower_framebuffer for blend shaders
<alyssa> I don't think that function has been touched since GenXML.
<alyssa> But if I were writing the function now, the 'right' approach would be:
<alyssa> 1. Get the tilebuffer format, like we do in pan_mfbd.c -- check panfrost_blend_format and if none exists, use an appropriately sized RAW format
<alyssa> 2. For a blendable format, special case packs by tilebuffer format ("Color Buffer Internal Format")
<alyssa> The underlying principle is that the tib can be reinterpreted as RGBX8 and it should still make sense.
<alyssa> So RGBA8 is just a copy, RGB10A2 shifts off the lower 2 bits and then packs them in the top byte, RGBA4 has everything shifted 4, etc.
<alyssa> 3. For a raw format, *we* determine the interpretation (these are the formats requiring blend shaders). So we do the obvious thing and always use the same format as Gallium, which means we can just use Gallium packs...
<alyssa> ...except stuff needs to be replicated to keep the hardware happy for reasons I don't remember.
<alyssa> icecream95: ---I guess to answer your question, probably just a memcpy, and you can probably garbage collect the == 6 case to be just a memcpy too.