alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
nlhowell has quit [Ping timeout: 246 seconds]
nlhowell has joined #panfrost
macc24 has quit [Ping timeout: 265 seconds]
stikonas has quit [Ping timeout: 265 seconds]
macc24 has joined #panfrost
mixfix41_ has joined #panfrost
vstehle has quit [Ping timeout: 246 seconds]
NeuroScr has quit [Quit: NeuroScr]
nerdboy has quit [Ping timeout: 256 seconds]
davidlt has joined #panfrost
robink has quit [Ping timeout: 265 seconds]
robink has joined #panfrost
buzzmarshall has quit [Remote host closed the connection]
vstehle has joined #panfrost
robert_ancell has quit [Ping timeout: 272 seconds]
macc24 has quit [Ping timeout: 265 seconds]
icecream95 has joined #panfrost
kaspter has quit [Quit: kaspter]
davidlt has quit [Remote host closed the connection]
mixfix41_ has quit [Ping timeout: 256 seconds]
davidlt has joined #panfrost
raster has joined #panfrost
cwabbott has joined #panfrost
cwabbott has quit [Quit: cwabbott]
cwabbott has joined #panfrost
cwabbott has quit [Quit: cwabbott]
cwabbott has joined #panfrost
kaspter has joined #panfrost
stikonas has joined #panfrost
stikonas_ has joined #panfrost
stikonas has quit [Ping timeout: 240 seconds]
<robmur01> HdkR: FWIW we've got Radeons in the eMAG and TX2 desktops at work, although that's not overly helpful just now :)
<robmur01> (I don't have the patience for remote desktop shenanigans)
<robmur01> mmind00: since the switch to generic OPP code we fail to actually change the regulator voltage ever, so for boards with kernel-controlled regulators it depends on how close the initial default is to that of the max OPP as to how wonky things get - Chromebooks seem worse off than most
<mmind00> robmur01: so it's a matter of devfreq acting up ... I don't remember seeing cpufreq-related reports though
<robmur01> there have been at least 3 attempts to fix it, but they all seem to get stuck in the mess of clock/regulator/OPP/devfreq optionality
<robmur01> why would panfrost driver changes affect cpufreq? :P
<mmind00> :-P ... I only read tidbits here yesterday, so was thinking if that was a opp-problem
<mmind00> [aka the TL;DR I deduced from the backlog yesterday was "panfrost broken on Kevin due to frequency scaling" ;-) ]
<robmur01> nope, it's a "panfrost fails to attach regulators to its own OPP table" problem ;)
macc24 has joined #panfrost
Elpaulo has quit [Quit: Elpaulo]
<shadeslayer> austriancoder: did you manage to get a trace visualized ?
<shadeslayer> austriancoder: https://ui.perfetto.dev/#!/ has a sample chrome/android trace, but if you want something more specific to Panfrost, here you go https://people.collabora.com/~shadeslayer/trace.protobuf
<shadeslayer> It's a little outdated though :)
<icecream95> shadeslayer: I have uploaded a Perfetto trace of glmark2-es2 at https://gitlab.freedesktop.org/snippets/1023
<shadeslayer> icecream95: amazing :)
<daniels> icecream95: thanks! is it useful to you at all?
<HdkR> robmur01: Dang, no Xavier to get ARMv8.1 I guess?
<robmur01> HdkR: pff, once N1SDP boards start turning up in numbers to replace the Junos it'll be v8.2 all the way... and then we wait and hope for Altra (and possibly KunPeng) :D
<HdkR> haha sure, I'm not saying Xavier is a good choice :P
<HdkR> Is the N1SDP board even something that will be available to purchase?
<robmur01> On the more affordable side, I believe Macchiatobins are a popular "stick a GPU card in it" board
<icecream95> daniels: I have been too busy hacking the Midgard instruction scheduler to use it much so far...
<HdkR> Dang, only A72 on those though
<daniels> icecream95: heh, that's cool :) what are you doing in the scheduler ooi?
<HdkR> Alternatively, just need Panfrost to support GL 3.3 :p
<icecream95> HdkR: MESA_GL_VERSION_OVERRIDE=3.3 MESA_GLSL_VERSION_OVERRIDE=330 PAN_MESA_DEBUG=gles3
<HdkR> er, Bifrost GL 3.3 for SoCs that support ARMv8.1*
<icecream95> HdkR: I'm sure you could very carefully remove the Bifrost GPU and glue in a Midgard one and everything would still work. :P
<HdkR> Those atomics are too good to live without :)
<robmur01> N1SDP> dunno - as far as I'm aware the original intent was a very-limited-scope CCIX demonstration platform, but I've since heard mumblings that there *might* be some shift to productise it at some point
<HdkR> dang
<robmur01> I wouldn't worry - if you don't care about CCIX then it's basically just 4 2-and-bit GHz cores plus a handful of PCIe lanes for ~$n000 ;)
<HdkR> hm
<robmur01> if you want a cheap v8.2 platform right now, consider bashing your head against S905X3
<HdkR> I have a couple of the ODROID-C4 boards, just can't use that for sticking a dGPU on it :P
<HdkR> And A55 isn't a great perf target...
nlhowell has quit [Ping timeout: 256 seconds]
<HdkR> Really I'm probably going to be waiting for an Nvidia Orin dev board, which is sad to say
<robmur01> yeah, Cortex-A55 + usable PCIe is probably an unlikely combination, except perhaps for high-core-count networking stuff
<HdkR> Especially with Orin being slated for 2022 and being Nvidia GPU. So no Bifrost/Valhall fun :P
<daniels> icecream95: heh! that's neat!
<HdkR> Looks like my best bet over the next few months is buying another Xavier and cringing and performance numbers though
<HdkR> cringing at performance numbers*
* HdkR stacks JITs
<daniels> you can take the boy out of NVIDIA ...
nlhowell has joined #panfrost
<HdkR> haha
<HdkR> Sadly nobody makes Exynos devboards anymore, which would have been fun targets :)
<HdkR> I guess I just have unreasonable performance desires
<daniels> not interested in Snapdragon for perf?
<robmur01> what is the "performance" you speak of? This is 2020, where 'hello world' is 300MB of packaged standalone JavaScript environment...
<daniels> you can actually get pretty reasonable performance out of those now
<daniels> er. *actually get pretty reasonably-priced devboards
<HdkR> I'd like the Snapdragon 865 dev board if it was a reasonable Linux target :P
<daniels> we gave up on Exynos long before they gave up on actually selling the silicon to anyone else - a few iterations of us fixing Exynos for mainline in one kernel release, then Samsung breaking it for everyone apart from Tizen in the very next release, was pretty demoralising
<HdkR> yea, I saw that over the years. Such a pain
<robmur01> FWIW, I can vouch for the performance of SDM835 running x86 Thunderbird being somewhat less than "reasonable" (cue stall for ~10s in the middle of typing this...)
<daniels> HdkR: not to try to talk you out of Panfrost or anything, but :P there are patches out there atm for 865 display & GPU
<HdkR> My Snapdragon 850 device destroys my Snapdragon 8cx device in unit test run time. But that's just because of WSL being terrible on the 8cx and running real linux on the 850 device
<HdkR> daniels: Oh yea, I saw that! Going to be a good time soon there
<HdkR> Freedreno I've already confirmed that it runs fine in an x86-64 environment, going to need to ensure Panfrost userspace also works in the same environment at some point :)
nlhowell has quit [Ping timeout: 256 seconds]
<HdkR> (Just need to test radv/radeonsi as well)
<robmur01> Just need to track down one of these... (according to wikipedia) https://ark.intel.com/content/www/us/en/ark/products/codename/80013/sofia-lte.html
<HdkR> no, gods no
<robmur01> I shudder to think what the integration of T720 into "everything is PCI" world looks like
nlhowell has joined #panfrost
<HdkR> Main thing is testing the kernel API (In AArch64) can communicate with the userspace (In x86-64), shouldn't be a big deal? :)
<daniels> robmur01: cursed
icecream95 has quit [Quit: leaving]
<HdkR> Sofia, super cursed
nlhowell has quit [Ping timeout: 256 seconds]
<robmur01> whoop-de-do, another year, another GPU tick... how very unexciting :D
<HdkR> Mali-G78, sounds like a good time for more Valhall
<HdkR> Can't tell if the 24core unit can get near Adreno top end perf
buzzmarshall has joined #panfrost
<alyssa> "up to some more radical changes such as a complete redesign of its FMA units."
<alyssa> robmur01: "The one key changed of the Mali-G78 that Arm had talked about the most, was the change from a single global frequency domain for the whole GPU to a new two-tier hierarchy, with decoupled frequency domains between the top-level shared GPU blocks, and the actual shader cores."
<alyssa> I just read this as more devfreq bugs for us down the line.
<alyssa> Bugs are O(N^2) to complexity IME ;)
stikonas_ has quit [Remote host closed the connection]
macc24 has quit [Read error: Connection reset by peer]
macc24 has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
nlhowell has joined #panfrost
raster has joined #panfrost
<robmur01> yeah, I can't even imagine off-hand how you'd even present the OPP tables for that, and I do wonder whether software is expected to forecast shader vs. tiler load for itself :/
<alyssa> bbrezillon: what's the idea for BO_ACCESS_VERTEX/FRAGMENT flags?
<alyssa> oh, for the dep graph later. got it.
<bbrezillon> alyssa: yep, knowing which one is used in the frag job
<bbrezillon> and which ones are used in the !frag job
* alyssa is looking into refactoring away the hash tables so
<bbrezillon> alyssa: what's the key of this hashtab?
<alyssa> bbrezillon: currently, we have a lot indexed by panfrost_bo * with a hash table
<alyssa> when we can get away with bo->gem_handle into an array/bitset/etc
<alyssa> (see discussion with jekstrand in dri-devel yesterday - this is how it's handled in anv)
<bbrezillon> yep, I saw that one
<bbrezillon> and that sounds like a good idea, indeed
<bbrezillon> sounds similar to the xarray concept we have in the kernel https://www.kernel.org/doc/html/latest/core-api/xarray.html
<bbrezillon> alyssa: maybe something that should be made generic so others can easily re-use the same concept (or is it already the case)
<alyssa> mayhaps
<cwabbott> bbrezillon: there's already a lockless sparse array implementation in mesa
<alyssa> (my branch uses that)
<bbrezillon> cool
<alyssa> So many corner cases though
cwabbott has quit [Quit: cwabbott]
cwabbott has joined #panfrost
<alyssa> 3 files changed, 33 insertions(+), 52 deletions(-)
<alyssa> So far so good :-)
<alyssa> bbrezillon: I don't see where PAN_BO_ACCESS_FRAGMENT is read, though
<alyssa> (It looks like we only track deps on a per-batch level)
<alyssa> and batch_submit_ioctl only uses the deps for the v/t side
* alyssa wonders if we're losing perf there
<bbrezillon> alyssa: yep, it's a per-batch thing
<bbrezillon> inter-batch dep is handled through BOs
<bbrezillon> I mean FBs, not BOs
<bbrezillon> alyssa: the dep of a fragment job, is the V/T job
<bbrezillon> which already has deps on other jobs defined
<bbrezillon> so the frag job indirectly depends on the V/T deps
<bbrezillon> is that wrong?
<bbrezillon> note that BO_ACCESS flags are here for resource refcounting, not deps
raster has quit [Quit: Gettin' stinky!]
marex-cloud has quit [Ping timeout: 256 seconds]
<bbrezillon> well, they also act as implicit deps, since the kernel driver waits for all referenced BOs to be idle before schedule a job
<bbrezillon> *scheduling
<alyssa> bbrezillon: Suppose batch A renders a cat to FBO #1.
<alyssa> Then batch B renders a fullscreen quad (so no deps in vertex/tiler) which in the fragment shader textures from FBO #1 to do some post-processing to make the cat rainbow and bounce and say nyan.
<alyssa> Ideally we would have:
<alyssa> VERTEX: [ A ] [ B ]
<alyssa> FRAGME: [ A ] [ B ]
<alyssa> since the vertex job of B does not depend on the fragment job of A, they can run concurrent
<alyssa> If I understand the code right, though, it would actually end up being
<alyssa> VERTEX: [ A ] [ B ]
<alyssa> FRAGME: [ A ] [ B ]
<alyssa> which is slower due to the unnecessary dep.
<alyssa> The ACCESS flags would signal that that's unnecessary, but I don't see how the kernel would know since it just sees a dep of B on A, and it just sees B accesses a BO written from A (the FBO)
<bbrezillon> right, I forgot that the tiler job was not responsible for texture sampling
<bbrezillon> so we could indeed remove this dep
<alyssa> (it's a bit confusing -- TILER jobs specify all the fragment shaders but they don't actually run until FRAGMENT)
<Lyude> alyssa: you working on midgard perf stuff?
<alyssa> Lyude: Yeah :-)
<bbrezillon> yep, I think last time we discussed that you said tiler jobs were referencing textures, which is why I thought there was a hard dep here
<alyssa> yeah, it's tricky. the TILER job does reference it in the sense that the job has the pointer, but it doesn't actually access it
<bbrezillon> if that's not the case, then we should remove the explicit dep on BOs flagged with BO_ACCESS_FRAGMENT only
<bbrezillon> we'd still pass the BO to the BO list, that's not a problem
<bbrezillon> we can just get rid of the dep
<alyssa> would that work if the kernel does implicit deps from the BO list..?
<bbrezillon> anyway, none of that will help improving the perfs if the kernel is not patched to support skipping the implicit waits on BOs
<bbrezillon> which was in the pipe when I submitted the batch pipelining stuff
<alyssa> I'm not convinced we need to specify that texture in the vertex/tiler BO list at all, though
<alyssa> When I say it's a pointed, I literally just mean it's a pointer. It shouldn't ever get dereferenced by the GPU until the corresponding frag job executes.
<alyssa> pointer
<alyssa> Not sure if that's a kosher use of the BO list, but it should work at this point
<alyssa> and then it becomes UABI by default or something
<alyssa> robher: *ducks*
<bbrezillon> alyssa: I wouldn't worry about that, the BO is still referenced by the frag job
<bbrezillon> which is executed after the tiler job is done
<bbrezillon> so omitting the BO in the tiler BO list shouldn't be a problem
<alyssa> agreed, just not sure it's totally intended :)
<bbrezillon> probably not
<bbrezillon> but adding a flag to skip the implicit deps would also be a good thing
<bbrezillon> I mean, etnaviv has that too
<alyssa> Mm
<bbrezillon> don't you have cases where 2 jobs read from the same BO but never write it?
<bbrezillon> clearly we don't want things to be serialized in this case
<bbrezillon> but that's what happens
<alyssa> ah, right. good point
<bbrezillon> we have all the pieces to skip this unneccessary serialization already
<bbrezillon> we just need this flag (and a lot of testing to make sure it doesn't regress things :))
<alyssa> testing? don't you mean pushing to master and waiting for the bug reports?
<alyssa> (thanks icecream95 ;P)
<bbrezillon> :D
<alyssa> bbrezillon: As an aside, I notice we spend serious CPU time in the SUBMIT ioctl.. wonder what's up with that
<alyssa> 9.21% on this trace in panfrost_ioctl_submit
<alyssa> within that 4.01% in panfrost_job_push, 1.23% in drm_gem..lookup, 1.18% in gem_mapping_get
<alyssa> 1.26% waiting on the wake up lock in drm_sched_wakeup
nerdboy has joined #panfrost
<alyssa> Maybe we're using way too many BOs
stikonas has joined #panfrost
<robher> I seem to recall some discussion on multiple readers. Related to resv_obj's I think.
<bbrezillon> b
<bbrezillon> but I can't find it
<bbrezillon> oh, no actually it was about flagging access types on BO
<bbrezillon> robher, alyssa: ^
<alyssa> DRM_IOCTL_PANFROST_CREATE_BO failed: No space left on device
<alyssa> ^ this seems bad
<bbrezillon> BO leak :)
<alyssa> yeah, but... why..
<bbrezillon> that's where shadeslayer's BO labeling could help :)
<alyssa> indeed
<alyssa> (^ the patch)
<alyssa> Oh wait
<alyssa> er no
<alyssa> also drm_syncobj_wait_ioctl is eating tremendous CPU uhh
<alyssa> am I doing something silly
NeuroScr has joined #panfrost
* alyssa is simplifying along..
<alyssa> TBD if it helps perf but shouldn't hurt, and should make things easier to follow.
<alyssa> and thus easier to fix for real perf things
nlhowell has quit [Ping timeout: 240 seconds]
<bbrezillon> alyssa: looks good to me (s/gaurantees/guarantees/)
<alyssa> bbrezillon: Thats the patch that's breaking the world :-)
davidlt has quit [Ping timeout: 256 seconds]
<bbrezillon> alyssa: well, it looked good :)
<alyssa> bbrezillon: :D
<alyssa> (`bo-v4` is what I'm working on. Nothing dramatic yet.)
<bbrezillon> alyssa: you probably want to release the BOs as soon as the fence is signalled
<alyssa> i'll try that
<alyssa> bbrezillon: nope, still not happy..
<alyssa> I'm suspicious of u_blitter interactions
<bbrezillon> alyssa: so BOs are not released as they should ne
<bbrezillon> be
<bbrezillon> meaning that some fences are never signalled
<bbrezillon> or never tested
<bbrezillon> wait, what's the data of the hashtab?
<bbrezillon> don't we have a circular dep here (BO entry holding a reference on the fence which holds a reference on the BO)?
<alyssa> uhm
<alyssa> bbrezillon: So we do. :|
<bbrezillon> yeah, it's not that simple I fear
<alyssa> how did this work before :p
<bbrezillon> because there was no circular dep :P
<alyssa> thanks
<alyssa> :p
<alyssa> oh. right. fine.
<alyssa> :p
<alyssa> Okay, what if I keep the structure as is, but have a set of gem handles on the fence (no referencing)?
<alyssa> dereference at the usual time
<alyssa> but keep an in_flight refcnt on the BO
<alyssa> is that still circular
<bbrezillon> I was about to propose having a weak ref on the ->accessed_bos hashtabl
<alyssa> (I got rid of ->accessed_bos a few patches ago, whoops?)
<bbrezillon> *on BOs inserted in the ->accessed_bos hashtab
<alyssa> but yes, I think that would also work
<bbrezillon> yep, but you did replace it by something else
<alyssa> a929ad7adacbd83f402ef890e0cf92043389a1e4
<bbrezillon> which holds a ref on the BOs inserted there
<bbrezillon> yes, so you need to be very careful here, since I'd expect the BO users to have a ref on the BOs they use
<alyssa> right
<alyssa> this is why i write compilers :p
<bbrezillon> and at the same time, the readers/writer arrays hold refs to those users
<alyssa> we can probably get rid of those refs..?
<bbrezillon> maybe :)
<alyssa> or not argh
<bbrezillon> can't we move to those smart-arrays without changing the structs relationships?
<alyssa> smart-arrays?
<bbrezillon> well, the replacement for hashtables
<alyssa> Oh, right
<alyssa> Is this about getting rid of accessed_bo or fussing with the BO array?
<bbrezillon> I feel like addressing all problems at once is complicating things quite a bit
<alyssa> I do have that problem quite a bit yes
<bbrezillon> it's mostly about isolating changes so we can easily debug each problem independently
<bbrezillon> and yes, having direct relationship between accesses and BOs is likely to cause many circular dep issues
* alyssa nods
<alyssa> Trying a much smaller subproblem then, let's see.
<bbrezillon> we probably want to address both ultimately
nlhowell has joined #panfrost
<alyssa> okay. replicated the experiment with a much simpler change set. still leaking, so there's some other issue somewhere.
<bbrezillon> alyssa: same branch?
<alyssa> bbrezillon: just pushed to `bob`
<alyssa> I am ahem creative with branch names ;P
<alyssa> Indeed.. there are BOs hit in free_batch's bo check that are not hit the right number of times in is_signaled's fence_bo check
<alyssa> so I guess some fences aren't being signaeld
<alyssa> (or checked)
<bbrezillon> I'd have to look at it more closely, but I fear attaching the readers/write directly to the BO has an impact on refcounting
<alyssa> Possibly.
<alyssa> rebasing w/o my other changes
<alyssa> nope
<bbrezillon> do you have fence leaks without your changes?
<alyssa> let's find otu.
<alyssa> .Yes.
<bbrezillon> duh
<alyssa> Er, no
<alyssa> Er, yes, just somewhat slower.
<alyssa> duh?
<alyssa> answer is yes
<bbrezillon> leak as in valgrind reporting a leak, or as in the number of fences keeps increasing
<alyssa> In -bideas that bounces between 0 and 1, which is expected
<alyssa> In -bterrain (which makes heavy use of u_blitter), that increases unbounded.
<alyssa> (the leaks I'm chasing are in terrain)
<alyssa> also leaks very quickly in -bdesktop, which does not mipmap afaik, but uses u_blitter for walpapering, as does terrain
<alyssa> so I bet it's wallpapering
<alyssa> (but then I'd expect to see the leak in weston, and I don't)
<alyssa> also seen in -brefract which doesn't use u_blitter, so that's a red herring
<alyssa> it's just FBO stuff then
<alyssa> (weston doesn't need FBOs)
<alyssa> --off-screen doesn't reproduce, though. So it's texturnig from an FBO that's a problem
<alyssa> read-after-write
<bbrezillon> can you run valgrind on it?
<alyssa> any flag to get valgrind to be useful here?
<bbrezillon> dunno
<bbrezillon> I not super familiar with valgrind
<bbrezillon> other than the basics
<bbrezillon> you can also check if the leaked refs are those where a fence is explicitly requested
<bbrezillon> (*fence not NULL in ctx->flush())
<bbrezillon> and it's really read-after-write that's at fault, it could also be a bad refcounting when we deal with inter-batch deps
<alyssa> `ure
<alyssa> fence == NULL in flush() for this reproducer
<alyssa> Got it.
<bbrezillon> I'm curious :)
<alyssa> bob-v2
<alyssa> spoiler alert: access->writer
<bbrezillon> ouch
<bbrezillon> hm, wait
<bbrezillon> no, that's correct
<alyssa> The original or the fix?
<bbrezillon> the fix
<alyssa> cool
<alyssa> Better question - why does an unrelated change in the BO cache cause fences to leak again
<bbrezillon> I guess I'll find the answer in bob-v3 :à
<bbrezillon> :)
<alyssa> :)
<alyssa> maybe the issue in terrain now isn't a leak, just a legitimate OOM :
<alyssa> :|
<HdkR> Time to swap your GPU buffers to disk :D
NeuroScr has quit [Remote host closed the connection]
NeuroScr has joined #panfrost
cphealy has quit [Remote host closed the connection]
<alyssa> HdkR: Oof.
<alyssa> Or time to switch gears to Bifrost :P
warpme_ has quit [Quit: Connection closed for inactivity]
macc24 has quit [Quit: WeeChat 2.8]
nerdboy has quit [Ping timeout: 265 seconds]
<alyssa> chewitt: Mind giving alyssa/mesa:bi-format a whirl on Kodi?
<alyssa> I don't have a convenient way to test kodi rn but iirc we saw this issue with midgard way back when and a similar patchset fixed it
cwabbott has quit [Ping timeout: 272 seconds]
<alyssa> although I'm still seeing corruption in weston in a few places so maybe tomorrow night instead ;)