alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
Moe_Icenowy has joined #panfrost
NeuroScr_ has joined #panfrost
AreaScout_ has quit [Ping timeout: 260 seconds]
AreaScout_ has joined #panfrost
megi1 has joined #panfrost
NeuroScr has quit [Read error: Connection reset by peer]
MoeIcenowy has quit [Quit: ZNC 1.7.2+deb3 - https://znc.in]
megi has quit [Ping timeout: 272 seconds]
Moe_Icenowy is now known as MoeIcenowy
NeuroScr_ is now known as NeuroScr
AreaScout_ has quit [Ping timeout: 240 seconds]
AreaScout_ has joined #panfrost
austriancoder has quit [Read error: Connection reset by peer]
austriancoder has joined #panfrost
Yardanico has quit [Quit: No Ping reply in 180 seconds.]
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
Yardanico has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
rhyskidd_ has joined #panfrost
nerdboy has quit [Ping timeout: 265 seconds]
rhyskidd has quit [Ping timeout: 260 seconds]
rhyskidd_ is now known as rhyskidd
stikonas has quit [Remote host closed the connection]
NeuroScr has joined #panfrost
bbrezillon has quit [Ping timeout: 240 seconds]
bbrezillon has joined #panfrost
nerdboy has joined #panfrost
vstehle has quit [Ping timeout: 260 seconds]
megi1 has quit [Quit: WeeChat 2.7]
megi has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
buzzmarshall has quit [Quit: Leaving]
<tlwoerner> wrt a good name... the two hardest things in computer science: https://www.reddit.com/r/ProgrammerHumor/comments/6hbrfg/the_two_hardest_things_in_computer_science_are/
NeuroScr has joined #panfrost
vstehle has joined #panfrost
nerdboy has quit [Ping timeout: 272 seconds]
warpme_ has joined #panfrost
JaceAlvejetti has quit [Quit: No Ping reply in 180 seconds.]
JaceAlvejetti has joined #panfrost
guillaume_g has joined #panfrost
davidlt has joined #panfrost
megi has quit [Ping timeout: 268 seconds]
<icecream95> Are performance counters global or per-process?
toggleton has quit [Ping timeout: 265 seconds]
<bbrezillon> icecream95: global
<bbrezillon> and the kernel iface is apparently broken in %.4+ (got a report yesterday)
<icecream95> bbrezillon: It does seem broken...
yann|work has quit [Ping timeout: 268 seconds]
davidlt has quit [Remote host closed the connection]
megi has joined #panfrost
icecream95 has quit [Ping timeout: 252 seconds]
yann|work has joined #panfrost
chewitt has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
<tomeu> bbrezillon: alyssa any ideas on why, when zeroing BOs in panfrost_bo_mmap, we fail some deqp tests?
raster has joined #panfrost
chewitt has quit [Ping timeout: 272 seconds]
raster has quit [Quit: Gettin' stinky!]
raster has joined #panfrost
raster- has joined #panfrost
raster has quit [Ping timeout: 260 seconds]
raster- has quit [Client Quit]
raster has joined #panfrost
chewitt has joined #panfrost
<alyssa> tomeu: which tests?
<tomeu> alyssa: a lot :)
<tomeu> I can easily check
raster has quit [Quit: Gettin' stinky!]
<bbrezillon> tomeu: well, if the test is doing a readpixel, and you zero the buf at mmap() time, that means readpixel returns 0
<bbrezillon> unless the buf was previously mmap-ed
<tomeu> hmm, should check that
<tomeu> but I would expect more tests to fail if all readpixels were broken
<bbrezillon> zero-ing at mmap() is probably not what you want anyway
<bbrezillon> I guess you want to do that when the BO is allocated or picked from the cache
<tomeu> yeah, though one needs to mmap for that
<tomeu> something to look at
<tomeu> alyssa: wonder if the scoreboard file shouldn't be also built O3
<tomeu> looks a bit heavy on the profiles
<tomeu> bbrezillon: the memcpys at submit time are still a bit heavy
<tomeu> 6.42% of total cpu time
<tomeu> wonder if we could do that in one go
<tomeu> also, looks like transient buffers should be a bit bigger
<bbrezillon> tomeu: mmap()+memset(0)+munmap()
<bbrezillon> for all !GROWABLE bufs
<tomeu> well, mmaps and mmunmaps are also prominent in the profiles
<bbrezillon> no, I meant for testing
<bbrezillon> to see if your bug persists
<tomeu> oh, ok
<tomeu> could be interesting indeed to have a flag to test occasionally with and find state leaks
<bbrezillon> state leaks?
<bbrezillon> is that what you were trying to do?
<MastaG> great steaks
<alyssa> ooooooo
* alyssa makes erie noises
<bbrezillon> tomeu: Re: 'memcpys at submit time' => you mean copying the shadow BO into the final one?
<tomeu> bbrezillon: yep
<tomeu> bbrezillon: was thinking that uninitialized memory could be the cause for some of the intermittent failures
<tomeu> or some failures that depend on the order in which tests are run
<HdkR> Spooky scary state leaks
<alyssa> meanwhile bit identical output, no fault, just failing deqp test
raster has joined #panfrost
daniels has quit [Ping timeout: 265 seconds]
anarsoul|c has quit [Ping timeout: 260 seconds]
jstultz has quit [Ping timeout: 246 seconds]
<bbrezillon> robmur01: do you know what JOB_BUS_FAULT means?
<bbrezillon> I get it when trying to dump the perf counters
<bbrezillon> hm, I seem to get this fault only when the GPU is idle
robher has quit [Ping timeout: 260 seconds]
shadeslayer has joined #panfrost
jstultz has joined #panfrost
daniels has joined #panfrost
robher has joined #panfrost
<robmur01> bbrezillon: not sure off-hand, but I'd assume "bus fault" would be the equivalent of a CPU external abort, i.e. it tried to access a physical address with nothing behind it, or something that wasn't powered up
anarsoul|c has joined #panfrost
<tomeu> narmstrong: should the meson drm driver be setting the composite's connector status to connected if nothing is actually connected to it?
<bbrezillon> robmur01: figured it out
<bbrezillon> we were always using AS 0 without reserving it
<tomeu> bbrezillon: alyssa: btw, looks like copying the cmdstream before submission makes things quite slower, rather than faster
<tomeu> at least with glmark2
<alyssa> interesting.
<alyssa> I'm not sure if that's surprising...?
<tomeu> well, I think we need to understand why mmap and unmap operations are so slow, and why writing to the transient BOs also take so long (and probably to texture BOs)
<tomeu> because it's really limiting the speed at which the CPU can feed the GPU
* alyssa nods
<tomeu> the theory was that every write to the transient BOs caused a cache flush and that grouping them could make everything faster, but that doesn't seem to be true
<narmstrong> tomeu: we can't know if it's connected or not
<tomeu> narmstrong: can't the state be unknown or so?
<alyssa> tomeu: for write-combine memory, right..
<narmstrong> tomeu: maybe, never tested
<tomeu> narmstrong: because of being listed before hdmi, lots of naive kms programs break
<alyssa> That might still be true, but the overhead of a cache flush is much less than copying everything entirely
<alyssa> And it might be that we're already writing the big buffers in order enough
<narmstrong> tomeu: i know
<tomeu> alyssa: guess I should check that I only memcpy what I strictly need
<narmstrong> I was told these program should be fixed...
<narmstrong> they should prioritize the already set pipeline by default then find connectors/crtc/... if not possible
<tomeu> poor meson :)
<tomeu> ah, makes sense
<tomeu> lots of work
<narmstrong> yep
<daniels> connector_state_unknown does exist
<narmstrong> no idea how these naive programs will behave if the only available connector is in unknown state
<narmstrong> but at some point idc, nobody uses composite...
<daniels> make it connector_state_disconnected then :P
<tomeu> better, delete the code :p
<narmstrong> ah ah
<tomeu> alyssa: bbrezillon: copying only the transient_offset bytes of the last transient BO in a batch cut the difference by 50%, but it's still slower
<bbrezillon> tomeu: if the cmdstream is only written, that's normal
<bbrezillon> I guess
<bbrezillon> buffers are mapped write-combine
<bbrezillon> which means small writes will be combined, plus each cmdstream entry is 64 bytes minimum
<bbrezillon> so we never have really small writes
<tomeu> oh, now I got what WC really means
<robmur01> bear in mind that "write-combine" is utter nonsense on Arm - ask for it and you get non-cacheable
<bbrezillon> the problem is if bo->cpu is read from
kherbst has joined #panfrost
karolherbst has quit [Disconnected by services]
<bbrezillon> robmur01: oh, there's no write-cache?
kherbst is now known as karolherbst
<tomeu> ah ok, that's what my patch avoids I guess
<tomeu> so we want cacheable and a way to flush the caches?
<robmur01> bbrezillon: well, it should still be bufferable, but the effect of that is more dependent on the specific CPU implementation and certainly doesn't guarantee the specific optimisations of x86's write-combine type
<tomeu> (before handling to the GPU)
<robmur01> and reading from non-cacheable will always suck
<bbrezillon> tomeu: or just avoid reading back from the cmdstream once it reached the BO
<robmur01> the one advantage is it makes things easier to reason about
<bbrezillon> well, being bufferable matter when you have small writes
<bbrezillon> *matters
<bbrezillon> I thought there's was a specific attribute for that on ARM
<MoeIcenowy> BTW how does Midgard reload depth/stencil buf?
<MoeIcenowy> on Utgard strangely it uses the same shader for color buf reload and Z buf reload
<bbrezillon> MoeIcenowy: not yet
<MoeIcenowy> but another shader for S buf reload
<MoeIcenowy> bbrezillon: I mean blob
<bbrezillon> oh, then yes
<MoeIcenowy> and it uses a strange sampler format when reloading Z/S
<MoeIcenowy> bbrezillon: does Midgard blob have the same situation?
<robmur01> back in the pre-VMSA days there were separate "B" and "C" bits; in VMSA I think the Normal type is inherently bufferable (as opposed to Device/Strongly-Ordered types which aren't)
<MoeIcenowy> (BTW Midgard sampler format seems to be more orthogonal than Utgard?
<bbrezillon> MoeIcenowy: I didn't try Z32, but for Z24{S,X}8 use an float or uint sampler (depending on the situation)
<bbrezillon> *they use
<MoeIcenowy> why float?
<MoeIcenowy> uint seems normal
<bbrezillon> guess it has to do with precision loss when doing Z24_UNORM -> Z32_FLOAT conversion when you stay in 32-bit mode
<bbrezillon> but I'm not sure
<bbrezillon> when sampling only the depth, they use a float sampler, and then 'fdot sample, constant'
<MoeIcenowy> BTW we're guessing that Utgard supports also writing to Z/S buf in fragment shader (because it uses FS to reload Z/S buf)
<bbrezillon> robmur01: ok, good to know
<MoeIcenowy> bbrezillon: what's the value of the constant?
<bbrezillon> I don't have it, it's something like {256*256 / 0xffffff , 256/0xffffff , 1/0xffffff}
<MoeIcenowy> ah... so is it sampler as a RGBX8888?
<MoeIcenowy> sampled *
<bbrezillon> yep, rgba, but each component is a float, not an integer
<MoeIcenowy> integer input float output?
<MoeIcenowy> ah I mean the value gets normalized
<bbrezillon> I don't remember the details, I'd have to look at the trace
<bbrezillon> anyway, we ended up sampling as R32UI, and doing bfe+fmul64 to get the normalized value
<bbrezillon> which is how u_blitter does
<bbrezillon> except you need fp64 support for that
<MoeIcenowy> BTW I'm thinking how to do dirty experiments on mali
<MoeIcenowy> bbrezillon: we utgard people have even no fp32
<MoeIcenowy> and no int
<tomeu> bbrezillon: why are we allocating WC if all the jobs have anyway JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE and JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE ?
<robmur01> tomeu: because the GPU caches are not the CPU caches ;)
<tomeu> those only refer to the GPU caches?
tasinofan has quit [Ping timeout: 246 seconds]
<robmur01> anything on the GPU only affects GPU caches (Midgard doesn't support full ACE coherency at all)
<tomeu> hmm, so maybe the best we can do is to make sure that we don't read from the cmdstream BOs?
<tomeu> and leave them WC
<robmur01> in principle you can use cacheable CPU mappings and try to convince the DMA API to help at the points where things are logically transferred from CPU to GPU and back again
<robmur01> but it does open up a whole new world of edge cases and awkwardness (see adreno)
<tomeu> hmm, guess PANFROST_BO_RO could help with making sure we don't read from it
tasinofan has joined #panfrost
tasinofan has quit [Client Quit]
<tomeu> or rather, PANFROST_BO_WO :)
tasinofan has joined #panfrost
<robmur01> shame we can't enforce that ;)
<tasinofan> Hi all
<tasinofan> Can someone provide panfrost specific xorg.conf section example? My
<tasinofan> modeset even if falling back to /dev/dri/card0, which is gpu-card.
<tasinofan> rename /dev/dri/card1, which is vpu-card, because it is chosen by
<tasinofan> Xorg works fine on khadas vim2, but, I have to explicitely delete or
<tasinofan> Basically I want to try avoid this card0 card1 mismatch by explicitely
<tasinofan> providing card0 coordinates in the appropriate xorg.conf section:
<tasinofan> falling back to /sys/devices/platform/soc/d0000000.apb/d00c0000.gpu/drm/card0
<tasinofan> (II) modeset(G0): using drv /dev/dri/card1
<tomeu> robmur01: no? was thinking of not setting IOMMU_READ
<alyssa> MoeIcenowy: "BTW
<alyssa> I'm
<robmur01> tomeu: I think you still want the *GPU* to be able to read BOs ;)
<alyssa> ...." thinking
<tomeu> robmur01: well, but cannot set that in the CPU's MMU?
<robmur01> it's on the CPU that you can't have write-only permission (well, at stage 1 anyway)
<tomeu> grr
<tomeu> ok, shouldn't be that much work to audit manually the source code
<robmur01> I can't remember if we have KVM support for stage 2 memory protection implemented, or whether it was just talked about...
<alyssa> tomeu: robmur01: We can enforce that in mesa
<alyssa> without the panfrost kernel's help
<alyssa> just mmap with PROT_WRITE but withotu PROT_READ
<alyssa> and if you try to read it should give a fault yes?
<alyssa> or does arm architecturally not do that
<robmur01> write implies read at stage 1, that's the point
<alyssa> I see.
<alyssa> tomeu: At any rate, the main place I think you'll find us reading the cmdstream is pan_scoreboard.c since I was lazy
<alyssa> That file could stand a good refactor
<tomeu> is there any good reason for stuff in that file to appear high in the profiles?
<tomeu> (only than reading from WC BOs :p)
<alyssa> tomeu: It's a hot path, since those functions run every draw/frame
<alyssa> So if they're already slow from WC stuff, it'll balloon
<alyssa> Other place you'll find it is directly mapped pipe_transfers (so LINEAR resources) ... you could see what happens when shadowing those but I doubt it'd be much better
<alyssa> reading from GPU mapped memory is bad hygiene (because WC/etc) anyway
<tomeu> alyssa: btw, if we are going to touch the scoreboard, any reason not to have the write_value jobs in the same order in the cmdstreama s the blob?
<tomeu> would make it more comfortable when diffing stuff
<alyssa> tomeu: I don't quite remember... I think I thought it might be faster this way but now I realize SET_VALUE jobs are so trivial it doesn't matter anyway
<tomeu> alyssa: one more thing: is there any reason to have the fragment job in a separate job chain?
<alyssa> tomeu: I don't know.
<tomeu> with all the bos we are sending around, it might involve quite some useless work
<alyssa> The blob does it that way.
<alyssa> I never checked if it's strictly necessary or not.
<tomeu> ok, something to check later I guess
<alyssa> I will say that working on compute shaders is a lot easier mentally than graphics
<alyssa> oh, oops, code was right I was reading wrong I feel sill
<alyssa> y
<alyssa> yes! passing more shared memory tests now!
<alyssa> found the uninitialized variable and initialized it
raster has quit [Quit: Gettin' stinky!]
<alyssa> and those tests are using barriers
<alyssa> so not sure what's up with the barriers tests but the issue isn't the barrier
<alyssa> in conclusion I made it past the barriers the hw threw at me (har-har) -- okay i'll show myself out
<alyssa> As for the mat tests failing, I'm guessing this is an issue with unaligned access to shared memory which I mean
<alyssa> Pretty sure I fixed this for UBOs at one point, should be easy enough to port a fix
toggleton has joined #panfrost
<bbrezillon> tomeu, alyssa: well, if you really want to prevent reads, you can always provide read/write accessors
<bbrezillon> and do the checks there
<bbrezillon> that means going through the code to convert all direct accesses to use those helpers, but you'll do that once
<bbrezillon> tomeu: pan_scoreboard.c seems to read object placed in BOs
<bbrezillon> having the job headers cached would probably help improving the perfs
warpme_ has quit [Quit: Connection closed for inactivity]
nerdboy has joined #panfrost
raster has joined #panfrost
buzzmarshall has joined #panfrost
guillaume_g has quit [Quit: Konversation terminated!]
pH5 has joined #panfrost
yann|work has quit [Ping timeout: 272 seconds]
Elpaulo has joined #panfrost
Elpaulo has quit [Quit: Elpaulo]
stikonas has joined #panfrost
<alyssa> bbrezillon: accessors seem clunky ...
<alyssa> and yeah re pan_scoreboard.c, but more than caching the headers, you probably want to fundamentally restructure some of that file so the access patterns are less random
LinguinePenguiny has joined #panfrost
buzzmarshall has quit [Quit: Leaving]
buzzmarshall has joined #panfrost
<bbrezillon> alyssa: well, random or not, accesses are still uncached
<bbrezillon> but I was wondering if we couldn't build the dep graph at insertion time
buzzmarshall has quit [Client Quit]
buzzmarshall has joined #panfrost
<alyssa> \shrug/
<alyssa> Postponing building the dep graph is definitely an option!
LinguinePenguiny has quit [Quit: LinguinePenguiny]
<bbrezillon> I had the opposite in mind
<bbrezillon> apart from the write_value, other jobs are inserted in order, isn't it?
<alyssa> Oh, I see. That's also an option
<alyssa> Mostly we want scoreboarding to be flexible enough so we don't paint ourselves into a corner for ES3.2
<alyssa> when you introduce geometry and tesselation shaders, things get... odd.
<bbrezillon> well, let's leave it like that for now
<bbrezillon> caching jobs info should help already
<alyssa> nice! what's the perf impact?
<bbrezillon> haven't tested it yet
<alyssa> Okey!
<bbrezillon> and won't do it today
<alyssa> Fair enough :)
buzzmarshall has quit [Quit: Leaving]
buzzmarshall has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
grw has quit [*.net *.split]
janrinze has quit [*.net *.split]
milkii has quit [*.net *.split]
milkii has joined #panfrost
janrinze has joined #panfrost
davidlt has joined #panfrost
raster has joined #panfrost
NeuroScr has joined #panfrost
davidlt has quit [Ping timeout: 240 seconds]
pH5 has quit [Ping timeout: 260 seconds]
leinax has quit [Remote host closed the connection]