alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
raster has quit [Quit: Gettin' stinky!]
stikonas has quit [Ping timeout: 272 seconds]
stikonas has joined #panfrost
stikonas has quit [Ping timeout: 272 seconds]
yann|work has joined #panfrost
vstehle has quit [Ping timeout: 268 seconds]
chewitt has quit [Read error: Connection reset by peer]
nerdboy has quit [Ping timeout: 260 seconds]
NeuroScr has quit [Ping timeout: 272 seconds]
NeuroScr has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
icecream95 has joined #panfrost
<icecream95> From running perf on a demo in xscreensaver:
<icecream95> 1.38% bouncingcow ld-2.30.so __aeabi_uidiv
<icecream95> Why does the dynamic linker need to do so many divisions?
<HdkR> Probably want a callstack on that one
<HdkR> What CPU is this?
<HdkR> Cortex-A7 has a udiv instruction, should recompile with -march=native if that is the case :P
<icecream95> This is on Arch Linux ARM for "armv7h" (I think the h is for "hard float"), so I guess they didn't set CFLAGS correctly.
<icecream95> At least with aarch64 GCC assumes a FPU so
<urjaman> FPU != integer division
<icecream95> Are there any ARM CPUs with an FPU and no integer division?
<urjaman> also, does the A7 really have udiv? like always and not only in the -R (iirc, embeddedish, not application) profile
<urjaman> atleast the Cortex-A8 doesnt have it
<icecream95> A7 was designed for big.LITTLE with A15 so they have the same features
<urjaman> i havent tried to write ARM assembler with a newer chip than that :P (was the pandora)
<urjaman> but they explicitly didnt have integer division in the application variants back then (basically i guess thinking that the compiler can math it away usually)
<HdkR> There are plenty of ARM CPUs with FPU and no integer division. Probably not any supporting Panfrost though :P
<HdkR> Cortex-A15 and A7 were first with intdiv
<urjaman> okay yeah "verified" that ...
<urjaman> in the ARMv7-A profile if the virtualization extensions are supported, then are udiv/sdiv
<urjaman> and A7/A15 have virtualization extensions
<urjaman> but the archlinux arm armhf target is built to run on anything ARMv7 so that's why it isnt using them
<urjaman> (or atleast it runs on the Cortex-A8 of the pandora so :P)
<HdkR> Should probably get a stacktrace and see if those udivs can be avoided
<HdkR> Could just be something that bouncingcow does since intdiv has been around for a long time on x86
buzzmarshall has quit [Quit: Leaving]
megi has quit [Ping timeout: 265 seconds]
<icecream95> The uidiv calls are from ld-2.30.so, so it looks like it's used for searching the symbol hash table.
<icecream95> Counting the lines output when running with LD_DEBUG=symbols, the hash table is searched 2,502,134 times during startup
<HdkR> What would it be searching a symbol hash table for?
<HdkR> reloations happen at module load and/or startup
<HdkR> relocations
<icecream95> Symbols? You can't just get the address of memcpy from nowhere.
<icecream95> The majority (over 80%) of those symbols are for LLVM.
<icecream95> If I recompiled Mesa without LLVM support applications would probably start noticeably faster.
<HdkR> This is just a profile of startup time then?
<icecream95> Yes, I only ran it for about 5 seconds
<HdkR> ah, I was thinking this was runtime so eh
davidlt has joined #panfrost
vstehle has joined #panfrost
<icecream95> Disabling LLVM (-Dllvm=false) makes applications start about 0.1 seconds faster for me.
<tomeu> alyssa: bbrezillon: maybe restructuring how we emit the command stream is something we should reconsider
<tomeu> it always comes to my mind when I'm debugging flaky tests
<tomeu> for example, we're almost done with gles2, but in reality, if you change the order in which tests are run we start failing tons of them
<tomeu> if we stopped exposing the cpu pointers of transient BOs and limited all that code into a single file, it would be easier to audit that we're not doing any reads
jolan has quit [Quit: leaving]
<tomeu> that would also mean adding to the context some more state that currently we extract from descriptors
<tomeu> if we did that, it would be easier to audit what state really belongs to the context and should be preserved between batches, and what needs to be reinitialized
<tomeu> the latter would go to structs that aren't shared between batches
jolan has joined #panfrost
davidlt has quit [Ping timeout: 272 seconds]
megi has joined #panfrost
<bbrezillon> tomeu: +1
<icecream95> bbrezillon: b334fcfa881 ("panfrost: Avoid reading job desc headers from the BO") seems to have no affect on performance
<tomeu> bbrezillon: it could also help to check that we are writing to transient BOs sequentially, so we tend to fill cache lines
<tomeu> icecream95: with which workload? it should only help in those in which the scoreboard functions appear high in the profile
guillaume_g has joined #panfrost
<icecream95> tomeu: glmark2-es2
_whitelogger has joined #panfrost
<robmur01> tomeu: 64 bytes is hella optimistic ;) e.g. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500j/CHDDJAFJ.html#CHDCAHGG
<robmur01> in general store buffer slots are likely to be the same width as the load/store interface itself
<tomeu> robmur01: hmm, I saw somewhere that the write cache should be able to store a whole cache line
<tomeu> might have been in x86?
<robmur01> again, capital-letter Write-Combine is a specifically defined part of the x86 architecture
<robmur01> merging store buffers are a more general concept and not the same thing (I believe WC is really closer to a write-through cache policy)
<tomeu> robmur01: do you know of any public documentation on how best use WC buffers on Arm?
<robmur01> no, but it would essentially be "keep small writes to adjacent addresses as close together as possible in program order"
<robmur01> interleaving stores to different locations (e.g. the stack) is liable to flush partially-filled slots in a suboptimal manner
buzzmarshall has joined #panfrost
<robmur01> (incidentally, buffered Non-Cacheable is actually less doom-laden than x86 WC when it comes to reads - non-cacheable loads have no magic side-effects, and in fact the LSU might snoop and forward from its own store buffer to elide the external request altogether)
<robmur01> (weak memory ordering FTW!)
<tomeu> robmur01: awesome thanks, that's very helpful
<robmur01> in reality I'd imagine x86 CPUs almost certainly have actual store buffers as well to minimise traffic between the LSU and L1, it's just that the stronger memory model will make them effectively invisible
<bbrezillon> alyssa: OOC, what's the difference between packet-based and desc-based cmdstreams?
warpme_ has joined #panfrost
guillaume_g has quit [Quit: Konversation terminated!]
pH5 has quit [Quit: bye]
<alyssa> bbrezillon: Uhh, materially none, but semantically... it's like functional vs imperative programming
<alyssa> GPUs like Adreno literally have a command stream, as in, you give it a list of commands in serial order that the GPU executes.
<alyssa> Utgard works like that too
<alyssa> Linear and imperative.
<alyssa> By contrast, Midgard/Bifrost (and I assume Valhall) uses descriptors, memory-mapped structures containing all the relevant state "all at once"
<alyssa> There isn't a linear command stream, you get a tree instead (and sometimes with cycles but we don't talk about that)
<alyssa> Where different descriptors also point to other descriptors
<alyssa> The two forms are equivalent, of course, but they look very different on the binary level. With command/packet/etc architectures, it makes sense to literally dump the command stream. With descriptors, it doesn't (this is why pandecode is so complicated and error prone -- there's no logical start or end, you just have stuff in memory and we have to walk the tree like the hardware would, but without
<alyssa> knowing anything about the hardware a priori we can't do that without heuristics. Which is why early panfrost dev circa 2017 was so slow)
buzzmarshall has quit [Quit: Leaving]
<alyssa> For an API level analogy -- OpenGL uses a command stream. Vulkan uses descriptors.
<alyssa> (mesa/st translates OpenGL commands to Gallium descriptors; it turns out even for command-based hardware, having descriptors is more convenient for the driver, which is why Gallium drivers tend to outperform classic drivers, and why Vulkan tends to outperform OpenGL)
<alyssa> Practically for cmdstream packing, that just means the API is going to look rather different than the OUT_* macros you see in freedreno (those wouldn't make sense here)
<alyssa> So we'd instead tend towards utilities to build and upload descriptors and return their pointers, probably (whereas OUT_* doesn't care about pointers, it's just a big queue of commands)
buzzmarshall has joined #panfrost
<bbrezillon> alyssa: got it
<bbrezillon> thanks for this explanation
<alyssa> bbrezillon: :+1:
* alyssa pokes at atomics
<karolherbst> alyssa: check the comment I put for the out of bound stuff :p
<alyssa> karolherbst: I feel like I'm missing a punchline
<alyssa> Yes, I'm reading that -- as stated I feel like I'm missing a punchline? :p
<karolherbst> there is none I think
<alyssa> Oh ok
<karolherbst> I just implemented that stuff as well
<karolherbst> :p
<karolherbst> it's not as nice as it could be, but probably better than nothing
<alyssa> Alright ... I think I really would prefer the non-branching versions though atm I'd probably prefer no bounds checking for the moment
<karolherbst> I really miss that nir has no concept or predication :/
<karolherbst> *of
<alyssa> Do you think maybe we could land the pass with the original commit + your atomic commit, and figure out bounds checking in a followup?
<karolherbst> alyssa: well... what does your hardware do when you go out of bound?
<alyssa> Depends how out of bound you go
<karolherbst> eg piglit/bin/arb_shader_storage_buffer_object-array-ssbo-binding goes out of bound
<alyssa> Read/write garbage or page fault \o/
<karolherbst> well.. I got some page faults :p
<karolherbst> hardware context dies
<karolherbst> because it's not recoverable
<alyssa> Peppering in `umin`s would fix that
<karolherbst> but then you don't return 0
<karolherbst> and what about writes?
<alyssa> so? and write garbage that's on the app?
<karolherbst> depends on what the spec says
<alyssa> oh! duh, there's a super obvious solution
<alyssa> Instead of doing umin(index, size - 1)
<alyssa> do umin(index, size)
<alyssa> and allocate an extra dummy element at the end of every SSBO
<karolherbst> mhhhhhhhh
<alyssa> So then stores write to the dummy element and are thus noops
<karolherbst> I don't think that works in all cases
<karolherbst> especially with CL where you can have host_ptrs
<alyssa> and loads... well, loads won't be zero if you both read and write oob
<alyssa> Maybe have two dummy elements then
<karolherbst> and the allocation is under the control of the application
<karolherbst> not the runtime
<karolherbst> the safest thing to do is to catch it
<alyssa> that seems like a separate problem, this is for OpenGL SSBOs
<alyssa> CL doesn't have SSBOs :p
<karolherbst> for us doing the min or what I came up with is essentially the same
<karolherbst> and compared to global mem access, one alu instruction is like nothing
<alyssa> True.
<alyssa> But branching can be non-nothing
<alyssa> especially if it toally disrupts the schedule.
<karolherbst> you won't branch
<alyssa> if there's an if we will
<karolherbst> do you have predicates?
<alyssa> No.
<karolherbst> uff
<HdkR> Not everyone needs predicates :P
<karolherbst> ohhh, now I see why codegen doesn't end up predicating for me as well... TGSI leads to 4 bound checks
<karolherbst> with the nir lowering there is just one
<karolherbst> interesting
<karolherbst> and stupid
<karolherbst> uff, our RA is just terrible
<karolherbst> alyssa: do you know what your prop. stack is doing?
<alyssa> (Oh, re adding a canary/dummy element -- if you really need loads to readback 0 you can add an ult/csel for cheaper then a branch)
<alyssa> prop. stack?
raster has quit [Quit: Gettin' stinky!]
<karolherbst> alyssa: mhhh... mhhh, even though I agree that it might work, a OOB load also crashes our context
<alyssa> karolherbst: It's not OOB
<karolherbst> why not?
<alyssa> It's to exaclty one element out of bounds, and then you expand all the SSBOs in the driver by one element
<alyssa> So while it's logically out of bounds, to the hardware it's in bounds (at the last element)
<karolherbst> I think you can still have system memory mapped in with fancy extensions
<alyssa> this is all for OpenGL SSBOs..
<karolherbst> but... without those you are probably fine though
<karolherbst> alyssa: you can create GL buffer with user memory with exts
<karolherbst> alyssa: PIPE_CAP_RESOURCE_FROM_USER_MEMORY
<alyssa> we can't :V
<karolherbst> AMD_pinned_memory was the GL extension
<karolherbst> alyssa: why not?
<alyssa> kernel doesn't support that afaik
<karolherbst> okay.. sure, but that can be changed ;)
<karolherbst> it's just mapped user memory into the GPUs MMU
<karolherbst> so instead of VRAM page, it's sysram... which in your case doesn't even make a difference
<alyssa> if we're modifying the kernel, we can have the kernel pad to an extra 16 bytes (or a whole page but fine) anyway if we really need it ;)
<karolherbst> the application allocates
<karolherbst> with malloc
<karolherbst> or something
<karolherbst> you could even write into glibc (or whatever libc there is) state and corrupt everything
<karolherbst> the application is still at fault, still ugly
<HdkR> pinned_memory is such a fun extension :D
<karolherbst> alyssa: okay.. so I think the situation is like this: without robustness you can do whatever you want, but if you have a robustness context you are not allowed to do anything out of bounds. But the runtime could write somewhere into the buffer as it seems
davidlt has joined #panfrost
davidlt has quit [Ping timeout: 260 seconds]
TheKit has quit [Remote host closed the connection]
davidlt has joined #panfrost
stikonas has joined #panfrost
warpme_ has quit [Quit: Connection closed for inactivity]
icecream95 has joined #panfrost
<alyssa> karolherbst: I mean, the kernel can always just map the page after to a buffer it allocates itself internally. It's silly but it certainly *can* be done.
<alyssa> And fwiw, mali blob with robustness just does umin
<alyssa> Dunno if they pad it to hide writes or not.
<alyssa> In other news, oh gosh load/store packing is funny
<alyssa> Atomics use the `swizzle` field as an arg_0 of sorts. Which is... fine, but I'm not sure what the right way to represent ld/st ops are anymore
<alyssa> Their disassembly is awkward enough as it is ;)
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
<karolherbst> alyssa: operation type probably
<karolherbst> atomics are always stores btw
<karolherbst> and loads
<karolherbst> I would be very surprised if any hardware does it differently
<alyssa> karolherbst: Oh, I understand the encoding, it's just awkward
<alyssa> Aside - I'm conjecturing that ops 0x29-0x2B might be fmin with various roungin modes and likewise for 0x2D-0x2F. Don't have a good way to test but 0x2E comes up in the pattern for fsign(), so that would here decode as fmax_rtp
<alyssa> Admittedly the fsign implementation uses like 3 different constructs so it's difficult to understand it just staring
<alyssa> (It also uses an .unk2 modifier I've never seen before - maybe it has something to do with NaN handling? idk)
davidlt has quit [Ping timeout: 268 seconds]