alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
Pak0st has quit [Remote host closed the connection]
nerdboy has joined #panfrost
<alyssa> Happy to hear it!
* alyssa still needs to figure out thread local storage sizing
<alyssa> So we have the inequality:
<alyssa> thread_tls_alloc <= max_threads
<alyssa> and the deafult value of max_threads is THREAD_MT_DEFUALT = 256
<alyssa> That concords with stuff saying a midgard shader core has up to 256 threads (see http://infocenter.arm.com/help/topic/com.arm.doc.dui0538f/CHDGJAIA.html)
<alyssa> Tentatively assuming max_threads <= 256
<alyssa> so we have $thread_tls_alloc \leq max_threads \leq 256 \Rightarrow thread_tls_alloc \leq 256$
<alyssa> Granted all of this info is routed from the kernel nowadays, but I digress.
<alyssa> Old kernels or something.
<alyssa> kbase describes tls_alloc as per-core (in mali_base_kernel.h)
<alyssa> The T860 in RK3399 is 4-core, so let's use that for calculations for now
<alyssa> So let's just try somethings and see what breaks?
<HdkR> TLS size calculations are fun for recursive function calls that spill
<alyssa> HdkR: Recursion is forbidden in GLSL
<alyssa> I think
<HdkR> Yes it is
<HdkR> It's forbidden with subroutines as well
<HdkR> Subroutines on a driver stack that implements it as true subroutines can be cheesed to have real recursion though..
<alyssa> Uh oh
<alyssa> Well, the obvious formula doesn't work (too small)
<urjaman> the obvious being cores*threads*tls_size ? (or something like that)
<alyssa> urjaman: That's the one
<urjaman> yeah i was thinking about that a few hours ago already ...
tgall_foo has quit [Ping timeout: 265 seconds]
<urjaman> if that too small then i'd throw a guess that maybe somethign is aligned to ... idk, pages
<alyssa> Probably
<alyssa> But it's off by a factor of 4 or 8 or something
<urjaman> like for each core, or for each thread
<alyssa> I mean --
<urjaman> hmmm....
<alyssa> urjaman: cores * threads * ALIGN(tls_size, 1024)
<alyssa> ^^ that matches the "off by a factor of 4 < 6.4 <= 8
<alyssa> But does it generalize?
<urjaman> i suppose testing is needed
<alyssa> (with tls_size = 160, that was 0x1e6 fwiw)
<alyssa> if I bump to 1e7, the formula is too small
<alyssa> but if I multiply by a factor of 2, it's fine for 1e7
<urjaman> oh yeah i didnt really get what that value was about when i looked at the hack patch (well, i guess maybe you dont either with how it was .unk0 ..:)
<urjaman> that's what made me think about what the size should be :P
<alyssa> with 1e8, need another factor of 2.
<alyssa> From which we conclude that unk0 is logarithmic
<alyssa> (log2, specifically)
<urjaman> has it always been 0x1e_ ? i'm feeling maybe the lower nibble is the shift and the rest is something else ...
<alyssa> Good question
<alyssa> no
<alyssa> Sometimes I've seen 0x0 but that's lazy
<alyssa> Usually it's 1e4 (with no spilling)
<alyssa> 1e5 with modest spilling
<alyssa> 1e6 with a lot of spilling
<alyssa> But uh
<alyssa> If I put an array on the stack, then it goes all wacky
* alyssa hasn't looked at this in 4 months, may take a bit to find her notes
<alyssa> ...Uhm they're not in the usual place
<alyssa> Oh, it was in here
<alyssa> Starts at 16:02
<alyssa> ----Oh, right. Those notes are pieces of paper a thousands km away, right.
stikonas has quit [Remote host closed the connection]
nerdboy has quit [Ping timeout: 276 seconds]
<alyssa> Agh, getting inconsistent results!
<alyssa> For the arrays, the formula:
<alyssa> max(log2(size), 0x5)
<alyssa> works just fine
<alyssa> Also note with a little algebra you get "size <= pow2(nibble)"
<alyssa> which explains some of the "special" alignments
<alyssa> But for regular register spilling, that doesn't seem to work
<alyssa> Not clear in which 'direction' it's wrong
<urjaman> maybe the 0x1e part is some sort of a "partitioning" of the TLS between the array/not-spilling vs for-spilling-area ?
<urjaman> so you'd need the shift for 0x1e + tls_size ... or something like that
<urjaman> conceptually, likely not exactly :P
Stary has joined #panfrost
<alyssa> Quite possible
<alyssa> But ... conceptually we're just spilling to an array, right...?
<alyssa> Here's issue #1: my offset calculations in the shader are off by a factor of 4
<urjaman> yeah, but you did say array access vs spilling used different instructions...
<alyssa> Slightly different
<alyssa> Same instruction but different magic number
<alyssa> Then again, with neither magic number, that instruction is used to load (/store resp.) to random addresses in GPU memory
<alyssa> Versatile.
<alyssa> With this off-by-factor-of-4 issue pushed out
<alyssa> I miiiight be making progress...?
<alyssa> Maybe not
<alyssa> I mean, that *also* needed to be fixed :p
vstehle has quit [Ping timeout: 268 seconds]
<alyssa> Definitely getting closer with the factor of 4 thing fixed
<alyssa> There's still a factor of 8 difference between spilling and arrays
<alyssa> ...Wait. What.
<alyssa> looks like it's an ~internal detail
<alyssa> and might not apply to panfrost's compiler anyway
<alyssa> So let's use the register spilling formula and pretend arrays don't exist for a minute.
<alyssa> So I guess we have the formula for unk0 then. and then getting the actual buffer size should follow from the above observation.
nerdboy has joined #panfrost
<alyssa> Yup!
<alyssa> With the above, the formula for the size of the buffer we allocate follows as an immediate corollary.
<alyssa> In particular, we have:
<alyssa> nibble = floor(log2(max(s/8, 31)))
<alyssa> ( roughly, not totally positive about the 31, might be 32, etc)
<alyssa> you can rearrange that to "solve" for s and yield in particular:
<alyssa> s <= 2 ^ (N + 4)
<alyssa> So we take the size of the stack per thread to be 2^(N + 4)
<alyssa> recall we allocate the stack for each thread and each core
<alyssa> So then we trivially need to allocate:
<alyssa> (2 ^ (N + 4)) * (# threads/core) * (# of cores)
<alyssa> And that's it :)
<urjaman> yeah sounds sensible - tho my guess above would've made the nibble like floor(log2(s/8 + 30))
<urjaman> so i'd say be sure to check the boundaries somehow, that is that there is as much spilling space available as we expect there to be
<urjaman> or maybe s/8 + 32 (alignment...)
nerdboy has quit [Ping timeout: 240 seconds]
_whitelogger has joined #panfrost
NeuroScr has joined #panfrost
_whitelogger has joined #panfrost
nerdboy has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
vstehle has joined #panfrost
davidlt has joined #panfrost
_whitelogger has joined #panfrost
_whitelogger has joined #panfrost
davidlt has quit [Ping timeout: 276 seconds]
_whitelogger has joined #panfrost
_whitelogger has joined #panfrost
davidlt has joined #panfrost
stikonas has joined #panfrost
eballetbo[m] has quit [Quit: killed]
flacks has quit [Quit: killed]
EmilKarlson has quit [Quit: killed]
TheCycoONE1 has quit [Quit: killed]
thefloweringash has quit [Quit: killed]
thefloweringash has joined #panfrost
EmilKarlson has joined #panfrost
flacks has joined #panfrost
eballetbo[m] has joined #panfrost
TheCycoONE1 has joined #panfrost
<alyssa> Probably
sravn has joined #panfrost
davidlt has quit [Ping timeout: 250 seconds]
megi has quit [Ping timeout: 240 seconds]
megi has joined #panfrost
grw has quit [Ping timeout: 244 seconds]
davidlt has joined #panfrost
abordado has joined #panfrost
megi has quit [Quit: WeeChat 2.6]
jschwart has joined #panfrost
<jschwart> hi all
<jschwart> what's the current status of Panfrost?
<jschwart> I have a rk3288 device (Tinkerboard) and I'd like to use Panfrost on Armbian Bionic (or any distribution that's compatible)
<alyssa> jschwart: OpenGL ES 2.0 is pretty smooth on Rk3288 these days, desktop GL 2.1 is decent (depends on the app) :)
<alyssa> robher: Off hand, do you know which drm_panfrost_param returns the number of cores in the GPU?
<alyssa> Looooks like popcount(DRM_PANFROST_PARAM_SHADER_PRESENT) should do the trick?
<alyssa> Ah, yeah, in kbase:
<alyssa> gpu_props->num_cores = hweight64(raw->shader_present)
<robher> alyssa: glad I could help. ;)
<alyssa> Thank you ;P
<alyssa> oh, "hamming weight", okay, yes
<alyssa> Nothing quite like supporting old kernels!~
<alyssa> ("Or old userspace.")
cowsay has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
cowsay has joined #panfrost
<alyssa> urjaman: After more ping-pong, https://people.collabora.com/~alyssa/nibbles.pdf is something I'm quite pleased with :)
<alyssa> The other way to think about it is "round size to the next power of two >= 256"
<alyssa> and then N is log2(that) - 4
<alyssa> [ or log2(that/16) ]
<urjaman> i see
<urjaman> alyssa: i feel like this pdf (and the fact that you made it) is very you
<alyssa> urjaman: Thank you!
<alyssa> util_logbase2(MAX2(size, 256)) - 4
<alyssa> er
<alyssa> util_logbase2_ceil(MAX2(size, 256)) - 4
<alyssa> if you prefer
jschwart has quit [Ping timeout: 250 seconds]
NeuroScr has joined #panfrost
davidlt has quit [Ping timeout: 240 seconds]
NeuroScr has quit [Quit: NeuroScr]
NeuroScr has joined #panfrost
vstehle has quit [Ping timeout: 265 seconds]
vstehle has joined #panfrost
NeuroScr has quit [Quit: NeuroScr]
gcl has joined #panfrost