#panfrost on 2019-12-07 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:15 Pak0st has quit [Remote host closed the connection]

00:40 nerdboy has joined #panfrost

00:45 <alyssa> Happy to hear it!

00:46 * alyssa still needs to figure out thread local storage sizing

00:50 <alyssa> So we have the inequality:

00:50 <alyssa> thread_tls_alloc <= max_threads

00:51 <alyssa> and the deafult value of max_threads is THREAD_MT_DEFUALT = 256

00:53 <alyssa> That concords with stuff saying a midgard shader core has up to 256 threads (see http://infocenter.arm.com/help/topic/com.arm.doc.dui0538f/CHDGJAIA.html)

00:56 <alyssa> Tentatively assuming max_threads <= 256

00:56 <alyssa> so we have $thread_tls_alloc \leq max_threads \leq 256 \Rightarrow thread_tls_alloc \leq 256$

00:57 <alyssa> Granted all of this info is routed from the kernel nowadays, but I digress.

00:57 <alyssa> Old kernels or something.

00:58 <alyssa> kbase describes tls_alloc as per-core (in mali_base_kernel.h)

00:58 <alyssa> The T860 in RK3399 is 4-core, so let's use that for calculations for now

00:59 <alyssa> So let's just try somethings and see what breaks?

00:59 <HdkR> TLS size calculations are fun for recursive function calls that spill

01:00 <alyssa> HdkR: Recursion is forbidden in GLSL

01:00 <alyssa> I think

01:00 <HdkR> Yes it is

01:00 <HdkR> It's forbidden with subroutines as well

01:01 <HdkR> Subroutines on a driver stack that implements it as true subroutines can be cheesed to have real recursion though..

01:03 <alyssa> Uh oh

01:03 <alyssa> Well, the obvious formula doesn't work (too small)

01:05 <urjaman> the obvious being cores*threads*tls_size ? (or something like that)

01:05 <alyssa> urjaman: That's the one

01:05 <urjaman> yeah i was thinking about that a few hours ago already ...

01:05 tgall_foo has quit [Ping timeout: 265 seconds]

01:06 <urjaman> if that too small then i'd throw a guess that maybe somethign is aligned to ... idk, pages

01:06 <alyssa> Probably

01:06 <alyssa> But it's off by a factor of 4 or 8 or something

01:06 <urjaman> like for each core, or for each thread

01:07 <alyssa> I mean --

01:07 <urjaman> hmmm....

01:08 <alyssa> urjaman: cores * threads * ALIGN(tls_size, 1024)

01:08 <alyssa> ^^ that matches the "off by a factor of 4 < 6.4 <= 8

01:08 <alyssa> But does it generalize?

01:09 <urjaman> i suppose testing is needed

01:09 <alyssa> (with tls_size = 160, that was 0x1e6 fwiw)

01:09 <alyssa> if I bump to 1e7, the formula is too small

01:10 <alyssa> but if I multiply by a factor of 2, it's fine for 1e7

01:11 <urjaman> oh yeah i didnt really get what that value was about when i looked at the hack patch (well, i guess maybe you dont either with how it was .unk0 ..:)

01:11 <urjaman> that's what made me think about what the size should be :P

01:12 <alyssa> with 1e8, need another factor of 2.

01:12 <alyssa> From which we conclude that unk0 is logarithmic

01:12 <alyssa> (log2, specifically)

01:13 <urjaman> has it always been 0x1e_ ? i'm feeling maybe the lower nibble is the shift and the rest is something else ...

01:14 <alyssa> Good question

01:14 <alyssa> no

01:14 <alyssa> Sometimes I've seen 0x0 but that's lazy

01:14 <alyssa> Usually it's 1e4 (with no spilling)

01:14 <alyssa> 1e5 with modest spilling

01:14 <alyssa> 1e6 with a lot of spilling

01:14 <alyssa> But uh

01:14 <alyssa> If I put an array on the stack, then it goes all wacky

01:15 * alyssa hasn't looked at this in 4 months, may take a bit to find her notes

01:16 <alyssa> ...Uhm they're not in the usual place

01:17 <alyssa> Oh, it was in here

01:17 <alyssa> https://freenode.irclog.whitequark.org/panfrost/2019-08-20

01:17 <alyssa> Starts at 16:02

01:18 <alyssa> ----Oh, right. Those notes are pieces of paper a thousands km away, right.

01:25 stikonas has quit [Remote host closed the connection]

01:52 nerdboy has quit [Ping timeout: 276 seconds]

02:01 <alyssa> Agh, getting inconsistent results!

02:02 <alyssa> For the arrays, the formula:

02:02 <alyssa> max(log2(size), 0x5)

02:02 <alyssa> works just fine

02:02 <alyssa> Also note with a little algebra you get "size <= pow2(nibble)"

02:02 <alyssa> which explains some of the "special" alignments

02:02 <alyssa> But for regular register spilling, that doesn't seem to work

02:03 <alyssa> Not clear in which 'direction' it's wrong

02:07 <urjaman> maybe the 0x1e part is some sort of a "partitioning" of the TLS between the array/not-spilling vs for-spilling-area ?

02:08 <urjaman> so you'd need the shift for 0x1e + tls_size ... or something like that

02:09 <urjaman> conceptually, likely not exactly :P

02:10 Stary has joined #panfrost

02:16 <alyssa> Quite possible

02:16 <alyssa> But ... conceptually we're just spilling to an array, right...?

02:17 <alyssa> Here's issue #1: my offset calculations in the shader are off by a factor of 4

02:17 <urjaman> yeah, but you did say array access vs spilling used different instructions...

02:17 <alyssa> Slightly different

02:17 <alyssa> Same instruction but different magic number

02:18 <alyssa> Then again, with neither magic number, that instruction is used to load (/store resp.) to random addresses in GPU memory

02:18 <alyssa> Versatile.

02:19 <alyssa> With this off-by-factor-of-4 issue pushed out

02:19 <alyssa> I miiiight be making progress...?

02:19 <alyssa> Maybe not

02:27 <alyssa> I mean, that *also* needed to be fixed :p

02:37 vstehle has quit [Ping timeout: 268 seconds]

02:41 <alyssa> Definitely getting closer with the factor of 4 thing fixed

02:45 <alyssa> There's still a factor of 8 difference between spilling and arrays

02:48 <alyssa> ...Wait. What.

02:51 <alyssa> looks like it's an ~internal detail

02:52 <alyssa> and might not apply to panfrost's compiler anyway

02:52 <alyssa> So let's use the register spilling formula and pretend arrays don't exist for a minute.

02:53 <alyssa> https://people.collabora.com/~alyssa/stack-arrays.txt

02:55 <alyssa> So I guess we have the formula for unk0 then. and then getting the actual buffer size should follow from the above observation.

02:56 nerdboy has joined #panfrost

03:04 <alyssa> Yup!

03:05 <alyssa> With the above, the formula for the size of the buffer we allocate follows as an immediate corollary.

03:05 <alyssa> In particular, we have:

03:05 <alyssa> nibble = floor(log2(max(s/8, 31)))

03:05 <alyssa> ( roughly, not totally positive about the 31, might be 32, etc)

03:06 <alyssa> you can rearrange that to "solve" for s and yield in particular:

03:06 <alyssa> s <= 2 ^ (N + 4)

03:06 <alyssa> So we take the size of the stack per thread to be 2^(N + 4)

03:06 <alyssa> recall we allocate the stack for each thread and each core

03:06 <alyssa> So then we trivially need to allocate:

03:06 <alyssa> (2 ^ (N + 4)) * (# threads/core) * (# of cores)

03:07 <alyssa> And that's it :)

03:15 <urjaman> yeah sounds sensible - tho my guess above would've made the nibble like floor(log2(s/8 + 30))

03:17 <urjaman> so i'd say be sure to check the boundaries somehow, that is that there is as much spilling space available as we expect there to be

03:18 <urjaman> or maybe s/8 + 32 (alignment...)

04:10 nerdboy has quit [Ping timeout: 240 seconds]

04:17 _whitelogger has joined #panfrost

04:42 NeuroScr has joined #panfrost

04:56 _whitelogger has joined #panfrost

05:19 nerdboy has joined #panfrost

05:21 NeuroScr has quit [Quit: NeuroScr]

06:00 vstehle has joined #panfrost

06:09 davidlt has joined #panfrost

06:35 _whitelogger has joined #panfrost

06:59 _whitelogger has joined #panfrost

07:59 davidlt has quit [Ping timeout: 276 seconds]

08:17 _whitelogger has joined #panfrost

09:05 _whitelogger has joined #panfrost

11:10 davidlt has joined #panfrost

11:10 stikonas has joined #panfrost

12:22 eballetbo[m] has quit [Quit: killed]

12:23 flacks has quit [Quit: killed]

12:23 EmilKarlson has quit [Quit: killed]

12:23 TheCycoONE1 has quit [Quit: killed]

12:23 thefloweringash has quit [Quit: killed]

13:09 thefloweringash has joined #panfrost

13:09 EmilKarlson has joined #panfrost

13:09 flacks has joined #panfrost

13:09 eballetbo[m] has joined #panfrost

13:09 TheCycoONE1 has joined #panfrost

13:45 <alyssa> Probably

14:14 sravn has joined #panfrost

14:20 davidlt has quit [Ping timeout: 250 seconds]

14:35 megi has quit [Ping timeout: 240 seconds]

15:11 megi has joined #panfrost

16:03 grw has quit [Ping timeout: 244 seconds]

16:14 davidlt has joined #panfrost

17:11 abordado has joined #panfrost

17:30 megi has quit [Quit: WeeChat 2.6]

20:15 jschwart has joined #panfrost

20:15 <jschwart> hi all

20:16 <jschwart> what's the current status of Panfrost?

20:17 <jschwart> I have a rk3288 device (Tinkerboard) and I'd like to use Panfrost on Armbian Bionic (or any distribution that's compatible)

20:54 <alyssa> jschwart: OpenGL ES 2.0 is pretty smooth on Rk3288 these days, desktop GL 2.1 is decent (depends on the app) :)

21:11 <alyssa> robher: Off hand, do you know which drm_panfrost_param returns the number of cores in the GPU?

21:12 <alyssa> "Shader core count" column in https://en.wikipedia.org/wiki/Mali_%28GPU%29#Implementations

21:12 <alyssa> Looooks like popcount(DRM_PANFROST_PARAM_SHADER_PRESENT) should do the trick?

21:13 <alyssa> Ah, yeah, in kbase:

21:13 <alyssa> gpu_props->num_cores = hweight64(raw->shader_present)

21:14 <robher> alyssa: glad I could help. ;)

21:14 <alyssa> Thank you ;P

21:15 <alyssa> oh, "hamming weight", okay, yes

21:17 <alyssa> Nothing quite like supporting old kernels!~

21:17 <alyssa> ("Or old userspace.")

21:18 cowsay has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

21:19 cowsay has joined #panfrost

21:28 <alyssa> urjaman: After more ping-pong, https://people.collabora.com/~alyssa/nibbles.pdf is something I'm quite pleased with :)

21:30 <alyssa> The other way to think about it is "round size to the next power of two >= 256"

21:30 <alyssa> and then N is log2(that) - 4

21:31 <alyssa> [ or log2(that/16) ]

21:40 <urjaman> i see

21:41 <urjaman> alyssa: i feel like this pdf (and the fact that you made it) is very you

21:42 <alyssa> urjaman: Thank you!

21:43 <alyssa> util_logbase2(MAX2(size, 256)) - 4

21:43 <alyssa> er

21:43 <alyssa> util_logbase2_ceil(MAX2(size, 256)) - 4

21:43 <alyssa> if you prefer

21:51 jschwart has quit [Ping timeout: 250 seconds]

21:52 NeuroScr has joined #panfrost

22:07 davidlt has quit [Ping timeout: 240 seconds]

22:22 NeuroScr has quit [Quit: NeuroScr]

22:31 NeuroScr has joined #panfrost

23:17 vstehle has quit [Ping timeout: 265 seconds]

23:22 vstehle has joined #panfrost

23:35 NeuroScr has quit [Quit: NeuroScr]

23:49 gcl has joined #panfrost