00:15
Pak0st has quit [Remote host closed the connection]
00:40
nerdboy has joined #panfrost
00:45
<
alyssa >
Happy to hear it!
00:46
* alyssa
still needs to figure out thread local storage sizing
00:50
<
alyssa >
So we have the inequality:
00:50
<
alyssa >
thread_tls_alloc <= max_threads
00:51
<
alyssa >
and the deafult value of max_threads is THREAD_MT_DEFUALT = 256
00:56
<
alyssa >
Tentatively assuming max_threads <= 256
00:56
<
alyssa >
so we have $thread_tls_alloc \leq max_threads \leq 256 \Rightarrow thread_tls_alloc \leq 256$
00:57
<
alyssa >
Granted all of this info is routed from the kernel nowadays, but I digress.
00:57
<
alyssa >
Old kernels or something.
00:58
<
alyssa >
kbase describes tls_alloc as per-core (in mali_base_kernel.h)
00:58
<
alyssa >
The T860 in RK3399 is 4-core, so let's use that for calculations for now
00:59
<
alyssa >
So let's just try somethings and see what breaks?
00:59
<
HdkR >
TLS size calculations are fun for recursive function calls that spill
01:00
<
alyssa >
HdkR: Recursion is forbidden in GLSL
01:00
<
HdkR >
It's forbidden with subroutines as well
01:01
<
HdkR >
Subroutines on a driver stack that implements it as true subroutines can be cheesed to have real recursion though..
01:03
<
alyssa >
Well, the obvious formula doesn't work (too small)
01:05
<
urjaman >
the obvious being cores*threads*tls_size ? (or something like that)
01:05
<
alyssa >
urjaman: That's the one
01:05
<
urjaman >
yeah i was thinking about that a few hours ago already ...
01:05
tgall_foo has quit [Ping timeout: 265 seconds]
01:06
<
urjaman >
if that too small then i'd throw a guess that maybe somethign is aligned to ... idk, pages
01:06
<
alyssa >
But it's off by a factor of 4 or 8 or something
01:06
<
urjaman >
like for each core, or for each thread
01:08
<
alyssa >
urjaman: cores * threads * ALIGN(tls_size, 1024)
01:08
<
alyssa >
^^ that matches the "off by a factor of 4 < 6.4 <= 8
01:08
<
alyssa >
But does it generalize?
01:09
<
urjaman >
i suppose testing is needed
01:09
<
alyssa >
(with tls_size = 160, that was 0x1e6 fwiw)
01:09
<
alyssa >
if I bump to 1e7, the formula is too small
01:10
<
alyssa >
but if I multiply by a factor of 2, it's fine for 1e7
01:11
<
urjaman >
oh yeah i didnt really get what that value was about when i looked at the hack patch (well, i guess maybe you dont either with how it was .unk0 ..:)
01:11
<
urjaman >
that's what made me think about what the size should be :P
01:12
<
alyssa >
with 1e8, need another factor of 2.
01:12
<
alyssa >
From which we conclude that unk0 is logarithmic
01:12
<
alyssa >
(log2, specifically)
01:13
<
urjaman >
has it always been 0x1e_ ? i'm feeling maybe the lower nibble is the shift and the rest is something else ...
01:14
<
alyssa >
Good question
01:14
<
alyssa >
Sometimes I've seen 0x0 but that's lazy
01:14
<
alyssa >
Usually it's 1e4 (with no spilling)
01:14
<
alyssa >
1e5 with modest spilling
01:14
<
alyssa >
1e6 with a lot of spilling
01:14
<
alyssa >
If I put an array on the stack, then it goes all wacky
01:15
* alyssa
hasn't looked at this in 4 months, may take a bit to find her notes
01:16
<
alyssa >
...Uhm they're not in the usual place
01:17
<
alyssa >
Oh, it was in here
01:17
<
alyssa >
Starts at 16:02
01:18
<
alyssa >
----Oh, right. Those notes are pieces of paper a thousands km away, right.
01:25
stikonas has quit [Remote host closed the connection]
01:52
nerdboy has quit [Ping timeout: 276 seconds]
02:01
<
alyssa >
Agh, getting inconsistent results!
02:02
<
alyssa >
For the arrays, the formula:
02:02
<
alyssa >
max(log2(size), 0x5)
02:02
<
alyssa >
works just fine
02:02
<
alyssa >
Also note with a little algebra you get "size <= pow2(nibble)"
02:02
<
alyssa >
which explains some of the "special" alignments
02:02
<
alyssa >
But for regular register spilling, that doesn't seem to work
02:03
<
alyssa >
Not clear in which 'direction' it's wrong
02:07
<
urjaman >
maybe the 0x1e part is some sort of a "partitioning" of the TLS between the array/not-spilling vs for-spilling-area ?
02:08
<
urjaman >
so you'd need the shift for 0x1e + tls_size ... or something like that
02:09
<
urjaman >
conceptually, likely not exactly :P
02:10
Stary has joined #panfrost
02:16
<
alyssa >
Quite possible
02:16
<
alyssa >
But ... conceptually we're just spilling to an array, right...?
02:17
<
alyssa >
Here's issue #1: my offset calculations in the shader are off by a factor of 4
02:17
<
urjaman >
yeah, but you did say array access vs spilling used different instructions...
02:17
<
alyssa >
Slightly different
02:17
<
alyssa >
Same instruction but different magic number
02:18
<
alyssa >
Then again, with neither magic number, that instruction is used to load (/store resp.) to random addresses in GPU memory
02:18
<
alyssa >
Versatile.
02:19
<
alyssa >
With this off-by-factor-of-4 issue pushed out
02:19
<
alyssa >
I miiiight be making progress...?
02:27
<
alyssa >
I mean, that
*also* needed to be fixed :p
02:37
vstehle has quit [Ping timeout: 268 seconds]
02:41
<
alyssa >
Definitely getting closer with the factor of 4 thing fixed
02:45
<
alyssa >
There's still a factor of 8 difference between spilling and arrays
02:48
<
alyssa >
...Wait. What.
02:51
<
alyssa >
looks like it's an ~internal detail
02:52
<
alyssa >
and might not apply to panfrost's compiler anyway
02:52
<
alyssa >
So let's use the register spilling formula and pretend arrays don't exist for a minute.
02:55
<
alyssa >
So I guess we have the formula for unk0 then. and then getting the actual buffer size should follow from the above observation.
02:56
nerdboy has joined #panfrost
03:05
<
alyssa >
With the above, the formula for the size of the buffer we allocate follows as an immediate corollary.
03:05
<
alyssa >
In particular, we have:
03:05
<
alyssa >
nibble = floor(log2(max(s/8, 31)))
03:05
<
alyssa >
( roughly, not totally positive about the 31, might be 32, etc)
03:06
<
alyssa >
you can rearrange that to "solve" for s and yield in particular:
03:06
<
alyssa >
s <= 2 ^ (N + 4)
03:06
<
alyssa >
So we take the size of the stack per thread to be 2^(N + 4)
03:06
<
alyssa >
recall we allocate the stack for each thread and each core
03:06
<
alyssa >
So then we trivially need to allocate:
03:06
<
alyssa >
(2 ^ (N + 4)) * (# threads/core) * (# of cores)
03:07
<
alyssa >
And that's it :)
03:15
<
urjaman >
yeah sounds sensible - tho my guess above would've made the nibble like floor(log2(s/8 + 30))
03:17
<
urjaman >
so i'd say be sure to check the boundaries somehow, that is that there is as much spilling space available as we expect there to be
03:18
<
urjaman >
or maybe s/8 + 32 (alignment...)
04:10
nerdboy has quit [Ping timeout: 240 seconds]
04:17
_whitelogger has joined #panfrost
04:42
NeuroScr has joined #panfrost
04:56
_whitelogger has joined #panfrost
05:19
nerdboy has joined #panfrost
05:21
NeuroScr has quit [Quit: NeuroScr]
06:00
vstehle has joined #panfrost
06:09
davidlt has joined #panfrost
06:35
_whitelogger has joined #panfrost
06:59
_whitelogger has joined #panfrost
07:59
davidlt has quit [Ping timeout: 276 seconds]
08:17
_whitelogger has joined #panfrost
09:05
_whitelogger has joined #panfrost
11:10
davidlt has joined #panfrost
11:10
stikonas has joined #panfrost
12:22
eballetbo[m] has quit [Quit: killed]
12:23
flacks has quit [Quit: killed]
12:23
EmilKarlson has quit [Quit: killed]
12:23
TheCycoONE1 has quit [Quit: killed]
12:23
thefloweringash has quit [Quit: killed]
13:09
thefloweringash has joined #panfrost
13:09
EmilKarlson has joined #panfrost
13:09
flacks has joined #panfrost
13:09
eballetbo[m] has joined #panfrost
13:09
TheCycoONE1 has joined #panfrost
14:14
sravn has joined #panfrost
14:20
davidlt has quit [Ping timeout: 250 seconds]
14:35
megi has quit [Ping timeout: 240 seconds]
15:11
megi has joined #panfrost
16:03
grw has quit [Ping timeout: 244 seconds]
16:14
davidlt has joined #panfrost
17:11
abordado has joined #panfrost
17:30
megi has quit [Quit: WeeChat 2.6]
20:15
jschwart has joined #panfrost
20:16
<
jschwart >
what's the current status of Panfrost?
20:17
<
jschwart >
I have a rk3288 device (Tinkerboard) and I'd like to use Panfrost on Armbian Bionic (or any distribution that's compatible)
20:54
<
alyssa >
jschwart: OpenGL ES 2.0 is pretty smooth on Rk3288 these days, desktop GL 2.1 is decent (depends on the app) :)
21:11
<
alyssa >
robher: Off hand, do you know which drm_panfrost_param returns the number of cores in the GPU?
21:12
<
alyssa >
Looooks like popcount(DRM_PANFROST_PARAM_SHADER_PRESENT) should do the trick?
21:13
<
alyssa >
Ah, yeah, in kbase:
21:13
<
alyssa >
gpu_props->num_cores = hweight64(raw->shader_present)
21:14
<
robher >
alyssa: glad I could help. ;)
21:14
<
alyssa >
Thank you ;P
21:15
<
alyssa >
oh, "hamming weight", okay, yes
21:17
<
alyssa >
Nothing quite like supporting old kernels!~
21:17
<
alyssa >
("Or old userspace.")
21:19
cowsay has joined #panfrost
21:30
<
alyssa >
The other way to think about it is "round size to the next power of two >= 256"
21:30
<
alyssa >
and then N is log2(that) - 4
21:31
<
alyssa >
[ or log2(that/16) ]
21:41
<
urjaman >
alyssa: i feel like this pdf (and the fact that you made it) is very you
21:42
<
alyssa >
urjaman: Thank you!
21:43
<
alyssa >
util_logbase2(MAX2(size, 256)) - 4
21:43
<
alyssa >
util_logbase2_ceil(MAX2(size, 256)) - 4
21:43
<
alyssa >
if you prefer
21:51
jschwart has quit [Ping timeout: 250 seconds]
21:52
NeuroScr has joined #panfrost
22:07
davidlt has quit [Ping timeout: 240 seconds]
22:22
NeuroScr has quit [Quit: NeuroScr]
22:31
NeuroScr has joined #panfrost
23:17
vstehle has quit [Ping timeout: 265 seconds]
23:22
vstehle has joined #panfrost
23:35
NeuroScr has quit [Quit: NeuroScr]
23:49
gcl has joined #panfrost