#panfrost on 2020-04-11 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:03 <alyssa> not sure about that second R0 argument... the port is being used..

00:03 <alyssa> regardless:

00:03 <alyssa> 2^x = opcd58(op7930(E_24(x), x))

00:05 <alyssa> opcd58 is near the various special TABLE ops so that's good

00:05 <alyssa> op7930 is actually near.. type conversions?

00:05 <alyssa> looks like a variant of f32 to i32, oddly

00:07 <alyssa> Time to break out the assembler, I guess. That or eat dinner.

00:10 <alyssa> Also, the blob seems unwilling to fold things into the lowered code. Not sure what's up with that yet.

00:21 <icecream95> Performance counters + Gallium HUD: https://gitlab.freedesktop.org/snippets/953

00:26 <alyssa> icecream95: niiiice :D

00:41 <anarsoul> icecream95: doom?

00:47 stikonas has quit [Remote host closed the connection]

00:57 * alyssa returns

00:58 <alyssa> 0x7930 is clearly another F32_TO_I32 variant

00:58 <alyssa> I'm not sure what the distinction is.. maybe handling of NaN / infinity or something

01:01 <HdkR> Rounding direction isn't too uncommon

01:02 <alyssa> HdkR: Ah, yes, could be

01:02 <alyssa> midgard has variants for rounding mode

01:03 <HdkR> If they moved even closer to NEON then maybe we have 4 rounding modes to choose :P

01:04 <alyssa> same behaviour at +inf

01:04 <alyssa> HdkR: we've had 4 modes since midgard t6xx :D

01:04 <HdkR> Perfect

01:05 <alyssa> and yes, can confirm it's round mode :)

01:05 <alyssa> which means there's way more variants. todo: fix packing code for that :'(

01:05 <HdkR> trunc, -inf, +inf, ties even? :D

01:06 <alyssa> sounds right

01:06 <alyssa> rtz, rte, rtn, rtp

01:06 <HdkR> yep yep

01:07 <alyssa> Anyway, interpolating a little bit, that means this is F32_TO_I32.RTE

01:08 <alyssa> So we have

01:08 <alyssa> 2^x = opCD58(f32_to_i32.rte(x * 2^24))

01:09 <alyssa> so I guess opCD58 is a lookup table for exp2... but with 24-bits of precision, I guess?

01:10 <alyssa> Odd that there's no fixup needed. Seems.. hm.

01:18 <alyssa> uh, ok. cd58(..) takes that reduction *and* the original x.

01:25 <icecream95> Does it make more sense for performance counter values to be per frame or per second?

01:26 <alyssa> icecream95: Probably depends what you're measuring? But I don't think mali's counters are reliable to per-frame granularity (though I'd love to be proven wrong)

01:26 <alyssa> what does gator do?

01:27 <icecream95> alyssa: This isn't about how often to read the values, but how to scale them.

01:27 <alyssa> Ah

01:27 <alyssa> Probably per second since fps is variable then?

01:27 <alyssa> er

01:28 <alyssa> but the counts will vary with the fps

01:28 <alyssa> um

01:28 <alyssa> this seems recursive

01:33 <alyssa> okay, the values returned from cd58(..) become quickly wrong..

01:34 <alyssa> I don't see how it would work for x > 15

01:34 <alyssa> Yeah, it only respects 31-bits of its input

01:35 <alyssa> top bit is ignored

01:35 <alyssa> er no

01:35 <alyssa> top nibble. so 28-bits

01:35 <alyssa> so maybe there's more to that mscale

01:38 <alyssa> also, the spec says exp2(x0 has precision (3 + 2 |x|) ULP

01:47 <alyssa> and yet it works

01:56 <alyssa> okay, it genuinely uses both the fixed int version and the float. fun!

01:56 <alyssa> mystery closed, I guess.

01:57 <alyssa> Next up - log2

01:57 adjtm_ has quit [Ping timeout: 240 seconds]

01:57 adjtm has joined #panfrost

02:00 <alyssa> log2(x) = opCC68(x) * op1E80(-1, x, x) + i2f(log_frexpe(x))

02:00 <alyssa> CC68 looks like another table

02:00 <alyssa> (but cf60 is already flog2_table..)

02:01 <alyssa> At least the + i2f(log_frexpe(x)) part is the same as the g71 formula

02:02 <alyssa> op1e80 is fma fwiw

02:02 <alyssa> er unit

02:03 vstehle has quit [Ping timeout: 260 seconds]

02:04 <alyssa> e1e80, I think

02:11 <alyssa> to the asm!

02:17 <icecream95> alyssa: Per frame values now work: https://gitlab.freedesktop.org/snippets/955

02:27 <alyssa> icecream95: neat!

02:27 <alyssa> (image doesn't load here..)

02:35 <icecream95> alyssa: I re-uploaded the image. It looks like it broke when I had to enable Javascript to participate in training AI for cyclist detection

02:35 <alyssa> asimilate into the borg.js

02:36 <alyssa> and super neat!!

02:37 <Lyude> alyssa: any packing work I could help out with?

02:38 <alyssa> Lyude: Have anything in mind?

02:38 <alyssa> it's probably todo ;)

02:38 <Lyude> alyssa: well what do you have working so far?

02:39 <alyssa> Packing of some common ops generally to one unit (FMA *or* ADD but not both)

02:39 <alyssa> Clauses with exactly 1 instruction bundle (and possible 1 constant)

02:39 <alyssa> (but not larger shaders)

02:39 <alyssa> s/shaders/clauses/

02:40 <alyssa> fp32 and a bit of fp16, no int yet or fp64

02:40 <alyssa> Oh, and unit testing so you can test new packing code against real hardware without touching anything else in the compiler

02:41 <alyssa> (so you can dig into alt units or bigger clauses without making the scheduler do that yet)

02:45 <alyssa> (Don't take that as pressure, if you want to just hack away I'd be happy to write the tests :) )

03:09 <Lyude> alyssa: i'm fine with writing tests

03:25 <alyssa> Lyude: fair enough, lmk if you have questions :)

03:40 <alyssa> Okay, it looks like I mis interpreted op1e80 and it's in fact a 2-op

03:40 <alyssa> op1e80(-1, x) seems to just be doing x - 1?

03:40 <alyssa> Not sure why they're not just using a regular ADD.f32 then

03:42 <alyssa> at least for -0.75 < x < 1.5, let's see what happens with bigger x

03:44 <alyssa> op1e80(-1, 2.5) = 0.25

03:45 <alyssa> so clearly it's more involved than an add. that'd be... an add and a >>1?

03:47 <alyssa> op1e80(-1, 3.5) = -0.125

03:48 <alyssa> op1e80(-1, 4.5) = 0.125

03:48 <alyssa> Uhm. Okay.

03:51 <alyssa> Ahh

03:51 <alyssa> op1e80(-1, x) = (x / 2^{log_frexpe(x)}) - 1

03:52 <alyssa> That is - reduce to [0.75, 1.5) and then subtract 1

03:56 <alyssa> = 2^{-f(x)} x - 1 ... might be clearer

03:56 <alyssa> letting opC..whatever be T(x), that means we have

03:57 <alyssa> log2(x) = (2^{-f(x)} x - 1) T(x) + f(x)

04:00 <alyssa> If $2^{-f(x)} x - 1 = 0$, then we see $x = 2^f$, so we're just taking a log2 of a power of two (which is what f(x) = log2_frexpe(x) does anyway)

04:00 <alyssa> Otherwise we can divide through and see for nonpower of two x..

04:02 <alyssa> Letting $u = 2^{-f(x)} x$ -- that is, the reduced version of $x$, we see that T(x) = log2(u) / (u - 1)

04:02 <alyssa> Why that's easier to calculate in hw I don't yet know but I suspect...

04:03 <chewitt> I got excited reading the backlog .. bits of cube and such!

04:03 <chewitt> now I'm back to feeling stupid again :)

04:06 <alyssa> Ah, yes, okay

04:06 <alyssa> So, T(x) *is* our logarithm approximation, the rest is just setup, right?

04:08 <alyssa> And all this setup is argument reduction, so instead of dealing with (0, inf), we just need hw for (0.75, 1.5)

04:08 <alyssa> So going back to `T(x) = log2(u) / (u - 1)`, we can look at the Taylor expansion of log2

04:08 <alyssa> actually, one more thing to make this super textbook. Let u' = u + 1

04:09 <alyssa> er, no, u' = u - 1, sorry messy handwriting

04:09 <alyssa> T(x) = log2(1 + u) / u'

04:09 <alyssa> But log2(1 + u) has the Taylor series

04:09 <alyssa> $\sum_{i = 1}^n (-1)^{n + 1} \frac{(u')^n}{n}$

04:10 <alyssa> so log2(1 + u) / u' has the Taylor series

04:10 <alyssa> $\sum_{i = 1}^n (-1)^{n + 1} \frac{(u')^n}{n - 1}$

04:10 <alyssa> erm

04:10 <alyssa> $\sum_{i = 1}^n (-1)^{n + 1} \frac{(u')^{n - 1}}{n}$

04:12 <alyssa> So assuming they're using a taylor approximation to compute this (this is pure speculation, this detail is invisible to us AFAIK) - the extra multiplication on the ISA side removes a bunch of multiplications in hw, which is important to keep everything to one cycle

04:13 <alyssa> I realize this is a "just so" explanation, but it's as good as any?

04:18 <alyssa> Oh, d'oh, that's the series for natural log, not log2

04:18 <alyssa> But the idea's probably similar

04:19 <alyssa> Just multiply by 1/log(2) at the end

04:22 <alyssa> ("Are you even doing panfrost anymore alyssa?"

04:23 <alyssa> "Calculus exam is approaching from the right...")

04:27 <alyssa> Main hole in my theory is that it takes bunches of terms for that series to converge but hey.

04:36 davidlt_ has joined #panfrost

04:40 davidlt_ is now known as davidlt

05:00 vstehle has joined #panfrost

05:19 buzzmarshall has quit [Remote host closed the connection]

05:54 <tomeu> alyssa: yep, happy to look at textures and sampling

05:54 <tomeu> great work!

05:54 <tomeu> icecream95: that looks very good!

06:16 tomboy65 has quit [Ping timeout: 240 seconds]

06:31 tomboy65 has joined #panfrost

06:54 <Werner> Hey there. I tried to get ioquake3 on my H6 based SBC running and got this error. Not sure if related to ioquake or Panfrost: tty]ioquake3.aarch64: ../src/gallium/drivers/panfrost/pan_sfbd.c:77: panfrost_sfbd_format: Assertion `!"Invalid format rendering"' failed.

07:05 tomboy65 has quit [Ping timeout: 240 seconds]

07:33 swiftgeek is now known as swiftgeek_

07:33 swiftgeek_ is now known as swiftgeek__

07:33 swiftgeek__ is now known as swiftgeek

08:21 <icecream95> Werner: The OpenGL 2 renderer tries to use a R16G16B16A16_FLOAT framebuffer, which I don't think SFBD GPUs support.

08:22 <icecream95> You can try using the OpenGL 1 renderer for ioquake by appending +set cl_renderer opengl1 to the command line

08:22 <Werner> Will do, thanks.

08:26 <Werner> Okay, starts, but it is unplayable due to artefacts. Well, it was worth the try :)

08:26 <icecream95> Werner: What Mesa version are you using?

08:27 <Werner> 2.1 Mesa 20.1.0-deve- (git-7aa6720ba4)

08:27 <Werner> *devel

08:28 <Werner> Running on Linux 5.6.2

09:03 davidlt has quit [Ping timeout: 265 seconds]

10:33 davidlt has joined #panfrost

10:37 stikonas has joined #panfrost

10:43 raster has joined #panfrost

11:09 icecream95 has quit [Ping timeout: 265 seconds]

12:06 yann|work has quit [Ping timeout: 260 seconds]

12:18 yann|work has joined #panfrost

12:55 ChanServ has quit [shutting down]

13:00 cwabbott has quit [Quit: cwabbott]

13:03 ChanServ has joined #panfrost

13:08 ChanServ has quit [*.net *.split]

13:09 stikonas has quit [Ping timeout: 246 seconds]

13:10 ChanServ has joined #panfrost

13:17 stikonas has joined #panfrost

13:22 stikonas has quit [Remote host closed the connection]

13:22 stikonas has joined #panfrost

13:28 stikonas has quit [Ping timeout: 246 seconds]

13:34 stikonas has joined #panfrost

13:43 cwabbott has joined #panfrost

13:53 mmind00 has quit [Remote host closed the connection]

14:05 mmind00 has joined #panfrost

14:13 cwabbott has quit [Quit: cwabbott]

14:29 buzzmarshall has joined #panfrost

14:48 yann|work has quit [Ping timeout: 265 seconds]

14:59 yann|work has joined #panfrost

15:11 yann|work has quit [Read error: No route to host]

15:12 yann|work has joined #panfrost

15:30 raster has quit [Quit: Gettin' stinky!]

15:34 yann|work has quit [Read error: No route to host]

15:35 yann|work has joined #panfrost

15:53 raster has joined #panfrost

17:23 nerdboy has joined #panfrost

18:42 stikonas_ has joined #panfrost

18:42 stikonas has quit [Ping timeout: 246 seconds]

19:13 raster has quit [Quit: Gettin' stinky!]

19:35 davidlt has quit [Ping timeout: 264 seconds]

20:20 indy has quit [Ping timeout: 264 seconds]

20:27 raster has joined #panfrost

20:35 indy has joined #panfrost

20:37 yann|work has quit [Ping timeout: 256 seconds]

21:57 yann has joined #panfrost

22:25 afaerber has quit [Quit: Leaving]

22:29 afaerber has joined #panfrost

23:27 buzzmarshall has quit [Remote host closed the connection]

23:35 unoccupied has joined #panfrost

23:44 stikonas_ has quit [Ping timeout: 246 seconds]

23:44 stikonas_ has joined #panfrost