marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<bloom> fadd16/fmul16/fmadd16 seem fine?
vlixa has quit [Remote host closed the connection]
<bloom> if you want to know what my reference point for regularity is i'll link you the bifrost encoding :p
<dougall> yeah, they could definitely be worse - i'm just thinking of moving the Am/Bm/Cm field relative to their 32-bit equivalents (why did they do that?), and the fact that 32-bit ops can have 16-bit sources and destinations too
<bloom> "and the fact that 32-bit ops can have 16-bit sources and destinations too"
<bloom> This part makes a ton of sense.
<bloom> The 32-bit ops are heavier weight. Yes, you _can_ run a fadd.32 with all operands 16-bit, but that will (depending on uarch details that are not ISA visible) be slower or higher power.
<bloom> It's fundamentally a different operation. Convert, fp32 multiply, convert, versus fp16 multiply. The latter is much cheaper. (The converts are cheap regardless.)
<dougall> ah, good point, yeah... that makes sense
<bloom> on some arches, fp16 multiply is even vectorized (where fp32 is scalar, conversions be damned)
<bloom> it's that much cheaper :>
<bloom> Honestly the most annoying part of the encoding is the presence of >64-bit instructions
<bloom> Makes the bit arithmetic awful.
<dougall> yeah, C is particularly painful for that... i'd probably use __int128, and i'd probably end up regretting it :p
<bloom> lol
<bloom> ok, added some generic ALU packing code
<bloom> 2 lines shorter than I was before :-p
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
vijfhoek has quit [Ping timeout: 246 seconds]
vijfhoek has joined #asahi-gpu
anuejn has quit [Ping timeout: 246 seconds]
anuejn has joined #asahi-gpu
<bloom> ...and with the generic stuff, it was a cinch to add support for all the funops
<bloom> dougall: " if sx and source.thread_bit_size >= 16:"
<bloom> I suspect s/>= 16/< 64/ was intended.
<bloom> I do wonder, if there's native 64-bit adds, why I see the blob lowering to a pair of adds
<bloom> Oh, maybe because there's no 64-bit access to uniform registers.
<dougall> hmm, yeah, i think you're right about < 64...
<bloom> not a r/e thing, just a "what is sign-extension?" thing ;)
<bloom> also, the encoding for iadd seems really odd. this is probably the weirdest of the ISA.
<bloom> It's like it's supposed to be a 48-bit instruction and they added an extra 2 bytes of padding for no reason? what?
<dougall> fwiw i saw apple's compiler emit 64-bit subtracts and 64-bit add+shift, but (as far as i can recall) not 64-bit adds
<bloom> ...Interesting.
<dougall> yeah, not sure what's up with that encoding... i do think there's _something_ in the high couple of bits in most/all instructions that i haven't figured out, which might make it make a tiny bit more sense
TheJollyRoger has quit [Quit: TheJollyRoger]
TheJollyRoger has joined #asahi-gpu
<bloom> I'm not worried about 2 unknown bits in the extended encoding
<bloom> it's iadd specifically (imadd is fine) that's all weird..
Necrosporus has quit [Killed (beckett.freenode.net (Nickname regained by services))]
Necrosporus has joined #asahi-gpu
Necrosporus has quit [Killed (verne.freenode.net (Nickname regained by services))]
Necrosporus has joined #asahi-gpu
<dougall> (or maybe i was trying to say that immediates don't get sign extended? not really the best way to represent that... hmm)
phiologe has quit [Ping timeout: 250 seconds]
phiologe has joined #asahi-gpu
pthariensflame has joined #asahi-gpu
pthariensflame has quit []
bpye has quit [Quit: The Lounge - https://thelounge.chat]
bpye has joined #asahi-gpu
vlixa has joined #asahi-gpu
bpye has quit [Quit: Ping timeout (120 seconds)]
bpye has joined #asahi-gpu
Bastian[m] has quit [Quit: Idle for 30+ days]
neobrain has quit [Remote host closed the connection]
Bastian[m] has joined #asahi-gpu
gabboman has joined #asahi-gpu
gabboman has quit [Quit: Ping timeout (120 seconds)]
gabboman has joined #asahi-gpu
gabboman has quit [Quit: Connection closed]
tomtastic has quit [Ping timeout: 240 seconds]
tomtastic has joined #asahi-gpu
vup has quit [Ping timeout: 245 seconds]
vup has joined #asahi-gpu
chrisf has quit [Quit: ZNC - https://znc.in]
chrisf has joined #asahi-gpu
odmir has joined #asahi-gpu
<bloom> dougall: ok, my curiousity got the best of me, poked at sin_pt_1/2
<bloom> The heavylifting is done by sin_pt_2. However, the function it computes is *not* sin(x), rather it's sin(x)/x
<bloom> (This is standard, there are numeric advantages here.)
<bloom> But it only computes in a single quadrant. So given 0 <= x < 1, it'll spit back sin(x * (pi/2)) / x
<bloom> Notice that's an even function. So sin_pt_2 is in fact defined over [-1, 1], but it ignores the sign bit of its input.
<bloom> This is a useful property: it lets sin_pt_1 pass the sign of the output over the sin_pt_2 call, to be recombined with a later multiplication.
<bloom> So what is sin_pt_1? It's just a quadrant fixup.
<bloom> For x in the first quadrant, it's simply the identity. sin_pt_2 is defined as such, so when we compute sin_pt_2(sin_pt_1(x)) * sin_pt_1(x) we're just computing sine.
<bloom> For x in the third quadrant, recall sin(x + pi) = -sin(x). So sin_pt_2 will just flip the sign, so we can compute in the first quadrant (pt_2), and then the sign gets restored with the multiply.
<bloom> For xin the second quadrant, recall sin(x + pi/2) = cos(x) = sin(pi/2 - x). So rather than flip the sign, we take the arithmetic complement.
<bloom> Likewise for the fourth quadrant, where we both complement and flip the sign.
<bloom> The last detail I glossed is the units. sin_pt_2 wants its angle as [-1, 1] but sin_pt_1 takes in a rotation [0, 4]. This doesn't affect any of the math, but it means the constants work out to nice integers.
<bloom> Putting it together, we get definitions:
<bloom> sin_pt_1 : [0, 4] -> [-1, 1], sin_pt_2 : [-1, 1] -> R
<bloom> sin_pt_1(x) =
<bloom> { x if 0 <= x < 1
<bloom> { 2 - x if 1 <= x < 2
<bloom> { 2 - x if 2 <= x < 3
<bloom> { x - 4 if 3 <= x < 4
<bloom> Or more clearly:
<bloom> sin_pt_1(x) =
<bloom> { fract(x) if 0 <= x < 1
<bloom> { 1 - fract(x) if 1 <= x < 2
<bloom> { - fract(x) if 2 <= x < 3
<bloom> { fract(x) - 1 if 3 <= x < 4
<bloom> As well as:
<bloom> sin_pt_2(x) = sin((pi/2) * |x|) / |x|
<bloom> (Is sin_pt_2(0) undefined? Maybe, maybe not. It doesn't matter, as long as it's value satisfies sin_pt_2(0) * 0 = 0 since sin(0) = 0.)
<bloom> (This requires the value to be finite, since Inf * 0 and NaN * 0 are both NaN under standard IEEE 754 rules.)
<bloom> And that's how AGX computes sine and cosine!
<bloom> (Up to constants, sin_pt_2 is the "sinc" function.)
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
nickiminjaj has joined #asahi-gpu
odmir has quit [Ping timeout: 260 seconds]
robinp has quit [Read error: Connection reset by peer]
robinp has joined #asahi-gpu
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 268 seconds]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
<glibc> # 48 00 C2 00 -unk48 h0, 2l, u0l.neg
<glibc> that appears even in empty fragment shaders
<glibc> with early-z, no alpha-to-coverage or other weird features
<bloom> yep..
<bloom> I *suspect* that controls some detail of the tilebuffer
<bloom> and if you look at how MRT writeout works you can see a pair of bits moving along in those unk48 ("writeout" in dougall's)
<bloom> similarly interesting patterns with depth/stencil writeouts, et
pthariensflame has joined #asahi-gpu
pthariensflame has quit [Client Quit]
chrisf has quit [Quit: ZNC - https://znc.in]
chrisf has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 268 seconds]
odmir has joined #asahi-gpu