<cyrozap>
Hi, all, I'm doing some ISA reverse engineering (CPU and DSP, though, not GPU), and I've been having trouble with learning some concepts because I seem to lack the vocabulary to describe what I'm trying to learn about.
<HdkR>
Hello cyrozap
<cyrozap>
Oh, hey, HdkR.
<HdkR>
What concept are you having a hard time with?
<cyrozap>
Yeah, hang on, I'm still typing, lol
<HdkR>
:)
<cyrozap>
So, the ISA I'm currently working on is called MD32, and it's used in a lot of MediaTek SoCs for SCP-like tasks, sensor hub, video decode controller (it's not doing DSP, just bitstream parsing and controlling the hardware decoder), and even some cellular modem tasks.
<cyrozap>
Interestingly, if you search for "MD32" or "MediaDSP", there's a MIPS-inspired CPU from a Chinese university by that same name, but I've read its ISA docs and its instruction encodings don't match up at all with what I'm seeing in real binaries. MediaTek's MD32 is also not the same ISA as their PCM ISA--the instructions don't match those, either.
<macc24>
!
<HdkR>
ah, I see your github repo readme
<HdkR>
Reminds me of how Broadcom's subdevices operate on its wifi chip
davidlt has quit [Ping timeout: 272 seconds]
<HdkR>
Which those were tied to some sort of standard backplane communication method and the code could select which device to communicate at any given moment. Also had something like 6 devices hanging off of it
<cyrozap>
Anyways, so in one of the big (20GB+) "Android Code" archives released by Xunlong for their Orange Pi 4G-IoT board (MT6737-based), there's a compiler and full GNU binutils suite (without any source code, unfortunately, but MediaTek most likely didn't give it to Xunlong in the first place) for this MD32 CPU.
<cyrozap>
HdkR: Ah, please note that the MD32 is _also_ not the Coresonic DSP. There's a _lot_ of random ISAs in these chips.
<HdkR>
Yea, I understand that :)
<macc24>
mediatek chips are cursed
<cyrozap>
IMO, MediaTek chips are actually really nice because of how straightforward and open to modification things are, especially when compared to Qualcomm's chips.
<HdkR>
7-9 cores hanging off of it. Has things like MAC and PHY shenanigans
<HdkR>
So similar concept :>
<cyrozap>
So, to get to my actual question, rather than disassemble the MD32 as/objdump and try to figure out the instructions that way, I've decided to just "cleverly brute-force" the disassembler, to build my own mapping of opcodes to mnemonics. And so far what I've found is, there appear to be 3 different types of instructions: Pure 32-bit instructions, pure 16-bit instructions, and sort of a "fused" 32-bit
<cyrozap>
instruction, where it contains two 16-bit instructions, but the first one (in the high bytes, big-endian) can't be decoded on its own.
<cyrozap>
(apologies, the second comma in the first sentence should be a colon)
<HdkR>
Clever
<cyrozap>
The "cleverly" part is that I'm using a combination of putting random 32-bit words into the disassembler, flipping bits in instructions that have already been decoded in order to find "adjacent" instructions, and using Z3 to pick "random" instructions that don't also match the encoding for other instructions.
<cyrozap>
So far I've found like 84 different opcodes.
<cyrozap>
And that's just the 32-bit operations.
camus has joined #panfrost
<cyrozap>
Oh, so the actual question is, what is it called when you have "combined" instructions like that?
kaspter has quit [Ping timeout: 240 seconds]
camus is now known as kaspter
<cyrozap>
I'll post an example of what I'm talking about.
<HdkR>
A bundle is common in VLIW speak, but depends on who you ask. It can just be a pair, bonded instructions, or a name you think up :P
<cyrozap>
Note the instructions that get decoded with two on the same line, with a pipe character separating them.
<cyrozap>
HdkR: Ah, I see.
<HdkR>
Common representation would be `{ \n <inst 1>, \n <inst 2> \n }`
<HdkR>
Maybe with some tabs for alignment :P
<cyrozap>
Of course, that then brings me to my next question: Why? Instructions seem to be aligned on 2-byte boundaries, and it already supports decoding 16-bit instructions on its own, so why have a third "bonded" instruction format? Lack of opcode space? And why encode two NOPs in the same 4-byte instruction when there's already a separate 4-byte NOP instruction?
<HdkR>
could be the nops take different amounts of time, thus not really being used for a nop
<HdkR>
This actually looks very similar to an ISA I also RE'd. It would do different nops based on pipeline timings
<HdkR>
Also potentially something as basic as alignment and that's just want the compiler dumps out :P
<HdkR>
s/want/what
<cyrozap>
Alignment was my first guess, but there seems to be plenty of 4-byte instructions appearing at addresses divisible by 2 but not by 4.
<cyrozap>
btw, these are the mnemonics and opcodes I've discovered so far (not final, still missing 16-bit and 16+16 instructions, and I need to re-do the "is this the same instruction" logic to take into account the argument format): https://paste.debian.net/hidden/6aff2409
<archetech>
Nov 24 08:04:04 alarm kernel: panfrost ffe40000.gpu: js fault
robmur01_ is now known as robmur01
alpernebbi has joined #panfrost
<alyssa>
cyrozap: Mali has VLIW encodings like that too, actually.
<alyssa>
For us, it's that there are different units of the hardware -- on Bifrost, a heavyweight FMA unit and a lightweight ADD unit
<alyssa>
And the operations supported by each unit vary a bit. imple things like moves can run anywhere, but floating multiplies can only go on FMA, and by convention things like branches can only go on ADD.
<alyssa>
Midgard is similar, but adds the twist of some of the units executing in parallel and some executing in series..
<alyssa>
And some units being vector and some being scalar
<robmur01>
Also sounds a bit like the quirk of the original Thumb ISA - BL (and later BLX too) actually consisted of a pair of separate instructions that were only valid to execute in sequence, but had individually-defined semantics and could be interrupted in the middle
<robmur01>
it was much later with Thumb-2 that those pairs got officially redefined as single 32-bit encodings
rando25892 has quit [Ping timeout: 256 seconds]
rando25892 has joined #panfrost
<robmur01>
but yeah, my guess from the shape of that code would be some kind of manual pipeline scheduling/delay slot type shenanigans
<daniels>
still not as good as the quirk of an extension to the non-Thumb ISA which accidentally included an opcode called BXJ
<robmur01>
daniels: hey, don't forget that Jazelle is still mandatory in Armv8 :P
<daniels>
robmur01: I, er ...
<daniels>
assuming that just immediately traps to the sw handler?
popolon has joined #panfrost
<robmur01>
indeed it is also mandatory to *not* actually implement any opcodes :D
<robmur01>
the one extension that was explicitly designed from the outset to have a limited lifespan and be phased out...
<daniels>
now that's one Jazelle mandate I can get behind ...
<alyssa>
The wrmask of xzw doesn't work with how Bifrost models stores...
<alyssa>
There is nir_lower_wrmask, but that would lower to two stores, which seems needlessly expensive
<alyssa>
More to the point, it assumes base is per-component, instead of per-vector. So doesn't work on our hardware as-is.
<alyssa>
It isn't obvious to me how the blob handles
<alyssa>
Ah! We can use lower_io_to_teporaries and then fill in the holes in the writemask since the holes are undefined.
<alyssa>
ok, that was a ton of thinking for a 2 line change :p
<alyssa>
ok, it compiles. onwards :-)
yann|work has joined #panfrost
<alyssa>
--Or not onwards. The whole gles2/gl2.1 set of shaders in my shader-db compile now.
<alyssa>
and the gles3.0 set :)
<alyssa>
Although probably the MRT ones are wrong
yann has quit [Ping timeout: 265 seconds]
<alyssa>
----Wait. Nope. I can't compile.
<alyssa>
that ws midgard, this embarassing
<alyssa>
yeah, bifrost still has some crashing
<alyssa>
shaders/tesseract/229.shader_test is our next problem shader
karolherbst has joined #panfrost
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
<alyssa>
alright, handled
<alyssa>
ok, *now* gles2 shaderdb finishes
<alyssa>
This is good, both because we just fixed a bunch of bugs, and also because we now have a baseline shader-db available, so when we start optimizing things we can measure against a standard set
raster has quit [Quit: Gettin' stinky!]
yann|work has quit [Read error: No route to host]
yann|work has joined #panfrost
archetech has quit [Quit: Leaving]
yann|work has quit [Read error: Connection reset by peer]
yann|work has joined #panfrost
yann|work is now known as yann
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
icecream95 has joined #panfrost
davidlt has quit [Ping timeout: 240 seconds]
karolherbst has quit [Remote host closed the connection]
karolherbst has joined #panfrost
raster has joined #panfrost
alpernebbi has quit [Quit: alpernebbi]
karolherbst has quit [Ping timeout: 272 seconds]
Ntemis has joined #panfrost
<Ntemis>
howdy
<Ntemis>
tested rk3288(miqi) mali-T764 and is not there yet