pinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
thinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
Vinalon has joined #nmigen
well, I finally got an RV32I soc with a 'neopixel' peripheral working, which is exciting. But it uses about 5,000 cells, which seems very large compared to other RV32I cores that I've seen.
but the peripherals and memories use about 2,000 of those cells, and I wonder if some of the smaller designs use commercial toolchains to get better results. Is it reasonable to expect that I could shave much off of that size?
or would that be tilting at windmills?
which core?
it's an RV32I on an iCE40UP5K - it sounds like those might tend to have larger designs because of the LUT4s?
there's hundreds
oh, I wrote one - but vexriscv claims to use less than 1200 cells on an iCE40 with only the integer instruction set
oh i see
and my decoder, ALU, and CSRs use about 2500-3000 put together, so it seems like there should be room for improvement; I just don't know how much of it is down to the tooling
i wouldn't think so
that that much of it depends on the tooling
I wouldn't say Yosys produces *fantastic* synthesis results, but it's probably not doing what you expect if it's that big
okay, thanks - guess I should try to read about how SoCs are supposed to be designed, then.
Vinalon: Have you got per-module size stats?
e.g. from `synth_ice40 -noflatten`?
yeah, but some things still get flattened. The ALU is ~750 cells, and the decoder/CSR module is ~2000-2500.
but I sort of hit a wall with simplifying logic, because those are all whittled down to switch cases with what I think is minimal logic
the decoder switches on the opcode, the alu switches on function bits, and the CSRs switch on register address.
so...I dunno, maybe there's only so much that you can do with high-level syntax and a rudimentary understanding of the underlying fundamentals
alexhw has quit [Ping timeout: 265 seconds]
The way I'd think about minimising logic is that your goal is not to use less operations, but less terms
If you were targeting ASIC, then minimising operations is good too, but
For FPGA, a LUT4 is "something made up of 4 terms"
"A & B & C & D" and "~(A | B) ^ C & ~D" both result in a LUT4
how many CSRs are you implementing?
picorv32 and vex both implement not-quite-standard CSR sets because the list in the manual is not appropriate for minimal FPGA implementations
just the basics; mstatus, mcycle, minstret, mtvec, mcause, mscratch, mepc, mtval, mcountinhibit
okay, thanks; so how does LUT4 encoding map to things like selecting and comparing a range of bits?
like, A/B/C/D are all one-bit values in that example, right?
Vinalon: If you're switching on, say, RV funct3, then that would be ~8 LUT4s
For something wider than 4, it gets more complicated to estimate
But yes, they're all one-bit values
so, switching on the 7-bit opcode field or the 12-bit CSR address is probably not very efficient?
For example, funct7 would be 2 LUT4s per case, since you have one LUT4 to check if the lower 4 bits match what you want, which outputs to another LUT4 that checks if the upper 3 bits match and the lower 4 match
However, funct7 is not used as much as funct3, so all illegal opcode groups boil down to two LUT4s
yeah, I kind of cheated by sending funct3 and bit 6 of func7 to the ALU to pick its operations
Thinking about it, actually, I think the synthesis tool can get away with 2 LUT4s per bit for logical ops, an adder, a subtractor and possibly a shifter
it's too bad you can't get estimates of how much logic different Module operations like If/Elif/Case/State use, but it's probably hard enough to have visibility into what the synthesizer does
Actually, how do you implement shifts?
I use Python arithmetic/logic operators for all of the ALU ops, so '<<' and '>>'
So, that's going to eat up a *lot* of area
hehe, is this going to be about barrel shifters again?
Yep, you're asking it to build a barrel shifter
Or possibly two, actually
That's going to be 32 * 5 = 160 LUT4s per barrel shifter
VexRiscV cheats a little here: it gets away with one barrel right-shifter by flipping the bits of a left shift on the input and output
well, that's one thing to improve - thanks!
Which turns, say, a 320 LUT4 barrel shifter into a 224 LUT4 barrel shifter
but what I really don't get is why the decoder and CSRs are so large, when they're basically just a lot of Cat/bit_select/Repl logic. I thought those were very efficient operations in FPGAs
But I suppose it doesn't matter too much
yeah, but when I reduce it to 32 registers and remove the 'irq' flag in the port addresses, it doesn't seem to make any difference in size
I guess I should remove it anyways, but I liked how Cortex-M cores save context automatically
So, your entire CPU is an FSM?
yes - is that bad?
It can be, yes
Essentially, your cheap decoder is now dependent on FSM state
Well, isn't that going to lead to basically everything having the whole current state in it?
Which makes it notably less cheap
So everything will depend on N bits where N is log2(number of states)?
oh...but if there are only 3 states, shouldn't that fit in a LUT4?
MadHacker: No, it'll depend on exactly the number of states, because nMigen makes everything one-hot
Oh god. Yeah. Good luck.
That'll be why everything's huge then.
innnnteresting. Well that's encouraging - sometimes it's nice to have one big problem instead of a bunch of small ones.
Vinalon: Yes, it's theoretically only one LUT4, but you need that LUT4 for everything you do
BRAMs have a minimum depth, on ice40 you won't see any difference between 32 and 64 registers because the RAMs don't come in any size smaller than 256 registers
The CPU decoder will always operate, but the results will be conditional on FSM state
Compare that to, say, a pipelined CPU, where the CPU decoder will still always operate, but the results will be written (almost - stalls can happen) unconditionally.
oh, interesting - so if I implement a traditional pipeline, the fetch/decode/execute stages could all act at once on different values instead of acting on the same one in sequence?
This is why CPUs nowadays are pipelined
If you have an FSM, the logic doesn't go away when not in that state
Better to put it to use if possible.
For example, with your FSM, in 6 cycles (assuming simple instructions), you can retire 2 instructions.
With a pipeline you can retire 3 (if I have my math correct)
<ZirconiumX> With a pipeline you can retire 4
cool - I guess I never understood how that architecture decision came from the way that the hardware was laid out in the chip
Granted, you then have small headaches with branches, and instruction dependencies
But that's the fun part of writing a CPU, right?
well, at least the instruction set includes an explicit 'fence' instruction for the compiler to use
thanks so much for taking a look and pointing that out!
Vinalon: FENCE doesn't help you with those :P
Actually for a strictly in-order CPU I think you can just add a pipeline bubble and be spec-compliant, but anyway
hm - well like they say, I can burn that bridge when I come to it. It'll probably take some trial and error to figure out how to avoid using an FSM for the core logic
Vinalon: Extract your FSM states into independent modules; that'll be a good start
And you *can* use an FSM, but the less depends on that FSM, the better
Here your *entire CPU* depends on that FSM.
yeah. They'll still have to be gated on the instruction fetching though, because SPI Flash access takes many clock cycles.
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
anyways, thanks again for the advice!
It won't make your design smaller, perhaps, but it'll be more efficient, I think
e.g. if you fetch a small stream of data that you cache (it doesn't have to be big), there'll still be pretty big gains
yeah, an I-cache is pretty high on the list of things I want to add, which is why I'm trying to optimize for size; the iCE40UP5K seems like the upper bound of what you can get within an order of magnitude of the cost of an actual microcontroller
If I needed a $100 ECP5 board to do the work of a Cortex-M0 chip, I wouldn't be able to actually use it in many throwaway projects
Vinalon: Minerva is 2,589 SB_LUT4s, as a datapoint
The cheapest ECP5 chip is about $5
But the support components and assembly costs are a fortune compared to the UP5K
yeah, but they're BGA and all the boards I've seen pack in extras like DDRAM
How much is an OrangeCrab? That seems like the smallest ECP5 board I know of
I think the groupget is $99
I think it was the usual $99 mark
if only they made an HX8K in QFN form factor
For a microcontroller replacement the SPRAM and DSPs are probably more useful than the extra LUTs
Unless you need the speed
yeah, I was thinking 12MHz with 128KB of RAM wouldn't be too bad if you could also have a handful of peripherals
Also, you don't really need an icache in this kind of situation
I think the SB_HFOSC can do 48MHz, right?
You can just copy the performance critical parts to RAM on startup manually
Or even the whole program if its small enough
There is a PLL too although it may not be reliable on the Upduino
yeah, it is nice that the SPRAMs are so large
but I really need to save another 1000 cells or so to hit that target; atm, I can't fit much more than a GPIO peripheral, and it doesn't include multiplication/division, interrupts, timers, or debug support
Multiplication at least is cheap for a UP5K
yeah, with the '-dsp' option right? That'll probably be next, once I can spare a few hundred cells...but at least I have somewhere to start looking now.
I feel like you could unify these with some level of microcoding
e.g. an AUIPC is a LUI followed by adding the PC
And if speed's not an issue (you're bottlenecked on SPI, right?)
yeah, that's a good point; and I did save a little bit of space by having the branch operations use the SLT/SLTU ALU operations, maybe other instructions could do the same sort of thing
* ZirconiumX
is actually curious how minimal of an internal instruction set you'd actually need
One instruction.
OISCs are a thing.
MadHacker: I don't feel like implementing RISC-V in terms of OISC
Usually subtract and branch if not zero usually.
ZirconiumX: It's probably not the most efficient way unless you want to use most of your RAM as storage space for a RISC-V emulator. :)
too bad GCC doesn't have "avoid this instruction" flags
You'd probably need a temporary accumulator to properly microcode these
What's one more stack of registers? :P
well, apparently there is extra room in the register map's BRAM cell...
Tsk, you mean you're not just keeping a bank of registers per pipeline stage for temporary results? How are you going to do speculative execution then, huh? :D
Vinalon: So, AUIPC can be done in terms of LUI and an ADD of PC; SUB can be done in terms of XORI/ADD/ADDI (-n == (n ^ -1) + 1); left-shifts can be done in terms of right-shifts with bits flipped
I think there's a lot of room for microcoding, actually
But I already saved you a hwardware subtractor and barrel left-shifter :P
SLT is a subtraction followed by checking the carry bit
It'd require quite a lot of reorganising, but I think if size is your target you can do it
yeah, multi-cycle operations would be interesting to investigate. Although, it's interesting that RISC-V doesn't require N/Z/V flags
You can even go down to an accumulator machine if you need to
It doesn't, but you'll probably need them anyway
Mm, but you might be able to pull them out of the normal critical path.
If you only need them for conditional branches, calculating them can be on the branch path not the normal one.
well, I guess I'll start by trying to modularize the main state machine. Maybe after consolidating the shifters.
Hope I've given you some things to think about though
definitely, thanks!
Vinalon has quit [Remote host closed the connection]