pinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
thinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
Vinalon has joined #nmigen
<Vinalon>
well, I finally got an RV32I soc with a 'neopixel' peripheral working, which is exciting. But it uses about 5,000 cells, which seems very large compared to other RV32I cores that I've seen.
<Vinalon>
but the peripherals and memories use about 2,000 of those cells, and I wonder if some of the smaller designs use commercial toolchains to get better results. Is it reasonable to expect that I could shave much off of that size?
<Vinalon>
or would that be tilting at windmills?
<tpw_rules>
which core?
<Vinalon>
it's an RV32I on an iCE40UP5K - it sounds like those might tend to have larger designs because of the LUT4s?
<tpw_rules>
"an"?
<tpw_rules>
which?
<tpw_rules>
there's hundreds
<Vinalon>
oh, I wrote one - but vexriscv claims to use less than 1200 cells on an iCE40 with only the integer instruction set
<tpw_rules>
oh i see
<Vinalon>
and my decoder, ALU, and CSRs use about 2500-3000 put together, so it seems like there should be room for improvement; I just don't know how much of it is down to the tooling
<tpw_rules>
i wouldn't think so
<tpw_rules>
that that much of it depends on the tooling
<ZirconiumX>
I wouldn't say Yosys produces *fantastic* synthesis results, but it's probably not doing what you expect if it's that big
<Vinalon>
okay, thanks - guess I should try to read about how SoCs are supposed to be designed, then.
<ZirconiumX>
Vinalon: Have you got per-module size stats?
<ZirconiumX>
e.g. from `synth_ice40 -noflatten`?
<Vinalon>
yeah, but some things still get flattened. The ALU is ~750 cells, and the decoder/CSR module is ~2000-2500.
<Vinalon>
but I sort of hit a wall with simplifying logic, because those are all whittled down to switch cases with what I think is minimal logic
<Vinalon>
the decoder switches on the opcode, the alu switches on function bits, and the CSRs switch on register address.
<Vinalon>
so...I dunno, maybe there's only so much that you can do with high-level syntax and a rudimentary understanding of the underlying fundamentals
alexhw has quit [Ping timeout: 265 seconds]
<ZirconiumX>
The way I'd think about minimising logic is that your goal is not to use less operations, but less terms
<ZirconiumX>
If you were targeting ASIC, then minimising operations is good too, but
<ZirconiumX>
For FPGA, a LUT4 is "something made up of 4 terms"
<ZirconiumX>
"A & B & C & D" and "~(A | B) ^ C & ~D" both result in a LUT4
<sorear>
how many CSRs are you implementing?
<sorear>
picorv32 and vex both implement not-quite-standard CSR sets because the list in the manual is not appropriate for minimal FPGA implementations
<Vinalon>
just the basics; mstatus, mcycle, minstret, mtvec, mcause, mscratch, mepc, mtval, mcountinhibit
<Vinalon>
okay, thanks; so how does LUT4 encoding map to things like selecting and comparing a range of bits?
<Vinalon>
like, A/B/C/D are all one-bit values in that example, right?
<ZirconiumX>
Vinalon: If you're switching on, say, RV funct3, then that would be ~8 LUT4s
<ZirconiumX>
For something wider than 4, it gets more complicated to estimate
<ZirconiumX>
But yes, they're all one-bit values
<Vinalon>
so, switching on the 7-bit opcode field or the 12-bit CSR address is probably not very efficient?
<ZirconiumX>
For example, funct7 would be 2 LUT4s per case, since you have one LUT4 to check if the lower 4 bits match what you want, which outputs to another LUT4 that checks if the upper 3 bits match and the lower 4 match
<ZirconiumX>
However, funct7 is not used as much as funct3, so all illegal opcode groups boil down to two LUT4s
<Vinalon>
yeah, I kind of cheated by sending funct3 and bit 6 of func7 to the ALU to pick its operations
<ZirconiumX>
Thinking about it, actually, I think the synthesis tool can get away with 2 LUT4s per bit for logical ops, an adder, a subtractor and possibly a shifter
<Vinalon>
it's too bad you can't get estimates of how much logic different Module operations like If/Elif/Case/State use, but it's probably hard enough to have visibility into what the synthesizer does
<ZirconiumX>
Actually, how do you implement shifts?
<Vinalon>
I use Python arithmetic/logic operators for all of the ALU ops, so '<<' and '>>'
<ZirconiumX>
Ouch
<ZirconiumX>
So, that's going to eat up a *lot* of area
<Vinalon>
hehe, is this going to be about barrel shifters again?
<ZirconiumX>
Yep, you're asking it to build a barrel shifter
<ZirconiumX>
Or possibly two, actually
<ZirconiumX>
That's going to be 32 * 5 = 160 LUT4s per barrel shifter
<ZirconiumX>
VexRiscV cheats a little here: it gets away with one barrel right-shifter by flipping the bits of a left shift on the input and output
<Vinalon>
well, that's one thing to improve - thanks!
<ZirconiumX>
Which turns, say, a 320 LUT4 barrel shifter into a 224 LUT4 barrel shifter
<Vinalon>
but what I really don't get is why the decoder and CSRs are so large, when they're basically just a lot of Cat/bit_select/Repl logic. I thought those were very efficient operations in FPGAs
<ZirconiumX>
But I suppose it doesn't matter too much
<Vinalon>
yeah, but when I reduce it to 32 registers and remove the 'irq' flag in the port addresses, it doesn't seem to make any difference in size
<Vinalon>
I guess I should remove it anyways, but I liked how Cortex-M cores save context automatically
<ZirconiumX>
So, your entire CPU is an FSM?
<Vinalon>
yes - is that bad?
<ZirconiumX>
It can be, yes
<ZirconiumX>
Essentially, your cheap decoder is now dependent on FSM state
<MadHacker>
Well, isn't that going to lead to basically everything having the whole current state in it?
<ZirconiumX>
Which makes it notably less cheap
<MadHacker>
So everything will depend on N bits where N is log2(number of states)?
<Vinalon>
oh...but if there are only 3 states, shouldn't that fit in a LUT4?
<ZirconiumX>
MadHacker: No, it'll depend on exactly the number of states, because nMigen makes everything one-hot
<MadHacker>
Oh god. Yeah. Good luck.
<MadHacker>
That'll be why everything's huge then.
<Vinalon>
innnnteresting. Well that's encouraging - sometimes it's nice to have one big problem instead of a bunch of small ones.
<ZirconiumX>
Vinalon: Yes, it's theoretically only one LUT4, but you need that LUT4 for everything you do
<sorear>
BRAMs have a minimum depth, on ice40 you won't see any difference between 32 and 64 registers because the RAMs don't come in any size smaller than 256 registers
<ZirconiumX>
The CPU decoder will always operate, but the results will be conditional on FSM state
<ZirconiumX>
Compare that to, say, a pipelined CPU, where the CPU decoder will still always operate, but the results will be written (almost - stalls can happen) unconditionally.
<Vinalon>
oh, interesting - so if I implement a traditional pipeline, the fetch/decode/execute stages could all act at once on different values instead of acting on the same one in sequence?
<ZirconiumX>
Correct.
<ZirconiumX>
This is why CPUs nowadays are pipelined
<ZirconiumX>
If you have an FSM, the logic doesn't go away when not in that state
<ZirconiumX>
Better to put it to use if possible.
<ZirconiumX>
For example, with your FSM, in 6 cycles (assuming simple instructions), you can retire 2 instructions.
<ZirconiumX>
With a pipeline you can retire 3 (if I have my math correct)
<ZirconiumX>
<ZirconiumX> With a pipeline you can retire 4
<Vinalon>
cool - I guess I never understood how that architecture decision came from the way that the hardware was laid out in the chip
<ZirconiumX>
Granted, you then have small headaches with branches, and instruction dependencies
<ZirconiumX>
But that's the fun part of writing a CPU, right?
<Vinalon>
well, at least the instruction set includes an explicit 'fence' instruction for the compiler to use
<Vinalon>
thanks so much for taking a look and pointing that out!
<ZirconiumX>
Vinalon: FENCE doesn't help you with those :P
<ZirconiumX>
Actually for a strictly in-order CPU I think you can just add a pipeline bubble and be spec-compliant, but anyway
<Vinalon>
hm - well like they say, I can burn that bridge when I come to it. It'll probably take some trial and error to figure out how to avoid using an FSM for the core logic
<ZirconiumX>
Vinalon: Extract your FSM states into independent modules; that'll be a good start
<ZirconiumX>
And you *can* use an FSM, but the less depends on that FSM, the better
<ZirconiumX>
Here your *entire CPU* depends on that FSM.
<Vinalon>
yeah. They'll still have to be gated on the instruction fetching though, because SPI Flash access takes many clock cycles.
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
<Vinalon>
anyways, thanks again for the advice!
<ZirconiumX>
It won't make your design smaller, perhaps, but it'll be more efficient, I think
<ZirconiumX>
e.g. if you fetch a small stream of data that you cache (it doesn't have to be big), there'll still be pretty big gains
<Vinalon>
yeah, an I-cache is pretty high on the list of things I want to add, which is why I'm trying to optimize for size; the iCE40UP5K seems like the upper bound of what you can get within an order of magnitude of the cost of an actual microcontroller
<Vinalon>
If I needed a $100 ECP5 board to do the work of a Cortex-M0 chip, I wouldn't be able to actually use it in many throwaway projects
<ZirconiumX>
Vinalon: Minerva is 2,589 SB_LUT4s, as a datapoint
<daveshah>
The cheapest ECP5 chip is about $5
<daveshah>
But the support components and assembly costs are a fortune compared to the UP5K
<Vinalon>
yeah, but they're BGA and all the boards I've seen pack in extras like DDRAM
<ZirconiumX>
How much is an OrangeCrab? That seems like the smallest ECP5 board I know of
<Vinalon>
I think the groupget is $99
<daveshah>
I think it was the usual $99 mark
<daveshah>
Yeah
<Vinalon>
if only they made an HX8K in QFN form factor
<daveshah>
For a microcontroller replacement the SPRAM and DSPs are probably more useful than the extra LUTs
<daveshah>
Unless you need the speed
<Vinalon>
yeah, I was thinking 12MHz with 128KB of RAM wouldn't be too bad if you could also have a handful of peripherals
<daveshah>
Also, you don't really need an icache in this kind of situation
<ZirconiumX>
I think the SB_HFOSC can do 48MHz, right?
<daveshah>
You can just copy the performance critical parts to RAM on startup manually
<daveshah>
Or even the whole program if its small enough
<daveshah>
Yes
<daveshah>
There is a PLL too although it may not be reliable on the Upduino
<Vinalon>
yeah, it is nice that the SPRAMs are so large
<Vinalon>
but I really need to save another 1000 cells or so to hit that target; atm, I can't fit much more than a GPIO peripheral, and it doesn't include multiplication/division, interrupts, timers, or debug support
<ZirconiumX>
Multiplication at least is cheap for a UP5K
<Vinalon>
yeah, with the '-dsp' option right? That'll probably be next, once I can spare a few hundred cells...but at least I have somewhere to start looking now.
<ZirconiumX>
I feel like you could unify these with some level of microcoding
<ZirconiumX>
e.g. an AUIPC is a LUI followed by adding the PC
<ZirconiumX>
And if speed's not an issue (you're bottlenecked on SPI, right?)
<Vinalon>
yeah, that's a good point; and I did save a little bit of space by having the branch operations use the SLT/SLTU ALU operations, maybe other instructions could do the same sort of thing
* ZirconiumX
is actually curious how minimal of an internal instruction set you'd actually need
<MadHacker>
One instruction.
<MadHacker>
OISCs are a thing.
<ZirconiumX>
MadHacker: I don't feel like implementing RISC-V in terms of OISC
<MadHacker>
Usually subtract and branch if not zero usually.
<MadHacker>
*-usually
<MadHacker>
ZirconiumX: It's probably not the most efficient way unless you want to use most of your RAM as storage space for a RISC-V emulator. :)
<Vinalon>
too bad GCC doesn't have "avoid this instruction" flags
<ZirconiumX>
You'd probably need a temporary accumulator to properly microcode these
<ZirconiumX>
What's one more stack of registers? :P
<Vinalon>
well, apparently there is extra room in the register map's BRAM cell...
<MadHacker>
Tsk, you mean you're not just keeping a bank of registers per pipeline stage for temporary results? How are you going to do speculative execution then, huh? :D
<ZirconiumX>
Vinalon: So, AUIPC can be done in terms of LUI and an ADD of PC; SUB can be done in terms of XORI/ADD/ADDI (-n == (n ^ -1) + 1); left-shifts can be done in terms of right-shifts with bits flipped
<ZirconiumX>
I think there's a lot of room for microcoding, actually
<ZirconiumX>
But I already saved you a hwardware subtractor and barrel left-shifter :P
<ZirconiumX>
SLT is a subtraction followed by checking the carry bit
<ZirconiumX>
It'd require quite a lot of reorganising, but I think if size is your target you can do it
<Vinalon>
yeah, multi-cycle operations would be interesting to investigate. Although, it's interesting that RISC-V doesn't require N/Z/V flags
<ZirconiumX>
You can even go down to an accumulator machine if you need to
<ZirconiumX>
It doesn't, but you'll probably need them anyway
<MadHacker>
Mm, but you might be able to pull them out of the normal critical path.
<MadHacker>
If you only need them for conditional branches, calculating them can be on the branch path not the normal one.
<Vinalon>
well, I guess I'll start by trying to modularize the main state machine. Maybe after consolidating the shifters.
<ZirconiumX>
Hope I've given you some things to think about though
<Vinalon>
definitely, thanks!
Vinalon has quit [Remote host closed the connection]