ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at https://github.com/nmigen · logs at https://freenode.irclog.whitequark.org/nmigen
pinknok has quit [Ping timeout: 272 seconds]
Degi_ has joined #nmigen
Degi has quit [Ping timeout: 258 seconds]
Degi_ is now known as Degi
proteus-dude has joined #nmigen
proteus-guy has quit [Ping timeout: 240 seconds]
proteus-dude has quit [Ping timeout: 256 seconds]
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
_florent_ has quit [*.net *.split]
pdp7 has quit [*.net *.split]
Degi has quit [Ping timeout: 264 seconds]
pdp7 has joined #nmigen
_florent_ has joined #nmigen
Degi has joined #nmigen
____ has joined #nmigen
pinknok has joined #nmigen
Asu has joined #nmigen
<_whitenotifier-3> [nmigen] sjolsen opened issue #359: Python chokes on simulated memories of depth greater than ~3k - https://git.io/JfJJh
<_whitenotifier-3> [nmigen] whitequark commented on issue #359: Python chokes on simulated memories of depth greater than ~3k - https://git.io/JfJUn
FFY00 has quit [Remote host closed the connection]
FFY00 has joined #nmigen
FFY00 has quit [Max SendQ exceeded]
FFY00 has joined #nmigen
chipmuenk has joined #nmigen
Vinalon has quit [Ping timeout: 258 seconds]
proteus-guy has joined #nmigen
proteus-guy has quit [Ping timeout: 240 seconds]
FFY00 has quit [Remote host closed the connection]
FFY00 has joined #nmigen
____2 has joined #nmigen
____ has quit [Ping timeout: 264 seconds]
proteus-guy has joined #nmigen
lkcl_ has joined #nmigen
____2 has quit [Quit: Nettalk6 - www.ntalk.de]
____ has joined #nmigen
lkcl has quit [Ping timeout: 265 seconds]
Asuu has joined #nmigen
Asu has quit [Ping timeout: 250 seconds]
lkcl__ has joined #nmigen
lkcl_ has quit [Ping timeout: 240 seconds]
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 260 seconds]
lkcl has joined #nmigen
lkcl_ has quit [Ping timeout: 256 seconds]
lkcl__ has joined #nmigen
lkcl has quit [Ping timeout: 265 seconds]
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 264 seconds]
lkcl has joined #nmigen
lkcl_ has quit [Ping timeout: 256 seconds]
lkcl_ has joined #nmigen
lkcl has quit [Ping timeout: 250 seconds]
lkcl__ has joined #nmigen
lkcl has joined #nmigen
lkcl_ has quit [Ping timeout: 258 seconds]
lkcl__ has quit [Ping timeout: 240 seconds]
lkcl_ has joined #nmigen
lkcl__ has joined #nmigen
lkcl has quit [Ping timeout: 265 seconds]
lkcl_ has quit [Ping timeout: 256 seconds]
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 256 seconds]
proteus-guy has quit [Ping timeout: 256 seconds]
lkcl__ has joined #nmigen
lkcl_ has quit [Ping timeout: 256 seconds]
proteus-guy has joined #nmigen
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 256 seconds]
Asu has joined #nmigen
Asuu has quit [Ping timeout: 256 seconds]
pinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
thinknok has quit [Remote host closed the connection]
thinknok has joined #nmigen
Vinalon has joined #nmigen
<Vinalon> well, I finally got an RV32I soc with a 'neopixel' peripheral working, which is exciting. But it uses about 5,000 cells, which seems very large compared to other RV32I cores that I've seen.
<Vinalon> but the peripherals and memories use about 2,000 of those cells, and I wonder if some of the smaller designs use commercial toolchains to get better results. Is it reasonable to expect that I could shave much off of that size?
<Vinalon> or would that be tilting at windmills?
<tpw_rules> which core?
<Vinalon> it's an RV32I on an iCE40UP5K - it sounds like those might tend to have larger designs because of the LUT4s?
<tpw_rules> "an"?
<tpw_rules> which?
<tpw_rules> there's hundreds
<Vinalon> oh, I wrote one - but vexriscv claims to use less than 1200 cells on an iCE40 with only the integer instruction set
<tpw_rules> oh i see
<Vinalon> and my decoder, ALU, and CSRs use about 2500-3000 put together, so it seems like there should be room for improvement; I just don't know how much of it is down to the tooling
<tpw_rules> i wouldn't think so
<tpw_rules> that that much of it depends on the tooling
<ZirconiumX> I wouldn't say Yosys produces *fantastic* synthesis results, but it's probably not doing what you expect if it's that big
<Vinalon> okay, thanks - guess I should try to read about how SoCs are supposed to be designed, then.
<ZirconiumX> Vinalon: Have you got per-module size stats?
<ZirconiumX> e.g. from `synth_ice40 -noflatten`?
<Vinalon> yeah, but some things still get flattened. The ALU is ~750 cells, and the decoder/CSR module is ~2000-2500.
<Vinalon> but I sort of hit a wall with simplifying logic, because those are all whittled down to switch cases with what I think is minimal logic
<Vinalon> the decoder switches on the opcode, the alu switches on function bits, and the CSRs switch on register address.
<Vinalon> so...I dunno, maybe there's only so much that you can do with high-level syntax and a rudimentary understanding of the underlying fundamentals
alexhw has quit [Ping timeout: 265 seconds]
<ZirconiumX> The way I'd think about minimising logic is that your goal is not to use less operations, but less terms
<ZirconiumX> If you were targeting ASIC, then minimising operations is good too, but
<ZirconiumX> For FPGA, a LUT4 is "something made up of 4 terms"
<ZirconiumX> "A & B & C & D" and "~(A | B) ^ C & ~D" both result in a LUT4
<sorear> how many CSRs are you implementing?
<sorear> picorv32 and vex both implement not-quite-standard CSR sets because the list in the manual is not appropriate for minimal FPGA implementations
<Vinalon> just the basics; mstatus, mcycle, minstret, mtvec, mcause, mscratch, mepc, mtval, mcountinhibit
<Vinalon> okay, thanks; so how does LUT4 encoding map to things like selecting and comparing a range of bits?
<Vinalon> like, A/B/C/D are all one-bit values in that example, right?
<ZirconiumX> Vinalon: If you're switching on, say, RV funct3, then that would be ~8 LUT4s
<ZirconiumX> For something wider than 4, it gets more complicated to estimate
<ZirconiumX> But yes, they're all one-bit values
<Vinalon> so, switching on the 7-bit opcode field or the 12-bit CSR address is probably not very efficient?
<ZirconiumX> For example, funct7 would be 2 LUT4s per case, since you have one LUT4 to check if the lower 4 bits match what you want, which outputs to another LUT4 that checks if the upper 3 bits match and the lower 4 match
<ZirconiumX> However, funct7 is not used as much as funct3, so all illegal opcode groups boil down to two LUT4s
<Vinalon> yeah, I kind of cheated by sending funct3 and bit 6 of func7 to the ALU to pick its operations
<ZirconiumX> Thinking about it, actually, I think the synthesis tool can get away with 2 LUT4s per bit for logical ops, an adder, a subtractor and possibly a shifter
<Vinalon> it's too bad you can't get estimates of how much logic different Module operations like If/Elif/Case/State use, but it's probably hard enough to have visibility into what the synthesizer does
<ZirconiumX> Actually, how do you implement shifts?
<Vinalon> I use Python arithmetic/logic operators for all of the ALU ops, so '<<' and '>>'
<ZirconiumX> Ouch
<ZirconiumX> So, that's going to eat up a *lot* of area
<Vinalon> hehe, is this going to be about barrel shifters again?
<ZirconiumX> Yep, you're asking it to build a barrel shifter
<ZirconiumX> Or possibly two, actually
<ZirconiumX> That's going to be 32 * 5 = 160 LUT4s per barrel shifter
<ZirconiumX> VexRiscV cheats a little here: it gets away with one barrel right-shifter by flipping the bits of a left shift on the input and output
<Vinalon> well, that's one thing to improve - thanks!
<ZirconiumX> Which turns, say, a 320 LUT4 barrel shifter into a 224 LUT4 barrel shifter
<Vinalon> but what I really don't get is why the decoder and CSRs are so large, when they're basically just a lot of Cat/bit_select/Repl logic. I thought those were very efficient operations in FPGAs
<ZirconiumX> They *are*
<ZirconiumX> Can I see your code?
<Vinalon> yeah, this is the decoder for example: https://github.com/WRansohoff/nmigen_rv32i_min/blob/master/cpu.py
<ZirconiumX> https://github.com/WRansohoff/nmigen_rv32i_min/blob/master/cpu.py#L28-L33 <-- this isn't strictly RV compliant
<ZirconiumX> But I suppose it doesn't matter too much
<Vinalon> yeah, but when I reduce it to 32 registers and remove the 'irq' flag in the port addresses, it doesn't seem to make any difference in size
<Vinalon> I guess I should remove it anyways, but I liked how Cortex-M cores save context automatically
<ZirconiumX> So, your entire CPU is an FSM?
<Vinalon> yes - is that bad?
<ZirconiumX> It can be, yes
<ZirconiumX> Essentially, your cheap decoder is now dependent on FSM state
<MadHacker> Well, isn't that going to lead to basically everything having the whole current state in it?
<ZirconiumX> Which makes it notably less cheap
<MadHacker> So everything will depend on N bits where N is log2(number of states)?
<Vinalon> oh...but if there are only 3 states, shouldn't that fit in a LUT4?
<ZirconiumX> MadHacker: No, it'll depend on exactly the number of states, because nMigen makes everything one-hot
<MadHacker> Oh god. Yeah. Good luck.
<MadHacker> That'll be why everything's huge then.
<Vinalon> innnnteresting. Well that's encouraging - sometimes it's nice to have one big problem instead of a bunch of small ones.
<ZirconiumX> Vinalon: Yes, it's theoretically only one LUT4, but you need that LUT4 for everything you do
<sorear> BRAMs have a minimum depth, on ice40 you won't see any difference between 32 and 64 registers because the RAMs don't come in any size smaller than 256 registers
<ZirconiumX> The CPU decoder will always operate, but the results will be conditional on FSM state
<ZirconiumX> Compare that to, say, a pipelined CPU, where the CPU decoder will still always operate, but the results will be written (almost - stalls can happen) unconditionally.
<Vinalon> oh, interesting - so if I implement a traditional pipeline, the fetch/decode/execute stages could all act at once on different values instead of acting on the same one in sequence?
<ZirconiumX> Correct.
<ZirconiumX> This is why CPUs nowadays are pipelined
<ZirconiumX> If you have an FSM, the logic doesn't go away when not in that state
<ZirconiumX> Better to put it to use if possible.
<ZirconiumX> For example, with your FSM, in 6 cycles (assuming simple instructions), you can retire 2 instructions.
<ZirconiumX> With a pipeline you can retire 3 (if I have my math correct)
<ZirconiumX> <ZirconiumX> With a pipeline you can retire 4
<Vinalon> cool - I guess I never understood how that architecture decision came from the way that the hardware was laid out in the chip
<ZirconiumX> Granted, you then have small headaches with branches, and instruction dependencies
<ZirconiumX> But that's the fun part of writing a CPU, right?
<Vinalon> well, at least the instruction set includes an explicit 'fence' instruction for the compiler to use
<Vinalon> thanks so much for taking a look and pointing that out!
<ZirconiumX> Vinalon: FENCE doesn't help you with those :P
<ZirconiumX> Actually for a strictly in-order CPU I think you can just add a pipeline bubble and be spec-compliant, but anyway
<Vinalon> hm - well like they say, I can burn that bridge when I come to it. It'll probably take some trial and error to figure out how to avoid using an FSM for the core logic
<ZirconiumX> Vinalon: Extract your FSM states into independent modules; that'll be a good start
<ZirconiumX> And you *can* use an FSM, but the less depends on that FSM, the better
<ZirconiumX> Here your *entire CPU* depends on that FSM.
<Vinalon> yeah. They'll still have to be gated on the instruction fetching though, because SPI Flash access takes many clock cycles.
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
<Vinalon> anyways, thanks again for the advice!
<ZirconiumX> It won't make your design smaller, perhaps, but it'll be more efficient, I think
<ZirconiumX> e.g. if you fetch a small stream of data that you cache (it doesn't have to be big), there'll still be pretty big gains
<Vinalon> yeah, an I-cache is pretty high on the list of things I want to add, which is why I'm trying to optimize for size; the iCE40UP5K seems like the upper bound of what you can get within an order of magnitude of the cost of an actual microcontroller
<Vinalon> If I needed a $100 ECP5 board to do the work of a Cortex-M0 chip, I wouldn't be able to actually use it in many throwaway projects
<ZirconiumX> Vinalon: Minerva is 2,589 SB_LUT4s, as a datapoint
<daveshah> The cheapest ECP5 chip is about $5
<daveshah> But the support components and assembly costs are a fortune compared to the UP5K
<Vinalon> yeah, but they're BGA and all the boards I've seen pack in extras like DDRAM
<ZirconiumX> How much is an OrangeCrab? That seems like the smallest ECP5 board I know of
<Vinalon> I think the groupget is $99
<daveshah> I think it was the usual $99 mark
<daveshah> Yeah
<Vinalon> if only they made an HX8K in QFN form factor
<daveshah> For a microcontroller replacement the SPRAM and DSPs are probably more useful than the extra LUTs
<daveshah> Unless you need the speed
<Vinalon> yeah, I was thinking 12MHz with 128KB of RAM wouldn't be too bad if you could also have a handful of peripherals
<daveshah> Also, you don't really need an icache in this kind of situation
<ZirconiumX> I think the SB_HFOSC can do 48MHz, right?
<daveshah> You can just copy the performance critical parts to RAM on startup manually
<daveshah> Or even the whole program if its small enough
<daveshah> Yes
<daveshah> There is a PLL too although it may not be reliable on the Upduino
<Vinalon> yeah, it is nice that the SPRAMs are so large
<Vinalon> but I really need to save another 1000 cells or so to hit that target; atm, I can't fit much more than a GPIO peripheral, and it doesn't include multiplication/division, interrupts, timers, or debug support
<ZirconiumX> Multiplication at least is cheap for a UP5K
<Vinalon> yeah, with the '-dsp' option right? That'll probably be next, once I can spare a few hundred cells...but at least I have somewhere to start looking now.
<ZirconiumX> I feel like you could unify these with some level of microcoding
<ZirconiumX> e.g. an AUIPC is a LUI followed by adding the PC
<ZirconiumX> And if speed's not an issue (you're bottlenecked on SPI, right?)
<Vinalon> yeah, that's a good point; and I did save a little bit of space by having the branch operations use the SLT/SLTU ALU operations, maybe other instructions could do the same sort of thing
* ZirconiumX is actually curious how minimal of an internal instruction set you'd actually need
<MadHacker> One instruction.
<MadHacker> OISCs are a thing.
<ZirconiumX> MadHacker: I don't feel like implementing RISC-V in terms of OISC
<MadHacker> Usually subtract and branch if not zero usually.
<MadHacker> *-usually
<MadHacker> ZirconiumX: It's probably not the most efficient way unless you want to use most of your RAM as storage space for a RISC-V emulator. :)
<Vinalon> too bad GCC doesn't have "avoid this instruction" flags
<ZirconiumX> You'd probably need a temporary accumulator to properly microcode these
<ZirconiumX> What's one more stack of registers? :P
<Vinalon> well, apparently there is extra room in the register map's BRAM cell...
<MadHacker> Tsk, you mean you're not just keeping a bank of registers per pipeline stage for temporary results? How are you going to do speculative execution then, huh? :D
<ZirconiumX> Vinalon: So, AUIPC can be done in terms of LUI and an ADD of PC; SUB can be done in terms of XORI/ADD/ADDI (-n == (n ^ -1) + 1); left-shifts can be done in terms of right-shifts with bits flipped
<ZirconiumX> I think there's a lot of room for microcoding, actually
<ZirconiumX> But I already saved you a hwardware subtractor and barrel left-shifter :P
<ZirconiumX> SLT is a subtraction followed by checking the carry bit
<ZirconiumX> It'd require quite a lot of reorganising, but I think if size is your target you can do it
<Vinalon> yeah, multi-cycle operations would be interesting to investigate. Although, it's interesting that RISC-V doesn't require N/Z/V flags
<ZirconiumX> You can even go down to an accumulator machine if you need to
<ZirconiumX> It doesn't, but you'll probably need them anyway
<MadHacker> Mm, but you might be able to pull them out of the normal critical path.
<MadHacker> If you only need them for conditional branches, calculating them can be on the branch path not the normal one.
<Vinalon> well, I guess I'll start by trying to modularize the main state machine. Maybe after consolidating the shifters.
<ZirconiumX> Hope I've given you some things to think about though
<Vinalon> definitely, thanks!
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
Asuu has joined #nmigen
Asu has quit [Ping timeout: 264 seconds]
Asuu has quit [Ping timeout: 256 seconds]
Asu has joined #nmigen
Asuu has joined #nmigen
Asu has quit [Ping timeout: 260 seconds]
____ has quit [Quit: Nettalk6 - www.ntalk.de]
Asu has joined #nmigen
Asuu has quit [Ping timeout: 260 seconds]
chipmuenk has quit [Quit: chipmuenk]
Asu has quit [Remote host closed the connection]
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen