#nmigen on 2020-04-17 — irc logs at freenode.irclog.whitequark.org

2020-01-27 18:31 ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at https://github.com/nmigen · logs at https://freenode.irclog.whitequark.org/nmigen

00:02 pinknok has quit [Ping timeout: 272 seconds]

00:19 Degi_ has joined #nmigen

00:23 Degi has quit [Ping timeout: 258 seconds]

00:23 Degi_ is now known as Degi

03:07 proteus-dude has joined #nmigen

03:08 proteus-guy has quit [Ping timeout: 240 seconds]

04:17 proteus-dude has quit [Ping timeout: 256 seconds]

04:36 Vinalon has quit [Remote host closed the connection]

04:49 Vinalon has joined #nmigen

05:10 _florent_ has quit [*.net *.split]

05:10 pdp7 has quit [*.net *.split]

05:14 Degi has quit [Ping timeout: 264 seconds]

05:15 pdp7 has joined #nmigen

05:15 _florent_ has joined #nmigen

05:16 Degi has joined #nmigen

05:27 ____ has joined #nmigen

06:45 pinknok has joined #nmigen

07:52 Asu has joined #nmigen

09:28 <_whitenotifier-3> [nmigen] sjolsen opened issue #359: Python chokes on simulated memories of depth greater than ~3k - https://git.io/JfJJh

09:35 <_whitenotifier-3> [nmigen] whitequark commented on issue #359: Python chokes on simulated memories of depth greater than ~3k - https://git.io/JfJUn

09:46 FFY00 has quit [Remote host closed the connection]

09:47 FFY00 has joined #nmigen

09:49 FFY00 has quit [Max SendQ exceeded]

09:49 FFY00 has joined #nmigen

10:13 chipmuenk has joined #nmigen

10:32 Vinalon has quit [Ping timeout: 258 seconds]

11:02 proteus-guy has joined #nmigen

11:15 proteus-guy has quit [Ping timeout: 240 seconds]

11:17 FFY00 has quit [Remote host closed the connection]

11:21 FFY00 has joined #nmigen

11:24 ____2 has joined #nmigen

11:27 ____ has quit [Ping timeout: 264 seconds]

11:28 proteus-guy has joined #nmigen

12:00 lkcl_ has joined #nmigen

12:01 ____2 has quit [Quit: Nettalk6 - www.ntalk.de]

12:03 ____ has joined #nmigen

12:04 lkcl has quit [Ping timeout: 265 seconds]

12:05 Asuu has joined #nmigen

12:05 Asu has quit [Ping timeout: 250 seconds]

12:13 lkcl__ has joined #nmigen

12:16 lkcl_ has quit [Ping timeout: 240 seconds]

12:25 lkcl_ has joined #nmigen

12:29 lkcl__ has quit [Ping timeout: 260 seconds]

12:29 lkcl has joined #nmigen

12:32 lkcl_ has quit [Ping timeout: 256 seconds]

12:32 lkcl__ has joined #nmigen

12:36 lkcl has quit [Ping timeout: 265 seconds]

12:38 lkcl_ has joined #nmigen

12:42 lkcl__ has quit [Ping timeout: 264 seconds]

12:43 lkcl has joined #nmigen

12:45 lkcl_ has quit [Ping timeout: 256 seconds]

12:47 lkcl_ has joined #nmigen

12:50 lkcl has quit [Ping timeout: 250 seconds]

12:52 lkcl__ has joined #nmigen

12:55 lkcl has joined #nmigen

12:56 lkcl_ has quit [Ping timeout: 258 seconds]

12:57 lkcl__ has quit [Ping timeout: 240 seconds]

12:57 lkcl_ has joined #nmigen

13:01 lkcl__ has joined #nmigen

13:01 lkcl has quit [Ping timeout: 265 seconds]

13:05 lkcl_ has quit [Ping timeout: 256 seconds]

13:05 lkcl_ has joined #nmigen

13:09 lkcl__ has quit [Ping timeout: 256 seconds]

13:17 proteus-guy has quit [Ping timeout: 256 seconds]

13:20 lkcl__ has joined #nmigen

13:23 lkcl_ has quit [Ping timeout: 256 seconds]

13:29 proteus-guy has joined #nmigen

13:49 lkcl_ has joined #nmigen

13:53 lkcl__ has quit [Ping timeout: 256 seconds]

14:31 Asu has joined #nmigen

14:32 Asuu has quit [Ping timeout: 256 seconds]

14:47 pinknok has quit [Remote host closed the connection]

14:48 thinknok has joined #nmigen

14:51 thinknok has quit [Remote host closed the connection]

15:00 thinknok has joined #nmigen

15:08 Vinalon has joined #nmigen

15:34 <Vinalon> well, I finally got an RV32I soc with a 'neopixel' peripheral working, which is exciting. But it uses about 5,000 cells, which seems very large compared to other RV32I cores that I've seen.

15:38 <Vinalon> but the peripherals and memories use about 2,000 of those cells, and I wonder if some of the smaller designs use commercial toolchains to get better results. Is it reasonable to expect that I could shave much off of that size?

15:40 <Vinalon> or would that be tilting at windmills?

15:46 <tpw_rules> which core?

15:47 <Vinalon> it's an RV32I on an iCE40UP5K - it sounds like those might tend to have larger designs because of the LUT4s?

15:47 <tpw_rules> "an"?

15:47 <tpw_rules> which?

15:48 <tpw_rules> there's hundreds

15:48 <Vinalon> oh, I wrote one - but vexriscv claims to use less than 1200 cells on an iCE40 with only the integer instruction set

15:49 <tpw_rules> oh i see

15:49 <Vinalon> and my decoder, ALU, and CSRs use about 2500-3000 put together, so it seems like there should be room for improvement; I just don't know how much of it is down to the tooling

15:50 <tpw_rules> i wouldn't think so

15:51 <tpw_rules> that that much of it depends on the tooling

15:54 <ZirconiumX> I wouldn't say Yosys produces *fantastic* synthesis results, but it's probably not doing what you expect if it's that big

15:55 <Vinalon> okay, thanks - guess I should try to read about how SoCs are supposed to be designed, then.

15:58 <ZirconiumX> Vinalon: Have you got per-module size stats?

15:58 <ZirconiumX> e.g. from `synth_ice40 -noflatten`?

16:00 <Vinalon> yeah, but some things still get flattened. The ALU is ~750 cells, and the decoder/CSR module is ~2000-2500.

16:01 <Vinalon> but I sort of hit a wall with simplifying logic, because those are all whittled down to switch cases with what I think is minimal logic

16:02 <Vinalon> the decoder switches on the opcode, the alu switches on function bits, and the CSRs switch on register address.

16:02 <Vinalon> so...I dunno, maybe there's only so much that you can do with high-level syntax and a rudimentary understanding of the underlying fundamentals

16:04 alexhw has quit [Ping timeout: 265 seconds]

16:04 <ZirconiumX> The way I'd think about minimising logic is that your goal is not to use less operations, but less terms

16:05 <ZirconiumX> If you were targeting ASIC, then minimising operations is good too, but

16:05 <ZirconiumX> For FPGA, a LUT4 is "something made up of 4 terms"

16:05 <ZirconiumX> "A & B & C & D" and "~(A | B) ^ C & ~D" both result in a LUT4

16:06 <sorear> how many CSRs are you implementing?

16:06 <sorear> picorv32 and vex both implement not-quite-standard CSR sets because the list in the manual is not appropriate for minimal FPGA implementations

16:07 <Vinalon> just the basics; mstatus, mcycle, minstret, mtvec, mcause, mscratch, mepc, mtval, mcountinhibit

16:08 <Vinalon> okay, thanks; so how does LUT4 encoding map to things like selecting and comparing a range of bits?

16:08 <Vinalon> like, A/B/C/D are all one-bit values in that example, right?

16:09 <ZirconiumX> Vinalon: If you're switching on, say, RV funct3, then that would be ~8 LUT4s

16:09 <ZirconiumX> For something wider than 4, it gets more complicated to estimate

16:09 <ZirconiumX> But yes, they're all one-bit values

16:10 <Vinalon> so, switching on the 7-bit opcode field or the 12-bit CSR address is probably not very efficient?

16:10 <ZirconiumX> For example, funct7 would be 2 LUT4s per case, since you have one LUT4 to check if the lower 4 bits match what you want, which outputs to another LUT4 that checks if the upper 3 bits match and the lower 4 match

16:12 <ZirconiumX> However, funct7 is not used as much as funct3, so all illegal opcode groups boil down to two LUT4s

16:13 <Vinalon> yeah, I kind of cheated by sending funct3 and bit 6 of func7 to the ALU to pick its operations

16:17 <ZirconiumX> Thinking about it, actually, I think the synthesis tool can get away with 2 LUT4s per bit for logical ops, an adder, a subtractor and possibly a shifter

16:17 <Vinalon> it's too bad you can't get estimates of how much logic different Module operations like If/Elif/Case/State use, but it's probably hard enough to have visibility into what the synthesizer does

16:17 <ZirconiumX> Actually, how do you implement shifts?

16:18 <Vinalon> I use Python arithmetic/logic operators for all of the ALU ops, so '<<' and '>>'

16:18 <ZirconiumX> Ouch

16:18 <ZirconiumX> So, that's going to eat up a *lot* of area

16:19 <Vinalon> hehe, is this going to be about barrel shifters again?

16:19 <ZirconiumX> Yep, you're asking it to build a barrel shifter

16:19 <ZirconiumX> Or possibly two, actually

16:19 <ZirconiumX> That's going to be 32 * 5 = 160 LUT4s per barrel shifter

16:21 <ZirconiumX> VexRiscV cheats a little here: it gets away with one barrel right-shifter by flipping the bits of a left shift on the input and output

16:22 <Vinalon> well, that's one thing to improve - thanks!

16:22 <ZirconiumX> Which turns, say, a 320 LUT4 barrel shifter into a 224 LUT4 barrel shifter

16:22 <Vinalon> but what I really don't get is why the decoder and CSRs are so large, when they're basically just a lot of Cat/bit_select/Repl logic. I thought those were very efficient operations in FPGAs

16:22 <ZirconiumX> They *are*

16:23 <ZirconiumX> Can I see your code?

16:24 <Vinalon> yeah, this is the decoder for example: https://github.com/WRansohoff/nmigen_rv32i_min/blob/master/cpu.py

16:25 <ZirconiumX> https://github.com/WRansohoff/nmigen_rv32i_min/blob/master/cpu.py#L28-L33 <-- this isn't strictly RV compliant

16:26 <ZirconiumX> But I suppose it doesn't matter too much

16:26 <Vinalon> yeah, but when I reduce it to 32 registers and remove the 'irq' flag in the port addresses, it doesn't seem to make any difference in size

16:26 <Vinalon> I guess I should remove it anyways, but I liked how Cortex-M cores save context automatically

16:26 <ZirconiumX> So, your entire CPU is an FSM?

16:26 <Vinalon> yes - is that bad?

16:27 <ZirconiumX> It can be, yes

16:27 <ZirconiumX> Essentially, your cheap decoder is now dependent on FSM state

16:27 <MadHacker> Well, isn't that going to lead to basically everything having the whole current state in it?

16:27 <ZirconiumX> Which makes it notably less cheap

16:27 <MadHacker> So everything will depend on N bits where N is log2(number of states)?

16:28 <Vinalon> oh...but if there are only 3 states, shouldn't that fit in a LUT4?

16:28 <ZirconiumX> MadHacker: No, it'll depend on exactly the number of states, because nMigen makes everything one-hot

16:28 <MadHacker> Oh god. Yeah. Good luck.

16:28 <MadHacker> That'll be why everything's huge then.

16:30 <Vinalon> innnnteresting. Well that's encouraging - sometimes it's nice to have one big problem instead of a bunch of small ones.

16:30 <ZirconiumX> Vinalon: Yes, it's theoretically only one LUT4, but you need that LUT4 for everything you do

16:30 <sorear> BRAMs have a minimum depth, on ice40 you won't see any difference between 32 and 64 registers because the RAMs don't come in any size smaller than 256 registers

16:31 <ZirconiumX> The CPU decoder will always operate, but the results will be conditional on FSM state

16:31 <ZirconiumX> Compare that to, say, a pipelined CPU, where the CPU decoder will still always operate, but the results will be written (almost - stalls can happen) unconditionally.

16:33 <Vinalon> oh, interesting - so if I implement a traditional pipeline, the fetch/decode/execute stages could all act at once on different values instead of acting on the same one in sequence?

16:33 <ZirconiumX> Correct.

16:34 <ZirconiumX> This is why CPUs nowadays are pipelined

16:34 <ZirconiumX> If you have an FSM, the logic doesn't go away when not in that state

16:34 <ZirconiumX> Better to put it to use if possible.

16:35 <ZirconiumX> For example, with your FSM, in 6 cycles (assuming simple instructions), you can retire 2 instructions.

16:36 <ZirconiumX> With a pipeline you can retire 3 (if I have my math correct)

16:36 <ZirconiumX> <ZirconiumX> With a pipeline you can retire 4

16:36 <Vinalon> cool - I guess I never understood how that architecture decision came from the way that the hardware was laid out in the chip

16:37 <ZirconiumX> Granted, you then have small headaches with branches, and instruction dependencies

16:37 <ZirconiumX> But that's the fun part of writing a CPU, right?

16:38 <Vinalon> well, at least the instruction set includes an explicit 'fence' instruction for the compiler to use

16:38 <Vinalon> thanks so much for taking a look and pointing that out!

16:39 <ZirconiumX> Vinalon: FENCE doesn't help you with those :P

16:40 <ZirconiumX> Actually for a strictly in-order CPU I think you can just add a pipeline bubble and be spec-compliant, but anyway

16:43 <Vinalon> hm - well like they say, I can burn that bridge when I come to it. It'll probably take some trial and error to figure out how to avoid using an FSM for the core logic

16:43 <ZirconiumX> Vinalon: Extract your FSM states into independent modules; that'll be a good start

16:44 <ZirconiumX> And you *can* use an FSM, but the less depends on that FSM, the better

16:44 <ZirconiumX> Here your *entire CPU* depends on that FSM.

16:46 <Vinalon> yeah. They'll still have to be gated on the instruction fetching though, because SPI Flash access takes many clock cycles.

16:46 Vinalon has quit [Remote host closed the connection]

16:46 Vinalon has joined #nmigen

16:47 <Vinalon> anyways, thanks again for the advice!

16:48 <ZirconiumX> It won't make your design smaller, perhaps, but it'll be more efficient, I think

16:49 <ZirconiumX> e.g. if you fetch a small stream of data that you cache (it doesn't have to be big), there'll still be pretty big gains

16:51 <Vinalon> yeah, an I-cache is pretty high on the list of things I want to add, which is why I'm trying to optimize for size; the iCE40UP5K seems like the upper bound of what you can get within an order of magnitude of the cost of an actual microcontroller

16:52 <Vinalon> If I needed a $100 ECP5 board to do the work of a Cortex-M0 chip, I wouldn't be able to actually use it in many throwaway projects

16:54 <ZirconiumX> Vinalon: Minerva is 2,589 SB_LUT4s, as a datapoint

16:55 <daveshah> The cheapest ECP5 chip is about $5

16:55 <daveshah> But the support components and assembly costs are a fortune compared to the UP5K

16:55 <Vinalon> yeah, but they're BGA and all the boards I've seen pack in extras like DDRAM

16:55 <ZirconiumX> How much is an OrangeCrab? That seems like the smallest ECP5 board I know of

16:55 <Vinalon> I think the groupget is $99

16:55 <daveshah> I think it was the usual $99 mark

16:56 <daveshah> Yeah

16:56 <Vinalon> if only they made an HX8K in QFN form factor

16:56 <daveshah> For a microcontroller replacement the SPRAM and DSPs are probably more useful than the extra LUTs

16:56 <daveshah> Unless you need the speed

16:57 <Vinalon> yeah, I was thinking 12MHz with 128KB of RAM wouldn't be too bad if you could also have a handful of peripherals

16:58 <daveshah> Also, you don't really need an icache in this kind of situation

16:59 <ZirconiumX> I think the SB_HFOSC can do 48MHz, right?

16:59 <daveshah> You can just copy the performance critical parts to RAM on startup manually

16:59 <daveshah> Or even the whole program if its small enough

16:59 <daveshah> Yes

16:59 <daveshah> There is a PLL too although it may not be reliable on the Upduino

16:59 <Vinalon> yeah, it is nice that the SPRAMs are so large

17:00 <Vinalon> but I really need to save another 1000 cells or so to hit that target; atm, I can't fit much more than a GPIO peripheral, and it doesn't include multiplication/division, interrupts, timers, or debug support

17:00 <ZirconiumX> Multiplication at least is cheap for a UP5K

17:01 <Vinalon> yeah, with the '-dsp' option right? That'll probably be next, once I can spare a few hundred cells...but at least I have somewhere to start looking now.

17:02 <ZirconiumX> I feel like you could unify these with some level of microcoding

17:03 <ZirconiumX> e.g. an AUIPC is a LUI followed by adding the PC

17:03 <ZirconiumX> And if speed's not an issue (you're bottlenecked on SPI, right?)

17:05 <Vinalon> yeah, that's a good point; and I did save a little bit of space by having the branch operations use the SLT/SLTU ALU operations, maybe other instructions could do the same sort of thing

17:07 * ZirconiumX is actually curious how minimal of an internal instruction set you'd actually need

17:10 <MadHacker> One instruction.

17:10 <MadHacker> OISCs are a thing.

17:10 <ZirconiumX> MadHacker: I don't feel like implementing RISC-V in terms of OISC

17:10 <MadHacker> Usually subtract and branch if not zero usually.

17:11 <MadHacker> *-usually

17:11 <MadHacker> ZirconiumX: It's probably not the most efficient way unless you want to use most of your RAM as storage space for a RISC-V emulator. :)

17:11 <Vinalon> too bad GCC doesn't have "avoid this instruction" flags

17:12 <ZirconiumX> You'd probably need a temporary accumulator to properly microcode these

17:12 <ZirconiumX> What's one more stack of registers? :P

17:13 <Vinalon> well, apparently there is extra room in the register map's BRAM cell...

17:14 <MadHacker> Tsk, you mean you're not just keeping a bank of registers per pipeline stage for temporary results? How are you going to do speculative execution then, huh? :D

17:18 <ZirconiumX> Vinalon: So, AUIPC can be done in terms of LUI and an ADD of PC; SUB can be done in terms of XORI/ADD/ADDI (-n == (n ^ -1) + 1); left-shifts can be done in terms of right-shifts with bits flipped

17:18 <ZirconiumX> I think there's a lot of room for microcoding, actually

17:19 <ZirconiumX> But I already saved you a hwardware subtractor and barrel left-shifter :P

17:20 <ZirconiumX> SLT is a subtraction followed by checking the carry bit

17:20 <ZirconiumX> It'd require quite a lot of reorganising, but I think if size is your target you can do it

17:21 <Vinalon> yeah, multi-cycle operations would be interesting to investigate. Although, it's interesting that RISC-V doesn't require N/Z/V flags

17:21 <ZirconiumX> You can even go down to an accumulator machine if you need to

17:21 <ZirconiumX> It doesn't, but you'll probably need them anyway

17:22 <MadHacker> Mm, but you might be able to pull them out of the normal critical path.

17:22 <MadHacker> If you only need them for conditional branches, calculating them can be on the branch path not the normal one.

17:22 <Vinalon> well, I guess I'll start by trying to modularize the main state machine. Maybe after consolidating the shifters.

17:23 <ZirconiumX> Hope I've given you some things to think about though

17:23 <Vinalon> definitely, thanks!

19:24 Vinalon has quit [Remote host closed the connection]

19:44 Vinalon has joined #nmigen

20:24 Asuu has joined #nmigen

20:24 Asu has quit [Ping timeout: 264 seconds]

21:01 Asuu has quit [Ping timeout: 256 seconds]

21:01 Asu has joined #nmigen

21:09 Asuu has joined #nmigen

21:10 Asu has quit [Ping timeout: 260 seconds]

21:24 ____ has quit [Quit: Nettalk6 - www.ntalk.de]

21:55 Asu has joined #nmigen

21:58 Asuu has quit [Ping timeout: 260 seconds]

22:38 chipmuenk has quit [Quit: chipmuenk]

22:41 Asu has quit [Remote host closed the connection]

23:55 Vinalon has quit [Remote host closed the connection]

23:56 Vinalon has joined #nmigen