#nmigen on 2021-05-13 — irc logs at freenode.irclog.whitequark.org

2021-02-27 05:40 whitequark[m] changed the topic of #nmigen to: nMigen hardware description language · code https://github.com/nmigen · logs https://freenode.irclog.whitequark.org/nmigen

00:03 lf has quit [Ping timeout: 250 seconds]

00:03 lf has joined #nmigen

00:20 <_whitenotifier-1> [YoWASP/nextpnr] whitequark pushed 1 commit to develop [+0/-0/±1] https://git.io/JsLIY

00:20 <_whitenotifier-1> [YoWASP/nextpnr] whitequark b8708c2 - Update dependencies.

00:51 pftbest has quit [Remote host closed the connection]

00:52 pftbest has joined #nmigen

01:00 <_whitenotifier-1> [YoWASP/yosys] whitequark pushed 1 commit to develop [+0/-0/±1] https://git.io/JsLYn

01:00 <_whitenotifier-1> [YoWASP/yosys] whitequark 2c32c5c - Update dependencies.

02:34 revolve has quit [Ping timeout: 265 seconds]

02:37 revolve has joined #nmigen

03:53 Degi_ has joined #nmigen

03:54 Degi has quit [Ping timeout: 265 seconds]

03:54 Degi_ is now known as Degi

04:18 roamingryan has joined #nmigen

04:52 roamingryan has quit [Ping timeout: 260 seconds]

05:16 modwizcode has quit [Ping timeout: 252 seconds]

06:23 modwizcode has joined #nmigen

09:04 revolve has quit [Read error: Connection reset by peer]

09:06 revolve has joined #nmigen

09:24 bvernoux has joined #nmigen

10:14 falteckz has joined #nmigen

10:55 <lkcl> twam: yosys should - does - break it down into appropriate (massive) smaller multiplies plus a cascade of adds

10:56 <lkcl> all of that however will be combinatorial. so, you think you're doing a single "sync", actually a massive chain of adds/muls is put behind it.

10:56 <agg> yosys should/can infer the 18x18 hardware multiplies, but i don't know if it can infer those as part of a wider multiplication

10:56 <lkcl> you can check for yourself by writing a *very* simple program with a single multiply, run yosys "synth_ecp5" then "show top"

10:57 <agg> afaik it can't infer or use 36x36 hardware multiplies, those are a special configuration of four 18x18 in a dsp slice

10:57 <lkcl> agg: interesting

10:57 <agg> as in, if you write "a = Signal(18); b = Signal(18); c = Signal(36); m.d.comb += c.eq(a * b)" and target ecp5, you should find it uses one MULT18X18D and is fast

10:57 <lkcl> i was going to suggest to twam to break it down explicity, into pipelined long-multiplication

10:58 <lkcl> agg: ha, nice example.

10:58 <agg> if you instantiate the MULT18X18D yourself and figure it out, you can have it do 36x36 multiplication using four of them, but i don't think it knows how to do that itself

10:59 <lkcl> agg: we currently have a 64x64->128 mul by just "comb += .eq(a*b)" at the moment

10:59 <lkcl> and yes, it produces correct results

10:59 <lkcl> but

10:59 <agg> bet that's fun for timing

10:59 <lkcl> it's one of ... lol yes

10:59 <agg> are you targeting ecp5 though? mapping to hardware dsp units will often be quite hardware specific

11:00 <lkcl> it's massive, and one of the critical paths @ 50 mhz on the ECP5

11:00 <agg> ah, hm

11:00 <agg> do it doesn't infer any MULT18X18Ds in the usage report when yo uhave a 64x64?

11:00 <lkcl> no, we're using the ECP5 as "just a platform on the way to ASIC"

11:00 <lkcl> 1 sec

11:00 <agg> if you can write it as a bunch of 18x18 (or 16x16) yourself, it might manage to infer correctly

11:00 <lkcl> Info: MULT18X18D: 16/ 72 22%

11:00 <agg> huh

11:00 <agg> well, it's using them

11:00 <agg> for something...

11:01 <agg> 16 is the right number for a single 64x64 multiply using 16x16 building blocks, too, right?

11:01 <agg> four 16x16 to get a 32x32, four of those to get a 64x64

11:02 <lkcl> well, you'll like this: our "live" nmigen multiplier is SIMD-partitionable

11:02 <agg> it won't be using the ALU in the DSP slice, though, so the addition is happening in logic

11:02 <lkcl> (designed for ASIC primarily)

11:02 <agg> it could probably be faster if it could, but maybe that's not worth the effort when ecp5 is just a prototype platform

11:02 <lkcl> so we do it like you would with "long multiplication", and use a Dadda Multiplier algorithm

11:03 <lkcl> wait, sorry - a Wallace

11:03 <lkcl> https://en.wikipedia.org/wiki/Wallace_tree

11:04 <lkcl> so we actually have QTY 64of 8x8->16-bit multiplies that are then cascade-added in a multi-stage pipeline

11:04 <agg> so not just m.d.comb += p.eq(a * b) for 64-bit a and b?

11:04 <lkcl> no, heck no

11:05 <agg> huh, I'm confused

11:05 <agg> are you talking about two different things?

11:05 <lkcl> how would that be sub-divided at runtime into 2x 32-bit 4x 16-bit 8x 8-bit multiplies

11:05 <lkcl> no, 1 sec

11:05 <agg> the "live" nmigen multiplier is different from the 64x64 you were talking about a minute ago?

11:05 <agg> I mean "11:59:17 lkcl> agg: we currently have a 64x64->128 mul by just "comb += .eq(a*b)" at the moment"

11:06 <lkcl> right, yes, that's just the "test" (prototype) code

11:06 <agg> ah, got you

11:06 <agg> is that the one that's inferring 16 MULT18X18Ds in nextpnr-ecp5?

11:06 <lkcl> the full partitioned pipelined multiplier is... well... a bit slow on simuulations :)

11:06 <lkcl> https://git.libre-soc.org/?p=ieee754fpu.git;a=tree;f=src/ieee754/part_mul_add;hb=HEAD

11:07 <lkcl> comb += c.eq(a*b) results in using some of those 16 MULT18x18Ds and creates its own adder tree, yes

11:07 <lkcl> which is all fine and great for prototype testing, but not fine for dynamic partitioned SIMD

11:07 <lkcl> on an ASIC

11:08 <lkcl> hence why we had to do a Wallace Multiplier "by hand"

11:09 <lkcl> here's how you'd do a wallace multiplier explicitly laid out in verilog

11:09 <lkcl> https://github.com/pareddy113/Design-of-various-multiplier-Array-Booth-Wallace-/blob/master/Wallace%20Tree%20Multiplier/Wallace%20Tree%20multiplier.v

11:09 <lkcl> we use python nmigen so it's entirely auto-generated and parameterised :)

11:10 <lkcl> and does it it a modular fashion so that if you inspect even the top level with "show top" in yosys it doesn't melt your machine :)

11:11 <lkcl> ASIC synthesis tools, these days, are supposed to infer a Dadda Multiplier for you

11:11 <lkcl> https://en.wikipedia.org/wiki/Dadda_multiplier

11:11 <lkcl> because it's the most gate-efficient and fastest known in computer science today

11:12 <lkcl> but, hoo-boy: gate-level layout using 2-bit and 3-bit adders? QTY several thousand? :)

11:13 <lkcl> so, twam, are you concerned about meeting fast timing, or about resource utilisation?

11:14 <lkcl> if you want a high-speed implementation, you'll need to implement "long multiplication" - doing a batch of by-radix-digit multiplies in the first stage

11:15 <lkcl> then doing cascading adds (like the Dadda and Wallace) in the subsequent pipeline stages until you get the answer out

11:16 <lkcl> but if you want to keep resources down, you'll need to do a FSM which performs only (say) the one multiply (effectively one digit at a time)

11:16 <lkcl> accumulate a series of digit-by-digit multiplies...

11:16 <lkcl> and *then* go through the same cascade-add

11:16 <lkcl> leave that with you to review the irc logs

11:40 chipmuenk has joined #nmigen

11:43 phire has quit [*.net *.split]

12:21 phire has joined #nmigen

13:32 Bertl_oO is now known as Bertl

13:35 chipmuenk has quit [Ping timeout: 250 seconds]

13:37 roamingryan has joined #nmigen

14:31 emeb has joined #nmigen

15:06 <d1b2> <twam> Puh, that's a lot of input. 🙂 I already thought of instantiating the MULT18X18D in 36bit manually and then combine multiple of those for a 72x72, but hope for an easy solution. yosys already does it with MULT18X18D, but like you said in a big combinatorial way which really slows down my timings.

15:08 <agg> instantiating four MULT18X18Ds plus two ALU54Bs in a DSP slice to get a 36x36 multiplier should be doable but I don't expect anyone's tried it with nextpnr and it might not work or need some work (it should work fine with diamond though)

15:10 <agg> twam: do you need it fully combinatorial or can you put some registers in?

15:10 <d1b2> <twam> I don't need it combinatorial.

15:11 <agg> diamond can generate 72x72 mult verilog using 16 MULT18X18D and 8 ALU54B instances

15:11 <agg> conceivably you could even just copy that verilog into a file and include it in your build

15:11 <agg> you can tell it whether you want registers enabled or not at in/out etc

15:11 <agg> it's like 7kloc verilog, lol

15:11 <agg> and no idea if nextpnr can synthesise it, though I'd be interested to try it

15:12 <agg> but it ends up being a module with two 72-bit inputs and a 143-bit output, so...

15:12 <d1b2> <twam> oO. Never used Diamond so far. I assume theres no support for OS X, so I need to get a VM first 😉

15:12 roamingryan has quit [Ping timeout: 265 seconds]

15:12 <agg> I can just generate the verilog if you want

15:12 <d1b2> <twam> That would be awesome!

15:12 <agg> your choices are input register on/off, pipeline register on/off, output register on/off

15:13 <agg> and it estimates 8 DSP slices (that means 16 mults and 8 alus) and 294 luts

15:14 <d1b2> <twam> Pipeline registers off means fully combinatorial or that I can use it in a pipelined way?

15:14 <agg> well, I was simplifying slightly, there's actually two choices there too, one sec

15:15 <agg> https://imgur.com/a/i2AuDc1

15:15 <agg> you can have "pipelined mode" where you get latency 3 (or 4 with input registers) but presumably can keep shifting stuff through

15:15 <agg> or instead, you can have optional input registers, optional internal pipeline register, and optional output register, which I think you can still keep shifting stuff through

15:15 <agg> I haven't investigated what "pipelined mode" does precisely

15:15 <agg> it uses a bunch of logic FFs though

15:16 <d1b2> <twam> Let's keep in simple and switch it off 🙂

15:16 <agg> yea probably best

15:16 <agg> the other registers are shown in the dsp docs and are pretty straightforward, turn them on for better timing but more latency basically

15:17 <agg> http://agg.io/u/bigmult3.v

15:18 <agg> that's with all three registers on but not pipeline mode, happy to generate other configs if you want

15:18 <d1b2> <twam> Thanks a lot! I'll try to get it running.

15:18 <agg> I recommend testing that nextpnr actually synthesises it into what you expect, a lot of the alu54 support is untested and/or incomplete (i'm slowly working on some of it, but haven't tried this)

15:19 <agg> in principle that ^ should build into the fastest possible 72x72 multiply the ecp5 can do, though

15:19 <d1b2> <twam> That's good to know. I thought that ECP5 is fully supported ^^

15:19 <agg> https://github.com/YosysHQ/prjtrellis#current-status says " Inference and more advanced DSP features are not yet supported."

15:20 <agg> which is not true: inference of 18x18 does work, and i've recently helped add a little support for alus so that you can do multiply-accumulate

15:20 <agg> but yea, it doesn't promise full DSP support yet :p

15:21 <agg> let me know how you get on, anyway, i've got some tooling for building designs with diamond and nextpnr and comparing the dsp bits and stuff like that so could take a look

15:21 <d1b2> <twam> Must have overread that 😉 I'll give feedback! Thanks a lot

15:29 bvernoux has quit [Quit: Leaving]

15:34 revolve has quit [Read error: Connection reset by peer]

15:36 revolve has joined #nmigen

15:58 <d1b2> <twam> Is there any documentation or examples how to use verilog modules in nmigen? Unclear to me how I can tell him where to find the code for Instances I use.

16:00 <agg> https://github.com/adamgreig/nmigen-examples/blob/master/nmigen_examples/instance.py

16:00 <agg> basically, platform.add_file(name, contents)

16:00 <agg> and then m.submodules.mult = Instance("mult", i_A=a, i_B=b, o_P=p)

16:02 <d1b2> <twam> Awesome! That platform.add_file was missing.

16:05 roamingryan has joined #nmigen

16:06 roamingryan has quit [Client Quit]

16:15 <d1b2> <twam> Hmm... bigmult3.v seem use a VLO module which yosys/nextpnr don't know. Looks like this seems to always provide just 0. If I replace this in the verilog yosys seems fine, but nextpnr crashes: libc++abi: terminating with uncaught exception of type std::out_of_range: unordered_map::at: key not found build_top.sh: line 9: 25980 Abort trap: 6 "$NEXTPNR_ECP5" --quiet --log top.tim --45k --package CABGA381 --speed 8 --json top.json --lpf

16:15 <d1b2> top.lpf --textcfg top.config

16:18 <agg> I wish unordered_map key not found would tell you what the key was

16:19 <agg> replacing vlo/vhi with 0/1 should be fine

17:42 <mindw0rk> speaking as a complete noob here - is there an nmigen UART<->Wishbone Bridge?

17:46 <d1b2> <twam> Haven't tried it, but I think https://github.com/lambdaconcept/lambdasoc/blob/master/lambdasoc/periph/serial.py should be one

17:56 <d1b2> <twam> @agg It fails at https://github.com/YosysHQ/nextpnr/blob/master/ecp5/pack.cc#L886 and ctx->id("CIN").c_str(ctx) is CIN.

17:57 <d1b2> <twam> (I hope that's the correct way to concert that ctx->id("CIN") index to human-readable string)

18:04 <agg> Hmm, yea, afaik the carry input is totally untested, could be something missing there, thanks for digging in... if I have time I'll try and look later too

18:07 <lkcl> twam: lol well you were away :)

18:07 <lkcl> agg: " plus two ALU54Bs" - i did wonder why those were zero usage

18:09 <lkcl> fantastic to hear you're working on nextpnr-ecp5

18:10 <lkcl> ahh https://github.com/YosysHQ/nextpnr/blob/master/ecp5/docs/primitives.md

18:10 <lkcl> more yosys i take it?

18:10 <d1b2> <twam> @agg cell->name.c_str(ctx) is U$$0.Cadd_bigmult3_10_1 on failure. Let me know if I can provide more details, ...

18:16 <d1b2> <twam> Looks like the CIN is not set on 3 of the CCU2C. If I manually set those to 0 (Don't know yet if this is a good idea), it continues until Routing where it fails with Warning: Failed to find a route for arc 932 of net $PACKER_GND_NET.

18:19 <agg> Yea, the cin is a fixed input that can only come from a cout I believe, there's no general routing so the router can't connect 0 to it

18:19 <agg> What it you leave it unspecified?

18:19 <agg> Not sure what you connect it to on the edges...

18:23 revolve has quit [Ping timeout: 240 seconds]

18:30 revolve has joined #nmigen

18:35 pftbest has quit [Remote host closed the connection]

18:39 pftbest has joined #nmigen

18:48 <d1b2> <twam> It was unconnected (.CIN()) before, when I got the key error.

18:48 <agg> I mean just remove it entirely, i.e. delete .CIN()

18:49 <d1b2> <twam> I get the key error again 🙂

19:05 <agg> looking at a 36x36, after deleting a bunch of 0s on unused ports it's now failing to route from R output of one ALU into CIN on the next ALU, which probably just means it doesn't know about a fixed route

19:06 <agg> hm, i thought that would use CO but perhaps/apparently not

19:08 <agg> checking the db it does know about at least some of these, though

19:08 <agg> oh, I bet I know what's up

19:12 <d1b2> <dub_dub_11> the DSP units use fabric FFs to pipeline? weird

19:13 <agg> only in "pipeline mode", i don't know what that means precisely

19:14 <agg> normally no, they have internal input, pipeline, and output registers

19:21 djr has joined #nmigen

19:24 <agg> so, the fixed connection failure is because the npnr placer only very recently learnt how to place alu54 at all, and so far only places it together with the two multipliers that make a single slice

19:24 <agg> so it doesn't know that if two alus are connected using R->CIN they need to be next to each other

19:24 <agg> if I manually place all the ALU and MULTs in a compatible row, I'm able to synthesise a 36x36 multiplier

19:24 <agg> dunno if it... works, though

19:30 djr has quit [Quit: Connection closed]

19:33 <d1b2> <twam> Do you have a corresponding 36x36 verilog file? How can I place them manually in nmigen or is this done in the verilog module?

19:36 <agg> you place them by adding an attribute to the instance instantiation, so in this case inside the verilog file

19:36 <agg> one sec, just seeing if i can do the same for a clocked 72x72

19:38 <d1b2> <twam> no hurry, i'll need to offline soon anyhow, but i'm eager to try it out tomorrow and test on hw if calculation is correct

19:39 <agg> ok, i synthesised the 72x72 using diamond through nmigen and have it spit out the result and it's definitely doing a 72x72->144 multiplication, good start

19:40 <agg> now to get npnr to reproduce it

19:40 <agg> at least i can test it on hardware

19:40 <agg> cheating and using diamond to do the placement and copying its placement output into the verilog, lol

19:40 <agg> there's a lot of instances @_@

19:44 <agg> it reckons 142MHz incidentally, on this speed grade 6 device

19:44 <d1b2> <twam> That sounds promising 😉

19:47 XgF has quit [Remote host closed the connection]

19:49 XgF has joined #nmigen

20:00 <agg> sweet, works

20:00 <agg> i have nmigen+nextpnr doing 72x72->144, the only hack required was manual placement of the mults and alus

20:00 <agg> well

20:00 <agg> probably anyway

20:00 <agg> my local trellis database has one or two other hacks that i don't think will be relevant :/

20:01 <agg> however i have synthesised for a cabga256 device which has different dsp locations to your 381

20:01 <agg> https://agg.io/u/mult72.v

20:02 <agg> i butchered the generated verilog: removed the unused inputs, vhi and vlo to 1 and 0, and added BEL attributes to get npnr to place

20:02 <agg> use the relevant tilemap for your device http://yosyshq.net/prjtrellis-db/ECP5/LFE5U-45F/index.html to work out the right locations, uh...

20:02 <agg> maybe you can get away with exactly the same or just changing the row, actually it depends on the 25/45/85F not the package

20:03 <agg> oh, I see you're using a 45F too, so it should actually work

20:05 phire has quit [Quit: ZNC - http://znc.in]

20:06 <agg> if it doesn't just work you might need to apply https://github.com/adamgreig/prjtrellis-db/commit/50364318029a1f2a9a81b15e4f39a0e983024eef which I think is the only significant/relevant change I haven't upstreamed yet

20:07 <agg> nextpnr has a much more pessimistic 90MHz for this design, and it doesn't even know how to time ALUs yet, lol

20:08 phire has joined #nmigen

20:17 <cr1901_modern> Use the multipnr :P?

20:20 <agg> m...multipnr?

20:20 pftbest has quit [Remote host closed the connection]

20:21 pftbest has joined #nmigen

20:35 <lkcl> whitequark: i have a crazed, completely mad off-the-wall idea

20:35 <lkcl> we're using nmigen in an OpenPOWER ISA instruction decoder...

20:36 <lkcl> ... but that's then used in a (very slow) python-based OpenPOWER simulator

20:36 <lkcl> wins no speed contests at all, but it's functional, and we have to do the exact same "yield" tricks

20:37 <lkcl> it occurred to me a couple days ago: what if we used the Liskov Substitution Principle to create a *non-hardware* version of nmigen, via an abstraction API?

20:38 <lkcl> we could then substitute - through a class Factory - an alternative... "thing" with the exact same API as nmigen

20:38 <lkcl> but instead of doing HDL AST, it spat out... ooo... i dunno... c++ source code?

20:38 <lkcl> :)

20:39 <lkcl> or SAIL formal correctness proofs

20:45 <lkcl> i thought you might appreciate that one

20:45 thorns514 has joined #nmigen

20:46 <lkcl> whitequark: also, we're fairly close to having a working RADIX MMU, about 2-3 weeks at a guess. means being able to run Linux, just like Microwatt

20:47 <lkcl> we're working closely with the OpenPOWER Foundation to make sure they're properly appraised with the ISA Augmentations needed for the 3D GPU and VPU aspects

20:48 <lkcl> this is absolutely critical, to properly submit instruction extensions to the new ISA WG, because IBM's reputation - and google's, and Raptor's, etc. all critically depend on OpenPOWER remaining rock-stable

20:49 <lkcl> if we proceeded without going through the proper channels we'd get sued to the bedrock, basically

21:09 Lord_Nightmare has quit [Quit: ZNC - http://znc.in]

21:28 <d1b2> <bob_twinkles> isn't cxxrtl basically what you're suggesting (at least re c++ source code)

21:30 thorns514 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

22:04 revolve has quit [Read error: Connection reset by peer]

22:09 revolve has joined #nmigen

22:15 Lord_Nightmare has joined #nmigen

23:22 <lkcl> bob_twinkles: not quite - this is to actually generate an actual c++ *program*

23:23 <lkcl> with very limited use-cases: i'd likely start off exclusively with combinatorial logic

23:25 <lkcl> the primary objective is to be able to use the OpenPOWER ISA decoder in Libre-SOC, in c / c++ applications