whitequark[m] changed the topic of #nmigen to: nMigen hardware description language · code https://github.com/nmigen · logs https://freenode.irclog.whitequark.org/nmigen
lf has quit [Ping timeout: 250 seconds]
lf has joined #nmigen
<_whitenotifier-1> [YoWASP/nextpnr] whitequark pushed 1 commit to develop [+0/-0/±1] https://git.io/JsLIY
<_whitenotifier-1> [YoWASP/nextpnr] whitequark b8708c2 - Update dependencies.
pftbest has quit [Remote host closed the connection]
pftbest has joined #nmigen
<_whitenotifier-1> [YoWASP/yosys] whitequark pushed 1 commit to develop [+0/-0/±1] https://git.io/JsLYn
<_whitenotifier-1> [YoWASP/yosys] whitequark 2c32c5c - Update dependencies.
revolve has quit [Ping timeout: 265 seconds]
revolve has joined #nmigen
Degi_ has joined #nmigen
Degi has quit [Ping timeout: 265 seconds]
Degi_ is now known as Degi
roamingryan has joined #nmigen
roamingryan has quit [Ping timeout: 260 seconds]
modwizcode has quit [Ping timeout: 252 seconds]
modwizcode has joined #nmigen
revolve has quit [Read error: Connection reset by peer]
revolve has joined #nmigen
bvernoux has joined #nmigen
falteckz has joined #nmigen
<lkcl> twam: yosys should - does - break it down into appropriate (massive) smaller multiplies plus a cascade of adds
<lkcl> all of that however will be combinatorial. so, you think you're doing a single "sync", actually a massive chain of adds/muls is put behind it.
<agg> yosys should/can infer the 18x18 hardware multiplies, but i don't know if it can infer those as part of a wider multiplication
<lkcl> you can check for yourself by writing a *very* simple program with a single multiply, run yosys "synth_ecp5" then "show top"
<agg> afaik it can't infer or use 36x36 hardware multiplies, those are a special configuration of four 18x18 in a dsp slice
<lkcl> agg: interesting
<agg> as in, if you write "a = Signal(18); b = Signal(18); c = Signal(36); m.d.comb += c.eq(a * b)" and target ecp5, you should find it uses one MULT18X18D and is fast
<lkcl> i was going to suggest to twam to break it down explicity, into pipelined long-multiplication
<lkcl> agg: ha, nice example.
<agg> if you instantiate the MULT18X18D yourself and figure it out, you can have it do 36x36 multiplication using four of them, but i don't think it knows how to do that itself
<lkcl> agg: we currently have a 64x64->128 mul by just "comb += .eq(a*b)" at the moment
<lkcl> and yes, it produces correct results
<lkcl> but
<agg> bet that's fun for timing
<lkcl> it's one of ... lol yes
<agg> are you targeting ecp5 though? mapping to hardware dsp units will often be quite hardware specific
<lkcl> it's massive, and one of the critical paths @ 50 mhz on the ECP5
<agg> ah, hm
<agg> do it doesn't infer any MULT18X18Ds in the usage report when yo uhave a 64x64?
<lkcl> no, we're using the ECP5 as "just a platform on the way to ASIC"
<lkcl> 1 sec
<agg> if you can write it as a bunch of 18x18 (or 16x16) yourself, it might manage to infer correctly
<lkcl> Info: MULT18X18D: 16/ 72 22%
<agg> huh
<agg> well, it's using them
<agg> for something...
<agg> 16 is the right number for a single 64x64 multiply using 16x16 building blocks, too, right?
<agg> four 16x16 to get a 32x32, four of those to get a 64x64
<lkcl> well, you'll like this: our "live" nmigen multiplier is SIMD-partitionable
<agg> it won't be using the ALU in the DSP slice, though, so the addition is happening in logic
<lkcl> (designed for ASIC primarily)
<agg> it could probably be faster if it could, but maybe that's not worth the effort when ecp5 is just a prototype platform
<lkcl> so we do it like you would with "long multiplication", and use a Dadda Multiplier algorithm
<lkcl> wait, sorry - a Wallace
<lkcl> so we actually have QTY 64of 8x8->16-bit multiplies that are then cascade-added in a multi-stage pipeline
<agg> so not just m.d.comb += p.eq(a * b) for 64-bit a and b?
<lkcl> no, heck no
<agg> huh, I'm confused
<agg> are you talking about two different things?
<lkcl> how would that be sub-divided at runtime into 2x 32-bit 4x 16-bit 8x 8-bit multiplies
<lkcl> no, 1 sec
<agg> the "live" nmigen multiplier is different from the 64x64 you were talking about a minute ago?
<agg> I mean "11:59:17 lkcl> agg: we currently have a 64x64->128 mul by just "comb += .eq(a*b)" at the moment"
<lkcl> right, yes, that's just the "test" (prototype) code
<agg> ah, got you
<agg> is that the one that's inferring 16 MULT18X18Ds in nextpnr-ecp5?
<lkcl> the full partitioned pipelined multiplier is... well... a bit slow on simuulations :)
<lkcl> comb += c.eq(a*b) results in using some of those 16 MULT18x18Ds and creates its own adder tree, yes
<lkcl> which is all fine and great for prototype testing, but not fine for dynamic partitioned SIMD
<lkcl> on an ASIC
<lkcl> hence why we had to do a Wallace Multiplier "by hand"
<lkcl> here's how you'd do a wallace multiplier explicitly laid out in verilog
<lkcl> we use python nmigen so it's entirely auto-generated and parameterised :)
<lkcl> and does it it a modular fashion so that if you inspect even the top level with "show top" in yosys it doesn't melt your machine :)
<lkcl> ASIC synthesis tools, these days, are supposed to infer a Dadda Multiplier for you
<lkcl> because it's the most gate-efficient and fastest known in computer science today
<lkcl> but, hoo-boy: gate-level layout using 2-bit and 3-bit adders? QTY several thousand? :)
<lkcl> so, twam, are you concerned about meeting fast timing, or about resource utilisation?
<lkcl> if you want a high-speed implementation, you'll need to implement "long multiplication" - doing a batch of by-radix-digit multiplies in the first stage
<lkcl> then doing cascading adds (like the Dadda and Wallace) in the subsequent pipeline stages until you get the answer out
<lkcl> but if you want to keep resources down, you'll need to do a FSM which performs only (say) the one multiply (effectively one digit at a time)
<lkcl> accumulate a series of digit-by-digit multiplies...
<lkcl> and *then* go through the same cascade-add
<lkcl> leave that with you to review the irc logs
chipmuenk has joined #nmigen
phire has quit [*.net *.split]
phire has joined #nmigen
Bertl_oO is now known as Bertl
chipmuenk has quit [Ping timeout: 250 seconds]
roamingryan has joined #nmigen
emeb has joined #nmigen
<d1b2> <twam> Puh, that's a lot of input. 🙂 I already thought of instantiating the MULT18X18D in 36bit manually and then combine multiple of those for a 72x72, but hope for an easy solution. yosys already does it with MULT18X18D, but like you said in a big combinatorial way which really slows down my timings.
<agg> instantiating four MULT18X18Ds plus two ALU54Bs in a DSP slice to get a 36x36 multiplier should be doable but I don't expect anyone's tried it with nextpnr and it might not work or need some work (it should work fine with diamond though)
<agg> twam: do you need it fully combinatorial or can you put some registers in?
<d1b2> <twam> I don't need it combinatorial.
<agg> diamond can generate 72x72 mult verilog using 16 MULT18X18D and 8 ALU54B instances
<agg> conceivably you could even just copy that verilog into a file and include it in your build
<agg> you can tell it whether you want registers enabled or not at in/out etc
<agg> it's like 7kloc verilog, lol
<agg> and no idea if nextpnr can synthesise it, though I'd be interested to try it
<agg> but it ends up being a module with two 72-bit inputs and a 143-bit output, so...
<d1b2> <twam> oO. Never used Diamond so far. I assume theres no support for OS X, so I need to get a VM first 😉
roamingryan has quit [Ping timeout: 265 seconds]
<agg> I can just generate the verilog if you want
<d1b2> <twam> That would be awesome!
<agg> your choices are input register on/off, pipeline register on/off, output register on/off
<agg> and it estimates 8 DSP slices (that means 16 mults and 8 alus) and 294 luts
<d1b2> <twam> Pipeline registers off means fully combinatorial or that I can use it in a pipelined way?
<agg> well, I was simplifying slightly, there's actually two choices there too, one sec
<agg> you can have "pipelined mode" where you get latency 3 (or 4 with input registers) but presumably can keep shifting stuff through
<agg> or instead, you can have optional input registers, optional internal pipeline register, and optional output register, which I think you can still keep shifting stuff through
<agg> I haven't investigated what "pipelined mode" does precisely
<agg> it uses a bunch of logic FFs though
<d1b2> <twam> Let's keep in simple and switch it off 🙂
<agg> yea probably best
<agg> the other registers are shown in the dsp docs and are pretty straightforward, turn them on for better timing but more latency basically
<agg> that's with all three registers on but not pipeline mode, happy to generate other configs if you want
<d1b2> <twam> Thanks a lot! I'll try to get it running.
<agg> I recommend testing that nextpnr actually synthesises it into what you expect, a lot of the alu54 support is untested and/or incomplete (i'm slowly working on some of it, but haven't tried this)
<agg> in principle that ^ should build into the fastest possible 72x72 multiply the ecp5 can do, though
<d1b2> <twam> That's good to know. I thought that ECP5 is fully supported ^^
<agg> https://github.com/YosysHQ/prjtrellis#current-status says " Inference and more advanced DSP features are not yet supported."
<agg> which is not true: inference of 18x18 does work, and i've recently helped add a little support for alus so that you can do multiply-accumulate
<agg> but yea, it doesn't promise full DSP support yet :p
<agg> let me know how you get on, anyway, i've got some tooling for building designs with diamond and nextpnr and comparing the dsp bits and stuff like that so could take a look
<d1b2> <twam> Must have overread that 😉 I'll give feedback! Thanks a lot
bvernoux has quit [Quit: Leaving]
revolve has quit [Read error: Connection reset by peer]
revolve has joined #nmigen
<d1b2> <twam> Is there any documentation or examples how to use verilog modules in nmigen? Unclear to me how I can tell him where to find the code for Instances I use.
<agg> basically, platform.add_file(name, contents)
<agg> and then m.submodules.mult = Instance("mult", i_A=a, i_B=b, o_P=p)
<d1b2> <twam> Awesome! That platform.add_file was missing.
roamingryan has joined #nmigen
roamingryan has quit [Client Quit]
<d1b2> <twam> Hmm... bigmult3.v seem use a VLO module which yosys/nextpnr don't know. Looks like this seems to always provide just 0. If I replace this in the verilog yosys seems fine, but nextpnr crashes: libc++abi: terminating with uncaught exception of type std::out_of_range: unordered_map::at: key not found build_top.sh: line 9: 25980 Abort trap: 6 "$NEXTPNR_ECP5" --quiet --log top.tim --45k --package CABGA381 --speed 8 --json top.json --lpf
<d1b2> top.lpf --textcfg top.config
<agg> I wish unordered_map key not found would tell you what the key was
<agg> replacing vlo/vhi with 0/1 should be fine
<mindw0rk> speaking as a complete noob here - is there an nmigen UART<->Wishbone Bridge?
<d1b2> <twam> Haven't tried it, but I think https://github.com/lambdaconcept/lambdasoc/blob/master/lambdasoc/periph/serial.py should be one
<d1b2> <twam> @agg It fails at https://github.com/YosysHQ/nextpnr/blob/master/ecp5/pack.cc#L886 and ctx->id("CIN").c_str(ctx) is CIN.
<d1b2> <twam> (I hope that's the correct way to concert that ctx->id("CIN") index to human-readable string)
<agg> Hmm, yea, afaik the carry input is totally untested, could be something missing there, thanks for digging in... if I have time I'll try and look later too
<lkcl> twam: lol well you were away :)
<lkcl> agg: " plus two ALU54Bs" - i did wonder why those were zero usage
<lkcl> fantastic to hear you're working on nextpnr-ecp5
<lkcl> more yosys i take it?
<d1b2> <twam> @agg cell->name.c_str(ctx) is U$$0.Cadd_bigmult3_10_1 on failure. Let me know if I can provide more details, ...
<d1b2> <twam> Looks like the CIN is not set on 3 of the CCU2C. If I manually set those to 0 (Don't know yet if this is a good idea), it continues until Routing where it fails with Warning: Failed to find a route for arc 932 of net $PACKER_GND_NET.
<agg> Yea, the cin is a fixed input that can only come from a cout I believe, there's no general routing so the router can't connect 0 to it
<agg> What it you leave it unspecified?
<agg> Not sure what you connect it to on the edges...
revolve has quit [Ping timeout: 240 seconds]
revolve has joined #nmigen
pftbest has quit [Remote host closed the connection]
pftbest has joined #nmigen
<d1b2> <twam> It was unconnected (.CIN()) before, when I got the key error.
<agg> I mean just remove it entirely, i.e. delete .CIN()
<d1b2> <twam> I get the key error again 🙂
<agg> looking at a 36x36, after deleting a bunch of 0s on unused ports it's now failing to route from R output of one ALU into CIN on the next ALU, which probably just means it doesn't know about a fixed route
<agg> hm, i thought that would use CO but perhaps/apparently not
<agg> checking the db it does know about at least some of these, though
<agg> oh, I bet I know what's up
<d1b2> <dub_dub_11> the DSP units use fabric FFs to pipeline? weird
<agg> only in "pipeline mode", i don't know what that means precisely
<agg> normally no, they have internal input, pipeline, and output registers
djr has joined #nmigen
<agg> so, the fixed connection failure is because the npnr placer only very recently learnt how to place alu54 at all, and so far only places it together with the two multipliers that make a single slice
<agg> so it doesn't know that if two alus are connected using R->CIN they need to be next to each other
<agg> if I manually place all the ALU and MULTs in a compatible row, I'm able to synthesise a 36x36 multiplier
<agg> dunno if it... works, though
djr has quit [Quit: Connection closed]
<d1b2> <twam> Do you have a corresponding 36x36 verilog file? How can I place them manually in nmigen or is this done in the verilog module?
<agg> you place them by adding an attribute to the instance instantiation, so in this case inside the verilog file
<agg> one sec, just seeing if i can do the same for a clocked 72x72
<d1b2> <twam> no hurry, i'll need to offline soon anyhow, but i'm eager to try it out tomorrow and test on hw if calculation is correct
<agg> ok, i synthesised the 72x72 using diamond through nmigen and have it spit out the result and it's definitely doing a 72x72->144 multiplication, good start
<agg> now to get npnr to reproduce it
<agg> at least i can test it on hardware
<agg> cheating and using diamond to do the placement and copying its placement output into the verilog, lol
<agg> there's a lot of instances @_@
<agg> it reckons 142MHz incidentally, on this speed grade 6 device
<d1b2> <twam> That sounds promising 😉
XgF has quit [Remote host closed the connection]
XgF has joined #nmigen
<agg> sweet, works
<agg> i have nmigen+nextpnr doing 72x72->144, the only hack required was manual placement of the mults and alus
<agg> well
<agg> probably anyway
<agg> my local trellis database has one or two other hacks that i don't think will be relevant :/
<agg> however i have synthesised for a cabga256 device which has different dsp locations to your 381
<agg> i butchered the generated verilog: removed the unused inputs, vhi and vlo to 1 and 0, and added BEL attributes to get npnr to place
<agg> use the relevant tilemap for your device http://yosyshq.net/prjtrellis-db/ECP5/LFE5U-45F/index.html to work out the right locations, uh...
<agg> maybe you can get away with exactly the same or just changing the row, actually it depends on the 25/45/85F not the package
<agg> oh, I see you're using a 45F too, so it should actually work
phire has quit [Quit: ZNC - http://znc.in]
<agg> if it doesn't just work you might need to apply https://github.com/adamgreig/prjtrellis-db/commit/50364318029a1f2a9a81b15e4f39a0e983024eef which I think is the only significant/relevant change I haven't upstreamed yet
<agg> nextpnr has a much more pessimistic 90MHz for this design, and it doesn't even know how to time ALUs yet, lol
phire has joined #nmigen
<cr1901_modern> Use the multipnr :P?
<agg> m...multipnr?
pftbest has quit [Remote host closed the connection]
pftbest has joined #nmigen
<lkcl> whitequark: i have a crazed, completely mad off-the-wall idea
<lkcl> we're using nmigen in an OpenPOWER ISA instruction decoder...
<lkcl> ... but that's then used in a (very slow) python-based OpenPOWER simulator
<lkcl> wins no speed contests at all, but it's functional, and we have to do the exact same "yield" tricks
<lkcl> it occurred to me a couple days ago: what if we used the Liskov Substitution Principle to create a *non-hardware* version of nmigen, via an abstraction API?
<lkcl> we could then substitute - through a class Factory - an alternative... "thing" with the exact same API as nmigen
<lkcl> but instead of doing HDL AST, it spat out... ooo... i dunno... c++ source code?
<lkcl> :)
<lkcl> or SAIL formal correctness proofs
<lkcl> i thought you might appreciate that one
thorns514 has joined #nmigen
<lkcl> whitequark: also, we're fairly close to having a working RADIX MMU, about 2-3 weeks at a guess. means being able to run Linux, just like Microwatt
<lkcl> we're working closely with the OpenPOWER Foundation to make sure they're properly appraised with the ISA Augmentations needed for the 3D GPU and VPU aspects
<lkcl> this is absolutely critical, to properly submit instruction extensions to the new ISA WG, because IBM's reputation - and google's, and Raptor's, etc. all critically depend on OpenPOWER remaining rock-stable
<lkcl> if we proceeded without going through the proper channels we'd get sued to the bedrock, basically
Lord_Nightmare has quit [Quit: ZNC - http://znc.in]
<d1b2> <bob_twinkles> isn't cxxrtl basically what you're suggesting (at least re c++ source code)
thorns514 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
revolve has quit [Read error: Connection reset by peer]
revolve has joined #nmigen
Lord_Nightmare has joined #nmigen
<lkcl> bob_twinkles: not quite - this is to actually generate an actual c++ *program*
<lkcl> with very limited use-cases: i'd likely start off exclusively with combinatorial logic
<lkcl> the primary objective is to be able to use the OpenPOWER ISA decoder in Libre-SOC, in c / c++ applications