Lofty changed the topic of #prjmistral to: Project Mistral: Yosys (and hopefully nextpnr) on Cyclone FPGAs - https://github.com/ZirconiumX/mistral - logs: https://freenode.irclog.whitequark.org/prjmistral
mwk has quit [Ping timeout: 244 seconds]
fdalleau` has joined #prjmistral
mwk has joined #prjmistral
fdalleau` has quit [Quit: It is time for you to leave]
<Sarayan> oh that sucks
<Sarayan> I currently have the route nodes as 8.8.8.8 type/x/y/z
<Sarayan> there are up to 4 io pads associated to one node
<Sarayan> there are *97* route nodes of the same type (IO_RE) associated to each io pad
<Lofty> Sarayan: so it needs to be extended further?
<Sarayan> Lofty: either that (I can reduce the number of bits on type/x/y) or pre-classify the IO_RE, or even go 64 bits
<Lofty> Well, x/y is, what, 64 max?
<Sarayan> hmmm, I'm close to getting somewhere w.r.t the timings for yosys amusingly
<Lofty> Maybe it's higher for bigger dies, but
<Sarayan> sx120f is 90x82
<Lofty> So, uh, you can save two whole bits in x/y coord
<Sarayan> yup
<Sarayan> which is, in fact, a lot
<Lofty> That's 1024 subtypes
<Sarayan> that or 1024 z vavlues
<Sarayan> both are possible
<Lofty> inb4 we get access to the commercial dies and they're huge
<Sarayan> commercial dies?
<Lofty> Commercially-licensed dies
<Lofty> AKA the ones you have to pay Intel to get Quartus for
<Sarayan> there are massively many more LABs & co?
<Lofty> So, Cyclone V tops out at like 120k ALMs, right?
<Lofty> Arria V GX tops out at 190,240
<Sarayan> 11356 (M)LABs, x10 to get ALMs (we have 4191 in the de10-nano)
<Sarayan> a (m)lab being what is in the grid, hence why I count that
<Sarayan> that puts the d9 over 128 I'm sure
<Lofty> Stratix V GX tops out at 359,200
<Lofty> That *definitely* does
<Sarayan> Is there the x10 in there?
<Lofty> Yes
<Lofty> I'm counting ALMs rather than LABs because I'm reading from the spec sheet
<Sarayan> stratix 10 gx tops are 10M, dunno if it's a x10, x20 or more
<Sarayan> still, fucking big
<Lofty> Indeed.
<Sarayan> I suspect you can more easily afford a 64bits cpu that even one of these beasts :-)
<Lofty> Quite probably
<Lofty> I know daveshah can target UltraScale+ thanks to RapidWright, and those chips are huge, but for the poorer of us, it's probably a good idea not to target the larger chips until we can afford them :P
<Sarayan> the de10-pro, with 2.8M (want!) costs $10K (maybe not)
<Lofty> Even ignoring the hardware, quartus alone is like $2K a year there
<Sarayan> plus it's pcie and doesn't even have a hdmi connector
<Sarayan> so meh
<Sarayan> I've yet to see a successor to the nano that can do the same things
<Lofty> The Nano's a really nice little board
<Sarayan> the han pilot platform could be except for its price ($3K)
<Sarayan> but fpga-wise it's 10 times bigger
<Sarayan> and the memory is 5 times faster or so I think
<Lofty> "Stratix V device configuration is enhanced for ease-of-use, speed, and cost."
<Lofty> "Enhanced-cost FPGA"
<Sarayan> yep, I can decode the dmf tables, need to write the python code for that
<Sarayan> lofty?
<Sarayan> og.kervella.org/db_cyclonev_sx120f-7_revprod_1100mv_100c.dmf.txt
<Sarayan> The timings you were asking for, I think the values are picoseconds, the scaling may need to be taken into account
<Sarayan> that's what the synthesis uses for our fpga
<Sarayan> nothing else
<Lofty> Holy shit Sarayan, thank you
<Sarayan> terminology detail: EAB/MEAB = M10K
<Lofty> Useful to note
<Sarayan> also, MAC = DSP
<Lofty> That much I knew, given how DSP cells are cyclonev_mac
<Sarayan> heh
<Lofty> Gonna be honest with you, Sarayan
<Lofty> These timings are detailed enough that they'd probably work okay for PnR
<Sarayan> they're missing the net delays
<Lofty> Synthesis timings tend to be pessimistic anyway
<Sarayan> that's in the fdi file, I think
<Sarayan> (pretty sure actually, since fit uses it in addition)
<Lofty> Do you think LE_COMB is a LAB, or an ALM?
<Sarayan> there is no sych thing as an ALM in these files
<Sarayan> seriously, you won't find ALM anywhere
<Lofty> ROOT LE_COMB RC_RISE RC_RISE MICRO - 6LUT_A_TO_COMBOUT = 605
<Lofty> ROOT LE_COMB RC_RISE RC_RISE MICRO - 6LUT_C_TO_COMBOUT = 432
<Lofty> ROOT LE_COMB RC_RISE RC_RISE MICRO - 6LUT_D_TO_COMBOUT = 433
<Lofty> ROOT LE_COMB RC_RISE RC_RISE MICRO - 6LUT_B_TO_COMBOUT = 583
<Lofty> I dunno, this seems promising :P
<Sarayan> LE_COMB is the left part of the LAB
<Sarayan> I mean the name "ALM" is used nowhere
<Lofty> Ah
<Sarayan> there is comb outside of (m)labs?
<Sarayan> looks like my set: are off by one
<Sarayan> updated the file
<Lofty> > DATAH
<Lofty> Oh god, yet another internal naming scheme
<Sarayan> that one is easy :-)
<Sarayan> once second, lemme find the table
<Sarayan> a-h = e0/f0/a/b/c0/c1/e1/f1
<Sarayan> in that order
<Sarayan> hmmm, looks like set: can have negative values
<Sarayan> fixed
<Sarayan> (see XTALK)
<Lofty> Okay, hmm
<Lofty> (This is useful to know)
<Lofty> But it means I need to figure out how the fuck to translate these into Yosys timings
<Sarayan> yep
<Sarayan> that's a fucklot of information
<Lofty> Time for epic diagram annotation time!
<Sarayan> and that's just our fpga, there is an infinite number of files like this one
<Lofty> I suspect they'll all have a pattern
<Sarayan> hmm, copy/paste fuckup, file name should be ddb_*, whatever
<Sarayan> db_cyclonev_sx120f-7_revprod_1100mv_n40c.dmf.txt for the -40C version if you want to compare
<Lofty> I think the approach daveshah took was to just use the slow corner timings everywhere
<daveshah> The nextpnr API supports four quadrant timings (slow/fast and rising/falling)
<Sarayan> synth uses the 100c only, fit uses both and the fdi file (net delays)
<daveshah> I think ECP5 is just two quadrant though, fast/slow as the rising/falling are the same in Diamond
<Sarayan> the fdi file seems to use distances between connection points and rc values
<Sarayan> plus some kind of driver power I suspect
<Sarayan> only handling cyclonev should simplify things though
<Lofty> "should"
<daveshah> I'm vaguely interested at having a RC based timing model that could be used in a couple of nextpnr arches, as that is similar to the model xilinx uses
<daveshah> So if someone here is interested in that, it would be great to have
<daveshah> it might be that that model would only be enabled for signoff timing, and a faster approximation used during routing (and necessarily placement where you don't know the exact interconnect path anyway)
<Sarayan> don't know as in not decided yet?
<daveshah> Yes, the exact interconnect path is the job of the router
<daveshah> in practice, in some of the most advanced flows, there's more of a blurring between place and route but nextpnr is someway off that kind of stuff yet
<Sarayan> well, I find the idea of optimizing for the cv interesting, but that requires managing to do stuff for the cv in the first place :-)
<daveshah> yeah, I think there could be quite a bit of shared work between cv (and bigger) and xilinx
<daveshah> first I need to get ripple done first, which seems to be expanding into a never-ending set of subproblems
<Sarayan> what's ripple?
<daveshah> it's a routeability driven placement algorithm
<daveshah> I'm working on some tweaks on top of it too, to make it generic to stuff other than Xilinx and add a few features like SLR (chiplet) partitioning
<Sarayan> oh, there's going to be an interesting issue in cv
<Lofty> daveshah: to place and route your FPGA you must first invent the universer
<Lofty> -r
<daveshah> yeah, things like discovering that I needed to do hypergraph partitioning myself has been something of an interesting distraction
<Sarayan> a lab has 40 ffs. the ff clocks can be connected to one of 3 clock lines. The clock lines are created globally in the lab, from (n) clock inputs and an optional inverter for each line
<daveshah> and now I'm looking at how to deal with cases of fracturable hard blocks (like the Xilinx RAM36/2x RAM18 and eventually Lattice DSPs) in the bipartite matching based legalisation
<daveshah> That does seem like it should be representable in the per-tile legality checks
<daveshah> I guess a LAB would be a tile?
<Sarayan> yeah
<Lofty> Cyclone V DSPs are three-way partitionable, daveshah
<Lofty> Have fun with that
<daveshah> Interesting
<Sarayan> a (m)lab tile has 20 lut-6-equivalent and 40 ffs
<Lofty> 3 9x9s, 2 18x18s or 1 27x27
<Sarayan> mlab can also switch to memory mode instead of comb
<daveshah> Interesting, Lattice DSPs can be 1x 18x18 or 2x 9x9
<daveshah> or four of them combined into a 36x36
<daveshah> but I haven't even finished fuzzing that for ECP5 yet, they are very painful in terms of the number of random bits toggling everywhere
<Lofty> Sarayan: is PROPAGATEIN the same as DATAA?
<Lofty> [note: pronounced dat<screaming>]
<daveshah> lol
<daveshah> I've always thought the ecp5 JTAGG primitive to be really cute for some reason
<Sarayan> well, propagatein is 3 while dataa is 12, so your guess is as good as mine
<Sarayan> (it also has data and ndataa, the problem mostly is that the enum is generic for everything quartus handles)
<Sarayan> (ndata, not ndataa)
<Sarayan> oh, and datain too
<daveshah> could it be two ways of routing to the same pin, with another bit somewhere to select what is used?
<Sarayan> no, there are 8 input pins on the data side
<Sarayan> they really seemed to have used PROPAGATEIN intead of datas just because
<Lofty> It's what it seems like from the dump, anyway
<Lofty> mwk: ^ have fun
<jevinskie[m]> They call both m4k and m10k MEAB? Not MEAB and LEAB? :)
<Lofty> Remember Intel have M20Ks and M144Ks
<Lofty> Anyway, so
<Lofty> These numbers *kinda* line up with mine
<daveshah> Ah, so Intel have a large RAM primitive that is all the rage these days too
<Lofty> I'm sure Yosys is going to infer them on a regular basis /s
<daveshah> I did to UltraRAM inference for Yosys
<daveshah> It managed to pull a few out of a LiteX test case
<Lofty> I'd imagine for ROM, right?
<daveshah> s/to/do
<daveshah> No, must have been the system RAM
<mwk> ultraram is useless for rom, not being initializable
<daveshah> It is in Versal
<Lofty> Oh, right, I forgot
<daveshah> But yeah, not in US+
<Lofty> The timings here are wack, holy shit.
<Lofty> ABC9's gonna have fun with this
<Lofty> (A, B, C, D, E, G) => COMBOUT = (368, 1342, 1323, 887, 927, 785)
<Lofty> Which is...substantially slower than the numbers I had before
<Lofty> (there are no [FH] => COMBOUT numbers)
<Lofty> the FRACT/6LUT_TO_COMBOUT numbers line up better, but they seem incomplete...
<Lofty> ...I have a suspicion the numbers are correct
<Lofty> The separate (A, B, C, D, E, G) => REGOUT numbers line up nicely
<Lofty> mwk, daveshah: do you think I should use the COMBOUT numbers or the REGOUT numbers?
<Lofty> (well in this case the numbers are (B, C, D, E, G, H) but anyway)
<Lofty> It'd be funny to have a pass that examines the netlist and sets a parameter which lets you change the timings according to what something is connected to
<Lofty> Okay, I've done some more research
<Lofty> I'm pretty sure PROPAGATEIN is *not* DATAA
<Lofty> But instead possibly - possibly - carry in
<Lofty> Or the share input
<Sarayan> where is e0 then?
<Sarayan> incidentally, sharein exists (273) and so does cin (2)
<Sarayan> og.kervella.org/enums.txt, we're talking about DB_INPUT_PORT_TYPE_STRING here
<Lofty> I'm just grepping through your data dump (which is very handy)
<Sarayan> the file is organized as a tree of nodes with a DBS_DTM_NODE_ENUM_STRING types that finally (the dash) go to a table indexed with a vector of (DTM_ENUM or DB_BURIED_PORT_TYPE or DB_INPUT_PORT_TYPE or DB_OUTPUT_PORT_TYPE or CDB_RE_TYPE or DEV_IO_STANDARD_ENUM) to which is associated an integer or float value
<Sarayan> so find a table, then find a value in the table
<Sarayan> enums.txt gives you, well, the vocabulary I guess
<Sarayan> does tell which mapping they decided on for cv
<Sarayan> oh also, s2t_dump_delay_info=on
<Sarayan> s2t_delay_model_dump_delays=on
<Sarayan> in quartus.ini makes it dump the real times for evey pair of node and quartus_map time, *if* it doesn't segfault (and even then you get intermediate files)
<Sarayan> way more interesting that the slack files which give you how much margin you have
<Lofty> Indeed :P
<Lofty> I have this nasty suspicion that both "COMBOUT" and "PROPAGATE{IN,OUT}" are overloaded here
<Lofty> I think COMBOUT is being used for both COMBOUT and SUMOUT
<Sarayan> could be
<Lofty> And PROPAGATE is both carry and share{in,out}
<Sarayan> I gave you the files as soon as I've been able to decode them
<Lofty> And I appreciate it
<Lofty> I hope I'm helping too >.>
<Sarayan> haven't yet tried to trace of they're actually used
<Sarayan> sure you are
<Lofty> So, there's a *mention* of SUMOUT
<Lofty> ROOT LE_COMB MICRO - FRACT_REG_FEEDBACK_D_TO_SUMOUT = 1113
<Lofty> ROOT LE_COMB MICRO - FRACT_LUT_CASCADE_D_TO_SUMOUT = 1173
<Sarayan> but I'm grepping on "ROOT LE_FF" and ending up kinda terrified
<Lofty> grep "ROOT LE_FF" db_cyclonev_sx120f-7_revprod_1100mv_100c.dmf.txt | grep -v "RC_RISE" | grep -v "RC_FALL" | grep -v "MLAB" | grep -v "MIN"
<Lofty> Filters out a lot of the noise
<Sarayan> I've built a small extract from the NES timing files (before it crashed):
<Sarayan> 282 = iterm 0 to oterm 0 in DFFE ic=9
<Sarayan> 731 = iterm 2 to oterm 0 in DFFE ic=9
<Sarayan> 731 = iterm 3 to oterm 0 in DFFE ic=9
<Sarayan> 813 = iterm 1 to oterm 0 in DFFE ic=9
<Sarayan> 1009 = iterm 4 to oterm 0 in DFFE ic=9
<Lofty> Yay, negative setup times!
<Sarayan> not sure what iterm is what, doesn't seem to match what's in the place file
<Sarayan> ROOT LE_FF P2P in_buried_node P2P_IN_BUR - CLK EXC_MREADY_DFF = 731
<Sarayan> ROOT LE_FF P2P in_buried_node P2P_IN_BUR - ENA EXC_MREADY_DFF = 813
<Sarayan> ROOT LE_FF P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 1009
<Sarayan> ROOT LE_FF P2P in_buried_node P2P_IN_BUR - LATCH_ENABLE EXC_MREADY_DFF = 282
<Lofty> 282 roughly matches the CLK -> DATAOUT arrival time
<Lofty> (which I have as 262 but)
<Sarayan> Seems to match, kinda. I'm not sure I understand latch_enable vs. enable
<Lofty> There's also WE
<Lofty> And I bet that's an enable too :P
<Sarayan> It feels like P2P means "point-to-point timing inside an element), with (in|buried)_(buried|out)_node telling you where the points are placed w.r.t the logical periphery of the element
<Lofty> That sounds about right
<Lofty> What's the numbers in set:N ?
<Lofty> Do you know?
<Sarayan> mux settings I'm pretty sure
<Sarayan> configurable stuff in any case
<Lofty> I think P2P stuff is effectively arrival time
<Sarayan> not sure what you mean by arrival
<Lofty> Right
<Lofty> When the clock edge triggers, there's a propagation delay between that and the flop output changing
phire has quit [Remote host closed the connection]
<Lofty> That delay is the arrival time, because it's when the effect of the clock edge arrives at the flop output
<Sarayan> ok, we agree then
<Sarayan> Not sure I entirely understand something like that though:
<Sarayan> ROOT IO HPAD FF INPUT RC_RISE RC_RISE P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 466
<Sarayan> ROOT IO HPAD FF INPUT RC_FALL RC_RISE P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 470
<Sarayan> ROOT IO HPAD FF INPUT RC_FALL RC_FALL P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 478
<Sarayan> ROOT IO HPAD FF INPUT P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 487
<Sarayan> ROOT IO HPAD FF INPUT RC_RISE RC_FALL P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 487
<Sarayan> unless the last time is just the worst case in general
<Lofty> That's exactly what it is
<Lofty> Which is why I exclude the RC_RISE/RC_FALL attributes
<Sarayan> except when it's on a MIN line of course
<Sarayan> then it's the min of the mins
<Lofty> It still is even on that I think
<Lofty> Okay
<Sarayan> ROOT IO HPAD FF INPUT MIN RC_RISE RC_RISE P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 407
<Sarayan> ROOT IO HPAD FF INPUT MIN RC_FALL RC_RISE P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 408
<Sarayan> ROOT IO HPAD FF INPUT MIN P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 407
<Sarayan> ROOT IO HPAD FF INPUT MIN RC_RISE RC_FALL P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 428
<Sarayan> ROOT IO HPAD FF INPUT MIN RC_FALL RC_FALL P2P in_buried_node P2P_IN_BUR - IN EXC_MREADY_DFF = 413
<Sarayan> you really can do a detailed sim of the timings there
<Sarayan> well, once you have the routing stuff too
<Lofty> daveshah: do all synchronous flop inputs need a setup time, or just some of them?
<Lofty> I suppose all of them
phire has joined #prjmistral
<daveshah> Yes, all of them would have a setup constraint
<daveshah> There are also some constraints on async set/reset inputs
<Lofty> Well, the data has two setup times: -196ps and 0ps
<Lofty> I'm assuming I should use -196?
<daveshah> No, that is less demanding than 0ps
<Lofty> ROOT LE_FF INTEGER - TSU set:0 = -196
<Lofty> ROOT LE_FF INTEGER - TSU set:1 = -196
<Lofty> ROOT LE_FF INTEGER - TSU set:2 = -196
<Lofty> ROOT LE_FF FLIP-FLOP - EXC_MREADY_DFF TSU = 0
<daveshah> It seems like this is more than two different corners
<Lofty> Nope
<Lofty> This is exactly one corner
<Lofty> 1.1V, 100C
<daveshah> Hmm, not sure why there would be four values and all but one the same
<daveshah> Might be different modes or something?
<daveshah> This seems more like a Quartus internals question than anything else
<Lofty> Mhm
<Lofty> Honestly, I think -196 is the value to choose
<Lofty> ROOT LE_FF INTEGER - TH set:0 = 270
<Lofty> ROOT LE_FF INTEGER - TH set:2 = 557
<Lofty> ROOT LE_FF INTEGER - TH set:1 = 327
<Lofty> ROOT LE_FF FLIP-FLOP - EXC_MREADY_DFF TH = 0
<Lofty> I find it difficult to believe they have a flop with a zero hold time
<Lofty> But while the hold times differ, the setup times match
<Sarayan> Not sure what TH means in the first place
<Lofty> Hold time
<Lofty> Sarayan: increasingly unconvinced the mapping you have of DATAX to the hardware diagram is correct
<Lofty> For example, according to the timing list, there's a connection between G = E1 and SUMOUT
<Lofty> Not according to the hardware diagram there isn't
<Sarayan> well, it may not be :-)
<Sarayan> fgrep ' DATAG ' db_cyclonev_sx120f-7_revprod_1100mv_100c.dmf.txt | fgrep SUMOUT
<Sarayan> that gives nothing?
<Lofty> They call it COMBOUT there
<Lofty> But it's definitely SUMOUT not COMBOUT
<Sarayan> Not convinced
<Lofty> Okay, look at the path for PROPAGATEIN
<Lofty> Not connected for REGOUT
<Lofty> Connected for COMBOUT and PROPAGATEOUT
<Lofty> Answer: PROPAGATEIN is carry, and COMBOUT is sum
<Lofty> And REGOUT is actually COMBOUT
<Sarayan> ROOT LE_COMB TABLE - CARRY_ADDER MAX = 52
<Sarayan> or the whole adder is in that
<Sarayan> (it's fast, it doesn't go through any ram)
<Sarayan> 52ps is almost long :-)
<Sarayan> I guess LUTRAM is MLAB in memory mode
<Lofty> Sarayan: I'm using the PROPAGATEIN to PROPAGATEOUT time of 71ps per ALM
<Sarayan> that's either carry or share
<Lofty> I think it's both.\
<Sarayan> could be
<Lofty> Not like we have any better ideas at present
<Sarayan> you can try to build small designs and have quartus_map dump the timings, you should get answers eventually :-)
<Sarayan> anyway, have fun and good night :-)