tpb has quit [Remote host closed the connection]
tpb has joined #symbiflow
<sf-slack> <pgielda> That will happen most probably tomorrow
<sf-slack> <pgielda> Sorry for the delay, one of engineers that is responsible for this part is out of office this week by coincidence
<_whitenotifier-f> [yosys] HackerFoo opened issue #79: Need updated Yosys to successfully run attosoc test with nextpnr-xilinx - https://git.io/JfyNj
<sf-slack> <pgielda> But thanks for comments in the PR would be great tp get it in and then expand and fix things
<sf-slack> <pgielda> We will also then have one set of packages for both xc7 and eos3 flows which would be a dream ;)
<HackerFoo> nextpnr takes 104s on vexrisc, where approximately all (all but 100ms) of that time is taken in fasm and bitstream generation: https://docs.google.com/document/d/1-lDeYYwmfxanod441FUjIoAKW061qmgKsqTcKTb_mXE/edit?usp=sharing
andrewb1999 has quit [Read error: Connection reset by peer]
<tpb> Title: Performance Comparison - Google Docs (at docs.google.com)
<HackerFoo> At least according to fpga-tool-perf
<mithro> @HackerFoo Did it actually generate anything? There is a lot of N/A output
<HackerFoo> mithro: It generated a 2MB .bit file; I haven't tried it. fpga-tool-perf doesn't seem to keep the output from nextpnr yet, but it didn't seem to fail.
<HackerFoo> I re-ran it to capture the output: https://gist.github.com/HackerFoo/7b2a7bed7632a981399d54cba389d749
<tpb> Title: vexriscv/nextpnr-xilinx log · GitHub (at gist.github.com)
<HackerFoo> Placement takes about 10 seconds.
citypw has joined #symbiflow
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
rvalles_ has joined #symbiflow
rvalles has quit [Ping timeout: 256 seconds]
Degi has quit [Ping timeout: 246 seconds]
Degi has joined #symbiflow
rvalles_ is now known as rvalles
az0re has quit [Remote host closed the connection]
futarisIRCcloud has joined #symbiflow
tpagarani has joined #symbiflow
smkz has quit [Quit: reboot@]
smkz has joined #symbiflow
smkz has quit [Client Quit]
smkz has joined #symbiflow
tpagarani has quit [Remote host closed the connection]
lns1 has joined #symbiflow
lns1 has left #symbiflow [#symbiflow]
OmniMancer has joined #symbiflow
OmniMancer1 has quit [Ping timeout: 256 seconds]
az0re has joined #symbiflow
epony has quit [Quit: QUIT]
epony has joined #symbiflow
epony has quit [Max SendQ exceeded]
epony has joined #symbiflow
kraiskil has joined #symbiflow
<sf-slack> <acomodi> nextpnr in tool-perf needs to be fixed to have all the output results correctly parsed, I am dealing with that now
mkru has joined #symbiflow
mkru has quit [Quit: Leaving]
lnsharma has joined #symbiflow
lnsharma has left #symbiflow [#symbiflow]
craigo has joined #symbiflow
<Lofty> tnt: you around?
<tnt> Lofty: I am
<Lofty> Could you send me your designs so I can play about with them?
<tnt> Note that you can only run it through yosys atm ... you can't actually run VPR because I haven't ported the RAM or IO so you have SB_IO and SB_RAM_4K blackboxes ...), nor worked out the proper interconnect.
anuejn_ is now known as anuejn
<tnt> But so far it's been good enough for me to look at the deepest path with 'show' (something like show n:$abc$8664$techmap\tx_pkt_I.$0\len[10:0][6] %ci*:-dff:-dffc:-SB_RAM40_4K ) and see if I can do better.
<Lofty> If ABC9 gets hooked up there'll be sta, which is even more useful
<Lofty> tnt: Is this after you've manually optimised it, or before?
<tnt> What I sent ? that's before I did anything.
<tnt> I didn't actually do any optimization anywhere else than on paper ...
<Lofty> Ah, okay
<Lofty> Let's start by seeing what happens if we map muxes before giving them to ABC
<Lofty> Since to implement a MUX8 in pure-LUT requires 11 inputs, and ABC thinks we have 4
kraiskil has quit [Ping timeout: 256 seconds]
<Lofty> Doesn't seem like you have anything mappable there. At least with `muxcover -nodecode`
<Lofty> Let's see what happens if we allow decoders
<Lofty> Again, nothing
<Lofty> Ah well
<tnt> Unfortunately they're probably never pure-mux. The 8:1 mux example is just like a synthetic worst case. For the few path I looked at it's more subtle that you can use the muxes better.
<Lofty> Alright, 1467 cells to 1261 cells by using `abc -luts 1,2,2,4`
<Lofty> Let's use a crappy formula for LC usage: LCs = LUT2/2 + LUT3/2 + LUT4
<Lofty> Before: 385/2 + 509/2 + 56 = 407 LCs
<Lofty> After: 107/2 + 420/2 + 217 = 498 LCs.
<Lofty> Hmm
<tnt> How's the ltp ?
<Lofty> length=7
<Lofty> So equivalent to before it seems
<Lofty> Okay, what happens if I reduce the area cost of LUT4s to make it try to cover more logic with it?
<Lofty> 109/2 + 389/2 + 238 = 486 LCs
kraiskil has joined #symbiflow
<Lofty> That's `abc -luts 1,2,2,3`
<Lofty> I just double-checked with flowmap: it seems the critical path is 7 and it can't actually be shortened
<Lofty> What if we make LUT4s the same cost as LUT3s?
<Lofty> 87/2 + 322/2 + 305 = 509 LCs
<Lofty> Hrm.
<Lofty> What if we make LUT4s unnecessarily expensive?
<tnt> I'll try to come up with a more synthetic example that's easier to analyze and that comes up quite a bit. ( wich is basically a loadable counter with enable ).
<Lofty> Sure
<tnt> For instance, it doesn't _use_ dff enable from what I've seem, which increase the length of the logic for nothing.
<Lofty> That's because the flow doesn't map dff enable
<Lofty> Which might help :P
<Lofty> I have no idea what the primitive for it is
<tnt> dffe ( Q, D, CLK, EN );
<Lofty> Okay, so literally just inserting a dff2dffe in the flow pulls out 191 dffes
<Lofty> <Lofty> Before: 385/2 + 509/2 + 56 = 407 LCs
<Lofty> After: 472/2 + 375/2 + 46 = 469
<Lofty> Eh?
<Lofty> ...Ah
<Lofty> Here's the answer: 385/2 + 509/2 + 56 = 502 LCs :P
<Lofty> Thanks windows calculator
<Lofty> Real helpful of you
<tnt> lol
<Lofty> Okay, so, dffes help
<Lofty> Let's go back to LUT mapping with the math problems solved
<Lofty> 186/2 + 345/2 + 178 = 443 LCs (abc -luts 1,2,2,4)
<Lofty> So I've shaved 60 LCs by configuring ABC better and using DFFEs
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
citypw has quit [Ping timeout: 240 seconds]
<Lofty> Surprisingly, this design isn't too awful to run `freduce` on
<Lofty> 200/2 + 291/2 + 168 = 413 LCs
<Lofty> Although maybe put that behind an option :P
<tnt> What's freduce ?
<Lofty> It looks for functionally-equivalent subcircuits (e.g. subcircuits that return the same value) to reduce area
<Lofty> Unfortunately the pass is horrifically slow for anything larger than toy examples
<Lofty> Because it attempts to SAT compare stuff without even trying
olegfink1 has joined #symbiflow
<Lofty> Hmmm
<Lofty> Without going to the effort of writing an ABC9 flow, I think I've mostly exhausted the things I can think of
smkz has quit [Ping timeout: 256 seconds]
<Lofty> Actually, there is one
<Lofty> Nope, doesn't apply here.
xobs1 has joined #symbiflow
FFY00 has quit [*.net *.split]
promach3 has quit [*.net *.split]
xobs has quit [*.net *.split]
olegfink has quit [*.net *.split]
<Lofty> Anyway, 60 LCs by configuring ABC better, another 30 by using freduce
FFY00 has joined #symbiflow
<tnt> Do you have a diff I can apply to yosys to try that here ?
<tnt> Did the dffe have any impact on ltp btw ?
<tpb> Title: synth_quicklogic.diff · GitHub (at gist.github.com)
<Lofty> No, but it reduces area
<Lofty> As I said: you can't do any better with depth
<Lofty> At least without resynthesis
<tnt> I'm also wondering why the flow is two steps of mapping. synth_quicklogic is the first stage but then they actually re-run yosys a second time with a different techmapping pass.
smkz has joined #symbiflow
<Lofty> ?!
<Lofty> That's...what.
<tpb> Title: synth.sh: yosys -p "tcl ${SYNTH_TCL_PATH}" -l $LOG ${VERILOG_FILES[*]} pytho - Pastebin.com (at pastebin.com)
<Lofty> This should be within synth_quicklogic
<tnt> Yeah, they have a second level cells_map.v that maps to T_FRAG / B_FRAG and other types of cells used by vtr rather than lut{2,3,4}.
* Lofty sighs
<tpb> Title: [VeriLog] // ============================================================================ - Pastebin.com (at pastebin.com)
<tnt> It seems to also have stuff like "// FIXME: Always select QDI as the FF's input" which seems rather sub-obtimal ... or maybe vpr's packing step can 'fix' that if it sees it can pack the Q FRAG with the T/B FRAG
<tnt> Also the LUT4 are mapped to 2 LUT3 + 1 F_FRAG. So not sure why that is. I'm hoping the F_FRAG can also be placed by VTR as the TBS_mux and not just the bottom Fmux because that seems like a huge waste.
<Lofty> Mmm
az0re has quit [Remote host closed the connection]
<Lofty> tnt: But presumably shaving 20% off the area helps a bit, right?
<tnt> It helps for sure, but atm I'm more concerned about the fmax / depth.
<tnt> I have to finish porting so I can make a full run synth->pnr to see the actual frequency. Less area will help placement for sure.
<Lofty> I'm fairly sure ABC9 will help a lot here
<Lofty> By the way, tnt, I added some other stuff you'll probably find helpful
<tnt> where ?
<Lofty> I just updated the diff I gave earlier
<Lofty> Sorry, Git was being a pain
<Lofty> You'll now see the names of wires in the LTP output
<Lofty> Should help your optimisation efforts
<tnt> Ah yeah, naming :)
<tnt> tx
<tnt> Lofty: it might not solve the longest one, but still provides a good reduction : https://pastebin.com/Swhjs6Gn
<tpb> Title: Paths length distribution: Before After 0: 270 307 1: 19 - Pastebin.com (at pastebin.com)
<tnt> (the method I use to collect them is flawed, but I still it still gives a decent idea ...)
<Lofty> Yay
<Lofty> That should improve routability a bit then
<Lofty> i.e. less shitty runtime
futarisIRCcloud has joined #symbiflow
kraiskil has quit [Ping timeout: 265 seconds]
kraiskil has joined #symbiflow
<sf-slack> <acomodi> mithro, HakerFoo: I have merged a fix in tool-perf to parse nextpnr results (frequency, runtime and resources)
OmniMancer has quit [Quit: Leaving.]
OmniMancer has joined #symbiflow
OmniMancer has quit [Client Quit]
epony has quit [Ping timeout: 258 seconds]
mangelis has quit [Ping timeout: 258 seconds]
craigo has quit [Quit: Leaving]
kraiskil has quit [Ping timeout: 256 seconds]
mkru has joined #symbiflow
mangelis has joined #symbiflow
mkru has quit [Quit: Leaving]
<mithro> Lofty: awesome work on the synthesis stuff
<Lofty> It was a handful of hours that I could spare
<mithro> Lofty: did you see that QuickLogic provide full liberty files which include timing for all the parts?
<tpb> Title: EOS-S3/Timing Data Files at master · QuickLogic-Corp/EOS-S3 · GitHub (at github.com)
<Lofty> I did, yes
<Lofty> But porting to ABC9 is a sufficiently big enough task that I'm not going to do it for free
<Lofty> Household expenditures, etc
<sf-slack> <acomodi> litghost, HakerFoo: I have an almost working lookahead. I have merged the lookahead creation from the connection box and used the SRC/OPIN --> CHAN and CHAN --> IPIN information from the upstream lookahead map
<litghost> acomodi: Define almost working? It routes well?
<sf-slack> <acomodi> almost working because it still take more time than the connection box lookahead (271 seconds vs ~130 seconds)
<sf-slack> <acomodi> I still need to check whether the bitstream works though, but the routing step completed successfully
<litghost> acomodi: How long did the untuned version take?
<sf-slack> <acomodi> By untuned you mean without the information on SRC/OPIN --> CHAN and CHAN --> IPIN?
<sf-slack> <acomodi> If that's the case it took 313 seconds
<sf-slack> <acomodi> Also, the lookahead generation took 473.11 seconds with NUM_WORKERS set to 20
<litghost> acomodi: I assume this is on the A50 fabric?
<sf-slack> <acomodi> Yes
<sf-slack> <acomodi> I still need to do more testing and probably fix some things, but I think we are on the right path
<litghost> acomodi: 313 -> 271 isn't a huge improvement, but it is moving in the right direction. It is likely worth comparing how and when the lookahead mispredictions occur. I expect you can use that to inform how to iterate
az0re has joined #symbiflow
epony has joined #symbiflow
kraiskil has joined #symbiflow
mkru has joined #symbiflow
<sf-slack> <acomodi> Sure, I'll try and get it down to reasonable run-time. Hopefully matching the current lookahead. I have tested the routed design on HW and it works
<litghost> acomodi: The runtime doesn't have to match exactly, but 271 vs 130 is a large enough difference that points to the refactored map lookahead still mispredicting pretty badly
<litghost> acomodi: Identifying how the new map lookahead mispredicts will enable targeted development to reduce the mispredicts
<litghost> acomodi: Another thing I expect is the new map lookahead output file to be significantly smaller than the connection box lookahead. Is that true?
<sf-slack> <acomodi> Indeed, here there is a pretty huge benefit. • refactored lookahead: 17M • connection box lookahead: 557<
<sf-slack> <acomodi> *M
<sf-slack> <acomodi> I will need to run more detailed routes, maybe there are some specific nets that encounter lots of mispredicts
<litghost> acomodi: That is really good! Hopefully with some further development we can capture the relevant data to make the map lookahead as good as the connection box lookahead, without the extra data
<litghost> acomodi: When I was doing initial development for the connection box lookahead, I enabled router profiling, and made sure to print per-connection route times per iteration, and printed the worst connection (OPIN -> IPIN) times
<litghost> acomodi: From there I used the routing_diag tool at criticality == 1 to find lookahead mispredictions
<litghost> acomodi: You may find that you will need to have several connections to suss out where the mispredictions were coming from
<litghost> acomodi: However it the lookahead is close, you will likely see a step function somewhere in the path that points to the current error source
<litghost> acomodi: Another thing is when you are examining just 1 connection, you can easily compare the ideal route (e.g. astar_fac = 0) versus the route at various A* levels
<litghost> acomodi: With a well tuned lookahead, the router should behave similiarly at A* = 0 and A* ~= .1 - .5
<litghost> acomodi: Once that is working well, A* = 1 should return a good route quickly, and A* = 1.05 ~ 1.2 should return a route quickly, but with somewhat worse path delay
<litghost> acomodi: Ideally A* <= 1 should all return best (or close to best) delay, and A* > 1 < 1.2 should return worse path delay, but with increased speed (via decreased search space)
<sf-slack> <acomodi> litghost: All right, thanks for the insight, I will get to analyze all of this on smaller circuits then, to keep things simple
<litghost> acomodi: No, my suggestion was not to use a smaller circuit, but to use a circuit, and pick the worst connection route time, and debug that connection specifically
<sf-slack> <acomodi> One question, by router profiling you mean to have the debug logging enabled or is that something else entirely?
<tpb> Title: vtr-verilog-to-routing/route_profiling.h at master+wip · SymbiFlow/vtr-verilog-to-routing · GitHub (at github.com)
<tpb> Title: vtr-verilog-to-routing/route_profiling.cpp at master+wip · SymbiFlow/vtr-verilog-to-routing · GitHub (at github.com)
<sf-slack> <acomodi> Ok, great, thanks
mkru has quit [Quit: Leaving]
mkru has joined #symbiflow
mkru has quit [Remote host closed the connection]
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
kraiskil has quit [Ping timeout: 246 seconds]
kraiskil has joined #symbiflow
tonlage has joined #symbiflow
OmniMancer has joined #symbiflow
andrewb1999 has joined #symbiflow
<andrewb1999> Is there a reason that the wire CLK_HROW_TOP_R_X60Y130/CLK_HROW_CK_BUFHCLK_L7 would exist in the database but not CLK_HROW_TOP_R_X60Y130/CLK_HROW_CK_BUFHCLK_L6?
<andrewb1999> Both of these wires were chosen by Vivado as partition pins, but and the xray python lookup scripts can find L7 but not L6
<andrewb1999> All other partition pins can be found fine too
<andrewb1999> Or is it possible this is an issue with using 2 clocks?
<andrewb1999> Does symbiflow support multiple clocks yet?
<litghost> andrewb1999: 7-series VPR does support multiple clocks, but it doesn't currently support explicit BUFH instancing
<litghost> andrewb1999: There is not support for fully route-through sites right now in VPR
<litghost> andrewb1999: You can operate with all explicitly BUFH instancing, or all implicit BUFH instancing (via routing)
<litghost> andrewb1999: Mixing them is not supported at this time
<litghost> andrewb1999: Support could be added, but it is unclear why it would be a priority
<andrewb1999> litghost: Ok thanks, will probably switch to one clock now for simplicity
<litghost> andrewb1999: FYI, One vs multiple clocks has no revelance on the BUFH discussion
<andrewb1999> litghost: Oh ok
futarisIRCcloud has joined #symbiflow
az0re has quit [Remote host closed the connection]
az0re has joined #symbiflow
gsmecher has quit [Ping timeout: 246 seconds]
kraiskil has quit [Ping timeout: 246 seconds]
<andrewb1999> Is the difference between graph_limit and roi just whether synth_tiles are created and if fasm from the harness gets merged?
<andrewb1999> Or does it get treated differently in other ways?
<litghost> andrewb1999: The ROI provides the graph limit as part of the ROI spec (along with the FASM for the bits for the harness)
<litghost> andrewb1999: The graph_limit is for ROI-less designs, where we want a sub-graph excluding some of the fabric
tonlage has quit [Remote host closed the connection]