clifford changed the topic of #yosys to: Yosys Open SYnthesis Suite: http://www.clifford.at/yosys/ -- Channel Logs: https://irclog.whitequark.org/yosys
emeb has quit [Quit: Leaving.]
Degi has joined #yosys
Degi has quit [Quit: ZNC 1.6.6+deb1ubuntu0.2 - http://znc.in]
Degi has joined #yosys
enigma has quit [Quit: leaving]
anticw_ has quit [Remote host closed the connection]
anticw has joined #yosys
Degi_ has joined #yosys
Degi has quit [Ping timeout: 246 seconds]
Degi_ is now known as Degi
strongsaxophone has quit [Remote host closed the connection]
citypw has joined #yosys
Cerpin has quit [Read error: Connection reset by peer]
Cerpin has joined #yosys
emeb_mac has quit [Quit: Leaving.]
knielsen_ has joined #yosys
dys has joined #yosys
Jybz has joined #yosys
knielsen_ is now known as knielsen
N2TOH has quit [Ping timeout: 240 seconds]
N2TOH has joined #yosys
strongsaxophone has joined #yosys
emeb has joined #yosys
citypw has quit [Ping timeout: 256 seconds]
dys has quit [Ping timeout: 256 seconds]
strongsaxophone has quit [Remote host closed the connection]
strongsaxophone has joined #yosys
<grazfather> Hm, from the readme: "type identifiers must currently be enclosed in (parentheses) when declaring signals of that type (this is syntactically incorrect SystemVerilog)". So what would that look like? I get errors with `(input bla::x) name` and other variations of what i put in parents
<grazfather> parens
<ZirconiumX> Maybe it's the "bla::x" bit?
<daveshah> You would put just the bla::x in parens
<daveshah> But there is a WIP to fix this, see https://github.com/YosysHQ/yosys/pull/1725
strongsaxophone has quit [Ping timeout: 246 seconds]
strongsaxophone has joined #yosys
strongsaxophone has quit [Ping timeout: 265 seconds]
strongsaxophone has joined #yosys
voxadam has quit [Read error: Connection reset by peer]
voxadam has joined #yosys
<thardin> yosys is deceptively good at optimizing thigns
<thardin> I thought I was able to fit a 128x128-bit multiplier in a hx1k, but most of it got optimized out
<ZirconiumX> thardin: it can always be made better :P
<thardin> in reality 24x24-bit is closer to what the hx1k can actually do
<ZirconiumX> If you can afford the timing delay, the UP5K has hardware multipliers
<thardin> could probably do some pipelining
<thardin> mostly just kind of getting a feel for what the different devices can do at the moment
<ZirconiumX> The downside is that it's a UP5K :P
<thardin> made a 24-bit Toom-3 squarer implementation, takes ~7 seconds just in yosys
<thardin> 99% LCs used :p
<ZirconiumX> I have a fun testbench for combinational logic
N2TOH has quit [Ping timeout: 250 seconds]
N2TOH has joined #yosys
<ZirconiumX> It calculates all possible chess moves on a chessboard in a single cycle through pure combinational logic
<thardin> nice
<ZirconiumX> 3.8k LUT4s :P
<thardin> I think I'll throw perf at yosys and see where it spends most of its time
<thardin> what's the delay through all of those?
<ZirconiumX> There are a couple of optimisation PRs out there
<thardin> I see it's single-threaded. a bit of openmp #pragmas in strategic places would be an easy way to speed it up
<ZirconiumX> Unfortunately it's not thread-safe
<ZirconiumX> And making it thread-safe is not trivial
<ZirconiumX> Internally, Yosys represents a netlist as a giant graph, and so any operation on this graph would need to be made thread-safe, so that no two threads could operate on the same node at the same time
<ZirconiumX> Or else the graph needs to be partitioned
<ZirconiumX> The easiest place to partition it would be at the module level
<ZirconiumX> However, the modules get almost immediately flattened unless you pass `-noflatten`
<ZirconiumX> (and not flattening is generally suboptimal)
<thardin> building with ENABLE_DEBUG to hopefully get some symbols in my perf log
<ZirconiumX> Actually, you get symbols
<ZirconiumX> `make` outputs a file with full symbols
<ZirconiumX> `make install` strips the installed executable but leaves the original intact
<ZirconiumX> So the `yosys` in your source directory will have symbols
<thardin> ah
<thardin> already did make clean so, thumb twiddling time
<thardin> I have ccache though, so future builds should be faster
N2TOH has quit [Ping timeout: 256 seconds]
<thardin> I have found it can take a long time for yosys+nextpnr to figure out that "nope your design doesn't fit". would be nice if it could come to that conclusion faster
<ZirconiumX> You can generally judge pretty quickly from the nextpnr "Device utilisation" info
<thardin> mm
<qu1j0t3> ZirconiumX: that is a nice benchmark! you should publish it
<thardin> or just yosys taking more than 10 seconds or so
<mwk> yosys doesn't even have the device geometry info
<mwk> it only knows *what* resources exist on the target device, not how many
<ZirconiumX> mwk: I feel like some `select -assert-max` might be a good rule of thumb
<mwk> (it could perhaps be fixed some day, but that'd require some serious changes)
<ZirconiumX> qu1j0t3: It's written in nMigen and maybe not the most readable thing ever
<qu1j0t3> still!
<qu1j0t3> also a good nmigen case study
<ZirconiumX> I also have a very partial implementation of the PS2's GPU pipelines
peepsalot has quit [Quit: Connection reset by peep]
<thardin> 15% of yosys-abc's time is spent just in clock_gettime()
<ZirconiumX> "yosys-abc" != "yosys"
<ZirconiumX> ABC is the bit performing LUT mapping
peepsalot has joined #yosys
<thardin> I see
<thardin> for yosys malloc() and free() accounts for 16%
<ZirconiumX> The wonderful thing about ABC is that ABC is a wonderful thing /s
<thardin> let's see what callgrind thinks
<thardin> dumdidum.. taking few minutes
<thardin> there we go
<thardin> RTLIL::put_reference dominates self%
<thardin> SigSpec::~SigSpec() also accounts for quite a lot
<ZirconiumX> I think this is a case of something being called really often (and string interning happens a *lot* in Yosys)
<thardin> yes, it seems roughly half the time is spent doing stuff like this
<ZirconiumX> <thardin> what's the delay through all of those? <--- presently 24ns :P
<ZirconiumX> So 40ish MHz on HX
<thardin> no too bad
<thardin> SigSpec::check() eats quite a bit too, but it's surrounded by #ifndef NDEBUG
<thardin> so maybe I should build it without DEBUG and with NDEBUG
<thardin> got some jobs queued up, bbl
<thardin> I'll try sticking -DYOSYS_NO_IDS_REFCNT on CFLAGS/CXXFLAGS too
<thardin> getting somewhere now. replace_const_cells() is a huge part of it, and is itself.. big
<thardin> so OptExprPass is the thing to look at
<ZirconiumX> pull #1789 and pull #1790 both improve opt_expr a bit
<thardin> there's about a bajillion IdString objects being created and destroyed in there
<thardin> the call graph tool in kcachegrind is a godsend
emeb_mac has joined #yosys
<whitequark> i'm wondering if IdString objects would be better off owned by the design and never freed except with the design itself
<daveshah> That's what nextpnr does
<whitequark> is there a reason yosys doesn't
<whitequark> ?
<daveshah> I think the reason is that Yosys can create large numbers of temporary IDs that last far less time than a design
<whitequark> ah
<daveshah> Whereas nextpnr does far fewer netlist manipulations so the set of IDs doesn't change anywhere near as much
<daveshah> I wonder if ID garbage collection could be done in between passes
<daveshah> It would be a problem is passes had state outside of the RTLIL structures though
<thardin> what does RTLIL stand for btw?
<mwk> RTL intermediate language
<mwk> (register transfer level)
<whitequark> passes having state outside of RTLIL seems like it would be troublesome all on its own
<thardin> got it
<ZirconiumX> Oooh, nextpnr is really struggling to route this...
<daveshah> What device and utilisation?
<ZirconiumX> hx8k, 98%
<daveshah> Yeah, iCE40 is generous in routing resources but 98% is ambitious
<thardin> opt_expr.cc:719 alone is ~20% of the time
<ZirconiumX> I'm a bit confused though, since I thought an ICESTORM_LC was roughly an SB_LUT4 plus an SB_DFF
<daveshah> Might be worth trying --router router2 just for fun
<daveshah> Yes, that's correct
<ZirconiumX> 6255 SB_LUT4s gets turned into 7538 ICESTORM_LCs, though
<ZirconiumX> So I thought I had a lot more leeway than this...
<daveshah> You can't use the LUT and FF separately
<ZirconiumX> 268 DFFEs, 1076 DFFESRs, 757 DFFSR
<daveshah> That's a lot of control signals too
<ZirconiumX> This is nMigen which loves to reset everything to zero
<daveshah> If its just one reset signal then it should be promoted to a global at least
<ZirconiumX> Yep
<ZirconiumX> Info: promoting fsm_state_SB_LUT4_I3_O [cen] (fanout 1024) <-- ouch
<thardin> so, the ID macro is supposed to allocate things statically right? but it seems it isn't. or it's returning copies of the strings
<daveshah> Yes, it's definitely supposed to be static
<ZirconiumX> Is it worth trying to reduce number of SB_IOs?
<thardin> perhaps if I change the args for IdString::in() to const refs..
<ZirconiumX> ...That'd be useful, yeah
<thardin> compiling and rerunning.. will be a few minutes
<ZirconiumX> daveshah: I... think router2 just borked
<ZirconiumX> Number of wires suddenly explodes
<daveshah> Oh yeah I fixed that in nextpnr-xilinx but not upstream
strongsaxophone has quit [Quit: Lost terminal]
<daveshah> Right, that bug should be fixed although looking at that progress I don't think it will succeed in any reasonable time anyway
<ZirconiumX> I mean, router1 was like 225 seconds
<ZirconiumX> ECP5 should handle this a bit better, at least
<ZirconiumX> Okay it's still struggling with ECP5
<ZirconiumX> And as soon as I say that, it fails
<daveshah> Possibly to do with connectivity as much as utilisation
<thardin> making them const& seems to have sped things up a bit. taking a closer look at IdString::in()
<ZirconiumX> I suppose that makes it a good benchmark then :P
<ZirconiumX> It's interesting that this routes on HX8K but not UM-45F
<daveshah> iCE40 does have more routing resources
<ZirconiumX> Sure, but the ECP5 is notably bigger
<thardin> daveshah: ooh
<ZirconiumX> So I thought it'd balance out
<daveshah> But the placer isn't routeability driven so won't necessarily space things out more
<daveshah> What utilisation is it ending up with for ECP5?
<ZirconiumX> 38% SLICE, 55% IO
<ZirconiumX> For UM-45F
<daveshah> Interesting, I've never seen an ECP5 design below 95% fail to route
<ZirconiumX> And a DCCA, but I don't know what that is
<daveshah> global buffer
<thardin> daveshah: changing in(IdString rhs) to const& might still be worth it
<daveshah> Yes, might be worth mentioning on that PR
<ZirconiumX> I noted that iCE40 seemed to promote more globals than ECP5 does
<daveshah> Yes, ECP5 global promotion is a bit conservative at the moment
Jybz has quit [Ping timeout: 246 seconds]
<thardin> I'll run some tests and see what callgrind says
<daveshah> it's been on my TODO list for a while
<ZirconiumX> If it fails to route on 85F I'm going to laugh
<daveshah> I'd be surprised if there was that much difference
<daveshah> it's not going to be placed any more spaced out
<ZirconiumX> 211k iterations so far and still trying
<daveshah> Well, this would be a good benchmark for routeability driven placement
<ZirconiumX> I didn't intend to make it *quite* this hairy ^^;;
<daveshah> If you want to try an experiment, you could reduce 0.9 here to 0.6 https://github.com/YosysHQ/nextpnr/blob/master/common/placer_heap.cc#L1720 to create a more spread out placement
<ZirconiumX> Well, I'm probably due for a nextpnr rebuild anyway
<ZirconiumX> 11 minutes into 85k route :P
<ZirconiumX> Aw drat, it just failed
<ZirconiumX> Info: 277000 | 98597 178402 | 406 594 | 1178| 5.02 799.11|
<ZirconiumX> daveshah: I think much of the spaced out solution that HeAP provides gets ""optimised"" by the SA refinement pass
<ZirconiumX> Since the total solution wire lengths after SA aren't that far apart
<daveshah> Yes, that could be a problem
<daveshah> You could go even lower with beta to see what that does
<daveshah> The SA refinement pass does ultimately have quite a small radius
<ZirconiumX> Well, I'll see is 0.6 routes, but I'm not confident because the difference post-SA is 2000 wirelen units
<thardin> changing to in(const IdString& rhs) reduces the number of calls to put_reference from 61.6M to 45.4M
<daveshah> thardin: that seems like a very worthwhile change then
<thardin> yeah
N2TOH has joined #yosys
<thardin> commented on PR 1767
<ZirconiumX> 0.6 still fails to route
<az0re> Does anyone know if there is any support at all for solving exists-forall problems directly in yosys (i.e. not yosys-smtbmc)? Nothing I have tried with the `sat` command has worked. Instead I get errors like: "ERROR: Failed to import cell $allconst$3 (type $allconst) to SAT database."
<thardin> std::map<RTLIL::Cell*, RTLIL::Cell*, CompareCells> sharemap(CompareCells(this)); in opt_merge.cc now sticks out. ~13% of runtime
<ZirconiumX> Eddie's beat you there too
<thardin> damn, I was just about to suggest unordered_map
<thardin> guess I'll merge that locally and see what else sticks out
<thardin> *compiling*
<ZirconiumX> daveshah: hell, even beta = 0.2 fails to route, because of the SA pass.
<ZirconiumX> Though the difference from 0.2 to 0.6 is way bigger than 0.6 to 0.9
<daveshah> Interesting
<thardin> alright, with eddie's optimizations it's actually spending most of its time hashing cell parameters
<ZirconiumX> 0.9: 87323; 0.6: 89983; 0.2: 95205
<ZirconiumX> What happens with beta = 0?
<ZirconiumX> ...I was half expecting it to not converge
<daveshah> beta=0 means expand to fill up the chip, more or less
<ZirconiumX> Yeah, beta = 0 gives 95205 again which leads me to suspect it's the same solution as 0.2
<daveshah> Yeah, 0.2 means expand to 20% utilisation
<daveshah> As the design is 38% then anything below .38 will have little difference
<ZirconiumX> Yep, beta = 0 fails to route
<ZirconiumX> So how do I disable the SA refinement pass?
<ZirconiumX> Found the function call
<ZirconiumX> Yep
<thardin> lots of methods in SigPool could also use some const&
<ZirconiumX> ...Dang, even after removing the SA refinement pass it still fails to route with beta = 0
<ZirconiumX> I think the design is just too difficult for ECP5 to route
<daveshah> Yup
<ZirconiumX> would -nowidelut help here?
<daveshah> Probably
<ZirconiumX> So, 24%, let's see how this goes.
<ZirconiumX> Wire length is already way down
<ZirconiumX> Nope :(
<ZirconiumX> Wonder how Quartus handles it.
<thardin> got tripped up by the way kernel/ gets copied on every build
<thardin> "why is my header change gettign reverted?"
<ZirconiumX> "Info (170202): The Fitter performed an Auto Fit compilation. No optimizations were skipped because the design's timing and routability requirements required full optimization.
<ZirconiumX> "
<ZirconiumX> Proud of that
<thardin> inlining some getters and making use of const& in some places reduces the instruction count of rmunused_module_signals() from 2800M to 2000M
<thardin> probably more overall
<thardin> >I have a patch for that incoming
<thardin> scooped again!
<ZirconiumX> You can try some openmp pragmas if you really want.