<adamgreig>
going to see how far i can push ice40's pseudo-lvds down utp
<whitequark>
that's a lot of FPGAs
<adamgreig>
with some sort of 8b10b and etc
<adamgreig>
i want to make a circuit switched network
<whitequark>
uhhhhh
<adamgreig>
yes yes "not far"
<whitequark>
ice40 cannot do clock recovery
<adamgreig>
well
<whitequark>
it's pointless to do 8b10b for the most part
<adamgreig>
I was hoping to use ddr on the gpio, ice40 at 100MHz, and the data clock is less
<whitequark>
I mean, you've ran at least two pairs to each port, right
<adamgreig>
so you can oversample
<adamgreig>
yea, there's a tx and an rx pair on each port
<whitequark>
that sounds like it'll fail but I'm curious.
<swetland>
could do a dedicated diffpair clock and one or more data lanes then, no?
<swetland>
similar to MIPI CSI
<adamgreig>
(and power on the other pairs)
<whitequark>
yeah that's what I would do
<whitequark>
clock pair
<whitequark>
could still do half-duplex I guess
<adamgreig>
not easily; the ice40 lvds is hard wired to tx or rx
<whitequark>
oh right
<whitequark>
ok
<adamgreig>
you don't reckon you could do cdr with 4x oversampling?
<adamgreig>
not looking to push bandwidth or distance records here really
<adamgreig>
already not going to have equalisation and the diff voltage is small too
<whitequark>
i mean... with that level of oversampling, you could run, like, uart
<adamgreig>
well sure :p
<whitequark>
what's the point in 8b10b if you're not actually doing clock recovery?
<whitequark>
is it capacitively coupled even?
<adamgreig>
guess it still gives you some sort of framing
<adamgreig>
no, dc
<whitequark>
so
<whitequark>
you don't need dc balance
<whitequark>
you don't need guaranteed transitions
<whitequark>
you just use it as a framing with 20% overhead
<whitequark>
this is literally uart but more complex
<adamgreig>
you can see how that's appealing, though?
<whitequark>
no?
<adamgreig>
fun to write an 8b10b enc/dec
<whitequark>
that's just a LUT
<adamgreig>
hmm
<whitequark>
i'd probably put it into a BRAM, even
<adamgreig>
well in any event the objective here was strictly to make some fpgas and experiment with connecting them
<adamgreig>
so really anything goes
<swetland>
did that on ZYBO to drive HDMI. sadly without OSERDES you can't really get the data you need for something like that
<adamgreig>
anyway uart also has 20% overhead ;)
<adamgreig>
if I'm going to transmit ten bits for each eight data bits, 8b10b seems like it'l be more fun than a start and stop bit
<whitequark>
uart gives you higher clock rate
<whitequark>
and less device utilization
<whitequark>
with everything else being equal
<swetland>
I think the only advantage to a symbol based system is that if you plug together two sides where one is constantly chattering you might avoid character mis-alignment
<whitequark>
indeded
<whitequark>
*indeed
<adamgreig>
honestly I'd do it just because I've implemented uarts in fpgas before
<adamgreig>
step one is the ethernet side anyway
<swetland>
if you want to run 100Mbps over a reasonable distance, ethernet PHYs are about $1, and RJ45 + Magnetics are about $4 (qty 1), and RMII is a 2bit/clk 50MHz interface, very easy to talk to with FPGAs
<adamgreig>
totally, I already have ethernet on this for "uplink"
<adamgreig>
the objective for the other side is having a synchronised system clock and circuit switched data though
<adamgreig>
which okay you could just send udp packets and maybe even use ptp
<swetland>
yeah, there is plenty of knowledge about how to do clock sync
pie___ has joined ##openfpga
pie__ has quit [Ping timeout: 268 seconds]
egg|egg is now known as egg|zzz|egg
azonenberg_work has quit [Ping timeout: 245 seconds]
unixb0y has quit [Ping timeout: 268 seconds]
unixb0y has joined ##openfpga
<whitequark>
siiiiigh
<whitequark>
so i'm gonna write a techmapper i think.
Miyu has quit [Ping timeout: 272 seconds]
catplant has joined ##openfpga
catplant has quit [Ping timeout: 250 seconds]
rohitksingh_work has joined ##openfpga
Bike has quit [Quit: Lost terminal]
prpplague has joined ##openfpga
<prpplague>
anyone know if the details for orconf2019 have been announced?
catplant has joined ##openfpga
catplant has quit [Ping timeout: 250 seconds]
azonenberg_work has joined ##openfpga
emeb has quit [Quit: Leaving.]
<whitequark>
daveshah: lmao what the fuck
<whitequark>
naive techmapping: 51 LUT
<whitequark>
naive techmapping followed by my opt_lut: 18 LUT
<whitequark>
abc: ............. 17 LUT
<whitequark>
this isn't even in C, this mostly just uses Yosys techmap pass...
azonenberg_work has quit [Ping timeout: 250 seconds]
<swetland>
ooh, I need to give this a try. yosys is using 60% more LUTs than icecube2
jevinskie has joined ##openfpga
<whitequark>
swetland: grab my other PR
<whitequark>
and try doing synth_ice40 -relut
<swetland>
717?
<whitequark>
717?
<whitequark>
oh yeah
<whitequark>
that one
jevinski_ has quit [Ping timeout: 268 seconds]
_whitelogger has joined ##openfpga
jevinski_ has joined ##openfpga
<swetland>
ERROR: timing analysis failed due to presence of combinatorial loops, incomplete specification of timing ports, etc.
genii has joined ##openfpga
<swetland>
w/ tot+717 (vs tot which works without complaint)
<whitequark>
interesting
<whitequark>
can you try to reduce the design?
jevinskie has quit [Ping timeout: 250 seconds]
<whitequark>
or, can you post it in the issue? yosys json or something like that
<swetland>
I can toss the json up right now and can poke at it a bit later and see if I can find a smaller failure case
<whitequark>
sure, that works
<swetland>
actually is the json (output from yosys) useful here?
<whitequark>
I think so yeah
<swetland>
interesting. only fails if I infer this 256x16b ram instead of invoking SB_RAM40_4K manually.
pie___ has quit [Quit: Leaving]
<whitequark>
interesting
<whitequark>
if you instantiate, does the design work?
<swetland>
provided I don't use -relut it does work
<swetland>
with -relut nextpnr fails
<swetland>
without -relut both inferred and instantiated version of the design works. with -relut instantiated version will not pass nextpnr, but inferred version does and also works
<whitequark>
what fails exactly?
<whitequark>
timing?
<whitequark>
wait
<whitequark>
with -relut instantiated version will not pass
<whitequark>
nextpnr, but inferred version does and also works
<whitequark>
I'm confused
<whitequark>
didn't you just say the opposite of this?..
<swetland>
sorry, I may have misspoke. if I infer the ram, nextpnr succeeds whether or not I used -relut with yosys synth_ice40 and both resulting bitfiles work
<swetland>
if I instantiate the ram nexpnr only succeeds if I do not use -relut, and the resulting bitfile works
<q3k>
whitequark: or you know, karnaugh maps if you did that manually :)
<whitequark>
oh!
<q3k>
i'm not sure it makes sense to run that per-lut (especially on narrow 4luts)
<whitequark>
yes, probably not per lut
<tnt>
per-lut ... at best you'd find useless inputs.
<q3k>
yeah
<whitequark>
might still be valuable
<whitequark>
but not very generic
<tnt>
I'm not sure how a karnaugh maps helps to map a N input comb function to a minimal amount of LUT4 (and then ... what do you consider minimal, depth ? or total # of luts)
catplant has joined ##openfpga
<whitequark>
tnt: can you give me the json that needs setundef?
<daveshah>
whitequark: Couldn't resist experimenting with the topological ordering idea
<tnt>
whitequark: it's the verilog source that creates the issue
<tnt>
(well ... a minimal test case)
<whitequark>
tnt: oh thanks!
<daveshah>
Doesn't help boneless much sadly
<whitequark>
daveshah: oh it's okay, boneless has a real awful alu i think
<whitequark>
i mean
<whitequark>
this whole thing grew out of me trying to make a less bad alu for boneless
<whitequark>
and discovering that yosys generates absurdly bad output for it
<whitequark>
and fixing that
<daveshah>
boneless is down to 713 vs 745
<daveshah>
without abc
<whitequark>
that's actually pretty good
<whitequark>
that's approaching abc quality, which is 669
<daveshah>
482 for me?
<whitequark>
oh, LUTs
<whitequark>
not total cells
<daveshah>
yeah
<whitequark>
ok sure
<whitequark>
still a nice improvement
<whitequark>
what about -abc -relut?
<daveshah>
gives me 463 LUTs
<daveshah>
with the topological ordering, it seems to converge (in the noabc case) after two runs of -relut
<daveshah>
don't know if that is different to before
<whitequark>
oh, that's a bug i'm about to fix
<whitequark>
it should converge immediately
<tnt>
Damn, the default yosys output for that minimal example is really bad ... I mean, there are 3 LUT-1 following each other ...
<daveshah>
picorv32 does pretty well without abc. 1953 LUTs without compared to 1538 LUTs with (so only about 27% overhead)
* daveshah
eats hat....
rohitksingh_work has quit [Ping timeout: 268 seconds]
rohitksingh_work has joined ##openfpga
<daveshah>
but Fmax is 16MHz compared to 56MHz with abc
<whitequark>
yes, I've noticed that Fmax gets pretty bad
<whitequark>
there should probably be some kind of K-map based (?) logic rebalancing (?)
<daveshah>
Yes, it's definitely the rebalancing that's the issue
<whitequark>
I mean, that could probably be done naively, even
<daveshah>
This might be as simple as a heuristic when merging LUTs to start with
<whitequark>
oh, yeah!
<daveshah>
Just try and merge the one that with the larger path length
<whitequark>
bleh, probably need to base gate2lut PR on opt_lut PR...
<whitequark>
kind of messy
<whitequark>
or, hm
<whitequark>
hmmmm
m4ssi has joined ##openfpga
<whitequark>
daveshah: take a look at what i just pushed
<daveshah>
yeap
<whitequark>
converges immediately now?
<whitequark>
or did i miss something?
<whitequark>
seems to converge right away here
<daveshah>
Yes, looks good
<daveshah>
I think that should always converge fine now
<whitequark>
let me add some stats to opt_lut while I'm at it.
<whitequark>
oh, this fails a test...
<whitequark>
ah I think this is the same issue tnt hits
<whitequark>
daveshah: ok, figured the cause i think
<whitequark>
daveshah: Found top.$abc$163$auto$blifparse.cc:492:parse_blif$187 (cell A) feeding top.$auto$alumacc.cc:474:replace_alu$19.slice[2].adder (cell B).
<whitequark>
Cell A is a 1-LUT. Cell B is a 3-LUT. Cells share 0 input(s) and can be merged into one 3-LUT.
<whitequark>
Not combining LUTs into cell A (cell B has attribute \lut_keep).
<whitequark>
Combining LUTs into cell B.
<whitequark>
Connecting input 0 as \d [2].
<whitequark>
Leaving input 1 as \c [2].
<whitequark>
Leaving input 2 as $abc$163$n52.
<whitequark>
Leaving input 3 as $auto$alumacc.cc:474:replace_alu$19.C [2].
<whitequark>
this is... an off by one of some sort?
<whitequark>
ok I think I see
jevinskie has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<whitequark>
tnt: can you recheck?
<whitequark>
I think I fixed all the bugs you've hit
<tnt>
whitequark: sure
<tnt>
whitequark: seems to work :) builds and the bitstream appear to operate properly on the device.
<whitequark>
wonderful :D
<whitequark>
tnt: what about timing? how bad is it?
<tnt>
It really didn't change anything wrt to timing.
<tnt>
I mean on that particular design it only combined 4 LUTs out of 260.
<whitequark>
ah ok
<tnt>
I tried another where it combined a bit more LUTs but they were not in the critical path either.
<tnt>
whitequark: you only consider LUT -> LUT connections where there is only 1 user of the signal ?
<whitequark>
tnt: yes
<whitequark>
it might make sense to consider more than that, e.g. 1-LUTs can *always* be folded
<whitequark>
yeah, definitely, that would be a significant improvement
<tnt>
yeah, I was looking at a couple netlist and I saw plently of cases or 1 or 2 luts feeding other 2/3 luts ...
<tnt>
the original one has to be kept because sometime the signal goes else where that can't be folded, but that would still be cutting the path for the other signals, at the expense of a higher fanout ...
<whitequark>
right
<whitequark>
hm it might make sense to do that as a part of a more general pass...
<cr1901_modern>
How can you merge a 1-LUT and a 3-LUT into a 3-LUT when none of the inputs are shared?
<whitequark>
cr1901_modern: no, it's a different case
<whitequark>
it's a case of 1-LUT feeding a 3-LUT and something else
<whitequark>
merging 1-LUT into this 3-LUT *and* keeping the original 1-LUT trades fanout for logic levels
<whitequark>
this should be almost always advantageable
<whitequark>
gonna try that soon
s_frit has joined ##openfpga
<daveshah>
whitequark: Small issue with the LUT merging stuff
<daveshah>
If a CARRY input is 1'b0, then the corresponding LUT input needs to stay 1'b0 too
rohitksingh_work has quit [Read error: Connection reset by peer]
<daveshah>
it seems this is not being preserved and creates a monstrous carry chain full of legalisation LCs which breaks nextpnr on picorv32
<whitequark>
lol
<cr1901_modern>
I understand the fanout decreases if it's merged, but what do you mean by "trades fanout for logic levels"?
<whitequark>
can you reduce a testcase?
<daveshah>
sure
<whitequark>
cr1901_modern: fanout *increases*
<cr1901_modern>
How does it increase? 1-LUT is no longer driving the 3-LUT if it's merged.
<cr1901_modern>
Oh, whatever was driving the 1-LUT has its fanout increase tho...
<whitequark>
yes.
<whitequark>
actually
<whitequark>
in case of 1-LUT that doesn't increase the fanout at all
<whitequark>
it just moves things around
* cr1901_modern
nods
<whitequark>
daveshah: can you rebase your branch btw?
<whitequark>
I refactored opt_lut a bit, to use a worker
<cr1901_modern>
so what did you mean by the "logic levels" part then?
<whitequark>
this should go nicely with timing reports... once proc learns to not assign some dumbass internal names
<whitequark>
that is on my shortlist
<whitequark>
i want to have ZERO $fuckyou$ names in the reports.
<tnt>
cr1901_modern: well imagine sig_in -> LUT1 -> LUT3 -> sig_out .. if you merge the LUT1 function into the LUT3 and you get sig_in -> LUT3 -> sit_out (and in parallel you may still have sig_in -> LUT1 -> other places that signal went).
<tnt>
cr1901_modern: you reduced the depth of the path from sig_in to sig_out but you increased the sig_in fanout.
* cr1901_modern
nods
<whitequark>
tnt: however you decreased LUT1 fanout
<whitequark>
so in this case it's even
<whitequark>
now, if you are merging LUT2 to LUT3, it is not as clear cut
<tnt>
sure ... but the delay on the net depends on the fanout of that net, not the total fanout of the whole fpga.
<tnt>
so propagation time for sig_in are a bit worse.
<tnt>
(tbh, I'm not sure if that works like that on the ice40, I'm just basing that on my experience of xilinx where net driving lots of loads are slower)
<sorear>
whitequark: it completely destroys buffer trees though
<daveshah>
The I2(1'b0) should be preserved so the carry and LUT can be packed
<daveshah>
will sort out rebase in a bit
<whitequark>
sorear: can you elaborate?
<whitequark>
daveshah: so... the constraint here is that I2 must be the same as I2.
<whitequark>
er.
<whitequark>
lut_i.I2 must be the same as carry_i.I1.
<sorear>
*finds keyboard*
<sorear>
let's say you have a signal with a fanout of 256. maybe a clock or a reset
<daveshah>
whitequark: yes, ditto with I1 and I0
<sorear>
an electrical fanout of 256 will be *extremely slow* because you have far more capacitance on the output than a gate is designed to drive
<sorear>
but if you turn it into a 4-level tree of inverters, each with a fanout of 4, you have a faster circuit
<whitequark>
daveshah: ooooh, so effectively... the inputs bound to SB_CARRY should not be considered "free" like normal constant inputs
<whitequark>
and should not be used for reencoding
<whitequark>
that's definitely doable
<daveshah>
Yes
<sorear>
of course multipass optimizers quite frequently do "pessimize something in pass A that you know pass B will clean up", and it probably makes more sense to do this kind of selective duplication after logic optimization (possibly even combined with placement)
<daveshah>
Probably best as a attr on SB_CARRY
<whitequark>
daveshah: are you sure?
<sorear>
so I'm not saying abc/relut would be *wrong* to do this, merely that it's not *prima facie optimal8
<whitequark>
daveshah: oh hm, is this because SB_CARRY can be optimized out?
<whitequark>
sorear: but modern FPGAs have routing buffers instead of routing pass transistors
<whitequark>
so in effect you have buffer trees whether you want it or not, no?
<sorear>
yes, but I was on a terrible phone keyboard and thought "buffer tree" sufficiently implied "ASIC"
<whitequark>
oh!
<whitequark>
I have no idea about anything related to ASICs
<whitequark>
and besides
<whitequark>
opt_lut is not intended for ASIC flow?
<whitequark>
in fact, *abc* is probably good for ASIC flow, it does area optimizations and stuff
<whitequark>
I mean, I assume it is good for at least something. definitely not FPGA flow.
<tnt>
lol
<sorear>
pretty sure I've heard boomcpu complain about critical paths and net naming
<whitequark>
there are 2 kinds of people: those who complain about critical paths and net naming, and those who suffer silently.
<whitequark>
daveshah: now that I think about it... might be a better idea to ditch attributes entirely
<whitequark>
and have something like...
<whitequark>
-dlogic SB_CARRY:1=I0:2=I1:3=CI
<whitequark>
daveshah: this could help ecp5 too, maybe?
Miyu has joined ##openfpga
scrts has joined ##openfpga
rohitksingh has joined ##openfpga
rohitksingh has quit [Ping timeout: 250 seconds]
<daveshah>
whitequark: looks good
<daveshah>
The problem with ECP5 is that the CCU2C carry primitive is 2 LUT4s with the output XORd with carry for the sum output and 2 LUT2s sharing inits with the bottom of the LUT4s plus some add and ors to generate carry
<daveshah>
It's a pretty tricky one to optimise or even split
<whitequark>
now just need to wire it to avoid disturbing those
<whitequark>
ok, I *think* I'm done.
<whitequark>
daveshah: oh holy shit
<whitequark>
this *really* improves timing *dramatically*
<whitequark>
like by 10 MHz
<daveshah>
sweeet
<daveshah>
I guess the timing problems before might have been excessive feedthroughs being inserted
<whitequark>
yeah
<whitequark>
let me check with -noabc too
<daveshah>
The Yosys/nextpnr changes over the last month must mean we are close to a 30-40% improvement in timing overall by now
<whitequark>
that's a lot
<whitequark>
this would make UP5K Glasgow actually usable :D
<daveshah>
next big step will be vpr-style criticality driven placement
<daveshah>
I might play with that now in fact
<daveshah>
I'm not sure if that will actually lead to an overall improvement in performace, or just make the opt-timing pass redundant
ZipCPU|Laptop has quit [Ping timeout: 245 seconds]
<daveshah>
The other thing I want to try is swapping macros, at the moment I think the inability to perform swaps after constraint legalisation limits Fmax with carrys
<daveshah>
without macro swapping support, LUTCascade will probably cause a step back in QoR too
<whitequark>
daveshah: ah no, I misread the report
<daveshah>
:(
<whitequark>
doesn't seem to lead to that much of an improvement in timing, sadly
<daveshah>
definitely not the first time I did that
<whitequark>
ok
<daveshah>
once I remember thinking that I had like a 30% increase in Fmax
<daveshah>
turns out I was compared hx8k against lp8k
<tnt>
whitequark: is it on your repo already ?
<tnt>
daveshah: lol
<whitequark>
lol
<sorear>
improve timing 30% with this one weird trick
<whitequark>
daveshah: can you check if this actually works as intended?
<whitequark>
just pushed
<daveshah>
sure
GuzTech has quit [Quit: Leaving]
<whitequark>
daveshah: I looked at your MCVE and it looks like there's no actual change if I run opt_lut on it at all?
<whitequark>
I mean
<whitequark>
it has one LUT
<whitequark>
opt_lut would not change it...
<daveshah>
It should have two LUTs
<daveshah>
opt_lut was previously illegaly merging those two
<whitequark>
oh, `a+b`
<whitequark>
oh sorry
<daveshah>
yeah
<whitequark>
let me recheck
<daveshah>
2 LUT4s, looks good
<whitequark>
hm, the log is a bit confusing
<whitequark>
let me tweak it a bit
<daveshah>
picorv32 example seems to work fine too now
<daveshah>
:)
<whitequark>
:D :D
<whitequark>
so, what changed? fmax before/after? lc before/after?
<whitequark>
is this -noabc or?
<daveshah>
No -noabc
<daveshah>
But a big jump in timing
<daveshah>
from 67MHz average without -relut to 72MHz with
<whitequark>
ooooh wow
<daveshah>
let me test on a soc design to make sure it still works on hardware
<whitequark>
I test on hardware periodically, seems to work still
<daveshah>
cool
<daveshah>
just want to test it together with my nextpnr carry changes
<daveshah>
That example that's at 72MHz now was pretty much stuck around 52MHz for a long time
<whitequark>
yeah, definitely curious
<whitequark>
oh wow
<daveshah>
like until a few weeks ago
<daveshah>
I don't think I even have min_ce_use in there, so it can probably get even better
<daveshah>
But I know opt-timing and the nextpnr carry changes each added about 10%
<whitequark>
what is opt-timing?
<daveshah>
It's a post-placement path that uses a fairly odd algorithm to minimise the critical path
<daveshah>
*post-placement pass
<daveshah>
basically, a BFS of neighbour bels of critical path bels
<daveshah>
hardware test is working (design is a picorv32 soc, qspi controller, and CSI-2 interface if you are curious)
<daveshah>
that design is now getting 24MHz on an ultraplus
<daveshah>
which is pretty good
<whitequark>
Cell A is a 3-LUT with 3 dedicated connections. Cell B is a 2-LUT.
<whitequark>
Cells share 0 input(s) and can be merged into one 4-LUT.
<whitequark>
Not combining LUTs into cell B (combined LUT wider than cell B).
<whitequark>
Combining LUTs into cell A.
<daveshah>
oops, forgot to add relut to the syn script for that hardware test
<daveshah>
let me actually check again
<daveshah>
yep, still works
<whitequark>
:D :D
<whitequark>
any change in fmax?
<daveshah>
dropped to 22MHz
<whitequark>
or is it just size?
<whitequark>
huh
<whitequark>
average?
<daveshah>
this is one run
<daveshah>
unlike the previous test
<daveshah>
let me run some proper 16-run comparisons on this design too
<daveshah>
size drops from 3371 LCs to 3311 LCs
dingbat has quit [Quit: Updating details, brb]
dingwat has joined ##openfpga
<tnt>
I tried it on a couple of designs here (over 10 runs each). Doesn't seem to affect F_avg / F_max (it's within the noise ... < 1 MHz variation on a 70 MHz design)
dingwat has quit [Client Quit]
dingwat has joined ##openfpga
<daveshah>
I dare say, this is where a Threadripper was a good buy :P
<tnt>
~ 5 % less LUTs
<whitequark>
I'm guessing the critical path is some sort of long carry chain
<daveshah>
difference is in the noise here too
<daveshah>
with relut: min = 23.45 MHz, avg = 25.30 MHz, max = 27.32 MHz
<daveshah>
without relut: min = 24.22 MHz, avg = 25.35 MHz, max = 27.05 MHz
<miek>
i'm having some trouble bringing up a Glasgow revB - `glasgow factory` seems to read back all 0s from the eeprom but `fx2tool` suggests it programmed ok? https://pastebin.com/raw/ZLjxQWce
<sensille>
shapr: i just used ghc and looked at some results
<tnt>
Is there such things as gearboxes ICs that take 2 * 5G serdes and make a 10G one ?
<whitequark>
miek: hm, interesting
<sensille>
shapr: but it might be too offtopic for this channel
<shapr>
sensille: in general (very broad brush strokes) , naive straightforward Haskell runs in about twice the time of naive straightforward C or C++
<shapr>
I'd argue that naive straightforward Haskell takes less than half the human thinking time, compared to C or C++, to implement the same solution.
<shapr>
sorear: you have experience on both sides, what do you think?
<sensille>
what i really need to understand is copying data vs. manipulating in place
<shapr>
I'd really like to see someone solving the Advent of Code puzzles on an FPGA
<shapr>
(going back on topic)
<shapr>
sensille: want to try #haskell-beginners or just #haskell for this topic?
<miek>
whitequark: same results with that firmware
<whitequark>
miek: very strange
<sensille>
shapr: i haven't even read one third of the book, so definitely -beginners :(
<whitequark>
might actually (gasp) look at floorplanning
<miek>
so i checked/reflowed a bunch of stuff but no joy. i haven't got anything around to decode easily, but the waveform is identical on the scope between using `glasgow flash` (all 0s) and `fx2tool read_eeprom` (correct readback)
<whitequark>
miek: very strange
<whitequark>
unfortunately, i don't really know how to help you at the moment
<whitequark>
i'll let you know if i have ideas, or ping me in a few days
<miek>
ok, no worries, i'll keep playing around. cheers for the help so far
<whitequark>
daveshah: so, like half of the boneless cpu design is attributed to the FSM
<whitequark>
cells wise
<whitequark>
and nets
<daveshah>
whitequark: PR looks good, thanks
<daveshah>
Interesting that the FSM is so significant even with a 16 bit datapath
<miek>
oh good it gets stranger, wireshark shows the correct data coming in
<whitequark>
miek: ohhhhh
<whitequark>
now *this* is something i know
<whitequark>
try updating your python-libusb
<whitequark>
miek: are you using the one in debian by any chance?
<miek>
ubuntu, but yeah. i just installed one from pip and it works! thanks!
<tnt>
Yeah T: glasgow.device.hardware: USB: BULK EP8 IN data=<dcf3b8e771cfe29ec43d8 .....
<whitequark>
ok
<whitequark>
your hardware is likely fine
<whitequark>
this is probably my shitty FX2 arbiter then
<whitequark>
I really need to rewrite it and, I dunno, add tests...
<SolraBizna>
why route when you can have a 128-layer board and each signal its own plane
<tnt>
whitequark: Do you use the IO registers ?
<whitequark>
tnt: yes
<whitequark>
before that it barely worked
<tnt>
yeah not surprising, timing would be highly dependent of the PnR results.
<whitequark>
i ned a model of fx2 in migen...
<whitequark>
need*
<tnt>
why would it affect only D1 though ?
<whitequark>
no idea
<whitequark>
i have not observed this particular failure
<whitequark>
can you try hmmm
<whitequark>
tnt: can you locally modify migen to pass --randomize-seed to nextpnr
<whitequark>
and see if that changes things
<tnt>
Yeah it seems it does
<tnt>
Is there a way to force rebuilt ?
<whitequark>
yes
<whitequark>
--rebuild :p
<tnt>
It actually works most of the time ... I guess just not with the default seed in my particular machine.
<whitequark>
tnt: so this is a timing issue... bleh
<whitequark>
:S
<whitequark>
i was afraid of that
<tnt>
nextpnr doesn't really analyze the path to/from D_{IN,OUT} as part of the sync logic. It doesn't seem to know when IO registers are enabled or not.
<daveshah>
Yes, that needs fixing
<daveshah>
It will count as the $async <-> clock paths though
<whitequark>
ohhhh
<tnt>
yeah, that's how I know it doesn't work atm :) because I see those path in <async>
<daveshah>
I'm not convinced icetime handles them entirely correctly either
<daveshah>
However, if the delay in the <async> path is still less than the clock period then its not a problem
<daveshah>
If it is, then that will be it
<whitequark>
daveshah: there is also setup/hold timing of fx2
<whitequark>
which is rather complicated.
<daveshah>
Yes, it is on my masters todo list to look at this kind of stuff in nextpnr
<daveshah>
But that won't be until next year now
<whitequark>
the fx2 timing is nightmarish in places
<whitequark>
because it has setup/hold timings... longer than one clock cycle
<whitequark>
like, what?
<daveshah>
yeah that's crazy
<tnt>
whitequark: mmm ...
<tnt>
whitequark: instead of having SB_IO followed by a SB_GB, can't you use SB_GB_IO ?
<whitequark>
tnt: where?
<whitequark>
also, is that actually different?
<tnt>
Yes.
<whitequark>
shit
<whitequark>
ok fine
<daveshah>
Yes SB_IO, SB_GB adds a bit of fabric routing
<tnt>
As is, the clokc will be routed to the fabric and brought to a random SB_GB depending on placement.
<tnt>
which means the clock phase will vary run to run.
<whitequark>
ughhhhhh
<daveshah>
Seems that the ice40up5k input register has a whole 4ns of its own setup time
<daveshah>
And clock to out of 1.5ns
<daveshah>
Just the pin and register excluding global network etc
<whitequark>
daveshah: what the fuck
<Richard_Simmons>
I'm seeing more and more of these Gowin fpgas, yet I still know nothing about them
<whitequark>
this makes the benchmark applet fail on my glasgow
<tnt>
It weird how that one board seem to behave so differently from other people's ... (and from the other board I built). First the FX2 LEDs and now this :p
<whitequark>
well, yeah
<whitequark>
it's interesting
<whitequark>
tnt: think you can try and abuse the design a bit in that branch?
<whitequark>
if it works decently enough i might just merge it...
<whitequark>
or write a model...
<tnt>
You could try to run the design in icecube to get the "official" timing number for the IO (i.e. how much sys_clk is delayed compared to the clock at the io pin, and how much setup/hold is expected on each pin and the clk to out etc ...)
<whitequark>
I donn't even have icecub
<tnt>
Ah :) Well, I can give it a shot.
<tnt>
Is there an option to save the .v / .pcf ?
<tnt>
CTRL-C during nextpnr works :p
<whitequark>
`glasgow build -t v`
<whitequark>
just for verilog
<whitequark>
or
<whitequark>
`glasgow build -t zip`
<whitequark>
for the entire design
<whitequark>
caution: zipbomb
<tnt>
E2792: Instance SB_IO_18 incorrectly constrained at SB_IO_OD location
<tnt>
damn
<tnt>
wtf ... they removed all the underscore in the ports names from SB_IO to SB_IO_OD ...
Bike has joined ##openfpga
ZipCPU|Laptop has quit [Ping timeout: 240 seconds]