<awygle>
"the code is moved into hardware" is an interesting description
<awygle>
i get what they're saying but still
<awygle>
it looks like you might be able to slot cxxrtl into place in this flow where Verilator currently lives?
<awygle>
i guess that wouldn't be hugely useful tho
<whitequark>
why not?
<awygle>
in the context of nmigen it seems like what you'd want is an nmigen frontend, not a cxxrtl middle-end
<whitequark>
ohh, i misunderstood what you wanted
<awygle>
what did you think i wanted? i mgiht want that too :p
<awygle>
hm i'd never heard of "ujprog" either
<whitequark>
awygle: using cxxrtl as a jit backend
<whitequark>
given that it *almost* supports proper separate compilation, not inconceivable
<awygle>
oh, yes, sure
<awygle>
we shoudl do that lol
<awygle>
the more i read this the more interesting cascade is
<awygle>
>> Cascade ... can target [the ULX3S'] reprogrammable fabric to improve virtual clock frequency for most applications.
futarisIRCcloud has joined #nmigen
<whitequark>
at which point does it stop being a JIT compiler and becomes a synthesizer with an ILA?
<whitequark>
i don't quite get it
<awygle>
i _think_ they mean "jit compiler" as "to a bitstream"
<awygle>
i also misunderstood at first
<whitequark>
hrm
<awygle>
but then why verilator
<sorear>
I think they mean it in the sense of "tiered compilation"
<awygle>
*confused*
<whitequark>
okay that *is* interesting
<whitequark>
but also confusing
<sorear>
using a SW sim as tier 1, and PnR as tier 2
<awygle>
actually i think it's
<awygle>
pure SW sim -> verilated compiled -> PnR
<whitequark>
right so they have deopt support, right?
<whitequark>
that's a lot of fun
<sorear>
except that there's no profiling and no reasonable way to split the design anyway, so you just migrate the whole thing when the background compile finishes
<awygle>
"deopt"?
<whitequark>
deoptimization
<sorear>
yes, if I'm reading the readme right they deopt for $printf etc
<awygle>
the "virtualization tasks" section looks like it would be quite nice for my interactive simulator dream
<awygle>
like, it's not quite that, but it's quite similar
Stary has quit [Ping timeout: 246 seconds]
Stary has joined #nmigen
<TD-Linux>
awygle, ujprog is the tool used to program the ulx3s via the ft2232h on board
<TD-Linux>
(it is somewhat difficult to make it work correctly also)
<awygle>
interesting
<awygle>
Why are all jtag api things bad
<whitequark>
awygle we have an entire channel literally dedicated to that
<awygle>
... we do?
<whitequark>
#glasgow ;p
<awygle>
ah :p
<awygle>
i thought you might mean that
<awygle>
different problem tho no?
<whitequark>
eh
<whitequark>
not very serious here
<awygle>
mhm
<awygle>
oh speaking of
<awygle>
i saw a comment on the glasgow (?) issue tracker that said you weren't interested in using libjtaghal and that the glasgow native support was strictly superior (i think)
<awygle>
i was curious about that
<awygle>
libjtaghal seems excessively complex to me, but i am interested in what you found objectionable (or if you did)
<whitequark>
awygle: mh, i might have worded that poorly
<whitequark>
there were a few realizations mixed up there
<whitequark>
first, it turns out i did not really need libjtaghal for... well, jtag. at the time i did not understand jtag very well. i do now. it is beautiful and not really hard to use
<whitequark>
second, interacting with glasgow from foreign c++ code is hard because glasgow, the USB device, doesn't (yet?) have a "stable ABI"
<TD-Linux>
I mean, I use the glasgow as my ecp5 jtag adapter of choice...
<whitequark>
third, it turned out that heavy vertical integration in glasgow gives almost exponential benefits
<awygle>
i see
<TD-Linux>
I actually find ice40 spi flashing more obnoxious because you have to hold reset and not all spi programmers support doing that
* cr1901_modern
has a use for stable glasgow USB interface in the mid-future (few months from now?)
<_whitenotifier-9>
[nmigen] codecov[bot] edited a comment on pull request #364: Fix `_yosys_version()` - https://git.io/JfkMQ
<whitequark>
Sarayan: that has "68000 gates", right? or more like 2/3 that amount
<_whitenotifier-9>
[nmigen] codecov[bot] edited a comment on pull request #364: Fix `_yosys_version()` - https://git.io/JfkMQ
<whitequark>
mh no, 1.2 mil
<whitequark>
it's really hard to make any sound prediction, but i'd expect you to be able to fit it into a larger FPGA
<whitequark>
not sure about the mister specifically
<Sarayan>
yeah
<Sarayan>
cyclone V, the one with a dual-core arm in
<daveshah>
I think what mostly affects the size of an FPU is how microcoded/multicycle it is
<daveshah>
The Rocket FPU is pretty large (needing pretty much an Artix-7 100T for SoC+FPU, whereas SoC on its own is fine in an ECP5 45k)
<daveshah>
but I think that their implementation is fairly inefficient
<Sarayan>
if the fpu has no sin() and friends, is there anything to microcode in the first place?
* whitequark
. o O ( bit-serial FPU )
<Sarayan>
oh damn, I nerd-sniped wq, sorry
<daveshah>
Division might well benefit from some kind of microcoding
<whitequark>
oh no, i'm not olofk :p
<MadHacker>
There's plenty FP emulators on 8 bit micros, so that sets an upper bound for how bad it can be. You can always implement it as a tiny 8-bit micro.
<daveshah>
Yeah, a picorv32 or VexRiscv would be even easier and sets an upper bound for a "microcoded" FPU (~2k LUTs)
<Sarayan>
true. Nore that 8bit micros are not ieee usually
<_whitenotifier-9>
[nmigen] hofstee commented on issue #363: Can I create an active-low (asynchronous) reset? - https://git.io/JfkMx
<Sarayan>
the 68k itself is microcoded for the "normal" instructions
<Sarayan>
I really wonder how small one can make a 68040-equivalent while keeping similar performance
<Sarayan>
I guess I'll start on the integer instructions when I'm bored
<_whitenotifier-9>
[nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? - https://git.io/JfkDv
<_whitenotifier-9>
[nmigen] whitequark edited a comment on issue #363: Can I create an active-low (asynchronous) reset? - https://git.io/JfkDv
<daveshah>
I think it is usually about 6-10 ASIC gates to the LUT used in an ASIC emulation context
<whitequark>
so 120-200k LUT?
<daveshah>
Perhaps less, as it is 1.2M transistors not gates
<whitequark>
not that big, but not ecp5 sized either
<whitequark>
ah, right
<daveshah>
and memories will be more efficient than that
<daveshah>
ditto DSPs if there are any multiplies in there
<Sarayan>
note that a large part of these transistors are just the caches
<Sarayan>
I have a feeling it would be fun to reimplement old workstations on fpga, and one only need external ram, there aren't the bw issues of distributed roms of arcade games
<daveshah>
Yeah, you could use DDR3 without worrying about latency issues too
<daveshah>
68040 computer with 1GiB RAM...
<whitequark>
and PCIe?
<MadHacker>
I've friends tried to emulate various machines on modern hardware who've found out the hard way that RAM latency really isn't that much better. :/
<MadHacker>
Meanwhile I'm sticking an HX4K on a BBC Master ROM cartridge for fun and USB.
<MadHacker>
I wish the ECP5 was easier to place, I'd prefer give it PCIe for a laugh. :)
<daveshah>
Depending on what machine, the whole thing should fit in cache given a decent CPU!
<MadHacker>
True that.
<_whitenotifier-9>
[nmigen] hofstee commented on issue #363: Can I create an active-low (asynchronous) reset? - https://git.io/JfkDr
<Sarayan>
daveshah: with mister you can have 128M SDRAM easily nowadays
<_whitenotifier-9>
[nmigen] hofstee commented on issue #185: ASIC support tracking issue - https://git.io/JfkDy
<_whitenotifier-9>
[nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? - https://git.io/JfkDx
rohitksingh has quit [Quit: No Ping reply in 180 seconds.]
<sorear>
there's a big difference between something like rocket's FPU, which has a 52x52 multiplier and several barrel shifters as (retimed) combinatorial logic and can complete double-precision FMAs at 1/cycle, and a 8080-era FPU which just has a couple of 80-bit registers, shift left/right one, an adder, and finite state logic
<sorear>
so rocket's FPU is huge (in ASIC processes it takes up half of the tile, the other half being "rest of core + I1$ + D1$"), but it's 122 times the throughput of what you're simulating
<Sarayan>
so it can be made quite small by sacrificing performance that doesn't need to be there anyway
<daveshah>
Given that multipliers are cheap on FPGAs I suspect you could make it quite a bit faster without costing that much more area
<Sarayan>
yeah, the cyclone v has a bunch of wide multipliers
<Sarayan>
the shift is probably costlier
<daveshah>
A fixed shift wouldn't be
<Sarayan>
not sure if nmigen/yosys can actually use the multipliers though
<daveshah>
ZirconiumX: ^
<Sarayan>
fadd requires a very not fixed shift
<daveshah>
Yeah
<daveshah>
I think there are tricks to use the multipliers for shifting, too
<ZirconiumX>
Yeah, you can't presently use the multipliers
<Sarayan>
you need a 2**n then
<whitequark>
isn't shift-by-mul just a mul by one hot?
<daveshah>
2**n is cheap, just a decoder
<ZirconiumX>
Even worse, there doesn't appear to be a Quartus IP core for this
<Sarayan>
ZX: for the multipliers?
<daveshah>
The tricks come in when the thing you are shifting is larger than the multiply but I can't remember the details
<daveshah>
There was an old Xilinx app note that I saw about it
<Sarayan>
wq: yeah, but between the size of the one hot and the muxing of the multiplier input and output I kinda wonder if directly barrel-shifting isn't better
<ZirconiumX>
Sarayan: yeah
<ZirconiumX>
Well
<ZirconiumX>
There's lpm_mult
<ZirconiumX>
Or altera_mult_add.
<ZirconiumX>
The Intel FPGA Multiply Adder (Intel Stratix 10, Intel Arria 10, and Intel Cyclone 10 GXdevices) or ALTERA_MULT_ADD (Arria V, Stratix V, and Cyclone V devices) IP coreallows you to implement a multiplier-adder.
<ZirconiumX>
This isn't going to be horrendously cursed at all
<ZirconiumX>
The alternative is direct cell instantiation
<Sarayan>
it's interesting though, how do you map a multiplier you write without thinking to whatever a fpga offers?
<whitequark>
you dont
<ZirconiumX>
You use `*` and hope for the best
<Sarayan>
ok, then how do you do hit fpga-specific resources?
<whitequark>
you use an instance
<Sarayan>
if there a generic way to describe/use them?
<sorear>
given that your 680x0 core necessarily already has microcode, it probably doesn't make sense to have a fully separate FPU if you're not going for cycle accuracy
<Sarayan>
sorear: No intention to have it fully separate, but it's visible in the isa that it runs separately, as in the main program waits for the results
<Sarayan>
(iirc, I never had a 68k with a fpu)
<sorear>
x86 has FWAIT too but it's a no-op on everything recent
<Sarayan>
well, I need to do the integer part for a start, it's going to be a large enough work :-)
<Sarayan>
caches, mmu, fun
<Sarayan>
can an instance "polyfill" for sim or for other fpgas that don't have the function?
<ZirconiumX>
No, but you can write a module to wrap around the instance
<ZirconiumX>
Essentially Instance is nMigen's FFI
<whitequark>
instance polyfills are very much planned
<Sarayan>
sorear: So you use the fabric capabilities to have a sungle-cycle fpu or so, then forget about the async?
<ZirconiumX>
I know there's Intel IP for FPU functions
<sorear>
if you have a FPGA that will fit a single-cycle FPU, then yes
<ZirconiumX>
58 arguments to altera_mult_add, 227 parameters
* ZirconiumX
cries
<Sarayan>
mwahahahhaa nice
<ZirconiumX>
Why, Intel? I don't need saturating arithmetic
<ZirconiumX>
I don't need you to rotate the input
<ZirconiumX>
I don't need you to register the inputs and outputs either
<ZirconiumX>
daveshah: how bad is the ECP5 MULT18X18 cell? I'll admit I haven't looked at it.
<daveshah>
In its simple form not too bad
<daveshah>
The only real weirdness are the various undocumented cascade modes
<daveshah>
and the DDR registers and associated /2 clock dividers
pinknok has joined #nmigen
thinknok has quit [Ping timeout: 265 seconds]
<ZirconiumX>
Good news, at least
<ZirconiumX>
The cyclonev_mac primitive has *only* 22 arguments and 44 parameters
<ZirconiumX>
On the other hand, it has an encrypted simulation model, so I have no clue how it works other than cargo-culting
<whitequark>
i can probably decrypt it if you give me a testbench that uses it
<ZirconiumX>
Sure, just need to do a bit of error-driven development
<ZirconiumX>
Honestly I'm surprised this synthesises
<whitequark>
ZirconiumX: oh, it's just cyclonev_atoms_ncrypt.v?
<ZirconiumX>
Probably
<whitequark>
... why is it only CV and 55nm?
<whitequark>
(what was 55nm again?)
<ZirconiumX>
I think 55nm was like C III
<ZirconiumX>
It's apparently also MAX 10
<whitequark>
looks like the mentor models are encrypted, the rest aren't?
<whitequark>
i have no idea. doesn't matter anywway
<ZirconiumX>
Yeah, googling 55nm Altera parts brings up the MAX 10 as using a TSMC 55nm process
<ZirconiumX>
<whitequark> looks like the mentor models are encrypted, the rest aren't? <-- the unencrypted sim model library makes reference to some encrypted models, so
<ZirconiumX>
e.g. cyclonev_clkena is also apparently in here somewhere
<whitequark>
ahh
<ZirconiumX>
There's gotta be some irony in discussing encrypted vendor models while writing coursework on encryption and how to break it
<tpw_rules>
isn't that just not irony?
<ZirconiumX>
Maybe my sense of humour is broken then
<tpw_rules>
"oohoohoo i'm talking about breaking encryption while breaking encryption"
<tpw_rules>
not ironic
<whitequark>
circumventing, not breaking
<_whitenotifier-9>
[nmigen] whitequark edited a comment on pull request #364: Fix `_yosys_version()` - https://git.io/JfkSe
cr1901_modern has quit [Read error: Connection reset by peer]
<ronyrus>
wq: I used the debug ring log + uart example from your Yumewatari project. It's extremely useful and the state decode trick is awesome!!!
<ronyrus>
Is there a resource teaching these kind of tricks somewhere? Are there more?
<whitequark>
ronyrus: i'm afraid that one was made after working a lot with migen (and patching it too)
<whitequark>
actually i had to implement .decoding[]
<ronyrus>
:) it's very useful :)
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
<_whitenotifier-9>
[nmigen] Fatsie commented on issue #185: ASIC support tracking issue - https://git.io/JfkA7
<_whitenotifier-9>
[nmigen] Fatsie edited a comment on issue #185: ASIC support tracking issue - https://git.io/JfkA7
pinknok has quit [Remote host closed the connection]
pinknok has joined #nmigen
<awygle>
Guessing yumewatari is fairly far down on your priority list at this point?
<whitequark>
awygle: no, actually
<whitequark>
it's more that i have to make the universe to get some apple pie
<whitequark>
depth first bugfixing
<Sarayan>
yumewatari?
<whitequark>
my PCIe stack
<awygle>
Right
<awygle>
Which particular bits of the universe are missing?
<whitequark>
FSM stuff, parser stuff
<whitequark>
(parser stuff likely dependent on good FSM stuff)
<awygle>
Makes sense
<awygle>
Is there a use case for yumewatari in particular?
<whitequark>
would be the first OSS PCIe PHY
<whitequark>
well... upper-PHY
<whitequark>
technically it already is, depending on how conformant you want it to be
<whitequark>
it has an LTSSM, it's buggy and doesn't implement a bunch of PM features, but so are lots of devices that silicon vendors actually ship. a question of magnitude, really :p
<Sarayan>
target is one of the numerous fpga-on-a-pcie card?
<Sarayan>
or glawgow with a rusty wire connector?
<Sarayan>
s/w/s/
<whitequark>
versa ecp5 5g
<Sarayan>
E226 on mouser, not insane
<Sarayan>
not sure what I could use it for, at least for the mister I have some ideas :-)
<daveshah>
It's a nice board
<daveshah>
I designed a hat with SDRAM and VGA (albeit I only ever assembled and tested the RAM)
<daveshah>
Which might be useful for Mister devel, if you didn't want to use the DDR3
<Sarayan>
what I'd love is a hat for that, or for a mister, with which I can plonk and torture yamaha sound chips
<Sarayan>
glasgow looks nice but is a little short for the ones that read pcm from rom
<awygle>
What about litepcie? Doesn't cover that layer?
<daveshah>
No, it relies on Xilinx hard IP for the LTSSM etc, at least last I looked
<daveshah>
Xilinx don't just have a SERDES, they have a much bigger part of the PCIe stack as hard IP too
<daveshah>
(Lattice have this with CrossLink NX, too, now, in fact I think that might provide even more than Xilinx does)
<awygle>
Ah
<awygle>
Lame
cr1901_modern has joined #nmigen
futarisIRCcloud has joined #nmigen
<Vinalon>
hey, I just wanted to say thanks again to ZirconiumX / MadHacker / sorear and the rest of y'all for the advice on how to shrink a CPU design last week; I managed to drop ~1000 cells by following your advice.
<ZirconiumX>
Wow, damn
<ZirconiumX>
Can you resend your source link?
<Vinalon>
removing extraneous CSRs, combining some ALU operations, and reducing the decoder's dependence on CPU state each dropped a few hundred
<Vinalon>
so now the ALU/CSR/CPU logic looks like it's a little less than 2000 cells, and it can fit 4 'neopixel' peripherals. I appreciate the help! :)
<ZirconiumX>
Glad to hear
<sorear>
"The spec does not define behavior when an unspecified opcode is encountered." illegal instruction exceptions are specified as mcause=2
<sorear>
all opcodes and bit patterns which are not specified are illegal
<ZirconiumX>
I don't think it's too burdensome to say "only execute legal instructions"
<ZirconiumX>
e.g. SERV requires this in the pursuit of absolute minimalism
<whitequark>
if you don't define every opcode, it's kind of implied that you can only ever use the defined ones as a software developer, no?
<sorear>
yes, but if you're going to do that it makes more sense to rip out the entire CSR system like picorv32 did
<Vinalon>
yeah, I can probably just add a default 'with m.Case()' to the end of the decoder to trigger a trap.
<Vinalon>
I would remove all of the CSRs, but the tests use 'minstret' to figure out if the program is still running and I want to add configurable interrupts eventually, like when a neopixel peripheral finishes sending its colors
<sorear>
you don't need 31 bits of mcause.ecode, it's a WLRL field so only valid values need to be representable
<Vinalon>
oh, that's a good point, thanks. I guess I could get away with just the first few bits.
<sorear>
re. default, the annoying part is that this applies to everything, not just the 7-bit opcode, so slli can trigger an exception in some cases that addi can't because slli has must-be-zero bits
<sorear>
etc
<Vinalon>
ah - yeah, I guess it'll never strictly comply with the specification...but I'm happy if it works with GCC using the '-mabi=rv32i' flag.
<sorear>
for a microcontroller it doesn't really matter but once you get into OSes with multiple privilege levels "undocumented instructions" become rather problematic
lkcl__ has quit [Ping timeout: 265 seconds]
<Vinalon>
oh, that's good to know. But on the bright side, multiple privilege levels probably wouldn't fit easily in the target chip's 5000 logic cells :P
<Vinalon>
anyways, it was really nice of y'all to take a look and offer advice, and it definitely helped my learning.