ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at · logs at
Degi_ has joined #nmigen
Degi has quit [Ping timeout: 240 seconds]
Degi_ is now known as Degi
danfoster has joined #nmigen
danfoster has quit [Remote host closed the connection]
<awygle> That's... Pretty awesome
<TD-Linux> >supports ulx3s
<awygle> thank you twitter
<TD-Linux> neat
<awygle> "the code is moved into hardware" is an interesting description
<awygle> i get what they're saying but still
<awygle> it looks like you might be able to slot cxxrtl into place in this flow where Verilator currently lives?
<awygle> i guess that wouldn't be hugely useful tho
<whitequark> why not?
<awygle> in the context of nmigen it seems like what you'd want is an nmigen frontend, not a cxxrtl middle-end
<whitequark> ohh, i misunderstood what you wanted
<awygle> what did you think i wanted? i mgiht want that too :p
<awygle> hm i'd never heard of "ujprog" either
<whitequark> awygle: using cxxrtl as a jit backend
<whitequark> given that it *almost* supports proper separate compilation, not inconceivable
<awygle> oh, yes, sure
<awygle> we shoudl do that lol
<awygle> the more i read this the more interesting cascade is
<awygle> >> Cascade ... can target [the ULX3S'] reprogrammable fabric to improve virtual clock frequency for most applications.
futarisIRCcloud has joined #nmigen
<whitequark> at which point does it stop being a JIT compiler and becomes a synthesizer with an ILA?
<whitequark> i don't quite get it
<awygle> i _think_ they mean "jit compiler" as "to a bitstream"
<awygle> i also misunderstood at first
<whitequark> hrm
<awygle> but then why verilator
<sorear> I think they mean it in the sense of "tiered compilation"
<awygle> *confused*
<whitequark> okay that *is* interesting
<whitequark> but also confusing
<sorear> using a SW sim as tier 1, and PnR as tier 2
<awygle> actually i think it's
<awygle> pure SW sim -> verilated compiled -> PnR
<whitequark> right so they have deopt support, right?
<whitequark> that's a lot of fun
<sorear> except that there's no profiling and no reasonable way to split the design anyway, so you just migrate the whole thing when the background compile finishes
<awygle> "deopt"?
<whitequark> deoptimization
<sorear> yes, if I'm reading the readme right they deopt for $printf etc
<awygle> the "virtualization tasks" section looks like it would be quite nice for my interactive simulator dream
<awygle> like, it's not quite that, but it's quite similar
Stary has quit [Ping timeout: 246 seconds]
Stary has joined #nmigen
<TD-Linux> awygle, ujprog is the tool used to program the ulx3s via the ft2232h on board
<TD-Linux> (it is somewhat difficult to make it work correctly also)
<awygle> interesting
<awygle> Why are all jtag api things bad
<whitequark> awygle we have an entire channel literally dedicated to that
<awygle> ... we do?
<whitequark> #glasgow ;p
<awygle> ah :p
<awygle> i thought you might mean that
<awygle> different problem tho no?
<whitequark> eh
<whitequark> not very serious here
<awygle> mhm
<awygle> oh speaking of
<awygle> i saw a comment on the glasgow (?) issue tracker that said you weren't interested in using libjtaghal and that the glasgow native support was strictly superior (i think)
<awygle> i was curious about that
<awygle> libjtaghal seems excessively complex to me, but i am interested in what you found objectionable (or if you did)
<whitequark> awygle: mh, i might have worded that poorly
<whitequark> there were a few realizations mixed up there
<whitequark> first, it turns out i did not really need libjtaghal for... well, jtag. at the time i did not understand jtag very well. i do now. it is beautiful and not really hard to use
<whitequark> second, interacting with glasgow from foreign c++ code is hard because glasgow, the USB device, doesn't (yet?) have a "stable ABI"
<TD-Linux> I mean, I use the glasgow as my ecp5 jtag adapter of choice...
<whitequark> third, it turned out that heavy vertical integration in glasgow gives almost exponential benefits
<awygle> i see
<TD-Linux> I actually find ice40 spi flashing more obnoxious because you have to hold reset and not all spi programmers support doing that
* cr1901_modern has a use for stable glasgow USB interface in the mid-future (few months from now?)
____ has joined #nmigen
_whitelogger has joined #nmigen
XgF has quit [Quit: - Chat comfortably. Anywhere.]
Vinalon has joined #nmigen
thinknok has joined #nmigen
thinknok has quit [Ping timeout: 272 seconds]
chipmuenk has joined #nmigen
Asu has joined #nmigen
thinknok has joined #nmigen
____2 has joined #nmigen
____ has quit [Ping timeout: 256 seconds]
<Sarayan> How insanely big the nmigen code for a fpu would be? And would it map to fpga hardware?
<Sarayan> I'm thinking about the feasability of something like a NeXT on mister
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
Ekho has quit [Quit: An alternate universe was just created where I didn't leave. But here, I left you. I'm sorry.]
Vinalon has quit [Ping timeout: 264 seconds]
Ekho has joined #nmigen
<_whitenotifier-9> [nmigen] hofstee opened issue #363: Can I create an active-low (asynchronous) reset? -
<whitequark> Sarayan: shouldn't be that huge
<whitequark> i mean, depedns on what kind of fpu, but they aren't inherently massive
<_whitenotifier-9> [nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? -
<_whitenotifier-9> [nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? -
<_whitenotifier-9> [nmigen] whitequark edited a comment on issue #363: Can I create an active-low (asynchronous) reset? -
<_whitenotifier-9> [nmigen] hofstee opened pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] hofstee commented on issue #363: Can I create an active-low (asynchronous) reset? -
<Sarayan> wq: 68040 or so, so trying a perfect cycle-exact simulation, just an equivalent behaviour, to make a full-speed NeXT for instance
<_whitenotifier-9> [nmigen] hofstee edited a comment on issue #363: Can I create an active-low (asynchronous) reset? -
<Sarayan> s/so trying/not trying/
<Sarayan> I think is has standard fp ops, possibly dropped the functions, for fp32, 64 and 80 iirc
<_whitenotifier-9> [nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? -
<_whitenotifier-9> [nmigen] codecov[bot] commented on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] codecov[bot] edited a comment on pull request #364: Fix `_yosys_version()` -
<whitequark> Sarayan: that has "68000 gates", right? or more like 2/3 that amount
<_whitenotifier-9> [nmigen] codecov[bot] edited a comment on pull request #364: Fix `_yosys_version()` -
<whitequark> mh no, 1.2 mil
<whitequark> it's really hard to make any sound prediction, but i'd expect you to be able to fit it into a larger FPGA
<whitequark> not sure about the mister specifically
<Sarayan> yeah
<Sarayan> cyclone V, the one with a dual-core arm in
<daveshah> I think what mostly affects the size of an FPU is how microcoded/multicycle it is
<daveshah> The Rocket FPU is pretty large (needing pretty much an Artix-7 100T for SoC+FPU, whereas SoC on its own is fine in an ECP5 45k)
<daveshah> but I think that their implementation is fairly inefficient
<Sarayan> if the fpu has no sin() and friends, is there anything to microcode in the first place?
* whitequark . o O ( bit-serial FPU )
<Sarayan> oh damn, I nerd-sniped wq, sorry
<daveshah> Division might well benefit from some kind of microcoding
<whitequark> oh no, i'm not olofk :p
<MadHacker> There's plenty FP emulators on 8 bit micros, so that sets an upper bound for how bad it can be. You can always implement it as a tiny 8-bit micro.
<daveshah> Yeah, a picorv32 or VexRiscv would be even easier and sets an upper bound for a "microcoded" FPU (~2k LUTs)
<Sarayan> true. Nore that 8bit micros are not ieee usually
<_whitenotifier-9> [nmigen] hofstee commented on issue #363: Can I create an active-low (asynchronous) reset? -
<Sarayan> the 68k itself is microcoded for the "normal" instructions
<Sarayan> I really wonder how small one can make a 68040-equivalent while keeping similar performance
<Sarayan> I guess I'll start on the integer instructions when I'm bored
<_whitenotifier-9> [nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? -
<_whitenotifier-9> [nmigen] whitequark edited a comment on issue #363: Can I create an active-low (asynchronous) reset? -
<daveshah> I think it is usually about 6-10 ASIC gates to the LUT used in an ASIC emulation context
<whitequark> so 120-200k LUT?
<daveshah> Perhaps less, as it is 1.2M transistors not gates
<whitequark> not that big, but not ecp5 sized either
<whitequark> ah, right
<daveshah> and memories will be more efficient than that
<daveshah> ditto DSPs if there are any multiplies in there
<Sarayan> note that a large part of these transistors are just the caches
<daveshah> Yeah, those become BRAM#
<daveshah> so probably its a mid ECP5 type design
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
<Sarayan> I have a feeling it would be fun to reimplement old workstations on fpga, and one only need external ram, there aren't the bw issues of distributed roms of arcade games
<_whitenotifier-9> [nmigen] whitequark reviewed pull request #364 commit -
<daveshah> Yeah, you could use DDR3 without worrying about latency issues too
<daveshah> 68040 computer with 1GiB RAM...
<whitequark> and PCIe?
<MadHacker> I've friends tried to emulate various machines on modern hardware who've found out the hard way that RAM latency really isn't that much better. :/
<MadHacker> Meanwhile I'm sticking an HX4K on a BBC Master ROM cartridge for fun and USB.
<MadHacker> I wish the ECP5 was easier to place, I'd prefer give it PCIe for a laugh. :)
<daveshah> Depending on what machine, the whole thing should fit in cache given a decent CPU!
<MadHacker> True that.
<_whitenotifier-9> [nmigen] hofstee commented on issue #363: Can I create an active-low (asynchronous) reset? -
<Sarayan> daveshah: with mister you can have 128M SDRAM easily nowadays
<_whitenotifier-9> [nmigen] hofstee commented on issue #185: ASIC support tracking issue -
<_whitenotifier-9> [nmigen] whitequark commented on issue #363: Can I create an active-low (asynchronous) reset? -
rohitksingh has quit [Quit: No Ping reply in 180 seconds.]
rohitksingh has joined #nmigen
<_whitenotifier-9> [nmigen] hofstee synchronize pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] codecov[bot] edited a comment on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] whitequark closed pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen/nmigen] whitequark pushed 1 commit to master [+0/-0/±1]
<_whitenotifier-9> [nmigen/nmigen] hofstee 875579e - back.verilog: make Yosys version check compatible with Verific.
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] whitequark commented on issue #185: ASIC support tracking issue -
<_whitenotifier-9> [nmigen] hofstee commented on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
<sorear> there's a big difference between something like rocket's FPU, which has a 52x52 multiplier and several barrel shifters as (retimed) combinatorial logic and can complete double-precision FMAs at 1/cycle, and a 8080-era FPU which just has a couple of 80-bit registers, shift left/right one, an adder, and finite state logic
<sorear> *8087
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
<sorear> <- 50+ cycles for multiply up through the 387, which postdates 68881
<sorear> so rocket's FPU is huge (in ASIC processes it takes up half of the tile, the other half being "rest of core + I1$ + D1$"), but it's 122 times the throughput of what you're simulating
<Sarayan> so it can be made quite small by sacrificing performance that doesn't need to be there anyway
<daveshah> Given that multipliers are cheap on FPGAs I suspect you could make it quite a bit faster without costing that much more area
<Sarayan> yeah, the cyclone v has a bunch of wide multipliers
<Sarayan> the shift is probably costlier
<daveshah> A fixed shift wouldn't be
<Sarayan> not sure if nmigen/yosys can actually use the multipliers though
<daveshah> ZirconiumX: ^
<Sarayan> fadd requires a very not fixed shift
<daveshah> Yeah
<daveshah> I think there are tricks to use the multipliers for shifting, too
<ZirconiumX> Yeah, you can't presently use the multipliers
<Sarayan> you need a 2**n then
<whitequark> isn't shift-by-mul just a mul by one hot?
<daveshah> 2**n is cheap, just a decoder
<ZirconiumX> Even worse, there doesn't appear to be a Quartus IP core for this
<Sarayan> ZX: for the multipliers?
<daveshah> The tricks come in when the thing you are shifting is larger than the multiply but I can't remember the details
<daveshah> There was an old Xilinx app note that I saw about it
<Sarayan> wq: yeah, but between the size of the one hot and the muxing of the multiplier input and output I kinda wonder if directly barrel-shifting isn't better
<ZirconiumX> Sarayan: yeah
<ZirconiumX> Well
<ZirconiumX> There's lpm_mult
<ZirconiumX> Or altera_mult_add.
<ZirconiumX> The Intel FPGA Multiply Adder (Intel Stratix 10, Intel Arria 10, and Intel Cyclone 10 GXdevices) or ALTERA_MULT_ADD (Arria V, Stratix V, and Cyclone V devices) IP coreallows you to implement a multiplier-adder.
<ZirconiumX> This isn't going to be horrendously cursed at all
<ZirconiumX> The alternative is direct cell instantiation
<ZirconiumX> Which, uh
<ZirconiumX> Hasn't gone well so far
<Sarayan> it's interesting though, how do you map a multiplier you write without thinking to whatever a fpga offers?
<whitequark> you dont
<ZirconiumX> You use `*` and hope for the best
<Sarayan> ok, then how do you do hit fpga-specific resources?
<whitequark> you use an instance
<Sarayan> if there a generic way to describe/use them?
<sorear> given that your 680x0 core necessarily already has microcode, it probably doesn't make sense to have a fully separate FPU if you're not going for cycle accuracy
<Sarayan> sorear: No intention to have it fully separate, but it's visible in the isa that it runs separately, as in the main program waits for the results
<Sarayan> (iirc, I never had a 68k with a fpu)
<sorear> x86 has FWAIT too but it's a no-op on everything recent
<Sarayan> well, I need to do the integer part for a start, it's going to be a large enough work :-)
<Sarayan> caches, mmu, fun
<Sarayan> can an instance "polyfill" for sim or for other fpgas that don't have the function?
<ZirconiumX> No, but you can write a module to wrap around the instance
<ZirconiumX> Essentially Instance is nMigen's FFI
<whitequark> instance polyfills are very much planned
<Sarayan> sorear: So you use the fabric capabilities to have a sungle-cycle fpu or so, then forget about the async?
<ZirconiumX> I know there's Intel IP for FPU functions
<sorear> if you have a FPGA that will fit a single-cycle FPU, then yes
<ZirconiumX> 58 arguments to altera_mult_add, 227 parameters
* ZirconiumX cries
<Sarayan> mwahahahhaa nice
<ZirconiumX> Why, Intel? I don't need saturating arithmetic
<ZirconiumX> I don't need you to rotate the input
<ZirconiumX> I don't need you to register the inputs and outputs either
<ZirconiumX> daveshah: how bad is the ECP5 MULT18X18 cell? I'll admit I haven't looked at it.
<daveshah> In its simple form not too bad
<daveshah> The only real weirdness are the various undocumented cascade modes
<daveshah> and the DDR registers and associated /2 clock dividers
pinknok has joined #nmigen
thinknok has quit [Ping timeout: 265 seconds]
<ZirconiumX> Good news, at least
<ZirconiumX> The cyclonev_mac primitive has *only* 22 arguments and 44 parameters
<ZirconiumX> On the other hand, it has an encrypted simulation model, so I have no clue how it works other than cargo-culting
<whitequark> i can probably decrypt it if you give me a testbench that uses it
<ZirconiumX> Sure, just need to do a bit of error-driven development
<ZirconiumX> Honestly I'm surprised this synthesises
<whitequark> ZirconiumX: oh, it's just cyclonev_atoms_ncrypt.v?
<ZirconiumX> Probably
<whitequark> ... why is it only CV and 55nm?
<whitequark> (what was 55nm again?)
<ZirconiumX> I think 55nm was like C III
<ZirconiumX> It's apparently also MAX 10
<whitequark> looks like the mentor models are encrypted, the rest aren't?
<whitequark> i have no idea. doesn't matter anywway
<ZirconiumX> Yeah, googling 55nm Altera parts brings up the MAX 10 as using a TSMC 55nm process
<ZirconiumX> <whitequark> looks like the mentor models are encrypted, the rest aren't? <-- the unencrypted sim model library makes reference to some encrypted models, so
<ZirconiumX> e.g. cyclonev_clkena is also apparently in here somewhere
<whitequark> ahh
<ZirconiumX> There's gotta be some irony in discussing encrypted vendor models while writing coursework on encryption and how to break it
<tpw_rules> isn't that just not irony?
<ZirconiumX> Maybe my sense of humour is broken then
<tpw_rules> "oohoohoo i'm talking about breaking encryption while breaking encryption"
<tpw_rules> not ironic
<whitequark> circumventing, not breaking
<_whitenotifier-9> [nmigen] whitequark edited a comment on pull request #364: Fix `_yosys_version()` -
<_whitenotifier-9> [nmigen] whitequark commented on pull request #364: Fix `_yosys_version()` -
Vinalon has joined #nmigen
cr1901_modern has quit [Read error: Connection reset by peer]
<ronyrus> wq: I used the debug ring log + uart example from your Yumewatari project. It's extremely useful and the state decode trick is awesome!!!
<ronyrus> Is there a resource teaching these kind of tricks somewhere? Are there more?
<whitequark> ronyrus: i'm afraid that one was made after working a lot with migen (and patching it too)
<whitequark> actually i had to implement .decoding[]
<ronyrus> :) it's very useful :)
Vinalon has quit [Remote host closed the connection]
Vinalon has joined #nmigen
<_whitenotifier-9> [nmigen] Fatsie commented on issue #185: ASIC support tracking issue -
<_whitenotifier-9> [nmigen] Fatsie edited a comment on issue #185: ASIC support tracking issue -
pinknok has quit [Remote host closed the connection]
pinknok has joined #nmigen
<awygle> Guessing yumewatari is fairly far down on your priority list at this point?
<whitequark> awygle: no, actually
<whitequark> it's more that i have to make the universe to get some apple pie
<whitequark> depth first bugfixing
<Sarayan> yumewatari?
<whitequark> my PCIe stack
<awygle> Right
<awygle> Which particular bits of the universe are missing?
<whitequark> FSM stuff, parser stuff
<whitequark> (parser stuff likely dependent on good FSM stuff)
<awygle> Makes sense
<awygle> Is there a use case for yumewatari in particular?
<whitequark> would be the first OSS PCIe PHY
<whitequark> well... upper-PHY
<whitequark> technically it already is, depending on how conformant you want it to be
<whitequark> it has an LTSSM, it's buggy and doesn't implement a bunch of PM features, but so are lots of devices that silicon vendors actually ship. a question of magnitude, really :p
<Sarayan> target is one of the numerous fpga-on-a-pcie card?
<Sarayan> or glawgow with a rusty wire connector?
<Sarayan> s/w/s/
<whitequark> versa ecp5 5g
<Sarayan> E226 on mouser, not insane
<Sarayan> not sure what I could use it for, at least for the mister I have some ideas :-)
<daveshah> It's a nice board
<daveshah> I designed a hat with SDRAM and VGA (albeit I only ever assembled and tested the RAM)
<daveshah> Which might be useful for Mister devel, if you didn't want to use the DDR3
<Sarayan> what I'd love is a hat for that, or for a mister, with which I can plonk and torture yamaha sound chips
<Sarayan> glasgow looks nice but is a little short for the ones that read pcm from rom
<awygle> What about litepcie? Doesn't cover that layer?
<daveshah> No, it relies on Xilinx hard IP for the LTSSM etc, at least last I looked
<daveshah> Xilinx don't just have a SERDES, they have a much bigger part of the PCIe stack as hard IP too
<daveshah> (Lattice have this with CrossLink NX, too, now, in fact I think that might provide even more than Xilinx does)
<awygle> Ah
<awygle> Lame
cr1901_modern has joined #nmigen
futarisIRCcloud has joined #nmigen
<Vinalon> hey, I just wanted to say thanks again to ZirconiumX / MadHacker / sorear and the rest of y'all for the advice on how to shrink a CPU design last week; I managed to drop ~1000 cells by following your advice.
<ZirconiumX> Wow, damn
<ZirconiumX> Can you resend your source link?
<Vinalon> removing extraneous CSRs, combining some ALU operations, and reducing the decoder's dependence on CPU state each dropped a few hundred
<Vinalon> it's here, but I'm still cleaning up the ALU changes and haven't committed them yet:
<Vinalon> so now the ALU/CSR/CPU logic looks like it's a little less than 2000 cells, and it can fit 4 'neopixel' peripherals. I appreciate the help! :)
<ZirconiumX> Glad to hear
<sorear> "The spec does not define behavior when an unspecified opcode is encountered." illegal instruction exceptions are specified as mcause=2
<sorear> all opcodes and bit patterns which are not specified are illegal
<ZirconiumX> I don't think it's too burdensome to say "only execute legal instructions"
<ZirconiumX> e.g. SERV requires this in the pursuit of absolute minimalism
<whitequark> if you don't define every opcode, it's kind of implied that you can only ever use the defined ones as a software developer, no?
<sorear> yes, but if you're going to do that it makes more sense to rip out the entire CSR system like picorv32 did
<Vinalon> yeah, I can probably just add a default 'with m.Case()' to the end of the decoder to trigger a trap.
<Vinalon> I would remove all of the CSRs, but the tests use 'minstret' to figure out if the program is still running and I want to add configurable interrupts eventually, like when a neopixel peripheral finishes sending its colors
<sorear> you don't need 31 bits of mcause.ecode, it's a WLRL field so only valid values need to be representable
<Vinalon> oh, that's a good point, thanks. I guess I could get away with just the first few bits.
<sorear> re. default, the annoying part is that this applies to everything, not just the 7-bit opcode, so slli can trigger an exception in some cases that addi can't because slli has must-be-zero bits
<sorear> etc
<Vinalon> ah - yeah, I guess it'll never strictly comply with the specification...but I'm happy if it works with GCC using the '-mabi=rv32i' flag.
<sorear> for a microcontroller it doesn't really matter but once you get into OSes with multiple privilege levels "undocumented instructions" become rather problematic
lkcl__ has quit [Ping timeout: 265 seconds]
<Vinalon> oh, that's good to know. But on the bright side, multiple privilege levels probably wouldn't fit easily in the target chip's 5000 logic cells :P
<Vinalon> anyways, it was really nice of y'all to take a look and offer advice, and it definitely helped my learning.
____2 has quit [Quit: Nettalk6 -]
chipmuenk has quit [Quit: chipmuenk]
pinknok has quit [Ping timeout: 272 seconds]
Asu has quit [Quit: Konversation terminated!]
<whitequark> ZirconiumX: so, how do i actually simulate your example?
<ZirconiumX> Since I don't have modelsim installed (for hopefully obvious reasons) I'm not actually sure
<whitequark> hrm, okay
<whitequark> i'm going to do this some other day then, sorry
<ZirconiumX> Sure
lkcl has joined #nmigen