ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at · logs at · IRC meetings each Monday at 1800 UTC · next meeting August 17th
<d1b2> <emeb> Isn't it true that all clock domains in a design are part of self, and the clock and reset signals associated with them are accessible?
<tannewt> oh, initialized by the super class?
<d1b2> <emeb> well, I think I misspoke saying "all clock domains", but for any module that's instantiated, the higher level hooks up the clock domain, and the clock & reset signals are available if you need them.
<tannewt> huh, interesting. I'd expect it to be passed in
<tannewt> (obviously I haven't used it much)
<d1b2> <emeb> but usually you don't since they're implicit when you use .sync process
<tannewt> right, you assume most things are using the common clock
<d1b2> <emeb> things get interesting when there's multiple domains and signals crossing of course - that's the topic of a lot of discussion here.
<tannewt> ya, I've seen that but haven't experimented with it yet
<tannewt> I know it from the "I've worked with a SAMD micro that does it" standpoint but not the HDL level
<d1b2> <emeb> Right - there area lot of hardware patterns that are used to ensure good behavior in those cases. AIUI nmigen is working to make a lot of those available easily.
<tannewt> nice! I love the nmigen primitives
<tannewt> is it usually done at the register level or peripheral level?
<tannewt> I'm probably thinking too high level
<d1b2> <emeb> Yeah - special structures to keep pulses from getting lost or getting stretched, special ways to make sure you only get out what's put in, etc.
<d1b2> <emeb> async FIFOs figure heavily. Lots of handshaking. etc.
Degi has quit [Ping timeout: 260 seconds]
Degi has joined #nmigen
<tannewt> cool cool. I haven't gotten that low level yet
<whitequark> tannewt: so, most signals are eagerly bound, meaning you have to possess a reference to the actual Signal object to do anything with them
<whitequark> clock domains are special in that they are late bound
<tannewt> at elaboration time?
<whitequark> yeah
<whitequark> or rather, *after* elaboration
<whitequark> well
<tannewt> 🤯
<whitequark> it's a combination of both, i guess
<tannewt> both during and after? or before?
<whitequark> you're explicitly binding them by manipulating, and then nmigen binds them through the hierarchy for you
<tannewt> and `sync` is the default domain right?
<whitequark> pretty much
<whitequark> for most part, `sync` is a convention
<whitequark> it's never treated specially other than by serving as a default name
<tannewt> makes sense. how are different clocks mapped to multiple domains?
<whitequark> can you elaborate?
<moony> is there a clean way to set a Signal to the current state of an FSM?
<moony> eh, nvm
<moony> just noticed it's in the signal viewer as FSM_STATE
<lkcl_> DaKnig: one "trick" (or two) that i learned is possible with nmigen, which helps keep line lengths to below 80 chars is:
<lkcl_> 1) use functions, passing in expressions and parameters (i'll send a link to a file i did that, in a mo)
<lkcl_> 2) assign AST sub-expressions to python variables then use those on the next line
<tannewt> wq, I'm wondering how you map two clock domains in an elaborate call
<lkcl_> here's an example of a python function which was created by taking the contents of a VHDL "case" statement out:
<lkcl_> the radix_read_wait() function just above it, you can see how many If If If indentations there are
<lkcl_> if those had also been inside the original Case statement at line 384, which is in *yet another* m.If, you can see quite easily how the indentation builds up and gets completely out of hand
<lkcl_> and yet it is perfectly reasonable to expect to have just one Switch statement - not even two *nested* nmigen Switch Statements - then do maybe one or two m.If indented pieces of work
<lkcl_> by using a python function for the majority of the Switch work, you can "go back" to a clean indent level.
<lkcl_> plus, i think the Switch statement looks a lot more understandable and readable because you don't have to scroll page-up, page-down dozens of times to take in all the Cases
<lkcl_> the second trick: using a python variable to store AST fragments: this is *usuallly* something that's not recommended
<lkcl_> because people tend to use those fragments multiple times, not realising that it's literally going to insert that exact same AST into the yosys output
<lkcl_> but if you use it carefully, and make sure that each python variable is used *once*, i've found that it's a really good way to stay below 80 chars.
<lkcl_> example, at line 205 (which is annoyingly unreadable)
<lkcl_> do this instead:
emeb_mac has quit [Ping timeout: 240 seconds]
<lkcl_> data01 = data[1] | data[0]
<lkcl_> followed by
<lkcl_> comb += perm_ok.eq(data01 &
<lkcl_> i *had* to develop these techniques, because of short-term memory issues. if the files are not all on-screen at once, i literally cannot recall the details of a file or function from a hidden tab or backgrounded window that i viewed only two seconds ago!
<moony> anyone have an idea why this doesn't work, and if yes, how I could do it correctly? (also i'd love to just get rid of that for loop, too.)
<lkcl_> oh - also, another thing you can see, there: "comb = m.d.comb" followed by just "comb +=".
emeb_mac has joined #nmigen
<lkcl_> this gives you 5 characters back which would otherwise be unavailable. if you can stand abbreviations and it's *really* important, you could do "c = m.d.comb" followed by "c += ...."
<lkcl_> moony: functions usually have to be "yielded from"
<lkcl_> yield from test_for_req()
<lkcl_> however...
<lkcl_> the for-loop... i *believe* you are doing the right thing, there (for i in test_for_req())
<lkcl_> is the testee done combinatorially?
<lkcl_> it would help to provide that source code as well
<moony> it worked with `yield from`
<lkcl_> oh! it did?? :)
<moony> I think the issue may lie with `(`yield testee.read_en) == 1`
<moony> oops, stray `
<lkcl_> oh, you mean you had:
<lkcl_> yield from test_for_req()
<lkcl_> yield i
<lkcl_> repeated a lot of times?
<lkcl_> now i am "engaging brain" a little more, the for-loop is not making any sense to me, neither is "yield i"
<lkcl_> this would make more sense:
<lkcl_> for i in range(5): # some arbitrary number
<lkcl_> yield from test_for_req()
<moony> yea, that's what I moved to
<moony> so, for future note: I'm still learning Python. I'm learning python only to use nmigen :p So how generators work in it is still a bit mysterious to me
<lkcl_> i know there's a reason why that works, it's just a bit too late for me to think it through and explain it :)
<lkcl_> they're very cool, however, yes, the fact that you can "jump" the control around - even sequentially - from one yield to the next
<moony> anywayssss....
* moony celebrates
<lkcl_> and even have for-loops around things that "yield" results, and even call functions that hierarchically do "more yields"... :)
<lkcl_> yay! :)
<moony> i now am something of the way there toward a working CPU
<lkcl_> cool!
<lkcl_> let us know how it goes
<moony> so far so good, surprisingly so as I bit off a bit more than I knew how to accomplish and somehow was mostly successful anyways
<moony> what's a clean way to write to a Memory in test code? (so using the write port isn't an option as that would take up cycles)
<moony> I assume prodding at _array?
<moony> uh oh
<moony> oh, no
<moony> my issue
<moony> successfully spent forever hunting down a bug that didn't exist
<moony> oops
aaaa has quit [Ping timeout: 265 seconds]
aaaa has joined #nmigen
Degi has quit [Ping timeout: 265 seconds]
ronyrus has quit [Ping timeout: 265 seconds]
ronyrus has joined #nmigen
Stary has quit [Ping timeout: 265 seconds]
plaes has quit [Ping timeout: 265 seconds]
Stary has joined #nmigen
plaes has joined #nmigen
plaes has joined #nmigen
plaes has quit [Changing host]
Degi has joined #nmigen
jaseg has quit [Ping timeout: 272 seconds]
jaseg has joined #nmigen
electronic_eel has quit [Ping timeout: 260 seconds]
electronic_eel has joined #nmigen
PyroPeter_ has joined #nmigen
PyroPeter has quit [Ping timeout: 246 seconds]
PyroPeter_ is now known as PyroPeter
lkcl__ has joined #nmigen
lkcl_ has quit [Ping timeout: 256 seconds]
jaseg has quit [Ping timeout: 240 seconds]
jaseg has joined #nmigen
emeb_mac has quit [Quit: Leaving.]
hitomi2507 has joined #nmigen
proteusguy has quit [Ping timeout: 246 seconds]
proteus-guy has quit [Ping timeout: 260 seconds]
peeps[zen] has joined #nmigen
peepsalot has quit [Ping timeout: 256 seconds]
proteusguy has joined #nmigen
proteus-guy has joined #nmigen
proteus-guy has quit [Remote host closed the connection]
Asu has joined #nmigen
cr1901_modern has quit [Ping timeout: 240 seconds]
cr1901_modern1 has joined #nmigen
Asuu has joined #nmigen
Asu has quit [Ping timeout: 265 seconds]
cr1901_modern1 has quit [Quit: Leaving.]
cr1901_modern has joined #nmigen
_whitelogger has joined #nmigen
Asuu has quit [Ping timeout: 265 seconds]
Asuu has joined #nmigen
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 240 seconds]
<DaKnig> lkcl_ : thanks for sharing. whats the problem with using the same fragment many times? so what if it copies the logic, its someting I am sure tools can optimize out...
<DaKnig> that's common subexpression consolidation, its very common to do this in compilers and im sure your backend would deal with it
<DaKnig> about memory issues, I totally get you. I really have the same problem. when working with big designs I have to have at least 3 files on screen at once (sometimes up to 6!) and my screen is tiny... with C/++ I feel like having just the file Im working on + the headers for other parts of the project is enough and I include usually examples of usage in the headers. but ofc in HDL its completely
<DaKnig> different.
<DaKnig> using (sync|comb) instead of m.d.($1) is a good idea! I didnt think about it, but actually if the module only deals with one time domain, makes total sense
<lkcl_> DaKnig: it's not necessarily guaranteed that yosys will optimise out repeated expressions
<DaKnig> I doubt it wont :)
<lkcl_> for example, on a "unit test" (small case) yosys successfully identifies a comparator-cascade and makes only 44 LUT4s
<DaKnig> if a few wires on the netlist have exactly the same drivers and same truth table, usually its optimized out
<lkcl_> however when that same module is *used* the result is 1,000+ LUT4s
<DaKnig> wow really? that's bad
<lkcl_> in other words, the easy cases, no problem
<lkcl_> but the more complex patterns, the current hypothesis under investigation is that it's unable to indentify the pattern
<lkcl_> after flattening, basically.
<lkcl_> just a word of caution not to rely on the (quite reasonable) expectation/assumption
<DaKnig> why is it so hard to notice that two wires in the netlist have the same expression assigned to them?
<DaKnig> many dependencies?
<lkcl_> the hypothesis that i have is that the substitution of expressions-into-expressions from yosys flatten causes it to no longer be capable of recognising the "outer" expression that it was, in the smaller (sub-module) case, perfectly well able to recognise
<lkcl_> a way to test that would be:
<lkcl_> for each identified tree-node module:
<lkcl_> optimise
<lkcl_> flatten
<lkcl_> repeat until completely flattened
<DaKnig> but even then you would still have subexpressions that can be merged, then merged again, all the way up, I'd assume?
<DaKnig> is that a fair assumption?
<lkcl_> well, the idea is with that approach that the expression that is repeated *within* a module is optimised and reduced down, *before* its inputs are multiply-substituted by "global flattening"
Asuu has quit [Quit: Konversation terminated!]
Asu has joined #nmigen
<moony> this is really strange, I have an FSM that's refusing to actually do anything, even if I make my start condition a simple dummy that sets itself to another condition, it'll always stay on condition 0. Will post my (poor) code in a min
<moony> all the other logic seems to work
<moony> i.e. the fetcher gets busy immediately while the CPU is hung
<lkcl_> happy to take a look, moony.
<moony> it never advances beyond this state. Ever.
<lkcl_>, not m.mode
<moony> bah
<lkcl_> :)
<moony> that should've been obvious
<lkcl_> weell...
<lkcl_> i'll not enumerate the number of things i missed that were obvious in hindsight :)
<lkcl_> moony: are you designing your own ISA?
<moony> mhm
<moony> came up with a simple(r) CISC that I could actually pull off
<lkcl_> one day i will implement an idea i came up with in 1990. 8-bit "escape-sequenced" instructions based on 2-bit bands, that gives an ISA very similar to the LZO compression algorithm
<lkcl_> without an escape-sequence to extend the "2-bit" of RA, it's only 2 bits.
<lkcl_> however _with_ a (first) escape-sequence RA becomes 16 bit register numbers
<lkcl_> another escape-sequence: 64-bit
<moony> went out of my way to make the CPU fetcher support async writes so I could just write and forget about it unless the write gets very stalled, so I could easily support the 6 different modes (Technically 12, as you need to be able to access the upper 8 registers and that's done by offsetting the "target" register for the mode by 8)
<lkcl_> bad-cpu is like a "proper" RISC, isn't it?
<moony> debatable
<moony> it's got decently complex op modes
<moony> i.e. the one I marked TODO is `(RA + RC * N + X), RB`
<moony> (and it's inverse)
<moony> I wrote them down beforehand to make sure I didn't forget, but didn't commit that with the repo, oops
<lkcl_> that's... MAC-with-immediate-offset. nice.
<moony> ish. N is always a power of 2
<moony> as it is on x86
<moony> so it's cheap
<moony> shift and add
<lkcl_> makes sense for word/etc.-alignment
<lkcl_> are you considering doing PowerISA style "load with update"?
<moony> hm, right
<moony> i never modified the fetcher to handle offset read/writes
<moony> it currently only does aligned
<moony> yet another TODO
<moony> hmm, i'll look at that instr
<lkcl_> as in: after the load effective-address calculation, update RA with that as a result
<moony> I kinda just copied my personal favorite operand modes out of the VAX
<lkcl_> it means doing two writes
<lkcl_> two reg writes
<lkcl_> nice :)
<moony> Oh, yea, see store_back_addr
<moony> which is a bit messily handled as I wanted to avoid the extra clock cycle
<lkcl_> the efficiency saving from load/store-with-update comes when referencing a struct in a loop
<moony> it's name is a misnomer which needs renamed
<lkcl_> because the 1st LD computes the address that you can then "add 12" (or whatever sizeof(struct)" to, to get the next struct
<moony> as it actually functions for both sides of the instruction
<lkcl_> where's store_back_addr?
<moony> in cpu
<lkcl_> ohh ok ah yeah i see
<moony> technically works for anything I like if I just set it during instruction execute
<lkcl_> so yes, looks like that's ld/st-with-update.
<lkcl_> did VAX have separate address and data registers, like the 68000 and CDC6600?
<moony> nah, it's a CISC
<moony> a very, very, very CISC CISC
<lkcl_> :)
<moony> too CISC for me to actually copy
<moony> I just stole the operand notation and some basic operands and called it a day
<lkcl_> lol
<lkcl_> if you get this right it should be frickin quick.
<moony> i.e. (RA + RC * N + X) would be X(RC)[RA] for the VAX (Note N isn't there, it's specified by instruction width, while I can only do 32-bit accesses)
<moony> lkcl_: it's sitting at 0.5 IPC if I don't need to add another state
<moony> at best, obv
<lkcl_> there's almost nothing to the decoder
<lkcl_> oh i mean if you end up pipelining it
<moony> oh, yea
<moony> you're 100% right
<moony> although the DECODE_2 step is a bit more complex
<moony> or will be
<moony> as it has to handle all the 2 word variants (aka any instr with an immediate following it)
<lkcl_> the CDC6600's decode phase is laughably trivial. bits from the instruction literally go directly through a 3 bit binary-to-unary expander and end up as the "enable" lines on any one of 8 function units (pipelines)
<lkcl_> ahh yeh
<moony> if you'd like, I can give you a link to a VAX manual. It's... less than trivial
<lkcl_> ahh... :)
<moony> as in, so un-trivial they dumped the arch because 1 IPC was impossible
<moony> operands had to be resolved sequentially
<lkcl_> i am presently trying to focus on POWER9. my brain will explode if i try to include a new ISA
<moony> and they were all at least 1 byte
<moony> (RA + RC * N + X) as X(RC)[RA] was 3 bytes at best (1 byte X), 6 bytes at worst (4 byte X)
<lkcl_> argh
<moony> fun arch, great to program for, but absolute hell to implement
<moony> it's a 1977 mainframe arch though
<moony> is it really surprising
<lkcl_> i'm a huge fan of the CDC6600, by James Thornton and Seymour Cray
<moony> at least it contributed some amazing things to the world. Like Ethernet, the IEEE 754 32-bit and 64-bit formats, the BSDs, and several other things
<lkcl_> they OoO design solved problems that hadn't even been realised were problems
<lkcl_> yehyeh
<moony> another arch I have on my desk rn is the MC88100, which is a very traditional RISC
<lkcl_> i was able to log in to a microvax at imperial college, played aroudn with it
<moony> I have a habit of giving SHL/SHR the boot and putting MAK/EXT (bitfield instrs) in their place because of it
<lkcl_> ooo that's another one that Mitch Alsup was involved in
<lkcl_> ah you know the story behind POWER9's shift routines?
<lkcl_> mask/rotate?
<moony> I'm just a teen hobbyist who absolutely loves old PCs, and most of the time people don't know what the VAX is somehow. No, but i'm open to listening.
<moony> s/PCs/computers/
<lkcl_> short version: they were on a serious gate budget, so rather than have a separate shift/mask set of gates in LD/ST
<lkcl_> what they did was: micro-op LD/ST and SHIFT/MASK, and join the two together via broadcast buses based on the register numbers RA, RB, RC, RS and RT
<moony> oh, one good question: You possibly know why Bitsavers might take down a scan? The scan for the VAX Architecture Handbook (which I personally own) seems to have been removed from their site at some point in the past. Neat.
<lkcl_> so LD would do a 32 (64?) bit wide aligned LD, then rather than pass it straight to the regfile, pass it to shift/rot which *then* did the mask/insert, and *then* did the store to regfile
<lkcl_> bitsavers?
<moony> these guys. They've scanned countless old documents
<lkcl_> no idea
<lkcl_> wow
<moony> they have most DEC stuff scanned
<moony> same for Motorola stuff
<lkcl_> woooow
<moony> even have some of DEC's internal VAX design documents scanned which is impressive
<lkcl_> and Acorn RISC Machines (before they tried renaming to ARM)
<moony> this one was great help when I was working on a VAX emulator to learn emulation. (Was way more than I could chew, but I got it all the way to running unprivileged code successfully :D)
<moony> basically DEC's internal spec for the arch
<lkcl_> they've got apricot on there! i knew the son of the founder of that, at school, back in 19801
<moony> they have many, many things
<lkcl_> no CDC6600 though :)
<moony> yea, they have tons
<lkcl_> ok ok enough, i will be sucked in forever. wow, thank you though
<moony> np
<moony> it's a great resource
<moony> and I probably wouldn't have gone out and bought paperback copies of several VAX/PDP-11 things for learning purposes if it didn't exist, as I would've never even got to know the architecture (I initially learned the ISA through the scans)
<moony> it's a wonderful design, just not the most scalable :p
<lkcl_> :)
<moony> tbh, if I had the skill, I would absolutely try and make an FPGA VAX clone.
<moony> it'd obv have to be heavily microcoded to actually fit, just like the real-world arch :P
<moony> and I don't know anything about uCode design/theory/etc
<moony> I've got some info on the uCode for the VAX-11/780 on hand, but not enough to truly understand it
<lkcl_> micro-coding is very simple, you don't just do the "actual" op, you translate internally
<DaKnig> split your instructions into smaller common operations. implement those. then have a "translator" block as part of your fetching mechanism.
<DaKnig> that's not that hard :)
<lkcl_> so for example on POWER9, the microwatt team, instead of implementing add, sub, neg, addc, addex etc. etc.
<lkcl_> they have *one* operation, "OP_ADD"
<moony> VAX-11/780 has a 96 bit wide ucode op, and it barely pulls 1 instruction per 10 cycles, so there's some heavy complexity to it. Could also just be an intimidating fact to know though :p
<moony> hm
<lkcl_> and the instruction decoder says, "if this is subtract, then invert A and set carry-in to 1"
* moony will think about it
<DaKnig> its not uncommon for microcode to be very wide
<DaKnig> at least when you count all the control bits flying around
<lkcl_> DaKnig: sigh. finding that out, the hard way
<lkcl_> 192 extra wires into some of the pipelines in LibreSOC
<moony> I think there's already the core idea behind ucode just sitting in my CPU design, various flags the EXECUTE, FETCH, DECODE, etc stages all pass to each-other for control could be bundled up into a control block
<DaKnig> look im a beginner, its just something I noticed from my own experience and from how I imagine other systems are implemented.
<DaKnig> its not a bad thing- more info is available means you can have less complex logic
<lkcl_> moony: exactly. that's micro-coding, basically.
<DaKnig> basically think about your instructions in such a machine as "compressed instructions"
<moony> "compressed"
<moony> also VAX: 120 byte instruction, yay!
<lkcl_> yay! :)
<Lofty> Itanium eat your heart out /s
<moony> (the architecture is capped at 255 bytes for an instruction, as well.)
<DaKnig> probably the ucode for that instruction is even longer :)
<lkcl_> "compressed", basically, you have a 16-bit to 32-bit ISA "map" of the most commonly-used instructions
<DaKnig> isnt that why people say that CISC has a RISC inside it plus a decoder?
<lkcl_> so you only fetch 16 bits, but internally you first expand it to an "equivalent" 32-bit opcode then put *that* into the decoder
<moony> my "bad" CPU (time to go rename it) just directly feeds a bunch of bits to the ALU for common ops
<moony> but for more complex ops maybe I should work out a way to dedup their logic
<lkcl_> which is extremely efficient, and is the whole basis of the "proper" RISC paradigm
<lkcl_> :)
<lkcl_> love the chat - really have to get back to fixing a LD/ST FSM...
<moony> yea.. the only "complex" to decode instr is any instr using the (RA + RC * N + X) mode, and that's only as the effective address gen will need to split out some more bitfields
<moony> I came up with something simple, and I'll stick to it for now to avoid feature creep :p
<DaKnig> why is this complex?
<moony> see the quotes.
<moony> ""complex"" (it's not)
<DaKnig> if you have hardware mul, you can just turn it into MAC, add, deref/do whatever
<moony> it's a shift-add + deref, yep
<DaKnig> the back propogation net in the pipeline should take care of the rest
<moony> simple too
<moony> in an earlier version of the arch I was considering making whatever NOP instr I decide on "zero cycle" in that the decode will immediately move to the next half of the 32-bit word it currently has, but that's not necessary anymore with my smarter fetcher design (which, among other things, will pre-emptively queue up the next 4 words)
<moony> s/4/3 as the 0th is the one it's currently giving to the CPU :p
<moony> anyways, question: What might be a good way to handle converting an unaligned load/store request into multiple aligned ones?
<moony> It's probably easy, but I only know the software way of emulating that behavior that you p much only use when you're trying for cycle accuracy in an emulator :p
<DaKnig> if you load 8 bits when the bus is 64bits wide (for example), you can load the whole 64 bits, then have a word_select thing (aka a mux)
<lkcl_> moony: well... i can describe the way that we're doing it for LibreSOC
<moony> i'm trying to convert a 32-bit unaligned load/store to an aligned one
<moony> sure
<lkcl_> however it's designed for serious throughput
<DaKnig> if the request loads 16 bits that are unaligned with said 64bits wide data bus, you can load the (rounded down) address and the one right after it, combine them and *then* use word_select
<lkcl_> basically, take the lower bits of the address, and the ld/st-length, and turn them into a bytemask
<moony> lkcl_: considering I was somehow able to make my fetch unit not be garbage, I probably have some small chance of understanding
<lkcl_> effectively, just like wishbone "sel" when you set 8-bit granularity
<DaKnig> lkcl_: question, what about my design that I just described doesnt give you enough throughput?
<lkcl_> DaKnig: we're planning an advanced (simultaneous multi-ld/st) version of that, in effect
<DaKnig> I see.
<moony> my fetcher could be better (it could simply never drop READ_EN when doing bulk reads) but it's already fast enough as is and when working with SRAM that can respond the next cycle the CPU can't out-pace it
<DaKnig> do you check if you can avoid unaligned access to save on bandwidth?
<lkcl_> so now you can think of those LD/STs as just *literally* being like wishbone 64-bit requests plus an 8-bit mask of "sel"
<lkcl_> DaKnig: no, what we do is:
<lkcl_> * use the bottom 4 bits of the address (0-15)
<moony> I assume the underlying code for LibreSoC isn't public yet?
<lkcl_> * create a 16-bit mask
<lkcl_> * split it into two halves (2 separate 64-bit requests, each with their own 8-bit wishbone-style "sel")
<lkcl_> * do *BOTH* of those simultaneously
<moony> oh, no
<moony> it is
<lkcl_> moony: yes
<DaKnig> how wide is the data bus?
<lkcl_> a whopping 256 bits.
<moony> just on your own git instead of github or similar, alright
<lkcl_> 4x 64-bit
<lkcl_> moony: yes. because we take the "libre" bit seriously
<moony> 256b seems p normal in hindsight, at least knowing how big modern CPU busses usually are
<lkcl_> moony, yeah. in GPU terms (those that use GDDR5) it's peanuts
<moony> i.e. Zen 2's internal bus is 512-bit iirc, while the external is just raw DDR4 memory lanes (as it's a SoC)
<DaKnig> and after that you just concat and select with the bottom 4 bits
<DaKnig> with the mask created by* the bottom 4 bits
<DaKnig> right?
<lkcl_> that makes sense, given if you want to handle 4k
<lkcl_> yes, once you have those 2x 64-bit aligned requests, some simple masking followed by shift/concatenate, and you're done
<lkcl_> in the "first iteration" we're thunking down those 2x 64-bit requests onto the same internal 64-bit wishbone bus
<lkcl_> so they can't happen simultaneously (ever)
<DaKnig> you might save on one of those requests tho with a bit of extra logic
<lkcl_> exactly, yes.
<moony> I'm getting an FPGA board soon, a ULX3S, so I'll probably get to enjoy the wonders of writing an sdram driver soon.
<DaKnig> not sure how that's gonna make it faster tho
<moony> ECP5-85F, so I can use Yosys and be happy
<lkcl_> if the mask (wishbone 8-bit "sel") is zero for either of the 2 64-bit requests, then you don't need to do the request at all.
Sarayan has joined #nmigen
<lkcl_> moony: i knooow, i'm jealous :)
<moony> You probably need a larger/more capable board to work on the SoC :P
<lkcl_> i have a versa-ecp5 and it's... ok.
<moony> oh, huh
<lkcl_> (45k LUT4s). at the moment because we're limiting what's being done, we're at 16k LUT4s.
<moony> lkcl_: ah. I might tinker with it sometime then, as it'll fit nicely and that'd be coooool
<lkcl_> :)
<DaKnig> the only thing stopping me from getting an ecp5 board is I didnt see any that has enough IO for my need. ones with HDMI, PLLs and DDR memory builtin are rare... can only get 7-series ones for my price range and those requirements
<moony> I'm probably not knowledgeable enough to help though.
<lkcl_> the litex sim works fine.;a=blob;f=src/soc/litex/florent/;hb=HEAD
<moony> considering all I've done for the past forever is just computer programming, I'm just now learning hardware design :p
<moony> this CPU is my first ever serious project
<moony> and I have no idea why it's going so well lol
<lkcl_> i'm currently fighting with the LD/ST FSM, which is being... odd and obtuse
<lkcl_> lol
<moony> i.e. how the heck did I manage to make the fetcher work so well
<lkcl_> probably _because_ you've done programming
<DaKnig> making useful (simple) CPUs is not that hard; compare that to , say, compilers...
<moony> getting stuff running nicely async is still a bit new to me
<lkcl_> DaKnig, yyeah, daveshah's ones are really good and have everything
<lkcl_> except he doesn't get much in the way of demand in order to turn it into a business
<moony> once I finish this design i'll probably just start a new design that's more complex :p
<lkcl_> at least he libre-licenses the full CAD files of what he does
<moony> ...I wonder how hard a PDP-11 soft-core would be
<moony> might look at that
<moony> most of the complexity is, like VAX, from the operand modes
<moony> hardware complexity that is
<lkcl_> DaKnig, so you could, hypothetically, do your own PCB run of his ECP5 boards
<lkcl_> PDP-11, the precursor to the 68000, right? i think it has those separate address and data registers
<moony> I have a big ol pile of DEC manuals I want to put to use
<moony> nah, PDP-11 is the VAX's precursor
<moony> it heavily inspired multiple MCUs though
<moony> probably 68k too
<moony> it doesn't have address registers iirc?
<lkcl_> the 68k (68000) was designed by Mitch Alsup. he was - still is - a fan of address/data registers because he studied the CDC 6600
<moony> unless i'm missing something about how it handles the extra 8 registers (Like my bad-cpu, it needs a different instruction mode/something to access the upper 8 registers)
<moony> they are GPR though, just checked
<lkcl_> i can't remember. it was... 1990 when i last looked at 68000 :)
<DaKnig> I ... really dont wanna make my own pcb. at some point, maintaining *all* your stack is annoying. gotta use others' works
<DaKnig> less time consuming, therefore cheaper
<moony> my PDP-11 Architecture Handbook is probably the most grimy of the set I have here, it's clearly been used a good bit :p
<lkcl_> DaKnig: sigh yehhh. i'd really like one of daveshah's more powerful ECP5 boards, too
<moony> lkcl_: yea, i'll almost definitely play with the SoC a bit when I get the board. It sounds fun
<moony> where's the main CPU? shakti-core?
<lkcl_> the OpenPOWER community's pretty nice.
<lkcl_> ok that's the litex simulation (you need to generate the verilog file first though)
<lkcl_> this is the "simple" FSM - the main instruction issuer loop
<lkcl_> it's temporary because i'm focussing on getting the instructions right, first
<moony> I should learn how pipelines truly function
<moony> yea, makes sense
<moony> something simple to test against
<moony> if my ""bad"" CPU design comes out well, I might try pipelining it
<lkcl_> pipelines are just some combinatorial logic that's joined with some clock-synchronised "registers"
<lkcl_> registers/latches
<moony> so basically the relationship my fetcher has with my CPU core? :P
<lkcl_> so where the gates would normally ripple and not stabilise before the clock goes "ping"
<lkcl_> you capture *partial* results in "latches"
<lkcl_> then continue on processing on the next cycle, in a new combinatorial block
<lkcl_> the "pipeline" bit is that you allow the 1st stage to start a *new* result whilst the 2nd stage is completing the next part
<lkcl_> extend it to 3-stage, 4-stage, however-many-you-want stage
* lkcl_ goes and reopens bad-cpu
<moony> you could theoretically apply that idea to the VAX, but you'd need a huge, x86-64 rivaling pipeline.
<lkcl_> indeed
<moony> and even then, complex instructions like `ADDL2 4(R1)[r2], 4(R3)[r2]` would be a pain in the butt
<lkcl_> so, if the fetcher FSM can process one instruction *at the same time* as the ALU is munching on the previously-fetched instruction, then yes, this is termed a "pipelined" design
<Sarayan> the fun part being managing the shared resources, like the memory port been fetches and instructions accessing memory
<moony> it indeed can, the fetcher procures instruction words asyncronously from the rest of the cpu core
<lkcl_> Sarayan: had a loootta fun with that, a couple days ago...
<moony> again, comparing to VAX, managing shared resources on it would be... not fun.
<moony> 12 registers in one instruction is the max.
<moony> aka basically the entire GPR file
<moony> as the last 4 have special uses per the cc
<lkcl_> moony: ah ok, so although you may have designed the fetcher FSM to be pipelined, here's the thing: if the fetcher FSM can only fetch one instruction every 2 to 3 clock cycles, then the ALU "pipelines" are going to be idle.
<lkcl_> yeah?
<moony> well, technically
<moony> but it helps that the CPU core takes at least 2 cycles per instr
<moony> and an instr is only half a 32-bit word
<lkcl_> the "solution" to that is to read 2, 3, 4 or 8 instructions
<lkcl_> ahh :)
<moony> the fetcher fetches 32 bits at a time
<moony> so, as I mentioned earlier, the fetcher will always keep up with the CPU if the external memory is fast enough
<lkcl_> ok so you can... ah, yes, *now* you can feed the 1st 16 bits on 1 cycle and the 2nd 16 bits on the next
<lkcl_> and during the "off" cycle, it fetches *another* 32 bits, right?
<moony> yep
<lkcl_> cool! then you've got a pipelined design
<moony> I'm going to try and make it so it doesn't have to disable READ_EN if it has more to fetch, which should make it even faster
<moony> well
<moony> fast enough it doesn't even matter lol, it'll just save some contention in edge cases
<lkcl_> i mean, it may only be a 2-stage, but it's still pipelined
<lkcl_> now as Sarayan says: if you can get the register reads / writes to not corrupt, and operate on a *3rd* cycle, now you have a 3-stage pipeline
<moony> once again I have no idea how I pulled this off with no real prior knowledge of CPU design beyond "hey pipelines exist"
<lkcl_> lol. the thing is: now you run into problems with instructions trying to read results that aren't ready yet
<lkcl_> you put 2 instructions:
<lkcl_> mul r1 <- r2, r3
<lkcl_> add r5 <- r1, r1
<moony> yeeep
<lkcl_> the result of the mul takes 3 cycles... you need to decide:
<DaKnig> lkcl_: are you assuming multi cycle mul?
<moony> at that point i'd have to use, say, a scoreboard. I only know what a scoreboard is because the MC88100 manual explains it lol
<lkcl_> a) do i care? :)
<DaKnig> ah.
<DaKnig> cant you have 1-cycle mul in your FPGA?
<DaKnig> DSP slices are quite fast
<DaKnig> certainly faster than 50MHz I think
<lkcl_> DaKnig, not necessarily, just the fact that the design *is* pipelined (even 2 stages) is enough
<moony> also the problem of jumps, which have to eat a cycle as they're tricky to pipeline
<lkcl_> if you're interested i can email you Mitch Alsup's book chapters on scoreboard design
<DaKnig> moony: I would really suggest to find some computer architecture design course online. it should answer all your questions and teach you much more
<lkcl_> in-order designs, they're much simpler: you "stall".
<DaKnig> just have a good predictor :)
<lkcl_> if that r1 hasn't been written yet, (the mul followed by add), you simply stall the instruction issue
<lkcl_> DaKnig: yeah, predictors are... fuuun.
<lkcl_> an in-order "stall" system is basically a degenerate Scoreboard Matrix of width and height 1
<DaKnig> why stall? having a network that returns the new result back to the system would be much better
<DaKnig> ... no?
<lkcl_> DaKnig: it's... complicated. it took me 5 months of talking with Mitch Alsup on comp.arch to fully understand scoreboards
<DaKnig> scoreboards are the term? hm
<lkcl_> there's 2 basic designs
<DaKnig> I had a very simple pipeline with that, I just kept track of what regs are used where in the pipeline
<lkcl_> Tomasulo, and Scoreboards
<DaKnig> ok will watch. thanks
<Sarayan> alternatively, the toshiba sh2 has a well-described 5-stage pipeline that can be seen as an example of practical implementation, given the manual indicates all the interactions requiring interlocks
<lkcl_> ah that's an in-order-style solution, which, if you're not careful (not stalling) i can't tell you exactly why, but you'll get data corruption
<lkcl_> Sarayan, you mean, it defines the timing and the compiler / assembly-writer has to respect that?
<lkcl_> there's historic and current designs which successfully do that, like the TI VLIW DSPs
Asuu has joined #nmigen
Asu has quit [Ping timeout: 265 seconds]
<Sarayan> Page 168+
<Sarayan> they describe in detail how opcodes are executed so that compiler programmers/assembly writes can use the processor efficiently
<lkcl_> nice. makes for very simple hardware
<lkcl_> the first SPARC processors did this, iirc. the first compilers? simply inserted a bunch of NOPs... :)
<lkcl_> urrrr
<Sarayan> all the interlocks are automatic in the sh, but the description is so complete that tells you what you should look at when making your own
<lkcl_> i remember my colleagues doing assembly-level programming of TI DSPs, in 1994. CEDAR Audio. there were only 1024 clock cycles available per 24-bit audio sample and the compiler wasn't efficient enough
<Sarayan> oh yeah, that must be fun
<lkcl_> now of course they can just use the main processor (doh)
<lkcl_> this was like 386sx16-->486dx25 with Windows 3.1 "locking up interrupts" days, whereas the TI DSPs were 50 MFLOPs sustained
<Sarayan> trivia: the mu100 synthesizer integrated effects DSP runs 768 instructions per sample, and afaict had no branching
<lkcl_> ooOoo :)
<Sarayan> I guess all instructions are single-clock
<Sarayan> external memory access (for reverb ram) is 3 clocks, memory instructions are at addresses multiple of 3. Even more amusing, on reads, the instruction that does something with the result is at 3n+2
<Sarayan> a very synchronous dsp
<Sarayan> instructions are vliw too, because why not
<moony> i've always wondered if there's more ways to make our fundamentally in-order CPU designs more efficient by, well, dumping a bit of said order :P
<moony> I know, for example, Mill is working on smth like that
<moony> something I do kinda miss in modern computers is also the idea of having more specialized hardware for tasks. Where's our DSPs? :(
<moony> everything's loaded onto the main CPU, even if it turns out to be one of the least efficient ways to do a task
<moony> GPUs exist obv, but they're really the only big co-processor in our computers :P
<lkcl_> i've spent some time on comp.arch, learning about the Mill. it's extremely cool.
<MadHacker> Well, GPUs make fairly decent DSPs; bearing in mind communications overhead with other devices, I'm not sure that other copros have much to offer any more.
<lkcl_> it only has "ADD" and "MUL" (not ADD8, ADD16, ADD32, ADD64, ADD-signed blah blah)
<MadHacker> Also there's a lot of very specialised ones you never really see, like in NICs and the like.
<lkcl_> the "width" (and type) is taken from the LD operation and carried right the way through even to ST
<lkcl_> moony: this is what ARM SoCs advocate, having specialist blocks for AES, Video, etc.
<moony> mhm
<Sarayan> lkcl: the issue tends to be if/how it interacts with the cache
<Sarayan> if your aes block flushes the cache because it's a dma, well, reloading it afterwards kills all the gain usually
<lkcl_> yeah
<lkcl_> oh interesting
<lkcl_> of course
<Sarayan> that makes accelerators hard
<lkcl_> and the software gets more complex (DMA, userspace-kernelspace)
<lkcl_> it's why we chose libre-soc to be a hybrid CPU-VPU-GPU. actually extending POWER9 to include sin, cos, texture interpolation, yuv2rgb and so on
<lkcl_> i don't know if you've seen how normal GPU architectures handle the software side - it's mental :)
<lkcl_> inter-process communication and synchronisation of multi-megabyte data structures in shared memory!
<lkcl_> that has to involve userspace-kernelspace-interprocessor_bridge-kernelspace-userspace interaction
<lkcl_> mmmental
Asuu has quit [Read error: Connection reset by peer]
emeb has joined #nmigen
<Sarayan> plus massive parallelism
<lkcl_> ah there is that :)
Asuu has joined #nmigen
<moony> lkcl_: yea, GPUs are nuts
<moony> and imo i'd absolutely love that kind of massive parallelism for some tasks
<moony> a CPU that has a large number of "little" cores designed for high parallelism would be fun
<moony> (and some "big" cores for single-thread/dual-thread tasks)
<MadHacker> Isn't that just a CPU with an integrated GPU? :D
<Sarayan> You mean a recent intel gpu with integrated graphics? ;-)
<Sarayan> mwahaha MH
phire has quit [Remote host closed the connection]
<moony> p much :p
phire has joined #nmigen
<MadHacker> The Intel knight's [landing, corner, whatever] series were an x86-flavoured variation on that theme.
<MadHacker> Lots of small x86 cores running in parallel.
<MadHacker> Of course, you can get 64 cores in a mainstream CPU now, so it's not THAT much parallelism.
<moony> alright, got bad-cpu executing instructions again (I scrapped the old fetcher and it was too tightly integrated so I had to just snip most of the core), and now it's running at 0.5 IPC
<moony> yay
<Sarayan> yeah, a good pipeline should get you almost 1ipc mean
<moony> yea. This is only a 2-stage pipeline though
<Sarayan> superscaling is yet another keetle of fish
<moony> I'll finish this design then design a good pipeline :P
<Sarayan> are you load-store?
<Sarayan> (most everything in register, dedicated instruction for load or store from/to ram)
<moony> no, loads/stores are handled with operand modes
<Sarayan> hmmm
<Sarayan> there's a fair change you can't really go over 0.5ipc then
<Sarayan> fetch is going to collide with instruction memory access
<moony> actually
<Sarayan> unless you go harvard, of course, or are really good with I$
<moony> it won't. instrs are 16-bit, so the CPU will likely have the next instr words pre-fetched, and the fetcher will prioritize, well, anything that isn't a instr fetch.
<moony> under normal operation rn, the fetcher fetches the next word every other cycle for normal form instructions
<moony> though
<moony> with a bigger pipeline that'd probably change
<Sarayan> how wide is your bus?
<moony> 32-bit external
<moony> so it's usually ahead of the CPU
<Sarayan> the sh2 has 16-bits instructions and a 32-bits internal bus, so it actually fetches every other instruction
<moony> tl;dr similar to the SH2, unless of course a 2 word instruction is being executed
<Sarayan> it's not really speculating, just reading with a 32bits granularity
<Sarayan> you have 2-word instructions?
<moony> only used when an instr has an immediate attached
<moony> they, of course
<moony> slow down the CPU a bit
<Sarayan> yeah, sh2 doesn't have that
<moony> mhm
<moony> alright, finally running instructions properly. Only the register/register ones though
<moony> next up, make FETCH work
<Degi> Is there something faster than using if/else for MUXing 2 signals?
<moony> presumably, Mux()
<moony> yay, load store works
<moony> now for one last puzzle piece, a working external bus that isn't a thunk :p
<moony> what's a good way to handle a ROM in nMigen?
hitomi2507 has quit [Quit: Nettalk6 -]
<Degi> Maybe a RAM with data already filled in?
<Degi> you can pass 'init=' to a Memory
<sorear> Can we reserve this for conversations at least somewhat relevant to nmigen?
<moony> sorear: I think discussing a CPU being written in nmigen, over an hour ago, is perfectly fine.
<sorear> it’s not about how long ago it was, it’s about how many pages of scrollback you take up
<lkcl_> sorear: apologies
Asu has joined #nmigen
Asuu has quit [Ping timeout: 240 seconds]
lkcl__ has joined #nmigen
lkcl_ has quit [Ping timeout: 265 seconds]
Asuu has joined #nmigen
Asu has quit [Ping timeout: 264 seconds]
Lord_Nightmare has joined #nmigen
<_whitenotifier-3> [nmigen] pbsds opened pull request #481: hdl.ast.Value: Add .to_signed() method -
Asuu has quit [Ping timeout: 240 seconds]
Asu has joined #nmigen
emeb_mac has joined #nmigen
<Degi> What is the best way to find out time-consuming things?
<vup> `python3 -m cProfile -s time` or maybe `-s tottime`?
<Degi> Ah I mean taking up time as in long carry chains etc. since my thingy only compiles to like 300 MHz
<vup> ah
<vup> although 300MHz doesn't sound too bad
<vup> what are you using for pnr?
<Degi> nextpnr ecp5
<Degi> Hm, AsyncFIFO(Buffered) seems to be slow
<Degi> Without that it compiles to 1100 MHz
<daveshah> nextpnr is being somewhat optimistic here, as it doesn't take into account the fact the global clock tree is only rated to 370MHz
<Degi> Hm, in practice it works at 800+ MHz
<Degi> (not this example, but the clock tree itself, in that case it was some 30 bit counter or so)
SpaceCoaster has quit [Quit: ZNC 1.7.2+deb3 -]
SpaceCoaster has joined #nmigen
<daveshah> Is this a 1.2V part by any chance?
<Degi> yes
<daveshah> I think that spec wasn't updated accordingly, so it is not surprising it can go a lot higher
<Degi> And what is the limit on clock domains?
<Degi> Like on the number of them
<daveshah> 16
<Degi> hmh okay
<daveshah> In theory up to 64 with cleverer placement but nextpnr's global code would need some changes to support this
<Degi> That would be nice heh
<daveshah> I'd rather work on cross clock constraints first, that would probably be more useful
<daveshah> At the moment actually using even a few clock domains with complex crossings is quite annoying
<daveshah> Out of curiosity, why are you needing more than 16 domains?
emeb has quit [Ping timeout: 240 seconds]
<Degi> Somehow something is broke and that led to me making gateware which takes 4 clock domains for 1 data lane...
emeb has joined #nmigen
<Degi> And there can be up to 4 data lanes. Maybe I can optimize that to 2 clock domains or even combine clock domains of lanes (for later), a problem was that the SERDES gearing seems to be broken
Asu has quit [Quit: Konversation terminated!]
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 240 seconds]
<_whitenotifier-3> [YoWASP/yosys] whitequark pushed 3 commits to release [+0/-0/±3]
<_whitenotifier-3> [YoWASP/yosys] github-actions[bot] 7fc68da - Update upstream code
<_whitenotifier-3> [YoWASP/yosys] github-actions[bot] a38714b - Update upstream code
<_whitenotifier-3> [YoWASP/yosys] github-actions[bot] 7017d1c - Update upstream code
<_whitenotifier-3> [YoWASP/nextpnr] whitequark pushed 3 commits to release [+0/-0/±4]
<_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] a390d11 - Update upstream code
<_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] b6894f6 - Update upstream code
<_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] 5385082 - Update upstream code