<d1b2>
<emeb> Isn't it true that all clock domains in a design are part of self, and the clock and reset signals associated with them are accessible?
<tannewt>
oh, initialized by the super class?
<d1b2>
<emeb> well, I think I misspoke saying "all clock domains", but for any module that's instantiated, the higher level hooks up the clock domain, and the clock & reset signals are available if you need them.
<tannewt>
huh, interesting. I'd expect it to be passed in
<tannewt>
(obviously I haven't used it much)
<d1b2>
<emeb> but usually you don't since they're implicit when you use .sync process
<tannewt>
right, you assume most things are using the common clock
<d1b2>
<emeb> things get interesting when there's multiple domains and signals crossing of course - that's the topic of a lot of discussion here.
<tannewt>
ya, I've seen that but haven't experimented with it yet
<tannewt>
I know it from the "I've worked with a SAMD micro that does it" standpoint but not the HDL level
<d1b2>
<emeb> Right - there area lot of hardware patterns that are used to ensure good behavior in those cases. AIUI nmigen is working to make a lot of those available easily.
<tannewt>
nice! I love the nmigen primitives
<tannewt>
is it usually done at the register level or peripheral level?
<tannewt>
I'm probably thinking too high level
<d1b2>
<emeb> Yeah - special structures to keep pulses from getting lost or getting stretched, special ways to make sure you only get out what's put in, etc.
<d1b2>
<emeb> async FIFOs figure heavily. Lots of handshaking. etc.
Degi has quit [Ping timeout: 260 seconds]
Degi has joined #nmigen
<tannewt>
cool cool. I haven't gotten that low level yet
<whitequark>
tannewt: so, most signals are eagerly bound, meaning you have to possess a reference to the actual Signal object to do anything with them
<whitequark>
clock domains are special in that they are late bound
<tannewt>
at elaboration time?
<whitequark>
yeah
<whitequark>
or rather, *after* elaboration
<whitequark>
well
<tannewt>
🤯
<whitequark>
it's a combination of both, i guess
<tannewt>
both during and after? or before?
<whitequark>
you're explicitly binding them by manipulating m.domains, and then nmigen binds them through the hierarchy for you
<tannewt>
and `sync` is the default domain right?
<whitequark>
pretty much
<whitequark>
for most part, `sync` is a convention
<whitequark>
it's never treated specially other than by serving as a default name
<tannewt>
makes sense. how are different clocks mapped to multiple domains?
<whitequark>
can you elaborate?
<moony>
is there a clean way to set a Signal to the current state of an FSM?
<moony>
eh, nvm
<moony>
just noticed it's in the signal viewer as FSM_STATE
<lkcl_>
DaKnig: one "trick" (or two) that i learned is possible with nmigen, which helps keep line lengths to below 80 chars is:
<lkcl_>
1) use functions, passing in expressions and parameters (i'll send a link to a file i did that, in a mo)
<lkcl_>
2) assign AST sub-expressions to python variables then use those on the next line
<tannewt>
wq, I'm wondering how you map two clock domains in an elaborate call
<lkcl_>
here's an example of a python function which was created by taking the contents of a VHDL "case" statement out:
<lkcl_>
the radix_read_wait() function just above it, you can see how many If If If indentations there are
<lkcl_>
if those had also been inside the original Case statement at line 384, which is in *yet another* m.If, you can see quite easily how the indentation builds up and gets completely out of hand
<lkcl_>
and yet it is perfectly reasonable to expect to have just one Switch statement - not even two *nested* nmigen Switch Statements - then do maybe one or two m.If indented pieces of work
<lkcl_>
by using a python function for the majority of the Switch work, you can "go back" to a clean indent level.
<lkcl_>
plus, i think the Switch statement looks a lot more understandable and readable because you don't have to scroll page-up, page-down dozens of times to take in all the Cases
<lkcl_>
the second trick: using a python variable to store AST fragments: this is *usuallly* something that's not recommended
<lkcl_>
because people tend to use those fragments multiple times, not realising that it's literally going to insert that exact same AST into the yosys output
<lkcl_>
but if you use it carefully, and make sure that each python variable is used *once*, i've found that it's a really good way to stay below 80 chars.
<lkcl_>
example, at line 205 (which is annoyingly unreadable)
<lkcl_>
do this instead:
emeb_mac has quit [Ping timeout: 240 seconds]
<lkcl_>
data01 = data[1] | data[0]
<lkcl_>
followed by
<lkcl_>
comb += perm_ok.eq(data01 & ~r.store)
<lkcl_>
i *had* to develop these techniques, because of short-term memory issues. if the files are not all on-screen at once, i literally cannot recall the details of a file or function from a hidden tab or backgrounded window that i viewed only two seconds ago!
<lkcl_>
oh - also, another thing you can see, there: "comb = m.d.comb" followed by just "comb +=".
emeb_mac has joined #nmigen
<lkcl_>
this gives you 5 characters back which would otherwise be unavailable. if you can stand abbreviations and it's *really* important, you could do "c = m.d.comb" followed by "c += ...."
<lkcl_>
moony: functions usually have to be "yielded from"
<lkcl_>
yield from test_for_req()
<lkcl_>
however...
<lkcl_>
the for-loop... i *believe* you are doing the right thing, there (for i in test_for_req())
<lkcl_>
is the testee done combinatorially?
<lkcl_>
it would help to provide that source code as well
<moony>
it worked with `yield from`
<lkcl_>
oh! it did?? :)
<moony>
I think the issue may lie with `(`yield testee.read_en) == 1`
<moony>
oops, stray `
<lkcl_>
oh, you mean you had:
<lkcl_>
yield from test_for_req()
<lkcl_>
yield i
<lkcl_>
repeated a lot of times?
<lkcl_>
now i am "engaging brain" a little more, the for-loop is not making any sense to me, neither is "yield i"
<lkcl_>
this would make more sense:
<lkcl_>
for i in range(5): # some arbitrary number
<lkcl_>
yield from test_for_req()
<moony>
yea, that's what I moved to
<moony>
so, for future note: I'm still learning Python. I'm learning python only to use nmigen :p So how generators work in it is still a bit mysterious to me
<lkcl_>
i know there's a reason why that works, it's just a bit too late for me to think it through and explain it :)
<lkcl_>
they're very cool, however, yes, the fact that you can "jump" the control around - even sequentially - from one yield to the next
<moony>
anywayssss....
* moony
celebrates
<lkcl_>
and even have for-loops around things that "yield" results, and even call functions that hierarchically do "more yields"... :)
<lkcl_>
yay! :)
<moony>
i now am something of the way there toward a working CPU
<lkcl_>
cool!
<lkcl_>
let us know how it goes
<moony>
so far so good, surprisingly so as I bit off a bit more than I knew how to accomplish and somehow was mostly successful anyways
<moony>
what's a clean way to write to a Memory in test code? (so using the write port isn't an option as that would take up cycles)
<moony>
I assume prodding at _array?
<moony>
uh oh
<moony>
oh, no
<moony>
my issue
<moony>
successfully spent forever hunting down a bug that didn't exist
<moony>
oops
aaaa has quit [Ping timeout: 265 seconds]
aaaa has joined #nmigen
Degi has quit [Ping timeout: 265 seconds]
ronyrus has quit [Ping timeout: 265 seconds]
ronyrus has joined #nmigen
Stary has quit [Ping timeout: 265 seconds]
plaes has quit [Ping timeout: 265 seconds]
Stary has joined #nmigen
plaes has joined #nmigen
plaes has joined #nmigen
plaes has quit [Changing host]
Degi has joined #nmigen
jaseg has quit [Ping timeout: 272 seconds]
jaseg has joined #nmigen
electronic_eel has quit [Ping timeout: 260 seconds]
electronic_eel has joined #nmigen
PyroPeter_ has joined #nmigen
PyroPeter has quit [Ping timeout: 246 seconds]
PyroPeter_ is now known as PyroPeter
lkcl__ has joined #nmigen
lkcl_ has quit [Ping timeout: 256 seconds]
jaseg has quit [Ping timeout: 240 seconds]
jaseg has joined #nmigen
emeb_mac has quit [Quit: Leaving.]
hitomi2507 has joined #nmigen
proteusguy has quit [Ping timeout: 246 seconds]
proteus-guy has quit [Ping timeout: 260 seconds]
peeps[zen] has joined #nmigen
peepsalot has quit [Ping timeout: 256 seconds]
proteusguy has joined #nmigen
proteus-guy has joined #nmigen
proteus-guy has quit [Remote host closed the connection]
Asu has joined #nmigen
cr1901_modern has quit [Ping timeout: 240 seconds]
cr1901_modern1 has joined #nmigen
Asuu has joined #nmigen
Asu has quit [Ping timeout: 265 seconds]
cr1901_modern1 has quit [Quit: Leaving.]
cr1901_modern has joined #nmigen
_whitelogger has joined #nmigen
Asuu has quit [Ping timeout: 265 seconds]
Asuu has joined #nmigen
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 240 seconds]
<DaKnig>
lkcl_ : thanks for sharing. whats the problem with using the same fragment many times? so what if it copies the logic, its someting I am sure tools can optimize out...
<DaKnig>
that's common subexpression consolidation, its very common to do this in compilers and im sure your backend would deal with it
<DaKnig>
about memory issues, I totally get you. I really have the same problem. when working with big designs I have to have at least 3 files on screen at once (sometimes up to 6!) and my screen is tiny... with C/++ I feel like having just the file Im working on + the headers for other parts of the project is enough and I include usually examples of usage in the headers. but ofc in HDL its completely
<DaKnig>
different.
<DaKnig>
using (sync|comb) instead of m.d.($1) is a good idea! I didnt think about it, but actually if the module only deals with one time domain, makes total sense
<lkcl_>
DaKnig: it's not necessarily guaranteed that yosys will optimise out repeated expressions
<DaKnig>
I doubt it wont :)
<lkcl_>
for example, on a "unit test" (small case) yosys successfully identifies a comparator-cascade and makes only 44 LUT4s
<DaKnig>
if a few wires on the netlist have exactly the same drivers and same truth table, usually its optimized out
<lkcl_>
however when that same module is *used* the result is 1,000+ LUT4s
<DaKnig>
wow really? that's bad
<lkcl_>
in other words, the easy cases, no problem
<lkcl_>
but the more complex patterns, the current hypothesis under investigation is that it's unable to indentify the pattern
<lkcl_>
after flattening, basically.
<lkcl_>
just a word of caution not to rely on the (quite reasonable) expectation/assumption
<DaKnig>
why is it so hard to notice that two wires in the netlist have the same expression assigned to them?
<DaKnig>
many dependencies?
<lkcl_>
the hypothesis that i have is that the substitution of expressions-into-expressions from yosys flatten causes it to no longer be capable of recognising the "outer" expression that it was, in the smaller (sub-module) case, perfectly well able to recognise
<lkcl_>
a way to test that would be:
<lkcl_>
for each identified tree-node module:
<lkcl_>
optimise
<lkcl_>
flatten
<lkcl_>
repeat until completely flattened
<DaKnig>
but even then you would still have subexpressions that can be merged, then merged again, all the way up, I'd assume?
<DaKnig>
is that a fair assumption?
<lkcl_>
well, the idea is with that approach that the expression that is repeated *within* a module is optimised and reduced down, *before* its inputs are multiply-substituted by "global flattening"
Asuu has quit [Quit: Konversation terminated!]
Asu has joined #nmigen
<moony>
this is really strange, I have an FSM that's refusing to actually do anything, even if I make my start condition a simple dummy that sets itself to another condition, it'll always stay on condition 0. Will post my (poor) code in a min
<moony>
all the other logic seems to work
<moony>
i.e. the fetcher gets busy immediately while the CPU is hung
<moony>
came up with a simple(r) CISC that I could actually pull off
<lkcl_>
one day i will implement an idea i came up with in 1990. 8-bit "escape-sequenced" instructions based on 2-bit bands, that gives an ISA very similar to the LZO compression algorithm
<lkcl_>
without an escape-sequence to extend the "2-bit" of RA, it's only 2 bits.
<lkcl_>
however _with_ a (first) escape-sequence RA becomes 16 bit register numbers
<lkcl_>
another escape-sequence: 64-bit
<moony>
went out of my way to make the CPU fetcher support async writes so I could just write and forget about it unless the write gets very stalled, so I could easily support the 6 different modes (Technically 12, as you need to be able to access the upper 8 registers and that's done by offsetting the "target" register for the mode by 8)
<lkcl_>
bad-cpu is like a "proper" RISC, isn't it?
<moony>
debatable
<moony>
it's got decently complex op modes
<moony>
i.e. the one I marked TODO is `(RA + RC * N + X), RB`
<moony>
(and it's inverse)
<moony>
I wrote them down beforehand to make sure I didn't forget, but didn't commit that with the repo, oops
<lkcl_>
are you considering doing PowerISA style "load with update"?
<moony>
hm, right
<moony>
i never modified the fetcher to handle offset read/writes
<moony>
it currently only does aligned
<moony>
yet another TODO
<moony>
hmm, i'll look at that instr
<lkcl_>
as in: after the load effective-address calculation, update RA with that as a result
<moony>
I kinda just copied my personal favorite operand modes out of the VAX
<lkcl_>
it means doing two writes
<lkcl_>
two reg writes
<lkcl_>
nice :)
<moony>
Oh, yea, see store_back_addr
<moony>
which is a bit messily handled as I wanted to avoid the extra clock cycle
<lkcl_>
the efficiency saving from load/store-with-update comes when referencing a struct in a loop
<moony>
it's name is a misnomer which needs renamed
<lkcl_>
because the 1st LD computes the address that you can then "add 12" (or whatever sizeof(struct)" to, to get the next struct
<moony>
as it actually functions for both sides of the instruction
<lkcl_>
where's store_back_addr?
<moony>
in cpu
<lkcl_>
ohh ok ah yeah i see
<moony>
technically works for anything I like if I just set it during instruction execute
<lkcl_>
so yes, looks like that's ld/st-with-update.
<lkcl_>
did VAX have separate address and data registers, like the 68000 and CDC6600?
<moony>
nah, it's a CISC
<moony>
a very, very, very CISC CISC
<lkcl_>
:)
<moony>
too CISC for me to actually copy
<moony>
I just stole the operand notation and some basic operands and called it a day
<lkcl_>
lol
<lkcl_>
if you get this right it should be frickin quick.
<moony>
i.e. (RA + RC * N + X) would be X(RC)[RA] for the VAX (Note N isn't there, it's specified by instruction width, while I can only do 32-bit accesses)
<moony>
lkcl_: it's sitting at 0.5 IPC if I don't need to add another state
<moony>
at best, obv
<lkcl_>
there's almost nothing to the decoder
<lkcl_>
oh i mean if you end up pipelining it
<moony>
oh, yea
<moony>
you're 100% right
<moony>
although the DECODE_2 step is a bit more complex
<moony>
or will be
<moony>
as it has to handle all the 2 word variants (aka any instr with an immediate following it)
<lkcl_>
the CDC6600's decode phase is laughably trivial. bits from the instruction literally go directly through a 3 bit binary-to-unary expander and end up as the "enable" lines on any one of 8 function units (pipelines)
<lkcl_>
ahh yeh
<moony>
if you'd like, I can give you a link to a VAX manual. It's... less than trivial
<lkcl_>
ahh... :)
<moony>
as in, so un-trivial they dumped the arch because 1 IPC was impossible
<moony>
operands had to be resolved sequentially
<lkcl_>
i am presently trying to focus on POWER9. my brain will explode if i try to include a new ISA
<moony>
and they were all at least 1 byte
<moony>
(RA + RC * N + X) as X(RC)[RA] was 3 bytes at best (1 byte X), 6 bytes at worst (4 byte X)
<lkcl_>
argh
<moony>
fun arch, great to program for, but absolute hell to implement
<moony>
it's a 1977 mainframe arch though
<moony>
is it really surprising
<lkcl_>
i'm a huge fan of the CDC6600, by James Thornton and Seymour Cray
<moony>
at least it contributed some amazing things to the world. Like Ethernet, the IEEE 754 32-bit and 64-bit formats, the BSDs, and several other things
<lkcl_>
they OoO design solved problems that hadn't even been realised were problems
<lkcl_>
yehyeh
<moony>
another arch I have on my desk rn is the MC88100, which is a very traditional RISC
<lkcl_>
i was able to log in to a microvax at imperial college, played aroudn with it
<moony>
I have a habit of giving SHL/SHR the boot and putting MAK/EXT (bitfield instrs) in their place because of it
<lkcl_>
ooo that's another one that Mitch Alsup was involved in
<lkcl_>
ah you know the story behind POWER9's shift routines?
<lkcl_>
mask/rotate?
<moony>
I'm just a teen hobbyist who absolutely loves old PCs, and most of the time people don't know what the VAX is somehow. No, but i'm open to listening.
<moony>
s/PCs/computers/
<lkcl_>
short version: they were on a serious gate budget, so rather than have a separate shift/mask set of gates in LD/ST
<lkcl_>
what they did was: micro-op LD/ST and SHIFT/MASK, and join the two together via broadcast buses based on the register numbers RA, RB, RC, RS and RT
<moony>
oh, one good question: You possibly know why Bitsavers might take down a scan? The scan for the VAX Architecture Handbook (which I personally own) seems to have been removed from their site at some point in the past. Neat.
<lkcl_>
so LD would do a 32 (64?) bit wide aligned LD, then rather than pass it straight to the regfile, pass it to shift/rot which *then* did the mask/insert, and *then* did the store to regfile
<lkcl_>
ok ok enough, i will be sucked in forever. wow, thank you though
<moony>
np
<moony>
it's a great resource
<moony>
and I probably wouldn't have gone out and bought paperback copies of several VAX/PDP-11 things for learning purposes if it didn't exist, as I would've never even got to know the architecture (I initially learned the ISA through the scans)
<moony>
it's a wonderful design, just not the most scalable :p
<lkcl_>
:)
<moony>
tbh, if I had the skill, I would absolutely try and make an FPGA VAX clone.
<moony>
it'd obv have to be heavily microcoded to actually fit, just like the real-world arch :P
<moony>
and I don't know anything about uCode design/theory/etc
<moony>
I've got some info on the uCode for the VAX-11/780 on hand, but not enough to truly understand it
<lkcl_>
micro-coding is very simple, you don't just do the "actual" op, you translate internally
<DaKnig>
split your instructions into smaller common operations. implement those. then have a "translator" block as part of your fetching mechanism.
<DaKnig>
that's not that hard :)
<lkcl_>
so for example on POWER9, the microwatt team, instead of implementing add, sub, neg, addc, addex etc. etc.
<lkcl_>
they have *one* operation, "OP_ADD"
<moony>
VAX-11/780 has a 96 bit wide ucode op, and it barely pulls 1 instruction per 10 cycles, so there's some heavy complexity to it. Could also just be an intimidating fact to know though :p
<moony>
hm
<lkcl_>
and the instruction decoder says, "if this is subtract, then invert A and set carry-in to 1"
* moony
will think about it
<DaKnig>
its not uncommon for microcode to be very wide
<DaKnig>
at least when you count all the control bits flying around
<lkcl_>
DaKnig: sigh. finding that out, the hard way
<lkcl_>
192 extra wires into some of the pipelines in LibreSOC
<moony>
I think there's already the core idea behind ucode just sitting in my CPU design, various flags the EXECUTE, FETCH, DECODE, etc stages all pass to each-other for control could be bundled up into a control block
<DaKnig>
look im a beginner, its just something I noticed from my own experience and from how I imagine other systems are implemented.
<DaKnig>
its not a bad thing- more info is available means you can have less complex logic
<lkcl_>
love the chat - really have to get back to fixing a LD/ST FSM...
<moony>
yea.. the only "complex" to decode instr is any instr using the (RA + RC * N + X) mode, and that's only as the effective address gen will need to split out some more bitfields
<moony>
I came up with something simple, and I'll stick to it for now to avoid feature creep :p
<DaKnig>
why is this complex?
<moony>
see the quotes.
<moony>
""complex"" (it's not)
<DaKnig>
if you have hardware mul, you can just turn it into MAC, add, deref/do whatever
<moony>
it's a shift-add + deref, yep
<DaKnig>
the back propogation net in the pipeline should take care of the rest
<moony>
simple too
<moony>
in an earlier version of the arch I was considering making whatever NOP instr I decide on "zero cycle" in that the decode will immediately move to the next half of the 32-bit word it currently has, but that's not necessary anymore with my smarter fetcher design (which, among other things, will pre-emptively queue up the next 4 words)
<moony>
s/4/3 as the 0th is the one it's currently giving to the CPU :p
<moony>
anyways, question: What might be a good way to handle converting an unaligned load/store request into multiple aligned ones?
<moony>
It's probably easy, but I only know the software way of emulating that behavior that you p much only use when you're trying for cycle accuracy in an emulator :p
<DaKnig>
if you load 8 bits when the bus is 64bits wide (for example), you can load the whole 64 bits, then have a word_select thing (aka a mux)
<lkcl_>
moony: well... i can describe the way that we're doing it for LibreSOC
<moony>
i'm trying to convert a 32-bit unaligned load/store to an aligned one
<moony>
sure
<lkcl_>
however it's designed for serious throughput
<DaKnig>
if the request loads 16 bits that are unaligned with said 64bits wide data bus, you can load the (rounded down) address and the one right after it, combine them and *then* use word_select
<lkcl_>
basically, take the lower bits of the address, and the ld/st-length, and turn them into a bytemask
<moony>
lkcl_: considering I was somehow able to make my fetch unit not be garbage, I probably have some small chance of understanding
<lkcl_>
effectively, just like wishbone "sel" when you set 8-bit granularity
<DaKnig>
lkcl_: question, what about my design that I just described doesnt give you enough throughput?
<lkcl_>
DaKnig: we're planning an advanced (simultaneous multi-ld/st) version of that, in effect
<DaKnig>
I see.
<moony>
my fetcher could be better (it could simply never drop READ_EN when doing bulk reads) but it's already fast enough as is and when working with SRAM that can respond the next cycle the CPU can't out-pace it
<DaKnig>
do you check if you can avoid unaligned access to save on bandwidth?
<lkcl_>
so now you can think of those LD/STs as just *literally* being like wishbone 64-bit requests plus an 8-bit mask of "sel"
<lkcl_>
DaKnig: no, what we do is:
<lkcl_>
* use the bottom 4 bits of the address (0-15)
<moony>
I assume the underlying code for LibreSoC isn't public yet?
<lkcl_>
* create a 16-bit mask
<lkcl_>
* split it into two halves (2 separate 64-bit requests, each with their own 8-bit wishbone-style "sel")
<moony>
just on your own git instead of github or similar, alright
<lkcl_>
4x 64-bit
<lkcl_>
moony: yes. because we take the "libre" bit seriously
<moony>
256b seems p normal in hindsight, at least knowing how big modern CPU busses usually are
<lkcl_>
moony, yeah. in GPU terms (those that use GDDR5) it's peanuts
<moony>
i.e. Zen 2's internal bus is 512-bit iirc, while the external is just raw DDR4 memory lanes (as it's a SoC)
<DaKnig>
and after that you just concat and select with the bottom 4 bits
<DaKnig>
with the mask created by* the bottom 4 bits
<DaKnig>
right?
<lkcl_>
that makes sense, given if you want to handle 4k
<lkcl_>
yes, once you have those 2x 64-bit aligned requests, some simple masking followed by shift/concatenate, and you're done
<lkcl_>
in the "first iteration" we're thunking down those 2x 64-bit requests onto the same internal 64-bit wishbone bus
<lkcl_>
so they can't happen simultaneously (ever)
<DaKnig>
you might save on one of those requests tho with a bit of extra logic
<lkcl_>
exactly, yes.
<moony>
I'm getting an FPGA board soon, a ULX3S, so I'll probably get to enjoy the wonders of writing an sdram driver soon.
<DaKnig>
not sure how that's gonna make it faster tho
<moony>
ECP5-85F, so I can use Yosys and be happy
<lkcl_>
if the mask (wishbone 8-bit "sel") is zero for either of the 2 64-bit requests, then you don't need to do the request at all.
Sarayan has joined #nmigen
<lkcl_>
moony: i knooow, i'm jealous :)
<moony>
You probably need a larger/more capable board to work on the SoC :P
<lkcl_>
i have a versa-ecp5 and it's... ok.
<moony>
oh, huh
<lkcl_>
(45k LUT4s). at the moment because we're limiting what's being done, we're at 16k LUT4s.
<moony>
lkcl_: ah. I might tinker with it sometime then, as it'll fit nicely and that'd be coooool
<lkcl_>
:)
<DaKnig>
the only thing stopping me from getting an ecp5 board is I didnt see any that has enough IO for my need. ones with HDMI, PLLs and DDR memory builtin are rare... can only get 7-series ones for my price range and those requirements
<moony>
I'm probably not knowledgeable enough to help though.
<DaKnig>
making useful (simple) CPUs is not that hard; compare that to , say, compilers...
<moony>
getting stuff running nicely async is still a bit new to me
<lkcl_>
DaKnig, yyeah, daveshah's ones are really good and have everything
<lkcl_>
except he doesn't get much in the way of demand in order to turn it into a business
<moony>
once I finish this design i'll probably just start a new design that's more complex :p
<lkcl_>
at least he libre-licenses the full CAD files of what he does
<moony>
...I wonder how hard a PDP-11 soft-core would be
<moony>
might look at that
<moony>
most of the complexity is, like VAX, from the operand modes
<moony>
hardware complexity that is
<lkcl_>
DaKnig, so you could, hypothetically, do your own PCB run of his ECP5 boards
<lkcl_>
PDP-11, the precursor to the 68000, right? i think it has those separate address and data registers
<moony>
I have a big ol pile of DEC manuals I want to put to use
<moony>
nah, PDP-11 is the VAX's precursor
<moony>
it heavily inspired multiple MCUs though
<moony>
probably 68k too
<moony>
it doesn't have address registers iirc?
<lkcl_>
the 68k (68000) was designed by Mitch Alsup. he was - still is - a fan of address/data registers because he studied the CDC 6600
<moony>
unless i'm missing something about how it handles the extra 8 registers (Like my bad-cpu, it needs a different instruction mode/something to access the upper 8 registers)
<moony>
they are GPR though, just checked
<lkcl_>
i can't remember. it was... 1990 when i last looked at 68000 :)
<DaKnig>
I ... really dont wanna make my own pcb. at some point, maintaining *all* your stack is annoying. gotta use others' works
<DaKnig>
less time consuming, therefore cheaper
<moony>
my PDP-11 Architecture Handbook is probably the most grimy of the set I have here, it's clearly been used a good bit :p
<lkcl_>
DaKnig: sigh yehhh. i'd really like one of daveshah's more powerful ECP5 boards, too
<moony>
lkcl_: yea, i'll almost definitely play with the SoC a bit when I get the board. It sounds fun
<lkcl_>
it's temporary because i'm focussing on getting the instructions right, first
<moony>
I should learn how pipelines truly function
<moony>
yea, makes sense
<moony>
something simple to test against
<moony>
if my ""bad"" CPU design comes out well, I might try pipelining it
<lkcl_>
pipelines are just some combinatorial logic that's joined with some clock-synchronised "registers"
<lkcl_>
registers/latches
<moony>
so basically the relationship my fetcher has with my CPU core? :P
<lkcl_>
so where the gates would normally ripple and not stabilise before the clock goes "ping"
<lkcl_>
you capture *partial* results in "latches"
<lkcl_>
then continue on processing on the next cycle, in a new combinatorial block
<lkcl_>
the "pipeline" bit is that you allow the 1st stage to start a *new* result whilst the 2nd stage is completing the next part
<lkcl_>
extend it to 3-stage, 4-stage, however-many-you-want stage
* lkcl_
goes and reopens bad-cpu
<moony>
you could theoretically apply that idea to the VAX, but you'd need a huge, x86-64 rivaling pipeline.
<lkcl_>
indeed
<moony>
and even then, complex instructions like `ADDL2 4(R1)[r2], 4(R3)[r2]` would be a pain in the butt
<lkcl_>
so, if the fetcher FSM can process one instruction *at the same time* as the ALU is munching on the previously-fetched instruction, then yes, this is termed a "pipelined" design
<Sarayan>
the fun part being managing the shared resources, like the memory port been fetches and instructions accessing memory
<moony>
it indeed can, the fetcher procures instruction words asyncronously from the rest of the cpu core
<lkcl_>
Sarayan: had a loootta fun with that, a couple days ago...
<moony>
again, comparing to VAX, managing shared resources on it would be... not fun.
<moony>
12 registers in one instruction is the max.
<moony>
aka basically the entire GPR file
<moony>
as the last 4 have special uses per the cc
<lkcl_>
moony: ah ok, so although you may have designed the fetcher FSM to be pipelined, here's the thing: if the fetcher FSM can only fetch one instruction every 2 to 3 clock cycles, then the ALU "pipelines" are going to be idle.
<lkcl_>
yeah?
<moony>
well, technically
<moony>
but it helps that the CPU core takes at least 2 cycles per instr
<moony>
and an instr is only half a 32-bit word
<lkcl_>
the "solution" to that is to read 2, 3, 4 or 8 instructions
<lkcl_>
ahh :)
<moony>
the fetcher fetches 32 bits at a time
<moony>
so, as I mentioned earlier, the fetcher will always keep up with the CPU if the external memory is fast enough
<lkcl_>
ok so you can... ah, yes, *now* you can feed the 1st 16 bits on 1 cycle and the 2nd 16 bits on the next
<lkcl_>
and during the "off" cycle, it fetches *another* 32 bits, right?
<moony>
yep
<lkcl_>
cool! then you've got a pipelined design
<moony>
I'm going to try and make it so it doesn't have to disable READ_EN if it has more to fetch, which should make it even faster
<moony>
well
<moony>
fast enough it doesn't even matter lol, it'll just save some contention in edge cases
<lkcl_>
i mean, it may only be a 2-stage, but it's still pipelined
<lkcl_>
now as Sarayan says: if you can get the register reads / writes to not corrupt, and operate on a *3rd* cycle, now you have a 3-stage pipeline
<moony>
once again I have no idea how I pulled this off with no real prior knowledge of CPU design beyond "hey pipelines exist"
<lkcl_>
lol. the thing is: now you run into problems with instructions trying to read results that aren't ready yet
<lkcl_>
you put 2 instructions:
<lkcl_>
mul r1 <- r2, r3
<lkcl_>
add r5 <- r1, r1
<moony>
yeeep
<lkcl_>
the result of the mul takes 3 cycles... you need to decide:
<DaKnig>
lkcl_: are you assuming multi cycle mul?
<moony>
at that point i'd have to use, say, a scoreboard. I only know what a scoreboard is because the MC88100 manual explains it lol
<lkcl_>
a) do i care? :)
<DaKnig>
ah.
<DaKnig>
cant you have 1-cycle mul in your FPGA?
<DaKnig>
DSP slices are quite fast
<DaKnig>
certainly faster than 50MHz I think
<lkcl_>
DaKnig, not necessarily, just the fact that the design *is* pipelined (even 2 stages) is enough
<moony>
also the problem of jumps, which have to eat a cycle as they're tricky to pipeline
<lkcl_>
if you're interested i can email you Mitch Alsup's book chapters on scoreboard design
<DaKnig>
moony: I would really suggest to find some computer architecture design course online. it should answer all your questions and teach you much more
<lkcl_>
in-order designs, they're much simpler: you "stall".
<DaKnig>
just have a good predictor :)
<lkcl_>
if that r1 hasn't been written yet, (the mul followed by add), you simply stall the instruction issue
<lkcl_>
DaKnig: yeah, predictors are... fuuun.
<lkcl_>
an in-order "stall" system is basically a degenerate Scoreboard Matrix of width and height 1
<DaKnig>
why stall? having a network that returns the new result back to the system would be much better
<DaKnig>
... no?
<lkcl_>
DaKnig: it's... complicated. it took me 5 months of talking with Mitch Alsup on comp.arch to fully understand scoreboards
<DaKnig>
scoreboards are the term? hm
<lkcl_>
there's 2 basic designs
<DaKnig>
I had a very simple pipeline with that, I just kept track of what regs are used where in the pipeline
<Sarayan>
alternatively, the toshiba sh2 has a well-described 5-stage pipeline that can be seen as an example of practical implementation, given the manual indicates all the interactions requiring interlocks
<lkcl_>
ah that's an in-order-style solution, which, if you're not careful (not stalling) i can't tell you exactly why, but you'll get data corruption
<lkcl_>
Sarayan, you mean, it defines the timing and the compiler / assembly-writer has to respect that?
<lkcl_>
there's historic and current designs which successfully do that, like the TI VLIW DSPs
<Sarayan>
they describe in detail how opcodes are executed so that compiler programmers/assembly writes can use the processor efficiently
<lkcl_>
nice. makes for very simple hardware
<lkcl_>
the first SPARC processors did this, iirc. the first compilers? simply inserted a bunch of NOPs... :)
<lkcl_>
urrrr
<Sarayan>
all the interlocks are automatic in the sh, but the description is so complete that tells you what you should look at when making your own
<lkcl_>
i remember my colleagues doing assembly-level programming of TI DSPs, in 1994. CEDAR Audio. there were only 1024 clock cycles available per 24-bit audio sample and the compiler wasn't efficient enough
<Sarayan>
oh yeah, that must be fun
<lkcl_>
now of course they can just use the main processor (doh)
<lkcl_>
this was like 386sx16-->486dx25 with Windows 3.1 "locking up interrupts" days, whereas the TI DSPs were 50 MFLOPs sustained
<Sarayan>
trivia: the mu100 synthesizer integrated effects DSP runs 768 instructions per sample, and afaict had no branching
<lkcl_>
ooOoo :)
<Sarayan>
I guess all instructions are single-clock
<Sarayan>
external memory access (for reverb ram) is 3 clocks, memory instructions are at addresses multiple of 3. Even more amusing, on reads, the instruction that does something with the result is at 3n+2
<Sarayan>
a very synchronous dsp
<Sarayan>
instructions are vliw too, because why not
<moony>
i've always wondered if there's more ways to make our fundamentally in-order CPU designs more efficient by, well, dumping a bit of said order :P
<moony>
I know, for example, Mill is working on smth like that
<moony>
something I do kinda miss in modern computers is also the idea of having more specialized hardware for tasks. Where's our DSPs? :(
<moony>
everything's loaded onto the main CPU, even if it turns out to be one of the least efficient ways to do a task
<moony>
GPUs exist obv, but they're really the only big co-processor in our computers :P
<lkcl_>
i've spent some time on comp.arch, learning about the Mill. it's extremely cool.
<MadHacker>
Well, GPUs make fairly decent DSPs; bearing in mind communications overhead with other devices, I'm not sure that other copros have much to offer any more.
<lkcl_>
it only has "ADD" and "MUL" (not ADD8, ADD16, ADD32, ADD64, ADD-signed blah blah)
<MadHacker>
Also there's a lot of very specialised ones you never really see, like in NICs and the like.
<lkcl_>
the "width" (and type) is taken from the LD operation and carried right the way through even to ST
<lkcl_>
moony: this is what ARM SoCs advocate, having specialist blocks for AES, Video, etc.
<moony>
mhm
<Sarayan>
lkcl: the issue tends to be if/how it interacts with the cache
<Sarayan>
if your aes block flushes the cache because it's a dma, well, reloading it afterwards kills all the gain usually
<lkcl_>
yeah
<lkcl_>
oh interesting
<lkcl_>
of course
<Sarayan>
that makes accelerators hard
<lkcl_>
and the software gets more complex (DMA, userspace-kernelspace)
<lkcl_>
it's why we chose libre-soc to be a hybrid CPU-VPU-GPU. actually extending POWER9 to include sin, cos, texture interpolation, yuv2rgb and so on
<lkcl_>
i don't know if you've seen how normal GPU architectures handle the software side - it's mental :)
<lkcl_>
inter-process communication and synchronisation of multi-megabyte data structures in shared memory!
<lkcl_>
that has to involve userspace-kernelspace-interprocessor_bridge-kernelspace-userspace interaction
<lkcl_>
mmmental
Asuu has quit [Read error: Connection reset by peer]
emeb has joined #nmigen
<Sarayan>
plus massive parallelism
<lkcl_>
ah there is that :)
Asuu has joined #nmigen
<moony>
lkcl_: yea, GPUs are nuts
<moony>
and imo i'd absolutely love that kind of massive parallelism for some tasks
<moony>
a CPU that has a large number of "little" cores designed for high parallelism would be fun
<moony>
(and some "big" cores for single-thread/dual-thread tasks)
<MadHacker>
Isn't that just a CPU with an integrated GPU? :D
<Sarayan>
You mean a recent intel gpu with integrated graphics? ;-)
<Sarayan>
mwahaha MH
phire has quit [Remote host closed the connection]
<moony>
p much :p
phire has joined #nmigen
<MadHacker>
The Intel knight's [landing, corner, whatever] series were an x86-flavoured variation on that theme.
<MadHacker>
Lots of small x86 cores running in parallel.
<MadHacker>
Of course, you can get 64 cores in a mainstream CPU now, so it's not THAT much parallelism.
<moony>
alright, got bad-cpu executing instructions again (I scrapped the old fetcher and it was too tightly integrated so I had to just snip most of the core), and now it's running at 0.5 IPC
<moony>
yay
<Sarayan>
yeah, a good pipeline should get you almost 1ipc mean
<moony>
yea. This is only a 2-stage pipeline though
<Sarayan>
superscaling is yet another keetle of fish
<moony>
I'll finish this design then design a good pipeline :P
<Sarayan>
are you load-store?
<Sarayan>
(most everything in register, dedicated instruction for load or store from/to ram)
<moony>
no, loads/stores are handled with operand modes
<Sarayan>
hmmm
<Sarayan>
there's a fair change you can't really go over 0.5ipc then
<Sarayan>
fetch is going to collide with instruction memory access
<moony>
actually
<Sarayan>
unless you go harvard, of course, or are really good with I$
<moony>
it won't. instrs are 16-bit, so the CPU will likely have the next instr words pre-fetched, and the fetcher will prioritize, well, anything that isn't a instr fetch.
<moony>
under normal operation rn, the fetcher fetches the next word every other cycle for normal form instructions
<moony>
though
<moony>
with a bigger pipeline that'd probably change
<Sarayan>
how wide is your bus?
<moony>
32-bit external
<moony>
so it's usually ahead of the CPU
<Sarayan>
the sh2 has 16-bits instructions and a 32-bits internal bus, so it actually fetches every other instruction
<moony>
tl;dr similar to the SH2, unless of course a 2 word instruction is being executed
<Sarayan>
it's not really speculating, just reading with a 32bits granularity
<Sarayan>
you have 2-word instructions?
<moony>
only used when an instr has an immediate attached
<moony>
they, of course
<moony>
slow down the CPU a bit
<Sarayan>
yeah, sh2 doesn't have that
<moony>
mhm
<moony>
alright, finally running instructions properly. Only the register/register ones though
<moony>
next up, make FETCH work
<Degi>
Is there something faster than using if/else for MUXing 2 signals?
<moony>
presumably, Mux()
<moony>
yay, load store works
<moony>
now for one last puzzle piece, a working external bus that isn't a thunk :p
<moony>
what's a good way to handle a ROM in nMigen?
hitomi2507 has quit [Quit: Nettalk6 - www.ntalk.de]
<Degi>
Maybe a RAM with data already filled in?
<Degi>
you can pass 'init=' to a Memory
<sorear>
Can we reserve this for conversations at least somewhat relevant to nmigen?
<moony>
sorear: I think discussing a CPU being written in nmigen, over an hour ago, is perfectly fine.
<sorear>
it’s not about how long ago it was, it’s about how many pages of scrollback you take up
<Degi>
What is the best way to find out time-consuming things?
<vup>
`python3 -m cProfile -s time yourfile.py` or maybe `-s tottime`?
<Degi>
Ah I mean taking up time as in long carry chains etc. since my thingy only compiles to like 300 MHz
<vup>
ah
<vup>
although 300MHz doesn't sound too bad
<vup>
what are you using for pnr?
<Degi>
nextpnr ecp5
<Degi>
Hm, AsyncFIFO(Buffered) seems to be slow
<Degi>
Without that it compiles to 1100 MHz
<daveshah>
nextpnr is being somewhat optimistic here, as it doesn't take into account the fact the global clock tree is only rated to 370MHz
<Degi>
Hm, in practice it works at 800+ MHz
<Degi>
(not this example, but the clock tree itself, in that case it was some 30 bit counter or so)
SpaceCoaster has quit [Quit: ZNC 1.7.2+deb3 - https://znc.in]
SpaceCoaster has joined #nmigen
<daveshah>
Is this a 1.2V part by any chance?
<Degi>
yes
<daveshah>
I think that spec wasn't updated accordingly, so it is not surprising it can go a lot higher
<Degi>
And what is the limit on clock domains?
<Degi>
Like on the number of them
<daveshah>
16
<Degi>
hmh okay
<daveshah>
In theory up to 64 with cleverer placement but nextpnr's global code would need some changes to support this
<Degi>
That would be nice heh
<daveshah>
I'd rather work on cross clock constraints first, that would probably be more useful
<daveshah>
At the moment actually using even a few clock domains with complex crossings is quite annoying
<daveshah>
Out of curiosity, why are you needing more than 16 domains?
emeb has quit [Ping timeout: 240 seconds]
<Degi>
Somehow something is broke and that led to me making gateware which takes 4 clock domains for 1 data lane...
emeb has joined #nmigen
<Degi>
And there can be up to 4 data lanes. Maybe I can optimize that to 2 clock domains or even combine clock domains of lanes (for later), a problem was that the SERDES gearing seems to be broken
Asu has quit [Quit: Konversation terminated!]
lkcl_ has joined #nmigen
lkcl__ has quit [Ping timeout: 240 seconds]
<_whitenotifier-3>
[YoWASP/yosys] whitequark pushed 3 commits to release [+0/-0/±3] https://git.io/JJA6I