#nmigen on 2020-08-19 — irc logs at freenode.irclog.whitequark.org

2020-08-09 01:55 ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at https://github.com/nmigen · logs at https://freenode.irclog.whitequark.org/nmigen · IRC meetings each Monday at 1800 UTC · next meeting August 17th

00:01 <d1b2> <emeb> Isn't it true that all clock domains in a design are part of self, and the clock and reset signals associated with them are accessible?

00:03 <tannewt> oh, initialized by the super class?

00:05 <d1b2> <emeb> well, I think I misspoke saying "all clock domains", but for any module that's instantiated, the higher level hooks up the clock domain, and the clock & reset signals are available if you need them.

00:06 <tannewt> huh, interesting. I'd expect it to be passed in

00:06 <tannewt> (obviously I haven't used it much)

00:06 <d1b2> <emeb> but usually you don't since they're implicit when you use .sync process

00:06 <tannewt> right, you assume most things are using the common clock

00:07 <d1b2> <emeb> things get interesting when there's multiple domains and signals crossing of course - that's the topic of a lot of discussion here.

00:08 <tannewt> ya, I've seen that but haven't experimented with it yet

00:08 <tannewt> I know it from the "I've worked with a SAMD micro that does it" standpoint but not the HDL level

00:09 <d1b2> <emeb> Right - there area lot of hardware patterns that are used to ensure good behavior in those cases. AIUI nmigen is working to make a lot of those available easily.

00:10 <tannewt> nice! I love the nmigen primitives

00:11 <tannewt> is it usually done at the register level or peripheral level?

00:11 <tannewt> I'm probably thinking too high level

00:12 <d1b2> <emeb> Yeah - special structures to keep pulses from getting lost or getting stretched, special ways to make sure you only get out what's put in, etc.

00:12 <d1b2> <emeb> async FIFOs figure heavily. Lots of handshaking. etc.

00:13 Degi has quit [Ping timeout: 260 seconds]

00:14 Degi has joined #nmigen

00:15 <tannewt> cool cool. I haven't gotten that low level yet

00:15 <whitequark> tannewt: so, most signals are eagerly bound, meaning you have to possess a reference to the actual Signal object to do anything with them

00:15 <whitequark> clock domains are special in that they are late bound

00:15 <tannewt> at elaboration time?

00:15 <whitequark> yeah

00:16 <whitequark> or rather, *after* elaboration

00:16 <whitequark> well

00:16 <tannewt> 🤯

00:16 <whitequark> it's a combination of both, i guess

00:17 <tannewt> both during and after? or before?

00:17 <whitequark> you're explicitly binding them by manipulating m.domains, and then nmigen binds them through the hierarchy for you

00:17 <tannewt> and `sync` is the default domain right?

00:19 <whitequark> pretty much

00:19 <whitequark> for most part, `sync` is a convention

00:20 <whitequark> it's never treated specially other than by serving as a default name

00:20 <tannewt> makes sense. how are different clocks mapped to multiple domains?

00:20 <whitequark> can you elaborate?

00:27 <moony> is there a clean way to set a Signal to the current state of an FSM?

00:29 <moony> eh, nvm

00:29 <moony> just noticed it's in the signal viewer as FSM_STATE

00:44 <lkcl_> DaKnig: one "trick" (or two) that i learned is possible with nmigen, which helps keep line lengths to below 80 chars is:

00:44 <lkcl_> 1) use functions, passing in expressions and parameters (i'll send a link to a file i did that, in a mo)

00:44 <lkcl_> 2) assign AST sub-expressions to python variables then use those on the next line

00:46 <tannewt> wq, I'm wondering how you map two clock domains in an elaborate call

00:46 <lkcl_> here's an example of a python function which was created by taking the contents of a VHDL "case" statement out:

00:46 <lkcl_> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/mmu.py;h=6af299aab9149273c0e029232a644d9031e55871;hb=HEAD#l237

00:46 <lkcl_> the radix_read_wait() function just above it, you can see how many If If If indentations there are

00:47 <lkcl_> if those had also been inside the original Case statement at line 384, which is in *yet another* m.If, you can see quite easily how the indentation builds up and gets completely out of hand

00:48 <lkcl_> and yet it is perfectly reasonable to expect to have just one Switch statement - not even two *nested* nmigen Switch Statements - then do maybe one or two m.If indented pieces of work

00:49 <lkcl_> by using a python function for the majority of the Switch work, you can "go back" to a clean indent level.

00:50 <lkcl_> plus, i think the Switch statement looks a lot more understandable and readable because you don't have to scroll page-up, page-down dozens of times to take in all the Cases

00:51 <lkcl_> the second trick: using a python variable to store AST fragments: this is *usuallly* something that's not recommended

00:51 <lkcl_> because people tend to use those fragments multiple times, not realising that it's literally going to insert that exact same AST into the yosys output

00:52 <lkcl_> but if you use it carefully, and make sure that each python variable is used *once*, i've found that it's a really good way to stay below 80 chars.

00:52 <lkcl_> example, at line 205 (which is annoyingly unreadable)

00:52 <lkcl_> do this instead:

00:52 emeb_mac has quit [Ping timeout: 240 seconds]

00:52 <lkcl_> data01 = data[1] | data[0]

00:52 <lkcl_> followed by

00:52 <lkcl_> comb += perm_ok.eq(data01 & ~r.store)

00:55 <lkcl_> i *had* to develop these techniques, because of short-term memory issues. if the files are not all on-screen at once, i literally cannot recall the details of a file or function from a hidden tab or backgrounded window that i viewed only two seconds ago!

00:56 <moony> anyone have an idea why this doesn't work, and if yes, how I could do it correctly? (also i'd love to just get rid of that for loop, too.) https://thelounge.hellomouse.net/uploads/a642990a4a09d8de/image.png

00:56 <lkcl_> oh - also, another thing you can see, there: "comb = m.d.comb" followed by just "comb +=".

00:56 emeb_mac has joined #nmigen

00:57 <lkcl_> this gives you 5 characters back which would otherwise be unavailable. if you can stand abbreviations and it's *really* important, you could do "c = m.d.comb" followed by "c += ...."

00:58 <lkcl_> moony: functions usually have to be "yielded from"

00:58 <lkcl_> yield from test_for_req()

00:58 <lkcl_> however...

00:59 <lkcl_> the for-loop... i *believe* you are doing the right thing, there (for i in test_for_req())

01:00 <lkcl_> is the testee done combinatorially?

01:00 <lkcl_> it would help to provide that source code as well

01:01 <moony> it worked with `yield from`

01:01 <lkcl_> oh! it did?? :)

01:01 <moony> I think the issue may lie with `(`yield testee.read_en) == 1`

01:01 <moony> oops, stray `

01:02 <lkcl_> oh, you mean you had:

01:02 <lkcl_> yield from test_for_req()

01:02 <lkcl_> yield i

01:02 <lkcl_> repeated a lot of times?

01:03 <lkcl_> now i am "engaging brain" a little more, the for-loop is not making any sense to me, neither is "yield i"

01:03 <lkcl_> this would make more sense:

01:04 <lkcl_> for i in range(5): # some arbitrary number

01:04 <lkcl_> yield from test_for_req()

01:04 <moony> yea, that's what I moved to

01:04 <moony> so, for future note: I'm still learning Python. I'm learning python only to use nmigen :p So how generators work in it is still a bit mysterious to me

01:04 <lkcl_> i know there's a reason why that works, it's just a bit too late for me to think it through and explain it :)

01:05 <lkcl_> they're very cool, however, yes, the fact that you can "jump" the control around - even sequentially - from one yield to the next

01:06 <moony> anywayssss....

01:06 * moony celebrates

01:06 <lkcl_> and even have for-loops around things that "yield" results, and even call functions that hierarchically do "more yields"... :)

01:06 <lkcl_> yay! :)

01:06 <moony> i now am something of the way there toward a working CPU

01:06 <lkcl_> cool!

01:06 <lkcl_> let us know how it goes

01:08 <moony> so far so good, surprisingly so as I bit off a bit more than I knew how to accomplish and somehow was mostly successful anyways

01:28 <moony> what's a clean way to write to a Memory in test code? (so using the write port isn't an option as that would take up cycles)

01:29 <moony> I assume prodding at _array?

01:40 <moony> uh oh

01:40 <moony> oh, no

01:40 <moony> my issue

01:50 <moony> successfully spent forever hunting down a bug that didn't exist

01:50 <moony> oops

02:04 aaaa has quit [Ping timeout: 265 seconds]

02:04 aaaa has joined #nmigen

02:05 Degi has quit [Ping timeout: 265 seconds]

02:05 ronyrus has quit [Ping timeout: 265 seconds]

02:06 ronyrus has joined #nmigen

02:06 Stary has quit [Ping timeout: 265 seconds]

02:06 plaes has quit [Ping timeout: 265 seconds]

02:06 Stary has joined #nmigen

02:12 plaes has joined #nmigen

02:12 plaes has quit [Changing host]

02:13 Degi has joined #nmigen

02:29 jaseg has quit [Ping timeout: 272 seconds]

02:31 jaseg has joined #nmigen

03:16 electronic_eel has quit [Ping timeout: 260 seconds]

03:16 electronic_eel has joined #nmigen

03:37 PyroPeter_ has joined #nmigen

03:39 PyroPeter has quit [Ping timeout: 246 seconds]

03:39 PyroPeter_ is now known as PyroPeter

03:46 lkcl__ has joined #nmigen

03:48 lkcl_ has quit [Ping timeout: 256 seconds]

04:33 jaseg has quit [Ping timeout: 240 seconds]

04:35 jaseg has joined #nmigen

06:27 emeb_mac has quit [Quit: Leaving.]

06:37 hitomi2507 has joined #nmigen

07:31 proteusguy has quit [Ping timeout: 246 seconds]

07:31 proteus-guy has quit [Ping timeout: 260 seconds]

07:38 peeps[zen] has joined #nmigen

07:38 peepsalot has quit [Ping timeout: 256 seconds]

07:44 proteusguy has joined #nmigen

08:07 proteus-guy has joined #nmigen

08:37 proteus-guy has quit [Remote host closed the connection]

08:40 Asu has joined #nmigen

08:41 cr1901_modern has quit [Ping timeout: 240 seconds]

08:41 cr1901_modern1 has joined #nmigen

08:42 Asuu has joined #nmigen

08:45 Asu has quit [Ping timeout: 265 seconds]

08:54 cr1901_modern1 has quit [Quit: Leaving.]

08:54 cr1901_modern has joined #nmigen

09:39 _whitelogger has joined #nmigen

09:41 Asuu has quit [Ping timeout: 265 seconds]

09:43 Asuu has joined #nmigen

09:53 lkcl_ has joined #nmigen

09:56 lkcl__ has quit [Ping timeout: 240 seconds]

11:09 <DaKnig> lkcl_ : thanks for sharing. whats the problem with using the same fragment many times? so what if it copies the logic, its someting I am sure tools can optimize out...

11:10 <DaKnig> that's common subexpression consolidation, its very common to do this in compilers and im sure your backend would deal with it

11:14 <DaKnig> about memory issues, I totally get you. I really have the same problem. when working with big designs I have to have at least 3 files on screen at once (sometimes up to 6!) and my screen is tiny... with C/++ I feel like having just the file Im working on + the headers for other parts of the project is enough and I include usually examples of usage in the headers. but ofc in HDL its completely

11:14 <DaKnig> different.

11:27 <DaKnig> using (sync|comb) instead of m.d.($1) is a good idea! I didnt think about it, but actually if the module only deals with one time domain, makes total sense

11:44 <lkcl_> DaKnig: it's not necessarily guaranteed that yosys will optimise out repeated expressions

11:45 <DaKnig> I doubt it wont :)

11:45 <lkcl_> for example, on a "unit test" (small case) yosys successfully identifies a comparator-cascade and makes only 44 LUT4s

11:45 <DaKnig> if a few wires on the netlist have exactly the same drivers and same truth table, usually its optimized out

11:45 <lkcl_> however when that same module is *used* the result is 1,000+ LUT4s

11:45 <DaKnig> wow really? that's bad

11:46 <lkcl_> in other words, the easy cases, no problem

11:46 <lkcl_> but the more complex patterns, the current hypothesis under investigation is that it's unable to indentify the pattern

11:46 <lkcl_> after flattening, basically.

11:47 <lkcl_> just a word of caution not to rely on the (quite reasonable) expectation/assumption

11:51 <DaKnig> why is it so hard to notice that two wires in the netlist have the same expression assigned to them?

11:51 <DaKnig> many dependencies?

12:14 <lkcl_> the hypothesis that i have is that the substitution of expressions-into-expressions from yosys flatten causes it to no longer be capable of recognising the "outer" expression that it was, in the smaller (sub-module) case, perfectly well able to recognise

12:15 <lkcl_> a way to test that would be:

12:15 <lkcl_> for each identified tree-node module:

12:15 <lkcl_> optimise

12:15 <lkcl_> flatten

12:15 <lkcl_> repeat until completely flattened

12:31 <DaKnig> but even then you would still have subexpressions that can be merged, then merged again, all the way up, I'd assume?

12:31 <DaKnig> is that a fair assumption?

12:38 <lkcl_> well, the idea is with that approach that the expression that is repeated *within* a module is optimised and reduced down, *before* its inputs are multiply-substituted by "global flattening"

12:59 Asuu has quit [Quit: Konversation terminated!]

13:00 Asu has joined #nmigen

13:06 <moony> this is really strange, I have an FSM that's refusing to actually do anything, even if I make my start condition a simple dummy that sets itself to another condition, it'll always stay on condition 0. Will post my (poor) code in a min

13:09 <moony> all the other logic seems to work

13:09 <moony> i.e. the fetcher gets busy immediately while the CPU is hung

13:09 <lkcl_> happy to take a look, moony.

13:10 <moony> https://github.com/moonheart08/bad-cpu/blob/master/cpu.py#L57 it never advances beyond this state. Ever.

13:10 <lkcl_> m.next, not m.mode

13:10 <moony> bah

13:10 <lkcl_> :)

13:10 <moony> that should've been obvious

13:11 <lkcl_> weell...

13:11 <lkcl_> i'll not enumerate the number of things i missed that were obvious in hindsight :)

13:12 <lkcl_> moony: are you designing your own ISA?

13:12 <lkcl_> looks like it - cool! https://github.com/moonheart08/bad-cpu/blob/master/operand%20modes

13:13 <moony> mhm

13:13 <moony> came up with a simple(r) CISC that I could actually pull off

13:13 <lkcl_> one day i will implement an idea i came up with in 1990. 8-bit "escape-sequenced" instructions based on 2-bit bands, that gives an ISA very similar to the LZO compression algorithm

13:14 <lkcl_> without an escape-sequence to extend the "2-bit" of RA, it's only 2 bits.

13:14 <lkcl_> however _with_ a (first) escape-sequence RA becomes 16 bit register numbers

13:14 <lkcl_> another escape-sequence: 64-bit

13:15 <moony> went out of my way to make the CPU fetcher support async writes so I could just write and forget about it unless the write gets very stalled, so I could easily support the 6 different modes (Technically 12, as you need to be able to access the upper 8 registers and that's done by offsetting the "target" register for the mode by 8)

13:15 <lkcl_> bad-cpu is like a "proper" RISC, isn't it?

13:16 <moony> debatable

13:16 <moony> it's got decently complex op modes

13:16 <moony> i.e. the one I marked TODO is `(RA + RC * N + X), RB`

13:16 <moony> (and it's inverse)

13:17 <moony> I wrote them down beforehand to make sure I didn't forget, but didn't commit that with the repo, oops

13:17 <lkcl_> that's... MAC-with-immediate-offset. nice.

13:17 <moony> ish. N is always a power of 2

13:17 <moony> as it is on x86

13:17 <moony> so it's cheap

13:17 <moony> shift and add

13:17 <lkcl_> makes sense for word/etc.-alignment

13:18 <lkcl_> are you considering doing PowerISA style "load with update"?

13:18 <moony> hm, right

13:18 <moony> i never modified the fetcher to handle offset read/writes

13:18 <moony> it currently only does aligned

13:18 <moony> yet another TODO

13:18 <moony> hmm, i'll look at that instr

13:19 <lkcl_> as in: after the load effective-address calculation, update RA with that as a result

13:19 <moony> I kinda just copied my personal favorite operand modes out of the VAX

13:19 <lkcl_> it means doing two writes

13:19 <lkcl_> two reg writes

13:19 <lkcl_> nice :)

13:19 <moony> Oh, yea, see store_back_addr

13:19 <moony> which is a bit messily handled as I wanted to avoid the extra clock cycle

13:20 <lkcl_> the efficiency saving from load/store-with-update comes when referencing a struct in a loop

13:20 <moony> it's name is a misnomer which needs renamed

13:21 <lkcl_> because the 1st LD computes the address that you can then "add 12" (or whatever sizeof(struct)" to, to get the next struct

13:21 <moony> as it actually functions for both sides of the instruction

13:21 <lkcl_> where's store_back_addr?

13:21 <moony> in cpu

13:21 <lkcl_> ohh ok ah yeah i see

13:22 <moony> technically works for anything I like if I just set it during instruction execute

13:22 <lkcl_> so yes, looks like that's ld/st-with-update.

13:23 <lkcl_> did VAX have separate address and data registers, like the 68000 and CDC6600?

13:23 <moony> nah, it's a CISC

13:23 <moony> a very, very, very CISC CISC

13:23 <lkcl_> :)

13:23 <moony> too CISC for me to actually copy

13:23 <moony> I just stole the operand notation and some basic operands and called it a day

13:23 <lkcl_> lol

13:24 <lkcl_> if you get this right it should be frickin quick.

13:24 <moony> i.e. (RA + RC * N + X) would be X(RC)[RA] for the VAX (Note N isn't there, it's specified by instruction width, while I can only do 32-bit accesses)

13:24 <moony> lkcl_: it's sitting at 0.5 IPC if I don't need to add another state

13:24 <moony> at best, obv

13:24 <lkcl_> there's almost nothing to the decoder

13:24 <lkcl_> oh i mean if you end up pipelining it

13:25 <moony> oh, yea

13:25 <moony> you're 100% right

13:25 <moony> although the DECODE_2 step is a bit more complex

13:25 <moony> or will be

13:26 <moony> as it has to handle all the 2 word variants (aka any instr with an immediate following it)

13:26 <lkcl_> the CDC6600's decode phase is laughably trivial. bits from the instruction literally go directly through a 3 bit binary-to-unary expander and end up as the "enable" lines on any one of 8 function units (pipelines)

13:26 <lkcl_> ahh yeh

13:26 <moony> if you'd like, I can give you a link to a VAX manual. It's... less than trivial

13:26 <lkcl_> ahh... :)

13:26 <moony> as in, so un-trivial they dumped the arch because 1 IPC was impossible

13:27 <moony> operands had to be resolved sequentially

13:27 <lkcl_> i am presently trying to focus on POWER9. my brain will explode if i try to include a new ISA

13:27 <moony> and they were all at least 1 byte

13:28 <moony> (RA + RC * N + X) as X(RC)[RA] was 3 bytes at best (1 byte X), 6 bytes at worst (4 byte X)

13:28 <lkcl_> argh

13:28 <moony> fun arch, great to program for, but absolute hell to implement

13:28 <moony> it's a 1977 mainframe arch though

13:28 <moony> is it really surprising

13:29 <lkcl_> i'm a huge fan of the CDC6600, by James Thornton and Seymour Cray

13:29 <moony> at least it contributed some amazing things to the world. Like Ethernet, the IEEE 754 32-bit and 64-bit formats, the BSDs, and several other things

13:29 <lkcl_> they OoO design solved problems that hadn't even been realised were problems

13:29 <lkcl_> yehyeh

13:30 <moony> another arch I have on my desk rn is the MC88100, which is a very traditional RISC

13:30 <lkcl_> i was able to log in to a microvax at imperial college, played aroudn with it

13:30 <moony> I have a habit of giving SHL/SHR the boot and putting MAK/EXT (bitfield instrs) in their place because of it

13:30 <lkcl_> ooo that's another one that Mitch Alsup was involved in

13:30 <lkcl_> ah you know the story behind POWER9's shift routines?

13:31 <lkcl_> mask/rotate?

13:31 <moony> I'm just a teen hobbyist who absolutely loves old PCs, and most of the time people don't know what the VAX is somehow. No, but i'm open to listening.

13:31 <moony> s/PCs/computers/

13:31 <lkcl_> short version: they were on a serious gate budget, so rather than have a separate shift/mask set of gates in LD/ST

13:32 <lkcl_> what they did was: micro-op LD/ST and SHIFT/MASK, and join the two together via broadcast buses based on the register numbers RA, RB, RC, RS and RT

13:32 <moony> oh, one good question: You possibly know why Bitsavers might take down a scan? The scan for the VAX Architecture Handbook (which I personally own) seems to have been removed from their site at some point in the past. Neat.

13:33 <lkcl_> so LD would do a 32 (64?) bit wide aligned LD, then rather than pass it straight to the regfile, pass it to shift/rot which *then* did the mask/insert, and *then* did the store to regfile

13:33 <lkcl_> bitsavers?

13:33 <moony> http://bitsavers.org/ these guys. They've scanned countless old documents

13:34 <lkcl_> no idea

13:34 <lkcl_> wow

13:34 <moony> they have most DEC stuff scanned

13:34 <moony> same for Motorola stuff

13:34 <lkcl_> woooow

13:34 <moony> even have some of DEC's internal VAX design documents scanned which is impressive

13:35 <lkcl_> and Acorn RISC Machines (before they tried renaming to ARM)

13:35 <moony> http://bitsavers.trailing-edge.com/pdf/dec/vax/archSpec/EL-00032-00-decStd32_Jan90.pdf this one was great help when I was working on a VAX emulator to learn emulation. (Was way more than I could chew, but I got it all the way to running unprivileged code successfully :D)

13:35 <moony> basically DEC's internal spec for the arch

13:36 <lkcl_> they've got apricot on there! i knew the son of the founder of that, at school, back in 19801

13:36 <moony> they have many, many things

13:36 <lkcl_> http://bitsavers.org/pdf/apricot/pictures/Apricot_T.jpg

13:37 <lkcl_> no CDC6600 though :)

13:37 <lkcl_> no wait - found it! http://bitsavers.org/pdf/apricot/pictures/Apricot_T.jpg

13:37 <lkcl_> http://bitsavers.org/pdf/cdc/cyber/books/

13:37 <moony> yea, they have tons

13:37 <lkcl_> ok ok enough, i will be sucked in forever. wow, thank you though

13:37 <moony> np

13:38 <moony> it's a great resource

13:38 <moony> and I probably wouldn't have gone out and bought paperback copies of several VAX/PDP-11 things for learning purposes if it didn't exist, as I would've never even got to know the architecture (I initially learned the ISA through the scans)

13:39 <moony> it's a wonderful design, just not the most scalable :p

13:40 <lkcl_> :)

13:40 <moony> tbh, if I had the skill, I would absolutely try and make an FPGA VAX clone.

13:41 <moony> it'd obv have to be heavily microcoded to actually fit, just like the real-world arch :P

13:41 <moony> and I don't know anything about uCode design/theory/etc

13:41 <moony> I've got some info on the uCode for the VAX-11/780 on hand, but not enough to truly understand it

13:42 <lkcl_> micro-coding is very simple, you don't just do the "actual" op, you translate internally

13:42 <DaKnig> split your instructions into smaller common operations. implement those. then have a "translator" block as part of your fetching mechanism.

13:42 <DaKnig> that's not that hard :)

13:42 <lkcl_> so for example on POWER9, the microwatt team, instead of implementing add, sub, neg, addc, addex etc. etc.

13:42 <lkcl_> they have *one* operation, "OP_ADD"

13:42 <moony> VAX-11/780 has a 96 bit wide ucode op, and it barely pulls 1 instruction per 10 cycles, so there's some heavy complexity to it. Could also just be an intimidating fact to know though :p

13:43 <moony> hm

13:43 <lkcl_> and the instruction decoder says, "if this is subtract, then invert A and set carry-in to 1"

13:43 * moony will think about it

13:43 <DaKnig> its not uncommon for microcode to be very wide

13:43 <DaKnig> at least when you count all the control bits flying around

13:43 <lkcl_> DaKnig: sigh. finding that out, the hard way

13:44 <lkcl_> 192 extra wires into some of the pipelines in LibreSOC

13:44 <moony> I think there's already the core idea behind ucode just sitting in my CPU design, various flags the EXECUTE, FETCH, DECODE, etc stages all pass to each-other for control could be bundled up into a control block

13:44 <DaKnig> look im a beginner, its just something I noticed from my own experience and from how I imagine other systems are implemented.

13:44 <DaKnig> its not a bad thing- more info is available means you can have less complex logic

13:44 <lkcl_> moony: exactly. that's micro-coding, basically.

13:44 <DaKnig> basically think about your instructions in such a machine as "compressed instructions"

13:44 <moony> "compressed"

13:45 <moony> also VAX: 120 byte instruction, yay!

13:45 <lkcl_> yay! :)

13:45 <Lofty> Itanium eat your heart out /s

13:45 <moony> (the architecture is capped at 255 bytes for an instruction, as well.)

13:45 <DaKnig> probably the ucode for that instruction is even longer :)

13:45 <lkcl_> "compressed", basically, you have a 16-bit to 32-bit ISA "map" of the most commonly-used instructions

13:46 <DaKnig> isnt that why people say that CISC has a RISC inside it plus a decoder?

13:46 <lkcl_> so you only fetch 16 bits, but internally you first expand it to an "equivalent" 32-bit opcode then put *that* into the decoder

13:46 <moony> my "bad" CPU (time to go rename it) just directly feeds a bunch of bits to the ALU for common ops

13:46 <moony> but for more complex ops maybe I should work out a way to dedup their logic

13:46 <lkcl_> which is extremely efficient, and is the whole basis of the "proper" RISC paradigm

13:47 <lkcl_> DaKnig, this is ridiculously small https://github.com/moonheart08/bad-cpu/blob/master/instr_decode.py

13:47 <lkcl_> :)

13:48 <lkcl_> love the chat - really have to get back to fixing a LD/ST FSM...

13:48 <moony> yea.. the only "complex" to decode instr is any instr using the (RA + RC * N + X) mode, and that's only as the effective address gen will need to split out some more bitfields

13:48 <moony> I came up with something simple, and I'll stick to it for now to avoid feature creep :p

13:49 <DaKnig> why is this complex?

13:49 <moony> see the quotes.

13:49 <moony> ""complex"" (it's not)

13:49 <DaKnig> if you have hardware mul, you can just turn it into MAC, add, deref/do whatever

13:49 <moony> it's a shift-add + deref, yep

13:49 <DaKnig> the back propogation net in the pipeline should take care of the rest

13:49 <moony> simple too

13:52 <moony> in an earlier version of the arch I was considering making whatever NOP instr I decide on "zero cycle" in that the decode will immediately move to the next half of the 32-bit word it currently has, but that's not necessary anymore with my smarter fetcher design (which, among other things, will pre-emptively queue up the next 4 words)

13:52 <moony> s/4/3 as the 0th is the one it's currently giving to the CPU :p

13:54 <moony> anyways, question: What might be a good way to handle converting an unaligned load/store request into multiple aligned ones?

13:55 <moony> It's probably easy, but I only know the software way of emulating that behavior that you p much only use when you're trying for cycle accuracy in an emulator :p

14:01 <DaKnig> if you load 8 bits when the bus is 64bits wide (for example), you can load the whole 64 bits, then have a word_select thing (aka a mux)

14:02 <lkcl_> moony: well... i can describe the way that we're doing it for LibreSOC

14:02 <moony> i'm trying to convert a 32-bit unaligned load/store to an aligned one

14:02 <moony> sure

14:02 <lkcl_> however it's designed for serious throughput

14:02 <DaKnig> if the request loads 16 bits that are unaligned with said 64bits wide data bus, you can load the (rounded down) address and the one right after it, combine them and *then* use word_select

14:03 <lkcl_> basically, take the lower bits of the address, and the ld/st-length, and turn them into a bytemask

14:03 <moony> lkcl_: considering I was somehow able to make my fetch unit not be garbage, I probably have some small chance of understanding

14:03 <lkcl_> effectively, just like wishbone "sel" when you set 8-bit granularity

14:03 <DaKnig> lkcl_: question, what about my design that I just described doesnt give you enough throughput?

14:04 <lkcl_> DaKnig: we're planning an advanced (simultaneous multi-ld/st) version of that, in effect

14:04 <DaKnig> I see.

14:04 <moony> my fetcher could be better (it could simply never drop READ_EN when doing bulk reads) but it's already fast enough as is and when working with SRAM that can respond the next cycle the CPU can't out-pace it

14:04 <DaKnig> do you check if you can avoid unaligned access to save on bandwidth?

14:04 <lkcl_> so now you can think of those LD/STs as just *literally* being like wishbone 64-bit requests plus an 8-bit mask of "sel"

14:04 <lkcl_> DaKnig: no, what we do is:

14:05 <lkcl_> * use the bottom 4 bits of the address (0-15)

14:05 <moony> I assume the underlying code for LibreSoC isn't public yet?

14:05 <lkcl_> * create a 16-bit mask

14:05 <lkcl_> * split it into two halves (2 separate 64-bit requests, each with their own 8-bit wishbone-style "sel")

14:06 <lkcl_> * do *BOTH* of those simultaneously

14:06 <moony> oh, no

14:06 <moony> it is

14:06 <lkcl_> moony: yes

14:06 <lkcl_> http://git.libre-soc.org

14:06 <DaKnig> how wide is the data bus?

14:06 <lkcl_> a whopping 256 bits.

14:06 <moony> just on your own git instead of github or similar, alright

14:06 <lkcl_> 4x 64-bit

14:06 <lkcl_> moony: yes. because we take the "libre" bit seriously

14:06 <moony> 256b seems p normal in hindsight, at least knowing how big modern CPU busses usually are

14:07 <lkcl_> moony, yeah. in GPU terms (those that use GDDR5) it's peanuts

14:07 <moony> i.e. Zen 2's internal bus is 512-bit iirc, while the external is just raw DDR4 memory lanes (as it's a SoC)

14:08 <DaKnig> and after that you just concat and select with the bottom 4 bits

14:09 <DaKnig> with the mask created by* the bottom 4 bits

14:09 <DaKnig> right?

14:09 <lkcl_> that makes sense, given if you want to handle 4k

14:09 <lkcl_> yes, once you have those 2x 64-bit aligned requests, some simple masking followed by shift/concatenate, and you're done

14:10 <lkcl_> in the "first iteration" we're thunking down those 2x 64-bit requests onto the same internal 64-bit wishbone bus

14:10 <lkcl_> so they can't happen simultaneously (ever)

14:10 <DaKnig> you might save on one of those requests tho with a bit of extra logic

14:10 <lkcl_> exactly, yes.

14:10 <moony> I'm getting an FPGA board soon, a ULX3S, so I'll probably get to enjoy the wonders of writing an sdram driver soon.

14:10 <DaKnig> not sure how that's gonna make it faster tho

14:10 <moony> ECP5-85F, so I can use Yosys and be happy

14:10 <lkcl_> if the mask (wishbone 8-bit "sel") is zero for either of the 2 64-bit requests, then you don't need to do the request at all.

14:10 Sarayan has joined #nmigen

14:11 <lkcl_> moony: i knooow, i'm jealous :)

14:11 <moony> You probably need a larger/more capable board to work on the SoC :P

14:11 <lkcl_> i have a versa-ecp5 and it's... ok.

14:11 <moony> oh, huh

14:12 <lkcl_> (45k LUT4s). at the moment because we're limiting what's being done, we're at 16k LUT4s.

14:12 <moony> lkcl_: ah. I might tinker with it sometime then, as it'll fit nicely and that'd be coooool

14:12 <lkcl_> :)

14:12 <DaKnig> the only thing stopping me from getting an ecp5 board is I didnt see any that has enough IO for my need. ones with HDMI, PLLs and DDR memory builtin are rare... can only get 7-series ones for my price range and those requirements

14:12 <moony> I'm probably not knowledgeable enough to help though.

14:13 <lkcl_> the litex sim works fine. https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/litex/florent/sim.py;hb=HEAD

14:13 <moony> considering all I've done for the past forever is just computer programming, I'm just now learning hardware design :p

14:13 <moony> this CPU is my first ever serious project

14:13 <moony> and I have no idea why it's going so well lol

14:13 <lkcl_> i'm currently fighting with the LD/ST FSM, which is being... odd and obtuse

14:13 <lkcl_> lol

14:13 <moony> i.e. how the heck did I manage to make the fetcher work so well

14:14 <lkcl_> probably _because_ you've done programming

14:14 <DaKnig> making useful (simple) CPUs is not that hard; compare that to , say, compilers...

14:14 <moony> getting stuff running nicely async is still a bit new to me

14:14 <lkcl_> DaKnig, yyeah, daveshah's ones are really good and have everything

14:15 <lkcl_> except he doesn't get much in the way of demand in order to turn it into a business

14:15 <moony> once I finish this design i'll probably just start a new design that's more complex :p

14:15 <lkcl_> at least he libre-licenses the full CAD files of what he does

14:15 <moony> ...I wonder how hard a PDP-11 soft-core would be

14:15 <moony> might look at that

14:15 <moony> most of the complexity is, like VAX, from the operand modes

14:15 <moony> hardware complexity that is

14:15 <lkcl_> DaKnig, so you could, hypothetically, do your own PCB run of his ECP5 boards

14:16 <lkcl_> PDP-11, the precursor to the 68000, right? i think it has those separate address and data registers

14:16 <moony> I have a big ol pile of DEC manuals I want to put to use

14:16 <moony> nah, PDP-11 is the VAX's precursor

14:16 <moony> it heavily inspired multiple MCUs though

14:16 <moony> probably 68k too

14:17 <moony> it doesn't have address registers iirc?

14:17 <lkcl_> the 68k (68000) was designed by Mitch Alsup. he was - still is - a fan of address/data registers because he studied the CDC 6600

14:17 <moony> unless i'm missing something about how it handles the extra 8 registers (Like my bad-cpu, it needs a different instruction mode/something to access the upper 8 registers)

14:18 <moony> they are GPR though, just checked

14:18 <lkcl_> i can't remember. it was... 1990 when i last looked at 68000 :)

14:18 <DaKnig> I ... really dont wanna make my own pcb. at some point, maintaining *all* your stack is annoying. gotta use others' works

14:18 <DaKnig> less time consuming, therefore cheaper

14:18 <moony> my PDP-11 Architecture Handbook is probably the most grimy of the set I have here, it's clearly been used a good bit :p

14:19 <lkcl_> DaKnig: sigh yehhh. i'd really like one of daveshah's more powerful ECP5 boards, too

14:24 <moony> lkcl_: yea, i'll almost definitely play with the SoC a bit when I get the board. It sounds fun

14:24 <moony> where's the main CPU? shakti-core?

14:25 <lkcl_> the OpenPOWER community's pretty nice.

14:25 <lkcl_> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/litex/florent/sim.py;hb=HEAD

14:25 <lkcl_> ok that's the litex simulation (you need to generate the verilog file first though)

14:26 <lkcl_> this is the "simple" FSM - the main instruction issuer loop

14:26 <lkcl_> https://git.libre-soc.org/?p=soc.git;a=tree;f=src/soc/simple;hb=HEAD

14:26 <lkcl_> it's temporary because i'm focussing on getting the instructions right, first

14:26 <moony> I should learn how pipelines truly function

14:26 <moony> yea, makes sense

14:26 <moony> something simple to test against

14:27 <moony> if my ""bad"" CPU design comes out well, I might try pipelining it

14:27 <lkcl_> pipelines are just some combinatorial logic that's joined with some clock-synchronised "registers"

14:27 <lkcl_> registers/latches

14:27 <moony> so basically the relationship my fetcher has with my CPU core? :P

14:28 <lkcl_> so where the gates would normally ripple and not stabilise before the clock goes "ping"

14:28 <lkcl_> you capture *partial* results in "latches"

14:28 <lkcl_> then continue on processing on the next cycle, in a new combinatorial block

14:29 <lkcl_> the "pipeline" bit is that you allow the 1st stage to start a *new* result whilst the 2nd stage is completing the next part

14:29 <lkcl_> extend it to 3-stage, 4-stage, however-many-you-want stage

14:30 * lkcl_ goes and reopens bad-cpu

14:30 <moony> you could theoretically apply that idea to the VAX, but you'd need a huge, x86-64 rivaling pipeline.

14:30 <lkcl_> indeed

14:31 <moony> and even then, complex instructions like `ADDL2 4(R1)[r2], 4(R3)[r2]` would be a pain in the butt

14:31 <lkcl_> so, if the fetcher FSM can process one instruction *at the same time* as the ALU is munching on the previously-fetched instruction, then yes, this is termed a "pipelined" design

14:32 <Sarayan> the fun part being managing the shared resources, like the memory port been fetches and instructions accessing memory

14:32 <moony> it indeed can, the fetcher procures instruction words asyncronously from the rest of the cpu core

14:32 <lkcl_> Sarayan: had a loootta fun with that, a couple days ago...

14:32 <moony> again, comparing to VAX, managing shared resources on it would be... not fun.

14:32 <moony> 12 registers in one instruction is the max.

14:32 <moony> aka basically the entire GPR file

14:33 <moony> as the last 4 have special uses per the cc

14:33 <lkcl_> moony: ah ok, so although you may have designed the fetcher FSM to be pipelined, here's the thing: if the fetcher FSM can only fetch one instruction every 2 to 3 clock cycles, then the ALU "pipelines" are going to be idle.

14:34 <lkcl_> yeah?

14:34 <moony> well, technically

14:34 <moony> but it helps that the CPU core takes at least 2 cycles per instr

14:34 <moony> and an instr is only half a 32-bit word

14:34 <lkcl_> the "solution" to that is to read 2, 3, 4 or 8 instructions

14:34 <lkcl_> ahh :)

14:34 <moony> the fetcher fetches 32 bits at a time

14:34 <moony> so, as I mentioned earlier, the fetcher will always keep up with the CPU if the external memory is fast enough

14:35 <lkcl_> ok so you can... ah, yes, *now* you can feed the 1st 16 bits on 1 cycle and the 2nd 16 bits on the next

14:35 <lkcl_> and during the "off" cycle, it fetches *another* 32 bits, right?

14:35 <moony> yep

14:35 <lkcl_> cool! then you've got a pipelined design

14:35 <moony> I'm going to try and make it so it doesn't have to disable READ_EN if it has more to fetch, which should make it even faster

14:35 <moony> well

14:35 <moony> fast enough it doesn't even matter lol, it'll just save some contention in edge cases

14:36 <lkcl_> i mean, it may only be a 2-stage, but it's still pipelined

14:36 <lkcl_> now as Sarayan says: if you can get the register reads / writes to not corrupt, and operate on a *3rd* cycle, now you have a 3-stage pipeline

14:37 <moony> once again I have no idea how I pulled this off with no real prior knowledge of CPU design beyond "hey pipelines exist"

14:37 <lkcl_> lol. the thing is: now you run into problems with instructions trying to read results that aren't ready yet

14:37 <lkcl_> you put 2 instructions:

14:37 <lkcl_> mul r1 <- r2, r3

14:38 <lkcl_> add r5 <- r1, r1

14:38 <moony> yeeep

14:38 <lkcl_> the result of the mul takes 3 cycles... you need to decide:

14:38 <DaKnig> lkcl_: are you assuming multi cycle mul?

14:38 <moony> at that point i'd have to use, say, a scoreboard. I only know what a scoreboard is because the MC88100 manual explains it lol

14:38 <lkcl_> a) do i care? :)

14:38 <DaKnig> ah.

14:38 <DaKnig> cant you have 1-cycle mul in your FPGA?

14:38 <DaKnig> DSP slices are quite fast

14:39 <DaKnig> certainly faster than 50MHz I think

14:39 <lkcl_> DaKnig, not necessarily, just the fact that the design *is* pipelined (even 2 stages) is enough

14:39 <moony> also the problem of jumps, which have to eat a cycle as they're tricky to pipeline

14:39 <lkcl_> moony: https://libre-soc.org/3d_gpu/architecture/6600scoreboard/

14:39 <lkcl_> if you're interested i can email you Mitch Alsup's book chapters on scoreboard design

14:40 <DaKnig> moony: I would really suggest to find some computer architecture design course online. it should answer all your questions and teach you much more

14:40 <lkcl_> in-order designs, they're much simpler: you "stall".

14:40 <DaKnig> just have a good predictor :)

14:40 <lkcl_> if that r1 hasn't been written yet, (the mul followed by add), you simply stall the instruction issue

14:40 <lkcl_> DaKnig: yeah, predictors are... fuuun.

14:41 <lkcl_> an in-order "stall" system is basically a degenerate Scoreboard Matrix of width and height 1

14:41 <DaKnig> why stall? having a network that returns the new result back to the system would be much better

14:41 <DaKnig> ... no?

14:42 <lkcl_> DaKnig: it's... complicated. it took me 5 months of talking with Mitch Alsup on comp.arch to fully understand scoreboards

14:42 <DaKnig> scoreboards are the term? hm

14:43 <lkcl_> there's 2 basic designs

14:43 <DaKnig> I had a very simple pipeline with that, I just kept track of what regs are used where in the pipeline

14:43 <lkcl_> Tomasulo, and Scoreboards https://www.youtube.com/watch?v=zS9ngvUQPNM

14:43 <DaKnig> ok will watch. thanks

14:44 <Sarayan> alternatively, the toshiba sh2 has a well-described 5-stage pipeline that can be seen as an example of practical implementation, given the manual indicates all the interactions requiring interlocks

14:44 <lkcl_> ah that's an in-order-style solution, which, if you're not careful (not stalling) i can't tell you exactly why, but you'll get data corruption

14:44 <lkcl_> Sarayan, you mean, it defines the timing and the compiler / assembly-writer has to respect that?

14:46 <lkcl_> there's historic and current designs which successfully do that, like the TI VLIW DSPs

14:47 Asuu has joined #nmigen

14:48 Asu has quit [Ping timeout: 265 seconds]

14:49 <Sarayan> https://antime.kapsi.fi/sega/files/h12p0.pdf

14:49 <Sarayan> Page 168+

14:49 <Sarayan> they describe in detail how opcodes are executed so that compiler programmers/assembly writes can use the processor efficiently

14:50 <lkcl_> nice. makes for very simple hardware

14:50 <lkcl_> the first SPARC processors did this, iirc. the first compilers? simply inserted a bunch of NOPs... :)

14:50 <lkcl_> urrrr

14:51 <Sarayan> all the interlocks are automatic in the sh, but the description is so complete that tells you what you should look at when making your own

14:53 <lkcl_> i remember my colleagues doing assembly-level programming of TI DSPs, in 1994. CEDAR Audio. there were only 1024 clock cycles available per 24-bit audio sample and the compiler wasn't efficient enough

14:53 <Sarayan> oh yeah, that must be fun

14:54 <lkcl_> now of course they can just use the main processor (doh)

14:55 <lkcl_> this was like 386sx16-->486dx25 with Windows 3.1 "locking up interrupts" days, whereas the TI DSPs were 50 MFLOPs sustained

14:55 <Sarayan> trivia: the mu100 synthesizer integrated effects DSP runs 768 instructions per sample, and afaict had no branching

14:56 <lkcl_> ooOoo :)

14:56 <Sarayan> I guess all instructions are single-clock

14:57 <Sarayan> external memory access (for reverb ram) is 3 clocks, memory instructions are at addresses multiple of 3. Even more amusing, on reads, the instruction that does something with the result is at 3n+2

14:57 <Sarayan> a very synchronous dsp

14:58 <Sarayan> instructions are vliw too, because why not

14:59 <moony> i've always wondered if there's more ways to make our fundamentally in-order CPU designs more efficient by, well, dumping a bit of said order :P

15:00 <moony> I know, for example, Mill is working on smth like that

15:02 <moony> something I do kinda miss in modern computers is also the idea of having more specialized hardware for tasks. Where's our DSPs? :(

15:03 <moony> everything's loaded onto the main CPU, even if it turns out to be one of the least efficient ways to do a task

15:04 <moony> GPUs exist obv, but they're really the only big co-processor in our computers :P

15:04 <lkcl_> i've spent some time on comp.arch, learning about the Mill. it's extremely cool.

15:05 <MadHacker> Well, GPUs make fairly decent DSPs; bearing in mind communications overhead with other devices, I'm not sure that other copros have much to offer any more.

15:05 <lkcl_> it only has "ADD" and "MUL" (not ADD8, ADD16, ADD32, ADD64, ADD-signed blah blah)

15:05 <MadHacker> Also there's a lot of very specialised ones you never really see, like in NICs and the like.

15:05 <lkcl_> the "width" (and type) is taken from the LD operation and carried right the way through even to ST

15:06 <lkcl_> moony: this is what ARM SoCs advocate, having specialist blocks for AES, Video, etc.

15:06 <moony> mhm

15:07 <Sarayan> lkcl: the issue tends to be if/how it interacts with the cache

15:08 <Sarayan> if your aes block flushes the cache because it's a dma, well, reloading it afterwards kills all the gain usually

15:08 <lkcl_> yeah

15:08 <lkcl_> oh interesting

15:08 <lkcl_> of course

15:09 <Sarayan> that makes accelerators hard

15:09 <lkcl_> and the software gets more complex (DMA, userspace-kernelspace)

15:10 <lkcl_> it's why we chose libre-soc to be a hybrid CPU-VPU-GPU. actually extending POWER9 to include sin, cos, texture interpolation, yuv2rgb and so on

15:11 <lkcl_> i don't know if you've seen how normal GPU architectures handle the software side - it's mental :)

15:11 <lkcl_> inter-process communication and synchronisation of multi-megabyte data structures in shared memory!

15:12 <lkcl_> that has to involve userspace-kernelspace-interprocessor_bridge-kernelspace-userspace interaction

15:12 <lkcl_> mmmental

15:12 Asuu has quit [Read error: Connection reset by peer]

15:12 emeb has joined #nmigen

15:13 <Sarayan> plus massive parallelism

15:13 <lkcl_> ah there is that :)

15:15 Asuu has joined #nmigen

15:18 <moony> lkcl_: yea, GPUs are nuts

15:18 <moony> and imo i'd absolutely love that kind of massive parallelism for some tasks

15:19 <moony> a CPU that has a large number of "little" cores designed for high parallelism would be fun

15:19 <moony> (and some "big" cores for single-thread/dual-thread tasks)

15:20 <MadHacker> Isn't that just a CPU with an integrated GPU? :D

15:20 <Sarayan> You mean a recent intel gpu with integrated graphics? ;-)

15:20 <Sarayan> mwahaha MH

15:20 phire has quit [Remote host closed the connection]

15:21 <moony> p much :p

15:21 phire has joined #nmigen

15:21 <MadHacker> The Intel knight's [landing, corner, whatever] series were an x86-flavoured variation on that theme.

15:22 <MadHacker> Lots of small x86 cores running in parallel.

15:23 <MadHacker> https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing - those things.

15:24 <MadHacker> Of course, you can get 64 cores in a mainstream CPU now, so it's not THAT much parallelism.

15:25 <moony> alright, got bad-cpu executing instructions again (I scrapped the old fetcher and it was too tightly integrated so I had to just snip most of the core), and now it's running at 0.5 IPC

15:25 <moony> yay

15:26 <Sarayan> yeah, a good pipeline should get you almost 1ipc mean

15:26 <moony> yea. This is only a 2-stage pipeline though

15:26 <Sarayan> superscaling is yet another keetle of fish

15:27 <moony> I'll finish this design then design a good pipeline :P

15:27 <Sarayan> are you load-store?

15:27 <Sarayan> (most everything in register, dedicated instruction for load or store from/to ram)

15:28 <moony> no, loads/stores are handled with operand modes

15:28 <Sarayan> hmmm

15:28 <Sarayan> there's a fair change you can't really go over 0.5ipc then

15:28 <Sarayan> fetch is going to collide with instruction memory access

15:28 <moony> actually

15:29 <Sarayan> unless you go harvard, of course, or are really good with I$

15:30 <moony> it won't. instrs are 16-bit, so the CPU will likely have the next instr words pre-fetched, and the fetcher will prioritize, well, anything that isn't a instr fetch.

15:30 <moony> under normal operation rn, the fetcher fetches the next word every other cycle for normal form instructions

15:30 <moony> though

15:31 <moony> with a bigger pipeline that'd probably change

15:31 <Sarayan> how wide is your bus?

15:31 <moony> 32-bit external

15:31 <moony> so it's usually ahead of the CPU

15:31 <Sarayan> the sh2 has 16-bits instructions and a 32-bits internal bus, so it actually fetches every other instruction

15:32 <moony> tl;dr similar to the SH2, unless of course a 2 word instruction is being executed

15:32 <Sarayan> it's not really speculating, just reading with a 32bits granularity

15:32 <Sarayan> you have 2-word instructions?

15:33 <moony> only used when an instr has an immediate attached

15:33 <moony> they, of course

15:33 <moony> slow down the CPU a bit

15:33 <Sarayan> yeah, sh2 doesn't have that

15:33 <moony> mhm

15:35 <moony> alright, finally running instructions properly. Only the register/register ones though

15:35 <moony> next up, make FETCH work

15:50 <Degi> Is there something faster than using if/else for MUXing 2 signals?

15:53 <moony> presumably, Mux()

15:57 <moony> yay, load store works

15:58 <moony> now for one last puzzle piece, a working external bus that isn't a thunk :p

16:08 <moony> what's a good way to handle a ROM in nMigen?

16:13 hitomi2507 has quit [Quit: Nettalk6 - www.ntalk.de]

16:19 <Degi> Maybe a RAM with data already filled in?

16:20 <Degi> you can pass 'init=' to a Memory

16:51 <sorear> Can we reserve this for conversations at least somewhat relevant to nmigen?

16:58 <moony> sorear: I think discussing a CPU being written in nmigen, over an hour ago, is perfectly fine.

17:00 <sorear> it’s not about how long ago it was, it’s about how many pages of scrollback you take up

17:36 <lkcl_> sorear: apologies

18:53 Asu has joined #nmigen

18:54 Asuu has quit [Ping timeout: 240 seconds]

18:55 lkcl__ has joined #nmigen

18:56 lkcl_ has quit [Ping timeout: 265 seconds]

19:00 Asuu has joined #nmigen

19:00 Asu has quit [Ping timeout: 264 seconds]

19:14 Lord_Nightmare has joined #nmigen

19:26 <_whitenotifier-3> [nmigen] pbsds opened pull request #481: hdl.ast.Value: Add .to_signed() method - https://git.io/JJAuR

20:12 Asuu has quit [Ping timeout: 240 seconds]

20:13 Asu has joined #nmigen

20:31 emeb_mac has joined #nmigen

20:51 <Degi> What is the best way to find out time-consuming things?

20:55 <vup> `python3 -m cProfile -s time yourfile.py` or maybe `-s tottime`?

20:59 <Degi> Ah I mean taking up time as in long carry chains etc. since my thingy only compiles to like 300 MHz

20:59 <vup> ah

20:59 <vup> although 300MHz doesn't sound too bad

20:59 <vup> what are you using for pnr?

21:00 <Degi> nextpnr ecp5

21:00 <Degi> Hm, AsyncFIFO(Buffered) seems to be slow

21:00 <Degi> Without that it compiles to 1100 MHz

21:03 <daveshah> nextpnr is being somewhat optimistic here, as it doesn't take into account the fact the global clock tree is only rated to 370MHz

21:04 <Degi> Hm, in practice it works at 800+ MHz

21:04 <Degi> (not this example, but the clock tree itself, in that case it was some 30 bit counter or so)

21:06 SpaceCoaster has quit [Quit: ZNC 1.7.2+deb3 - https://znc.in]

21:06 SpaceCoaster has joined #nmigen

21:06 <daveshah> Is this a 1.2V part by any chance?

21:06 <Degi> yes

21:06 <daveshah> I think that spec wasn't updated accordingly, so it is not surprising it can go a lot higher

21:06 <Degi> And what is the limit on clock domains?

21:06 <Degi> Like on the number of them

21:06 <daveshah> 16

21:07 <Degi> hmh okay

21:07 <daveshah> In theory up to 64 with cleverer placement but nextpnr's global code would need some changes to support this

21:12 <Degi> That would be nice heh

21:13 <daveshah> I'd rather work on cross clock constraints first, that would probably be more useful

21:13 <daveshah> At the moment actually using even a few clock domains with complex crossings is quite annoying

21:14 <daveshah> Out of curiosity, why are you needing more than 16 domains?

21:22 emeb has quit [Ping timeout: 240 seconds]

21:22 <Degi> Somehow something is broke and that led to me making gateware which takes 4 clock domains for 1 data lane...

21:22 emeb has joined #nmigen

21:22 <Degi> And there can be up to 4 data lanes. Maybe I can optimize that to 2 clock domains or even combine clock domains of lanes (for later), a problem was that the SERDES gearing seems to be broken

21:32 Asu has quit [Quit: Konversation terminated!]

22:11 lkcl_ has joined #nmigen

22:14 lkcl__ has quit [Ping timeout: 240 seconds]

23:07 <_whitenotifier-3> [YoWASP/yosys] whitequark pushed 3 commits to release [+0/-0/±3] https://git.io/JJA6I

23:07 <_whitenotifier-3> [YoWASP/yosys] github-actions[bot] 7fc68da - Update upstream code

23:07 <_whitenotifier-3> [YoWASP/yosys] github-actions[bot] a38714b - Update upstream code

23:07 <_whitenotifier-3> [YoWASP/yosys] github-actions[bot] 7017d1c - Update upstream code

23:07 <_whitenotifier-3> [YoWASP/nextpnr] whitequark pushed 3 commits to release [+0/-0/±4] https://git.io/JJA6L

23:07 <_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] a390d11 - Update upstream code

23:07 <_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] b6894f6 - Update upstream code

23:07 <_whitenotifier-3> [YoWASP/nextpnr] github-actions[bot] 5385082 - Update upstream code