#nmigen on 2020-08-21 — irc logs at freenode.irclog.whitequark.org

2020-08-09 01:55 ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at https://github.com/nmigen · logs at https://freenode.irclog.whitequark.org/nmigen · IRC meetings each Monday at 1800 UTC · next meeting August 17th

00:00 esden has joined #nmigen

00:07 Degi has quit [Ping timeout: 246 seconds]

00:09 Degi has joined #nmigen

00:25 lkcl has joined #nmigen

00:39 emeb has quit [Quit: Leaving.]

02:28 jaseg has quit [Ping timeout: 244 seconds]

02:30 jaseg has joined #nmigen

03:04 emeb_mac has quit [Ping timeout: 240 seconds]

03:06 emeb_mac has joined #nmigen

03:16 electronic_eel has quit [Ping timeout: 260 seconds]

03:16 electronic_eel has joined #nmigen

03:20 emeb_mac has quit [Ping timeout: 264 seconds]

03:22 emeb_mac has joined #nmigen

03:28 emeb_mac has quit [Ping timeout: 240 seconds]

03:29 emeb_mac has joined #nmigen

03:33 emeb_mac has quit [Ping timeout: 240 seconds]

03:34 PyroPeter_ has joined #nmigen

03:36 emeb_mac has joined #nmigen

03:38 PyroPeter has quit [Ping timeout: 260 seconds]

03:38 PyroPeter_ is now known as PyroPeter

04:18 _whitelogger has joined #nmigen

04:23 proteusguy has quit [Ping timeout: 264 seconds]

04:27 <d1b2> <edbordin> whitequark, cr1901_modern, not sounding like there is a lot of enthusiasm for that sby PR. do you think we can convince them? main issue seems to be the current code is "battle tested" and there aren't comprehensive regression tests for this core code. I can understand them not wanting to ruin it for customers.

04:29 <d1b2> <edbordin> I notice that at some point in the past year, there was a workaround inserted on windows to call sleep(0.1) instead of select so the loop wouldn't spin too hard

04:31 <d1b2> <edbordin> I could compromise and PR the console process group on windows so that there is less chance of runaway processes at least. In practice I haven't hit that issue because the engine tends to crash anyway when the stdin pipe closes

04:33 <d1b2> <edbordin> a more chaotic option would be to ship this patched version in open-tool-forge and crowdsource regression testing 😛

04:35 proteusguy has joined #nmigen

04:46 <cr1901_modern> At this point, I want to wait and see what upstream wants.

05:56 cr1901_modern has quit [Quit: Leaving.]

05:58 cr1901_modern has joined #nmigen

06:38 TiltMeSenpai has quit [*.net *.split]

06:38 FL4SHK has quit [*.net *.split]

06:38 cr1901_modern has quit [*.net *.split]

06:38 MadHacker has quit [*.net *.split]

06:42 emeb_mac has quit [Quit: Leaving.]

06:43 cr1901_modern has joined #nmigen

06:43 TiltMeSenpai has joined #nmigen

06:43 FL4SHK has joined #nmigen

06:43 MadHacker has joined #nmigen

06:49 hitomi2507 has joined #nmigen

07:23 cr1901_modern has quit [Quit: Leaving.]

07:24 cr1901_modern has joined #nmigen

07:28 cr1901_modern has quit [Ping timeout: 240 seconds]

07:50 PyroPeter has quit [*.net *.split]

07:50 lkcl has quit [*.net *.split]

07:50 Degi has quit [*.net *.split]

07:50 esden has quit [*.net *.split]

08:22 esden has joined #nmigen

08:22 lkcl has joined #nmigen

08:22 PyroPeter has joined #nmigen

08:22 Degi has joined #nmigen

08:30 cr1901_modern has joined #nmigen

08:31 cr1901_modern has quit [Client Quit]

08:34 Asu has joined #nmigen

08:35 Asu is now known as Asuu

08:35 Asuu is now known as Asu

08:37 cr1901_modern has joined #nmigen

08:39 cr1901_modern has quit [Client Quit]

08:39 cr1901_modern has joined #nmigen

08:41 cr1901_modern has quit [Client Quit]

08:44 sorear has quit [Read error: Connection reset by peer]

08:45 sorear has joined #nmigen

09:20 cr1901_modern has joined #nmigen

10:59 <Degi> Can I somehow get how full a FIFO is?

11:00 <DaKnig> switches dont have to cover all cases right?

11:01 <DaKnig> (useful for me when doing some synchronous stuff)

11:01 <Lofty> Degi: https://github.com/nmigen/nmigen/commit/73f672f57c606ac29087ffb09c43a7e4d9a9dfc6

11:04 <vup> Degi: Lofty: note r_level and w_level for {Buffered,}AsyncFifo is currently broken in some ways

11:26 <Degi> Oh I want a sync one anywys

11:26 <Degi> Cool, that's new, nice

11:29 <vup> well sync fifos have had level since they were added

11:30 <Degi> Is w_level and r_level for the respective domains?

11:30 <vup> yes

11:31 <Degi> Nice

11:42 <DaKnig> whitequark: I remember you saying that inheriting from Signal is not supported, can you explain why?

11:43 <DaKnig> I thought about making an Integer class, that knows the max and min value, but otherwise behaves like a Signal

11:45 <DaKnig> it could reduce some logic in some cases - if you have an unsgined(4) number and it only goes up to 10, checking `num==9` is the same as checking `num[0] & num[3]` (of course for larger examples the benefit is bigger, I think)

11:45 <DaKnig> wouldnt inheriting from Signal here be beneficial?

12:10 lkcl_ has joined #nmigen

12:13 Asu has quit [Remote host closed the connection]

12:13 Asu has joined #nmigen

12:14 lkcl has quit [Ping timeout: 272 seconds]

12:19 Chips4Makers has joined #nmigen

12:23 Asuu has joined #nmigen

12:26 Asu has quit [Ping timeout: 240 seconds]

12:49 zignig has joined #nmigen

12:49 * zignig has been erratic for some time, working boneless-v3 shell coming soon.

13:56 <pepijndevos> DaKnig, my understanding of it is that it's just an AST node so the backends need to have specific understanding of each class. So while it might not immediately break, it's not part of the "public API" and subject to breakage.

13:57 <pepijndevos> SubClassing UserValue is supported for that particular case of course...

13:58 <pepijndevos> What I'm doing with my fixed point number class is that it wraps an internal signal

13:58 <pepijndevos> This is actually a pretty smooth experience, except when it comes to arrays.

14:00 <pepijndevos> IMHO if you're just adding functionality to Signal that has general usefulness and does not conflict with nMigen philosophy, maybe a PR is the best solution? :)))

14:16 <DaKnig> it doesnt go well with the nmigen philosophy though

14:17 <DaKnig> as nmigen, according to wq, should be typeless, having just the "shape" thing

14:17 <DaKnig> so no integers and such

14:17 <DaKnig> but I find integers to be quite useful

14:17 <DaKnig> what's your problem with arrays?

14:23 emeb has joined #nmigen

15:00 <DaKnig> pepijndevos

15:41 <lkcl_> whitequark: about DaKnig's question (inheriting from Signal), i (sort-of) remember the discussions we had 2 years ago. PartitionedSignal is the result of (in 4-6 months time) needing to implement SIMD operations behind a nmigen-identical API

15:41 <lkcl_> including Arrays of SIMD Signals

15:42 <lkcl_> i'd be interested to have a (short) conversation about it some time. it's a roadmap item for us though - not urgent

15:47 * lkcl_ has a conversion of MiSOC wishbone DownConverter up and running

15:47 <lkcl_> unit tests work, litex sim of libresoc works, FPGA building now...

15:48 <lkcl_> similar to this - i think https://github.com/nmigen/nmigen-soc/pull/21

15:49 <lkcl_> although... i'm having difficulty understanding this code:

15:49 <lkcl_> https://github.com/Fatsie/nmigen-soc/blob/wishbone_connector/nmigen_soc/wishbone/bus.py#L262

15:53 Asuu has quit [Ping timeout: 258 seconds]

15:53 Asu has joined #nmigen

15:58 hitomi2507 has quit [Quit: Nettalk6 - www.ntalk.de]

15:59 <DaKnig> lkcl_: can you tell me more about that?

16:00 <DaKnig> what do you exactly mean

16:12 <lkcl_> DaKnig: the SIMD Signal class? or the wishbone downconverter?

16:13 <DaKnig> SIMD Signal class

16:13 <lkcl_> ok, so what would be like the "normal" way to do a SIMD ALU? assume in nmigen

16:13 <DaKnig> you can use `map` for this kinda stuff (plus Array) no?

16:14 <DaKnig> or for loop with word_select

16:14 <lkcl_> and there would be a construct along the lines of:

16:14 <lkcl_> switch (SIMD_width):

16:14 <lkcl_> case (64bit): comb += o.eq(a + b)

16:15 <lkcl_> case 32bit: for i in range(2) ... word_select()...

16:15 <lkcl_> etc. etc

16:15 <lkcl_> yes?

16:15 <lkcl_> would that sound like a "reasonable" way to do a SIMD ALU for an "add" operation?

16:16 <lkcl_> case statements with a 1x 64-bit, 2x loop on 32-bit, 4x loop on 16-bit, 8x loop on 8-bit

16:17 <lkcl_> most people would say "yes, sounds perfectly reasonable and normal".

16:18 <lkcl_> it's only 1 line of code "comb += o.eq(a + b)" so what's the problem, right?

16:19 <lkcl_> now imagine that you need to implement e.g. POWER9 SIMD

16:19 <lkcl_> this involves carry-in and carry-out bits

16:20 <lkcl_> now imagine what happens when it comes to implementing DIV in SIMD.

16:20 <lkcl_> for every single line of 64-bit scalar code, and i do mean absolutely every single line

16:20 <lkcl_> you need that switch statement

16:21 <lkcl_> switch (SIMD_width): case statement for 1x64, case statement for 2x32, case statement for 4x16, case statement for 8x8

16:21 <lkcl_> at which point any hope of being able to understand and maintain the resultant code is a foregone lost cause

16:22 <lkcl_> what if there was a class, named PartitionedSignal, that could, at runtime, be given a "context" (the current SIMD_width)?

16:23 <lkcl_> what if that class had exactly the same properties and behaviour as Signal?

16:23 <lkcl_> that would be really neat, wouldn't it?

16:24 <lkcl_> because instead of the god-awful pervasive switch statement invading every single line of code, it's hidden from sight, set up once and only once when the PartitionedSignal is created

16:24 <lkcl_> "all" (i say all) you'd need to do is: implement PartitionedSignal.__add__, PartitionedSignal.__eq__, PartitionedSignal.__xor__ and so on

16:25 <lkcl_> the problem then comes when you try do do with m.If(SomePartitionedSignal) because what goes on in the m.If *critically depends* on that SIMD_width "context"

16:26 <lkcl_> likewise, Arrays of PartitionedSignals - again: what happens is critically dependent on the SIMD_width context

16:31 <lkcl_> DaKnig: does that give some context? in summary, we want *transparent* SIMD that looks exactly like it's a scalar operation, however "behind the scenes" a context (SIMD_width) which is set at runtime (in HDL) can "break" that signal down into multiple smaller (SIMD) fragments.

17:38 <Lofty> If I try to build a belt-machine based RISC processor, how many patents do you think I'd infringe upon simultaneously?

17:39 <lkcl_> Lofty: absolutely none if it's for the purposes of developing your own experiment

17:40 <Lofty> Little bit easier said than done.

17:40 <lkcl_> enshrined into patent law is the right to implement QTY1 of any patent, for the purposes of developing your *own* inventions and further patents

17:40 <whitequark> pepijndevos: a PR is very rarely, almost never, a solution for language changes

17:40 <whitequark> an RFC issue would be most welcome though

17:41 <lkcl_> if however you intend to commercialise what you're doing, _then_ you may run into problems - e.g. most of the Mill patents :)

17:41 <Lofty> I mean, PRs might not be a solution, but it gives thinking material IMO

17:42 <whitequark> DaKnig: inheriting from Signal isn't supported for a few reasons; the main reason is that it would interfere with the stability guarantees I'd like to provide, by forcibly sharing a namespace between nmigen and downstream code

17:42 <whitequark> this is also a problem with UserValue, which is why we're replacing that with ValueCastable

18:09 <lkcl_> whitequark: so how would you think might be the best way to implement a SIMD-style Signal?

18:11 <lkcl_> bearing in mind that we need to do SIMD IEEE754 FP, and a full SIMD Integer processor including multiply, divide, shift, everything.

18:11 <Lofty> lkcl_: why not just a Signal?

18:13 <lkcl_> Lofty: see above, about the massive proliferation of switch/case statements

18:13 <lkcl_> on liiiterally eeeevery single (scalar) line of code (comb += o.eq(a + b))

18:14 <whitequark> lkcl_: i still have no idea what the semantics of it would be

18:14 <lkcl_> that is now replaced with an unbelievably tedious switch (SIMD_width) case 64: do this, case 32: do 2x 32-bit case 16: .. etc.

18:14 <Lofty> Well, nMigen has functions for this

18:14 <whitequark> i've read your code, which doesn't concern If/Switch

18:14 <whitequark> that code is relatively clear

18:15 <Lofty> Nothing obligates you to put all your logic into elaborate()

18:15 <whitequark> what you want your "SIMD signal" to do when used in If/Switch, I don't know

18:15 <lkcl_> Lofty: it becomes unreadable and unmaintainable

18:15 * Lofty blinks

18:15 <lkcl_> whitequark: to do exactly the same thing as if were "just a plain Signal(64)"

18:15 <Lofty> I *really* must be missing something.

18:16 <lkcl_> we did successfully create something called "PMux()" which is a Dynamic SIMD capable version of Mux()

18:16 <Lofty> To me this seems like a problem with a lot of inherent complexity, but also a lot of inherent redundancy.

18:16 <Lofty> Which sounds like something you'd use a function for.

18:16 <lkcl_> and whereever we had any code involving m.If() we replaced it with PMux

18:17 <lkcl_> what i would _like_ to have happen is "with simdcontext.If(x):" to behave *exactly* like "with m.If()"

18:17 <lkcl_> and behind the scenes, "with simdcontext.If()" hides the switch (SIMD_width)

18:18 <Lofty> But doesn't that still have problems when you need to do width-specific things?

18:18 <lkcl_> Lofty: i considered that approach. to have a class / function which creates, with a parameter, the different versions

18:19 <lkcl_> Lofty: bizarrely... for the majority of cases...no! if it's designed correctly.

18:19 <lkcl_> where it does become a problem is things like shift and shift_rot

18:20 <Lofty> Well, if you can do four 32-bit ops or two 64-bit ops (for example), doesn't that mean you have to do something width-specific or else duplicate hardware?

18:20 <lkcl_> ok, so we've actually implemented much of this already :)

18:20 <lkcl_> and yes there's an overhead

18:21 <lkcl_> however it is NOTHING compared to trying to do it as a switch statement

18:21 <lkcl_> one lot of 64 bit adds, another lot of 2x 32-bit adds, etc.

18:22 <lkcl_> this is some notes on how "add" works: https://libre-soc.org/3d_gpu/architecture/dynamic_simd/add/

18:22 <lkcl_> let's say you have 32-bit add and it's broken into 4 chunks

18:23 <lkcl_> turns out that if you insert 3 bits - extending the 32-bit into 32+3 - you can set bits...err... 8, 17, 26 as "0" and voila, you have divided it into 4 separate 8-bit adds

18:23 <lkcl_> if you want 2x 16, you set bit 17 to 1

18:23 <lkcl_> it becomes a "carry"

18:24 <lkcl_> and the carry from bit 16 "rolls over" from bit 17 into bit 18!

18:24 <lkcl_> ta-daaa

18:24 <lkcl_> jacob came up with that one :)

18:24 <Lofty> Well, Yosys applies that transform in reverse.

18:25 <lkcl_> turns out you can do the same thing with Multiply, by impementing a Wallace Tree multiplier

18:25 <Lofty> While it's a neat trick, to me it seems too slow...

18:25 <lkcl_> Lofty: cool. well, the nice thing is: we can use a straight 35-bit adder - nothing special

18:26 <lkcl_> ah that's the beauty of it. because we're using straight 35-bit add, it's not "slow" at all

18:26 <lkcl_> not in simulation and not in FPGA.

18:26 <Lofty> And the 128-bit elements or whatever?

18:26 <lkcl_> ok Multiply yes, that's a bit sloooow because in 64-bit it's 8x 16-bit multiplies followed by a 20-long chain of ADD operations

18:26 <Lofty> Does that not require a > 128-bit adder to implement?

18:27 <lkcl_> if doing SIMD 128?

18:27 <lkcl_> then that would be... if you wanted to break into 8-bit... 128/8 equals 16 "partitions"

18:27 <Lofty> So, that's, what, 128+15 bits?

18:27 <lkcl_> so you'd add one partition bit in every 8 bits, 128+15... yeah

18:28 <Lofty> I'm fairly sure that would become a critical path pretty quickly

18:28 <lkcl_> and the FPGA tools (and yosys) would allocate suitable hardware

18:28 <Lofty> Yosys performs that transform in reverse where it can prove it to break up adders.

18:28 <lkcl_> we're not doing 128-bit SIMD (yet). limiting it to 64-bit

18:28 <lkcl_> nice

18:29 <lkcl_> so it turns out that you can do a similar partitioning trick for greater-than, less-than, eq

18:29 <lkcl_> and, xor and or are obviously trivial

18:29 <lkcl_> shift took *3* weeks to work out :)

18:29 <Lofty> Well, GT/LT/EQ can all reduce to subtraction followed by checking bits, so yeah.

18:30 <lkcl_> we did it as bit-wise compares, using "count leading zeros", on the basis that lt is much more efficiently implemented that way

18:30 <lkcl_> so we allow the "count leading" to simply cascade through the partitions if they're set

18:30 <lkcl_> or not

18:31 <Lofty> Well, to me using no extra resources seems like a pretty good deal, but what do I know about CPU architecture.

18:32 <lkcl_> it seems that way until you think through the answer, "how many gates would it be if i did a separate 8x8 pipeline, a separate 4x16 pipeline, a separate 2x32 pipeline and a separate 1x64 pipeline"

18:32 <lkcl_> the answer to that is so massive (like... 4x the number of gates) that at that point it's a no-brainer

18:32 <lkcl_> 3-4x

18:33 <Lofty> But...you don't need that much logic at all.

18:33 <lkcl_> turns out that jacob's Partitioned Multiplier is a 50% overhead compared to a "straight" (non-partitioned multiplier)

18:33 <Lofty> You certainly don't need N pipelines if you can partition subtractions.

18:33 <lkcl_> remember: we're doing IEEE754 FP as well

18:33 <Lofty> For comparison at least

18:34 <lkcl_> SIMD IEEE754 FP - including sin, cos, atan2, rsqrt - the works

18:36 <lkcl_> at which point, the last thing we need is embedded SIMD-style switch statements, or wrapper classes creating separate FP pipelines for each!

18:36 <lkcl_> from a commercial perspective the product would be a failure right off the bat

18:36 <lkcl_> due to far too many gates

18:37 <Lofty> Clearly I'm not understanding the problems at hand here, and I'll take your word for it.

18:38 <lkcl_> it's easy to estimate "how many gates would this idea take", and "what would the code look like, would anyone be able to understand it?"

18:38 <lkcl_> :)

18:39 <lkcl_> Lofty, whitequark: thank you for taking the time here. whitequark i especially appreciate that you read the existing code (and understood it)

18:49 <pepijndevos> DaKnig, my problem with arrays is that you can't stick your number class into them easily.

18:49 <pepijndevos> Indexing an Array with a Value gives an ArrayProxy

18:51 <pepijndevos> So if you make an array with MyCustomNumber and then do array[Const(5)].custom_method() it... kinda doesn't really work

19:03 jaseg has quit [Ping timeout: 272 seconds]

19:05 jaseg has joined #nmigen

19:32 _whitelogger has quit [Ping timeout: 240 seconds]

19:33 proteusguy has quit [Ping timeout: 240 seconds]

19:34 _whitelogger_ has joined #nmigen

19:39 sorear_ has joined #nmigen

19:42 proteusguy has joined #nmigen

19:42 sorear has quit [Ping timeout: 240 seconds]

19:42 sorear_ is now known as sorear

20:17 <DaKnig> lkcl_: Im not against such a thing, biut doesnt that seem a bit... overspecialized? also can you not do this with a few smart map functions in python? like a function that takes a function, width and applies said function assuming w-wide words in the signal? or something like that. then you just call this in a loop for all your operations...

20:17 <DaKnig> I think

20:19 <DaKnig> whitequark: I see. so I think making a wrapper with internal Signal would be the way to go.

20:20 <whitequark> yup

20:20 <whitequark> i should definitely cover this topic in the manual

20:31 <DaKnig> lkcl_: oh I saw your later implementation with the "buffer bits"! that's cool. does it work for div and mul?

20:36 <DaKnig> whitequark: where should I put suggestions for things that are not covered in the manual?

20:36 <DaKnig> I find a lot of those tiny things that are not covered by either the manual or the examples provided

20:37 <DaKnig> example- all the ways to use Case statements (with strings and dont care, numbers, what else? see? I dont know!)

20:37 <DaKnig> but I find much more of those tiny things that I kinda have to piece together or ask here

20:40 <lkcl_> DaKnig: hypothetically, we could write a type of "macro" that effectively uses the same Signals underneath as "storage" for the same computations.

20:40 <lkcl_> so you instantiate 4 different "macro" classes: 1 for 1x64, 1 for 2x32 ... etc. and they all use the exact same Signals

20:41 <lkcl_> then at the *top level*, there's a class that instantiates all those 4 "macros" and it is there - and only there - that you do that "switch(SIMD_width)"

20:42 emeb_mac has joined #nmigen

20:42 <lkcl_> the PartitionedSignal effectively takes that "macro" concept and embeds it *right* inside at the *micro* level.

20:43 <lkcl_> DaKnig: div we haven't implemented yet, mul we have. it was one of the first pieces of NLNet-sponsored work. https://git.libre-soc.org/?p=ieee754fpu.git;a=blob;f=src/ieee754/part_mul_add/multiply.py;hb=HEAD

20:45 <lkcl_> absolutely superb work by jacob. he implemented a Wallace Tree multiplier. it's like "an efficient version of long multiplication"

20:45 <whitequark> DaKnig: if it's a part of the language proper, you don't need to put them anywhere, i'll just get around to them at some point

20:47 <DaKnig> lkcl_: it took me a few minutes to understand how you did the addition, can you please explain it in simple terms? otherwise I might not get it

20:49 <lkcl_> DaKnig: it's outlined here https://libre-soc.org/3d_gpu/architecture/dynamic_simd/add/

20:49 <DaKnig> I mean mul, I got add

20:50 <lkcl_> this is how a wallace tree works

20:50 <lkcl_> https://en.wikipedia.org/wiki/Wallace_tree

20:50 <DaKnig> ah changing the URL with s/add/mul/ gave me the answer

20:51 <lkcl_> to save some gates (and also to be useful on FPGAs), we didn't go all the way down to bit-level

20:51 <lkcl_> that would be insane :)

20:51 <lkcl_> so we went down as far as 8-bit, just like you do with "Long Multiplication"

20:51 <DaKnig> insanely efficient

20:51 <DaKnig> you mean

20:51 <DaKnig> : )

20:51 <lkcl_> :)

20:51 <DaKnig> you can have a gate-level thing that is useful on FPGAs

20:52 <DaKnig> by having some tool show equivalence to simpler form of the smae thing

20:52 <lkcl_> so you have 8x8 16-bit multiply partial results (which is good for FPGA because you can use the on-board DSPs)

20:52 <DaKnig> (for example converting the fancy adders back to the operator + for FPGA to use the builtin adder

20:52 <lkcl_> and then because it is "adds" from that point onwards, we simply use exactly the same "partition" trick.

20:53 <lkcl_> well, jacob did a huge number of unit tests on it.

21:19 <DaKnig> lkcl_: very nice solution. still do you not feel like having this SIMD vector is ... too specific?

21:20 <DaKnig> I mean, sigints are awesome, but I wouldnt want them to be part of the python lang.

21:48 <lkcl_> DaKnig: it come from consideration of a large number of factors. and we're doing a CPU / VPU / GPU.

21:48 <lkcl_> SIMD - vectors - is essential.

21:50 <lkcl_> and if you want to keep the gate count down, partitioning is a *lot* more efficient than having multiple separate pipelines.

21:53 <DaKnig> who even suggested separate pipelines lol

21:54 Asu has quit [Quit: Konversation terminated!]

22:10 <lkcl_> well ok you know what i mean - separate hardware for each type of SIMD operation :) if it's in the same pipeline, it's still separate