ChanServ changed the topic of #nmigen to: nMigen hardware description language · code at · logs at · IRC meetings each Monday at 1800 UTC · next meeting August 17th
esden has joined #nmigen
Degi has quit [Ping timeout: 246 seconds]
Degi has joined #nmigen
lkcl has joined #nmigen
emeb has quit [Quit: Leaving.]
jaseg has quit [Ping timeout: 244 seconds]
jaseg has joined #nmigen
emeb_mac has quit [Ping timeout: 240 seconds]
emeb_mac has joined #nmigen
electronic_eel has quit [Ping timeout: 260 seconds]
electronic_eel has joined #nmigen
emeb_mac has quit [Ping timeout: 264 seconds]
emeb_mac has joined #nmigen
emeb_mac has quit [Ping timeout: 240 seconds]
emeb_mac has joined #nmigen
emeb_mac has quit [Ping timeout: 240 seconds]
PyroPeter_ has joined #nmigen
emeb_mac has joined #nmigen
PyroPeter has quit [Ping timeout: 260 seconds]
PyroPeter_ is now known as PyroPeter
_whitelogger has joined #nmigen
proteusguy has quit [Ping timeout: 264 seconds]
<d1b2> <edbordin> whitequark, cr1901_modern, not sounding like there is a lot of enthusiasm for that sby PR. do you think we can convince them? main issue seems to be the current code is "battle tested" and there aren't comprehensive regression tests for this core code. I can understand them not wanting to ruin it for customers.
<d1b2> <edbordin> I notice that at some point in the past year, there was a workaround inserted on windows to call sleep(0.1) instead of select so the loop wouldn't spin too hard
<d1b2> <edbordin> I could compromise and PR the console process group on windows so that there is less chance of runaway processes at least. In practice I haven't hit that issue because the engine tends to crash anyway when the stdin pipe closes
<d1b2> <edbordin> a more chaotic option would be to ship this patched version in open-tool-forge and crowdsource regression testing 😛
proteusguy has joined #nmigen
<cr1901_modern> At this point, I want to wait and see what upstream wants.
cr1901_modern has quit [Quit: Leaving.]
cr1901_modern has joined #nmigen
TiltMeSenpai has quit [*.net *.split]
FL4SHK has quit [*.net *.split]
cr1901_modern has quit [*.net *.split]
MadHacker has quit [*.net *.split]
emeb_mac has quit [Quit: Leaving.]
cr1901_modern has joined #nmigen
TiltMeSenpai has joined #nmigen
FL4SHK has joined #nmigen
MadHacker has joined #nmigen
hitomi2507 has joined #nmigen
cr1901_modern has quit [Quit: Leaving.]
cr1901_modern has joined #nmigen
cr1901_modern has quit [Ping timeout: 240 seconds]
PyroPeter has quit [*.net *.split]
lkcl has quit [*.net *.split]
Degi has quit [*.net *.split]
esden has quit [*.net *.split]
esden has joined #nmigen
lkcl has joined #nmigen
PyroPeter has joined #nmigen
Degi has joined #nmigen
cr1901_modern has joined #nmigen
cr1901_modern has quit [Client Quit]
Asu has joined #nmigen
Asu is now known as Asuu
Asuu is now known as Asu
cr1901_modern has joined #nmigen
cr1901_modern has quit [Client Quit]
cr1901_modern has joined #nmigen
cr1901_modern has quit [Client Quit]
sorear has quit [Read error: Connection reset by peer]
sorear has joined #nmigen
cr1901_modern has joined #nmigen
<Degi> Can I somehow get how full a FIFO is?
<DaKnig> switches dont have to cover all cases right?
<DaKnig> (useful for me when doing some synchronous stuff)
<vup> Degi: Lofty: note r_level and w_level for {Buffered,}AsyncFifo is currently broken in some ways
<Degi> Oh I want a sync one anywys
<Degi> Cool, that's new, nice
<vup> well sync fifos have had level since they were added
<Degi> Is w_level and r_level for the respective domains?
<vup> yes
<Degi> Nice
<DaKnig> whitequark: I remember you saying that inheriting from Signal is not supported, can you explain why?
<DaKnig> I thought about making an Integer class, that knows the max and min value, but otherwise behaves like a Signal
<DaKnig> it could reduce some logic in some cases - if you have an unsgined(4) number and it only goes up to 10, checking `num==9` is the same as checking `num[0] & num[3]` (of course for larger examples the benefit is bigger, I think)
<DaKnig> wouldnt inheriting from Signal here be beneficial?
lkcl_ has joined #nmigen
Asu has quit [Remote host closed the connection]
Asu has joined #nmigen
lkcl has quit [Ping timeout: 272 seconds]
Chips4Makers has joined #nmigen
Asuu has joined #nmigen
Asu has quit [Ping timeout: 240 seconds]
zignig has joined #nmigen
* zignig has been erratic for some time, working boneless-v3 shell coming soon.
<pepijndevos> DaKnig, my understanding of it is that it's just an AST node so the backends need to have specific understanding of each class. So while it might not immediately break, it's not part of the "public API" and subject to breakage.
<pepijndevos> SubClassing UserValue is supported for that particular case of course...
<pepijndevos> What I'm doing with my fixed point number class is that it wraps an internal signal
<pepijndevos> This is actually a pretty smooth experience, except when it comes to arrays.
<pepijndevos> IMHO if you're just adding functionality to Signal that has general usefulness and does not conflict with nMigen philosophy, maybe a PR is the best solution? :)))
<DaKnig> it doesnt go well with the nmigen philosophy though
<DaKnig> as nmigen, according to wq, should be typeless, having just the "shape" thing
<DaKnig> so no integers and such
<DaKnig> but I find integers to be quite useful
<DaKnig> what's your problem with arrays?
emeb has joined #nmigen
<DaKnig> pepijndevos
<lkcl_> whitequark: about DaKnig's question (inheriting from Signal), i (sort-of) remember the discussions we had 2 years ago. PartitionedSignal is the result of (in 4-6 months time) needing to implement SIMD operations behind a nmigen-identical API
<lkcl_> including Arrays of SIMD Signals
<lkcl_> i'd be interested to have a (short) conversation about it some time. it's a roadmap item for us though - not urgent
* lkcl_ has a conversion of MiSOC wishbone DownConverter up and running
<lkcl_> unit tests work, litex sim of libresoc works, FPGA building now...
<lkcl_> similar to this - i think
<lkcl_> although... i'm having difficulty understanding this code:
Asuu has quit [Ping timeout: 258 seconds]
Asu has joined #nmigen
hitomi2507 has quit [Quit: Nettalk6 -]
<DaKnig> lkcl_: can you tell me more about that?
<DaKnig> what do you exactly mean
<lkcl_> DaKnig: the SIMD Signal class? or the wishbone downconverter?
<DaKnig> SIMD Signal class
<lkcl_> ok, so what would be like the "normal" way to do a SIMD ALU? assume in nmigen
<DaKnig> you can use `map` for this kinda stuff (plus Array) no?
<DaKnig> or for loop with word_select
<lkcl_> and there would be a construct along the lines of:
<lkcl_> switch (SIMD_width):
<lkcl_> case (64bit): comb += o.eq(a + b)
<lkcl_> case 32bit: for i in range(2) ... word_select()...
<lkcl_> etc. etc
<lkcl_> yes?
<lkcl_> would that sound like a "reasonable" way to do a SIMD ALU for an "add" operation?
<lkcl_> case statements with a 1x 64-bit, 2x loop on 32-bit, 4x loop on 16-bit, 8x loop on 8-bit
<lkcl_> most people would say "yes, sounds perfectly reasonable and normal".
<lkcl_> it's only 1 line of code "comb += o.eq(a + b)" so what's the problem, right?
<lkcl_> now imagine that you need to implement e.g. POWER9 SIMD
<lkcl_> this involves carry-in and carry-out bits
<lkcl_> now imagine what happens when it comes to implementing DIV in SIMD.
<lkcl_> for every single line of 64-bit scalar code, and i do mean absolutely every single line
<lkcl_> you need that switch statement
<lkcl_> switch (SIMD_width): case statement for 1x64, case statement for 2x32, case statement for 4x16, case statement for 8x8
<lkcl_> at which point any hope of being able to understand and maintain the resultant code is a foregone lost cause
<lkcl_> what if there was a class, named PartitionedSignal, that could, at runtime, be given a "context" (the current SIMD_width)?
<lkcl_> what if that class had exactly the same properties and behaviour as Signal?
<lkcl_> that would be really neat, wouldn't it?
<lkcl_> because instead of the god-awful pervasive switch statement invading every single line of code, it's hidden from sight, set up once and only once when the PartitionedSignal is created
<lkcl_> "all" (i say all) you'd need to do is: implement PartitionedSignal.__add__, PartitionedSignal.__eq__, PartitionedSignal.__xor__ and so on
<lkcl_> the problem then comes when you try do do with m.If(SomePartitionedSignal) because what goes on in the m.If *critically depends* on that SIMD_width "context"
<lkcl_> likewise, Arrays of PartitionedSignals - again: what happens is critically dependent on the SIMD_width context
<lkcl_> DaKnig: does that give some context? in summary, we want *transparent* SIMD that looks exactly like it's a scalar operation, however "behind the scenes" a context (SIMD_width) which is set at runtime (in HDL) can "break" that signal down into multiple smaller (SIMD) fragments.
<Lofty> If I try to build a belt-machine based RISC processor, how many patents do you think I'd infringe upon simultaneously?
<lkcl_> Lofty: absolutely none if it's for the purposes of developing your own experiment
<Lofty> Little bit easier said than done.
<lkcl_> enshrined into patent law is the right to implement QTY1 of any patent, for the purposes of developing your *own* inventions and further patents
<whitequark> pepijndevos: a PR is very rarely, almost never, a solution for language changes
<whitequark> an RFC issue would be most welcome though
<lkcl_> if however you intend to commercialise what you're doing, _then_ you may run into problems - e.g. most of the Mill patents :)
<Lofty> I mean, PRs might not be a solution, but it gives thinking material IMO
<whitequark> DaKnig: inheriting from Signal isn't supported for a few reasons; the main reason is that it would interfere with the stability guarantees I'd like to provide, by forcibly sharing a namespace between nmigen and downstream code
<whitequark> this is also a problem with UserValue, which is why we're replacing that with ValueCastable
<lkcl_> whitequark: so how would you think might be the best way to implement a SIMD-style Signal?
<lkcl_> bearing in mind that we need to do SIMD IEEE754 FP, and a full SIMD Integer processor including multiply, divide, shift, everything.
<Lofty> lkcl_: why not just a Signal?
<lkcl_> Lofty: see above, about the massive proliferation of switch/case statements
<lkcl_> on liiiterally eeeevery single (scalar) line of code (comb += o.eq(a + b))
<whitequark> lkcl_: i still have no idea what the semantics of it would be
<lkcl_> that is now replaced with an unbelievably tedious switch (SIMD_width) case 64: do this, case 32: do 2x 32-bit case 16: .. etc.
<Lofty> Well, nMigen has functions for this
<whitequark> i've read your code, which doesn't concern If/Switch
<whitequark> that code is relatively clear
<Lofty> Nothing obligates you to put all your logic into elaborate()
<whitequark> what you want your "SIMD signal" to do when used in If/Switch, I don't know
<lkcl_> Lofty: it becomes unreadable and unmaintainable
* Lofty blinks
<lkcl_> whitequark: to do exactly the same thing as if were "just a plain Signal(64)"
<Lofty> I *really* must be missing something.
<lkcl_> we did successfully create something called "PMux()" which is a Dynamic SIMD capable version of Mux()
<Lofty> To me this seems like a problem with a lot of inherent complexity, but also a lot of inherent redundancy.
<Lofty> Which sounds like something you'd use a function for.
<lkcl_> and whereever we had any code involving m.If() we replaced it with PMux
<lkcl_> what i would _like_ to have happen is "with simdcontext.If(x):" to behave *exactly* like "with m.If()"
<lkcl_> and behind the scenes, "with simdcontext.If()" hides the switch (SIMD_width)
<Lofty> But doesn't that still have problems when you need to do width-specific things?
<lkcl_> Lofty: i considered that approach. to have a class / function which creates, with a parameter, the different versions
<lkcl_> Lofty: bizarrely... for the majority of! if it's designed correctly.
<lkcl_> where it does become a problem is things like shift and shift_rot
<Lofty> Well, if you can do four 32-bit ops or two 64-bit ops (for example), doesn't that mean you have to do something width-specific or else duplicate hardware?
<lkcl_> ok, so we've actually implemented much of this already :)
<lkcl_> and yes there's an overhead
<lkcl_> however it is NOTHING compared to trying to do it as a switch statement
<lkcl_> one lot of 64 bit adds, another lot of 2x 32-bit adds, etc.
<lkcl_> this is some notes on how "add" works:
<lkcl_> let's say you have 32-bit add and it's broken into 4 chunks
<lkcl_> turns out that if you insert 3 bits - extending the 32-bit into 32+3 - you can set bits...err... 8, 17, 26 as "0" and voila, you have divided it into 4 separate 8-bit adds
<lkcl_> if you want 2x 16, you set bit 17 to 1
<lkcl_> it becomes a "carry"
<lkcl_> and the carry from bit 16 "rolls over" from bit 17 into bit 18!
<lkcl_> ta-daaa
<lkcl_> jacob came up with that one :)
<Lofty> Well, Yosys applies that transform in reverse.
<lkcl_> turns out you can do the same thing with Multiply, by impementing a Wallace Tree multiplier
<Lofty> While it's a neat trick, to me it seems too slow...
<lkcl_> Lofty: cool. well, the nice thing is: we can use a straight 35-bit adder - nothing special
<lkcl_> ah that's the beauty of it. because we're using straight 35-bit add, it's not "slow" at all
<lkcl_> not in simulation and not in FPGA.
<Lofty> And the 128-bit elements or whatever?
<lkcl_> ok Multiply yes, that's a bit sloooow because in 64-bit it's 8x 16-bit multiplies followed by a 20-long chain of ADD operations
<Lofty> Does that not require a > 128-bit adder to implement?
<lkcl_> if doing SIMD 128?
<lkcl_> then that would be... if you wanted to break into 8-bit... 128/8 equals 16 "partitions"
<Lofty> So, that's, what, 128+15 bits?
<lkcl_> so you'd add one partition bit in every 8 bits, 128+15... yeah
<Lofty> I'm fairly sure that would become a critical path pretty quickly
<lkcl_> and the FPGA tools (and yosys) would allocate suitable hardware
<Lofty> Yosys performs that transform in reverse where it can prove it to break up adders.
<lkcl_> we're not doing 128-bit SIMD (yet). limiting it to 64-bit
<lkcl_> nice
<lkcl_> so it turns out that you can do a similar partitioning trick for greater-than, less-than, eq
<lkcl_> and, xor and or are obviously trivial
<lkcl_> shift took *3* weeks to work out :)
<Lofty> Well, GT/LT/EQ can all reduce to subtraction followed by checking bits, so yeah.
<lkcl_> we did it as bit-wise compares, using "count leading zeros", on the basis that lt is much more efficiently implemented that way
<lkcl_> so we allow the "count leading" to simply cascade through the partitions if they're set
<lkcl_> or not
<Lofty> Well, to me using no extra resources seems like a pretty good deal, but what do I know about CPU architecture.
<lkcl_> it seems that way until you think through the answer, "how many gates would it be if i did a separate 8x8 pipeline, a separate 4x16 pipeline, a separate 2x32 pipeline and a separate 1x64 pipeline"
<lkcl_> the answer to that is so massive (like... 4x the number of gates) that at that point it's a no-brainer
<lkcl_> 3-4x
<Lofty> don't need that much logic at all.
<lkcl_> turns out that jacob's Partitioned Multiplier is a 50% overhead compared to a "straight" (non-partitioned multiplier)
<Lofty> You certainly don't need N pipelines if you can partition subtractions.
<lkcl_> remember: we're doing IEEE754 FP as well
<Lofty> For comparison at least
<lkcl_> SIMD IEEE754 FP - including sin, cos, atan2, rsqrt - the works
<lkcl_> at which point, the last thing we need is embedded SIMD-style switch statements, or wrapper classes creating separate FP pipelines for each!
<lkcl_> from a commercial perspective the product would be a failure right off the bat
<lkcl_> due to far too many gates
<Lofty> Clearly I'm not understanding the problems at hand here, and I'll take your word for it.
<lkcl_> it's easy to estimate "how many gates would this idea take", and "what would the code look like, would anyone be able to understand it?"
<lkcl_> :)
<lkcl_> Lofty, whitequark: thank you for taking the time here. whitequark i especially appreciate that you read the existing code (and understood it)
<pepijndevos> DaKnig, my problem with arrays is that you can't stick your number class into them easily.
<pepijndevos> Indexing an Array with a Value gives an ArrayProxy
<pepijndevos> So if you make an array with MyCustomNumber and then do array[Const(5)].custom_method() it... kinda doesn't really work
jaseg has quit [Ping timeout: 272 seconds]
jaseg has joined #nmigen
_whitelogger has quit [Ping timeout: 240 seconds]
proteusguy has quit [Ping timeout: 240 seconds]
_whitelogger_ has joined #nmigen
sorear_ has joined #nmigen
proteusguy has joined #nmigen
sorear has quit [Ping timeout: 240 seconds]
sorear_ is now known as sorear
<DaKnig> lkcl_: Im not against such a thing, biut doesnt that seem a bit... overspecialized? also can you not do this with a few smart map functions in python? like a function that takes a function, width and applies said function assuming w-wide words in the signal? or something like that. then you just call this in a loop for all your operations...
<DaKnig> I think
<DaKnig> whitequark: I see. so I think making a wrapper with internal Signal would be the way to go.
<whitequark> yup
<whitequark> i should definitely cover this topic in the manual
<DaKnig> lkcl_: oh I saw your later implementation with the "buffer bits"! that's cool. does it work for div and mul?
<DaKnig> whitequark: where should I put suggestions for things that are not covered in the manual?
<DaKnig> I find a lot of those tiny things that are not covered by either the manual or the examples provided
<DaKnig> example- all the ways to use Case statements (with strings and dont care, numbers, what else? see? I dont know!)
<DaKnig> but I find much more of those tiny things that I kinda have to piece together or ask here
<lkcl_> DaKnig: hypothetically, we could write a type of "macro" that effectively uses the same Signals underneath as "storage" for the same computations.
<lkcl_> so you instantiate 4 different "macro" classes: 1 for 1x64, 1 for 2x32 ... etc. and they all use the exact same Signals
<lkcl_> then at the *top level*, there's a class that instantiates all those 4 "macros" and it is there - and only there - that you do that "switch(SIMD_width)"
emeb_mac has joined #nmigen
<lkcl_> the PartitionedSignal effectively takes that "macro" concept and embeds it *right* inside at the *micro* level.
<lkcl_> DaKnig: div we haven't implemented yet, mul we have. it was one of the first pieces of NLNet-sponsored work.;a=blob;f=src/ieee754/part_mul_add/;hb=HEAD
<lkcl_> absolutely superb work by jacob. he implemented a Wallace Tree multiplier. it's like "an efficient version of long multiplication"
<whitequark> DaKnig: if it's a part of the language proper, you don't need to put them anywhere, i'll just get around to them at some point
<DaKnig> lkcl_: it took me a few minutes to understand how you did the addition, can you please explain it in simple terms? otherwise I might not get it
<DaKnig> I mean mul, I got add
<lkcl_> this is how a wallace tree works
<DaKnig> ah changing the URL with s/add/mul/ gave me the answer
<lkcl_> to save some gates (and also to be useful on FPGAs), we didn't go all the way down to bit-level
<lkcl_> that would be insane :)
<lkcl_> so we went down as far as 8-bit, just like you do with "Long Multiplication"
<DaKnig> insanely efficient
<DaKnig> you mean
<DaKnig> : )
<lkcl_> :)
<DaKnig> you can have a gate-level thing that is useful on FPGAs
<DaKnig> by having some tool show equivalence to simpler form of the smae thing
<lkcl_> so you have 8x8 16-bit multiply partial results (which is good for FPGA because you can use the on-board DSPs)
<DaKnig> (for example converting the fancy adders back to the operator + for FPGA to use the builtin adder
<lkcl_> and then because it is "adds" from that point onwards, we simply use exactly the same "partition" trick.
<lkcl_> well, jacob did a huge number of unit tests on it.
<DaKnig> lkcl_: very nice solution. still do you not feel like having this SIMD vector is ... too specific?
<DaKnig> I mean, sigints are awesome, but I wouldnt want them to be part of the python lang.
<lkcl_> DaKnig: it come from consideration of a large number of factors. and we're doing a CPU / VPU / GPU.
<lkcl_> SIMD - vectors - is essential.
<lkcl_> and if you want to keep the gate count down, partitioning is a *lot* more efficient than having multiple separate pipelines.
<DaKnig> who even suggested separate pipelines lol
Asu has quit [Quit: Konversation terminated!]
<lkcl_> well ok you know what i mean - separate hardware for each type of SIMD operation :) if it's in the same pipeline, it's still separate