<whitequark>
mh, guess not, let me finish describing this stuff anyway
<awygle>
oh sorry
<awygle>
i'm back
<awygle>
drifted away
<whitequark>
awygle: alright
<whitequark>
so, i explained the state of cxxrtl, where the inelegance of triggers matters relatively little
<awygle>
mhm
<whitequark>
ah, one thing that cxxrtl would greatly benefit from is if you could clock domains directly rather than clock signals. with the current edge detector architecture that can't work as well.
<awygle>
ah
<whitequark>
i haven't worked on that goal specifically because almost everyone who tried cxxrtl said that it's more than fast enough
<whitequark>
anyway. let's look at cxxsim now.
<whitequark>
cxxsim performs cosimulation of python processes with the cxxrtl process (singular), which adds a lot of moving parts
<whitequark>
for example, cxxrtl inputs (of the toplevel module) are value<>, which is fine for cxxrtl, which never writes to those. however, python processes most certainly can both read and write. what do? making those toplevel inputs wire<> would double the amount of delta cycles in best case.
<whitequark>
well, the solution i came up is that cxxsim (the python module) creates a virtual wire<> whose `curr` part is c++ owned value<> that's a part of the netlist, and `next` part is python owned pseudo-value<> that only exists to make python processes deterministic.
<whitequark>
which is lowkey cursed but it works as multiple processes only ever arise python-side
<whitequark>
*as long as
<whitequark>
but that's not the problem. the problem is triggering
<whitequark>
see, if a python process waits on some signal (currently, that's pretty much always a clock), and the cxxrtl process has registers that trigger on that same signal, they *must* be evaluated concurrently, or you'll get a race.
electronic_eel has quit [Ping timeout: 246 seconds]
electronic_eel has joined #nmigen
<whitequark>
okay, i explained enough context to explain the actual issue
sakirious has joined #nmigen
PyroPeter_ has joined #nmigen
PyroPeter has quit [Ping timeout: 256 seconds]
PyroPeter_ is now known as PyroPeter
emeb_mac has quit [Quit: Leaving.]
Bertl_oO is now known as Bertl_zZ
<whitequark>
so, the issue arises when python processes wait on signals. with the cxxrtl process it's easy: it simply polls the async input on every call to eval(). this works because the simulator can advance the simulated time once the cxxrtl process converges.
<whitequark>
python processes can't busy wait, so instead they register a trigger and go to sleep.
<whitequark>
right now, the way i process triggers, is that during commit i check every one of them in sequence and compare curr/next
<whitequark>
(similar to what cxxrtl does when the clock is a wire<>)
<whitequark>
this works for inputs and registers, but it doesn't work for comb outputs, aliases, or anything else
<whitequark>
since comb outputs in cxxrtl netlists don't have curr/next, to fix this, i would have to save the old value somewhere else
<whitequark>
so basically, python processes right now only work if you wait on a wire
<whitequark>
conversely, cxxrtl processes only work if their clock is *not* a wire, because cxxsim does an equivalent of this c++ code: top.p_clk.set(true); top.commit(); top.eval();
<whitequark>
and in that sequence, commit() makes curr == next, and then eval() doesn't actually trigger any sync logic
<whitequark>
you might ask: but if p_clk is a value<>, then prev_p_clk is set during commit, why doesn't that break in the same way?
<whitequark>
and the answer is that it works more or less by accident. specifically, because i commit signals driven by the cxxrtl process first, and signals driven by python processes second
<whitequark>
to make this even worse, currently every single cxxrtl-owned signal you use is registered as a trigger for the cxxrtl process
<whitequark>
and is linear searched multiple times for every simulation instant
gkelly has quit [Quit: Idle for 30+ days]
<whitequark>
i don't know how coherent what i just wrote is (i'm pretty sure it's impossible to understand), but it's ok if it isn't. i'm only really trying to convey the amount of ad-hoc hacks related to edge triggered logic here
nelgau has joined #nmigen
<whitequark>
and it's not just that they're ad-hoc, it's that they violate basic invariants that both pysim and cxxsim are built around
<whitequark>
- order of eval is unimportant
<whitequark>
- order of commit is unimportant
<whitequark>
i want something that would:
<whitequark>
- trigger cxxrtl processes whether the clock is wire<> or value<>, and without doing unsound reads from .next
<whitequark>
- trigger python processes in exactly the same way as cxxrtl processes are
<whitequark>
- free python code from the need to do O(n) operations on large amounts of signals
<whitequark>
- ideally, greatly reduce the cost of eval on the inactive edge of clock
jeanthom has joined #nmigen
nelgau has quit [Remote host closed the connection]
<cesar[m]>
Regarding the last point, what if you split the eval function into eval_level_sensitive and eval_edge_sensitive?
<cesar[m]>
eval_edge_sensitive would only run if it's triggered by an edge
<cesar[m]>
if it's the wrong edge, it would not be run, neither would eval_level_sensitive
<cesar[m]>
if eval_edge_sensitive does run, eval_level_sensitive would be run afterwards.
<cesar[m]>
eval_level_sensitive would not have any clocks in its sensitivity list.
<whitequark>
yes, that's exactly what i'm thinking about here
<whitequark>
the devil is in the details, really; i'd like to preserve the beautifully simple `top.step()` interface, yet also enable this
<whitequark>
well, it's not really "edge sensitive" and "level sensitive"
<whitequark>
both comb and sync processes are edge sensitive, because comb processes are iterated to fixpoint
<whitequark>
unlike in verilog, a process that reads its own output will self-trigger
<whitequark>
so, perhaps something like eval_comb() and eval_sync(), where eval_sync() would perhaps further delegate its job to eval_posedge_p_clk_negedge_p_rst() or something like that
<whitequark>
and then eval() would just be eval_comb();eval_sync()
jeanthom has quit [Ping timeout: 246 seconds]
chipmuenk has joined #nmigen
<_whitenotifier>
[nmigen/nmigen] whitequark pushed 2 commits to cxxsim [+0/-0/±2] https://git.io/JIiLi
<_whitenotifier>
[nmigen/nmigen] whitequark 0807149 - sim.cxxsim: dump simulation-only signals to VCD, when possible.
<_whitenotifier>
[nmigen] whitequark commented on issue #556: cxxsim: simulator-only signals not included in VCD and GTKWave files - https://git.io/JIiLy
<_whitenotifier>
[nmigen] whitequark closed issue #556: cxxsim: simulator-only signals not included in VCD and GTKWave files - https://git.io/JIRpP
<_whitenotifier>
[nmigen] whitequark edited a comment on issue #556: cxxsim: simulator-only signals not included in VCD and GTKWave files - https://git.io/JIiLy
FFY00 has quit [Read error: Connection reset by peer]
<_whitenotifier>
[nmigen] nfbraun commented on issue #558: write_cfgmem is writing to arty board when not specified to in the code - https://git.io/JIipq
<cesar[m]>
whitequark: I've identified a couple of tests where CxxSim and PySim still disagree. I'll investigate.
<cesar[m]>
Otherwise, CxxSim is well reproducing PySim results.
<whitequark>
cesar[m]: excellent
<whitequark>
looking forward to your testcases
<whitequark>
i was initially planning to do randomized testing, but that turned out to be more difficult than i anticipated, and on top of it there is still significant missing functionality
<whitequark>
testing it on a large real-world project is the next best thing
<whitequark>
cesar[m]: how's the performance so far?
<_whitenotifier>
[nmigen] whitequark commented on issue #558: write_cfgmem is writing to arty board when not specified to in the code - https://git.io/JIPkI
emeb_mac has joined #nmigen
<cesar[m]>
Sorry, any potential gain in run time is being completely offset by C++ compilation time.
<whitequark>
cesar[m]: are you able to reuse a simulator by repeatedly resetting it?
<cesar[m]>
Sure.
<whitequark>
still compiles for too long?
<cesar[m]>
What I mean, it sure would be faster if we repeatedly reseted the simulator instead of recreating it as we do today.
<whitequark>
ah.
<whitequark>
yes. if you recreate it then the unfortunate result you are observing is expected
<whitequark>
in principle it would not be too hard to add caching, but the knob you would have to use for that is not currently exposed.
<cesar[m]>
Also, I guess we could leverage the greater performance by increasing the number of iterations and test vectors.
<whitequark>
the main reason I'm asking is that ctypes can have an.. unfortunate effect on performance
<whitequark>
and the exact steps I will have to take to tackle that should, I think, mostly reflect real-world use
<cesar[m]>
Maybe there could be a way to measure the simulation run time, excluding compile time, in both CxxSim and PySim?
<cesar[m]>
It could be done in a Python process, I guess.
<whitequark>
that could be done, but I'm worried about Amdahl's law
<whitequark>
what you ultimately care about is how long the tests run, not how quickly CXXRTL itself runs
<cesar[m]>
Indeed.
<whitequark>
CXXRTL is fast enough that it's competitive with single threaded Verilator, which in practice means that even relatively small inefficiencies in Python have a dramatically higher effect on runtime
<whitequark>
at one point, I measured a prototype of CXXSim run *slower* than PySim, even though the same design ran in pure C++ at over 1 MCPS
<whitequark>
I know there's multiple places in the current version of CXXSim where I have great opportunities for optimizing the interface, but I'd like to actually do that with a benchmark in hand and not blindly
<cesar[m]>
Unfortunately, I cannot try CXXSim on the full design at the moment, because it uses Litex peripherals, so it must be simulated in Litex / Verilator, not in nMigen.
<whitequark>
ah I see
<_whitenotifier>
[nmigen] davidlattimore commented on issue #558: write_cfgmem is writing to arty board when not specified to in the code - https://git.io/JIPY9
nfbraun has quit [Ping timeout: 256 seconds]
nfbraun has joined #nmigen
<cesar[m]>
So, I do the next best thing, which is to simulate (in nMigen) the big integration test.
<lkcl>
whitequark: some background there - nmigen-soc is still in its infancy so we had to use something that's well-established
<whitequark>
yep, that's perfectly sensible
<lkcl>
also, the initialisation of even 64k of "memory" (loading a BIOS that would allow further extensive testing) into pysim is... awful.
<whitequark>
pysim memories are about to get a whole lot faster
<lkcl>
microwatt's "random instructions" unit tests are around... 128k in size?
<lkcl>
ah brilliant. that would be superb
<whitequark>
I'm kind of forced to optimize them because otherwise I could not integrate cxxrtl properly
<whitequark>
because of some tricky internal interface issues
<lkcl>
:)
<whitequark>
the reason memories are so painful right now is they are O(n)
nfbraun_ has joined #nmigen
<lkcl>
ouch.
<whitequark>
which as you have just discovered turns into O(n^2) the moment you load the entire thing
<whitequark>
(before someone submits this to the "accidentally quadratic" tumblr blog: I did them this way on purpose, it was the right decision at the time)
<lkcl>
i had to write directly to the _init internal data structure, btw, to load in BIOSes in a reasonable time. didn't tell you about it so as not to freak you out :)
<whitequark>
wait, it was Memory.init setter that was this slow?
<whitequark>
interesting
<whitequark>
*that* is not supposed to be O(n)
<whitequark>
just the pysim generated code
<lkcl>
i bypassed it - long story.
<whitequark>
yes, it's reasonable that you did
<whitequark>
I'm just surprised that it's so slow
<lkcl>
also... *sigh*... OpenPOWER ISA has bigendian / little-endian plus the data was in 32-bit format and needed to be in a 64-bit Memory... *sigh*
<lkcl>
if that can be sped u
<whitequark>
i'm fairly certain that can be sped up
<lkcl>
p, and there's a UART in nmigen that is even remotely "semi-compatible" with 16550 (even for "read"), then we can try - pretty much straight away - to run e.g. the microwatt "helloworld" example
<whitequark>
the whole "Memory is secretly an Array" thing is going to be in the past
<lkcl>
it's only 1500 bytes
<lkcl>
hooray :)
<whitequark>
yet another decision I unthinkingly pulled from Migen where I should have really known better
<whitequark>
ah well
<lkcl>
:)
<whitequark>
the whole fragment transformer thing makes compilation probably an order of magnitude slower than it should be, too
<whitequark>
not to mention more buggy
<lkcl>
cesar[m]: thank you for doing the extensive report on the libresoc unit tests. you saw i went through them?
<lkcl>
fragment transformer? before handing to cxxsim?
<_whitenotifier>
[nmigen] nfbraun commented on issue #558: write_cfgmem is writing to arty board when not specified to in the code - https://git.io/JIP30
<lkcl>
you mean the node-walker? (i took a look a couple days ago at xfrm.py)
<whitequark>
DomainRenamer, DomainLowerer, etc work by term rewriting essentially
<whitequark>
it's something i kept like it was in Migen because it looked reasonable at first glance
<whitequark>
but if you think about it, you'll notice that virtually no other compiler uses that approach
<lkcl>
hmm
<whitequark>
well, it's because it is not only massively wasteful, but also hard to reason about
<lkcl>
luckily they do actually "work", putting them off from being high-priority
<whitequark>
to be perhaps more fair to Migen, Migen's version was a lot less wasteful, but on the other hand it was even harder to use correctly because of all the mutation going on
<whitequark>
really none of this stuff should exist in first place
<lkcl>
i know litex is a fantastic accumulation of incredibly valuable expertise and recipes
<lkcl>
but... gaah, it's just impossible to consider committing any resources to it because migen gives zero warnings.
<lkcl>
recently we got as far as *P&R* in coriolis2 before discovering an error! that's 4 hours compilation time!
<whitequark>
ouch!
nfbraun_ has quit [Quit: leaving]
<lkcl>
it's to do with netlists that should have been amalgamated. the assignment is detected by coriolis2 (an input to an input) and converted to an *output*
<whitequark>
sounds kinda cursed
<lkcl>
you're supposed to "fix" this in verilog by having a register that's assigned to both inputs
<_whitenotifier>
[nmigen] whitequark commented on issue #558: write_cfgmem is writing to arty board when not specified to in the code - https://git.io/JIP31
<_whitenotifier>
[nmigen/nmigen] whitequark pushed 1 commit to master [+0/-0/±1] https://git.io/JIP3M