zng has quit [Quit: ZNC 1.8.x-nightly-20181211-72c5f57b - https://znc.in]
zng has joined ##openfpga
Miyu has quit [Ping timeout: 240 seconds]
ayjay_t has quit [Read error: Connection reset by peer]
ayjay_t has joined ##openfpga
<tnt>
whitequark: yeah, but casez(mysig) will also make a 'z' in 'mysig' be a don't care. I'd like to have only the case options have don't cares, not the input signal I'm matching against.
rohitksingh has quit [Ping timeout: 250 seconds]
<whitequark>
right
<whitequark>
no idea how to do that
rohitksingh has joined ##openfpga
catplant has quit [Quit: WeeChat 2.2]
catplant has joined ##openfpga
rohitksingh has quit [Remote host closed the connection]
rohitksingh has joined ##openfpga
<whitequark>
ZipCPU: so, I am trying to convert an FSM-driven CPU into a pipelined CPU, for various reasons
<whitequark>
I am interested in formally proving their equivalence
<ZipCPU>
nMigen?
<whitequark>
yeah
<whitequark>
both in nMigen
<whitequark>
really, the pipelined one will have the same CPI as FSM-driven one
<whitequark>
but it turns out that FPGA tooling is kind of bad at optimizing FSM-driven CPUs
<whitequark>
right now the FSM logic takes well over 50% of LUT count
<ZipCPU>
Ouch
<whitequark>
and that's with many different toolchains
<ZipCPU>
Ok
<whitequark>
yeah, it should be somewhere in the range of 200 LUT, and in reality it's more like 500
<whitequark>
they will be both in-order and scalar
<ZipCPU>
How much of a proof do you have in place already?
<whitequark>
absolutely nothing, I have a set of testcases
<whitequark>
(and a number of bugs in them, some of which I probably haven't found yet)
<ZipCPU>
RISC-V CPU?
<whitequark>
no, it's my own architecture
<ZipCPU>
Ok
<ZipCPU>
Clock for clock equivalence?
<ZipCPU>
Or more general?
<whitequark>
general. I think I need a bit of background
<ZipCPU>
Never done the formal thing before?
<whitequark>
this CPU is intended for FPGA control plane and the idea is that it takes absolute minimal amount of resources
<whitequark>
so it stores everything except PC and flags in a single block RAM
<whitequark>
it is also designed to exploit single-port RAM like in UP5K, but scale up to true multiport RAM
<whitequark>
there are two reasons I am converting it to a pipeline design in spite of the fact that every instruction, by design, involves up to 3 loads and up to 1 memory store
<whitequark>
to a single BRAM
<whitequark>
the first one is that inferredd FSM control logic is unusably bloated
<ZipCPU>
"stores everything ... in a single block RAM": does that include CPU state variables, or just CPU registers?
<whitequark>
general-purpose registers
<whitequark>
the only CPU state variables that are defined by architecture is PC and flags (ZSCV)
<ZipCPU>
Just GP registers, or ... other memory as well?
<whitequark>
GP registers, code and data share address space
<whitequark>
and live in a single BRAM
<whitequark>
it is an exceptionally compact design, which AFAIK has no precedent, or at least I haven't seen any
<ZipCPU>
How much formal verification have you done so far?
<whitequark>
it takes a number of lessons from PicoBlaze though
<whitequark>
I have never done any formal verification for programmable logic, though I have some experience with Coq
<whitequark>
I've looked at RVFI.
<ZipCPU>
Ok ... so you'll be working through a learning curve at the same time. Sure. Here's the lesson I learned recently on CPU verification:
<ZipCPU>
Keep the instruction around. Re-decode it within your formal logic at every step of the pipeline, and then verify that all of the local values within the pipeline continue to match the instruction you are processing.
<ZipCPU>
That's similar to what RVFI is doing. They use a packet that goes through the pipeline, and then verify it at the end
<ZipCPU>
s/They use/It uses/
<ZipCPU>
You can do something similar, but have (as a minimum) the instruction as your packet, and then verify everything against it.
<whitequark>
right, so what I'm thinking is: have an external interface where each clock cycle it is announced: a) the instruction opcode, b) what it is storing to memory, c) what is the new state of flags (if changed) and PC
<whitequark>
because this should match for all in-order scalar implementations, no matter what they do inside.
<ZipCPU>
Why an *external* interface?
<sorear>
if you have a mechanism to $display instructions one at a time as they're executed, you have what you need for RVFI-alike
<whitequark>
external to the implementation, I mean
<whitequark>
hm
<whitequark>
what I was thinking about is using the FSM implementation as a reference and comparing the pipelined implementation against it
<whitequark>
but maybe that's a bad idea
<ZipCPU>
The difficult part will be induction
<ZipCPU>
You might manage to go several cycles with that in BMC, but ... induction can be more of a challenge
<ZipCPU>
So there's somewhat of a division between Clifford and I: I use induction for everything, he does not
<ZipCPU>
We joke about my being crazy for doing it, but I like the extra assurance it provides. That and ... somethings can only be proven via induction
<ZipCPU>
Induction requires .... more intrusive properties to make certain the CPU state is consistent at every state
<ZipCPU>
This is why I now "re-decode" the instruction at every CPU state--it helps to insure a consistent state within the CPU at every stage of the pipeline
<ZipCPU>
I might also recommend, since you are starting from scratch, that you don't worry so much about verifying the equivalence of the implementations, but rather that one of the implementations (or both) is equivalent to the ISA behavior
* ZipCPU
clicks on the link
<whitequark>
essentially, it has three load/decode stages, then some combinatorial logic that actually executes the instruction, then a writeback stage.
* ZipCPU
had forgotten yo uwere using a spreadsheet for your pipeline ;)
<ZipCPU>
Four stages?
<whitequark>
yes.
<whitequark>
three-address code, so load opA, load opB, store result, and fetch.
<whitequark>
in the case of ALU instructions.
<whitequark>
of course, since every stage is competing for memory access, it will be always mostly stalled. but this is fine.
<ZipCPU>
Memory ... everything is using block RAM, right?
<whitequark>
yep.
<ZipCPU>
How big is the block RAM?
<whitequark>
16 bit by however much storage you want.
<whitequark>
it should run with 256x16 iCE40 fine, and address space permits up to 65536x16.
<whitequark>
which would be single port RAM territory on iCE40UP5K.
<ZipCPU>
Ok, so ... 65kx16 then for the size of the proof
<whitequark>
that seems like a very large state space...
<ZipCPU>
Is the memory external or internal to the CPU?
<whitequark>
internal to CPU.
<whitequark>
(there is an external bus also, which you can use to attach peripherals, expand memory, etc.)
<ZipCPU>
(It's not nearly as large a state space ... if you only have a 10 stage pipeline, only 10 memory values will ever be relevant in any proof)
<whitequark>
ah I see.
<ZipCPU>
So ... after your memory is larger than 10, bigger memories don't really change things
<ZipCPU>
But it is internal to the CPU so .... you'll need assertions mid-pipeline that the instructions are consistent with what's in memory ...
<ZipCPU>
What kind of states can the CPU ever be halted within? Are there specific places where it can be halted?
<whitequark>
it doesn't have opcode space for a halt instruction. right now I use a special jump in a simulator to signal halt.
oter_ has quit [Quit: My iMac has gone to sleep. ZZZzzz…]
<whitequark>
so, HLT ~ J .-1
<ZipCPU>
(One of my frustrations is that, with the ZipCPU, it can be halted in *any* state ... that necessitated a method of verification)
<ZipCPU>
Ok, what sort of stalls might you expect? Does the CPU ever stall?
<sorear>
what does "state" mean in a pipelined context? do you not just stall the frontend and wait for everything to drain?
<whitequark>
a lot. with single port RAM, each ALU instruction will stall four times
<ZipCPU>
"stall" four times? But your memory is internal to the CPU. What's causing the stalls?
<whitequark>
the memory may be single-port.
<whitequark>
so only one access to the memory may go on per cycle
<whitequark>
basically the challenge with this CPU is scheduling access to its internal memory
<whitequark>
using the FSM as an abstraction for that did not work out
<whitequark>
so now I am using a pipeline as an abstraction for that
<ZipCPU>
whitequark: This sounds like a fun problem, and a good one for formal verification
<whitequark>
right? I think so too
<whitequark>
there are definitely some very good formal properties to be stated, like "in a SPRAM implementation, RE and WE are never on at the same time"
<ZipCPU>
I'm not sure I have any wonderful ideas for how to get started. Usually I recommend starting with something simpler, but ... the problem at hand is always the one you want to work with
<ZipCPU>
That's good
<ZipCPU>
How about for instruction I, that reads from memory A, that it should be reading from memory A ?
<whitequark>
I was thinking about this, but, depending on the implementation, reads may well go in different order
<ZipCPU>
Or that, in your pipeline implementation, only instruction I is ever in the pipeline at a time
<ZipCPU>
I think that was what you had described to me, no?
<whitequark>
I think these are both not very good properties for formal verification of an *arbitrary* implementation, as opposed to a specific one
<whitequark>
for example:
<whitequark>
for any ALU instruction, the Load A and Load B stages are not interdependent
<whitequark>
so they can be swapped
<whitequark>
and loads will go in different order
<ZipCPU>
The tricky (but important) one is writeback. If instruction I is an ADD instruction, then OpA should match the memory, OpB should match memory, and OpA+OpB should be written back
<ZipCPU>
You could start there
<whitequark>
hmmm
<ZipCPU>
Be aware of multiplies ... those will definitely get in your way, so leave those for the end when you have everything else verified
<whitequark>
so, in my formal model, I parse the instruction, extract operands from memory, execute the instruction, and verify that the output matches the FI packet?
<whitequark>
there are no multiples in this CPU :D
<ZipCPU>
Oh, sorry, looking closer at the "M-class" instructions it looks like 'M' stood for memory, rather than multiply
<ZipCPU>
Since you have specific classes, you could create an assertion for each class: if the instruction being retired at this point is of class X, then ... these properties should hold
<ZipCPU>
Lesson 3 gets into using SymbiYosys (don't give up on the lesson too early for being too basic)
<ZipCPU>
Yeah, slide 30 of lesson three shows the basic outline of a SymbiYosys script
<whitequark>
ah I see! I haven't looked at that since I don't really write Verilog directly
<whitequark>
in fact nMigen does not use Verilog anywhere, it synthesizes to RTLIL
<ZipCPU>
So ... how is your flow going to work then? nMigen->RTLIL->Verilog->SymbiYosys ?
<whitequark>
I think so, yeah, since Verilog is (a) the human-readable output (b) can be used with other toolchains.
<whitequark>
so, making sure that Yosys write_verilog pass did not mess up is important
<ZipCPU>
Looks like lesson three of the tutorial walks you through the initial basics of using SymbiYosys and all the details of how it works
<whitequark>
thanks, will use that
<daveshah>
Making sure write verilog works is the only reason to use verilog
<whitequark>
hehehe
<daveshah>
read_ilang should be fine in SymbiYosys too
<ZipCPU>
Will you be creating your formal statements within nMigen?
<ZipCPU>
formal *properties* .. sorry
<whitequark>
nMigen doesn't yet support formal properties, and I'm not entirely sure if they survive write_verilog in Yosys yet
<daveshah>
The alternative to that is just exposing the necessary internal state as external outputs
<whitequark>
so I think I'll start with writing them in Verilog
<daveshah>
like RVFI
<whitequark>
yeah, I'm going to follow RVFI
<whitequark>
right now I really want to treat the CPU as a black box
<whitequark>
maybe later I will add some assertions to the CPU itself
<ZipCPU>
Ok ... that'll work until you get to induction ... then you need properties describing internal states
<ZipCPU>
In other words, start out with "mode bmc" in your sby file
<daveshah>
I'm not convinced you ever need induction
<whitequark>
ahhh I think I see why induction needs properties describing internal states
<daveshah>
riscv-formal gets away wiyhout it
<whitequark>
because it ignores initial statements
<daveshah>
Exactly
<ZipCPU>
whitequark: Didn't I share that there was a division between Clifford (and now David) and I over using induction?
<ZipCPU>
:D
<whitequark>
yeah, but I didn't really understand why, until now
<ZipCPU>
For now, it's a healthy debate
<daveshah>
The third option is abc pdr. This provides a full proof too, but doesn't tend to need so many internal assertions as it determines reachability itself
<ZipCPU>
None of us are religious about it (yet--or at least, I'm not)
<daveshah>
However, I think a CPU might be too much for it
<whitequark>
daveshah: even a boneless CPU? :P
<daveshah>
Might be worth a try
<ZipCPU>
daveshah: whitequark is dealing with up to 65kx16 of memory. I didn't think abc pdr handled memory well
<daveshah>
Sure, but there's no need for more than a few 10s or 100s of words in that case
<ZipCPU>
whitequark: The fastest engines right now for CPU work are yices and v3 of boolector. If you just use the "smtbmc" engine in SymbiYosys, you should do well
<whitequark>
ZipCPU: thanks. I am off to watch FV find the bugs in my CPU I currently know the existence of.
<ZipCPU>
Wheeee!!!
<ZipCPU>
My guess is that you'll start finding a bunch of bugs in your properties first ;)
<whitequark>
yeah, when I was writing testcases, it was about 50/50 split.
<whitequark>
but I expect this to change once I switch to a pipelined design.
<whitequark>
especially once I give it a true multiport memory and hazards become something that needs to be handled...
* ZipCPU
found a lot of unchecked hazards when he first verified his ZipCPU
GuzTech has joined ##openfpga
<whitequark>
yeah, I am *definitely* not trusting a pipelined design without a full formal proof
<whitequark>
humans are just not good at finding this kind of bugs at all
* ZipCPU
trusted his design before the formal proof, and now realizes his error
<SolraBizna>
I didn't verify my pipelined design at all
<whitequark>
ever since I seriously looked at FV of software, I am acutely aware that most code I write works, at most, by coincidence
<SolraBizna>
(Relatedly, I never got it fully working)
<ZipCPU>
I'm not sure if I'd write a CPU again without starting with formal at the beginning, just based on all I've learned in the process
<ZipCPU>
;D
<whitequark>
but for software, FV is still often intractable
<whitequark>
you cannot really use it for e.g. Python code
<ZipCPU>
There are approaches to doing so ... I've just never tried them
<whitequark>
there are well established approaches
<ZipCPU>
SolraBizna: Still working with it and want to give formal a try?
<whitequark>
for example, Coq extraction, and there's Frama-C for C code
<SolraBizna>
I've been thinking about starting from scratch again and using formal verification as an excuse to do it
<ZipCPU>
I like your excuse
<SolraBizna>
It's one of those things where I'm going to hate learning it, but once I know it and use it it will save countless hours of work
<ZipCPU>
I actually enjoyed learning FV. I had anticipated that I would not, but I found it to be a lot of fun
<whitequark>
I really liked learning Coq
<sorear>
my impression is that the major difference is that in RTLIL your state space is always finite, while nontrivial C or Python programs have infinite state spaces
<SolraBizna>
I expect it will be similar with me, once I get past the "quit dithering and actually figure out how to talk to SymbiYosys" stage
<whitequark>
however, I never got far in Coq
<whitequark>
working with natural numbers is very different from working with inherently bounded state spaces like in programmable logic
<ZipCPU>
sorear: You may have hit the nail on the head there
<sorear>
one of these is NP-hard, the other is undecidable
<whitequark>
it's an excellent mental exercise btu I think if you want it to be practical, you need many years of it
<whitequark>
and maybe a PhD or two
<whitequark>
that you write while learning
<whitequark>
such is the state of the field...
<ZipCPU>
No, I disagree
<ZipCPU>
I found bugs in the first design that I (as a complete FV newbie) tried to formally verify
<whitequark>
no no, I am talking about Coq h ere
<ZipCPU>
Ah, ok
<whitequark>
Coq is far more general, it is a bit like a formal verifier construction kit
<whitequark>
there is a verified C compiler called CompCert, in Coq
<ZipCPU>
whitequark: Do keep me posted on how this works out for you. I'd love to hear (and catalog) your experiences.
<whitequark>
and most of the work that went into it, was in figuring out how to make *any* compiler verified in Coq
<ZipCPU>
The bottom of slide four discusses the differences between induction and BMC
<whitequark>
ah I see, that's a good explanation
<whitequark>
but here's something weird
<whitequark>
I used induction a lot in Coq and in Coq, it works like this: you prove the basis of induction, and then you prove the step
<sorear>
then cakeml went and did an arguably stronger approach using a completely different system
<ZipCPU>
whitequark: Go on
<whitequark>
but in Yosys it seems that the proof of the basis is missing; instead, you take an arbitrary but valid state of the basis, without trying to prove that it is reachable
<whitequark>
it feels to me that there should be something in the middle between Yosys BMC and Yosys induction
<daveshah>
The proof of the basis is done by running BMC first
<sorear>
"proving that a state is reachable" is … harder than NP
<whitequark>
hmm, but then why do you need a white box approach?
<daveshah>
sorear: that's what abc pdr does
<whitequark>
can you take the results of BMC and feed them into induction?
<daveshah>
It either completes in a second, or doesn't complete at all
<daveshah>
whitequark: no, because the starting state is arbitrary
<whitequark>
yeah, that's what I don't understand
<whitequark>
why is it arbitrary?
<ZipCPU>
To avoid calculating reachability ;)
<sorear>
state reachability is PSPACE-complete
<sorear>
both "BMC" and "induction" are only solving NP problems
<whitequark>
ok, so it is basically an optimization
<whitequark>
I see
<ZipCPU>
Optimization? Not what I'd call it
<ZipCPU>
It's not optimizing an existing solution to work faster, it's just not calculating reachability at all
<whitequark>
well, optimization might not be a good word
<ZipCPU>
Your assertions are used to force your design into a reachable state
<whitequark>
but I think I understand why this happens now
<ZipCPU>
This was the important part of the last FV quiz: if you find a bug in section (A) (the first part of your trace), you don't have enough assertions
* sorear
has an irrational dislike of automatic theorem provers with an exponential or larger gap between average/heuristic case and worst/provable case, which is most of them
<whitequark>
of course, I still want reachability, but I assume that there are very good reasons it isn't done
<ZipCPU>
If you find a bug in section (B) or (C) of the trace (last clock, or last timestep), then you may have a logic bug
<daveshah>
Well, try abc pdr and see if it works
<ZipCPU>
No ... run from abc pdr!
* whitequark
has an irrational dislike of abc
<whitequark>
(or maybe rational)
<ZipCPU>
Somtimes abc pdr works fast, and gives you a solution very quickly
<sorear>
a circuit with 200 state bits can have states which are reachable, but only after 2^200 time steps
<ZipCPU>
Other times, smtbmc gives you a result and ... abc pdr gets stuck
* ZipCPU
heads back to bed
rohitksingh has quit [Ping timeout: 250 seconds]
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark pushed 1 commit to master [+0/-0/±2] https://git.io/fhUYK
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark 26d7202 - Remove explicit HALT state, use J(-1) instead.
_whitelogger has joined ##openfpga
<tnt>
Does yosys take advantage of 'x' ? Like if in a always_comb I assign a buch of signals to 'x' most of the time, does it know I don't care about the output in this case and its free to pick whatever is most convenient to generate ?
rohitksingh has joined ##openfpga
rohitksingh has quit [Ping timeout: 272 seconds]
Miyu has joined ##openfpga
Miyu has quit [Ping timeout: 244 seconds]
pie__ has quit [Ping timeout: 252 seconds]
sgstair has quit [Remote host closed the connection]
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark pushed 3 commits to master [+6/-1/±2] https://git.io/fhUcC
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark 92fee90 - Separate core and testbench code.
<_whitenotifier-6>
[whitequark/Boneless-CPU] whitequark 9bef717 - Add a disassembler, usable as a GTKWave filter.