<ZirconiumX>
pepijndevos: but the dumb state machine can hit ~170 MHz on a Cyclone V, which I'm proud of
<pepijndevos>
Nice nice
<ZirconiumX>
Even if the Quartus Timing Analyser tried its best to make things worse :P
<pepijndevos>
huh?
OmniMancer has quit [Quit: Leaving.]
freeemint has quit [Ping timeout: 250 seconds]
carl0s has joined ##openfpga
genii has joined ##openfpga
pie__ has joined ##openfpga
pie_ has quit [Ping timeout: 276 seconds]
emeb has joined ##openfpga
pie__ has quit [Ping timeout: 246 seconds]
pie_ has joined ##openfpga
OmniMancer has joined ##openfpga
OmniMancer has quit [Read error: Connection reset by peer]
Asu` has joined ##openfpga
Asu has quit [Ping timeout: 276 seconds]
dh73 has joined ##openfpga
dh73 has quit [Remote host closed the connection]
dh73 has joined ##openfpga
<ZirconiumX>
pepijndevos: since I apparently missed that, the Timing Analyser tries to helpfully suggest improvements to your source, such as disabling an optimisation on a critical path to make the critical path latency worse
<pepijndevos>
Nice
<whitequark>
w-why
<mwk>
quality toolchain
<ZirconiumX>
I have no idea, but when I added the third setup stage to make the combinational logic shorter I re-enabled that optimisation for an extra 15MHz Fmax
<ZirconiumX>
I was wondering how the GS worked and got its performance from
<ZirconiumX>
So I looked in the data sheet
<ZirconiumX>
I'm still clueless, but now my eyes are bleeding
<ZirconiumX>
mwk: since it was for a console where people had to control the GPU directly rather than via an API, it's *kind of* well documented
<ZirconiumX>
Even if the console has like three A4 pages of bugs that it tells you to work around
<ZirconiumX>
Such as short loops in the CPU *not* looping, the "disable interrupt" instruction sometimes not disabling interrupts when you get an interrupt after said "disable interrupt" instruction, etc
<mwk>
... what do you mean, loops not looping
<whitequark>
did they fuck up the pipeline
<ZirconiumX>
If you have a loop that's less than 6 instructions long some of the branch instructions don't calculate the condition properly and return false
<ZirconiumX>
So they don't loop
<mwk>
but but how
<whitequark>
forgot an interlock or a forward?
<ZirconiumX>
They don't say, sadly, only that the work around is to pad loops to six instructions or more
<mwk>
it's the distance from loop beginning to end that counts?
<ZirconiumX>
Yep
<mwk>
huh
<ZirconiumX>
It's a 7-stage pipeline
<ZirconiumX>
They mention that if there's a cache stall inside the loop it's okay
<ZirconiumX>
But if you're cache stalling on every single loop iteration, what are you doing
<sorear>
this is on the mips core??
<ZirconiumX>
Oh yeah
<ZirconiumX>
The GPU has bugs too but they're less interesting to talk about
<ZirconiumX>
Things like truncating your textures if you write them to misaligned addresses
<ZirconiumX>
And the Z test disable bit not working
<ZirconiumX>
Amusingly though the PS2 officially has no "bugs" just "inconveniences"
<mwk>
bwahahah
<mwk>
I have to remember that one
<ZirconiumX>
"Return from the interrupt handler after executing the EI instruction. If this restriction is not followed, an inconvenience may happen when an interrupt occurs immediately after executing the DI instruction."
<ZirconiumX>
"Instructions that operate the cache or TLB must be directly preceded by and followed by a SYNC instruction. Detailed information is given to the respective instructions. If this restriction is not followed, an inconvenience may be caused when a COP0 Unusable exception occurs."
<ZirconiumX>
"The TLBR instruction must not be immediately followed by a jump/branch instruction. Four instructions or more are required between them, excluding the SYNC.P instruction next to the TLBR instruction. Also, the TLBR instruction must not be placed at the end of a page. Six instructions or more from the end of the page are required for the TLBR instruction. If this restriction is not followed, an inconvenience may be caused when an ITLB
<ZirconiumX>
miss occurs immediately after the TLBR instruction cancellation, due to the occurrence of an exception, etc."
<ZirconiumX>
It's pretty funny
azonenberg_work has joined ##openfpga
pie_ has quit [Ping timeout: 245 seconds]
<Finde>
you have to wonder how much extra performance you could get from correct codegen if it weren't so buggy
<emily>
lol @ "an inconvenience may be caused"
<mwk>
The Enrichment Center apologizes for the inconvenience and wishes you the best of luck.
<whitequark>
lmao
<pepijndevos>
ZirconiumX, are you making a PS2 GPU now?
<ZirconiumX>
pepijndevos: Truthfully I have no idea :P
<ZirconiumX>
It just started because I was curious how the PS2 GPU worked
<pepijndevos>
That's how most things start, tbqh
<ZirconiumX>
I'm still a little confused as to how they got the setup times so short
<ZirconiumX>
They mention using DDA, and yet that requires you to calculate dy/dx
<ZirconiumX>
Which is Not Cheap
<ZirconiumX>
Yet they can do Gouraud shading with 4 cycles of setup time
<ZirconiumX>
My hunch is they're doing Bresenham instead
<ZirconiumX>
Because DDA would require a lot of division
pie_ has joined ##openfpga
<ZirconiumX>
Though lerp also requires a lot of division, so
<ZirconiumX>
It must be one hell of an FSM though.
<ZirconiumX>
8-bit division in 4 cycles implies, what, radix-4 SRT?
<ZirconiumX>
Actually it's going to be bigger than that, hmm.
<sorear>
8-bit division?
<sorear>
is that a gpu thing
<ZirconiumX>
sorear: (most) GPUs work on 8 bit per channel colour
<ZirconiumX>
To do Gouraud shading, you need to linearly interpolate between colours
<ZirconiumX>
For that you need the slope of the line between them, which is where the 8-bit division comes into play
azonenberg_work has quit [Ping timeout: 250 seconds]
azonenberg_work has joined ##openfpga
<kernlbob_>
Could be a big lookup table.
<ZirconiumX>
I initially thought it might be a flavour of SRT division if it's 8-bit, but if it's 16-bit you couldn't do SRT in 4 cycles
<kernlbob_>
Is SRT anything like Goldschmidt division?
<kernlbob_>
Never mind -- I too can read Wikipedia. (-:
<GenTooMan>
Well this (https://pastebin.com/cdqjWL6s) isn't working correctly in nmigen (not a surprise) the simulation gets stuck in an infinite loop (likely adding to vcount) with the "m.d.comb += self.vcount.eq(vcount + 1)" line suggestions and thoughts welcome I'm out of thoughts.
<ZirconiumX>
GenTooMan: you have a combinational loop
<ZirconiumX>
self.vcount.eq(self.vcount + 1) means "whenever vcount is updated, set vcount to vcount + 1"
<kernlbob_>
You can't run a counter in combinatorial logic. You probably want `m.d.sync += self.vcount.eq...`.
<ZirconiumX>
And setting vcount to vcount + 1 is an update of vcount
<ZirconiumX>
Thus the logic is infinite
<ZirconiumX>
You can do this synchronously as kernlbob_ suggests, but I don't know your goal here
<GenTooMan>
Hmm yes I am semi aware of that. What I desire is vcount to increment each time hcount rolls over (only). I thought I could create a second clock and toggle that clock each time hcount rolled over but creating another clock didn't seem that simple.
<tnt>
don't create clocks ...
<tnt>
really, if you don't know why you should avoid creating clocks, you really shouldn't be doing it. Stick to single clock domain logic for now.
<tnt>
Use enables.
<GenTooMan>
time too dig through enables, and yes I still have a headache from trying to create a second clock domain.
<ZirconiumX>
This brings up a question. Say you've got a 300 MHz main clock, but parts of the chip only need to run at half that. Do those bits have to meet 300 MHz timing still?
<tnt>
ZirconiumX: technically no, if you can ensure you won't use the result anywhere else that doesn't have an enable.
<tnt>
nextpnr however doesn't support multi-cycle clock constraints.
<ZirconiumX>
Fair. I guess the alternative is to run the 150MHz bit at 300MHz but pipeline bits over to meet timing?
<tnt>
yes, that's one option.
<davidc__>
ZirconiumX: or create two clock domains and use FIFOS for the domain crossing if that makes sense in your design
<ZirconiumX>
Some of the bits of the GS are FIFO-insulated, some aren't
rohitksingh has joined ##openfpga
emeb has quit [Ping timeout: 276 seconds]
emeb has joined ##openfpga
emeb_mac has joined ##openfpga
Asu` has quit [Ping timeout: 244 seconds]
s_frit has quit [Remote host closed the connection]
s_frit has joined ##openfpga
Asu` has joined ##openfpga
rohitksingh has quit [Ping timeout: 244 seconds]
freeemint has joined ##openfpga
rohitksingh has joined ##openfpga
Asu` has quit [Quit: Konversation terminated!]
danilonc has quit [Quit: WeeChat 1.9.1]
Bike has joined ##openfpga
rohitksingh has quit [Ping timeout: 245 seconds]
carl0s has quit [Remote host closed the connection]