proteus-guy has quit [Remote host closed the connection]
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
futarisIRCcloud has joined #nmigen
Degi has quit [Ping timeout: 258 seconds]
Degi has joined #nmigen
____ has joined #nmigen
rohitksingh has joined #nmigen
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
proteus-guy has joined #nmigen
proteus-dude has joined #nmigen
futarisIRCcloud has joined #nmigen
jeanthom has joined #nmigen
chipmuenk has joined #nmigen
<Sarayan>
I'm looking for a recommendation on how to do something in hardware/on a fpga
<Sarayan>
but I'm not sure if anybody is awake :-)
<sorear>
o/
<Sarayan>
yay
<Sarayan>
so, it's about sprite drawing
<Sarayan>
I want to draw a line of a zoomed sprite into a line buffer (double-buffered)
<Sarayan>
the line buffer has one entry (24 bits) per horizontal pixel
<Sarayan>
the hardware passes over all the active sprites and draw the line of the sprite in the buffer
<zignig>
Sarayan: unce unce unce unce.
<Sarayan>
an interesting characteristic is that the peak drawing speed is 8 pixels per pixel clock
<Sarayan>
and the rom read capability is also of 8 pixels as a time
<zignig>
Sarayan: nice work, sauce ?
<Sarayan>
so, I have the position of the horizontal center of the sprite, and the 10-bit 4.6 zoom factor
<Sarayan>
the zoom factor adds to the position in the original image, e.g. 1 is zoom by 64x, 0x3ff is divide by almost 16
<sorear>
8 pixels = 24 bytes?
<sorear>
are the sprites in the same format as the line buffer?
<Sarayan>
sorear: yeah, it's 24 bytes, the sprites pixels are a little smaller, there is per-sprite information added that generates the extra bits. Not an issue
<sorear>
what does "peak drawing speed" mean? how much already exists and how much needs to exist?
<sorear>
this sounds like emulating an old console gpu?
<Sarayan>
I mean that the hardware I'm trying to kinda-imitate can draw 3072 pixels in a 384 pixel clock line
<Sarayan>
So I guess I have to make the linebuffer 24 bytes-wide and do pretty much write coalescing
<sorear>
is that 3072 pre-scaled or post-scaled pixels
<Sarayan>
post-scaled
<sorear>
is scaling always up or down, and is it nearest-neighbor or more complicated
<Sarayan>
in fact it's peak, so being slower when reducing wouldn;t be an issue
<Sarayan>
nearest-neighbor
<Sarayan>
scaling is 10 bits 4.6, gos from 64x to /16
<Sarayan>
rom read access is 8 aligned pixels per clock
<Sarayan>
I'm not sure at all how to efficiently go from the unscaled pixels to the scaled ones, seems awfully muxy
<sorear>
I'm wondering if it would make more sense to render multiple sprites simultaneously at 1 pixel/sprite/clock
<sorear>
are the sprites always 8 px wide
<Sarayan>
nope, 8 to 128
<Sarayan>
sorry, 16 to 128
<Sarayan>
linebuffer is 512 pixels wide fwiw
<Sarayan>
I suspect pipelining will be needed to meet decent timings too
<sorear>
wondering if the original hardware ran at a significant multiple of the pixel clock
<Sarayan>
No, it didn't
<sorear>
you could have a decoupled pipeline that reads up to 8 pixels per clock from the sprite memory, feeds it into a scaler-black-box, then reads 8-pixel scaled "fragments" from the scaler's output FIFO and writes them to the line buffer at 8 pixels per clock
<sorear>
the scaler-black-box doesn't need to access any memory, so it could be replicated as needed
<Sarayan>
yeah, I guess something like that is needed
<Sarayan>
you just don't want to halve your draw rate just because your sprite horizontal position is not on a multiple of 8
<sorear>
if your memory is banked by pixel%8 you can do misaligned reads at full speed
<sorear>
although that does need a barrel shifter
<sorear>
the output position may be more of a problem
<sorear>
wait, you're talking about screen-space position, not sprite-space
<sorear>
is the 3072 cycles a "best case" or "guaranteed" number
<Sarayan>
probably best case
<Sarayan>
Each OBJ has 10(hex) attributes, you can define max 256 OBJs. The OBJ is line buffer system max 4096 dot (dot clock
<Sarayan>
8MHz) or 3084 dot (dot clock 6 or 12 MHz). If the OBJs exceeds this limit, the last OBJ to be written will disappear
<Sarayan>
(translated by a japanese native speaker, please escuse the slightly broken english)
<Sarayan>
dot clock 8MHz has 512 pixels/line, 6MHz has 384, 12MHz has 384*2 but I suspect it just means the hardware can only do rendering at 6MHz and not 12
<Sarayan>
only the line read can reach the 12MHz
<Sarayan>
ok, the renderer does have a clock that's two times the dot clock
<Sarayan>
(forget the 12MHz case, it's not used anyway)
<sorear>
anyway I'm thinking that the scaler produces on its output fifo 8-aligned blocks of 8 pixels (transparent)
<sorear>
when scaling up you need to look at pairs of adjacent input blocks, that's a 16x8 crossbar
<sorear>
scaling down (at full speed allowed by the fifos) would be an 8x16 or 16x8 crossbar mapping incoming blocks into a work register that's periodically discharged
<sorear>
well I guess you don't need a work register, you could send a bunch of partial fragments down the pipe since that side isn't limiting
<Sarayan>
maybe it's two writes per pixel clock, and then you don't need coalescing
* zignig
is glad people are noodling with nmigen.
<Sarayan>
I love nmigen
<Sarayan>
only sane hdl I've ever seen
<zignig>
yep same here , there has been other python trys , but this is cogent for me.
<sorear>
"two writes" raises a lot of cans of worms re. conflicting writes, you can do it with banking and multiport and both are terrible
<sorear>
if you can do a single wide write port the system will be much easier to reason about
<Sarayan>
sorear: I just mean using the clock that's 2x the pixel clock
<sorear>
ah
<sorear>
well if you can run the whole thing at 2x pixel clock, you can shrink the entire datapath by 2x, which cuts your crossbars by 4x
<Sarayan>
yeah, I can run the whole thing at 2x, but I don't see how it reduces the crossbars
<Sarayan>
so, let's see
<Sarayan>
you get the 8 pixels in, which go into a scaler
<Sarayan>
the scaler outputs (up to) 8 scaled pixels, and shifts out as much as needed
<Sarayan>
behind the scale, another mux splits the up to 8 pixels into two writes depending on the screen alignment
<sorear>
4 pixels in
<sorear>
since you doubled the clock, the number of pixels you need to handle at once is halved
<Sarayan>
the input datapath is 64 bits wide, for 8x(up to)8bpp pixels
<Sarayan>
the rom reacts at pixel clock speed
<Sarayan>
don't want to have to accelerate it if I don't have to
<sorear>
in that case you can put a gearbox before the scaler to break 8-pixel input fragments into 4-pixel fragments at twice the clock
<Sarayan>
because if it ever goes on a real fpga that's going to route to an external sdram interface shared with other "roms"
<sorear>
my intent was that the scaler would only output already-aligned blocks
<Sarayan>
screen-aligned you mean?
<sorear>
yes
<Sarayan>
that's hard
<Sarayan>
because the zoom in is sprite-space, not screen-space
<Sarayan>
well, specifically it's "advance by this distance in the source for each pixel on the screen"
<sorear>
ok, I got that backward, I don't think it changes much but I need to update my model
<Sarayan>
and the zero is in sprite-space
<Sarayan>
technically, the information I have is "position 8/16/32/64 in the 16/32/64/128-wide sprite is at exact position x on the screen"
<sorear>
so when scaling up, the scaler's state variables are (a) current (leftmost) screen position (b) corresponding sprite position (fraction mod 4) (c) scale factor (less than 1 per assumption) (d,e) the sprite pixel blocks corresponding to the current and next sprite block (8 pixels total)
<sorear>
on each clock (a) compute sprite positions (relative to the sliding base) for the 4 screen pixels and the 5th next position (b) 8->4 crossbar mux to generate the screen pixel values (c) if next > 4, shift the current and next sprite-blocks and read a new next block from the input FIFO (d) update the screen and sprite position
<sorear>
when scaling _down_ you have a different set of problems because the screen position isn't always %4 but similar logic applies
<sorear>
at least half the work will be handling left/right boundary cases :x
<Sarayan>
yeah
<Sarayan>
I don't exactly see how to make the scaler spout screen-aligned blocks though
<sorear>
it's like a funnel shift, but scaling
<Sarayan>
I feel like the initial shift may be complicated :-)
Asu has joined #nmigen
proteus-dude has quit [Ping timeout: 256 seconds]
proteus-dude has joined #nmigen
proteus-guy has quit [Read error: Connection reset by peer]
proteus-dude has quit [Remote host closed the connection]