_whitelogger has joined ##openfpga
ZipCPU has joined ##openfpga
<awygle> forgive my ignorance - how is configuration RAM organized in an SRAM FPGA (e.g. Lattice)?
<awygle> i had the naive idea that it was a giant shift register for some reason but clifford's documentation clearly says otherwise
<awygle> is it physically distributed around the chip? or is it a big block of RAM and then wires going all over the chip to carry the signals to muxes and whatnot?
<azonenberg> awygle: It's physically distributed around the chip
<azonenberg> A big block of ram would be absurd
<azonenberg> If the device has on-chip nonvolatile RAM, that is usually in one block
<azonenberg> then there's SRAM in the logic itself
<azonenberg> and some glue logic that copies the NVRAM to SRAM at boot
<awygle> azonenberg: yeah that didn't seem right. but it seems to be addressed in the same way as a block of RAM, if i'm reading this right
<azonenberg> Correct, except not necessarily byte sized
<awygle> 16-bit rows
<azonenberg> There are row and column addresses
<azonenberg> in xilinx parts, actually, the config ram is blockwise addressed
<azonenberg> in virtex5, for example, a frame is 1312 bits (41 32-bit words) and that's the smallest addressable unit
<azonenberg> So there may actually be a shift register going on within that area, i'm not sure
<awygle> interesting. that seems like it would shrink the address logic somewhat
<azonenberg> But each block has a unique address in (x,y)
<azonenberg> a typical bitstream just writes to addresses sequentially
<azonenberg> but if you're doing partial reconfig, you can direct a bitstream to portions of the device as small as a single block
<awygle> so is the addressing done essentially for flexibility then? it seems like a chip-wide shift register would be smaller, silicon-wise, since there's no need for row and column decoders etc
<azonenberg> it might be, but that renders you unable to ever configure part of the chip
<awygle> right
<azonenberg> This is an XC2C32A
<azonenberg> You're looking at bare silicon after etching off all metal interconnect and polysilicon gates
<azonenberg> and staining P-type doping brown
<azonenberg> (N vs undoped are indistinguishable in this image)
<azonenberg> This image is optical with a 100x objective, you can see gates but don't have the resolution to see individual transistors clearly
<azonenberg> (the chip is 180 nm which is just a little smaller than half the wavelength of the light we're using to image)
<azonenberg> anyway, this particular device is SRAM based but also has on-chip EEPROM for nonvolatile config
<azonenberg> you should easily be able to see four distinct large areas of the chip plus some smaller ones
<azonenberg> First, you have the I/O pad ring around the whole chip
<azonenberg> Then you have the main logic area which is the top ~2/3
<azonenberg> Then there's the mostly-dark memory arrays bottom center, and a U-shaped area around them that's generally lighter in color
<azonenberg> got it?
<awygle> yep
<azonenberg> So, the dark arrays are the EEPROM
<azonenberg> It's split into five blocks
<azonenberg> All are 49 rows high
<azonenberg> You have one ten bits wide at the left that has a valid bit and nine bits of macrocell config
<azonenberg> then one 112 bits wide that has PLA AND/OR array config
<azonenberg> one 16 bits wide that has global routing config
<azonenberg> another 112 bit PLA config memory
<azonenberg> then (way off at the far right side) another ten bit macrocell + valid
<azonenberg> The U-shaped light colored area is JTAG and EEPROM programming logic that has not been well studied
<azonenberg> Somewhere either in that area, or running horizontally across the top of the memory, is a 274 bit shift register
<azonenberg> Which has a couple of padding bits, six address bits, and 260 data bits (one bit per EEPROM column)
<azonenberg> During JTAG programming of this part, you shift in an address+data block then it programs either EEPROM or the configuration SRAM directly
<azonenberg> Anyway, if you turn your attention to the actual logic fabric...
<azonenberg> You can see it's split into five regions horizontally left to right
<azonenberg> and is symmetric about the X axis
<azonenberg> (roughly)
<awygle> sure
<azonenberg> The X axis layout matches that of the config ram
<azonenberg> at the left and right side, you have the macrocells
<azonenberg> Sixteen high on each half of the chip for a total of 32
<azonenberg> eight above and below the central spine
<azonenberg> Each macrocell has 27 config bits, organized in a 3 row x 9 bit wide block
<azonenberg> The config bits are individual SRAM cells that share bit/word lines spanning the entire chip in an X-Y grid, but have poly or metal-1 interconnect going from the SRAM Q/nQ lines to various logic throughout the device
<azonenberg> anyway, as you can see the macrocell logic is directly above the macrocell EEPROM (though wider, so the SRAM bitlines have to go diagonally a bit during the fanonut from the EEPROM)
<azonenberg> The bitlines run directly from the EEPROM sense amps / JTAG shift register group at the top of the EEPROM up vertically to the associated CPLD logic fabric, although they're not visible in this image since they got etched off
<azonenberg> anyway, moving closer to the center of the device
<azonenberg> the wide blocks just left/right of center are the actual PLA
<azonenberg> Each of the PLA blocks is divided into 3 vertically
<azonenberg> At top and bottom you have AND array, at middle you have OR array
<azonenberg> (this chip is based on sum-of-products expressions rather than LUTs)
<azonenberg> Each of the AND array blocks is 56 product terms wide and 20 inputs high
<awygle> right
<azonenberg> Since each AND can be either X or nX as input
<azonenberg> there are 112 SRAM bits in each row
<azonenberg> with x-enable and nX-enable, one-hot
<azonenberg> Then the output of the blocks goes into the OR array, which is logically 56 product terms wide x 16 OR gates high
<azonenberg> although physically, two OR gates share one row to keep the config memory 112 bits wide
<azonenberg> So instead of having x-enable and nX-enable for each pterm
<azonenberg> you have or1-enable and or2-enable for each pterm
<azonenberg> in each row
<azonenberg> Make sense?
<awygle> almost entirely, the one thing where i got lost for a bit was "The config bits are individual SRAM cells that share bit/word lines spanning the entire chip in an X-Y grid, but have poly or metal-1 interconnect going from the SRAM Q/nQ lines to various logic throughout the device", specifically the bit/word line part towards the beginning
<azonenberg> I'm getting there :)
<awygle> haha okay :)
<azonenberg> let me describe the floorplan so you know what happens when you zoom in
<awygle> sure
<azonenberg> Anyway, the last bit is the central spine
<azonenberg> The very center of the chip is global muxes and things for the clock tree etc
<azonenberg> Above and below that is global routing
<azonenberg> 20 rows above and below, each feeding one PLA input left and one right
<azonenberg> The circuits are paired in a mirror image relative to each other
<azonenberg> so you'll see ten identical blocks of stuff
<azonenberg> each feeding two bits left and two right
<azonenberg> but those ten blocks are just two identical blocks mirrored back to back
<azonenberg> Although it looks symmetric left-right, it is not quite
<azonenberg> Fundamentally, that whole structure is a bunch of 8:1 one-hot muxes
<azonenberg> You have a bit to select Vdd, Vss, or one of six data inputs
<azonenberg> then drive it out into a high-fanout driver
<azonenberg> See how each row of the routing looks like a bunch of small gates then one giant block left and right?
<awygle> yes, guessing those are the drivers and are built with bigger transistors for higher fanout
<azonenberg> Yes
<azonenberg> That giant block is a 3-stage inverter cascade
<azonenberg> If you zoom in closer you'll see the 3 stages going from small to large as you move out from the center cascade
<azonenberg> First two stages are vertical then the third is horizontal
<azonenberg> they're multi-fingered transistors so you'll see multiple channels in the image
<azonenberg> on the metal layer the channels are parallelled
<awygle> makes sense
<azonenberg> You can also see the symmetry about the X axis when you zoom in a bit
<azonenberg> the driver has a separator down the centerline, everything above vs below is the same but mirrored
<azonenberg> http://thanatos.virtual.antikernel.net/unlisted/zia_final_gate_transistor_and_m1_render.png is the global routing and a little bit of the PLA AND gate traced out (by hand) and with net names labeled
<azonenberg> Blue = metal 1, red = poly, black = via, yellow = P doping, green = N doping
<azonenberg> If you look at the far left of the image, you'll see a bunch of SRAMv1x1 cells
<azonenberg> which are 6-transistor SRAM cells
<azonenberg> The horizontal red wire is the SRAM word line, any time it crosses the "donut" shape that's a row-select transistor
<azonenberg> then the two vertical red wires in the SRAM cell are the inverter loop that stores the actual config bit
<azonenberg> when it crosses the yellow those are P-fets for the high side of the inverter
<azonenberg> when it crosses the green that's a N-fet for the low side
<azonenberg> The vias at 3 and 9 o'clock of the donut go up to the SRAM word lines
<azonenberg> bit lines*
<azonenberg> sorry
<azonenberg> The bit lines run vertically through the array on metal 2 so you can't see them here
<azonenberg> Make sense or are you not familiar with how SRAM works at this level?
<awygle> no, i've got it now i think
<azonenberg> Well, this is for the PLA
<azonenberg> It gets more fun when you move inboard to the logic area
<azonenberg> So if you take a look just to the right of the leftmost high-fanout buffer
<azonenberg> at the top
<azonenberg> you'll see a single SRAMv0x1 cell
<azonenberg> This one has the metal layer drawn, you can see FB1_PULLUP_x drawn on the metal layer for the inverter loop
<azonenberg> WL_TOP is the word line, this is actually routed on metal 3 so you don't see a long continuous word line on the poly layer
<azonenberg> it just connects up through the metal stack to each bit cell individually
<azonenberg> The SRAM cell is unfolded vs being in a donut shape
<azonenberg> its now a straight line
<azonenberg> And you can see FB1_PULLUP_BL_x as the bit lines
<azonenberg> These are closely routed as a diffpair on metal 2
<azonenberg> You can see in this case, the _N output of the SRAM is connected directly on the poly layer to the input of a NOR2
<azonenberg> Which then drives a PASSPv0x1 cell (P-channel pass transistor) to connect MUXOUT_FB1 to either VCCINT or float
<awygle> that would be FB1_PULLUP_N?
<azonenberg> Correct
<azonenberg> Which is the net that says "connect MUXOUT_FB1 to drive a constant 1"
<azonenberg> As you move to the right there's similar circuitry to connect MUXOUT_FB1 to Vss or six data inputs
<azonenberg> you can see the SRAM bitlines
<azonenberg> FBx_PULLUP_BL_x
<awygle> mhm
<azonenberg> Those span the entire chip vertically from the top of the logic array down to the JTAG shift register / EEPROM readout circuitry
<azonenberg> This is metal 3
<azonenberg> You can see the one long MUXOUT_FB1 line that connects to all 8 mux blocks and then to the left/right row driver
<awygle> and the word line is going horizontally across the chip on this layer
DocScrutinizer05 has quit [Disconnected by services]
<azonenberg> Correct
DocScrutinizer05 has joined ##openfpga
DocScrutinizer05 has quit [Read error: Connection reset by peer]
<azonenberg> WL_TOP
<azonenberg> You can see the MUXIN_TOP_xx lines as well
<azonenberg> which are the mux inputs
<azonenberg> The global routing is structured as a sparse crossbar
DocScrutinizer05 has joined ##openfpga
<azonenberg> On metal 4 (http://thanatos.virtual.antikernel.net/unlisted/zia_final_m4_render.png) there's a big bus for VCCINT and GND, which drive 0 and 1 plus obviously powering stuff
<azonenberg> then there's six groups of 11 (or 10 in the rightmost case) wires
<azonenberg> These are all of the possible inputs to the CPLD logic (32 flipflops and 33 input pins... simplifying a bit, there's some muxing in the io cells)
<azonenberg> Each of the 40 rows of the global interconnect can route Vdd, Vss, or one signal from each group to its output
<azonenberg> there's a via from M3 to M4 that connects MUXIN_TOP_xx to one of the 11 signals
<azonenberg> in a different spot for each row
<azonenberg> You don't need 100% connectivity since all inputs of an AND gate are logically indistinguishable from each other
<awygle> thus explaining "via mux for input #"
<azonenberg> Yep
<azonenberg> so you don't have to be able to route FB1_1_IBUF to all 40 rows
<azonenberg> as long as you have it routed to enough rows that any possible 40 of the 65 can be routed
<azonenberg> in practice i think a tiny fraction of 40-to-65 combinations are not possible
<azonenberg> i read a paper on how they designed the coolrunner XPLA3 routing and i assume CR-2 is basically the same process
digshadow has quit [Ping timeout: 240 seconds]
<azonenberg> They added enough routing to fully route all 36-input functions
<azonenberg> as well as the vast majority of 37-40
<azonenberg> but accepted a tiny fraction of very complex designs not fitting in exchange for not making the matrix larger
<azonenberg> The via mux settings start out logically 1,2,3,4,5 at the top left of the array then progressively get more scrambled as you go right and down
<azonenberg> presumably they used some kind of iterative algorithm to perturb the mux settings until it was good enough
<azonenberg> there's no rhyme or reason to the mux settings and extracting that pattern is one of the requirements to RE any particular coolrunner bitstream
<azonenberg> So far i've only dumped the rom for the 32a
<azonenberg> but i could do it for others as needed
<azonenberg> The 64a appears to be basically the same routing structure, the 128 and larger may be a multi-level tree of some sort vs a one-level tree based on what we've seen in the bitstream
<azonenberg> but we don't have silicon photos to figure out the details yet
<azonenberg> Anyway, in this particular chip the word lines run across the entire die
<azonenberg> Which means the smallest reconfigurable unit is one row across the entire CPLD
<azonenberg> not a very practical way to do partial reconfiguration, but experimentally it does actually seem possible to reconfigure one row at a time over jtag
<azonenberg> it sometimes doesnt work, i think i might have metastability or reset issues or something if i don't do the full programming algorithm
<azonenberg> in a more complex device like an FPGA with real partial reconfig support, the word lines would be segmented
<azonenberg> so you could write a couple of rows for one block of the chip and reconfigure a contiguous 2D region of the device
<azonenberg> And your physical bitstream addresses would then be a series of words each with a (row, column) address
<awygle> okay, so in this chip the 6-bit address from the JTAG shift register gets decoded into a word line, and the 260 data bits from the shift register drive the bit lines for programming
<azonenberg> Correct
<azonenberg> I havent actually looked for the WL decode logic, it might be along the left/right between the macrocells and the IO pads or it might be at the bottom of the chip in the JTAG block
<azonenberg> Wasn't important for what i was doing
<awygle> but in a smaller-chunk-size FPGA your address might decode into row and column bits, and you'd have sizeof(chunk) data bits programmed in parallel
<azonenberg> Yeah
<azonenberg> Most likely an actual config block would be 2D
<azonenberg> So you'd have say a 128-bit-long wordline
<azonenberg> and you'd write to 64 contiguous addresses with the same col and incrementing row
<azonenberg> to write to a 64x128 bit block that configured some 2D region of the device
<azonenberg> Since it makes no sense to reconfigure e.g. half of a LUT
<awygle> sure
<azonenberg> the reconfigurable blocks are generally a bunch of logic or io resources and the associated switch boxes
<awygle> only semi-related, do the JTAG drivers have to be stronger than the SRAM feedback inverters to drive properly?
<awygle> to write the SRAM properly, rather
<azonenberg> This is true for SRAM in general, whether in an FPGA or otherwise
<azonenberg> Typically the feedback inverters are just strong enough to hold the bit reliably
<azonenberg> and the bitline drivers drive a lot harder
<awygle> right, just wanted to double check that i understood that right
<awygle> thanks for the 101 class, i really appreciate it!
Zarutian has quit [Quit: Zarutian]
<awygle> easily 4 credits
<azonenberg> lol
DocScrutinizer05 has quit [Ping timeout: 260 seconds]
DocScrutinizer05 has joined ##openfpga
theMagnumOrange has joined ##openfpga
<cyrozap> azonenberg: Ah, that's kind of a bummer about the Spartan-6, since there's so many cheap dev boards/products out there that use them, and so few cheap (i.e. sub-$50 range) 7-series dev boards.
<azonenberg> That will change when spartan7 comes out, i think
<cyrozap> azonenberg: And it's a bummer for me personally because I have a bunch of LX100 and LX150-based devices (because... uh... "reasons") that I'd love to have a FOSS toolchain for. I guess I know what I'll be working on after the PSoC stuff :P
<azonenberg> Lol
<azonenberg> Well we can potentially work on s6 at some point i just dont see it as a priority
<azonenberg> vs s3 (simple) and 7 series (modern)
<cyrozap> I totally understand
<cyrozap> The 7-series stuff is definitely much more interesting
<rqou> i'm still in favor of jumping straight to 7-series without doing s3
<azonenberg> I feel like s3 is a close enough ancestor of 7 that i'd learn a lot from studying the simpler interconnect
<azonenberg> the chip is larger process and easier to deprocess/image
<azonenberg> cheaper to get samples for destructive imaging
<rqou> i also really want to see (as a test) vpr support for ice40
<azonenberg> That would be cool too
<azonenberg> Honestly i don't think vpr is the best way to go if scaling to large 7-series parts is the goal
<azonenberg> afaik vpr is basically smart annealing
<azonenberg> i'd rather go with a global mass-spring type routing algorithm from the get-go
<azonenberg> and design for multithreading and maybe even multi-server builds
<azonenberg> even if we normally run locally with only 1-4 threads
<azonenberg> that scalability essentially doesn't exist in any toolchain that targets a real architecture
<rqou> hmm i recall seeing a paper (that i couldn't be arsed to download) about a hybrid annealing+quadratic-wirelength approach for FPGAs
<rqou> that might be interesting to look at at some point
<azonenberg> Well what i'm saying is
<azonenberg> annealing doesn't parallelize well
<azonenberg> And i think once we have the bitstream figured out
<rqou> right, but quadratic-wirelength does
<azonenberg> oh? i'm not familiar with that alg
<azonenberg> Because it would be potentially worthwhile to try developing routing algorithms that scale all the way from tens to thousands of cores
<azonenberg> not 4 like ise
<rqou> that's basically the "global mass/spring intuition" algorithm
<azonenberg> oh, ok
<azonenberg> i'm imagining a pile of ec2 spot instances or just a rack of servers in a lab somewhere
<azonenberg> fully routing an xcku035 in <5 minutes
<rqou> we should focus on getting xc2par to return more right answers first :P
<azonenberg> lol yes
<rqou> alright, time for me to get up and have brunch
amclain has quit [Quit: Leaving]
<cyrozap> Regarding P&R, would it be possible to design a library so that it could be used for many different architectures? I'm just thinking it would be really time-consuming to have to write a new tool for every chip (arachne-pnr for ICE40, xc2par, gpkpar, etc.) and as a FOSS project, one of our strengths over the proprietary tools is that we can share code (and by extension algo/implementation improvements).
<azonenberg> That is what VPR does
<azonenberg> but AIUI it's based on annealing and i want to try a different algorithm
* cyrozap isn't familiar with the acronym "VPR"
<azonenberg> "virtual place and route" iirc
<azonenberg> it's a primarily research based par tool that is meant to work on toy architectures for researching par algorithms
<azonenberg> afaik
<azonenberg> it hasnt been used much with real chips that i know of
<azonenberg> Also, xc2par and gp4par are already sharing code
<azonenberg> xbpar is the "crossbar and logic stuff" par engine
<azonenberg> gp4par is the greenpak front end to it
<pointfree> I'm trying to figure out the PSoC 5LP status and control blocks. I can't imagine the HC switches below status and control are any different and the HC still needs to be configured to route to the status or control blocks. I guess the control block is shorting away routes inside the UDB and that's how it does its thing.
<pointfree> (I thought that's what vpr is)
<cyrozap> azonenberg: I was thinking more along the lines of a library/tool that contains a bunch of different algos, and you just feed it tech cells and routing models (and maybe specify which algo to use)
m_w has quit [Quit: leaving]
<azonenberg> So more modular like yosys? or what
<azonenberg> I have to spend a while reading papers on lut-based par engines and playing with raw bitfiles before i even attempt to do something with FPGAs
<azonenberg> i know crossbar architectures very well
<azonenberg> The verilog-to-routing project uses ODIN for synthesis, but afaik yosys is a lot better
<azonenberg> then they use ABC for techmapping, which yosys does internally
<azonenberg> then VPR for P&R
<cyrozap> azonenberg: Yeah, something like yosys, I think. Really I'm just lazy and terrible at software and would much rather build on the (proven) work of others than do everything myself :P
<azonenberg> Lol
<azonenberg> well i want to share code too
<azonenberg> i just need to spend time fooling with vpr and see if it does what we want
pie_ has quit [Ping timeout: 260 seconds]
wolfspra1l has quit [Ping timeout: 260 seconds]
eduardo__ has quit [Ping timeout: 276 seconds]
eduardo__ has joined ##openfpga
X-Scale has quit [Read error: Connection reset by peer]
Hootch has joined ##openfpga
<azonenberg> Sooo let's see, i still have to figure out what is wrong with the ZIA in my coolrunner emulator
<azonenberg> Components for the greenpak thermal characterization board arrived
<azonenberg> Stencil for the level shifter arrived
<azonenberg> Still waiting on the PCBs
<azonenberg> Those are ETA the 11th
<rqou> azonenberg: can you id the fab for this? :P https://s.zeptobars.com/batteriser-btr004k-HD.jpg
pie_ has joined ##openfpga
Zorix has quit [Ping timeout: 258 seconds]
Zorix has joined ##openfpga
pie_ has quit [Ping timeout: 276 seconds]
pie_ has joined ##openfpga
pie_ has quit [Ping timeout: 260 seconds]
pie_ has joined ##openfpga
<qu1j0t3> rqou: what _is_ that?
<qu1j0t3> nrossi: thanks
<azonenberg> rqou: "analog" :p
<azonenberg> i only know digital fabs
<azonenberg> analog processes have too much hand done stuff
azonenberg_work has quit [Ping timeout: 255 seconds]
<balrog> azonenberg: how about MEMS?
pie_ has quit [Changing host]
pie_ has joined ##openfpga
kristianpaul has quit [Quit: leaving]
kristianpaul has joined ##openfpga
X-Scale has joined ##openfpga
sn00n has joined ##openfpga
<sn00n> hi
amclain has joined ##openfpga
lexano has quit [Ping timeout: 276 seconds]
lexano has joined ##openfpga
digshadow has joined ##openfpga
<pie_> whatd i miss
azonenberg_work has joined ##openfpga
philpem has joined ##openfpga
Hootch has quit [Quit: Leaving]
<qu1j0t3> pie_: I'm happy with my TDS460A 400MHz that arrived this week
<cr1901_modern> I'm happy with my... oh wait, I don't have an oscilloscope
lexano has quit [Ping timeout: 260 seconds]
lexano has joined ##openfpga
philpem has quit [Ping timeout: 240 seconds]
<pie_> cr1901_modern, \o/
m_w has joined ##openfpga
mifune has joined ##openfpga
mifune has joined ##openfpga
mifune has quit [Changing host]
philpem has joined ##openfpga
digshadow has quit [Quit: Leaving.]
uovo has joined ##openfpga
cr1901_modern1 has joined ##openfpga
rqou_ has joined ##openfpga
cr1901_modern has quit [Read error: Connection reset by peer]
oeuf has quit [Read error: Connection reset by peer]
rqou has quit [Quit: ZNC 1.7.x-git-709-1bb0199 - http://znc.in]
rqou_ is now known as rqou
ChickeNES has quit [Ping timeout: 240 seconds]
ChickeNES has joined ##openfpga
digshadow has joined ##openfpga
mifune has quit [Ping timeout: 260 seconds]
mifune has joined ##openfpga
lexano has quit [Ping timeout: 260 seconds]
azonenberg_work has quit [Ping timeout: 240 seconds]
azonenberg_work has joined ##openfpga
mifune has quit [Ping timeout: 268 seconds]
cr1901_modern1 is now known as cr1901_modern
lexano has joined ##openfpga