<cyrozap-ZNC>
Hey, all! Before I go on a yak-shaving adventure, does anyone know of any logic analyzer cores I can just drop into an XC6SLX16 that support a 100 Msps sample rate and streaming data to a PC via an FT232H in synchronous 245 FIFO mode?
cyrozap-ZNC has quit [Quit: Client quit]
cyrozap has joined ##openfpga
<cyrozap>
... I realize this is a very specific request, but I'd be really embarrassed if I went and coded something like this up and it turned out that all or even just most of that effort was unnecessary.
<cyrozap>
Oh, and the context of this is that I want to snoop on an SPI bus running at around 20 MHz for several seconds, so I need to stream 100 MSps of at least 4 channels of data to my PC, and I don't want to buy new hardware to do all that (assuming that hardware even exists and works on Linux), so I'd like to use my Digilent Analog Discovery if I can, but the Discovery/WaveForms only has a buffer depth of
<cyrozap>
16k samples, so at 100 MSps that's only ~164 microseconds of data.
<cyrozap>
So my current idea is to write my own dumb LA core that always samples 8 channels at 100 MSps, stuffs each byte of that data into a deflate core (https://github.com/tomtor/HDL-deflate), and then spits the compressed stream out to the FT232H at (hopefully) <35 MB/s.
Bike has quit [Quit: leaving]
Bird|otherbox has quit [Read error: Connection reset by peer]
<azonenberg>
cyrozap: why deflate if you could just run length encode?
<azonenberg>
RLE works very well on most digital data especially if oversampled
<azonenberg>
and uses much less gates
<azonenberg>
if it's not enough compression then consider something more aggressive
<azonenberg>
but i'd be very doubtful that you could fit full LZ in a 6slx16 for a sane number of channels
<cyrozap>
azonenberg: Well, because it's done a good job of compressing my sigrok captures 10:1 (the SR file format is literally just a zip file with files filled with bytes, one byte for every 8 channels), it doesn't take up too many LUTs, it's fast in hardware (100 MHz according to that repo), and overall it just seemed like a good tradeoff between compression ratio, speed, and resource usage.
<cyrozap>
I suppose I should do a benchmark between deflate and RLE on my real-world captures, so I know for sure if deflate or RLE would be better.
<cyrozap>
Ok, so far, with my naive RLE implementation, deflate is winning. 2.4 MB for deflate vs. 14 MB for RLE, both performed on a 182 MB input. Yeah, I could probably fiddle with things like increasing the number of bytes for the encoded length, using certain bits to indicate the number of bytes for the encoded length, etc., but I'm pretty sure I'm not a better compression engineer than the people who made
<cyrozap>
deflate.
<TD-Linux>
deflate is going to crush rle because it has entropy coding so it's going to easily crush rle as it can take advantage of correlation between channels
<TD-Linux>
even moreso if you have a synchronized clock as then the dictionary will kick in
<sorear>
you can be a pretty terrible compression engineer and make something better *for a specific purpose* than the best general-purpose algorithms
<TD-Linux>
honestly I think you're going to have trouble beating general purpose lossless compression on signal waveforms
<TD-Linux>
especially with a few tweaks like how you format the bytes going in and maybe choice of dictionary size
<cyrozap>
Yeah, Huffman coding is good shit. And I haven't even measured how well (or rather, how poorly) my RLE encoder handles the hard-to-compress parts of the capture--that's really important since I only have so much instantaneous bandwidth available on the USB bus, and no DRAM to buffer the compressed data in.
<TD-Linux>
yeah, the worst case is of course no compression so it really depends on your device. you'll definitely want a long fifo. if your data is bursty that might obviate the need for deflate
<TD-Linux>
now you're making me want to try implementing the av1 entropy coder (which consumes up to 4 bits of entropy per cycle)
<cyrozap>
Here's the results of the additional tests: For the least-compressible part of the capture (largest filesize output), my RLE encoding was 5.6 times the size of the deflated one. For the most-compressible parts, my RLE encoding was consistently about 8 times the size of what deflate produced.
<sorear>
I've done twice-as-good-as-LZMA on the Unicode Character Database, personally
<sorear>
haven't tried with large waveform captures, never needed to
<sorear>
ultimately it comes down to (a) THINK for half a second about the local and nonlocal correlations in your data (b) throw it into an entropy coder
<sorear>
are you storing data synchronous to the SPI clock or to an independent sampling clock?
<cyrozap>
And last interesting comparison between my naive RLE and deflate: On the least-compressible data, RLE was 60% of the size, while deflate was 11%, which is really important because at 35 MB/s max throughput, RLE would let me do 50 MSps, while deflate would enable (in theory) over 300 MSps, though of course in practice I'm limited by the speed of the encoder and the input characteristics of the LA.
<sorear>
you said the SPI was 20 MHz, why do you care about 300 MSps?
<cyrozap>
sorear: Yeah, if I did that, then I wouldn't even need the compression (20 MHz SPI data would easily fit in that 35 MB/s max transfer rate, since I could omit the clock bit and stuff 8 samples of CSN+MOSI+MISO into 3 bytes, which would give me 7.5 MB/s), but I'm trying to make something a little more generic.
<cyrozap>
And I'm only trying for 100 MSps so I can have at least 4x oversampling.
<cyrozap>
That 300 MSps was just a theoretical number based only on the compression ratio I got from deflate, and not something I'd ever expect in practice.
<sorear>
I suspect deflate will work much better if you give it protocol-level bytes rather than 8-consecutive-samples
<cyrozap>
Right, but what I'm saying is that this is mostly just an excuse to get me to write a general-purpose FOSS bitstream for this USB scope/LA, so I can get it working in sigrok and not have to use Digilent WaveForms. If I just wanted to look at SPI data, of course it'd probably be better to just write an SPI sniffer peripheral and send the raw bytes over the wire (which at 20 MHz I think would be 5 MB/s,
<cyrozap>
since that's just the MOSI+MISO data).
<cyrozap>
Oh, and I forgot the other constraint: The timing and grouping of the SPI transactions is important here, so if I were to do that, I'd have to find some way to add a timestamp to each burst (or something).
hitomi2504 has joined ##openfpga
OmniMancer has joined ##openfpga
emeb_mac has quit [Quit: Leaving.]
mossmann has quit [Ping timeout: 258 seconds]
unkraut has quit [Remote host closed the connection]
mossmann has joined ##openfpga
unkraut has joined ##openfpga
Asu has joined ##openfpga
gregdavill_ has joined ##openfpga
gregdavill has quit [Ping timeout: 256 seconds]
Asuu has joined ##openfpga
Asu has quit [Ping timeout: 256 seconds]
Bike has joined ##openfpga
q3k has quit [Ping timeout: 260 seconds]
q3k has joined ##openfpga
nickjohnson has quit [Ping timeout: 272 seconds]
nickjohnson has joined ##openfpga
Asuu has quit [Read error: Connection reset by peer]
Asu has joined ##openfpga
<azonenberg>
cyrozap: i mean of course rle is less good compression than delate
<azonenberg>
what i wonder more about is gate count and timing performance
<azonenberg>
how much bigger/slower is deflate?
<azonenberg>
the big advantage of rle is simplicity, not compression rate
* whitequark
read that as "rle is less good compression than delete"
_whitelogger has joined ##openfpga
gregdavill_ has quit [Quit: Leaving]
<Hoernchen>
cyrozap, beaglelogic? huge buffer depth due to system ram...
Degi has quit [*.net *.split]
qu1j0t3 has quit [*.net *.split]
tlwoerner has quit [*.net *.split]
Sellerie has quit [*.net *.split]
ZipCPU has quit [*.net *.split]
somlo has quit [*.net *.split]
finsternis has quit [*.net *.split]
kiboneu has quit [*.net *.split]
_franck_ has quit [*.net *.split]
Finde has quit [*.net *.split]
wizzy has quit [*.net *.split]
Degi has joined ##openfpga
tlwoerner has joined ##openfpga
qu1j0t3 has joined ##openfpga
Finde has joined ##openfpga
kiboneu has joined ##openfpga
Sellerie has joined ##openfpga
ZipCPU has joined ##openfpga
finsternis has joined ##openfpga
wizzy has joined ##openfpga
somlo has joined ##openfpga
_franck_ has joined ##openfpga
<TD-Linux>
the big limitation of anything more complex than rle is usually the entropy coder on a fpga
<TD-Linux>
outputting more than one symbol per clock gets very difficult
<whitequark>
is it fundamentally difficult or just the limitation of our tooling
<TD-Linux>
fundamentally difficult. each symbol depends on the state of the entropy coder after the previous one
<whitequark>
right but can you pipeline that
<tnt>
well no ...
<tnt>
if the next state depends on the prev state + next input, you can't just freely pipeline.
<TD-Linux>
it is a latency limitation. you can't compute the next entropy coder state until the previous one is computed
<TD-Linux>
for a decoder, there is a trick you can do by computing all of the possibilities for the second/future symbols and then selecting the proper one
<whitequark>
what if you update the state once per n symbols
<whitequark>
it would be a custom algorithm
<whitequark>
but just... stuff the pipeline depth into the header or something
<tnt>
TD-Linux: that sounds like it would get expensive very quickly :p
<TD-Linux>
tnt, yeah, I know it is done on certain shipped hardware but only one into the future
<TD-Linux>
whitequark, you can do that, it's basically multiple parallel entropy coders
<whitequark>
aha
<TD-Linux>
but you want the "once per n" to be relatively high because there's overhead per each output block
<TD-Linux>
because the output is variable length, if you want it to be decodeable in parallel, you have to encode start positions for all of the output blocks
<TD-Linux>
(oh also, in the case of huffman, the state is just a bit pointer so in that special case you can just concatenate the symbols to do multiple per clock)
cr1901_modern has quit [Read error: Connection reset by peer]
<TD-Linux>
one other complication is in the case of video coders, the context model features adaptive probabilities. so every time you code a symbol, the probabilites adapt to make it more likely to code (and take less bits). the decoder model runs in lockstep
cr1901_modern has joined ##openfpga
<TD-Linux>
CABAC, used in H.264 and H.265, is the pessimal example. every symbol is binary (one bit), and each bit causes a probability update
Asu has quit [Remote host closed the connection]
Asu has joined ##openfpga
<TD-Linux>
AV1 instead uses variable symbol sizes, up to 16 (4 bits), so it's up to 4 times as fast. also AV1 has tiles which are basically the break point for the parallel entropy coders
<tnt>
Ah, never dug mux into CABAC, but I did make a MQ decoder for jpeg2k for V4/V5 a long time ago.
<tnt>
Same thing with probabilities updated at each step.
genii has joined ##openfpga
cr1901_modern1 has joined ##openfpga
<tpw_rules>
TD-Linux: have you heard of finite state entropy?
<tpw_rules>
i want to try and put it on an FPGA. but it's trivially pipelineable because iirc you don't have to deal with encoding start positions and demultiplexing the streams
<tpw_rules>
you can just round robin symbols on both the encoder and decoder. also it doesn't require divides or anything messy
<tpw_rules>
the RAD people did a GPU parallelized version called BitKnit
cr1901_modern has quit [Ping timeout: 256 seconds]
<tpw_rules>
the only weird kink is that it's LIFO: you have to decode in the reverse order you encode. usually they do encoding backwards, so you have to be able to buffer an entire chunk (and build the tables if you're doing that) before you can read it backwards to encode it
cr1901_modern1 has quit [Quit: Leaving.]
cr1901_modern has joined ##openfpga
<TD-Linux>
tpw_rules, encoding ANS actually super sucks on a FPGA because you have to buffer all of the (uncompressed) symbols and then encode them backwards
<TD-Linux>
which is the main reason AV1 doesn't use it
<TD-Linux>
but if you can tolerate that limitation it's pretty good
Asuu has joined ##openfpga
<tpw_rules>
ah ok. yeah i was thinking it was a good opportunity to build the tables for the tabled version and not a problem if you had lots of DRAM. i have an application in mind but i'm not sure if i'll actually need it
Asu has quit [Ping timeout: 272 seconds]
<TD-Linux>
and yeah you need the tabled version if you don't want the divide
<TD-Linux>
luckily fpgas are pretty good at tables
<azonenberg>
whitequark: so i started the scopehal cleanup
<azonenberg>
first step was merging scopehal-cmake into scopehal-apps, which appears to have successfully completed
<azonenberg>
So now azonenberg/scopehal-apps is the top level repo for the project and includes the build system and code for glscopeclient and some other utilities
<azonenberg>
and has submodules for azonenberg/scopehal (library code) and azonenberg/scopehal-docs (documentation)
<azonenberg>
i may merge those as well at some point
<whitequark>
azonenberg: excellent
<whitequark>
ah, another question
<whitequark>
can you stuff ffts as a submodule? it already uses cmake anyway
<azonenberg>
File a ticket against scopehal-apps and i'll look into it
<whitequark>
thanks
<azonenberg>
i may have to make a secondary build system around it or something in the parent repo because cmake has some global config and importing a top level cmakelists via add_subdirectory doesnt always work well
<whitequark>
wait, who is github.com/antikerneldev?
<azonenberg>
monochroma
<whitequark>
ah
<azonenberg>
She originally made the account for working on antikernel stuff back in the day and the name stuck
<whitequark>
usually i add subprojects with add_subdirectory(dir EXCLUDE_FROM_ALL)
<hell__>
azonenberg: thanks for reorganizing the scopehal repos! one last detail that would help is to pin scopehal-apps instead of scopehal in your profile (that's how I went from `scopehal-apps` to `scopehal` without noticing `scopehal-cmake`)
<azonenberg>
hell__: will do, i missed that one
<hell__>
or, if moar merging is to be done, I can wait
<azonenberg>
I'm not sure yet
<azonenberg>
merging scopehal-cmake with scopehal would be a much more major undertaking
<whitequark>
the doc is just for glscopeclient, right?
<azonenberg>
scopehal-apps with*
<azonenberg>
whitequark: at the moment yes, but there will be developer documentation eventually
<whitequark>
ah hm
<whitequark>
in the same document?
<azonenberg>
no
<azonenberg>
but the same repo
<azonenberg>
the repo will eventually have multiple documents in it
<azonenberg>
maybe even appnotes etc
<whitequark>
right, ok
<azonenberg>
guides on using the library
<whitequark>
i have no strong opinion on that
<azonenberg>
My plan is to give it a week or two at least, and see how things shake out
<whitequark>
other than "pdfs are inaccessible" but you already know that
<azonenberg>
That's because i havent actually set up a proper host for them yet
<whitequark>
no i mean things like
<whitequark>
you can't link to a section in a pdf
* hell__
looks at their own github profile
<azonenberg>
pdf.js i think supports that?
<azonenberg>
though not everyone uses firefox
<whitequark>
i have no idea how to do that; if your doc is in html you just right click
<whitequark>
the other reason is that contributing to latex docs is pure suffering
<whitequark>
there have been multiple times when i thought about describing something in the yosys manual, remembering it's latex, deciding i can just not do that
<TAL>
(for pdf.js and chrome/chromiums pdf reader add #page=<pagenum>)
<whitequark>
TAL: that's not the same thing at all
<TAL>
ah you mean actual sections?
<whitequark>
yes
<whitequark>
linking to pages is stupid, it's like linking to lines of code on `master`
<whitequark>
it just means the link rots in a week
<TAL>
ah, nvm. yeah, sorry
emeb has joined ##openfpga
hitomi2504 has quit [Quit: Nettalk6 - www.ntalk.de]