electronic_eel has quit [Ping timeout: 256 seconds]
electronic_eel has joined #scopehal
<_whitenotifier-f>
[scopehal] azonenberg pushed 2 commits to master [+0/-0/±8] https://git.io/JIftr
<_whitenotifier-f>
[scopehal] azonenberg d015a6c - Filter: removed double-precision FindZeroCrossings(), redundant now that we have femtosecond timing resolution
electronic_eel has quit [Ping timeout: 256 seconds]
electronic_eel has joined #scopehal
<azonenberg>
bluezinc: So, i had a chance to look at your i2s filter
<azonenberg>
It looks decent but i question whether the display format you chose really makes the most sense
<azonenberg>
think about the purpose of i2s, it's ultimately encoding analog data. So does outputting a protocol analyzer style text symbol view really make sense?
<bluezinc>
You're suggesting a pair of waveforms out for L/R?
<azonenberg>
Yes
<azonenberg>
DownconvertFilter does exactly this, it outputs two I/Q waveforms
<azonenberg>
so it's probably a good reference for how to do it
<azonenberg>
The style you have isn't *wrong*, i just question whether it's the most useful given that ultimately you're encoding audio data
m4ssi has joined #scopehal
<azonenberg>
also just a style note... even if you're the author of a single file, I try to keep the header up top consistent project wide
<azonenberg>
i'm actually going to be reformatting that probably in January
<azonenberg>
so it will say copyright 2012-2021 andrew zonenberg + contributors
<azonenberg>
and also say "glscopeclient" rather than 'antikernel' which is where i lifted the comment from back when glscopeclient and scopehal were part of the antikernel project
<azonenberg>
The Doxygen @author comment is intended to be used to tag a file with the primary author, but ultimately copyright is shared
<azonenberg>
Other than that, it looks good. Do you agree that the dual waveform output style makes more sense in this case?
<_whitenotifier-f>
[scopehal] azonenberg opened issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP
<_whitenotifier-f>
[scopehal] azonenberg labeled issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP
m4ssi has quit [Remote host closed the connection]
<marshallh>
speculation on what the traces are? i was thinking 1st layer of a pcb inductor but i have no idea
<marshallh>
chip is intel iris XE max graphics
<marshallh>
specifally DG1
<azonenberg>
very interesting
<azonenberg>
If i'm interpreting this right the darker color is copper and lighter is substrate
<marshallh>
yes
<azonenberg>
most of the fat traecs seem to be connected to capacitors
<marshallh>
they have filled microvias so there could be layer changes that aren't visible
<azonenberg>
Which does lead me to believe that those are from pcb inductors
<marshallh>
yeah
<azonenberg>
some kind of IVR
<marshallh>
interesting that they would parallel many coils to get more current maybe
<azonenberg>
possibly, yeah
<marshallh>
definitely high current
<marshallh>
my 3090 draws 420W during benchmark lmao
<azonenberg>
Yeah
<azonenberg>
Also, i'm probably going to be experimenting with pushing some scopehal compute to GPU in the nearish future
<azonenberg>
like, actual waveform processing rather than just rendering
<marshallh>
cuda?
<azonenberg>
gl compute shaders
<marshallh>
so vendor agnostic
<azonenberg>
it's portable and integrates nicely with gl rendering
<azonenberg>
the eye pattern in particular is pretty compute heavy
<marshallh>
what part of the processing can be moved easily? i assume you arent talking protocol decode
<marshallh>
maybe just symbol-level decode?
<azonenberg>
So i want to experiment
<azonenberg>
The stuff i want to do on GPU is mostly a few categories
<azonenberg>
Lower level symbol decoding, DSP/math (FFTs, channel emulation, etc), and the ey epattern
<azonenberg>
Also, some line coding as well
<azonenberg>
sampling a signal on clock edges is probably easy to parallelize because you can just do one output sample per gpu thread
<azonenberg>
decoding a sampled serial data stream to 8b10b symbols should be easy to parallelize too. as soon as you have digital samples on a uniform timebase you don't need to do any random access seeking
<azonenberg>
i think the eye pattern is going to be a highish priority because it isn't SIMD friendly
<azonenberg>
most of the other time consuming compute is already making heavy use of AVX2
<marshallh>
cool
<azonenberg>
the eye pattern filter is not. I have a test setup that shows off jitter histograms and eye patterns of pcie before and after removing de-emphasis
<azonenberg>
according to vtune, a *single line* of code in the eye pattern filter is 3.4% of the total cpu-seconds used by the entire demo
<azonenberg>
the eye pattern filter as a whole is 25.1% of the total cpu time
<marshallh>
ouch
<azonenberg>
It's been my top priority for optimization for a while and i've tuned it a bunch, it was worse before
<azonenberg>
but i'm starting to hit limits
<azonenberg>
one thing that seems to be a bit heavy is all of the float to int64 and back conversions
<azonenberg>
unfortunately bitmaps are indexed by integer coordinates
<azonenberg>
and math on voltages kinda has to be floating point
<azonenberg>
especially if you're doing sub sample interpolation
<marshallh>
does it have to be float though?
<marshallh>
with 64bits you have more than enough for fixed
<marshallh>
especially since you are targeting display
<azonenberg>
the issue is that the voltages originally come in as fp32 from elsewhere in scopehal
<azonenberg>
and DSP math is much friendlier in fp32
<azonenberg>
also, AVX gets you double throughput in fp32 vs i64
<marshallh>
hmm
<marshallh>
i'm just used to doing all fixed point on fpga
<azonenberg>
(and half the ram usage)
<marshallh>
and integer always being faster on most embedded cpus
<azonenberg>
GPUs are generally better at float
<marshallh>
yeah if oyu are going GPU then def float
<azonenberg>
in fact one of the reasons glscopeclient doesnt run on a lot of older intel integrated gpus is that i need int64s in shaders for timestamps of waveforms
<azonenberg>
all internal timebase units are int64 femtoseconds
<azonenberg>
(recently converted from picoseconds, fs gives much needed extra resolution)
<sorear>
much more than double
<azonenberg>
sorear: i havent checked latency or throughput on the FPU itself
<azonenberg>
but i know you get twice as many per instruction
<azonenberg>
This is what i'm working with. lots of scalar float instructions, it isnt even gonna be super easy to parallelize fine grained because of the potentially variable clock frequency
<azonenberg>
so my current thought is to statically partition into N sub-blocks, one thread for each
<azonenberg>
(sub-blocks of waveform samples)
<sorear>
huh, haven't seen that tool before in use
<marshallh>
cool
<azonenberg>
then binary search to find the clock offset for each block
<azonenberg>
then evaluate each block in parallel using essentially this same inner loop
<azonenberg>
i don't see super easy opportunities for SIMD-ification here
<azonenberg>
There might be potential to unroll at least some of the inner loop and SIMD-ify the interpolation or something
<azonenberg>
but the actual output math has to be sequential because the pixel locations you're writing to are not consecutive
<sorear>
well you can do a big SIMD-friendly loop to make a list of (address, increment) pairs, then a non-SIMD loop to apply them
<azonenberg>
Yeah that is a possibility
<azonenberg>
but the stuff up top finding the timestamp offsets also isnt super SIMD friendly
<sorear>
(this is why riscv has the vectorized relaxed atomic adds everyone hates)
<azonenberg>
the question is ultimately if all of the marshaling and unmarshaling costs more time than this way
<azonenberg>
i'm also not sure how much i trust the current profile output
<azonenberg>
i might need to run longer?
<azonenberg>
because this is showing 2.9% of my time on "mov ecx, 0x40"
<azonenberg>
which can't be right
<sorear>
roughly what fraction of the time is the condition on line 472 true?
<azonenberg>
depends on the multiplier of sample clock rate to symbol rate
<azonenberg>
with 5 Gbps PCIe on a 40 Gsps scope, it's true 1/8 of the time
<sorear>
it's superscalar execution, the machine picks one instruction executed on any given cycle to account that cycle against and it doesn't always make sense (my experience is with perf(1), but I think they're using the same hw features)
<azonenberg>
interesting, they don't count every executing instruction? they pick one?
<azonenberg>
in any case i would expect that the *actual* bottleneck in line 508 is the 0xaec1f add instruction, as that hits memory
<sorear>
well it depends on what you're counting
<azonenberg>
This does not bode well for the simd-ifying idea though
<sorear>
perf defaults to "cycles", which stops every N cycles and records whatever the (I think, there are other possibilities) oldest unretired instruction is
<azonenberg>
as it implies most of the expense here is the increments rather than the math
massi_ has joined #scopehal
<sorear>
how much do we know about clock_edges? is it completely arbitrary?
<azonenberg>
It's the output of a CDR PLL normally, but could also be something like a DDR DQS
<sorear>
can we at least assume it's monotone?
<azonenberg>
Yes, it is monotonic
<sorear>
you could break the loop into N segments, and find the start point for each segment by binary search
<azonenberg>
I just discussed that up top
<azonenberg>
that would allow multithreading
<azonenberg>
that would also be my strategy for running it on GPU
<azonenberg>
i think what might make sense at this point is setting up the infrastructure for doing scopehal filters on GPU
<sorear>
if you think about it you're basically doing a sorted-list merge
<azonenberg>
implement some simple DSP filters like "subtract" as a start
<azonenberg>
then work on some of the more complicated stuff
<azonenberg>
i also have some more work i wanted to do on reducing unnecessary allocations and memcpy'
<azonenberg>
so that might also be good to do first
<azonenberg>
but basically i think what i will need to do is keep track of what waveform data lives on the CPU or GPU
<azonenberg>
and move between them only when needed
<azonenberg>
right now i'm using stl vectors with my AlignedAllocator to allocate memory on the CPU but that object model might need some retooling when it comes to GPU memory
<sorear>
if the clock frequency is relatively stable, min_clock <= (clock_edges[i+1] - clock_edges[i]) <= max_clock, max_clock <= "a few" * min_clock you could make an inverse map
<azonenberg>
Do i want to have that actual vector live on the gpu?
<azonenberg>
or do i want a separate buffer on the gpu?
<azonenberg>
how do i sync between them if so?
<azonenberg>
etc
<azonenberg>
especially during the transitional period there will be lots of processing done on both
<sorear>
can GL4.5 compute shaders access app memory? I've only done things with OpenCL which can
<azonenberg>
Not sure. I've only done the opposite
<azonenberg>
mapping GPU memory so i can access it from CPU code
<sorear>
it's just NUMA and I wish we could consistently treat it that way
<azonenberg>
Yeah
<azonenberg>
Anyway, i guess the time has come
<azonenberg>
i was always going to be pushing compute to the gpu and now is as good a time as any
massi_ has quit [Remote host closed the connection]
<azonenberg>
Let me take care of #354 first though
juli966 has joined #scopehal
<_whitenotifier-f>
[scopehal] azonenberg pushed 3 commits to master [+0/-0/±15] https://git.io/JIJbz
<_whitenotifier-f>
[scopehal] azonenberg ef8328e - EyePattern: removed antialiasing which isn't needed after femtosecond timebase conversion
<_whitenotifier-f>
[scopehal] azonenberg 63b5fa3 - Initial implementation of dense packed waveform processing. Only supported for LeCroy and Tek scopes, and tapped delay line filters. Not persisted to files yet. Fixes #354.
<_whitenotifier-f>
[scopehal] azonenberg 083769b - Added dense packed optimizations to a lot more filters
<_whitenotifier-f>
[scopehal] azonenberg closed issue #354: Add flag to Waveform indicating "waveform is dense packed" - https://git.io/JIeQW
<_whitenotifier-f>
[scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJbw
<_whitenotifier-f>
[scopehal-apps] azonenberg 56046b0 - Updated to latest scopehal with zero crossing cache and dense pack optimizations
<_whitenotifier-f>
[scopehal] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJpZ
<azonenberg>
So by propagating the dense pack attribute and avoiding just a handful of int64-fp32 conversions in the eye decode
<azonenberg>
I processed 23% more waveforms in the same time, and used only 93% of the CPU time in the eye filter
<azonenberg>
average 40.7 ms/wfm (with two eyes on screen) before, now 30.8
<azonenberg>
so that's roughly a 25% speedup
<azonenberg>
I still retain all of the flexibility i had before for working with signals that have gone through complex processing and resampling, i'm just optimizing the common case of monotonic sampling at uniform intervals
<_whitenotifier-f>
[scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIJjV
<_whitenotifier-f>
[scopehal-apps] azonenberg a7c8ccd - Updated to latest scopehal
<azonenberg>
I still think there is potential to vectorize too
<azonenberg>
since it seems that fp32 to int64 conversions are so slow, even if most of the other stuff is serialized
<azonenberg>
actually, i bet i can optimize even more by using int32s
<azonenberg>
while timestamps can be large, no eye pattern is going to be more than 4 gigapixels
<azonenberg>
Looks like the jitter spectrum filter is going to be another target for performance tuning soon
<azonenberg>
every time i think it's time to start pushing compute to the GPU i find ways to squeeze more performance out of the software side lol
m4ssi has joined #scopehal
jn__ has quit [Quit: No Ping reply in 180 seconds.]
jn__ has joined #scopehal
m4ssi has quit [Remote host closed the connection]
<bvernoux>
haha during test my Rigol has frozen
<bvernoux>
I have C ansi code working on both Windows & Linux ;)
<bvernoux>
Anyway so far Rigol MSO5000 is so instable with the SCPI I use ...
<bvernoux>
need to check to use additional SCPI to see if that improve stability like adding ":WAV:STAR 1" & ":WAV:STOP " before to retrive the samples
<azonenberg>
The impression i'm getting is rigol stuff just isn't stable enough for serious use
electronic_eel has quit [Ping timeout: 240 seconds]