azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-apps, https://github.com/azonenberg/scopehal, https://github.com/azonenberg/scopehal-docs | Logs: https://freenode.irclog.whitequark.org/scopehal
<azonenberg> AKL-PT2 v0.4 boards sent to fab, ETA the 9th
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 256 seconds]
Degi_ is now known as Degi
<_whitenotifier-f> [scopehal] azonenberg pushed 2 commits to master [+2/-0/±4] https://git.io/JIfes
<_whitenotifier-f> [scopehal] azonenberg f0fbd17 - LeCroyOscilloscope: fixed comment error
<_whitenotifier-f> [scopehal] azonenberg 8338bf1 - Added emphasis filter. Fixes #190.
<_whitenotifier-f> [scopehal] azonenberg closed issue #190: Add filter to apply pre/de emphasis to a signal - https://git.io/JJRNV
electronic_eel has quit [Ping timeout: 256 seconds]
electronic_eel has joined #scopehal
<_whitenotifier-f> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±8] https://git.io/JIftr
<_whitenotifier-f> [scopehal] azonenberg d015a6c - Filter: removed double-precision FindZeroCrossings(), redundant now that we have femtosecond timing resolution
<_whitenotifier-f> [scopehal] azonenberg 211e07f - FindZeroCrossings: cache results to avoid repeated calls. Fixes #355.
<_whitenotifier-f> [scopehal] azonenberg closed issue #355: Cache Filter::FindZeroCrossings results - https://git.io/JIe7W
electronic_eel has quit [Ping timeout: 256 seconds]
electronic_eel has joined #scopehal
<azonenberg> bluezinc: So, i had a chance to look at your i2s filter
<azonenberg> It looks decent but i question whether the display format you chose really makes the most sense
<azonenberg> think about the purpose of i2s, it's ultimately encoding analog data. So does outputting a protocol analyzer style text symbol view really make sense?
<bluezinc> You're suggesting a pair of waveforms out for L/R?
<azonenberg> Yes
<azonenberg> DownconvertFilter does exactly this, it outputs two I/Q waveforms
<azonenberg> so it's probably a good reference for how to do it
<azonenberg> The style you have isn't *wrong*, i just question whether it's the most useful given that ultimately you're encoding audio data
m4ssi has joined #scopehal
<azonenberg> also just a style note... even if you're the author of a single file, I try to keep the header up top consistent project wide
<azonenberg> i'm actually going to be reformatting that probably in January
<azonenberg> so it will say copyright 2012-2021 andrew zonenberg + contributors
<azonenberg> and also say "glscopeclient" rather than 'antikernel' which is where i lifted the comment from back when glscopeclient and scopehal were part of the antikernel project
<azonenberg> The Doxygen @author comment is intended to be used to tag a file with the primary author, but ultimately copyright is shared
<azonenberg> Other than that, it looks good. Do you agree that the dual waveform output style makes more sense in this case?
<_whitenotifier-f> [scopehal] mjgerm closed pull request #353: I2S Decoder Implementation - https://git.io/JkxFl
<_whitenotifier-f> [scopehal] azonenberg opened issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP
<_whitenotifier-f> [scopehal] azonenberg labeled issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP
m4ssi has quit [Remote host closed the connection]
juli966 has quit [Quit: Nettalk6 - www.ntalk.de]
m4ssi has joined #scopehal
electronic_eel has quit [Ping timeout: 260 seconds]
electronic_eel has joined #scopehal
<azonenberg> So i'm looking at the mold design i was working on for the AKL-PT2
<azonenberg> trying to remember if i had any last minute edits i needed to make
<azonenberg> or if it's good to order
<marshallh> speculation on what the traces are? i was thinking 1st layer of a pcb inductor but i have no idea
<marshallh> chip is intel iris XE max graphics
<marshallh> specifally DG1
<azonenberg> very interesting
<azonenberg> If i'm interpreting this right the darker color is copper and lighter is substrate
<marshallh> yes
<azonenberg> most of the fat traecs seem to be connected to capacitors
<marshallh> they have filled microvias so there could be layer changes that aren't visible
<azonenberg> Which does lead me to believe that those are from pcb inductors
<marshallh> yeah
<azonenberg> some kind of IVR
<marshallh> interesting that they would parallel many coils to get more current maybe
<azonenberg> possibly, yeah
<marshallh> definitely high current
<marshallh> my 3090 draws 420W during benchmark lmao
<azonenberg> Yeah
<azonenberg> Also, i'm probably going to be experimenting with pushing some scopehal compute to GPU in the nearish future
<azonenberg> like, actual waveform processing rather than just rendering
<marshallh> cuda?
<azonenberg> gl compute shaders
<marshallh> so vendor agnostic
<azonenberg> it's portable and integrates nicely with gl rendering
<azonenberg> the eye pattern in particular is pretty compute heavy
<marshallh> what part of the processing can be moved easily? i assume you arent talking protocol decode
<marshallh> maybe just symbol-level decode?
<azonenberg> So i want to experiment
<azonenberg> The stuff i want to do on GPU is mostly a few categories
<azonenberg> Lower level symbol decoding, DSP/math (FFTs, channel emulation, etc), and the ey epattern
<azonenberg> Also, some line coding as well
<azonenberg> sampling a signal on clock edges is probably easy to parallelize because you can just do one output sample per gpu thread
<azonenberg> decoding a sampled serial data stream to 8b10b symbols should be easy to parallelize too. as soon as you have digital samples on a uniform timebase you don't need to do any random access seeking
<azonenberg> i think the eye pattern is going to be a highish priority because it isn't SIMD friendly
<azonenberg> most of the other time consuming compute is already making heavy use of AVX2
<marshallh> cool
<azonenberg> the eye pattern filter is not. I have a test setup that shows off jitter histograms and eye patterns of pcie before and after removing de-emphasis
<azonenberg> according to vtune, a *single line* of code in the eye pattern filter is 3.4% of the total cpu-seconds used by the entire demo
<azonenberg> the eye pattern filter as a whole is 25.1% of the total cpu time
<marshallh> ouch
<azonenberg> It's been my top priority for optimization for a while and i've tuned it a bunch, it was worse before
<azonenberg> but i'm starting to hit limits
<azonenberg> one thing that seems to be a bit heavy is all of the float to int64 and back conversions
<azonenberg> unfortunately bitmaps are indexed by integer coordinates
<azonenberg> and math on voltages kinda has to be floating point
<azonenberg> especially if you're doing sub sample interpolation
<marshallh> does it have to be float though?
<marshallh> with 64bits you have more than enough for fixed
<marshallh> especially since you are targeting display
<azonenberg> the issue is that the voltages originally come in as fp32 from elsewhere in scopehal
<azonenberg> and DSP math is much friendlier in fp32
<azonenberg> also, AVX gets you double throughput in fp32 vs i64
<marshallh> hmm
<marshallh> i'm just used to doing all fixed point on fpga
<azonenberg> (and half the ram usage)
<marshallh> and integer always being faster on most embedded cpus
<azonenberg> GPUs are generally better at float
<marshallh> yeah if oyu are going GPU then def float
<azonenberg> in fact one of the reasons glscopeclient doesnt run on a lot of older intel integrated gpus is that i need int64s in shaders for timestamps of waveforms
<azonenberg> all internal timebase units are int64 femtoseconds
<azonenberg> (recently converted from picoseconds, fs gives much needed extra resolution)
<sorear> much more than double
<azonenberg> sorear: i havent checked latency or throughput on the FPU itself
<azonenberg> but i know you get twice as many per instruction
<azonenberg> not sure how IPC compares
<azonenberg> btw
<azonenberg> This is what i'm working with. lots of scalar float instructions, it isnt even gonna be super easy to parallelize fine grained because of the potentially variable clock frequency
<azonenberg> so my current thought is to statically partition into N sub-blocks, one thread for each
<azonenberg> (sub-blocks of waveform samples)
<sorear> huh, haven't seen that tool before in use
<marshallh> cool
<azonenberg> then binary search to find the clock offset for each block
<azonenberg> then evaluate each block in parallel using essentially this same inner loop
<azonenberg> i don't see super easy opportunities for SIMD-ification here
<azonenberg> There might be potential to unroll at least some of the inner loop and SIMD-ify the interpolation or something
<azonenberg> but the actual output math has to be sequential because the pixel locations you're writing to are not consecutive
<sorear> well you can do a big SIMD-friendly loop to make a list of (address, increment) pairs, then a non-SIMD loop to apply them
<azonenberg> Yeah that is a possibility
<azonenberg> but the stuff up top finding the timestamp offsets also isnt super SIMD friendly
<sorear> (this is why riscv has the vectorized relaxed atomic adds everyone hates)
<azonenberg> the question is ultimately if all of the marshaling and unmarshaling costs more time than this way
<azonenberg> i'm also not sure how much i trust the current profile output
<azonenberg> i might need to run longer?
<azonenberg> because this is showing 2.9% of my time on "mov ecx, 0x40"
<azonenberg> which can't be right
<sorear> roughly what fraction of the time is the condition on line 472 true?
<azonenberg> depends on the multiplier of sample clock rate to symbol rate
<azonenberg> with 5 Gbps PCIe on a 40 Gsps scope, it's true 1/8 of the time
<sorear> it's superscalar execution, the machine picks one instruction executed on any given cycle to account that cycle against and it doesn't always make sense (my experience is with perf(1), but I think they're using the same hw features)
<azonenberg> interesting, they don't count every executing instruction? they pick one?
<azonenberg> in any case i would expect that the *actual* bottleneck in line 508 is the 0xaec1f add instruction, as that hits memory
<sorear> well it depends on what you're counting
<azonenberg> This does not bode well for the simd-ifying idea though
<sorear> perf defaults to "cycles", which stops every N cycles and records whatever the (I think, there are other possibilities) oldest unretired instruction is
<azonenberg> as it implies most of the expense here is the increments rather than the math
massi_ has joined #scopehal
<sorear> how much do we know about clock_edges? is it completely arbitrary?
<azonenberg> It's the output of a CDR PLL normally, but could also be something like a DDR DQS
<sorear> can we at least assume it's monotone?
<azonenberg> Yes, it is monotonic
<sorear> you could break the loop into N segments, and find the start point for each segment by binary search
<azonenberg> I just discussed that up top
<azonenberg> that would allow multithreading
<azonenberg> that would also be my strategy for running it on GPU
<azonenberg> i think what might make sense at this point is setting up the infrastructure for doing scopehal filters on GPU
<sorear> if you think about it you're basically doing a sorted-list merge
<azonenberg> implement some simple DSP filters like "subtract" as a start
<azonenberg> then work on some of the more complicated stuff
<azonenberg> i also have some more work i wanted to do on reducing unnecessary allocations and memcpy'
<azonenberg> so that might also be good to do first
<azonenberg> but basically i think what i will need to do is keep track of what waveform data lives on the CPU or GPU
<azonenberg> and move between them only when needed
<azonenberg> right now i'm using stl vectors with my AlignedAllocator to allocate memory on the CPU but that object model might need some retooling when it comes to GPU memory
<sorear> if the clock frequency is relatively stable, min_clock <= (clock_edges[i+1] - clock_edges[i]) <= max_clock, max_clock <= "a few" * min_clock you could make an inverse map
<azonenberg> Do i want to have that actual vector live on the gpu?
<azonenberg> or do i want a separate buffer on the gpu?
<azonenberg> how do i sync between them if so?
<azonenberg> etc
<azonenberg> especially during the transitional period there will be lots of processing done on both
<sorear> can GL4.5 compute shaders access app memory? I've only done things with OpenCL which can
<azonenberg> Not sure. I've only done the opposite
<azonenberg> mapping GPU memory so i can access it from CPU code
<sorear> it's just NUMA and I wish we could consistently treat it that way
<azonenberg> Yeah
<azonenberg> Anyway, i guess the time has come
<azonenberg> i was always going to be pushing compute to the gpu and now is as good a time as any
massi_ has quit [Remote host closed the connection]
<azonenberg> Let me take care of #354 first though
juli966 has joined #scopehal
<_whitenotifier-f> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±15] https://git.io/JIJbz
<_whitenotifier-f> [scopehal] azonenberg ef8328e - EyePattern: removed antialiasing which isn't needed after femtosecond timebase conversion
<_whitenotifier-f> [scopehal] azonenberg 63b5fa3 - Initial implementation of dense packed waveform processing. Only supported for LeCroy and Tek scopes, and tapped delay line filters. Not persisted to files yet. Fixes #354.
<_whitenotifier-f> [scopehal] azonenberg 083769b - Added dense packed optimizations to a lot more filters
<_whitenotifier-f> [scopehal] azonenberg closed issue #354: Add flag to Waveform indicating "waveform is dense packed" - https://git.io/JIeQW
<_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJbw
<_whitenotifier-f> [scopehal-apps] azonenberg 56046b0 - Updated to latest scopehal with zero crossing cache and dense pack optimizations
<_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJpZ
<_whitenotifier-f> [scopehal] azonenberg 29fb6dd - DeEmbedFilter: implemented dense pack optimizations
<_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±4] https://git.io/JIJp4
<_whitenotifier-f> [scopehal-apps] azonenberg c961070 - WaveformArea: fixed bug with handling of large (more than one sample) trigger phase shifts
<azonenberg> holy moley
<_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIJhF
<_whitenotifier-f> [scopehal] azonenberg fc4641c - EyePattern: dense pack optimizations. Massive speedups.
<azonenberg> So by propagating the dense pack attribute and avoiding just a handful of int64-fp32 conversions in the eye decode
<azonenberg> I processed 23% more waveforms in the same time, and used only 93% of the CPU time in the eye filter
<azonenberg> average 40.7 ms/wfm (with two eyes on screen) before, now 30.8
<azonenberg> so that's roughly a 25% speedup
<azonenberg> I still retain all of the flexibility i had before for working with signals that have gone through complex processing and resampling, i'm just optimizing the common case of monotonic sampling at uniform intervals
<_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIJjV
<_whitenotifier-f> [scopehal-apps] azonenberg a7c8ccd - Updated to latest scopehal
<azonenberg> I still think there is potential to vectorize too
<azonenberg> since it seems that fp32 to int64 conversions are so slow, even if most of the other stuff is serialized
<azonenberg> actually, i bet i can optimize even more by using int32s
<azonenberg> while timestamps can be large, no eye pattern is going to be more than 4 gigapixels
juli966 has quit [Quit: Nettalk6 - www.ntalk.de]
<Bird|otherbox> azonenberg: we call that a "beholder pattern"
<azonenberg> Bird|otherbox: lol
<azonenberg> also the eye is actually vectorizing better than i thought
<azonenberg> i'm not even quite done and i'm at a ~2x speedup *of the entire test case*
<azonenberg> 464 to 782 waveforms processed in 1 minute, using less cpu time
<azonenberg> i'm now up to almost 20 WFM/s from 8-9
<azonenberg> and i think 6 before i started this round of optimizing
_whitelogger has joined #scopehal
m4ssi has quit [Remote host closed the connection]
bvernoux has joined #scopehal
<bvernoux> hi
<bvernoux> I'm doing even more test on Rigol MSO5000 and I can confirm Rigol Firmware is really buggy to capture data ...
<azonenberg> Duh :p
<bvernoux> I heavily suspect a buffer overflow in the firmware ;)
<bvernoux> which corrupt the size of data ...
<bvernoux> as sometimes there is 0 data and sometimes too much ;)
<_whitenotifier-f> [scopehal] azonenberg pushed 5 commits to master [+0/-0/±7] https://git.io/JIUKj
<_whitenotifier-f> [scopehal] azonenberg 5664039 - EyePattern: refactored inner loop out to separate function
<_whitenotifier-f> [scopehal] azonenberg 23c24b9 - EyePattern: Refactoring of inner loop in preparation for AVX2 optimizations
<_whitenotifier-f> [scopehal] azonenberg bf5304f - EyePattern: Partial vectorization of inner loop
<bvernoux> it is a real mess
<_whitenotifier-f> [scopehal] ... and 2 more commits.
<bvernoux> they have clearly very bad synchro between scope and SCPI commands with probably mutex issue (or mutex missing ...)
<_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIU6L
<_whitenotifier-f> [scopehal-apps] azonenberg 62c2b78 - Updated to latest scopehal
<bvernoux> also the fun is Rigol MSO5000 have Eye Analysis & Jitter ;)
<bvernoux> but those functions does not work and do nothing ;)
<bvernoux> like they planned to develop that and they never finished and they deliver the FW ;)
<miek> maybe they share bits of the firmware with the mso8k
<bvernoux> haha yes ;)
<d1b2> <TiltMeSenpai> do you have the "student discount" on your MSO5k
<bvernoux> I think there is potentially workaround
<bvernoux> by setting again the parameter with SCPI commands
<bvernoux> which are probably overwritten sometimes to be checked ;)
<bvernoux> their synchro is crazy bad anyway between scope and scpi commands
<bvernoux> we see the latency with the trigger ...
<bvernoux> something like 400ms
<bvernoux> it is like they are synchronizing badly some tasks (or even worse creating thread/destroying them to loose so much time)
<bvernoux> the worse is they are stacking like a FIFO all commands
<bvernoux> so it end in a dead spiral when something fail like it corrupt number of point to send
<_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIUPR
<_whitenotifier-f> [scopehal] azonenberg a2e1fec - Vectorized address calculation
nelgau has joined #scopehal
<azonenberg> Looks like the jitter spectrum filter is going to be another target for performance tuning soon
<azonenberg> every time i think it's time to start pushing compute to the GPU i find ways to squeeze more performance out of the software side lol
m4ssi has joined #scopehal
jn__ has quit [Quit: No Ping reply in 180 seconds.]
jn__ has joined #scopehal
m4ssi has quit [Remote host closed the connection]
<bvernoux> haha during test my Rigol has frozen
<bvernoux> I have C ansi code working on both Windows & Linux ;)
<bvernoux> Anyway so far Rigol MSO5000 is so instable with the SCPI I use ...
<bvernoux> need to check to use additional SCPI to see if that improve stability like adding ":WAV:STAR 1" & ":WAV:STOP " before to retrive the samples
<azonenberg> The impression i'm getting is rigol stuff just isn't stable enough for serious use
electronic_eel has quit [Ping timeout: 240 seconds]
electronic_eel has joined #scopehal
bvernoux has quit [Quit: Leaving]
smkz has quit [Quit: smkz]