#scopehal on 2020-12-01 — irc logs at freenode.irclog.whitequark.org

2020-07-13 15:00 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-apps, https://github.com/azonenberg/scopehal, https://github.com/azonenberg/scopehal-docs | Logs: https://freenode.irclog.whitequark.org/scopehal

01:19 <azonenberg> AKL-PT2 v0.4 boards sent to fab, ETA the 9th

01:41 Degi_ has joined #scopehal

01:44 Degi has quit [Ping timeout: 256 seconds]

01:44 Degi_ is now known as Degi

01:59 <_whitenotifier-f> [scopehal] azonenberg pushed 2 commits to master [+2/-0/±4] https://git.io/JIfes

01:59 <_whitenotifier-f> [scopehal] azonenberg f0fbd17 - LeCroyOscilloscope: fixed comment error

01:59 <_whitenotifier-f> [scopehal] azonenberg 8338bf1 - Added emphasis filter. Fixes #190.

01:59 <_whitenotifier-f> [scopehal] azonenberg closed issue #190: Add filter to apply pre/de emphasis to a signal - https://git.io/JJRNV

02:02 <azonenberg> https://www.antikernel.net/temp/emphasis-insertion.png

02:51 electronic_eel has quit [Ping timeout: 256 seconds]

02:51 electronic_eel has joined #scopehal

03:11 <_whitenotifier-f> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±8] https://git.io/JIftr

03:11 <_whitenotifier-f> [scopehal] azonenberg d015a6c - Filter: removed double-precision FindZeroCrossings(), redundant now that we have femtosecond timing resolution

03:11 <_whitenotifier-f> [scopehal] azonenberg 211e07f - FindZeroCrossings: cache results to avoid repeated calls. Fixes #355.

03:11 <_whitenotifier-f> [scopehal] azonenberg closed issue #355: Cache Filter::FindZeroCrossings results - https://git.io/JIe7W

03:42 electronic_eel has quit [Ping timeout: 256 seconds]

03:42 electronic_eel has joined #scopehal

03:49 <azonenberg> bluezinc: So, i had a chance to look at your i2s filter

03:50 <azonenberg> It looks decent but i question whether the display format you chose really makes the most sense

03:50 <azonenberg> think about the purpose of i2s, it's ultimately encoding analog data. So does outputting a protocol analyzer style text symbol view really make sense?

03:51 <bluezinc> You're suggesting a pair of waveforms out for L/R?

03:51 <azonenberg> Yes

03:51 <azonenberg> DownconvertFilter does exactly this, it outputs two I/Q waveforms

03:51 <azonenberg> so it's probably a good reference for how to do it

03:51 <azonenberg> The style you have isn't *wrong*, i just question whether it's the most useful given that ultimately you're encoding audio data

03:52 m4ssi has joined #scopehal

03:52 <azonenberg> also just a style note... even if you're the author of a single file, I try to keep the header up top consistent project wide

03:52 <azonenberg> i'm actually going to be reformatting that probably in January

03:53 <azonenberg> and also say "glscopeclient" rather than 'antikernel' which is where i lifted the comment from back when glscopeclient and scopehal were part of the antikernel project

03:54 <azonenberg> The Doxygen @author comment is intended to be used to tag a file with the primary author, but ultimately copyright is shared

03:55 <azonenberg> Other than that, it looks good. Do you agree that the dual waveform output style makes more sense in this case?

03:58 <_whitenotifier-f> [scopehal] mjgerm closed pull request #353: I2S Decoder Implementation - https://git.io/JkxFl

04:09 <_whitenotifier-f> [scopehal] azonenberg opened issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP

04:09 <_whitenotifier-f> [scopehal] azonenberg labeled issue #356: TappedDelayLineFilter: support resampling for when tap delay is not an integer multiple of the sample rate - https://git.io/JIfGP

06:20 m4ssi has quit [Remote host closed the connection]

06:52 juli966 has quit [Quit: Nettalk6 - www.ntalk.de]

07:37 m4ssi has joined #scopehal

08:14 electronic_eel has quit [Ping timeout: 260 seconds]

08:15 electronic_eel has joined #scopehal

11:54 <azonenberg> So i'm looking at the mold design i was working on for the AKL-PT2

11:55 <azonenberg> trying to remember if i had any last minute edits i needed to make

11:55 <azonenberg> or if it's good to order

12:30 <marshallh> https://i.imgur.com/JgGzbd5.jpg

12:30 <marshallh> speculation on what the traces are? i was thinking 1st layer of a pcb inductor but i have no idea

12:31 <marshallh> chip is intel iris XE max graphics

12:31 <marshallh> specifally DG1

12:32 <azonenberg> very interesting

12:32 <azonenberg> If i'm interpreting this right the darker color is copper and lighter is substrate

12:33 <marshallh> yes

12:33 <azonenberg> most of the fat traecs seem to be connected to capacitors

12:33 <marshallh> they have filled microvias so there could be layer changes that aren't visible

12:33 <azonenberg> Which does lead me to believe that those are from pcb inductors

12:33 <marshallh> yeah

12:33 <azonenberg> some kind of IVR

12:34 <marshallh> interesting that they would parallel many coils to get more current maybe

12:34 <azonenberg> possibly, yeah

12:34 <marshallh> definitely high current

12:35 <marshallh> my 3090 draws 420W during benchmark lmao

12:35 <azonenberg> Yeah

12:35 <azonenberg> Also, i'm probably going to be experimenting with pushing some scopehal compute to GPU in the nearish future

12:35 <azonenberg> like, actual waveform processing rather than just rendering

12:35 <marshallh> cuda?

12:35 <azonenberg> gl compute shaders

12:35 <marshallh> so vendor agnostic

12:35 <azonenberg> it's portable and integrates nicely with gl rendering

12:36 <azonenberg> the eye pattern in particular is pretty compute heavy

12:36 <marshallh> what part of the processing can be moved easily? i assume you arent talking protocol decode

12:36 <marshallh> maybe just symbol-level decode?

12:36 <azonenberg> So i want to experiment

12:36 <azonenberg> The stuff i want to do on GPU is mostly a few categories

12:37 <azonenberg> Lower level symbol decoding, DSP/math (FFTs, channel emulation, etc), and the ey epattern

12:37 <azonenberg> Also, some line coding as well

12:38 <azonenberg> sampling a signal on clock edges is probably easy to parallelize because you can just do one output sample per gpu thread

12:38 <azonenberg> decoding a sampled serial data stream to 8b10b symbols should be easy to parallelize too. as soon as you have digital samples on a uniform timebase you don't need to do any random access seeking

12:39 <azonenberg> i think the eye pattern is going to be a highish priority because it isn't SIMD friendly

12:39 <azonenberg> most of the other time consuming compute is already making heavy use of AVX2

12:40 <marshallh> cool

12:40 <azonenberg> the eye pattern filter is not. I have a test setup that shows off jitter histograms and eye patterns of pcie before and after removing de-emphasis

12:41 <azonenberg> according to vtune, a *single line* of code in the eye pattern filter is 3.4% of the total cpu-seconds used by the entire demo

12:41 <azonenberg> the eye pattern filter as a whole is 25.1% of the total cpu time

12:41 <marshallh> ouch

12:41 <azonenberg> It's been my top priority for optimization for a while and i've tuned it a bunch, it was worse before

12:41 <azonenberg> but i'm starting to hit limits

12:42 <azonenberg> one thing that seems to be a bit heavy is all of the float to int64 and back conversions

12:42 <azonenberg> unfortunately bitmaps are indexed by integer coordinates

12:42 <azonenberg> and math on voltages kinda has to be floating point

12:42 <azonenberg> especially if you're doing sub sample interpolation

12:43 <marshallh> does it have to be float though?

12:43 <marshallh> with 64bits you have more than enough for fixed

12:43 <marshallh> especially since you are targeting display

12:43 <azonenberg> the issue is that the voltages originally come in as fp32 from elsewhere in scopehal

12:44 <azonenberg> and DSP math is much friendlier in fp32

12:44 <azonenberg> also, AVX gets you double throughput in fp32 vs i64

12:44 <marshallh> hmm

12:44 <marshallh> i'm just used to doing all fixed point on fpga

12:44 <azonenberg> (and half the ram usage)

12:44 <marshallh> and integer always being faster on most embedded cpus

12:44 <azonenberg> GPUs are generally better at float

12:44 <marshallh> yeah if oyu are going GPU then def float

12:45 <azonenberg> in fact one of the reasons glscopeclient doesnt run on a lot of older intel integrated gpus is that i need int64s in shaders for timestamps of waveforms

12:45 <azonenberg> all internal timebase units are int64 femtoseconds

12:45 <azonenberg> (recently converted from picoseconds, fs gives much needed extra resolution)

12:45 <sorear> much more than double

12:46 <azonenberg> sorear: i havent checked latency or throughput on the FPU itself

12:46 <azonenberg> but i know you get twice as many per instruction

12:46 <azonenberg> not sure how IPC compares

12:48 <azonenberg> https://www.antikernel.net/temp/eye-loop.png

12:48 <azonenberg> btw

12:49 <azonenberg> This is what i'm working with. lots of scalar float instructions, it isnt even gonna be super easy to parallelize fine grained because of the potentially variable clock frequency

12:49 <azonenberg> so my current thought is to statically partition into N sub-blocks, one thread for each

12:50 <azonenberg> (sub-blocks of waveform samples)

12:50 <sorear> huh, haven't seen that tool before in use

12:50 <marshallh> cool

12:50 <azonenberg> then binary search to find the clock offset for each block

12:50 <azonenberg> then evaluate each block in parallel using essentially this same inner loop

12:50 <azonenberg> i don't see super easy opportunities for SIMD-ification here

12:51 <azonenberg> There might be potential to unroll at least some of the inner loop and SIMD-ify the interpolation or something

12:51 <azonenberg> but the actual output math has to be sequential because the pixel locations you're writing to are not consecutive

12:52 <sorear> well you can do a big SIMD-friendly loop to make a list of (address, increment) pairs, then a non-SIMD loop to apply them

12:53 <azonenberg> Yeah that is a possibility

12:53 <azonenberg> but the stuff up top finding the timestamp offsets also isnt super SIMD friendly

12:53 <sorear> (this is why riscv has the vectorized relaxed atomic adds everyone hates)

12:53 <azonenberg> the question is ultimately if all of the marshaling and unmarshaling costs more time than this way

12:54 <azonenberg> i'm also not sure how much i trust the current profile output

12:54 <azonenberg> i might need to run longer?

12:54 <azonenberg> because this is showing 2.9% of my time on "mov ecx, 0x40"

12:54 <azonenberg> which can't be right

12:55 <sorear> roughly what fraction of the time is the condition on line 472 true?

12:56 <azonenberg> depends on the multiplier of sample clock rate to symbol rate

12:56 <azonenberg> with 5 Gbps PCIe on a 40 Gsps scope, it's true 1/8 of the time

12:56 <sorear> it's superscalar execution, the machine picks one instruction executed on any given cycle to account that cycle against and it doesn't always make sense (my experience is with perf(1), but I think they're using the same hw features)

12:57 <azonenberg> interesting, they don't count every executing instruction? they pick one?

12:57 <azonenberg> in any case i would expect that the *actual* bottleneck in line 508 is the 0xaec1f add instruction, as that hits memory

12:57 <sorear> well it depends on what you're counting

12:58 <azonenberg> This does not bode well for the simd-ifying idea though

12:58 <sorear> perf defaults to "cycles", which stops every N cycles and records whatever the (I think, there are other possibilities) oldest unretired instruction is

12:58 <azonenberg> as it implies most of the expense here is the increments rather than the math

12:58 massi_ has joined #scopehal

13:00 <sorear> how much do we know about clock_edges? is it completely arbitrary?

13:00 <azonenberg> It's the output of a CDR PLL normally, but could also be something like a DDR DQS

13:01 <sorear> can we at least assume it's monotone?

13:01 <azonenberg> Yes, it is monotonic

13:03 <sorear> you could break the loop into N segments, and find the start point for each segment by binary search

13:03 <azonenberg> I just discussed that up top

13:03 <azonenberg> that would allow multithreading

13:03 <azonenberg> that would also be my strategy for running it on GPU

13:05 <azonenberg> i think what might make sense at this point is setting up the infrastructure for doing scopehal filters on GPU

13:05 <sorear> if you think about it you're basically doing a sorted-list merge

13:05 <azonenberg> implement some simple DSP filters like "subtract" as a start

13:06 <azonenberg> then work on some of the more complicated stuff

13:06 <azonenberg> i also have some more work i wanted to do on reducing unnecessary allocations and memcpy'

13:06 <azonenberg> so that might also be good to do first

13:07 <azonenberg> but basically i think what i will need to do is keep track of what waveform data lives on the CPU or GPU

13:07 <azonenberg> and move between them only when needed

13:08 <azonenberg> right now i'm using stl vectors with my AlignedAllocator to allocate memory on the CPU but that object model might need some retooling when it comes to GPU memory

13:08 <sorear> if the clock frequency is relatively stable, min_clock <= (clock_edges[i+1] - clock_edges[i]) <= max_clock, max_clock <= "a few" * min_clock you could make an inverse map

13:08 <azonenberg> Do i want to have that actual vector live on the gpu?

13:08 <azonenberg> or do i want a separate buffer on the gpu?

13:08 <azonenberg> how do i sync between them if so?

13:08 <azonenberg> etc

13:08 <azonenberg> especially during the transitional period there will be lots of processing done on both

13:09 <sorear> can GL4.5 compute shaders access app memory? I've only done things with OpenCL which can

13:10 <azonenberg> Not sure. I've only done the opposite

13:10 <azonenberg> mapping GPU memory so i can access it from CPU code

13:12 <sorear> it's just NUMA and I wish we could consistently treat it that way

13:13 <azonenberg> Yeah

13:14 <azonenberg> Anyway, i guess the time has come

13:14 <azonenberg> i was always going to be pushing compute to the gpu and now is as good a time as any

13:14 massi_ has quit [Remote host closed the connection]

13:14 <azonenberg> Let me take care of #354 first though

13:28 juli966 has joined #scopehal

14:02 <_whitenotifier-f> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±15] https://git.io/JIJbz

14:02 <_whitenotifier-f> [scopehal] azonenberg ef8328e - EyePattern: removed antialiasing which isn't needed after femtosecond timebase conversion

14:02 <_whitenotifier-f> [scopehal] azonenberg 63b5fa3 - Initial implementation of dense packed waveform processing. Only supported for LeCroy and Tek scopes, and tapped delay line filters. Not persisted to files yet. Fixes #354.

14:02 <_whitenotifier-f> [scopehal] azonenberg 083769b - Added dense packed optimizations to a lot more filters

14:02 <_whitenotifier-f> [scopehal] azonenberg closed issue #354: Add flag to Waveform indicating "waveform is dense packed" - https://git.io/JIeQW

14:02 <_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJbw

14:02 <_whitenotifier-f> [scopehal-apps] azonenberg 56046b0 - Updated to latest scopehal with zero crossing cache and dense pack optimizations

14:18 <_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JIJpZ

14:18 <_whitenotifier-f> [scopehal] azonenberg 29fb6dd - DeEmbedFilter: implemented dense pack optimizations

14:18 <_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±4] https://git.io/JIJp4

14:18 <_whitenotifier-f> [scopehal-apps] azonenberg c961070 - WaveformArea: fixed bug with handling of large (more than one sample) trigger phase shifts

14:25 <azonenberg> holy moley

14:25 <_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIJhF

14:25 <_whitenotifier-f> [scopehal] azonenberg fc4641c - EyePattern: dense pack optimizations. Massive speedups.

14:26 <azonenberg> So by propagating the dense pack attribute and avoiding just a handful of int64-fp32 conversions in the eye decode

14:26 <azonenberg> I processed 23% more waveforms in the same time, and used only 93% of the CPU time in the eye filter

14:27 <azonenberg> average 40.7 ms/wfm (with two eyes on screen) before, now 30.8

14:27 <azonenberg> so that's roughly a 25% speedup

14:28 <azonenberg> I still retain all of the flexibility i had before for working with signals that have gone through complex processing and resampling, i'm just optimizing the common case of monotonic sampling at uniform intervals

14:28 <_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIJjV

14:28 <_whitenotifier-f> [scopehal-apps] azonenberg a7c8ccd - Updated to latest scopehal

14:29 <azonenberg> I still think there is potential to vectorize too

14:29 <azonenberg> since it seems that fp32 to int64 conversions are so slow, even if most of the other stuff is serialized

14:30 <azonenberg> actually, i bet i can optimize even more by using int32s

14:31 <azonenberg> while timestamps can be large, no eye pattern is going to be more than 4 gigapixels

14:58 juli966 has quit [Quit: Nettalk6 - www.ntalk.de]

16:25 <Bird|otherbox> azonenberg: we call that a "beholder pattern"

16:28 <azonenberg> Bird|otherbox: lol

16:31 <azonenberg> also the eye is actually vectorizing better than i thought

16:31 <azonenberg> i'm not even quite done and i'm at a ~2x speedup *of the entire test case*

16:32 <azonenberg> 464 to 782 waveforms processed in 1 minute, using less cpu time

16:34 <azonenberg> i'm now up to almost 20 WFM/s from 8-9

16:34 <azonenberg> and i think 6 before i started this round of optimizing

17:14 _whitelogger has joined #scopehal

17:31 m4ssi has quit [Remote host closed the connection]

17:57 bvernoux has joined #scopehal

18:04 <bvernoux> hi

18:04 <bvernoux> I'm doing even more test on Rigol MSO5000 and I can confirm Rigol Firmware is really buggy to capture data ...

18:04 <azonenberg> Duh :p

18:04 <bvernoux> I heavily suspect a buffer overflow in the firmware ;)

18:05 <bvernoux> which corrupt the size of data ...

18:05 <bvernoux> as sometimes there is 0 data and sometimes too much ;)

18:05 <_whitenotifier-f> [scopehal] azonenberg pushed 5 commits to master [+0/-0/±7] https://git.io/JIUKj

18:05 <_whitenotifier-f> [scopehal] azonenberg 5664039 - EyePattern: refactored inner loop out to separate function

18:05 <_whitenotifier-f> [scopehal] azonenberg 23c24b9 - EyePattern: Refactoring of inner loop in preparation for AVX2 optimizations

18:05 <_whitenotifier-f> [scopehal] azonenberg bf5304f - EyePattern: Partial vectorization of inner loop

18:05 <bvernoux> it is a real mess

18:05 <_whitenotifier-f> [scopehal] ... and 2 more commits.

18:05 <bvernoux> they have clearly very bad synchro between scope and SCPI commands with probably mutex issue (or mutex missing ...)

18:07 <_whitenotifier-f> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIU6L

18:07 <_whitenotifier-f> [scopehal-apps] azonenberg 62c2b78 - Updated to latest scopehal

18:08 <bvernoux> also the fun is Rigol MSO5000 have Eye Analysis & Jitter ;)

18:08 <bvernoux> but those functions does not work and do nothing ;)

18:08 <bvernoux> like they planned to develop that and they never finished and they deliver the FW ;)

18:11 <miek> maybe they share bits of the firmware with the mso8k

18:12 <bvernoux> haha yes ;)

18:12 <d1b2> <TiltMeSenpai> do you have the "student discount" on your MSO5k

18:12 <bvernoux> I think there is potentially workaround

18:12 <bvernoux> by setting again the parameter with SCPI commands

18:12 <bvernoux> which are probably overwritten sometimes to be checked ;)

18:13 <bvernoux> their synchro is crazy bad anyway between scope and scpi commands

18:13 <bvernoux> we see the latency with the trigger ...

18:13 <bvernoux> something like 400ms

18:14 <bvernoux> it is like they are synchronizing badly some tasks (or even worse creating thread/destroying them to loose so much time)

18:15 <bvernoux> the worse is they are stacking like a FIFO all commands

18:16 <bvernoux> so it end in a dead spiral when something fail like it corrupt number of point to send

18:20 <_whitenotifier-f> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JIUPR

18:20 <_whitenotifier-f> [scopehal] azonenberg a2e1fec - Vectorized address calculation

20:10 nelgau has joined #scopehal

20:29 <azonenberg> Looks like the jitter spectrum filter is going to be another target for performance tuning soon

20:29 <azonenberg> every time i think it's time to start pushing compute to the GPU i find ways to squeeze more performance out of the software side lol

20:30 m4ssi has joined #scopehal

20:50 jn__ has quit [Quit: No Ping reply in 180 seconds.]

20:51 jn__ has joined #scopehal

21:16 m4ssi has quit [Remote host closed the connection]

21:21 <bvernoux> haha during test my Rigol has frozen

21:21 <bvernoux> I have C ansi code working on both Windows & Linux ;)

21:22 <bvernoux> Anyway so far Rigol MSO5000 is so instable with the SCPI I use ...

21:23 <bvernoux> need to check to use additional SCPI to see if that improve stability like adding ":WAV:STAR 1" & ":WAV:STOP " before to retrive the samples

21:26 <azonenberg> The impression i'm getting is rigol stuff just isn't stable enough for serious use

21:34 electronic_eel has quit [Ping timeout: 240 seconds]

21:34 electronic_eel has joined #scopehal

22:22 bvernoux has quit [Quit: Leaving]

23:53 smkz has quit [Quit: smkz]