#scopehal on 2020-05-15 — irc logs at freenode.irclog.whitequark.org

2019-05-03 07:48 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-cmake, https://github.com/azonenberg/scopehal-apps, https://github.com/azonenberg/scopehal | Logs: https://freenode.irclog.whitequark.org/scopehal

00:23 <azonenberg> So i'm spending a bit of time on performance to see how much i can improve things

00:23 <azonenberg> it seems I have a *lot* of time wasted in thread sync overhead

00:24 <azonenberg> my first test was 55.729 sec total, 511.5 sec of cpu time

00:24 <azonenberg> with 485 sec spent spinning

00:24 <azonenberg> mostly in gomp_simple_barrier_wait and gomp_team_barrier_wait_end

00:24 <azonenberg> then 52.5 sec in gomp_simple_barrier_wait

00:26 <azonenberg> now let's see what happens if i have no protocol decodes active

00:27 <azonenberg> 12 sec real, 62.6 sec cpu time, 50.5 spent in gomp_simple_barrier_wait

00:27 <azonenberg> so it seems like i have some very unbalanced openmp code that doesnt parallelize well

00:40 <azonenberg> Setting OMP_WAIT_POLICY=PASSIVE fixes that. i guess the default is to use spinlocks for some reason which seems a bit odd

01:26 _whitelogger has joined #scopehal

01:50 <_whitenotifier-c> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfBxI

01:50 <_whitenotifier-c> [scopehal] azonenberg f553007 - DifferenceDecoder: don't specify an unusual thread count, this causes lots of forking overhead

01:51 <_whitenotifier-c> [scopehal] azonenberg 50f03fe - LeCroyOscilloscope: optimizations to waveform download postprocessing

01:51 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfBxL

01:51 <_whitenotifier-c> [scopehal-apps] azonenberg 1c8894a - Set thread pool size for scope drivers

01:51 <_whitenotifier-c> [scopehal-apps] azonenberg a710522 - Don't specify thread pool size for rendering prep

01:57 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JfBxO

01:57 <_whitenotifier-c> [scopehal-apps] azonenberg ca8fee7 - OscilloscopeWindow: don't run event loop when polling scopes

02:04 <azonenberg> well, that massively reduced cpu usage

02:05 <azonenberg> this was the first time i've actually run a profiler on glscopeclient lol

02:45 <lain> lol

03:00 <azonenberg> and i see a bunch more spots to improve things

03:00 <azonenberg> WFM/s i don't think is up by much, but i cpu usage in operation is definitely down a lot

03:01 <azonenberg> and i think general responsiveness is up too but i need to try some bigger waveforms to have better data to see tha

03:01 <azonenberg> that

03:19 <azonenberg> just found a bunch of time wasted calling a virtual function that could have been just a simple member variable access

03:19 <azonenberg> etc

03:22 <azonenberg> ooooh just got a huge boost from that

03:25 Degi has quit [Ping timeout: 258 seconds]

03:27 Degi has joined #scopehal

03:32 <azonenberg> I now see some other red flags like OscilloscopeSampleBase::OscilloscopeSampleBase() taking 8.5 SECONDS of cpu time in what really should be a no-op

03:33 <monochroma> :o

03:45 <azonenberg> and that's now fixed yay

04:07 <azonenberg> monochroma: so i'm continuing to optimize glscopeclient core stuff

04:08 <azonenberg> the test procedure is to load a save file that's preconfigured for my waverunner, analog channels only, both legs of two diffpairs for 100baseTX

04:08 <azonenberg> subtract them, CDR, eye pattern, bathtub for the upper eye, and eth protocol decode

04:08 <azonenberg> run that for 1 minute

04:08 <azonenberg> there's a loop in LeCroyOscilloscope::AcquireData) that took 22.3 sec of cpu time, i now have it down to 9.9

04:09 <azonenberg> that's 1 min of triggering as fast as i can on 1M points per waveform

04:10 <azonenberg> average cpu load is actually only about 1.3 cores active of 32

04:11 <azonenberg> the loop in question takes the raw adc samples that came off the scope and does a bunch of repacking and floating point math to convert adc codes to volts and scopehal sample objects

04:12 <_whitenotifier-c> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfRvx

04:12 <_whitenotifier-c> [scopehal] azonenberg 9fd5f0a - OscilloscopeSample: added empty default constructor for STL to use

04:12 <_whitenotifier-c> [scopehal] azonenberg 6b52304 - LeCroyOscilloscope: massive speedup of loop that converts ADC codes to volts

04:12 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 3 commits to master [+0/-0/±8] https://git.io/JfRvp

04:12 <_whitenotifier-c> [scopehal-apps] azonenberg 9617a3c - WaveformArea: no longer call lots of virtual functions in inner loop of waveform preparation

04:12 <_whitenotifier-c> [scopehal-apps] azonenberg 8e85aa1 - Added --nodata argument to load saved UI config without data on the command line

04:12 <_whitenotifier-c> [scopehal-apps] azonenberg d691d03 - Added --retrigger argument to start the trigger immediately upon loading a save file on the command line

04:26 <azonenberg> let's see, in my 1 minute test EyeDecoder2::Refresh spends 6.285 seconds in floor()

04:26 <azonenberg> wonder if i can do something about that

04:32 <azonenberg> Just the last few changes i've done have brought CPU time in this 1-minute test down from 98.16 to 80.88 sec

04:41 <azonenberg> i really wish i could multithread the eye processing but since it's locked to a clock that can vary in frequency that would be tricky

04:41 <azonenberg> i might get back to that later

04:43 <azonenberg> i do at least multithread if you have more than one eye to process

06:32 <_whitenotifier-c> [scopehal] azonenberg opened issue #114: Add control for eye pattern saturation - https://git.io/JfRIo

06:32 <_whitenotifier-c> [scopehal] azonenberg labeled issue #114: Add control for eye pattern saturation - https://git.io/JfRIo

06:48 <_whitenotifier-c> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±3] https://git.io/JfRtL

06:48 <_whitenotifier-c> [scopehal] azonenberg 5fa9d33 - DifferenceDecoder: performance optimizations to inner loop

06:48 <_whitenotifier-c> [scopehal] azonenberg 8ea61be - EyeDecoder2: performance optimizations

06:48 <_whitenotifier-c> [scopehal] azonenberg 44bb022 - EyeDecoder2: moved some UI calculations out of inner loop

07:20 <_whitenotifier-c> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRql

07:20 <_whitenotifier-c> [scopehal] azonenberg 8c0a439 - FindZeroCrossings: performance optimizations

07:20 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRq8

07:20 <_whitenotifier-c> [scopehal-apps] azonenberg 36e3fce - Indentation/comment fixes

07:21 <_whitenotifier-c> [scopehal-cmake] azonenberg pushed 2 commits to master [+0/-0/±3] https://git.io/JfRqB

07:21 <_whitenotifier-c> [scopehal-cmake] azonenberg df3e9c5 - Updated submodules

07:21 <_whitenotifier-c> [scopehal-cmake] azonenberg 5a98330 - Updated ignore properties for release/debug build directories

08:28 juli964 has quit [Quit: Nettalk6 - www.ntalk.de]

08:56 <_whitenotifier-c> [scopehal-apps] azonenberg opened issue #98: When reconnecting to a scope via a save file, channels are not added to the "add channel" menu - https://git.io/JfROv

08:56 <_whitenotifier-c> [scopehal-apps] azonenberg labeled issue #98: When reconnecting to a scope via a save file, channels are not added to the "add channel" menu - https://git.io/JfROv

09:13 <_whitenotifier-c> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±3] https://git.io/JfROD

09:13 <_whitenotifier-c> [scopehal] azonenberg d29f824 - Minor performance tweaks in FindZeroCrossings, fixed some warnings

09:13 <_whitenotifier-c> [scopehal] azonenberg 2cfa2b2 - Performance tuning to Ethernet100BaseTDecoder

09:13 <_whitenotifier-c> [scopehal] azonenberg f8f2231 - LeCroyOscilloscope: optimized waveform downloading for digital channels

09:17 futarisIRCcloud has quit [Quit: Connection closed for inactivity]

09:23 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfR3C

09:23 <_whitenotifier-c> [scopehal-apps] azonenberg de8b4e4 - WaveformArea: optimized index generation loop

09:24 <_whitenotifier-c> [scopehal-cmake] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JfR34

09:24 <_whitenotifier-c> [scopehal-cmake] azonenberg 5bd78ea - Updated submodules

12:26 futarisIRCcloud has joined #scopehal

12:34 juli964 has joined #scopehal

15:22 juli964 has quit [Quit: Nettalk6 - www.ntalk.de]

16:17 <azonenberg> So i did a bunch more poking around in vtune. i'm seeing a lot of NUMA accesses that i think can be optimized if i lock stuff to run on one package

16:49 <azonenberg> But that didnt pan out

16:49 <azonenberg> I'm also starting to wonder about retooling the waveform structure to be vector start, vector len, vector voltage

16:49 <azonenberg> rather than vector<start, len, voltage>

16:50 <azonenberg> this would allow much more efficient memory accesses i think

16:50 <azonenberg> but would also be a very nontrivial refactoring

16:50 <azonenberg> sounds like a weekend project perhaps :p

17:08 <azonenberg> The other possible optimization is non-sparse waveforms

17:08 <azonenberg> but that would be a lot more workl

17:08 <azonenberg> work*

17:15 <azonenberg> in either case i think the refactoring to separate arrays makes more sense to do first

22:30 <_whitenotifier-c> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRMw

22:30 <_whitenotifier-c> [scopehal] azonenberg 3e3d0c0 - Clarified log message

22:30 <_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JfRMr

22:30 <_whitenotifier-c> [scopehal-apps] azonenberg c0749a9 - Status bar now shows total number of trigger events for performance analysis. Removed a bunch of sleeps from ScopeThread.

22:55 <azonenberg> ok soooo let's see how much i can break with this refactoring

22:55 <azonenberg> deleting one class i use through literally the entire codebase (OscilloscopeSample), then completely retooling another (CaptureChannel)

23:07 <azonenberg> Which is now known as Waveform because i've changed the interface so much it's not even recognizable as the same class by this point :P

23:08 <azonenberg> (and that was a more logical name anyway

23:31 <funkylab[m]> azonenberg: thanks for the twitter-dm heads up

23:34 <azonenberg> funkylab[m]: yeah so basically the end goal is to avoid strided access to all the waveform data

23:34 <azonenberg> because that's not SIMD friendly

23:34 <azonenberg> but this is going to be one heck of a massive diff and there's not really going to be any way to break it up into anything smaller

23:35 <funkylab[m]> honestly, if the stride is small enough, it's not that big of a deal – modern SIMD extensions do have slightly more elegant loading instructions

23:35 <azonenberg> there's some other advantages re GPU compute etc on this

23:36 <azonenberg> it also will eventually allow special-casing sparse vs dense waveforms

23:36 <azonenberg> without much of a code change to the datapath

23:36 <funkylab[m]> yep, serialization of uniform sample data gets more compact and everything

23:37 <funkylab[m]> basically, all PC-connected SDR devices I can think of just deliver plain contiguous IQ samples, within relatively large packets

23:37 <azonenberg> Yeah

23:37 <funkylab[m]> (some do have ... interesting wire formats; signed int 12, anyone?)

23:37 <azonenberg> Lol

23:37 <azonenberg> aka raw adc samples

23:38 <azonenberg> The reason i have this architecture is to allow for things that aren't directly digitized waveforms. Like "I2C sensor readings" or the instantaneous frequency of a waveform

23:38 <azonenberg> etc

23:38 <azonenberg> in which case you are very likely to have irregular sample intervals

23:38 <funkylab[m]> not even that; it's just that if you have a << 16 ENOB ADC, and don't do too much decimation in the DDC on-device, you can fit more through the USB link

23:38 <funkylab[m]> (that's Ettus B2xx btw)

23:39 <funkylab[m]> (without losing any significant digits)

23:39 <funkylab[m]> yep, fully understand why the more flexible format makes sense to something that essentially is meant to abstract all kinds of scopes

23:39 <azonenberg> yeah

23:40 <azonenberg> or sampling scopes, or specans, or really any sampled analog or digital data of some sort lol

23:40 <funkylab[m]> yep

23:41 <azonenberg> btw if you havent already, in just the past couple of days there have been massive (>20% in a few commits) performance boosts from various profiling and tweaking i've done

23:41 <azonenberg> as well as a great reduction in idle CPU

23:41 <funkylab[m]> You do NOT want your DDR5 development-enabling scope to deliver full-rate sampling continuously

23:41 <azonenberg> lol

23:41 <azonenberg> what, you don't have petabit ethernet on your workstation?

23:41 <funkylab[m]> talking of optical comms testers: same!

23:42 <funkylab[m]> (honestly, whenever I talk to people working on high-rate optical links, I get ADC envy. Like: anything I ever did in radio is totally baseband to these ADCs)

23:43 <azonenberg> lol i know the feeling

23:43 <funkylab[m]> and you've definitely worked larger BWs than I did!

23:43 <funkylab[m]> anyways, I've got to get some rest

23:43 <azonenberg> btw not sure if i mentioned on twitter or whatever

23:43 <azonenberg> But my long term plan is to push as much of the DSP as possible to compute shaders

23:43 <azonenberg> and maybe even look into stuff like RDMA

23:43 <funkylab[m]> :+1:

23:44 <funkylab[m]> that'd be extremely nice

23:44 <azonenberg> Because my vision for some of my longer term scopes (think 8 channels 1 GHz bw, one AD9213 and a SODIMM of DDR4 per channel)

23:44 <azonenberg> involves 40G or 100GbE as the backhaul to the host system

23:44 <azonenberg> and i want to be able to do analysis on that data in real time

23:45 <azonenberg> in my recent testing, the fastest performance i've got was 363 triggers in 1 minute pulling four channel 1M point 8-bit waveforms off a LeCroy WaveRunner 8104-MS

23:46 <funkylab[m]> Nice, bit of background on that: GNU Radio as a project is currently (since fosdem) trying to figure out how to come up with an architecture for incorporating accelerators (currently: FPGAs, GPUs, and there's stakeholders with domain-specific ASIC accelerators) into a signal processing workflow

23:46 <azonenberg> if you do the math that comes out to only 193.6 Mbps of actual waveform data hitting the PC, and glscopeclient spent 47 seconds of that minute waiting in Socket::RecvLooped()

23:46 <azonenberg> i.e. the majority of my time was waiting for the DSO to send me data

23:46 <azonenberg> BTW, that was not me just downloading waveforms and throwing them away

23:46 <funkylab[m]> that's a good sign!

23:46 <azonenberg> i was subtracting two differential inputs for tx and rx of a 100 Mbit ethernet waveform

23:47 <azonenberg> doing full 100baseTX protocol decode on both lanes

23:47 <funkylab[m]> niiice

23:47 <azonenberg> then doing a separate CDR PLL filter

23:47 <azonenberg> rendering separate eye patterns

23:47 <azonenberg> AND generating a BER bathtub curve for the top eye of each lane

23:47 <azonenberg> (it's a 3-level signaling so two openings in the eye)

23:47 <funkylab[m]> :D enough getting my mouth watery! I'm off to bed!

23:47 <azonenberg> and cpu usage was near zero because i was spending most of my time waiting for samples lol

23:48 <azonenberg> of course "near zero" on a dual socket xeon 6144 workstation is still a fair bit of compute, but still

23:48 <azonenberg> anyway, my eventual goal is to be able to process >10 Gbps of realtime waveform data

23:48 <azonenberg> with nontrivial analytics on it