azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-cmake, https://github.com/azonenberg/scopehal-apps, https://github.com/azonenberg/scopehal | Logs: https://freenode.irclog.whitequark.org/scopehal
<azonenberg> So i'm spending a bit of time on performance to see how much i can improve things
<azonenberg> it seems I have a *lot* of time wasted in thread sync overhead
<azonenberg> my first test was 55.729 sec total, 511.5 sec of cpu time
<azonenberg> with 485 sec spent spinning
<azonenberg> mostly in gomp_simple_barrier_wait and gomp_team_barrier_wait_end
<azonenberg> then 52.5 sec in gomp_simple_barrier_wait
<azonenberg> now let's see what happens if i have no protocol decodes active
<azonenberg> 12 sec real, 62.6 sec cpu time, 50.5 spent in gomp_simple_barrier_wait
<azonenberg> so it seems like i have some very unbalanced openmp code that doesnt parallelize well
<azonenberg> Setting OMP_WAIT_POLICY=PASSIVE fixes that. i guess the default is to use spinlocks for some reason which seems a bit odd
_whitelogger has joined #scopehal
<_whitenotifier-c> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfBxI
<_whitenotifier-c> [scopehal] azonenberg f553007 - DifferenceDecoder: don't specify an unusual thread count, this causes lots of forking overhead
<_whitenotifier-c> [scopehal] azonenberg 50f03fe - LeCroyOscilloscope: optimizations to waveform download postprocessing
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfBxL
<_whitenotifier-c> [scopehal-apps] azonenberg 1c8894a - Set thread pool size for scope drivers
<_whitenotifier-c> [scopehal-apps] azonenberg a710522 - Don't specify thread pool size for rendering prep
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JfBxO
<_whitenotifier-c> [scopehal-apps] azonenberg ca8fee7 - OscilloscopeWindow: don't run event loop when polling scopes
<azonenberg> well, that massively reduced cpu usage
<azonenberg> this was the first time i've actually run a profiler on glscopeclient lol
<lain> lol
<azonenberg> and i see a bunch more spots to improve things
<azonenberg> WFM/s i don't think is up by much, but i cpu usage in operation is definitely down a lot
<azonenberg> and i think general responsiveness is up too but i need to try some bigger waveforms to have better data to see tha
<azonenberg> that
<azonenberg> just found a bunch of time wasted calling a virtual function that could have been just a simple member variable access
<azonenberg> etc
<azonenberg> ooooh just got a huge boost from that
Degi has quit [Ping timeout: 258 seconds]
Degi has joined #scopehal
<azonenberg> I now see some other red flags like OscilloscopeSampleBase::OscilloscopeSampleBase() taking 8.5 SECONDS of cpu time in what really should be a no-op
<monochroma> :o
<azonenberg> and that's now fixed yay
<azonenberg> monochroma: so i'm continuing to optimize glscopeclient core stuff
<azonenberg> the test procedure is to load a save file that's preconfigured for my waverunner, analog channels only, both legs of two diffpairs for 100baseTX
<azonenberg> subtract them, CDR, eye pattern, bathtub for the upper eye, and eth protocol decode
<azonenberg> run that for 1 minute
<azonenberg> there's a loop in LeCroyOscilloscope::AcquireData) that took 22.3 sec of cpu time, i now have it down to 9.9
<azonenberg> that's 1 min of triggering as fast as i can on 1M points per waveform
<azonenberg> average cpu load is actually only about 1.3 cores active of 32
<azonenberg> the loop in question takes the raw adc samples that came off the scope and does a bunch of repacking and floating point math to convert adc codes to volts and scopehal sample objects
<_whitenotifier-c> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://git.io/JfRvx
<_whitenotifier-c> [scopehal] azonenberg 9fd5f0a - OscilloscopeSample: added empty default constructor for STL to use
<_whitenotifier-c> [scopehal] azonenberg 6b52304 - LeCroyOscilloscope: massive speedup of loop that converts ADC codes to volts
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 3 commits to master [+0/-0/±8] https://git.io/JfRvp
<_whitenotifier-c> [scopehal-apps] azonenberg 9617a3c - WaveformArea: no longer call lots of virtual functions in inner loop of waveform preparation
<_whitenotifier-c> [scopehal-apps] azonenberg 8e85aa1 - Added --nodata argument to load saved UI config without data on the command line
<_whitenotifier-c> [scopehal-apps] azonenberg d691d03 - Added --retrigger argument to start the trigger immediately upon loading a save file on the command line
<azonenberg> let's see, in my 1 minute test EyeDecoder2::Refresh spends 6.285 seconds in floor()
<azonenberg> wonder if i can do something about that
<azonenberg> Just the last few changes i've done have brought CPU time in this 1-minute test down from 98.16 to 80.88 sec
<azonenberg> i really wish i could multithread the eye processing but since it's locked to a clock that can vary in frequency that would be tricky
<azonenberg> i might get back to that later
<azonenberg> i do at least multithread if you have more than one eye to process
<_whitenotifier-c> [scopehal] azonenberg opened issue #114: Add control for eye pattern saturation - https://git.io/JfRIo
<_whitenotifier-c> [scopehal] azonenberg labeled issue #114: Add control for eye pattern saturation - https://git.io/JfRIo
<_whitenotifier-c> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±3] https://git.io/JfRtL
<_whitenotifier-c> [scopehal] azonenberg 5fa9d33 - DifferenceDecoder: performance optimizations to inner loop
<_whitenotifier-c> [scopehal] azonenberg 8ea61be - EyeDecoder2: performance optimizations
<_whitenotifier-c> [scopehal] azonenberg 44bb022 - EyeDecoder2: moved some UI calculations out of inner loop
<_whitenotifier-c> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRql
<_whitenotifier-c> [scopehal] azonenberg 8c0a439 - FindZeroCrossings: performance optimizations
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRq8
<_whitenotifier-c> [scopehal-apps] azonenberg 36e3fce - Indentation/comment fixes
<_whitenotifier-c> [scopehal-cmake] azonenberg pushed 2 commits to master [+0/-0/±3] https://git.io/JfRqB
<_whitenotifier-c> [scopehal-cmake] azonenberg df3e9c5 - Updated submodules
<_whitenotifier-c> [scopehal-cmake] azonenberg 5a98330 - Updated ignore properties for release/debug build directories
juli964 has quit [Quit: Nettalk6 - www.ntalk.de]
<_whitenotifier-c> [scopehal-apps] azonenberg opened issue #98: When reconnecting to a scope via a save file, channels are not added to the "add channel" menu - https://git.io/JfROv
<_whitenotifier-c> [scopehal-apps] azonenberg labeled issue #98: When reconnecting to a scope via a save file, channels are not added to the "add channel" menu - https://git.io/JfROv
<_whitenotifier-c> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±3] https://git.io/JfROD
<_whitenotifier-c> [scopehal] azonenberg d29f824 - Minor performance tweaks in FindZeroCrossings, fixed some warnings
<_whitenotifier-c> [scopehal] azonenberg 2cfa2b2 - Performance tuning to Ethernet100BaseTDecoder
<_whitenotifier-c> [scopehal] azonenberg f8f2231 - LeCroyOscilloscope: optimized waveform downloading for digital channels
futarisIRCcloud has quit [Quit: Connection closed for inactivity]
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfR3C
<_whitenotifier-c> [scopehal-apps] azonenberg de8b4e4 - WaveformArea: optimized index generation loop
<_whitenotifier-c> [scopehal-cmake] azonenberg pushed 1 commit to master [+0/-0/±2] https://git.io/JfR34
<_whitenotifier-c> [scopehal-cmake] azonenberg 5bd78ea - Updated submodules
futarisIRCcloud has joined #scopehal
juli964 has joined #scopehal
juli964 has quit [Quit: Nettalk6 - www.ntalk.de]
<azonenberg> So i did a bunch more poking around in vtune. i'm seeing a lot of NUMA accesses that i think can be optimized if i lock stuff to run on one package
<azonenberg> But that didnt pan out
<azonenberg> I'm also starting to wonder about retooling the waveform structure to be vector start, vector len, vector voltage
<azonenberg> rather than vector<start, len, voltage>
<azonenberg> this would allow much more efficient memory accesses i think
<azonenberg> but would also be a very nontrivial refactoring
<azonenberg> sounds like a weekend project perhaps :p
<azonenberg> The other possible optimization is non-sparse waveforms
<azonenberg> but that would be a lot more workl
<azonenberg> work*
<azonenberg> in either case i think the refactoring to separate arrays makes more sense to do first
<_whitenotifier-c> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JfRMw
<_whitenotifier-c> [scopehal] azonenberg 3e3d0c0 - Clarified log message
<_whitenotifier-c> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JfRMr
<_whitenotifier-c> [scopehal-apps] azonenberg c0749a9 - Status bar now shows total number of trigger events for performance analysis. Removed a bunch of sleeps from ScopeThread.
<azonenberg> ok soooo let's see how much i can break with this refactoring
<azonenberg> deleting one class i use through literally the entire codebase (OscilloscopeSample), then completely retooling another (CaptureChannel)
<azonenberg> Which is now known as Waveform because i've changed the interface so much it's not even recognizable as the same class by this point :P
<azonenberg> (and that was a more logical name anyway
<funkylab[m]> azonenberg: thanks for the twitter-dm heads up
<azonenberg> funkylab[m]: yeah so basically the end goal is to avoid strided access to all the waveform data
<azonenberg> because that's not SIMD friendly
<azonenberg> but this is going to be one heck of a massive diff and there's not really going to be any way to break it up into anything smaller
<funkylab[m]> honestly, if the stride is small enough, it's not that big of a deal – modern SIMD extensions do have slightly more elegant loading instructions
<azonenberg> there's some other advantages re GPU compute etc on this
<azonenberg> it also will eventually allow special-casing sparse vs dense waveforms
<azonenberg> without much of a code change to the datapath
<funkylab[m]> yep, serialization of uniform sample data gets more compact and everything
<funkylab[m]> basically, all PC-connected SDR devices I can think of just deliver plain contiguous IQ samples, within relatively large packets
<azonenberg> Yeah
<funkylab[m]> (some do have ... interesting wire formats; signed int 12, anyone?)
<azonenberg> Lol
<azonenberg> aka raw adc samples
<azonenberg> The reason i have this architecture is to allow for things that aren't directly digitized waveforms. Like "I2C sensor readings" or the instantaneous frequency of a waveform
<azonenberg> etc
<azonenberg> in which case you are very likely to have irregular sample intervals
<funkylab[m]> not even that; it's just that if you have a << 16 ENOB ADC, and don't do too much decimation in the DDC on-device, you can fit more through the USB link
<funkylab[m]> (that's Ettus B2xx btw)
<funkylab[m]> (without losing any significant digits)
<funkylab[m]> yep, fully understand why the more flexible format makes sense to something that essentially is meant to abstract all kinds of scopes
<azonenberg> yeah
<azonenberg> or sampling scopes, or specans, or really any sampled analog or digital data of some sort lol
<funkylab[m]> yep
<azonenberg> btw if you havent already, in just the past couple of days there have been massive (>20% in a few commits) performance boosts from various profiling and tweaking i've done
<azonenberg> as well as a great reduction in idle CPU
<funkylab[m]> You do NOT want your DDR5 development-enabling scope to deliver full-rate sampling continuously
<azonenberg> lol
<azonenberg> what, you don't have petabit ethernet on your workstation?
<funkylab[m]> talking of optical comms testers: same!
<funkylab[m]> (honestly, whenever I talk to people working on high-rate optical links, I get ADC envy. Like: anything I ever did in radio is totally baseband to these ADCs)
<azonenberg> lol i know the feeling
<funkylab[m]> and you've definitely worked larger BWs than I did!
<funkylab[m]> anyways, I've got to get some rest
<azonenberg> btw not sure if i mentioned on twitter or whatever
<azonenberg> But my long term plan is to push as much of the DSP as possible to compute shaders
<azonenberg> and maybe even look into stuff like RDMA
<funkylab[m]> :+1:
<funkylab[m]> that'd be extremely nice
<azonenberg> Because my vision for some of my longer term scopes (think 8 channels 1 GHz bw, one AD9213 and a SODIMM of DDR4 per channel)
<azonenberg> involves 40G or 100GbE as the backhaul to the host system
<azonenberg> and i want to be able to do analysis on that data in real time
<azonenberg> in my recent testing, the fastest performance i've got was 363 triggers in 1 minute pulling four channel 1M point 8-bit waveforms off a LeCroy WaveRunner 8104-MS
<funkylab[m]> Nice, bit of background on that: GNU Radio as a project is currently (since fosdem) trying to figure out how to come up with an architecture for incorporating accelerators (currently: FPGAs, GPUs, and there's stakeholders with domain-specific ASIC accelerators) into a signal processing workflow
<azonenberg> if you do the math that comes out to only 193.6 Mbps of actual waveform data hitting the PC, and glscopeclient spent 47 seconds of that minute waiting in Socket::RecvLooped()
<azonenberg> i.e. the majority of my time was waiting for the DSO to send me data
<azonenberg> BTW, that was not me just downloading waveforms and throwing them away
<funkylab[m]> that's a good sign!
<azonenberg> i was subtracting two differential inputs for tx and rx of a 100 Mbit ethernet waveform
<azonenberg> doing full 100baseTX protocol decode on both lanes
<funkylab[m]> niiice
<azonenberg> then doing a separate CDR PLL filter
<azonenberg> rendering separate eye patterns
<azonenberg> AND generating a BER bathtub curve for the top eye of each lane
<azonenberg> (it's a 3-level signaling so two openings in the eye)
<funkylab[m]> :D enough getting my mouth watery! I'm off to bed!
<azonenberg> and cpu usage was near zero because i was spending most of my time waiting for samples lol
<azonenberg> of course "near zero" on a dual socket xeon 6144 workstation is still a fair bit of compute, but still
<azonenberg> anyway, my eventual goal is to be able to process >10 Gbps of realtime waveform data
<azonenberg> with nontrivial analytics on it