#scopehal on 2020-08-09 — irc logs at freenode.irclog.whitequark.org

2020-07-13 15:00 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-apps, https://github.com/azonenberg/scopehal, https://github.com/azonenberg/scopehal-docs | Logs: https://freenode.irclog.whitequark.org/scopehal

00:26 maartenBE has quit [Ping timeout: 240 seconds]

00:33 maartenBE has joined #scopehal

00:33 Degi has quit [Ping timeout: 264 seconds]

00:34 Degi has joined #scopehal

00:40 <_whitenotifier-b> [scopehal] azonenberg pushed 3 commits to master [+0/-0/±5] https://git.io/JJ1Rt

00:40 <_whitenotifier-b> [scopehal] azonenberg 39993e7 - Merged Convert8BitSamples and FillWaveformHeaders

00:40 <_whitenotifier-b> [scopehal] azonenberg 6c83322 - LeCroyOscilloscope: use OpenMP to parallelize conversions of >1M point waveforms

00:40 <_whitenotifier-b> [scopehal] azonenberg 5f083bb - Set up preferred thread count

03:01 <_whitenotifier-b> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±4] https://git.io/JJ12l

03:01 <_whitenotifier-b> [scopehal] azonenberg 932bc15 - Performance improvements to LeCroy/VICP driver, reduced a bunch of needless data copying

03:16 electronic_eel has quit [Ping timeout: 256 seconds]

03:16 electronic_eel has joined #scopehal

06:09 Nero_ has joined #scopehal

06:09 Nero_ is now known as Guest86320

06:09 Guest86320 is now known as NeroTHz

06:39 <_whitenotifier-b> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JJ1PM

06:39 <_whitenotifier-b> [scopehal] azonenberg 22a7b22 - Don't mess with thread count here, it's now done in ScopeThread

06:39 <_whitenotifier-b> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JJ1PD

06:39 <_whitenotifier-b> [scopehal-apps] azonenberg e0dde4c - Set thread count in ScopeThread based on number of cores

07:23 <azonenberg> So I'm thinking about bumping the minimum OpenGL version up to 4.5

07:23 <azonenberg> This requires a Radeon HD 5000 series, Broadwell or newer integrated GPU, or GeForce 400 or newer

07:23 <azonenberg> Current minimum is 4.3, which has the same discrete GPU requirements but will work back to Haswell integrated gfx

07:24 <azonenberg> So basically bumping to 4.5 means we will no longer work on 2013-vintage integrated GPUs

07:24 <azonenberg> and anything with a more recent cpu or discrete gpu is unaffected

07:24 <azonenberg> Any objections?

07:24 <azonenberg> monochroma, lain, Degi?

07:24 <azonenberg> noopwafel?

07:25 <lain> doesn't bother me

07:25 <monochroma> for what gain?

07:25 <azonenberg> monochroma: Direct state access

07:26 <azonenberg> basically allows you to modify a GL object by the handle rather than binding and then using the implied current object

07:26 <azonenberg> it's a much nicer API

07:27 <monochroma> i think restricting what hardware it can work on even more is not great, if both were supported in parallel that would be fine. but that's just my $0.02

07:28 <azonenberg> yes I know. But at the same time, how many people are trying to use glscopeclient on something that supports 4.3 but not 4.5?

07:28 <azonenberg> as far as i can tell the only possible configuration is a haswell system with no discrete gpu

07:28 <azonenberg> That's a pretty niche thing to be worried about

07:29 <azonenberg> If it required a new discrete gpu generation i'd be a lot more concerned

07:29 <azonenberg> almost anybody who still has a system that old probably has a discrete gpu for it to make it usable for anything modern, in which case they're good

07:30 <azonenberg> the only configuration i can imagine being affected would be a scope or other embedded platform with no free pcie slots and a haswell iGPU

07:30 <azonenberg> and at least lecroy went from ivy bridge to skylake with their motherboards :p

07:32 <monochroma> i guess we will see what happens

07:32 <azonenberg> I mean in general i want to have good hardware support, but i also do not have the resources to have a zillion implementations of everything

07:33 <azonenberg> and haswell is fairly old

07:33 <azonenberg> and like i said a haswell desktop can just get a 2014-or-newer gpu installed and it'll be fine

07:33 <azonenberg> It already doesn't work in virtualbox because gl newer than 2.1

07:34 * monochroma looks at her main desktop with an i5-4440 haswell that usually doesn't have a discreete GPU in it :P

07:34 <azonenberg> Lol

07:35 <azonenberg> i mean i still have a haswell desktop too

07:35 <azonenberg> But it has a discrete gpu

07:36 <azonenberg> Do you actually use glscopeclient on it?

07:36 <azonenberg> basically what i'm trying to do is multithread the "copy waveform data to the gpu" logic

07:36 <sorear> I misread that as Radeon RX 5000 series and was about to object to your deprecation timeline

07:36 <azonenberg> This is hard to do without direct state access

07:36 <azonenberg> sorear: lol

07:36 <monochroma> user@scopedev:~$ lscpu | grep 'Model name'

07:36 <monochroma> Model name: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz

07:36 <sorear> they have the same number so they're the same year right

07:36 <monochroma> :P

07:37 <azonenberg> monochroma: ok fine i'll see what i can do to keep it at 4.3 for now

07:37 <azonenberg> if you care that much?

07:37 <azonenberg> I will give you a discrete gpu if it comes to that :p

07:37 <monochroma> it's not a big deal, i havn't had time for scope dev in a while, so when i have time i will figure something out

07:37 <azonenberg> ah ok

07:37 <monochroma> but, idk how many other people are running similar configs

07:37 <azonenberg> i mean to be quite honest you're due for an upgrade anyway

07:37 <azonenberg> Yeah thats why i was asking around

07:38 <azonenberg> i dont want to require the latest and greatest hardware, so for example all of the AVX optimization i've done recently has runtime cpu detection and there are fallbacks to the old versions

07:38 <azonenberg> But i feel like needing a 6-year-old computer is not an unreasonable requirement

07:38 <azonenberg> needing a modern xeon is :p

07:40 <azonenberg> what i'm hoping to do here is eliminate a bunch of copies

07:41 <azonenberg> right now i create temporary buffers, write waveform data into them, then glBufferData them to the GPU

07:41 <azonenberg> What i want to do instead if glMapNamedBuffer() then write directly into the buffer as i convert from the internal representation to the GPU-friendly representation

07:41 <azonenberg> is*

07:51 <Degi> hi

07:52 <Degi> Hm then it probably wont run on my laptop but I've never used it there either

07:53 <Degi> Oh nevermind, I dont think that 4.3 does either

07:54 <Degi> Hmh, my newest laptop has a 4210M... Itd be kinda nice if we can support both, or an alt mode where it does CPU rendering

08:01 <azonenberg> Degi: that's not happening any time soon. i'm moving more and more stuff to compute shaders

08:01 <Degi> Hmh, maybe some kinda legacy mode

08:02 <azonenberg> that was kinda always the endgame, doing protocol decodes and math on the gpu

08:02 <azonenberg> That would basically require two implementations of everything

08:02 <azonenberg> I don't have the resources for that

08:02 <Degi> Okay

08:05 <azonenberg> Right now the core non-negotiable requirements are a 64-bit CPU and a gpu with compute shader support

08:06 <azonenberg> I'm going to try to hold off on needing gl4.5

08:44 <_whitenotifier-b> [scopehal-apps] azonenberg pushed 3 commits to master [+0/-0/±12] https://git.io/JJ1y4

08:44 <_whitenotifier-b> [scopehal-apps] azonenberg 7367430 - Refactoring: split PrepareGeometry into PrepareGeometry (parallelizable) + DownloadGeometry (must run in render thread)

08:45 <_whitenotifier-b> [scopehal-apps] azonenberg 4d44724 - Parallelize PrepareGeometry()

08:45 <_whitenotifier-b> [scopehal-apps] azonenberg e51ce62 - WaveformArea::PrepareGeometry now outputs directly to OpenGL memory rather than doing separate glBufferData operations

08:51 <azonenberg> There's still just a little bit more work being done in the main thread than I'd like but this is good progress and has eliminated a lot of wasted work

08:53 <Degi> Oh neat

09:18 <noopwafel> as long as it's just glscopeclient, do whatever makes sense for you, I think

09:18 <noopwafel> (from my perspective)

09:25 <azonenberg> noopwafel: right now my "has to still work on this" system is a haswell i5 i got near the end of my phd, with an nvidia discrete gpu

09:28 <noopwafel> I will swap out haswell thing for something with a Radeon 540

09:28 <noopwafel> myself

09:28 <azonenberg> anyway i decided to stick with GL 4.3 for the short term

09:28 <azonenberg> so haswell integrated gfx will work

09:29 <azonenberg> sandy/ivy bridge will not, but they never did

09:29 * monochroma pats her sad x220 :<

09:30 <azonenberg> monochroma: oh come on get lain to put in some pcie bodgewires

09:30 <azonenberg> swap out one of the usb2 ports w/ usb3 and put pcie on the SS pins

09:31 <azonenberg> then make a 2080 ti dongle

09:31 <azonenberg> :P

09:31 <noopwafel> it is not going to work on my poor x200 whatever happens, which are the cheap disposable laptops I dumped on tables so I could do workshops at CCC

09:31 <azonenberg> Yeah. I don't expect it will ever work without compute shader support

09:31 <azonenberg> That was a core requirement almost from day one

09:31 <noopwafel> but for this I can just do a horrible alternative frontend, given my previous UI was a python thing with a graph, some comboboxes and an 'arm!' button :-)

09:31 <azonenberg> lol

09:32 <azonenberg> yeah the scopehal library will work on whatever

09:32 <azonenberg> I do have special case optimizations for avx2 and avx512 in the lecroy driver now

09:32 <azonenberg> but i dynamically swap those in based on cpuid detection and fall back to a default "gcc -m64" build otherwise

09:32 <noopwafel> ah right, I rescued my pico

09:32 <noopwafel> so I should see how much I can get

09:33 <electronic_eel> my pc on the electronics workbench is a sandy bridge i3-2100. currently I mostly need it for viewing ibom and hooking up some jtag or serial adapters. I guess I'll have to upgrade when I want a 40GbE link to the scope there ;)

09:33 <noopwafel> it can pull 200MS/s so I guess I will be very cpu-limited too

09:33 <noopwafel> might be good motivation to add oversampling support

09:40 <azonenberg> noopwafel: oversampling what?

09:42 <noopwafel> azonenberg: oversampling on the scope side, because playing with streaming

09:42 <azonenberg> oh you mean to inflate the number of samples we get?

09:42 <azonenberg> electronic_eel: yeah good luck saturating 40GbE with that

09:43 <noopwafel> I mean getting the fpga to average together 10 traces or so, reducing stream to 100MS/s

09:43 <azonenberg> oh you mean hardware averaging?

09:43 <noopwafel> yeah

09:43 <noopwafel> pico call it 'resolution enhancement' and people around me use all kinds of different terms to refer to this :-)

09:44 <noopwafel> it's a bit fiddly driver-side because suddenly you can't just pass around int8_t any more

09:44 * monochroma imagines an oscilloscope with a "Turbo" button

09:46 <azonenberg> noopwafel: yeah i already have "HD mode" support on lecroy for reading 16-bit samples (typically 10/12 bits on the hardware side but padded to 16) from HDO series scopes

09:47 <noopwafel> ah right they call it downsampling internally, because different modes

09:47 <noopwafel> azonenberg: nice, I can re-use more of your code :-)

10:08 <azonenberg> noopwafel: also based on current profiling of the code running on a lecroy at 400 Mbps

10:08 <azonenberg> i have concerns re our ability to do 10G or 40G on a single thread viably

10:08 <azonenberg> Even in push mode

10:08 <azonenberg> We may want to consider a multi-stream protocol so we can use several threads for RX

10:09 <azonenberg> say one socket per channel or something

10:09 <noopwafel> that would I guess be very easy to do

10:09 <azonenberg> Yeah one per channel would map nicely to what we have now. Would just need some sync logic to ensure the waveforms are aligned right to the same trigger

12:06 <_whitenotifier-b> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://git.io/JJ1NZ

12:06 <_whitenotifier-b> [scopehal] azonenberg d8db8f2 - LeCroyOscilloscope: fixed loop bounds error

12:10 <azonenberg> soooo it seems i've been doing my fft's all wrong

12:11 <azonenberg> the setup is more expensive than the execution

12:11 <azonenberg> i need to be caching the setup

12:11 <azonenberg> like, by a factor of 20

12:11 <miek> ohh yeah, the wisdom stuff?

12:11 <azonenberg> ffts_init_1d_real is taking 20 cpu-sec in my test

12:11 <azonenberg> ffts_execute_1d_real is taking 1 sec

12:11 <azonenberg> :p

12:11 <miek> lol

12:12 <noopwafel> oof :)

12:12 <azonenberg> The other thing i'm spending a lot of time in is SParameterVector::InterpolatePoint

12:12 <azonenberg> Because right now when doing de-embedding and channel emulation i resample the s2p to the FFT bin resolution every single waveform

12:13 <azonenberg> that should also be cached

12:13 <azonenberg> So it looks like in general my channel emulation and de-embedding can be *massively* sped up

12:13 <azonenberg> probably by an OOM or close to it

12:50 _whitelogger has joined #scopehal

12:50 <azonenberg> Gonna try and at least get the bare PCBs going to the US (i.e. not needing customs forms which take time to fill out) shipped asap

12:50 <azonenberg> hopefully everything this week

12:53 <NeroTHz> azonenberg, how do you interpolate the sparameters?

12:53 <NeroTHz> Just linear? or do you fit a curve?

13:04 <azonenberg> NeroTHz: As of now, linear. But improving that is on the wishlist

13:05 <azonenberg> most of the s2ps i measure on my VNA have a ton of points, like 10K from 300 kHz to 6 GHz

13:05 <azonenberg> So the spacing between them is tight enough almost any interpolation will work just fine :p

13:05 <azonenberg> with field solver output or vendor models that are less dense it will be a bigger dela

13:05 <azonenberg> deal*

13:09 <_whitenotifier-b> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±4] https://git.io/JJ1pV

13:09 <_whitenotifier-b> [scopehal] azonenberg 9614f63 - DeEmbedDecoder: cache interpolated S-parameters

13:09 <_whitenotifier-b> [scopehal] azonenberg ede70ff - DeEmbedDecoder: Cache FFT plan between iterations

13:09 <azonenberg> noopwafel, miek: ok so yeah i had no idea fft setup was such an expensive operation. i thought the function was just filling out a context object or something (I know ~zilch about how FFTs work under the hood)

13:11 <azonenberg> Test setup: 1 minute on 4 channels, three just rendering and one with four different emulated channels on it (an AKL-PT1 with each ground accessory)

13:12 <azonenberg> Profiling run 32: 90.48 CPU-sec, 11.139 sec in SParameterVector::InterpolatePoint, 23.661 in ffts_init_1d_real, and a total of 45.248 in DeEmbedDecoder::DoRefresh()

13:13 <azonenberg> Run 34: 60.974 CPU-sec, interpolation doesnt even show up in the list of hot spots, init is just a couple of milliseconds

13:13 <azonenberg> DeEmbedDecoder::DoRefresh() now takes 13.114 cpu-sec

13:14 <azonenberg> So it's 2.5x faster now lol

13:14 <NeroTHz> azonenberg, if you have low point count, my experience is that a spline interpolation can be good. You avoid sharp edges causing timedomain ripple

13:14 <azonenberg> and that's without any vectorization or optimization on the actual de-embedding loop itself which is where most of the time goes i think

13:14 <azonenberg> the main inner loop is 4.1 sec of that 13

13:15 <NeroTHz> ofcourse, if you have enough points it is fine, but I also worked a lot with simulated results, and when you have your cluster spend 2 hours per frequency point, I want to have as few of those as I can get away with

13:16 <NeroTHz> but I guess for now it is not a priority, as your application does mean you can measure it usually :p

13:16 <azonenberg> Yeah one of the things on my near term todo is getting data to ground truth my channel emulation

13:16 <azonenberg> so measure a signal directly, measure s-parameters of a 2-port network

13:16 <miek> azonenberg: lol, nice

13:17 <azonenberg> then measure the signal through that network and compare to channel emulation on the original signal

13:17 <azonenberg> Anyway therre's definitely room to optimize more here

13:17 <miek> azonenberg: iirc the fftw docs are pretty good on mentioning stuff like that - may be worth a read even if you're not using fftw itself, it probably applies across other impls

13:17 <azonenberg> 6.089 sec are in the output loop, of which most is spent in STL push-back's

13:17 <azonenberg> because i forgot to preallocate the output buffer

13:18 <azonenberg> then 4.178 is the actual de-embedding loop which should be possible to vectorize

13:18 <azonenberg> then 0.882 is the forward FFT and 1.274 is the inverse fft

13:18 <Degi> Is it possible to make that part more efficient

13:18 <azonenberg> I think all of it except the FFT can be optimized quite a bit

13:19 <azonenberg> i'm not going to try to make ffts faster :p

13:19 <Degi> I mean maybe doing away with it idk

13:19 <azonenberg> Once i'm done tuning this, the CTLE and FFT filters can likely get some of the same tweaks applied

13:19 <NeroTHz> the way many commercial software packages use it is to use the s-parameters to generate a few taps worth of equalizer

13:19 <azonenberg> NeroTHz: I may make a *separate* filter that does this

13:20 <azonenberg> But this filter is specifically for full channel emulation and de-embedding, rather than basic cable loss compensation

13:20 <azonenberg> Long term i want to have a large toolbox of filters that do similar things but have different implementations that are specialized for various purposes

13:20 <miek> i spent quite a bit of time optimising fft stuff for my spectrogram viewer, that happily lets you pan through >>200GB sdr captures like it's nothing ;D

13:20 <azonenberg> for example a FFT based CTLE that's very faithful to how an actual hardware CTLE would work

13:21 <Degi> miek: Oh neat!

13:21 <azonenberg> or a separate FIR based equalizer that has similar frequency response

13:21 <azonenberg> and lets you trade speed off against accuracy

13:21 <azonenberg> i.e. more or less taps

13:21 <azonenberg> But first i want to get the mathematically "clean" implementation done

13:22 <azonenberg> then worry about "close enough for most purposes" optimizations

13:22 <azonenberg> All of the tuning i've done so far has been either trivial algorithmic optimizations like caching values instead of recomputing them, or straightforward vectorizations of an existing implementation

13:24 <azonenberg> Long term i want the vast majority of the performance critical DSP code to be either GPU or heavily vectorized CPU

13:24 <azonenberg> But obviously initial implementations are focusing on 'make it work first'

13:26 <miek> btw, one other thing i found when profiling stuff like this way back was log10 being really slow. swapping out for log2(x)/log2(10) was way better ¯\_(ツ)_/¯

14:08 maartenBE has quit [Ping timeout: 265 seconds]

14:10 maartenBE has joined #scopehal

19:14 _whitelogger has joined #scopehal

19:29 NeroTHz has quit [Read error: Connection reset by peer]

19:32 m4ssi has joined #scopehal

20:13 m4ssi has quit [Remote host closed the connection]

21:39 m4ssi has joined #scopehal

22:00 m4ssi has quit [Remote host closed the connection]

22:18 bvernoux has quit [Quit: Leaving]

22:53 m4ssi has joined #scopehal

23:07 m4ssi has quit [Remote host closed the connection]