<sb0_>
whitequark, why did you put the try/finally inside the loop for PulseRate?
<sb0_>
test zero cost exceptions?
<sb0_>
this should have been applied to stable-1 as well I think
<whitequark>
yes
<whitequark>
(test zero cost exceptions)
<whitequark>
ok, I got everything working on 3.7
<whitequark>
but I can't push again because of this crappy ADSL...
<whitequark>
I must say that migrating up two LLVM versions was rather painless, nothing was broken
<whitequark>
I had to fix two bugs where I used a wrong identifier in a merge, and a shitload of git merge issues, but that's it
<sb0_>
whitequark, should we do 1.0rc2 now or merge your compiler changes before?
<sb0_>
if we don't merge your compiler changes, I suppose that building 1.0rc2 with the old llvm on the buildserver will cause all sorts of conda problems?
<whitequark>
it will not if you restrict the llvm-or1k version in artiq conda package
<sb0_>
ok, how do you do that?
<whitequark>
- llvm-or1k 3.5.*
<whitequark>
note no = or == ... conda is being very straightforward again
<sb0_>
there is no llvm-or1k in artiq deps
<whitequark>
yes
<sb0_>
there is llvmlite-artiq
<whitequark>
add it there
<whitequark>
and restrict it
<sb0_>
will that be OK?
<whitequark>
the dependency solver will then select the right version of both llvm-or1k and llvmlite-artiq
<whitequark>
well, at least llvmlite merges PRs quicklty
<rjo>
why did you get so many merge conflicts in the tests?
<whitequark>
rjo: tests?
<whitequark>
in llvm-or1k or where?
<rjo>
llvm-3.7 merge
<rjo>
yes
<whitequark>
because I went from 3.6.2 to 3.7.1
<whitequark>
both of which were branched at some point from the common history
<whitequark>
I know 3.7.0 unexpectedly broke ABI, and 3.6.<2 might have had some kind of similar shit
<whitequark>
so I did not go hunting for 3.7.0 and merging there, and then merging .2 again
<whitequark>
all this crap in the history is irrelevant anyway because once I'm in sync with 3.9 I'll submit it upstream
<whitequark>
it'll probably take half a year, but hopefully after 3.9 doing all this refactoring is going to be the job of the one who wants the refactoring.
<rjo>
sb0_: if they get a cpu, and if they get artiq-python, then floating point could be a reason.
<sb0_>
the only things I can think of are 1) running a DSP card standalone, which we don't want to do because of enclosure/power supply issues 2) some hypothetical situation where the ARM provides a clear processing advantage and we want to avoid the round trip latency to the master board
<rjo>
and the 5-10 fold increase in cpu speed.
<rjo>
but no. i see no reason for zynq unless they are in fact cheaper and only the loss of fabric IO pins does not hurt us.
<sb0_>
what about SDRAM?
<sb0_>
connect it to the fabric and make the MPSoC permanently unusable? or connect to MPSoC pins and deal with the hardblock memory controller stuff?
<sb0_>
do we even have the first option i.e. didn't they remove odelays or something similar...
<sb0_>
ethernet has the same issue, but io/bandwidth requirements are less
<sb0_>
and we don't really need ethernet on dsp cards anyway.
<sb0_>
rjo, I'm still hesitant over SMP+digital vs. FMC
<sb0_>
FMC is really cramped
<sb0_>
and for the first option, well the daughter card can potentially use most of the space above the AMC, so the space taken by the SMP plugs doesn't sound like an issue
<rjo>
you can also use both sides of the mezzanine.
<sb0_>
and not having the DACs there makes the RF daughtercards simpler and potentially hand-solderable, they do not require any FMC hacks for the necessary voltages
<rjo>
but its not that cramped. afaikt 4 smp (maybe 5 is somebody wants a clock to go there) plus a decent 40 pin plus lots of ground digital connector uses more space.
<rjo>
and the space on the carrier is also a bit limited looking at afck/afc and the additional power supplies that are needed.
<sb0_>
is there any standard for transceiver pins on FMC or you just put them anywhere?
<rjo>
hand soldering the mezzanine is an illusion imho. if that's needed people can still castellate on a "prototyping"/"basic" mezzanine
<rjo>
standard for where they are located on the package?
<rjo>
i am pretty sure you can choose any of the LA* pairs.
evilspirit has joined #m-labs
bb-m-labs has quit [Quit: buildmaster reconfigured: bot disconnecting]
bb-m-labs has joined #m-labs
kyak_ has joined #m-labs
kyak_ has joined #m-labs
<sb0_>
do I read that right that only HPC has them?
<sb0_>
what is GBTCLK0_M2C on LPC then...
larsc has quit [Ping timeout: 244 seconds]
kyak has quit [Ping timeout: 244 seconds]
<GitHub147>
[artiq] whitequark pushed 2 new commits to master: https://git.io/vVGn3
<GitHub147>
artiq/master b8bd344 whitequark: compiler: use correct data layout.
<GitHub147>
artiq/master 6e02839 whitequark: compiler: update for LLVM 3.7.
<sb0_>
rjo, no. LPC has only two transceiver pairs defined by the standard. plus a transceiver clock.
<sb0_>
rjo, btw, if we have two FMCs, it sounds difficult to have another connector for TTLs on the AMC
<rjo>
i see no way of having a ttl connector and as many sma connectors as we are planing. independent of FMC versus NIH
<whitequark>
bb-m-labs: force build --props=package=llvm-or1k conda-all
<bb-m-labs>
build #35 forced
<bb-m-labs>
I'll give a shout when the build finishes
<sb0_>
whitequark, can you move those new packages outside of the main channel? i'll do it this time...
<sb0_>
conda breakage on the buildserver is one thing, on users machines, it's much worse.
<whitequark>
sb0_: llvm-or1k is not installed on user machines
<sb0_>
there was a llvmlite-artiq uploaded too
<whitequark>
that one isn't broken
<whitequark>
but, I will move it out
<sb0_>
did you thoroughly test it with 1.0rc1?
<whitequark>
because with it debug information isn't emitted, which is bad
<sb0_>
on windows and linux?
<sb0_>
if not, then it should not be in the main channel
<whitequark>
like I said, the package isn't broken
<whitequark>
but the combination of it and ARTIQ are
<whitequark>
is
<whitequark>
oh, you already moved it. good.
<sb0_>
the problem I'm trying to avoid is: someone installs 1.0rc1, it pulls this new llvmlite-artiq package that might have some bug (since it's untested), and artiq "stable" doesn't work.
<rjo>
anybody have the actual ansi/vita 57.1 standard around?
<whitequark>
yes, you are correct in moving it away.
<sb0_>
rjo, yes
<rjo>
i can't find where my alma mater has hidden it...
<_florent_>
sb0: how does one use SDRAM on artix-7 if there is no ODELAY? --> MIG is using others internal primitives (that are not documented...)
<whitequark>
rjo: while working on llvmlite I have devised a way to speed up compilation substantially while not redoing the entrety of llvmlite
<whitequark>
I think it's possible to rework the innards of llvmlite to use the C API underneath without breaking the interface
<rjo>
whitequark: without serializing as IR?
<whitequark>
yeah, remove all the string concatenation crap
<whitequark>
the LLVM project explicitly says not to do this, I don't know why they did it in first place
<whitequark>
not only it's slow but also very fragile... they have a few things covered in tests in master but they don't even parse as LLVM IR by the corresponding LLVM version
<sb0_>
whitequark, you may find it ironic, but the reason I heard is compatibility with future versions of LLVM
<sb0_>
I would also have thought the pure C API was better.
<rjo>
but llvm-c probably is not much more observant as to the validity of its object tree representation of the IR
<whitequark>
rjo: it's 50/50
<whitequark>
the debug builds of LLVM have lots of asserts, which provide more insight into errors during construction
<whitequark>
on the other hand that crashes your program
<rjo>
sb0_: if the notion is that we hijack a few hpc ground pins for +-15 and +-5 then we can as well use the hpc transciever pins.
<rjo>
that sounds pragmatic to me.
<rjo>
whitequark: how thoroughly does the llvm-or1k testsuite excercise the or1k parts?
rohitksingh1 has joined #m-labs
<whitequark>
I don't think there is an llvm-or1k testsuite
<rjo>
the llvm testsuite then.
<sb0_>
rjo, in addition to SDRAM problems, artix also has slower transceivers, which are a bigger issue I think.
<whitequark>
I mean, the OR1K backend is not tested within LLVM essentially at all
<sb0_>
rjo, there are LVDS high speed DACs, but they aren't too nice
<rjo>
6 GB/s?
<sb0_>
yes
<whitequark>
there are a few tests but they hardly do anything
<sb0_>
you need 12Gbps transceivers for 4 channel 16-bit 1.25Gsps
rohitksingh has quit [Ping timeout: 244 seconds]
<rjo>
sb0_: ack
<sb0_>
so basically this sort of high speed DACs -> fast transceiver -> expensive FPGA or maybe the ARM mess
<sb0_>
but 1) synchronization issues 2) PCB routing hassle and probably no FMC 3) got to check if the crippled Artix-7 IOs can handle it anyway 4) few second-sources unlike JESD204
<rjo>
whitequark: just wondering.
<whitequark>
rjo: I'll have to write a functional testsuite to merge it upstream, anyway
<whitequark>
the OR1K backend is very small and simple, so it isn't going to prove a problem
<whitequark>
(also that is the reason why there was so little breakage while upgrading)
<sb0_>
nah, let's not use that.
<rjo>
sb0_: but why would a US+ kintex with N LUTs and M GT?s ever be more expensive than a US+ zynq with >=N LUTs and >=M GT?. apart from cross-subsidization.
<sb0_>
yield management like airlines
<rjo>
but the market dynamics are completely different.
<whitequark>
ok. I added !dereferenceable.
<whitequark>
PulseRate went 1500ns->1374ns, PulseRateDDS went 300us->146us
<rjo>
whitequark: but the asm for some test programs looks much nicer now.
<whitequark>
oh, it does?
<whitequark>
that's interesting
<whitequark>
can you quantify the change? I'm having a hard time imagining why would it be so dramatic
<rjo>
this is comparing the state before the custom pipeline with llvm-3.5 and now.
<whitequark>
oh.
<whitequark>
oh, duh, yeah, the standard pipeline produced complete garbage
<whitequark>
ha, you really should look at the code for PulseRateDDS after the commit I'm about to push
<sb0_>
rjo, the zynq ultrascale+ are available in smaller sizes than kintex ultrascale+
evilspirit has quit [Ping timeout: 252 seconds]
<sb0_>
and it seems they target more consumer-ish/automotive applications, eg. they contain built-in GPU, CAN core, USB
<sb0_>
you can overcharge for those like you do for the telecom/military markets
<sb0_>
*cannot
<rjo>
whitequark: even comparing the custom pipeline-3.5 and now-3.7 mandelbrot is ~40% smaller.
<sb0_>
rjo, btw, 7k160t is only slightly more expensive than 7a200t
<GitHub74>
[artiq] whitequark pushed 1 new commit to master: https://git.io/vVGBe
<GitHub74>
artiq/master 10108e6 whitequark: compiler: mark loaded pointers as !dereferenceable....
<whitequark>
should shrink a bit more after ^
<whitequark>
and that's actually not all, I see some more obvious inefficiencies
<rjo>
sb0_: hmm.
<sb0_>
but the DAC would eat all the transceivers
<rjo>
sb0_: are the price lists for the ultrascale(+) stuff?
<sb0_>
might be ok if we don't use JESD for the ADC
<sb0_>
and we have only coupled IQ channels
<sb0_>
rjo, ultrascale you can find some at digikey, ultrascale+ could not find anything
<rjo>
yes. i have see the ~2 ultrascale things from digikey
<whitequark>
hm, for some reason it's not coalescing loads..
<whitequark>
bizarre
<whitequark>
all that is out of the inner loop, but it's still pretty stupid to repeat those
<whitequark>
OH
<whitequark>
those are two DDSes
<whitequark>
it cannot prove that the core_dds loaded from dds0 and core_dds loaded from dds1 are the same core_dds
<rjo>
sb0_: and i am worried about this SSI stuff. might be that's what's making stuff cheaper to fab. but i guess it's a mess for Vivado. let me try to put a few DSP channels on some US+ device.
<rjo>
also US+ could be out of a Trump playbook: "US+: Make America great again"
<whitequark>
can we have *one* irc channel that doesn't talk about trump, please?
kyak_ is now known as kyak
<whitequark>
amazing, LLVM even ordered basic blocks so that all of the ones that handle the exception are at the end of the function
<whitequark>
rjo: fyi if you request IR from artiq_run and then translate it with llc you get helpful comments
<whitequark>
even more helpful on 3.7
<whitequark>
e.g. it explains where the loops are
<whitequark>
I should enable that for ARTIQ_DUMP_ASM too but it needs some llvmlite plumbing, so later.
<rjo>
ah. that's the asm i have been looking for.
<whitequark>
ok. I do not see anything else obviously wrong
<whitequark>
ah, no, I do see
<whitequark>
there's this stupid call to FP round right in the middle of the loop. I wonder why isn't it hoisted
<whitequark>
no. not even enabling all -ffast-math flags makes LLVM hoist it
<rjo>
sb0_: by the way. looks like we will need to make drtio "packets" the data for a timed dsp channel event can legitimately be anywhere from 4 to ~56 bytes.
<sb0_>
ok
<sb0_>
I can take care of all DRTIO things if you want
<rjo>
let's hash all that stuff out in HK.
<whitequark>
ok. I see some extremely minor low-hanging fruit that's missed opportunities in LLVM
<bb-m-labs>
I'll give a shout when the build finishes
<rjo>
whitequark: did the numba guys say what their plans are w.r.t. llvmlite for llvm 3.8/3.9?
<whitequark>
haven't asked.
<whitequark>
but I'll do the port to 3.8 as there are few changes. and I should probably withhold port to 3.9 until we get 3.9
<whitequark>
3.9 is likely to have some seriously breaking changes late in the release cycle...
<whitequark>
rjo: have you measured compilation time?
<whitequark>
specifically between 3.5 pre and post new pipeline
<rjo>
oh by they way. sb0_, mithro and whoever is interested in networking stacks: i asked the altran/tass people about slapping "or later" onto the picotcp licensing and they seemed not completely unopposed. weighing in by others would probably help as well.
<rjo>
whitequark: not quantitatively.
<whitequark>
ok.
rohitksingh1 has quit [Read error: Connection reset by peer]
<sb0_>
whitequark, how would you see a rust runtime? what TCP/IP stack? what about the linker?
<whitequark>
once we're on 3.8 I'll pull in LLD, which will allow us to ditch binutils forever
<whitequark>
well, not immediately, but at some point after 1.0
<whitequark>
Rust does not currently have a usable TCP/IP stack, so I'll have to write one
<whitequark>
but I've been planning to do so anyway, for my personal projects...
<sb0_>
what about interfacing to picotcp, or even lwip which, even though it is mediocre, is already ported and works?
<whitequark>
that can be done, sure
<whitequark>
as a transitional step or even forever. Rust is very good at interfacing with C
<whitequark>
the tooling has good integration too, e.g. it is trivial to build C code as a part of Rust build process
<sb0_>
so we could use rust for just the high level behavior. do things like device-assisted experiment scheduling (queuing kernels and switching to the next in microseconds), the protocol for talking to the PC, etc.
<whitequark>
sure
<sb0_>
the rest - TCP/IP stack, kernel linking/loading, could keep using the current code
<whitequark>
kernel linking would benefit from rust too...
<sb0_>
in what way?
<whitequark>
well, basically anything that uses pointers would
<sb0_>
since we have to use a C compiler anyway, we might as well recycle the existing code
<whitequark>
I'd rather get better integration with the networking code
<whitequark>
but the point of discussion is moot, the kernel loading/linking code is trivial
<sb0_>
what about exception handling?
<whitequark>
what about it?
<sb0_>
all the unwinding logic.
<sb0_>
stays in C?
<whitequark>
Rust uses the same C++ zero-cost EH, except the language disallows catching exception
<whitequark>
oh
<whitequark>
libunwind stays in C++, yes. Rust uses libunwind as well.
<whitequark>
good. the LLVM people say D18643 is safe.
<whitequark>
so we'll make PulseRateDDS even faster.
rohitksingh has quit [Ping timeout: 246 seconds]
<sb0_>
fwiw the desired number I got from Oxford is 10us/channel
<rjo>
for what?
<sb0_>
DDS programming
<whitequark>
right now it's somewhere around 40us.
<whitequark>
let's see where it gets once we get rid of all FP in the loop.
<rjo>
dds programming itself is much faster
<rjo>
this is sustained dds programming.
<sb0_>
yes, of course
<rjo>
maybe they don't appreciate the difference.
<sb0_>
this was in Chris's email from last Friday, I think he probably understands it
<rjo>
let's check
<rjo>
with the current cpu i would guess that 10 us per dds event is marginal.
<whitequark>
we could inline dds_set.
<sb0_>
yes, maybe all the DDS logic could move from C to ARTIQ-Python now...
<whitequark>
yeah, I see no problems with that, except maybe increased compile time
<rjo>
i am pretty sure that the 10us per dds event that I got from Oxford can be deferred to a future with DMA.
<whitequark>
you know, in ARTIQ Python compiler, I initially added support for modules aka compilation unit
<whitequark>
actually being able to use those would decrease compilation time dramatically
<whitequark>
but Python, as the language, is extremely hostile to attempts on this
<rjo>
that would need csr->python header support.
<whitequark>
that sounds trivial.
<rjo>
it needs to be done.
<rjo>
and i would be surprised if inlining dds_set helps a lot.
<whitequark>
well, you save some loads, some calculation, the compiler gets higher visibility into EH...
<rjo>
apart from making compilation and upload slower.
<whitequark>
going forward, we need to devise some way of compartmentalizing ARTIQ Python code.
<rjo>
i would like to see a speed up in the submit-to-execute pipeline first.
<whitequark>
there's really no need to re-typecheck all the stdlib code every time a kernel is compiled
<whitequark>
fair.
<sb0_>
whitequark, there is the phase compensation array that may need to be kept from one kernel to the next
<whitequark>
sb0_: put it in the cache?
<sb0_>
yes. but there is the detail of loading it the first time...
<sb0_>
also the C code makes use of integer overflows
<whitequark>
ARTIQ Python does not trap on overflow, it wraps
<whitequark>
trapping on overflow would be very expensive
<whitequark>
why do these tests produce lower result on my machine?..
<whitequark>
I'm very confused
<sb0_>
whitequark, no idea. i haven't looked at all in the overflow flag, i just know it's there.
<sb0_>
my guess is no one uses it, therefore 1) it has bugs 2) it can be changed =]
sb0_ has quit [Quit: Leaving]
<whitequark>
can be changed in what sense?
evilspirit has joined #m-labs
key2 has quit [Ping timeout: 276 seconds]
kuldeep has quit [Ping timeout: 268 seconds]
sb0 has joined #m-labs
kristian1aul has quit [Quit: Reconnecting]
kristianpaul has joined #m-labs
kristianpaul has joined #m-labs
kuldeep has joined #m-labs
<sb0>
in any way you want, probably
<whitequark>
ah
<whitequark>
an instruction l.bo like l.bf would be perfect, and maybe also a cmov variant
<whitequark>
but that's speculation, I'll need to look at the actual code generation issues to say for sure
<whitequark>
the problem with raising an exception is that LLVM assumes that e.g. `add` does not raise
<whitequark>
you can use `add nuw`, in which case it assumes that unsigned overflow produces undefined behavior
<whitequark>
but it will still, say, mark the function as nounwind if it only has arithmetics
<whitequark>
so... implementing this is not entirely trivial
<whitequark>
though I think there will be interest from LLVM core in having better support for overflow traps
<sb0>
ok. well this is very low priority anyway.
<sb0>
rust runtime would be interesting. then I imagine we can have device-assisted scheduling in a nicer way, which is a useful feature for atomic clocks
FelixVi has joined #m-labs
fengling has joined #m-labs
fengling has quit [Ping timeout: 240 seconds]
<cr1901_modern>
I've been looking into writing Rust bare metal code. I've found that, I need to know the language before I can competently write bare metal Rust. Contrast to C, where I'd be very comfortable teaching someone who doesn't know any C on a microcontroller environment.