sb0 changed the topic of #m-labs to: ARTIQ, Migen, MiSoC, Mixxeo & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs
_rht has joined #m-labs
<mithro>
sb0: What is the current status of the DVI Sampler and frame buffer in the current misoc? _florent_ was mentioning something around the DMA interfaces changing and there are a couple of "TODO: rewrite dma_lasmi module" type things in the dvi_sampler code?
<sb0>
mithro, I haven't tested it for ages and there has been major misoc refactorings since then, so sure enough it's broken
<mithro>
sb0: okay, that is where I thought it was at
<sb0>
the bugfixes shouldn't be substantial, though
rohitksingh has joined #m-labs
kuldeep has quit [Ping timeout: 248 seconds]
kuldeep has joined #m-labs
mumptai has quit [Quit: Verlassend]
mumptai has joined #m-labs
<GitHub189>
[migen] sbourdeauducq pushed 2 new commits to master: https://git.io/vVvOa
<whitequark>
rjo: with my latest optimizer tweaking i improved #298 by a factor of 100 and #338 by a factorof 150
<whitequark>
should reduce compile time too
<rjo>
what is the final absolute number? that's what matters.
<whitequark>
1us
<rjo>
on both?
<whitequark>
that's for PulseRateDDS
<whitequark>
for that RTIO loop it's 250ns
<rjo>
that sounds reasonable. please tie down the unittests so that we don't regress again.
<whitequark>
and I can actually further improve both, though not by much
<whitequark>
mainly, I need to factor out very cold bounds checking code out of the loops
<rjo>
it's a reasonable numer. i remember having 170 ns for a 75 MHz sys_clock in a very old version of RTIO (ventilator) with hand written C and lm32 a few years back.
<whitequark>
since it pessimizes the inliner
<whitequark>
and the second thing is it constantly reads and writes the global now
<whitequark>
170ns might be achieaable
<rjo>
yeah. i can see that 64 bit stuff actually dominating eventually.
<whitequark>
pulse_rate_dds still has FP math
<rjo>
to repeat: RTIO pulse rate is 1/250ns now?
<whitequark>
mostly because frequency_to_ftw has a division, which has a ZeroDivisionError branch, which ends up inflating that function
<whitequark>
i.e. it's only really useful if everything is inlined into a single function
<whitequark>
other than fixing this `now` issue and marking the TTLOut.channel attribute as immutable there is nothing to be done to increase the TTL pulse rate
<whitequark>
since the code is basically optimal already
<rjo>
ack.
<whitequark>
there's some modest PIC overhead, but not too much
<rjo>
one thing that is still in there is marking a few of those registers non-volatile.
<whitequark>
the inner loop is composed of 52 instructions
<whitequark>
going to non-PIC can save you, uh, I think four?
<whitequark>
(52 instructions not counting those in rtio_output)
<whitequark>
actually, nope
<whitequark>
two instructions
<whitequark>
the non-PIC version is 50.
<whitequark>
I think two of them stopped being loads, but that's not really much difference
<whitequark>
so I think PIC overhead can be considered negligible..
<whitequark>
ok. let me see what I can do with PulseRateDDS.
<whitequark>
also, I looked at the PulseRate test (the actual test code) that uses exceptions
<whitequark>
and the reason it's just a 50ns worse than the code in that hastebin, which doesn't use exceptions, is because I used LLVM's zero-cost exception handling
<whitequark>
actually not even 50ns, it's exactly same
<GitHub176>
[artiq] whitequark pushed 3 new commits to master: https://git.io/vVfel
<GitHub176>
artiq/master 186a564 whitequark: compiler: make quoted functions independent of outer environment.
<GitHub176>
artiq/master 5aec82b whitequark: test_pulse_rate: tighten upper bound to 1310ns.
<GitHub176>
artiq/master 20ad762 whitequark: llvm_ir_generator: generate code more amenable to LLVM's GlobalOpt....
<whitequark>
rjo: I think there is a problem with the PulseRateDDS test.
<whitequark>
it does 1000 iterations of setting DDSes
<whitequark>
and this currently results in a 500us value
<whitequark>
however, if I enlarge the number of iterations to 10000, it results in 2500us per pulse
<whitequark>
so I think with 1000 iterations, the measured value is lower than what is real; what happens is that every time it runs it "borrows" a chunk of time from break_realtime
<whitequark>
but the iteration count is not high enough that it results in an underflow.
<whitequark>
the higher I make the mu value in break_realtime, the lower the measured value becomes.
<whitequark>
whereas, if I raise the iteration count to 10000, then the value is the same as with 30000 iterations
<rjo>
cut that 5 ms to something like 100 us. pretty sure that's sufficient so subsequent batches don't overlap in your case. if you cut that 5ms to much you will see RTIOSequenceError because of overlapping batches.
<whitequark>
yep. with this new benchmark i get ~266us on a very wide range of n's
<rjo>
that code is weird.
<whitequark>
is it?
<whitequark>
it's a way to ensure that fifos are cleared in time
<rjo>
i think you are just measuring the fifo depth here.
<whitequark>
am I?
<whitequark>
it returns 266us even with n=10
<whitequark>
well, 268us. close enough.
<rjo>
you push a bunch of events always 1ms in the future over and over again.
<whitequark>
hm.
<whitequark>
i see your point
<rjo>
that will generally succeed unless there are events in the fifo that prevent new events from getting in and through in time.
<whitequark>
yes
<whitequark>
so if there are none, am i not measuring the time it takes to submit events?
<rjo>
yes. if the fifo is empty and stays non-full during the entire game. you will measure that time modulo the overhead due to setting and getting now.
<whitequark>
excellent. that's what i wanted to measure.
<rjo>
but for large n (>~ 50) i expect this to be wrong.
<rjo>
anyway. good night. see you tomorrow.
<whitequark>
night.
<whitequark>
this actually returns the same value for even n=100000.