<GitHub92>
[smoltcp] whitequark commented on issue #19: > I propose I rename that I called SocketDispatcher to SocketDispatchTable, put it inside SocketSet... https://git.io/vQ3Ds
<GitHub79>
[smoltcp] whitequark pushed 1 new commit to master: https://git.io/vQ3DV
<GitHub79>
smoltcp/master b3e3554 whitequark: Add missing #[derive]s on wire::IpVersion.
<travis-ci>
m-labs/smoltcp#132 (master - b3e3554 : whitequark): The build passed.
<GitHub109>
[smoltcp] whitequark commented on issue #19: Moving forward, do you think you can cover EthernetInterface with tests? A basic coverage of every `Ok` and `Err` returned from the `process_*` functions would be a great start, I'll chime in then, adding support for the newly landed rustc support for gcov. https://git.io/vQ3SL
<rjo>
i am ok with the scripts as long as they are clean, documented, and maintained.
<GitHub65>
[smoltcp] batonius commented on issue #19: >Moving forward, do you think you can cover EthernetInterface with tests?... https://git.io/vQ3H6
attie has quit [Remote host closed the connection]
<GitHub55>
[smoltcp] batonius commented on issue #19: >Moving forward, do you think you can cover EthernetInterface with tests?... https://git.io/vQ3H6
<travis-ci>
batonius/smoltcp#13 (master - b3e3554 : whitequark): The build passed.
<whitequark>
rjo: did I already ask you as to why you're not using them btw?
<whitequark>
do you have your own scripts?
<whitequark>
or just do everything on lab.* via ssh?
rohitksingh_work has quit [Ping timeout: 246 seconds]
<rjo>
whitequark: a mixture of git, rsync, and mosh/tmux. yes. but i always wanted to give your scripts a try.
<whitequark>
rjo: ack.
<whitequark>
rjo: I believe I found the root cause behind all our throughput issues btw.
<whitequark>
smoltcp did not send duplicate ACKs when it detected a missing segment
<whitequark>
this meant that every missing segment incurred at least 0.5s of delay waiting for the host to retransmi
<whitequark>
if I send duplicate ACKs this gets resolved within milliseconds in my testing (not on the core device yet)
<whitequark>
I've missed this because this behavior of duplicate ACKs is an implementation detail of congestion control algorithms and isn't in RFC793, though it is in RFC1122
rohitksingh_work has joined #m-labs
<GitHub77>
[smoltcp] whitequark pushed 3 new commits to master: https://git.io/vQ3F2
<GitHub77>
smoltcp/master ac6efbf whitequark: Try to trigger fast retransmit when we detect a missing TCP segment....
<GitHub77>
smoltcp/master a2f233e whitequark: In examples, trace the packets being dropped by the fault injector.
<GitHub77>
smoltcp/master 86c1cba whitequark: In examples, print packet dumps with timestamps, too....
<travis-ci>
m-labs/smoltcp#133 (master - ac6efbf : whitequark): The build was broken.
rohitksingh_work has quit [Ping timeout: 260 seconds]
<GitHub163>
[artiq] whitequark commented on issue #685: > It should be perfectly workable to keep a free list backed by a static pool around to store per-segment metadata (sequence number ranges, …), while building up the payload directly in the circular buffer.... https://github.com/m-labs/artiq/issues/685#issuecomment-311006676
<whitequark>
sb0: rjo: I get 1 Mbps of throughput consistently, up to 1.8 Mbps of throughput in good conditions
<whitequark>
so this actually exceeds lwip I believe
<whitequark>
that's host pushing data to the core device
<whitequark>
wait, no
<whitequark>
1 mega*byte* per second
<rjo>
whitequark: by the way, could you or sb0, look at upgrading the rigol's firmware? it hangs much more than mine on the same commands.
<whitequark>
is there an upgrade?
<GitHub99>
[artiq] whitequark commented on issue #685: > It should be perfectly workable to keep a free list backed by a static pool around to store per-segment metadata (sequence number ranges, …), while building up the payload directly in the circular buffer.... https://github.com/m-labs/artiq/issues/685#issuecomment-311006676
<rjo>
iirc there have been some in the last months.
<whitequark>
ah.
<whitequark>
rjo: I'm not in HK
<rjo>
whitequark: ok. 1 MB/s is what i remember from lwip as well. good if we can go faster, better if we can identify how we can go even faster then ;)
<rjo>
sb0: no urgency though, i can work around it.
<whitequark>
sb0: rjo: there's another issue though, because of a silly bug the throughput in the *other* direction is 28 kBps
<whitequark>
but that's even easier to fix and it's completely obvious why that happens.
<whitequark>
(smoltcp waits for an ACK after sending exactly one packet)
<whitequark>
rjo: another issue with sitting there with a full window is it destroys throughput through artiq_devtool
<whitequark>
not sure exactly why, it might be something about the way ssh does forwarding
<whitequark>
but via artiq_devtool I only get about half that
<whitequark>
over a long fat pipe, I mean
<whitequark>
this *shouldn't* matter since the transfer only goes in one direction...
<GitHub147>
[artiq] whitequark commented on issue #685: > It should be perfectly workable to keep a free list backed by a static pool around to store per-segment metadata (sequence number ranges, …), while building up the payload directly in the circular buffer.... https://github.com/m-labs/artiq/issues/685#issuecomment-311006676
<whitequark>
rjo: ah no unfortunately we already do as few copies as possible
<whitequark>
there's copy #1 that takes the ethmac buffer and puts it into the TCP circular buffer
<whitequark>
and there's copy #2 that takes the TCP circular buffer and puts it into an allocation owned by kernel CPU
<whitequark>
neither can *really* be eliminated
<whitequark>
then it looks like the only option here is implementing proper window management
<whitequark>
we essentially get permanently stuck with a MTU*4 window right now despite the TCP buffer being almost entirely empty
<whitequark>
er, correction. MTU*4 window regardless of the state of the buffer so long as it isn't almost entirely full
<GitHub77>
[artiq] klickverbot commented on issue #685: @sbourdeauducq: It wasn't better at the time of your message, but the duplicate ACK handling from earlier today should indeed make a difference. I'll have a go at reproducing the results soon.... https://github.com/m-labs/artiq/issues/685#issuecomment-311031321
<rjo>
whitequark: ethernet DMA...
<cr1901_modern>
Does misoc support a DMA controller (I suppose you could just implement the ethernet core as a WB master for DMA if not)
<sb0>
what is a DMA controller? hardware memcpy? no
<cr1901_modern>
sb0: Yes, basically. But I recall you saying something else a while back: that a DMA controller shouldn't use the same bus as the CPU to read/dump data to memory.
<cr1901_modern>
^So I was asking if that was supported (do other SoCs use this approach?) either
<sb0>
rjo, did your modified vivado script improve the dma/rtio timing, or is that still an important issue?
<rjo>
it did have a small impact but the path is still there. and it is extremely long and i expect it to cause problems soon again.
<GitHub39>
[smoltcp] whitequark opened issue #20: TCP reset generation is not quite correct https://git.io/vQsn5
<GitHub57>
[smoltcp] whitequark opened issue #21: Challenge ACKs are not always generated https://git.io/vQsnb
<GitHub51>
[smoltcp] whitequark opened issue #22: ACKs are not generated when receiving segments and the window is zero https://git.io/vQscm
rohitksingh_wor1 has quit [Read error: Connection reset by peer]
<sb0>
rjo, okay, it's the ack (flow control) path
<sb0>
it's long because many components have combinatorial logic in flow control
<sb0>
that can be broken with a 2-entry FIFO
<whitequark>
I'm not sure how much sense it makes to have a hardware memcpy
<whitequark>
or1k is a pipelined CPU with prefetch, right?
<sb0>
or, more simply, by having a component that reads/writes sequentially. unlike the FIFO this limits the throughput but should not be the bottleneck
<sb0>
that maybe can be combined with the time offset stage, which would use a negligible amount of FPGA resources (the wide data makes buffers expensive)
<whitequark>
if you unroll the memcpy loop then it can spend many of its cycles actually doing copying
<sb0>
whitequark, you can access the SDRAM with a wider bus than the CPU
<whitequark>
if you go the hardware memcpy route though then you will pay the cache flush penalty
<whitequark>
unless you make it cache-coherent which is not gonna happen for ARTIQ
<sb0>
cache coherency isn't *that* bad
<whitequark>
well can you justify implementing MOESI just to get faster memcpy/
<whitequark>
?
<sb0>
even in FPGAs it can work, for example milkymist had a VGA framebuffer that had cache coherency with the L2 cache
<whitequark>
well
<whitequark>
it will also give us faster kernel/comms CPU data transfer
<whitequark>
right now every RPC is a slog
<whitequark>
so maybe it can be justified after all
<sb0>
are cache misses the main slowdown for RPCs?
<whitequark>
they are a significant slowdown iirc the last time I was measuring that
<whitequark>
first, you have this massive loop that iterates through entire l2 cache, as a fixed penalty
<whitequark>
and then you get all of your working set evicted
<whitequark>
also does or1k really have no way to flush *specific* dcache lines?
<whitequark>
it already has the CAM...
<sb0>
access another address with an offset at a multiple of the cache size
<whitequark>
oh, you don't flush l2 cache for RPCs, my bad
<whitequark>
sb0: that's a waste of time
<whitequark>
well I suppose since we have no MMU we could calculate it from the way/set/block count
<whitequark>
and do an appropriate SPR_DCBIR write
<sb0>
if you access in the on-chip SRAM it's a lesser waste of time
<GitHub184>
[smoltcp] whitequark commented on issue #19: Something I just remembered that might be very relevant to your work is that having several open sockets in LISTEN state with the same local endpoint is perfectly legal. That's how listen backlog is implemented (by a layer on top of smoltcp). https://git.io/vQs4Q
<GitHub44>
[smoltcp] whitequark opened issue #23: Revise errors returned from `TcpSocket::process()` https://git.io/vQs0K
<GitHub78>
[smoltcp] batonius commented on issue #19: Right, I somehow missed that point completely, it's not enough to dispatch tcp packets based on the dst endpoint, a server can have several clients connected to it, we need to consider src endpoint as well, and we don't know it until we established a connection. This means we need another layer of dispatching and a way for a socket to report the fact it has established a connection to a remote endpoint.
<GitHub106>
[smoltcp] batonius commented on issue #19: Now I think if it, it should be easy enough to do by checking if socket's `remote_endpoit` has changed after `process` in `process_tcpv4`. https://git.io/vQsgX
rohitksingh has joined #m-labs
hartytp has joined #m-labs
<hartytp>
sbo: DRTIO switching
<hartytp>
why do you need to store an entry for each DRTIO channel in a table?
<hartytp>
Isn’t 1 entry per satellite device is enough?
<hartytp>
what is the planned implementation of the DRTIO switching funded by ARL?
<sb0>
hartytp, the ARL design is for the Sayma RTM FPGA, and unlike Kasli, the two ends of the switch operate at different data rates
<sb0>
currently the DRTIO master needs to store how much space is available in the RTIO FIFO of each channel, to avoid querying the satellite every time which would cause poor performance
<sb0>
with the current switch support plan, there is only one hop at most, and the number of RTIO channel on Sayma RTM is rather small. this makes the DRTIO master block RAM more manageable ...
<hartytp>
Okay, so this is about keeping track of room in FIFOs, rather than about constructing a routing table?
<hartytp>
and, we can't just use overflow errors for flow control?
<hartytp>
"If we design the route->index mapper in a naive and trivial way (encode each hop with 2 bits, concatenate the results, and multiply by the memory allocation for one device) then the required amount of memory is very high at 10 megabytes, with a tree 5 layers deep."
<hartytp>
I assumed we'd store a list of DRTIO devices and, for each one, store the route information.
<hartytp>
Thus, it's only a few extra bits of information for each DRTIO slave we add, rather than an exponentially increasing amount of data
<hartytp>
"hartytp, the ARL design is for the Sayma RTM FPGA, and unlike Kasli, the two ends of the switch operate at different data rates"
<hartytp>
If you can do different data rates, don't you get switching with the same data rate more or less for free? (different data rates sounds like a general case)
<sb0>
the different data rate design will have less performance (needs to buffer whole packets etc.). same data rate, you can do cut-through switching
<sb0>
no, we can't just use overflow errors for flow control
<sb0>
having a list of drtio devices and storing route information in gateware is an option, yes. but it needs to be done...
<sb0>
and even with that, it's still a 200KB table
<hartytp>
sb0, okay so the latency is quite high for the current ARL funded DRTIO switch (how high?). The estimate you gave me is for reducing the latency in the same-speed case by implementing cut-through switching, right?
<hartytp>
"having a list of drtio devices and storing route information in gateware is an option, yes. but it needs to be done..." would that be a lot simpler to implement?
<sb0>
it's much simpler in gateware than encoding the route in the RTIO channel numbers and then having to map that efficiently to table addresses
<sb0>
but then there is the problem of loading the route table.
<sb0>
I suppose the only option is to put it as a config option in the core device flash, otherwise startup/idle kernels would not run properly
<sb0>
yes, that estimate is implementing cut-through switching
<hartytp>
"and even with that, it's still a 200KB table" true, assuming we need to store 10 bytes per DRTIO channel (what are they for?) and we want to support 1024 RTIO channels per device (256 seems plenty IMO)....
<hartytp>
"I suppose the only option is to put it as a config option in the core device flash, otherwise startup/idle kernels would not run properly" that doesn't sound too bad to me
<sb0>
well, the user interface needs a bit of thought
<hartytp>
yes
<sb0>
10 bytes = last timestamp (64 bits) + FIFO space (16 bits)
<hartytp>
remind me what last timestamp is needed for
<sb0>
sequence error detection
<hartytp>
detecting out-of-order events?
<sb0>
trying to post an event on a channel with a timestamp smaller than the previous one
<hartytp>
okay
<sb0>
with SRTIO it is generally OK to do that, so this "sequence error" doesn't exist anymore
<hartytp>
can the error detection be done by the DRTIO slave, rather than the master?
<hartytp>
That way, you're down to 2 bytes per DRTIO channel on the master
<sb0>
then either you lose precise exceptions, or performance, since a round-trip would be required for every event
<sb0>
if you add a microsecond of latency by crossing switches, then the event rate really drops...
<hartytp>
precise exceptions? The DRTIO slave can raise an exception over the DRTIO aux channel, telling you which instruction caused the error. What other information do you need?
<sb0>
there are two problems with that:
<sb0>
1) it cannot work like a Python exception, e.g. the CPU may already be out of the "except:" clause when the error arrives
<hartytp>
yes, it'd be more like an underflow error
<sb0>
2) the kernel may even have already terminated by the time the error arrives, so if you store just a program counter value it still is a bit tricky to know where the error came from
<sb0>
underflow errors also use precise exceptions.
<sb0>
you can catch them etc
<sb0>
"try: ttl.on() except RTIOUnderflow: ..." has precisely defined behavior
<hartytp>
okay.
<hartytp>
how are underflows handled if not via drtio aux?
<sb0>
locally by looking at the local timestamp counter, and checking that there is enough time considering the various latencies into account
<sb0>
drtio switches also complicate that, by the way
<sb0>
contrary to what you think, they are not easy, even for small networks
<hartytp>
Never thought this was easy
<hartytp>
just trying to understand the issues
<hartytp>
(sorry, just re-read the drtio docs and noticed some of my questions were answered there)
<hartytp>
In general:
<hartytp>
- Kasli as master is something I'm keen on, as some of our experiments will only need minimal uTCA hardware (others will need lots of it, so will use Metlino).
<hartytp>
- but, I am a bit concerned with potential resource/speed limitations of Kasli (as discussed previously)
<hartytp>
- we can potentially fund the SRTIO proposal, depending on the costs
<sb0>
ok, good :)
<rjo>
hartytp: are you guys doing CameraLink-based readout from (Andor?) (EM) CCDs? just heard it mentioned here (PTB) that someone from oxford had that in a thesis.
<hartytp>
Chris has done some (unpublished stuff). IIRC, triggering the camera via TTL and then reading back later via a PC card
<rjo>
oh. i remember that. ack.
<hartytp>
real-time readout via CameraLink is something we're keen on/thinking about. Are PTB considering funding it?
<hartytp>
sb0: the simpler/cheaper we can keep the switching proposal, the easier it will be for us to fund; even if it doesn't address all the issues required for long-term scalability, at least it would be a start
<hartytp>
other than that, I'll wait to hear from you re Kasli speed and a more firm estimate of switching costs.
cjbe has joined #m-labs
<cjbe>
rjo: I have looked into the Andor EMCCD CameraLink implementation, and stuck a scope on it to confirm the protocol and latency is not crazy, but have not written any gateware for this (yet...)
hartytp has quit [Quit: Page closed]
<GitHub91>
[smoltcp] whitequark opened issue #24: Use timestamp for TCP initial sequence number https://git.io/vQGLb
<whitequark>
sb0: I have an idea for handling exceptions
<whitequark>
we could add a hook so that before the try: block is exited, the compiler issues an exception barrier, and pulls in any that might have arised
<whitequark>
basically, mark the RTIOUnderflow (or whichever) exception as "this needs additional code emitted before try: block that catches it finishes"
<whitequark>
could be even just python code to be fully generic
<whitequark>
easy to implement, seems pretty ergonomic
<travis-ci>
batonius/smoltcp#14 (master - 1746702 : whitequark): The build passed.
<GitHub93>
[smoltcp] whitequark pushed 2 new commits to master: https://git.io/vQGYB
<GitHub93>
smoltcp/master 5c3fc49 whitequark: Discard packets with non-unicast source addresses at IP level....
<GitHub93>
smoltcp/master e47e94e whitequark: Transmit actual UDP checksum of all-zeroes as all-ones instead.
<GitHub168>
[smoltcp] klickverbot commented on issue #24: Also see RFC 1948/6528 – timestamps have been augmented by a PRNG since to avoid sequence number attacks. https://git.io/vQGY2
<GitHub133>
[smoltcp] whitequark commented on issue #24: @klickverbot Is there some source of truth for which RFCs are actually authoritative for TCP? RFC 793 is hopelessly outdated and has errata, RFC 1122 fixes some of that, highlights a few common errors, many of which I did make, but also piles completely useless junk on top of it (I think every ICMP message it specifically mentions except unreachables and echo request/reply is deprecated, strongly disco
<GitHub110>
[smoltcp] klickverbot commented on issue #24: @whitequark: Unfortunately, I don't know of any up to date list of RFCs relevant for the various areas, but I found the review in RFC 7414 to be quite useful (from 2015). https://git.io/vQGOa
raghu has joined #m-labs
<raghu>
bb-m-labs: force build --props=package=artiq-kc705-nist_qc2 artiq-board
<bb-m-labs>
build forced [ETA 16m43s]
<bb-m-labs>
I'll give a shout when the build finishes
raghu has quit [Client Quit]
mumptai has joined #m-labs
<GitHub118>
[smoltcp] whitequark commented on issue #24: @klickverbot Thanks https://git.io/vQG3m
<GitHub78>
[smoltcp] whitequark commented on issue #19: Yeah that works. https://git.io/vQG3Y
<GitHub187>
[smoltcp] whitequark commented on issue #19: Hmm, I'm not sure if I like this idea very much, we already have drop magic in Device and that's pretty bad already. But I can give it a look. https://git.io/vQGDU