#litex on 2020-06-30 — irc logs at freenode.irclog.whitequark.org

2020-02-07 11:13 _florent_ changed the topic of #litex to: LiteX FPGA SoC builder and Cores / Github : https://github.com/enjoy-digital, https://github.com/litex-hub / Logs: https://freenode.irclog.whitequark.org/litex

00:00 tpb has quit [Remote host closed the connection]

00:00 tpb has joined #litex

00:02 lf has quit [Ping timeout: 260 seconds]

00:02 lf has joined #litex

00:03 st-gourichon-f has joined #litex

00:04 st-gourichon-fid has quit [Ping timeout: 256 seconds]

00:33 <gregdavill> _florent_: Thanks for looking into that so quickly! I've pulled your litedram changes and altered the CSR module in my design with the external reset signal. It's looking good here.

00:35 <gregdavill> An interesting observation, I'm getting different bitslip results if I load my design via JTAG, compared to if it's loaded from FLASH.

01:53 Degi has quit [Ping timeout: 260 seconds]

01:54 Degi has joined #litex

02:15 jaseg has quit [Ping timeout: 272 seconds]

02:16 jaseg has joined #litex

02:36 guan has quit [Read error: Connection reset by peer]

02:36 bubble_buster has quit [Ping timeout: 240 seconds]

02:37 mithro has quit [Ping timeout: 272 seconds]

02:37 levi has quit [Read error: Connection reset by peer]

02:38 mithro has joined #litex

02:39 bubble_buster has joined #litex

02:41 guan has joined #litex

02:44 levi has joined #litex

05:07 m4ssi has joined #litex

05:08 kgugala_ has joined #litex

05:09 kgugala has quit [Ping timeout: 265 seconds]

06:23 m4ssi has quit [Remote host closed the connection]

06:47 st-gouri- has joined #litex

06:50 st-gourichon-f has quit [Ping timeout: 264 seconds]

07:11 kgugala has joined #litex

07:15 kgugala_ has quit [Ping timeout: 260 seconds]

07:42 gregdavill has quit [Ping timeout: 264 seconds]

08:38 leons has quit [Quit: killed]

08:38 CarlFK[m] has quit [Quit: killed]

08:38 disasm[m] has quit [Quit: killed]

08:38 sajattack[m] has quit [Quit: killed]

08:38 nrossi has quit [Quit: killed]

08:38 david-sawatzke[m has quit [Quit: killed]

08:38 john_k[m] has quit [Quit: killed]

08:38 xobs has quit [Quit: killed]

08:52 david-sawatzke[m has joined #litex

09:03 gregdavill has joined #litex

09:24 xobs has joined #litex

09:24 disasm[m] has joined #litex

09:24 john_k[m] has joined #litex

09:24 CarlFK[m] has joined #litex

09:24 nrossi has joined #litex

09:24 sajattack[m] has joined #litex

09:24 leons has joined #litex

09:53 kgugala_ has joined #litex

09:55 scanakci has quit [Quit: Connection closed for inactivity]

09:55 kgugala_ has quit [Read error: Connection reset by peer]

09:55 kgugala has quit [Ping timeout: 258 seconds]

09:55 kgugala has joined #litex

12:32 <_florent_> gregdavill: indeed i also get different bitstlip results when loading multiple time via JTAG, that's the next things to investigate :)

12:32 <_florent_> things/thing

12:41 <somlo> _florent_, gregdavill: memtest on rocket+litex on the trellisboard (ecp5-85k) used to fail 30-50% of the time, depending on the week. After yesterday's litedram update, it's down to 10% :) I used to think it had maybe something to do with the board and chips warming up after a while (when the error rate would decrease)...

12:41 <somlo> not sure I'm helping, but figured I'd throw in an extra data point, fwiw...

12:42 <somlo> never made it to actualy tinkering with the litedram settings in a systematic way, so thanks for doing that!

13:29 Skip has joined #litex

13:32 gregdavill has quit [Ping timeout: 240 seconds]

14:52 <_florent_> somlo: thanks for the feedback

16:04 CarlFK has quit [Ping timeout: 240 seconds]

16:16 CarlFK has joined #litex

16:46 <st-gouri-> Hi! We are sending bulk (around 150kbytes total) data to a wishbone target, and it's very very slow, around 50 bytes per second. Is there some documentation about overhead and what we could do?

16:55 <lf> st-gouri-: i am new to this but could you give some extra info like: is that over a bridge, how many master are on the bus.

16:55 <st-gouri-> lf, sure, thanks.

16:56 <st-gouri-> To be clearer, PC runs wishbone client code in python, connected to a litex_server, that litex_server sends data through a 3-wire UART to the design. so far so good?

16:57 <st-gouri-> The bulk data is used to drive a second UART that sends the bulk data to some other device.

16:57 <st-gouri-> Currently, we send bytes one by one and it's not clear if that is the actual performance killer.

17:03 <st-gouri-> We have understood that an event register is necessary to read data back to our client. Writing 2 to the UART_EV_PENDING registers signals that we have read the byte from UART_RXTX, so that the design make the next byte received by that UART available at the register.

17:03 <zyp> AFAIK the uart protocol does 32-bit transfers, so if you're only using 8 of them, that's a 4x overhead in itself

17:04 <lf> but still you should be able to get nearly 1000 request per sec

17:04 <st-gouri-> lf, interesting.

17:05 <zyp> what baudrate does the bridge run at?

17:05 <st-gouri-> Let me check.

17:05 <st-gouri-> zyp, bridge at 115200 bauds.

17:05 <st-gouri-> Any idea about the overhead of a request?

17:05 <lf> unreleated but i think this is wrong. https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/uart.py#L286

17:05 <tpb> Title: litex/uart.py at master · enjoy-digital/litex · GitHub (at github.com)

17:06 <lf> it checks the data_width not address_width

17:09 <zyp> are you doing interleaved reads and writes as well?

17:09 <zyp> if so, you're also subject to the latency of any usb-uart involved

17:11 <zyp> I've found that one of my usb-uart adapters takes like 10x longer to transfer a litescope buffer than another, because it's using a buffering strategy optimized for throughput rather than latency, or something like that

17:13 <_florent_> st-gouri-: i also think it could be related to interleaved reads/writes as zyp suggests

17:13 <_florent_> st-gouri-: the UART bridge is not very fast, but it shouldn't be that slow...

17:15 <st-gouri-> How many bytes are sent by litex_server for each request?

17:16 <zyp> I'm guessing 8

17:16 <zyp> four for the address and four four the data

17:16 <st-gouri-> Sounds reasonable.

17:16 <lf> cmd, length, 4x addr, 4x data i think for write

17:16 <st-gouri-> Regarding interleaved read and write, in spirit yes, let me check.

17:16 <zyp> ah, yeah, of course there has to be a cmd as well

17:19 <zyp> yeah, lf is correct, here's a usb-capture I did of wishbone-bridge traffic some weeks ago

17:19 <zyp> https://bin.jvnv.net/file/GtKiI.png

17:20 scanakci has joined #litex

17:20 <lf> but if i read the UARTBone corretly it can only it can only make data_width alinged writes

17:20 <zyp> that'd be pretty natural

17:21 <zyp> I'm assuming it doesn't do much in the way of width conversion at all

17:21 <st-gouri-> Can we in one etherbone request send several values in sequence to the same register?

17:22 <zyp> I don't think so

17:22 <st-gouri-> Like, send 16 bytes to be sent to the same address?

17:22 <zyp> I mean, I don't know, but I would guess not

17:22 <st-gouri-> Or it would need to extend the protocol. Haven't actually looked at it. Might be an option.

17:23 <st-gouri-> We might have to send much data through it later on. Perhaps interactive terminal. Perhaps gdb server procotol.

17:24 <lf> ok any more deatails? you have one UARTBone and on UART on the bus. and nothing else?

17:24 <lf> both from litex?

17:24 <zyp> judging by https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/uart.py#L344, it's always incrementing the address if writing more than one word

17:24 <tpb> Title: litex/uart.py at master · enjoy-digital/litex · GitHub (at github.com)

17:25 <st-gouri-> lf, both from litex. Not much activity inside the design at this point in time.

17:25 <lf> st-gouri-: so only one master on the bus?

17:25 <st-gouri-> zyp, interesting.

17:26 <st-gouri-> Mhmhm, most probably only one active master on the bus, the uart fed by the litex_server.

17:27 <zyp> st-gouri-, what is your goal for what you're doing?

17:28 <st-gouri-> Our design is a kind of I/O chip for an external (separate chip) CPU.

17:29 <zyp> and you're planning to use uart for the interface towards the CPU?

17:29 <st-gouri-> The UART that we drive from the PC, along with some GPIO pins, is now capable of sending the correct sequence to boot the CPU (microcontroller). But it's slow.

17:30 <st-gouri-> Booting the CPU involves to GPIO and an UART.

17:30 <st-gouri-> Booting the CPU involves two (2) GPIO and an (1) UART.

17:31 <zyp> and what will drive this?

17:31 <st-gouri-> zyp, not sure about what your asking.

17:31 <lf> ok if it looks like this "INFO:SoCBusHandler:Interconnect: InterconnectShared (1 <-> 2)." i can cross of rouge bus master from my list. and go to python usb.

17:31 <zyp> st-gouri-, right now you're using a PC to talk to the wishbone bridge

17:31 <st-gouri-> After the CPU is booted, it will drive the I/O chip by targetting the wishbone bus. Probably through SPI.

17:32 <st-gouri-> In the field, the design will manage to setup the CPU by itself. No PC will be needed.

17:33 <zyp> so if you won't use the uart bridge in the field, I guess the performance of it is kinda moot

17:33 <st-gouri-> One simple solution in the field is to have the design (in a FPGA) boot a softcore that runs some code to target wishbon. Same actions, lower overhead.

17:34 <st-gouri-> zyp, not so moot, because currently, to boot the CPU we need to send 160k of data and that takes 41 minutes. That is a problem.

17:34 <zyp> if you want a faster bridge, I guess you could try valentyusb

17:34 <st-gouri-> In another setup, the PC directly driving the CPU with an UART, we get 10kbytes/s, not 50 bytes/sec.

17:35 <st-gouri-> zyp, at this point I'm trying to understand the actual source of slowliness. (I have a simple external log analyzer at my disposal, too.)

17:35 <lf> load a bitstream to connect the pins boot cpu load new bitstream?

17:37 <lf> you are pulling the status register to see it it finnisched?

17:37 <_florent_> st-gouri-: are you able to provide a minimal design to reproduce the issue?

17:37 <_florent_> i could look at this

17:38 <st-gouri-> lf, planned in the field, FPGA gets the design from flash, which begets a small softcore CPU. That CPU boots from same flash, runs software. That software targets through wishbone GPIO and UART. These boot the main processor. Then the main processor (fast) can do whatever, targetting wishbone through SPI.

17:38 <_florent_> the bridge supports bursts, but not sure the current software makes use of it

17:38 <st-gouri-> lf, which status register? EV_PENDING ?

17:39 <st-gouri-> _florent_, bursts to same address?

17:39 <_florent_> st-gouri-: no this is incrementing

17:40 <_florent_> st-gouri-: but we could eventually add a different comment for non-incrementing writes

17:40 <st-gouri-> If the UART has, say, 16 bytes buffer, and we send by burst of 16 bytes, then we might have only 60% overhead instead of 1000%.

17:40 <st-gouri-> But I suspect there is something else.

17:40 <_florent_> st-gouri-: yes i also suspect there is something else

17:41 <_florent_> can you share the part of the python code on the host that is doing the upload?

17:41 <st-gouri-> Yes.

17:42 <st-gouri-> Will be all open-source eventually, but is not yet. Will pastebin parts.

17:43 <lf> and can you hookup a logic analyser to both uarts? i maybe the USB-UART adapter is realy the slow with short messages

17:43 <st-gouri-> One question about EV_PENDING. Is it set after any byte sent? Or when send buffer becomes empty?

17:44 <st-gouri-> Probably the former.

17:45 <st-gouri-> https://paste.ubuntu.com/p/KrqCcCcv59/

17:45 <tpb> Title: Ubuntu Pastebin (at paste.ubuntu.com)

17:45 <st-gouri-> Wow, the indentation appears wrong.

17:45 <st-gouri-> The first line "def" should be aligned with the second "def".

17:46 <st-gouri-> It's part of a Python class that mimicks the standard class for Uart, but routes through python wishbone client.

17:47 <_florent_> st-gouri-: for the IRQ/Pending:

17:47 <_florent_> https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/uart.py#L224

17:47 <tpb> Title: litex/uart.py at master · enjoy-digital/litex · GitHub (at github.com)

17:47 <_florent_> https://github.com/enjoy-digital/litex/blob/master/litex/soc/cores/uart.py#L237

17:47 <tpb> Title: litex/uart.py at master · enjoy-digital/litex · GitHub (at github.com)

17:47 <st-gouri-> _florent_, ahah, "non-full".

17:48 <st-gouri-> That's very nice for a local ISR. This allows nice pipelining to use the full UART bandwidth. Cool.

17:50 <_florent_> st-gouri-: could you do a quick test to see if the issue is related to the interleaved reads/writes: remove the while (self.isTxFull()) and use a times.sleep() after self._wb.regs.uart_rxtx.write(byte)

17:50 <st-gouri-> _florent_, good idea.

17:50 <st-gouri-> Full class if it is of any use: https://paste.ubuntu.com/p/zvj2TbVHnF/

17:51 <tpb> Title: Ubuntu Pastebin (at paste.ubuntu.com)

17:53 <st-gouri-> MMh, I have a strange unrelated error "index out of range", have to check that.

18:02 <st-gouri-> pycharm does not let me know exactly where the exception is raised.

18:04 FFY00 has quit [Ping timeout: 260 seconds]

18:04 <st-gouri-> Ah, some progress.

18:05 FFY00 has joined #litex

18:06 FFY00 has quit [Remote host closed the connection]

18:07 FFY00 has joined #litex

18:07 <st-gouri-> _florent_, I simply disabled the call to isTxFull(), and get speed up to 973 bytes/sec.

18:08 <st-gouri-> So, the problem is most probably the interleaved read/writes.

18:09 <st-gouri-> 973 bytes/sec is much more inline with what is expected from a 10-bytes long packet sending one payload byte.

18:10 <st-gouri-> The transfer was successful and took 2 minutes 25 seconds instead of 41 minutes when interleaving reads and writes.

18:10 <st-gouri-> I know it's successful because the CPU has definitely booted and runs our code.

18:11 <lf> ya i would say zyp is right with the slow uart adapter. it probebly thakes some time for it to flush its buffer.

18:14 <st-gouri-> Is the problem in the design of the UART, litex-level?

18:15 <st-gouri-> Of in a third-party USB-UART bridge?

18:16 <lf> not sure. but i do know that some ppl curse about uart adapperts buffering small transfers for long times.

18:17 <st-gouri-> In this experiment, the UART adapter is soldered in the TinyFPGA BX board being used.

18:18 <st-gouri-> Ah, no wrong.

18:18 <st-gouri-> The UART advertises FTDI TTL232R-3V3 idVendor=0403, idProduct=6001, bcdDevice= 6.00

18:19 <st-gouri-> Do you expect we might have better performance with another one?

18:19 <lf> ya i never had that problem so i don't know.

18:20 <lf> and if it where a problem with that chip i don't think they would use it

18:22 <st-gouri-> Was wrong when mentioning TinyFPGA Bx. It's not that one. It's a separate independent cable, with the device details I provided.

18:24 <st-gouri-> So, now I'm sending data as fast as I can without checking the TX-Full register, and it definitely can't be full, because it has n times more time that it needs, where n in the side of the etherbone packet!

18:24 <st-gouri-> s/side/size/

18:25 <lf> ok the ftdi should flush its buffer every 16ms

18:27 <lf> mmh that are like 60Hz or you know ~50 byte over the uart

18:27 <lf> https://www.ftdichip.com/Support/Documents/AppNotes/AN232B-04_DataLatencyFlow.pdf

18:27 <lf> page 6

18:27 <lf> 7

18:29 <st-gouri-> lf, very interesting.

18:30 <st-gouri-> "For application programmers it must be stressed that data should be sent or received using buffersand not individual characters." ... well, this protocol kind of needs interleaving read and writes.

18:32 <st-gouri-> "3.2Adjusting the Receive Buffer Latency Timer" -> howdo you know how to to that on Linux? From Python?

18:32 <lf> ya the problem is the read. as the respons from the brige gets stuck in the buffer

18:32 <lf> https://granitedevices.com/wiki/FTDI_Linux_USB_latency

18:32 <tpb> Title: FTDI Linux USB latency - Granite Devices Knowledge Wiki (at granitedevices.com)

18:35 <st-gouri-> Thanks lf for those relevant links!

18:35 <st-gouri-> Setting latency_timer to 1 I get 218b/s instead of 50b/s.

18:36 <st-gouri-> This shows that the latency timer is indeed involved in the delay.

18:36 <lf> well that is that mystery solved. but how to solve the problem

18:36 <st-gouri-> Let n=10 the size of a Etherbone packet writing one byte to the UART.

18:37 <st-gouri-> As long as the baudrate PC side is not higher than n times the baudrate on the other UART, we're safe.

18:37 <lf> lol yes

18:41 <st-gouri-> And... guess what the next step will be to gain some performance. ;-)

18:43 <st-gouri-> More seriously, what we have done today is very good. We understood the reason for such slow performance, got a fix, proved that ugly as it looks it is actually safe.

18:43 <lf> do i even dare guessing

18:44 <st-gouri-> Currently, the litex_server runs at 115200. One solution is to pump it up to 10 times that speed and call it a day.

18:56 <lf> st-gouri-: can you just bypass all the logic and switch the bypass of after you are done. but sure it its only for develepment its probebly the best solution

18:56 <st-gouri-> Bypass all which logic?

18:56 <lf> my writing bad

18:57 <lf> just connect uart_a to uart_b this comb logic. like a mux on the tx pin

18:57 <lf> with

19:00 FFY00 has quit [Read error: Connection reset by peer]

19:00 FFY00 has joined #litex

19:01 FFY00 has quit [Remote host closed the connection]

19:01 FFY00 has joined #litex

19:02 <st-gouri-> Mfmfmf. We would lose the multiplexing property of the wishbone bridge. Would need to kill litex_server to free the UART. Then would need some way to revert. Once booted the CPU can do that. That could actually work.

19:03 <st-gouri-> If we can beef up the first UART to 1152000 we have the same benefits and no downside.

19:04 <lf> ture

19:04 <st-gouri-> Still, it's interesting.

19:04 <st-gouri-> Many ideas flying. Even if we don't implement most, it's still interesting.

19:05 FFY00 has quit [Read error: Connection reset by peer]

19:05 FFY00 has joined #litex

19:18 m4ssi has joined #litex

19:37 <zyp> sorry, I had to go put the kid to bed :)

19:37 <zyp> the 16ms buffer flush that lf is quoting sounds like the same I ran into

19:38 <zyp> one of the adapters I used was based on FT232X

19:38 <zyp> and a different adapter (based on a stm32 running a usb-uart firmware) got 10x the throughput at the same baudrate

19:40 <zyp> before I left I proposed looking into running the valentyusb bridge, i.e. usb directly to the fpga instead of uart

19:41 <lf> zyp: we could add an option to send an event charater at the end of a read. but you would still need to configure that in the ftdi chip

19:42 <zyp> I haven't tested the performance of that myself, but it should eliminate the buffer latency completely

19:43 <zyp> the valentyusb bridge stuff is based on control requests, which usb should be able to send a couple thousand of per second, so if I'm guessing, you might see a kilobyte or more per second from that

19:44 <zyp> bottleneck is going to be how fast the host side is passing requests and replies between userland and the usb controller

19:47 <lf> ya

19:47 <st-gouri-> reading

19:48 <st-gouri-> Thanks for the hints.

19:48 <lf> i would need to look into litescop but maybe there is a way to change the reading behavier to get less overhead when reading or optimies it more for this buffering behevier

19:48 <st-gouri-> TinyFPGA BX and Fomu use valentyUSB for their bootloader, IIRC.

19:49 <zyp> st-gouri-, yes

19:53 <zyp> lf, the problem with the wishbone bridge is that currently the only way to get flow control is to poll a register in between reads or writes

19:55 <zyp> although litescope shouldn't need that when dumping the buffer, so the protocol could be extended to add a «read address A N times»

19:56 <zyp> as far as I can see, the length argument is not considered for reads currently, only writes

19:57 <zyp> and no incrementing reads and writes are necessary for dealing with fifo registers

19:59 <lf> but does that help? it will just read the address as fast as the bus can. it think useing the event char or the flow controle lines of the ftdi chip to force a buffer flush. would be more helpfull.

19:59 <lf> or useing the usb bridge

19:59 <zyp> help what? it'd help for litescope

20:00 <zyp> the slow part of using litescope is dumping the capture buffer after the capture is finished

20:01 <lf> ah i have not read how litescope transfers date. but if that is behinde one address then yes that should give big bust

20:01 <zyp> yeah, litescope puts everything behind a set of CSRs

20:02 <zyp> one of them pops from a fifo

20:04 <zyp> actually, it's not even doing flow control, it's just the constant back and forth between «READ 1 from ADDR» «DATA» with latency in between

20:05 <lf> ya you could just not wait for the response and send the next request

20:05 <zyp> that is true

20:06 <zyp> but I wonder if that would risk overflowing the uart on the fpga

20:07 <zyp> although as long as it has deep enough buffers it should be fine

20:10 <lf> i dont think UARTBone has a buffer. only the UART for CSR has fifos.

20:11 <lf> ah there is a "FT245 Asynchronous FIFO mode" but i think that uses extra pins

20:11 <zyp> but we're discussing software fixes now :)

20:11 <lf> ya

20:12 <zyp> hardware solutions are not very useful if you already have a hardware design you don't wan't to modify

20:16 <st-gouri-> will go afk

20:16 <lf> but i think you are right that would overwelm the uart.

20:16 <zyp> yeah, it'd need enough buffering that it could start receiving the second read while replying to the first

20:17 <lf> you could set the event char to zero because reading a csr register will always return 3 zero bytes

20:17 <zyp> not for 32-bit CSRs

20:17 <lf> arg right

20:19 <lf> we need our own uart adapter that is uartbone aware

20:20 <st-gouri-> Wht

20:20 <st-gouri-> sory

20:20 <st-gouri-> sorry

20:20 <zyp> lf, just have the rx buffer flush at four bytes :)

20:21 <st-gouri-> A modified UARTbone could be fed 4-bytes at a time... but how to tell it there's only 3 bytes.

20:21 <lf> ya or let it handel read write and you just give it a list of addresses you need

20:22 <st-gouri-> The fourth byte would be a status to tell if there's one, two or three bytes in the packet?

20:22 <st-gouri-> Overhead divided by 3.

20:23 <st-gouri-> Not only when driven through litex_server. Also locally.

20:24 <st-gouri-> Unused bits of the fourth byte could be UART lines.

20:24 <st-gouri-> But I digress.

20:29 <lf> ya for now the bigest deleay is the 1-16ms buffer delay of the ftdi. and without changeing the bitstream. the only way to make that fast is to put litex server an the adapter

20:29 <lf> mmh pi zero?

20:30 <st-gouri-> Seeya.

20:30 <lf> bye

20:31 <lf> i think when i get that problem i will just try some cortex-a with linux and run litex-server on that with its nativ uart.

20:35 Skip has quit [Remote host closed the connection]

20:37 <lf> n8

21:15 daveshah has quit [Read error: Connection reset by peer]

21:15 daveshah has joined #litex

22:42 st-gouri- has quit [Ping timeout: 240 seconds]

22:53 m4ssi has quit [Remote host closed the connection]