<GitHub17>
[smoltcp] whitequark commented on pull request #244 3d9b73b: The purpose of this method is to be able to update IP addresses without assigning a different ManagedSlice. It's for memory-constrained devices without an allocator. https://github.com/m-labs/smoltcp/pull/244#discussion_r197364017
<sb0>
ffs the sayma bug festival never ends, does it? cannot load RTM FPGA gateware: "Did not exit INIT after releasing PROGRAM" appeared out of the blue on one board
<sb0>
meanwhile, the other one developed new power supply problems
<sb0>
mh, the RTM loading failure seems to be another symptom of the general sayma memory corruption/insanity ...
<GitHub-m-labs>
artiq/master f87da95 Sebastien Bourdeauducq: jesd204: use jesd clock domain for sysref sampler...
<GitHub-m-labs>
artiq/master 76fc63b Sebastien Bourdeauducq: jesd204: use separate controls for reset and input buffer disable
<GitHub-m-labs>
artiq/master d9955fe Sebastien Bourdeauducq: jesd204: make sure IOB FF is used to sample SYSREF at FPGA
<sb0>
sayma as satellite with a kasli master seems to work just fine. the sayma master, on the other hand, is completely trashed since I added SAWG
<sb0>
it seems even crashier than the standalone target
hartytp_ has joined #m-labs
<sb0>
sync between the two sayma doesn't work at all, on the other hand...
<sb0>
the phase even varies without rebooting any board
<sb0>
there are discrete phase jumps in the output. i guess there's jitter on sysref or something
<sb0>
those jumps are present on one board only (this is two satellites, same gateware, driven by kasli)
<sb0>
that board is Florent's board, on which one DAC is dead... could be just a hardware problem?
<sb0>
hartytp, can you test?
<sb0>
once the drtio link is established, there are no more sysref adjustments, and I didn't see phase jumps on the other board, so it looks like a one-off board issue
<sb0>
could also try running the standalone design on Florent's board and look at phase jumps to confirm ...
<hartytp_>
so, a related question: when we discussed the crashes related to the HMC7043 noise, I thought you said that all logic clocked from the HMC7043 was held in reset during the boot
<hartytp_>
but, that's not true, is it?
<hartytp_>
the SAWG is clocked from that and is *not* held in reset during boot
<hartytp_>
so, we were running a pretty big chunk of logic from a crap clock
<hartytp_>
that may explain why the issues we saw were so bad. I wonder if we would have had such bad HMC7043 issues if the rtio_phy CD was held in reset until we'd gaurenteed a stable clock...
<sb0>
normally there is no difference, especially since it's a synchronous reset
<sb0>
additionally, nothing in the rio_phy domain is supposed to interfere with the CPU or SDRAM
<hartytp_>
sure
<hartytp_>
but, normally you would assume a stable clock for that
<hartytp_>
the HMC7043 isn't really designed to do that. at least not during boot
<hartytp_>
anyway, not saying that that was our issue, but it does mean that one of the assumptions that I though we had agreed on when talking about the HMC7043 was not correct
<sb0>
not really; timing violations may corrupt the state of FFs, but then a reset would clear them
<hartytp_>
yes, so the model here would have to be some kind of PI/SI issue
<sb0>
sending 2GHz noise through the FPGA clock networks, on the other hand, can cause problems, and synchronous resets won't help
<hartytp_>
yes, but it's probably best practice to hold the logic in reset until the clock is good
<hartytp_>
anyway, I think I'll OR the rtio_phy with a CSR that defaults to 1 and then release it at the end of boot
<hartytp_>
then run mem tests at a few points during boot and see if I can identify what on earth is going on
<sb0>
just use the existing ResetSignal("rtio")
<sb0>
that one should be asserted until the "rtio" clock is stable
<hartytp_>
that's unlikely with an array size of 0x10000
<hartytp_>
isn't it?
<hartytp_>
32 bit addresses, right?
<sb0_>
yes, u32
<GitHub-m-labs>
[artiq] jordens commented on issue #1065: We also had discussed adding a blinking LED (or SMA) and reproduce it toggling erratically in the corrupted state (iirc that's something that was observed as one point). That would allow debugging of the clocking when the board is in the failed/corrupted state. https://github.com/m-labs/artiq/issues/1065#issuecomment-399481337
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: Thanks for the reminder. So, you were looking on the DRTIO master build without SAWG (which didn't crash for you). You looked at the blink signal using microscope. Expectation is that it should toggle at about 2Hz (150MHz / 2^28). What exactly did you see? I'm happy to try adding that to my build at some point soon... https://github.com/m-labs/artiq/issues/1065#issu
<jkeller>
bb-m-labs: force build --props=package=artiq-board,artiq_target=kc705,artiq_variant=nist_qc2 artiq-board --branch=release-3
<bb-m-labs>
build forced [ETA 43m49s]
<bb-m-labs>
I'll give a shout when the build finishes
jkeller has quit [Client Quit]
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: how long did it do that for? You expect the HMC7043 to startup at the wrong frequency for a while before it is configured via SPI. There, you have the CB enabled even during HMC7043 configuration, so there will be a period of "noise". https://github.com/m-labs/artiq/issues/1065#issuecomment-399491943
<GitHub-m-labs>
[artiq] jonaskeller commented on issue #1076: I'd like to test this but can't build the newest `kc705-nist_qc2` 3.6 gateware. The bot is building 4.0.dev despite the argument `-branch=release-3`:... https://github.com/m-labs/artiq/issues/1076#issuecomment-399496511
<GitHub-m-labs>
[artiq] jonaskeller commented on issue #1076: I'd like to test this but can't build the newest `kc705-nist_qc2` 3.6 gateware. The bot is building 4.0.dev despite the argument `--branch=release-3`:... https://github.com/m-labs/artiq/issues/1076#issuecomment-399496511
<whitequark>
first, don't expect this to work on the comms CPU (in runtime code), other than by accident
<whitequark>
the runtimestack is much smaller than that
<whitequark>
second, there is no logger registered in the code running on comms CPU
<whitequark>
you can use println! instead of info!
<whitequark>
the comms CPU stack is large enough, so the stack-allocated array is fine
jkeller has quit [Quit: Page closed]
_whitelogger has joined #m-labs
<GitHub-m-labs>
[artiq] whitequark commented on issue #1072: Back when it was added, @sbourdeauducq said that the "proper" way to do debug printing is with the `print` RPC; `core_log` was always internal. https://github.com/m-labs/artiq/issues/1072#issuecomment-399571654
<GitHub-m-labs>
[artiq] mfe5003 commented on issue #1078: So it looks like I can communicate with kasli using `artiq_coremgmt` and the log (0x01) and reboot (0x05) commands seem to work fine. I can change the log level to debug then try to write a key value pair.... https://github.com/m-labs/artiq/issues/1078#issuecomment-399578773
<GitHub-m-labs>
[artiq] marmeladapk commented on issue #1078: @mfe5003 This is a question to @sbourdeauducq or @jordens. But you don't need a idle kernel to work with Kasli and schedule experiments, it's just an experiment that activates when nothing else is happening (for example to toggle diode). https://github.com/m-labs/artiq/issues/1078#issuecomment-399579952
<hartytp__>
whitequark: thanks!
<hartytp__>
so, I just need to swap the info! for println and all should be good
<GitHub-m-labs>
[artiq] mfe5003 commented on issue #1078: @whitequark This is my first time trying to use artiq, so I am trying to figure out how it all works. I did not intend to use different gateware/firmware. It seems like I need to use version 4 to use kasli, because conda ends up pulling from the version 4 dev branch when I do:... https://github.com/m-labs/artiq/issues/1078#issuecomment-399582990
<hartytp__>
whitequark: "error: cannot find macro `println!` in this scope"
<whitequark>
hartytp__: yes, println! in kernels is defined in ksupport/lib.rs
<whitequark>
so put your code there
<hartytp__>
aah, thanks
<hartytp__>
well, I already hacked it to just return the results rather than printing
<hartytp__>
but good to know
<GitHub-m-labs>
[artiq] whitequark commented on issue #1078: If you want to stay on the release versions, you can remove the dev channel from conda instead. Alternatively, you could flash the dev channel gateware using the artiq_flash script. https://github.com/m-labs/artiq/issues/1078#issuecomment-399585726
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: So, we cannot find any evidence of a SI/PI problem after probing the HW, and I can't find any evidence of memory corruption occurring during boot or during kernel operation.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399591428
<GitHub-m-labs>
[artiq] gkasprow commented on issue #1080: that's true. But this was the only modification I did. There is 3.3V -> 1.8V conversion using 200R resistor that injects current to 1.8V port of DAC and FPGA. Theoretically the FPGA has protection diodes, but DAC may not like voltage peaks of rougly 2.5V (1.8V + 0.7V of diode). I have no idea how this could affect second DAC channel in such bizarre way.... https:/
<GitHub-m-labs>
[artiq] hartytp commented on issue #1080: My guess was that it's due to one of the recent ARTIQ commits rather than the HW changes. But, I might be wrong -- I haven't given it too much thought yet. https://github.com/m-labs/artiq/issues/1080#issuecomment-399593872
<GitHub-m-labs>
[artiq] gkasprow commented on issue #1080: the funny thing is that I started seeing PRBS errors on one board a few days ago, another was workin well. And next day second board also got PRBS "sickness ". https://github.com/m-labs/artiq/issues/1080#issuecomment-399594020
<GitHub-m-labs>
[artiq] whitequark commented on issue #1065: And of course the crash happens in code because you evict all code from L2 cache during the memory test, whatever is executed during memory test doesn't fit in L1, and so on the next code fetch from DRAM you get a crash. https://github.com/m-labs/artiq/issues/1065#issuecomment-399594376
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: Okay, but does this *really* seem like PI/SI noise? I'm not seeing any memory issues during my random reads or writes, but always the same bits getting flipped in the same places. That doesn't sound like noise to me. https://github.com/m-labs/artiq/issues/1065#issuecomment-399600708
<hartytp__>
whitequark/sb0/rjo: okay, I'm a bit out of my depth here, but this doesn't feel like a simple noise issue here as it seems far too deterministic
<hartytp__>
let me know if you can think of anything else I should try
<hartytp__>
but, given that greg has checked the PI carefully, I think we need to keep looking at ARTIQ to make sure this isn't an issue in the code
<hartytp__>
would be good to hear what your plan for dealing with this is, as these problems have gone on for far too long...
<GitHub-m-labs>
[artiq] whitequark commented on issue #1065: > I'm not seeing any memory issues during my random reads or writes, but always the same bits getting flipped in the same places. That doesn't sound like noise to me.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399602545
<GitHub-m-labs>
[artiq] whitequark commented on issue #1065: Oh, and to add to this: all `println!` statements in the kernel have to go through the runtime, they don't go directly via UART. This means that when your memory test *did* successfully corrupt memory, chances are, the runtime code is *already* corrupted as well. I don't think that you will ever see a failure message with the way this memory test code is composed.
<GitHub-m-labs>
[artiq] whitequark commented on issue #1065: You can run the profiler on the comms CPU and I bet the addresses where you see bitflips will also be at the very top of the profiler report. Conversely, if you look at more crashes you'll see different ones too. If you adjust the runtime code so that it does nothing but spins in a loop after you start the memory test kernel, I predict you'll never see a crash. h
<GitHub-m-labs>
[artiq] whitequark commented on issue #1065: You can run the profiler on the comms CPU and I bet the addresses where you see bitflips will also be at the very top of the profiler report. (I already know from the logs you posted that these addresses are some of the hottest in the runtime.) Conversely, if you look at more crashes you'll see different ones too. If you adjust the runtime code so that it does nothin
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: It's also interesting that I always seem to get 28 successful memory tests before a crash. @whitequark I get what you're saying, but I'm still not sure that this feels like white noise. We do a lot of successful reading/writing from RAM and then always have a crash in the same place. Seems like a cop out to say it's PI/SI.... https://github.com/m-labs/artiq/issues/1
<GitHub-m-labs>
[artiq] gkasprow commented on issue #1065: It's quite possible i.e. due to amount of SSO (simultaneously switching outputs) that at certain moment there is voltage peak on one of the supply rails, clock signal, termination voltage, etc.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399604575
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: @gkasprow okay, what I'm seeing looks deterministic. So, you can take the fork of ARTIQ I linked to above and flash that, as well as the startup Kernel I posted. Check that you see the same crashes as me. Then add a line in ARTIQ that pulses a TTL before each mem test. That gives you your trigger. https://github.com/m-labs/artiq/issues/1065#issuecomment-399605264
<GitHub-m-labs>
[artiq] gkasprow commented on issue #1065: This is tricky, but I can use i.e. SDRAM read signal as a trigger but cannot say which address is currently being written. I have only four 1GHz active probes and one 5GHz active probe. And the scope has only 4 inputs. I have also logic analyzers but connecting the probes would kill the SI.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399606180
<GitHub-m-labs>
[artiq] hartytp commented on issue #1065: > But you don't crash in the same place. You provided four crash logs, and there are three different crash addresses in them. Yes, they are on the same bit, but that's just because of the illegal instruction encodings in or1k.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399606212
<GitHub-m-labs>
[artiq] gkasprow commented on issue #1065: I can generate trigger based on sequence of input signals, but this is still not enough to isolate certain address read/write. For this purpose I'd need some logic that toggles IO line.... https://github.com/m-labs/artiq/issues/1065#issuecomment-399606382
hartytp__ has quit [Quit: Page closed]
<GitHub-m-labs>
[artiq] klickverbot commented on issue #1065: Potentially a silly idea @hartytp, but what if you add sleeps/busy spins between memtests? Might help to disambiguate between time being the factor vs. number of writes or something else weirdly stateful. (E.g. is this something heating up leading to SI/PI issues? DRAM refresh being borked?) https://github.com/m-labs/artiq/issues/1065#issuecomment-399606890