<aw>
lekernel, 0x71: I plugged usb-A port with keyboard and usb-B port with mouse, they can be detected well, but port-A shows 'USB: HC: Transfer start: RX timeout error'; so I swapped keyboard and mouse. that error won't show up, then swapped again, it shows up still. so I let 0x71 be in gui mode, I still can use mouse and keyboard together even swapped them. What does this stand for?
<wolfspraul>
aw: are you trying with the silicone keyboard? try with another keyboard too (either a completely different non-silicone keyboard, or with a second silicone keyboard)
<wolfspraul>
but definitely X for now
<aw>
yes, silicone keyboard
<aw>
okay...let me try another silicone keyboard.
<wolfspraul>
and a different keyboard (non-silicone) too, if you have one that works
<wolfspraul>
just to get some more data
<wolfspraul>
but the board stays FAIL anyway, so maybe just a waste of time...
<aw>
hmmm....the same results after used another silicone keyboard. also swapped in gui 'login' it reacted well responsely to my type though. strange indeed.
<aw>
well...marked 'x' still..next board to test. ;-)
<aw>
s.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-result s-1 9. replaced new u17. 10. reflashed successfully by erase version:http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-results-e
<aw>
0x6c now is rendering done successfully. ;-) any amazing?
<wolfspraul>
with this history, you cannot set it to 'available'
<aw>
also no d2/d3 dimly lit although the power-cylce is not too much.
<wolfspraul>
we need to understand the flash/boot/dimly lit issues first
<aw>
sure, i just wanted to say it passed all tests. strange and amazing.
<wolfspraul>
that's why we need to do more research before starting to sell
<aw>
anyway...i just posted here any news i found/tested. ;-)
<wpwrak>
so the USB transceivers are acting up, too ? (U16, U17)
<lekernel>
the FPGA design has tons of (painful) bugs in the USB "UART", it could just be that part tolerances are tickling them
<wolfspraul>
I don't think I'm worried about those bugs right now
<wolfspraul>
they are clearly identified, and the boards singled out
<wolfspraul>
wpwrak: you proposed a reset IC with different threshold voltage and supplied from 5V. any candidates?
<lekernel>
by the way, it never ceases to amaze me how people seem happy with the opencores USB "UARTs", for example this one: http://opencores.org/project,usbhostslave
<wolfspraul>
I realize to save time we should order some parts...
<lekernel>
"Works like a champ for me" ... this thing has MORE bugs than my crappy design. I used it at the beginning and had to throw it away because it did not work at at all.
<wpwrak>
wolfspraul: (not worried) okay, but what is the trigger for replacing U17 ?
<lekernel>
you could see that piece of crap obviously misbehaving on trivial corner cases of bit stuffing etc.
<lekernel>
for example it doesn't bitstuff correctly the last bit of USB packets
<lekernel>
every USB application note tells you to be careful about that
<wpwrak>
wolfspraul: for the reset IC, there's a -440 part (4.4 V) of the same chip. that would allow you to operate perfectly within specs.
<wolfspraul>
maybe Adam should order that right on Monday morning?
<wolfspraul>
wpwrak: [trigger for replacing u17] don't know. when the test fails? :-)
<wolfspraul>
my feeling from looking at testing results is that (if it's 1 problem only), the root cause is not a simple flash write somewhere
<wolfspraul>
that wouldn't explain why some boards cannot be reflashed anymore, sometimes for a day or several days, sometimes forever
<wolfspraul>
if it's just 1 problem, it must be some out-of-spec electrical shock/impact on the chip that may sometimes result in corrupt data, sometimes it different kinds of damage
<wpwrak>
(part) U24, instead of A4809E3R-263DN, use A4809E3R-440DN, 4.312-4.488 V
<wpwrak>
yes, there seems to be too much of a connection to power cycling for it to be just some weird writes
<wolfspraul>
also the boards that end up in "stopped at 'Bitstream length: 1484404' while reflashing" state
<wpwrak>
block locking may still make the problem "go away", but i wouldn't rely on it
<wpwrak>
that sounds like USB
<wolfspraul>
nah, too related to prior reconfig or d2/d3 dimly lit problems
<wolfspraul>
and why does it not go away? and the same board could be flashed before?
<wpwrak>
my guess would be that, if you switch to full-speed, these ""stopped at bitstream" things will vanish
<wolfspraul>
ok, definitely one of the first tests to do
<wpwrak>
i suspect high-speed USB signal integrity issues. you probably have tons of CRC errors you never see. and every once in a while, one slips through and spoils your day.
<wolfspraul>
even if that is so, it doesn't explain why boards that render eventually experience flash problems and then eventually end up unflashable
<wpwrak>
i think it's unrelated
<wolfspraul>
correct
<wolfspraul>
but that's why I think the high-speed CRC problem, if it exists, is already contained now
<wolfspraul>
because once the flash is written, and the crc checks of the test software pass, we are behind this potential failure case
<wolfspraul>
that doesn't explain why at some later point this same board suddenly and persistently cannot be flashed at all anymore
<wpwrak>
the design of the reset circuit does not seem to offer protection when powering down. not by design and, if the voltage rail traces are still representative, also not by accident. so if the underlying reason for using the reset circuit in the first place is correct, then that might be the problem.
<wpwrak>
of course, if the reset circuit is actually completely unnecessary, then it's not ;-)
<wpwrak>
(crc contained) maybe. depends a bit on how it's implemented. do you remember how fdformat works ? (from a user's perspective)
<wolfspraul>
ah no, that sounds 80's, forgot
<wpwrak>
lekernel: there's a lot you can get way with on USB a lot of times. and people are probably just happy they don't have to give evil FTDI their money ;-)
<wpwrak>
wolfspraul: maybe qi-hw's first ASIC should be a completely open USB-to-serial converter ;-)
<wolfspraul>
wpwrak: the reset ic in 4.4v variant would offer protection when powering down as well?
<lekernel>
do we really want to spend time on something as mundane, overengineered and pesky as USB? :-)
<wpwrak>
(fdformat) hey, that must have been '92 ! :) well, what it does is that it formats tracks 1-N, then seeks back to track 1 and verifies tracks 1-N. unlike the approach the DOS tools use, which format and verify track 1, then format and verify track 2, etc.
<wolfspraul>
basically I am trying to think whether there are other alternatives and whether we would order more parts, to speedup
<wpwrak>
(fdformat) can you guess why it does this ? hint: i wrote the whole formatting stuff and my floppy drive was a little bit defective :)
<wolfspraul>
because whatever we do, Adam will have to do some testing of this and that variant. and if parts are missing we will quickly have another 'couple days' waiting time in between...
<wpwrak>
lekernel: i think no matter how stubborn we are, we can't defeat USB ;-)
<wpwrak>
(a809) yeah, dunno how long it takes to get that one. i think they're in taiwan. it's one of those never-to-be-seen-at-digi-key parts :)
<wolfspraul>
ok, attack plan seems to be
<wolfspraul>
1. order 4.4v variant of reset ic
<wpwrak>
(fdformat) the problem was that the stepper motor sometimes didn't step. so i could get logical tracks 1-2-3-5-6-7-... on successive physical tracks. a per-track verification would have succeeded. the whole disk verification would spot the problem
<wolfspraul>
2. for a board in 'unflashable' state, try to reseat jtag board, try to force USB to full-speed, try to enable urjtag debug messages, try Xilinx Impact
<wolfspraul>
3. for a board in 'cannot reconfigure' state, run the test software for CRC checking
<wolfspraul>
4. for a board that is in 'available' state right now, try to do 100 thirty second render cycles to see whether the 'cannot reconfigure' problem can be enforced
<wpwrak>
(fdformat) lesson learned: if location is unreliable, separate write and verification phases. (another lesson, implicit in the floppy structure, would be to have location information embedded in the data. alas, that would be difficult in this case. but then, we have a lot of entropy, so i wouldn't be worried about tht)
<wolfspraul>
5. if we feel better about reproducing the 'cannot reconfigure' problem, compare the different ways to power cycle - unplug DC, unplug mains, three-button reset
<wolfspraul>
6. once we have the 4.4v reset ic, rework a board and see whether we can reproduce the 'cannot reconfigure' problem still, on that board
<wolfspraul>
7. make some power-down scope measurements to collect more data points?
<wolfspraul>
a lot depends on us being able to reproduce the 'cannot reconfigure' state better
<wolfspraul>
wpwrak: [fdformat] but we do have that already. the crc checks are separate, because the test software checks later, completely independent of the flashing operation
<wolfspraul>
does my attack plan #1 - #7 sound about right?
<wolfspraul>
I will dwell over the testing results a bit more...
<wpwrak>
hmm, for 2., i'd say to estimate the current rate of flash failures / CRC errors with the long cable. then switch to full-speed permanently and see if the error rate drops to a low-probability percentile.
<wolfspraul>
you mean full-speed with long cable?
<wolfspraul>
I would like to get my head off of usb asap, because like I said the crc checks are already independent, so there is no way _ANY_ jtag flashing issue can still be around that much later
<wolfspraul>
so even if the jtag usb is unreliable like hell, once we managed to write nor properly, it's there because it will be independently verified by the test software later, in a totally different code path
<wolfspraul>
that's my understanding at least, I do not see how a USB issue can matter then, even if one exists
<wolfspraul>
the test software is loaded via serial, it checks the crc of the data on nor. if that is ok, any potential usb/jtag flashing bug is behind us.
<wolfspraul>
8. implement locking of standy+rescue partitions
<wolfspraul>
wpwrak: do you understand/agree that flashing and checking are already completely separate?
<wolfspraul>
maybe I misunderstand our process...
<wpwrak>
back from phone
<wpwrak>
(usb issue) whether it's truly solved or not depends on how the protocol is designed. so i'd rather eliminate the root cause, just to be sure.
<wpwrak>
but yes, if you do a verification with the test sw afterwards, the NOR is good
<wolfspraul>
I'm looking for the root cause of the 'cannot reconfigure' problem, not the root cause of any jtag/usb flashing issue
<wolfspraul>
because the latter one isn't a sales showstopper, but the former one is
<wpwrak>
(locking) i would also lock the regular bitream. maybe also APP, in case it's mostly read-only. basic rule: lock everything you're not going to write to often.
<wolfspraul>
those sound like software improvements
<wolfspraul>
the highest priority is to make a decision whether we have boards (any of the 90) that we believe are electrically good
<wpwrak>
what makes me uncomfortable about USB is that it also complicated analysis. so each analysis step needs to include retries to make sure any supposed NOR errors found are not from USB
<wolfspraul>
yes but it's easy to run the test sw
<wpwrak>
(locking) correct. you shouldn't _need_ the locking. it's another safety belt. while chasing the NOR corruption, i wouldn't lock at all. i.e., leave things as they were
<wpwrak>
if you have a single-bit error, that will work
<wpwrak>
if you have multibit errors, it's more work. then you need to implement the crc also on the pc, to verify that the (failing) CRC is the same on both sides
<wolfspraul>
I don't think (guess) we are dealing with any 'proper' nor write
<wolfspraul>
I think it's a maltreatment of some wires into the chip that may also express itself in the form of a bad bit
<wolfspraul>
that's just my uninformed guess of course
<wpwrak>
(proper nor write) locking may also protect against other write actions
<wolfspraul>
maybe I'm trying to find a root cause to fix all flash issues at once :-)
<wpwrak>
for now, i would assume that the software is perfect and thus doens't need NOR locking to survive :)
<wolfspraul>
how does my attack plan above sound? right direction overall?
<wolfspraul>
yes definitely
<wolfspraul>
once we are on the software level it's a different thing already
<wolfspraul>
I think we have some issue below software however
<wolfspraul>
this is not a 'clean' nor write going astray sometimes
<wolfspraul>
the data doesn't add up to that theory
<wpwrak>
1. sounds good. 2., i would simplify to "force full-speed" (and report if the stopped bitstream ever appears again)
<wpwrak>
3. i would run the test sw always, independent of "cannot reconfigure'. if NOR corruption happens at random locations, you'll encounter the problem 20x as often.
<wpwrak>
4. does "render cycle" include power-cycling ?
<wpwrak>
if things go as expected, 7. may be unnecessary :)
<wpwrak>
regarding 3, i would switch from "deal with NOR corruption when you happen to observe it" to "specifically look for it"
<wpwrak>
also, NOR corruption could also cause other upsets than just failure to reconfigure. e.g., do BIOS/RTEMS/flickernoise check their consistency when booting ?
<wolfspraul>
I doubt it
<wpwrak>
the BIOS is quite small, so it shouldn't be hit very often. FN is more than twice the size of the bitstream. so assuming uniform distribution, for each hit reconfiguration takes, you make have two hits on FN.
<wolfspraul>
maybe from now on, when Adam tests the 10 render cycles, he should run the test software in between
<wpwrak>
on the other hand, if where's checksumming to protect FN, so that APP NOR corruption would have a clear and distinct indication, then this would tell us something about the distribution of where NOR corruption happens. but let's worry about that later
<wpwrak>
yes, definitely run the test sw in between
<wpwrak>
NOR corruption hitting FN could also cause other failures. failures which will go away if adam then reworks a (perfectly good) chip and reflashes :)
<wpwrak>
he seems to reflash very often, so that may hide such things
<wpwrak>
so i would reflash only if there's a known corruption (or if the NOR content needs updating for some reason)
<wolfspraul>
good point [from now on reflash more carefully after a board was flashed knowingly good for the first time]
<strangeloop>
hello
<strangeloop>
anyone at camp who wants to chat (ie explain) a bit about milkymist to me? :D
<wpwrak>
is lekernel still there ? if yes, he'd be the man to catch :)
<wpwrak>
(lekernel = sebastien)
<strangeloop>
na he said he already left camp
<strangeloop>
of course can always discuss online, but this kind of stuff is more fun face to face :)
<wpwrak>
(left) pity. hmm, dunno if there's anyone else. roh (joachim) knows the M1 a bit, at least mechanically, but i don't know how familiar he is with its software
<wpwrak>
could be anything from not even having power up the board to him having a rave with the wildest video effects in town every night ;-)
<strangeloop>
hehe i see  :)
<strangeloop>
i'll keep my eyes and ears open then
<strangeloop>
(and obvioulsy have a few more technical questions here as soon as i manage to get my hands on a board :)
<wolfspraul>
strangeloop: wow, nice to hear from you and definitely, stop back here...
<wpwrak>
ah, and the reset chip rework (to 5V) would be as follows: unsolder old chip, bend pin 3 (the one on the side with only one pin) up, solder the two other pins, run a patch wire from pin 3 to 5V, put something isolating between pin 3 and the pad underneath. a bit hackish.
<kristianpaul>
"uCLinux USB driver." l :-)
<kristianpaul>
lekernel: as seems you have lot experience with testing the opencores stuff, what about this one http://opencores.org/project,ethmac ?
<lekernel>
sure, that would be cool. get it to work, kristianpaul!
<lekernel>
I used it before; it's bloated
<lekernel>
but it works
<lekernel>
that's rare enough for something from opencores, so it's worth being mentioned
<lekernel>
it's about the size of LM32
<kristianpaul>
may be is bloated to be full IEEEÂ Â compliance?
<lekernel>
kristianpaul, if you dislike my choice of not wasting my time implementing all the useless/legacy features of ethernet into minimac, go ahead and fix it. i'm always waiting for your patches. and I think a better job can be done than what the opencores people did, even when sticking to the standard.
<kristianpaul>
i'm do _not_ disliking nothing, i respect others work, bloated or not :-)
<kristianpaul>
and what's the problem if i dont send patches? i'm not as good as you or others here coding, is that a problem? please tell me--
<kristianpaul>
or i'll better hold my comments, wich seems are not well wellcome if a patch is not attched..
<lekernel>
kristianpaul, simply stating the obvious things that do not work or are not implemented in open source FPGA cores is not going to get many things done, so I'm simply gently prodding you in the right direction :-P