#milkymist on 2011-08-13 — irc logs at freenode.irclog.whitequark.org

07:25 <aw> lekernel, 0x71: I plugged usb-A port with keyboard and usb-B port with mouse, they can be detected well, but port-A shows 'USB: HC: Transfer start: RX timeout error'; so I swapped keyboard and mouse. that error won't show up, then swapped again, it shows up still. so I let 0x71 be in gui mode, I still can use mouse and keyboard together even swapped them. What does this stand for?

07:26 <aw> 0x71: http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/71-results

07:27 <aw> I will mark 0x71 as 'X' still though. ;-)

07:29 <wolfspraul> aw: are you trying with the silicone keyboard? try with another keyboard too (either a completely different non-silicone keyboard, or with a second silicone keyboard)

07:30 <wolfspraul> but definitely X for now

07:30 <aw> yes, silicone keyboard

07:31 <aw> okay...let me try another silicone keyboard.

07:33 <wolfspraul> and a different keyboard (non-silicone) too, if you have one that works

07:33 <wolfspraul> just to get some more data

07:34 <wolfspraul> but the board stays FAIL anyway, so maybe just a waste of time...

07:38 <aw> hmmm....the same results after used another silicone keyboard. also swapped in gui 'login' it reacted well responsely to my type though. strange indeed.

07:38 <aw> well...marked 'x' still..next board to test. ;-)

07:38 <wolfspraul> yes, correct. mark 'x' and move on.

07:39 <wolfspraul> definitely not pass like this

07:48 <wpwrak> try a shorter cable ? (-:C

08:22 <aw> 0x6c interesting histories: 1. TP4 â€“ 174 ohm, 2. current 0.53A normally 3. http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-results 4. No VGA screen. 5. usb-B 6. d2/d3 always dimly lit after powered-on since finished first test program firstly 7.Â Â d2/d3 dimly lit after powered on since replaced u7/u19/u20 couple days later 8. reflashed successfully by BEN usb cable: http://download

08:22 <aw> s.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-result s-1 9. replaced new u17. 10. reflashed successfully by erase version:http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-results-e

08:22 <aw> 0x6c now is rendering done successfully. ;-) any amazing?

08:23 <wolfspraul> with this history, you cannot set it to 'available'

08:23 <aw> also no d2/d3 dimly lit although the power-cylce is not too much.

08:23 <wolfspraul> we need to understand the flash/boot/dimly lit issues first

08:24 <aw> sure, i just wanted to say it passed all tests. strange and amazing.

08:24 <wolfspraul> that's why we need to do more research before starting to sell

08:26 <aw> anyway...i just posted here any news i found/tested. ;-)

12:34 <wpwrak> so the USB transceivers are acting up, too ? (U16, U17)

12:36 <lekernel> the FPGA design has tons of (painful) bugs in the USB "UART", it could just be that part tolerances are tickling them

12:37 <wolfspraul> I don't think I'm worried about those bugs right now

12:37 <wolfspraul> they are clearly identified, and the boards singled out

12:37 <wolfspraul> wpwrak: you proposed a reset IC with different threshold voltage and supplied from 5V. any candidates?

12:38 <lekernel> by the way, it never ceases to amaze me how people seem happy with the opencores USB "UARTs", for example this one: http://opencores.org/project,usbhostslave

12:38 <wolfspraul> I realize to save time we should order some parts...

12:38 <lekernel> "Works like a champ for me" ... this thing has MORE bugs than my crappy design. I used it at the beginning and had to throw it away because it did not work at at all.

12:39 <wpwrak> wolfspraul: (not worried) okay, but what is the trigger for replacing U17 ?

12:39 <lekernel> you could see that piece of crap obviously misbehaving on trivial corner cases of bit stuffing etc.

12:39 <lekernel> for example it doesn't bitstuff correctly the last bit of USB packets

12:39 <lekernel> every USB application note tells you to be careful about that

12:40 <wpwrak> wolfspraul: for the reset IC, there's a -440 part (4.4 V) of the same chip. that would allow you to operate perfectly within specs.

12:41 <wolfspraul> maybe Adam should order that right on Monday morning?

12:41 <wolfspraul> wpwrak: [trigger for replacing u17] don't know. when the test fails? :-)

12:45 <wolfspraul> my feeling from looking at testing results is that (if it's 1 problem only), the root cause is not a simple flash write somewhere

12:46 <wolfspraul> that wouldn't explain why some boards cannot be reflashed anymore, sometimes for a day or several days, sometimes forever

12:46 <wolfspraul> if it's just 1 problem, it must be some out-of-spec electrical shock/impact on the chip that may sometimes result in corrupt data, sometimes it different kinds of damage

12:47 <wpwrak> (part) U24, instead of A4809E3R-263DN, use A4809E3R-440DN, 4.312-4.488 V

12:47 <wpwrak> yes, there seems to be too much of a connection to power cycling for it to be just some weird writes

12:48 <wolfspraul> also the boards that end up in "stopped at 'Bitstream length: 1484404' while reflashing" state

12:48 <wpwrak> block locking may still make the problem "go away", but i wouldn't rely on it

12:48 <wpwrak> that sounds like USB

12:48 <wolfspraul> nah, too related to prior reconfig or d2/d3 dimly lit problems

12:48 <wolfspraul> and why does it not go away? and the same board could be flashed before?

12:49 <wpwrak> my guess would be that, if you switch to full-speed, these ""stopped at bitstream" things will vanish

12:50 <wolfspraul> ok, definitely one of the first tests to do

12:50 <wpwrak> i suspect high-speed USB signal integrity issues. you probably have tons of CRC errors you never see. and every once in a while, one slips through and spoils your day.

12:50 <wolfspraul> even if that is so, it doesn't explain why boards that render eventually experience flash problems and then eventually end up unflashable

12:51 <wpwrak> i think it's unrelated

12:51 <wolfspraul> correct

12:51 <wolfspraul> but that's why I think the high-speed CRC problem, if it exists, is already contained now

12:51 <wolfspraul> because once the flash is written, and the crc checks of the test software pass, we are behind this potential failure case

12:52 <wolfspraul> that doesn't explain why at some later point this same board suddenly and persistently cannot be flashed at all anymore

12:53 <wpwrak> the design of the reset circuit does not seem to offer protection when powering down. not by design and, if the voltage rail traces are still representative, also not by accident. so if the underlying reason for using the reset circuit in the first place is correct, then that might be the problem.

12:53 <wpwrak> of course, if the reset circuit is actually completely unnecessary, then it's not ;-)

12:55 <wpwrak> (crc contained) maybe. depends a bit on how it's implemented. do you remember how fdformat works ? (from a user's perspective)

12:56 <wolfspraul> ah no, that sounds 80's, forgot

12:57 <wpwrak> lekernel: there's a lot you can get way with on USB a lot of times. and people are probably just happy they don't have to give evil FTDI their money ;-)

12:58 <wpwrak> wolfspraul: maybe qi-hw's first ASIC should be a completely open USB-to-serial converter ;-)

12:58 <wolfspraul> wpwrak: the reset ic in 4.4v variant would offer protection when powering down as well?

12:58 <lekernel> do we really want to spend time on something as mundane, overengineered and pesky as USB? :-)

12:59 <wpwrak> (fdformat) hey, that must have been '92 ! :) well, what it does is that it formats tracks 1-N, then seeks back to track 1 and verifies tracks 1-N. unlike the approach the DOS tools use, which format and verify track 1, then format and verify track 2, etc.

12:59 <wolfspraul> basically I am trying to think whether there are other alternatives and whether we would order more parts, to speedup

13:00 <wpwrak> (fdformat) can you guess why it does this ? hint: i wrote the whole formatting stuff and my floppy drive was a little bit defective :)

13:00 <wolfspraul> because whatever we do, Adam will have to do some testing of this and that variant. and if parts are missing we will quickly have another 'couple days' waiting time in between...

13:00 <wpwrak> lekernel: i think no matter how stubborn we are, we can't defeat USB ;-)

13:03 <wpwrak> (a809) yeah, dunno how long it takes to get that one. i think they're in taiwan. it's one of those never-to-be-seen-at-digi-key parts :)

13:06 <wolfspraul> ok, attack plan seems to be

13:06 <wolfspraul> 1. order 4.4v variant of reset ic

13:06 <wpwrak> (fdformat) the problem was that the stepper motor sometimes didn't step. so i could get logical tracks 1-2-3-5-6-7-... on successive physical tracks. a per-track verification would have succeeded. the whole disk verification would spot the problem

13:07 <wolfspraul> 2. for a board in 'unflashable' state, try to reseat jtag board, try to force USB to full-speed, try to enable urjtag debug messages, try Xilinx Impact

13:07 <wolfspraul> 3. for a board in 'cannot reconfigure' state, run the test software for CRC checking

13:08 <wolfspraul> 4. for a board that is in 'available' state right now, try to do 100 thirty second render cycles to see whether the 'cannot reconfigure' problem can be enforced

13:09 <wpwrak> (fdformat) lesson learned: if location is unreliable, separate write and verification phases. (another lesson, implicit in the floppy structure, would be to have location information embedded in the data. alas, that would be difficult in this case. but then, we have a lot of entropy, so i wouldn't be worried about tht)

13:09 <wolfspraul> 5. if we feel better about reproducing the 'cannot reconfigure' problem, compare the different ways to power cycle - unplug DC, unplug mains, three-button reset

13:09 <wolfspraul> 6. once we have the 4.4v reset ic, rework a board and see whether we can reproduce the 'cannot reconfigure' problem still, on that board

13:10 <wolfspraul> 7. make some power-down scope measurements to collect more data points?

13:10 <wolfspraul> a lot depends on us being able to reproduce the 'cannot reconfigure' state better

13:11 <wolfspraul> wpwrak: [fdformat] but we do have that already. the crc checks are separate, because the test software checks later, completely independent of the flashing operation

13:11 <wolfspraul> does my attack plan #1 - #7 sound about right?

13:11 <wolfspraul> I will dwell over the testing results a bit more...

13:12 <wpwrak> hmm, for 2., i'd say to estimate the current rate of flash failures / CRC errors with the long cable. then switch to full-speed permanently and see if the error rate drops to a low-probability percentile.

13:12 <wolfspraul> you mean full-speed with long cable?

13:13 <wolfspraul> I would like to get my head off of usb asap, because like I said the crc checks are already independent, so there is no way _ANY_ jtag flashing issue can still be around that much later

13:13 <wolfspraul> so even if the jtag usb is unreliable like hell, once we managed to write nor properly, it's there because it will be independently verified by the test software later, in a totally different code path

13:14 <wolfspraul> that's my understanding at least, I do not see how a USB issue can matter then, even if one exists

13:15 <wolfspraul> the test software is loaded via serial, it checks the crc of the data on nor. if that is ok, any potential usb/jtag flashing bug is behind us.

13:17 <wolfspraul> 8. implement locking of standy+rescue partitions

13:23 <wolfspraul> wpwrak: do you understand/agree that flashing and checking are already completely separate?

13:23 <wolfspraul> maybe I misunderstand our process...

13:24 <wpwrak> back from phone

13:25 <wpwrak> (usb issue) whether it's truly solved or not depends on how the protocol is designed. so i'd rather eliminate the root cause, just to be sure.

13:26 <wpwrak> but yes, if you do a verification with the test sw afterwards, the NOR is good

13:26 <wolfspraul> I'm looking for the root cause of the 'cannot reconfigure' problem, not the root cause of any jtag/usb flashing issue

13:26 <wolfspraul> because the latter one isn't a sales showstopper, but the former one is

13:27 <wpwrak> (locking) i would also lock the regular bitream. maybe also APP, in case it's mostly read-only. basic rule: lock everything you're not going to write to often.

13:28 <wolfspraul> those sound like software improvements

13:28 <wolfspraul> the highest priority is to make a decision whether we have boards (any of the 90) that we believe are electrically good

13:28 <wpwrak> what makes me uncomfortable about USB is that it also complicated analysis. so each analysis step needs to include retries to make sure any supposed NOR errors found are not from USB

13:29 <wolfspraul> yes but it's easy to run the test sw

13:29 <wpwrak> (locking) correct. you shouldn't _need_ the locking. it's another safety belt. while chasing the NOR corruption, i wouldn't lock at all. i.e., leave things as they were

13:30 <wpwrak> if you have a single-bit error, that will work

13:30 <wpwrak> if you have multibit errors, it's more work. then you need to implement the crc also on the pc, to verify that the (failing) CRC is the same on both sides

13:31 <wolfspraul> I don't think (guess) we are dealing with any 'proper' nor write

13:31 <wolfspraul> I think it's a maltreatment of some wires into the chip that may also express itself in the form of a bad bit

13:31 <wolfspraul> that's just my uninformed guess of course

13:31 <wpwrak> (proper nor write) locking may also protect against other write actions

13:31 <wolfspraul> maybe I'm trying to find a root cause to fix all flash issues at once :-)

13:32 <wpwrak> for now, i would assume that the software is perfect and thus doens't need NOR locking to survive :)

13:32 <wolfspraul> how does my attack plan above sound? right direction overall?

13:32 <wolfspraul> yes definitely

13:32 <wolfspraul> once we are on the software level it's a different thing already

13:32 <wolfspraul> I think we have some issue below software however

13:33 <wolfspraul> this is not a 'clean' nor write going astray sometimes

13:33 <wolfspraul> the data doesn't add up to that theory

13:33 <wpwrak> 1. sounds good. 2., i would simplify to "force full-speed" (and report if the stopped bitstream ever appears again)

13:34 <wpwrak> 3. i would run the test sw always, independent of "cannot reconfigure'. if NOR corruption happens at random locations, you'll encounter the problem 20x as often.

13:34 <wpwrak> 4. does "render cycle" include power-cycling ?

13:35 <wpwrak> if things go as expected, 7. may be unnecessary :)

13:37 <wpwrak> regarding 3, i would switch from "deal with NOR corruption when you happen to observe it" to "specifically look for it"

13:38 <wpwrak> also, NOR corruption could also cause other upsets than just failure to reconfigure. e.g., do BIOS/RTEMS/flickernoise check their consistency when booting ?

13:41 <wolfspraul> I doubt it

13:41 <wpwrak> the BIOS is quite small, so it shouldn't be hit very often. FN is more than twice the size of the bitstream. so assuming uniform distribution, for each hit reconfiguration takes, you make have two hits on FN.

13:41 <wolfspraul> maybe from now on, when Adam tests the 10 render cycles, he should run the test software in between

13:42 <wpwrak> on the other hand, if where's checksumming to protect FN, so that APP NOR corruption would have a clear and distinct indication, then this would tell us something about the distribution of where NOR corruption happens. but let's worry about that later

13:42 <wpwrak> yes, definitely run the test sw in between

13:43 <wpwrak> NOR corruption hitting FN could also cause other failures. failures which will go away if adam then reworks a (perfectly good) chip and reflashes :)

13:43 <wpwrak> he seems to reflash very often, so that may hide such things

13:44 <wpwrak> so i would reflash only if there's a known corruption (or if the NOR content needs updating for some reason)

13:46 <wolfspraul> good point [from now on reflash more carefully after a board was flashed knowingly good for the first time]

13:49 <strangeloop> hello

13:50 <strangeloop> anyone at camp who wants to chat (ie explain) a bit about milkymist to me? :D

13:51 <wpwrak> is lekernel still there ? if yes, he'd be the man to catch :)

13:51 <wpwrak> (lekernel = sebastien)

13:54 <strangeloop> na he said he already left camp

13:54 <strangeloop> of course can always discuss online, but this kind of stuff is more fun face to face :)

13:56 <wpwrak> (left) pity. hmm, dunno if there's anyone else. roh (joachim) knows the M1 a bit, at least mechanically, but i don't know how familiar he is with its software

13:57 <wpwrak> could be anything from not even having power up the board to him having a rave with the wildest video effects in town every night ;-)

13:57 <strangeloop> hehe i seeÂ Â :)

13:57 <strangeloop> i'll keep my eyes and ears open then

13:58 <strangeloop> (and obvioulsy have a few more technical questions here as soon as i manage to get my hands on a board :)

14:00 <wolfspraul> strangeloop: wow, nice to hear from you and definitely, stop back here...

14:30 <wpwrak> ah, and the reset chip rework (to 5V) would be as follows: unsolder old chip, bend pin 3 (the one on the side with only one pin) up, solder the two other pins, run a patch wire from pin 3 to 5V, put something isolating between pin 3 and the pad underneath. a bit hackish.

15:37 <kristianpaul> "uCLinux USB driver." l :-)

15:39 <kristianpaul> lekernel: as seems you have lot experience with testing the opencores stuff, what about this one http://opencores.org/project,ethmac ?

15:39 <lekernel> sure, that would be cool. get it to work, kristianpaul!

15:39 <lekernel> I used it before; it's bloated

15:39 <lekernel> but it works

15:39 <lekernel> that's rare enough for something from opencores, so it's worth being mentioned

15:39 <lekernel> it's about the size of LM32

15:40 <kristianpaul> may be is bloated to be full IEEEÂ Â compliance?

16:23 <lekernel> kristianpaul, if you dislike my choice of not wasting my time implementing all the useless/legacy features of ethernet into minimac, go ahead and fix it. i'm always waiting for your patches. and I think a better job can be done than what the opencores people did, even when sticking to the standard.

17:13 <kristianpaul> i'm do _not_ disliking nothing, i respect others work, bloated or not :-)

17:15 <kristianpaul> and what's the problem if i dont send patches? i'm not as good as you or others here coding, is that a problem? please tell me--

17:16 <kristianpaul> or i'll better hold my comments, wich seems are not well wellcome if a patch is not attched..

17:49 <lekernel> kristianpaul, simply stating the obvious things that do not work or are not implemented in open source FPGA cores is not going to get many things done, so I'm simply gently prodding you in the right direction :-P