#milkymist on 2011-08-18 — irc logs at freenode.irclog.whitequark.org

00:58 <wpwrak> and now, live from the arena, the eternal struggle of good versus evil ! today, the champion of the powers of good will be ... adam ! the forces of evil are represented by board 0x32. will the good prevail ? will one more M1rc3 perish in darkness ? watch it live, only on #milkymist !

01:02 <roh> *lol*

01:02 <roh> :)

01:02 <roh> how's the score atm?

01:09 <wpwrak> roh: so far, adam has just entered the area but hasn't uttered a word yet. perhaps he's meditating, gathering his spiritual forces to defeat the foe :)

01:11 <wpwrak> s/area/arena/

01:17 <roh> heh

01:17 <roh> i am sure he'll get through that pile of stuff to fix. will just take some time.

01:26 <wpwrak> yeah. let's hope there aren't too many of the truly weird problems left in that pile

01:48 <wpwrak> hmm, seems that tonight will be quiet

02:00 <kristianpaul> and still you thinking rework 90 pcbs again?

02:06 <wolfspraul> alright, I saw some confusion in questions from Sebastien and Kristian Paul

02:06 <wolfspraul> it's a little difficult to keep overview in the flood of details

02:07 <kristianpaul> well, i'm asking with no awareness of last day of backlog

02:07 <wolfspraul> 1. we cannot say there there is a problem with 'bad nor chips'

02:07 <wolfspraul> all is fine with the nor chips

02:07 <wolfspraul> 2. urjtag may have bugs, but right now it seems Xilinx Impact is only as good as urjtag, so we will continue to use urjtag and the jtag-serial board

02:10 <wolfspraul> 3. of course we will apply fix2b, or any additional fix we may find to be needed - to all boards that are being sold

02:10 <wolfspraul> all boards that are being sold are sold in the same condition, and that condition is 100% pass and 100% bug-free

02:10 <wolfspraul> quite simple :-)

02:10 <wolfspraul> 4. yesterday was a big messy, or rather slow, because it was not really a production day, but more a fix2b design verification day

02:10 <wolfspraul> normally we would not spend so much time on one board that shows a NOR problem (_for whatever reason_), but we did because we had to verify fix2b and cannot risk jumping over too many unknowns

02:10 <wolfspraul> after yesterday, I feel pretty good about fix2b

02:10 <wolfspraul> Adam will now approach the other 15/16 fix2b candidates in a more production style, that is - fast

02:10 <wolfspraul> if something doesn't work - note that went wrong, then next board

02:11 <wolfspraul> 5. I am very happy that now it seems we can remove the long wire

02:11 <wolfspraul> the long wire is an invitation for trouble

02:11 <kristianpaul> aka FM atenna :)

02:11 <wolfspraul> the one thing special about rc3 is that we mix design verification and production run

02:11 <wolfspraul> almost since day smt+1 if you remember

02:12 <wolfspraul> we started fiddling with the reset circuit right at the beginning because boards wouldn't boot

02:12 <wolfspraul> remember?

02:12 <kristianpaul> sure, that was messy but still well managed wich is good !

02:12 <wolfspraul> so that's causing a big problem

02:12 <kristianpaul> yes i have good memory

02:12 <wolfspraul> because design verification and production are so different

02:12 <wolfspraul> production is about speed, economics

02:12 <wolfspraul> every board goes through a predefined and efficient process

02:13 <wolfspraul> and it doesn't matter why it fails, if it fails the failure point is recorded and we go to the next board

02:13 <wolfspraul> but if we are uncertain about the design (!), then we cannot do that

02:13 <wolfspraul> so...

02:13 <kristianpaul> keep going !:)

02:13 <kristianpaul> yes, i know understand your point

02:13 <wolfspraul> why did sales not start yet?

02:14 <wolfspraul> it's easy

02:14 <wolfspraul> it's _NOT_ because some boards have problems

02:14 <wolfspraul> in a run of 90 there will always be problems

02:15 <wolfspraul> it's because our test procedure, which ended in 10 thirty second render cycles, showed sudden failures (cannot reconfig) on boards that were perfectly rendering fine, but stopped doing so at the 2nd, 5th, 9th render cycle

02:15 <wolfspraul> that's bad!!!!

02:15 <wolfspraul> that's the bad thing

02:15 <wolfspraul> not the schmitt-triggers, usb transceivers, nor chips, urjtag bugs, long wires, etc.

02:15 <kristianpaul> sure, thats uncertain

02:15 <kristianpaul> at least was..

02:15 <wolfspraul> no it still is

02:15 <wolfspraul> so I have 40 boards (for example)

02:15 <wolfspraul> 30 pass

02:15 <wolfspraul> 10 fail, some at 2nd rendering, some at 5th, some at 9th

02:15 <wolfspraul> ok?

02:15 <wolfspraul> can we sell the 30 pass?

02:15 <wolfspraul> NO!

02:15 <wolfspraul> why?

02:16 <kristianpaul> not working

02:16 <wolfspraul> because if we had done 20 render cycles, or 30, then maybe only 20 boards would have 'passed'

02:16 <wolfspraul> or 15

02:16 <wolfspraul> maybe if we do 100 render cycles, none would pass

02:16 <wolfspraul> got it?

02:16 <kristianpaul> totally :)

02:16 <wolfspraul> I cannot start selling like this, not _ANY_ board

02:17 <wolfspraul> so what I want to see with fix2b now is this:

02:17 <wolfspraul> 1. Adam works on a lot of boards, one after another, fast

02:17 <wolfspraul> some pass, some fail

02:17 <wolfspraul> ok?

02:17 <wolfspraul> he will do 10 render cycles on each one

02:17 <wolfspraul> I have two conditions to start sales:

02:18 <wolfspraul> 1) at least 50% of boards must pass (otherwise something so big may be covered up somewhere that we are better off pausing for a day or two to study it)

02:19 <wolfspraul> 2) from the boards that fail, _NONE_ must fail at any point after running the test software (all peripherals). If they fail before that - fine. But once they boot for the first time after the test software, there must be no failures.

02:19 <wpwrak> yeah, no regressions

02:20 <wolfspraul> so I think we need to give Adam one or two full days, where he can just focus on speed and no chatting and no time consuming analysis.

02:20 <wolfspraul> unless something really worrying comes up with fix2b, but we can follow the wiki page as he updates it

02:20 <wpwrak> that's what we had with 0x3a. maybe one day we'll even know why ;-)

02:20 <wpwrak> (no chatting) hehe ;-)

02:20 <wolfspraul> wpwrak: 0x3A rendered fine, and then stopped?

02:21 <wolfspraul> kristianpaul: is this all clear now?

02:21 <kristianpaul> wolfspraul: yes

02:22 <kristianpaul> i lack patience thats all :)

02:22 <wolfspraul> he, ok. sorry. it's a flood of details I know.

02:22 <kristianpaul> you think in long term

02:22 <kristianpaul> wich is GOOD

02:22 <wpwrak> 0x3a didn't get that far. butat one point in time correct NOR content could be read back. after that, either writing failed or readback has 100% reproducible corruption.

02:22 <kristianpaul> i havent even tought about tha 50% pass

02:22 <wolfspraul> ok but that's way before 100% pass of the test software

02:22 <wolfspraul> I don't care about those cases

02:22 <wolfspraul> kristianpaul: that's just to protect us from a potentially still remaining design mistake

02:23 <wolfspraul> if Adam works on 20 boards, and 2 pass - something is wrong :-)

02:23 <kristianpaul> oh sure, as those poping with rc2 :)

02:23 <kristianpaul> s/with/from

02:23 <wolfspraul> wpwrak: it's good that we digged into the nor of 0x3A so much yesterday, but from a failure analysis standpoint, it may be just one of the 5-10 boards with 'various' problems here or there in the end

02:23 <wolfspraul> now we are sure it's not related to fix2b

02:23 <roh> re

02:24 <wolfspraul> kristianpaul: you can also see it this way:

02:24 <wolfspraul> the test software must catch 100% of failures

02:24 <wolfspraul> if the test software passes, after that the board must work

02:25 <wpwrak> wolfspraul: yeah, i feel good about fix2b. also in 0x3a, besides the actual corruption, the rest of the system behaviour makes sense

02:25 <wolfspraul> if boards fail after the test software has determined they are 100% ok, that's a big problem

02:25 <wpwrak> wolfspraul: this means that we can now do a little better than just "d2/d3 dimly lit" :)

02:25 <wolfspraul> that means our test software is bad (or a design mistake that the test software cannot detect), and it means we have to re-test the entire batch after that issue is cleared up

02:25 <wpwrak> yeah. these things would suck

02:26 <wolfspraul> wpwrak: yes, we learnt a lot with 0x3a

02:26 <wolfspraul> interesting board

02:26 <wolfspraul> but not the time to go deeper there now, and maybe never

02:26 <wolfspraul> maybe just replace the nor chip and it's all fine

02:27 <wolfspraul> manufacturing is about economics, not every weird case needs to be analyzed to the total root cause

02:27 <wolfspraul> if we have a strong design, and strong test software, we have a basis to run manufacturing economics on

02:27 <wolfspraul> the most decisions become economic decisions

02:27 <wolfspraul> but if we are uncertain about the design, or test software - BAD! :-)

02:27 <wolfspraul> then it gets messy

02:28 <wolfspraul> because then we cannot just make quick economic decisions

02:28 <wpwrak> if we get a cluster of NOR troubles like in 0x3a, it may make sense to write a pattern of 0x0000 or 0xffff and read it back. then probe the bus lines. that would show whether the problem is on reading or on writing.

02:28 <wolfspraul> I think it's crazy that we first solder this long wire and diode to 90 boards, and a few weeks later determine that it was not needed :-)

02:28 <wolfspraul> that shows how we are mixing design and production work

02:28 <wolfspraul> but it's ok, we go full power forward on everything now

02:29 <wpwrak> yeah, that was the scenic route :)

02:29 <wolfspraul> if more design problems show up, well, sorry, we have to go through the entire batch again...

02:29 <roh> true. on the other hand.. better to find and fix that bugs before shipping.. not like some other vendors selling green bananas

02:29 <wolfspraul> no worries.

02:30 <roh> in the end there were not many 'design' errors right? mostly 'bad parts' as far as i understood

02:30 <wolfspraul> oh no

02:30 <wolfspraul> :-)

02:30 <roh> like the schmitt-triggers and now this diodes?

02:30 <wolfspraul> design 'error' maybe too much, but definitely design 'uncertainty'

02:30 <wolfspraul> and that is enough to disrupt the normal testing of the run

02:31 <wolfspraul> roh: like for example, it turned out that the way we produced the boards in SMT, none of them would have booted

02:31 <roh> huh? i thought all design changes were prototyped and 'tested

02:31 <wolfspraul> :-)

02:31 <wolfspraul> would you call that a 'design error'? :-)

02:31 <roh> by reworking a rc2

02:31 <wolfspraul> so they went back to SMT for rework multiple times (!)

02:31 <wolfspraul> plus reworks on Adam's side

02:31 <roh> uh.. what was the root cause of that? bad smt params?

02:31 <wolfspraul> roh: no, we made mistakes there

02:31 <wolfspraul> no bad smt params

02:32 <wolfspraul> the smt shop did everything as told

02:32 <kristianpaul> up to how many rewords are aceptable? (wich was my concern when asked first)

02:32 <wolfspraul> I'm telling you - design 'weaknesses'

02:32 <wolfspraul> so we find 'oops'. 'they all don't boot'

02:32 <wolfspraul> :-)

02:32 <wolfspraul> that's how it started

02:32 <wolfspraul> but it's ok, these things can happen in small runs

02:33 <wolfspraul> kristianpaul: theoretically it's infinite, you can rework many times - why not

02:33 <wolfspraul> but practically maybe not

02:33 <roh> wolfspraul: well.. design is schematic and pcb layout for me. when it comes to parts and the mechanical mounting.. thats 'craft' for me ;)

02:33 <wolfspraul> because every rework means new errors get introduced

02:33 <wolfspraul> roh: we definitely ran into design issues

02:33 <roh> wolfspraul: can you be more precise what assumptions were 'wrong' there?

02:33 <wolfspraul> that's why you hear about fix1, fix2, fix3, fix4, fix2b, etc.

02:33 <kristianpaul> yeah, thats increasi the mix of posible bugs

02:34 <wolfspraul> roh: phew, hard. the reset circuit is really subtle.

02:34 <wolfspraul> lots of details, lots of pins

02:34 <wolfspraul> I have no full electrical overview.

02:34 <wolfspraul> sebastien and werner do

02:34 <wolfspraul> it's actually supposed to be 'simple' (ahem)

02:34 <wolfspraul> but you know how it is

02:34 <roh> maybe we should colaboratively write a book after that... list all classes of 'errors' and 'stuff to go wrong' ... to help other people do it better or not do the same errors again ;)

02:34 <wolfspraul> a little thing wrong there and the board won't boot

02:35 <wolfspraul> so before the boards even left the SMT shop for the first time, Adam was already on the phone trying to tell them "ahh, can you please do the following reworks before sending out: a, b, c"

02:35 <wpwrak> i don't think anyone really has a full overview of the current reset circuit ;-) the very non-ideal diode(s) make it rather complicated.

02:36 <roh> wpwrak: what i dont get.. why does it need to be so complicated?

02:36 <roh> doesnt xilinx provide a proper, simple reset example?

02:36 <wolfspraul> yeah, we already have more 'improvements' for the reset circuit lined up (gates, second reset ic, etc)

02:37 <wpwrak> roh: now it's relatively simple again. but the problem is that the diodes have relatively large capacitance.

02:37 <wpwrak> roh: the issue is that we (think we) need to hold the NOR in reset while power ramps up

02:37 <wolfspraul> collectively we must have spent 3 months on the reset circuit now

02:37 <wolfspraul> :-)

02:38 <wpwrak> roh: (and we hope we don't really need to hold it in reset while power ramps down ... because rc3 doesn't do that :)

02:38 <wolfspraul> Adam went to Xilinx FAE etc. etc. crazy.

02:38 <wpwrak> heh ;-)

02:38 <kristianpaul> (FAE) oh, what they said i dont remenber that..

02:38 <wolfspraul> don't even ask

02:38 <roh> wolfspraul: yes. and it will be a complete waste of time if xilinx does a new revision of the spartan ;)

02:38 <wolfspraul> I don't know

02:38 <wolfspraul> too many details

02:38 <kristianpaul> sure

02:39 <wolfspraul> well.

02:39 <wpwrak> wolfspraul: (2nd reset IC) actually, that was a bad idea. we need a gate to act as our "diode"

02:39 <wolfspraul> I think about the platform, Milkymist platform.

02:39 <wolfspraul> so once more boards are out, I think our effectiveness in those things will go up.

02:39 <wpwrak> roh: it may not be an FPGA-specific issue

02:39 <wolfspraul> because in the end it is a simple circuit

02:39 <kristianpaul> oh yes, now adam should learn to use the /ignore nickÂ Â command :)

02:39 <wolfspraul> but we are struggling because we operate with so few people, so few boards

02:40 <wolfspraul> I just need to stabilize the bloody rc3 run, get reliable 100% pass testing results, and start to ship those monsters out :-)

02:42 <wolfspraul> roh: for sure the value of the m1 board is not in its reset circuit

02:42 <wolfspraul> :-)

02:42 <wolfspraul> if anything it's in the Milkymist SoC, Flickernoise, case :-)

02:42 <wolfspraul> so I'm not worried that this is a very spartan-6 specific little circuit, that's fully understood.

02:43 <wolfspraul> these are no 'investments', just cost and nastiness

02:44 <wolfspraul> kristianpaul: in general you want to avoid reworks completely

02:44 <wolfspraul> rework = heat

02:44 <wolfspraul> heat = bad

02:44 <wolfspraul> make the production process as determined as possible

02:44 <wolfspraul> deterministic

02:44 <wpwrak> crispy chips. yumm :)

02:45 <wolfspraul> the rework heat can cause a nearly infinite number of side-effects in chips, passive parts, the pcb, etc.

02:45 <wolfspraul> why is there all this fuss about precise reflow temperature curves?

02:45 <wpwrak> ... diodes ... :)

02:45 <wolfspraul> because it's so important, even 1 degree makes a difference

02:45 <kristianpaul> ah, thats why wpwrak like freeze boards :)

02:45 <wolfspraul> whether the top is at 246 celsius or 247 celsius...

02:45 <wolfspraul> so think about that

02:46 <wolfspraul> if that is important, how crude a rework is!

02:46 <wpwrak> kristianpaul: naw, that was actually something else :)

02:46 <wolfspraul> it's like a hammer on delicate china

02:46 <wpwrak> kristianpaul: what mystifies me in 0x3a is that we went from read noise to either write noise or stable read failure

02:47 <kristianpaul> good, you hammer the monster before leave the cave :)

02:47 <wolfspraul> so the best number of reworks is 0. but theoretically there is nothing wrong with reworks either.

02:47 <wolfspraul> sorry, doesn't get more precise than that...

02:47 <roh> wpwrak: well.. maybe its really a broken nor chip. or even only bad soldering for some reason. there are always flukes making sure you are puzzled about the rules

02:49 <wolfspraul> kristianpaul: in terms of life expectancy of a particular board, I don't think you can say in general that a board with 0 reworks has a longer life expectancy than one with 5 reworks.

02:49 <wolfspraul> the key is the test software here

02:49 <wolfspraul> if the board with 5 reworks passes the test software, and the test software (or process) is good, we can safely assume the life expectancy of that board to be the same as the one with 0 reworks (and also pass the entire test process)

02:50 <wolfspraul> that's because there may be a small lingering problem in the one with 0 reworks as well, I don't see how reworks in general increase the number of small lingering problems

02:50 <wolfspraul> I have no such data.

02:52 <roh> well.. heat can reduce the life expectancy of caps, semiconductors and other parts as well.. but i dont think that reworks have more influence than regular production or weird designs.

02:53 <roh> weird meaning e.g. mis-spec-ing a smps and killing the caps over time in the process

02:54 <roh> happens from time to time on mainboards (exploding caps are not always a 'bad caps' cause)

02:56 <wolfspraul> in terms of life expectancy, there's some interesting stuff in the solder process.

02:56 <wolfspraul> unfortunately more and more consumer electronics are designed for a 2 year or even less life span

02:56 <roh> that also. i bet that lead-free mania will bite us in the ass atleast once ;)

02:56 <wolfspraul> so the process gets optimized towards that

02:57 <wolfspraul> but there are dramatic differences in life expectancy (say for example temperature impact over time), so if you want to you can solder in a way that will be dozens of times more robust towards temperature cycles

02:57 <wolfspraul> but one by one

02:57 <roh> maybe we shouldnt design like that and use that for marketing ;)

02:57 <wolfspraul> we cannot work on all these details now

02:57 <wolfspraul> it's not the design, it's the soldering process

02:59 <wolfspraul> every time any part gets hot or cools down again, the different materials expand differently

02:59 <wolfspraul> if you want to manufacture for 20 or 30 year life expectancy, there's a lot of good stuff you can do

02:59 <wolfspraul> but increasingly the consumer electronics industry moves away from that

02:59 <wolfspraul> so that's only for aviation, cars, medicine, etc.

03:00 <wolfspraul> anyway we are not at that level yet

03:00 <wolfspraul> just trying to get m1 rc3 out as a good product, that boots and works at all :-)

03:01 <wolfspraul> but I looked at some data once of a comparison of different solder processes and techniques, and failure rates of temperature cycles (say up to 60 and back down to 30).

03:01 <wolfspraul> and the differences were huge

03:01 <wolfspraul> giant

03:02 <wolfspraul> so after 500 cycles, 1000 cycles, 2000 cycles. some my have failure rates of 30-40%, and other processes maybe only 1%

10:35 <wpwrak> so, how's the battle going ? cluster got smaller ?

10:37 <wolfspraul> I don't dare to ask :-)

10:37 <wolfspraul> sometimes gotta give people time...

10:40 <wpwrak> ah, i thought you had a little window monitoring adam's vital functions, heart rate, blood pressure, transpiration, ... :)

10:41 <wolfspraul> hmm

10:41 <wolfspraul> some things I read on the wiki do not look that great

10:42 <wolfspraul> search for fix2b

10:42 <wolfspraul> 0x32 is still not right somehow

10:43 <wolfspraul> 0x34 and 0x39 are ok

10:43 <wolfspraul> 0x3A has the nor problems we observed yesterday

10:43 <wolfspraul> now...

10:43 <wolfspraul> 0x3C, hmm

10:44 <wolfspraul> 0x40 is good, but then: 0x48 (!) what's that?

10:44 <wolfspraul> cannot configure after 2nd rendering!

10:44 <wpwrak> 0x34 and 0x39 were already good yesterday, no ?

10:44 <wolfspraul> yes

10:45 <wolfspraul> there is more stuff we have to find out about

10:45 <wolfspraul> 0x3C - strange (like 0x32?)

10:46 <wolfspraul> 0x48 is really bad

10:46 <wolfspraul> because that means we still have boards that fall back from rendering to unreconfigurable, even after fix2b

10:47 <wpwrak> yes

10:47 <wolfspraul> 0x54 is good, 0x55 could be something with the nor chip (=ignore)

10:47 <wolfspraul> 0x5C is good

10:47 <wolfspraul> that seems to be all fix2b results so far

10:47 <wolfspraul> we need to look into 3C and 48

10:47 <wolfspraul> unfortunately

10:47 <wolfspraul> bah

10:48 <lekernel> delays delays delays delays delays delays

10:48 <lekernel> :(

10:49 <wpwrak> 0x54 is good, no ?

10:49 <wolfspraul> the test results are very clear, that's good. I'm the eternal optimist.

10:49 <wolfspraul> yes, perfect

10:49 <wpwrak> ah, 0x55 was the NOR .. checking ...

10:49 <wolfspraul> I think we can ignore 0x55, not important in our quest for a stable design and reliable test process for 100% pass boards

10:50 <wolfspraul> I would look at 0x48 first, really dig in there. because that board regressed!

10:50 <wolfspraul> and then 0x3C maybe, if needed

10:50 <wpwrak> 0x55 seems bad, yes. maybe we have a NOR cluster now. but let'see then adam is through with fix2b

10:50 <wolfspraul> don't worry about problems with the nor chip per se

10:50 <wpwrak> rc2 used the same NOR chips as rc3 ?

10:50 <wolfspraul> yes

10:51 <wolfspraul> there are no problems with the nor chips

10:51 <wolfspraul> even if there are, they are easily replaced and done

10:51 <wolfspraul> we are not debugging nor chips

10:51 <wpwrak> i was thinking of the interaction with the FPGA. it's a fairly complex process. FPGA apparently needs to read the NOR's configuration data, etc.

10:52 <wolfspraul> nah. we have way too many working boards to suspect a design issue there.

10:52 <wpwrak> can be borderline parameters

10:52 <wolfspraul> if we knew our design and test process was 100% stable, we would replace the nor chip on 0x55 and most likely it would pass then.

10:53 <wolfspraul> I would not look at 0x55, waste of time imho.

10:53 <wolfspraul> 0x48 and 0x3C are interesting

10:53 <wolfspraul> (and maybe more later since Adam is not finished yet)

10:53 <wpwrak> 0x55 is scary, yes. i'd rather look at 0x3a :)

10:54 <wpwrak> 0x3a looks as if one could figure out what's going on. and it somehow almost worked in the past. so if the NOR problems have a common cause, that may provide some clues.

10:55 <wolfspraul> 0x48 is my favorite

10:55 <wolfspraul> crystal clear test path

10:55 <wolfspraul> everything picture perfect, but then

10:55 <wolfspraul> let me check 0x3A...

10:55 <wpwrak> but .. the next tests would be harder to make: write synthetic patterns, check them on the bus, read them back, etc. not the things adam usually does. well, when you send me my M1(s) maybe include 0x3a :)

10:55 <wolfspraul> ahh. 0x3A never booted before.

10:56 <wolfspraul> I'm not so interested in those (maybe a mistake).

10:56 <wpwrak> yes, it never booted. that's the fly in the ointment :)

10:56 <wolfspraul> I don't suspect a big problem with the design.

10:56 <wolfspraul> we made rc1, rc2, etc.

10:56 <wolfspraul> that's all fine

10:56 <wolfspraul> it must be something small, like we already fiddled with the reset circuit 3 times now.

10:56 <wpwrak> the reset circuit looks good on 0x3a

10:56 <wolfspraul> let me read 0x3A notes carefully

10:57 <wolfspraul> oh wait

10:57 <wolfspraul> 0x3A is the one from yesterday!

10:57 <wolfspraul> no - not go back to that :-)

10:57 <wpwrak> yes

10:57 <wpwrak> ;-)

10:57 <wolfspraul> just replace the nor chip (we have no spares right now so cannot try)

10:58 <wolfspraul> I'm 80% sure after replacing nor chip it works

10:58 <wolfspraul> not so interesting

10:58 <wolfspraul> how about 0x3C ?

10:58 <wolfspraul> that's exactly like 0x32, with the 'pulses' etc.

10:59 <wpwrak> mmh, i think replacing the NOR is too radical. you may just mask a real problem. i wouldn't replace the NOR of 0x3a before a) checking the data that goes in really gets corrupted before it comes out again and b) verifying the signal timing.

10:59 <wolfspraul> the difference is that 0x32 never booted before, but 0x3A did

10:59 <wolfspraul> sorry I meant 0x3C did

10:59 <wolfspraul> no really, no more time into 0x3A

10:59 <wolfspraul> it's not worth it

11:00 <wolfspraul> look at the difference between 0x3C and 0x32

11:00 <wpwrak> 0x3a didn't boot

11:00 <wolfspraul> they have pretty much the same state now

11:00 <wpwrak> ah, 0x3c :)

11:00 <wolfspraul> those crazy nor bit corruption searches take huge time and don't help us in the big picture with the run

11:00 <wpwrak> 0x32 has a long patient's history :)

11:00 <wolfspraul> we are not fixing every board here

11:00 <wolfspraul> we are only trying to come up with a stable design and reliable test process (!)

11:01 <wolfspraul> so taht we can start sending boards out

11:01 <wpwrak> (nor a waste of time) dunno. i wouldn't be so quick to assume that the chips just go bad randomly.

11:01 <wolfspraul> of course I understand the _real_ bug may hide anywhere...

11:01 <wolfspraul> true, but we have lower hanging fruits

11:01 <wolfspraul> compare 0x3C and 0x32

11:01 <lekernel> by the way, have you tried assembling one complete unit already?

11:01 <wpwrak> yeah,looking at 0x32

11:01 <lekernel> with case, box, etc.

11:02 <wolfspraul> sure everybody has 1 unit, I think Adam too (his own)

11:02 <lekernel> rc3?

11:02 <lekernel> with the case and the box?

11:02 <wolfspraul> but not from 0x30 on and higher

11:02 <wolfspraul> no probably not

11:02 <wolfspraul> you worried it won't fit? :-)

11:02 <lekernel> yes. given that absolutely everything in this run has gone wrong in one way or another, there could be surprises there as well

11:03 <wolfspraul> nah it will fit. I'm not getting distracted on that now.

11:03 <wolfspraul> fix2b is a big step forward, looking at today's results

11:03 <wolfspraul> but not 100% yet, it seems

11:04 <wolfspraul> I don't want to trample over test results and ignore them etc.

11:04 <wolfspraul> not good

11:04 <wpwrak> wow. 0x32 is crazy.

11:04 <wolfspraul> well, read 0x3C now :-)

11:06 <wolfspraul> from the boards that we have fix2b results for so far, I would look at 0x48 first

11:07 <wolfspraul> tp36/37 is good, but it won't reconfigure currently (after rendering before)

11:09 <lekernel> wolfspraul, maybe you should ship problem boards around (including to a Xilinx FAE) so people can look at them in parallel?

11:09 <wpwrak> for 0x32 and 0x3c, the next thing to analyze would be to bring R60 back. if that doesn't help, try without D16 (without D16, the board is a likely NOR corruption candidate, though. so not for normal sale)

11:10 <wpwrak> lekernel: yes, wolfgang seems to have a few boards he's already given up on. i think he could let these out.

11:11 <wolfspraul> no, I think that will be the ultimate delay producer

11:11 <wolfspraul> the quality and consistency of our test results would go down

11:11 <wpwrak> worst case: something is found that fixes all these boards but is hard to apply in the field

11:11 <wolfspraul> I made that mistake with rc2, so no way I'm going to make it again :-)

11:11 <wpwrak> wolfspraul: rc2 went to people who didn't even turn it on :)

11:11 <wolfspraul> if we do that we will not sell any rc3, so I won't do it

11:12 <wpwrak> lekernel: i think he just doesn't want to spend money on fedex :)

11:12 <wolfspraul> wpwrak: for 32/3c, bring R60 back involves bringing the long wire back as well?

11:12 <wpwrak> wolfspraul: no no. just solder one resistor to an existing footprint

11:12 <wolfspraul> ah ok

11:13 <wpwrak> wolfspraul: r60 was removed as part of fix2, so lower the current on the reset chip a little

11:13 <lekernel> by doing that we had mwalle fix the video chip (Adam and I failed), as well as me fixing the intermittent video-in failure and audio output noise

11:13 <wpwrak> wolfspraul: but with fix2b, we're already nicer to the reset chip, so ...

11:14 <wolfspraul> lekernel: no it wouldn't work. it would be the end of rc3.

11:14 <wolfspraul> I will not do it.

11:14 <wolfspraul> we need to be able to look across multiple boards.

11:14 <wolfspraul> and if they are in different locations with different people the consistency will completely break down.

11:14 <wolfspraul> of course all sorts of random results will pop up, and the general quality will go up

11:14 <wolfspraul> then we can try in rc4 what the results are :-)

11:15 <wolfspraul> that was what rc2 was for and we didn't do that well in this sytem

11:15 <wolfspraul> not again

11:15 <wolfspraul> I am not writing off rc3, no need.

11:16 <wolfspraul> from the fix2b results so far, the only one that pops out is 0x48

11:16 <wolfspraul> that one is not right

11:16 <lekernel> the time sinks in rc3 are: protection circuit, counterfeit buffers, and now flash/reset circuit

11:16 <wpwrak> maybe fly sebastien to taipei ? make R&D division do double shifts ;-)

11:16 <wolfspraul> but it's only one board, so I suggest to wait until Adam finished the entire fix2b plan

11:16 <lekernel> very little of that can be attributed to our shipping of rc2 boards around

11:17 <wolfspraul> ahh :-) you can ask Adam later, without politics for an honest answer. we have seen 'similar' problems like the one we are dealing with here on rc2.

11:17 <wpwrak> wolfspraul: so your plan is, if there aren't a lot more 0x48, just consider them outliers and go ahead ?

11:17 <wolfspraul> but because of the way we sent rc2 out, we lost focus and consistency to get to the root causes and eliminate them for rc3.

11:17 <wolfspraul> that's my analysis

11:17 <wolfspraul> hmm

11:17 <wolfspraul> we can make that judgment

11:17 <lekernel> we did eliminate the video in instability and the audio noise

11:18 <wolfspraul> 0x48 is tough though

11:18 <wolfspraul> lekernel: yes! :-) so those we don't have to worry about now :-)

11:18 <wolfspraul> wpwrak: would you be willing to ignore 0x48 and assume our design is stable and our test process is reliable?

11:18 <lekernel> also, a lot of successful improvements between rc1 and rc2 were done in a more 'distributed' way

11:18 <wpwrak> wolfspraul: the thing is that, if something needs deep analysis, the current process is very inefficient. so if you can exclude needing deep analysis, then you're right not to spread the work

11:19 <wolfspraul> sending boards anywhere now will delay rc3 sales by at least a month

11:19 <wolfspraul> just saying

11:19 <wpwrak> wolfspraul: (0x48) i think i'd want to know if the board responds to environmental parameters

11:19 <wolfspraul> I will simply refuse to sell CRAP.

11:20 <wpwrak> wolfspraul: (1 month) i don't think so. pick problems you don't expect to be able to analyze with the current process. then you can only win (well, minus the shipping cost)

11:20 <wolfspraul> so as long as it's crap, I keep improving on it, until it's not crap anymore :-)

11:20 <wolfspraul> nah there are no such problems

11:20 <wolfspraul> I am reading the test results, not speculating or ranting at whatever targets come to my mind.

11:20 <wolfspraul> the problem we have that stops rc3 sales is very isolated

11:21 <wolfspraul> and _almost_ eliminated

11:21 <wpwrak> again, the current process is inefficient for deep analysis. it is efficient, though, for things that need broad rework with non-trivial parts.

11:25 <wolfspraul> any result that would come in from anywhere non-taipei delays sales for over a month

11:25 <wolfspraul> I'm still more optimistic.

11:25 <wpwrak> another problem with the current process is that, if adam makes any systematic mistakes, they may get undiscovered. debugging his workflow is very time-consuming.

11:25 <wolfspraul> plus any board that goes anywhere quickly falls out of the logic with which we can right now still compare boards and group them

11:26 <wpwrak> yes, that's true

11:26 <wolfspraul> yes. so let's have 90 people produce 1 board each :-)

11:26 <wolfspraul> anyway I can say clearly that I have 100% trust in rc3 and our current approach

11:26 <wolfspraul> that design is good

11:26 <wpwrak> naw, you don't have so many people :) mwalle is on vacation, so lekernel, me, anyone else ?

11:26 <wolfspraul> :-)

11:27 <wolfspraul> here's the procedure: Adam first finished the fix2b test plan.

11:27 <wolfspraul> I don't want to interrupt him now.

11:28 <wolfspraul> if 0x48 is super isolated, maybe all is good already

11:28 <wpwrak> finishing fix2b is good, agreed

11:30 <wpwrak> (trust in the process) there's a thing colloquially called "get-there-ism". it's the determination of following one's current path of action to achieve a certain objective, ignoring evidence that this may not be possible. it's a well-known phenomenon in the aviation industry. makes planes crash a few meters from the runway, with empty fuel tanks, because the pilot didn't want to divert.

11:31 <wpwrak> (get-there-ism) this is something to watch out for. it's easy to get caught up in it :)

11:31 <wolfspraul> these are the ones missing: 0x61 0x63 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x85

11:32 <wpwrak> so .. fix2b completion ~saturday evening

11:32 <wpwrak> then re-clustering

11:33 <wpwrak> i dislike those where the TP36/TP37 voltage goes crazy after fix2b again

11:33 <wolfspraul> we had that yesterday already, no?

11:33 <wpwrak> with 0x3a ? no. that was rock solid.

11:34 <wolfspraul> 0x32

11:34 <wpwrak> (0x3a, the TP36/37 results)

11:34 <wpwrak> ah, we never got to having a good look at 0x32

11:38 <wolfspraul> man I just think through sending a board to Werner. so painful. the number of dead-end scenarios is staggering. which one? 0x3A?

11:38 <wolfspraul> on 0x3A, Werner would disappear into nor analysis land

11:39 <wolfspraul> he would never replace the chip because a) he doesn't have a spare b) he needs to study that board because that's the one he has

11:39 <wpwrak> he'd run a few more tests, that's for sure :)

11:39 <wolfspraul> 0x32 ? total guess. what if after 2 hours we find it's some plain and simple problem somewhere, without any relation to the fix2/fix2b issues?

11:39 <wolfspraul> see that's the problem

11:40 <wpwrak> when do the new NOR reach adam anyway ? i think he has ordered some, no ?

11:40 <wolfspraul> of course it would be great if more people would be where the boards are, in one place

11:40 <wolfspraul> because then we could parallelice

11:40 <wolfspraul> lize

11:40 <wolfspraul> but sending a board out? argh

11:40 <wpwrak> i'm still confused about 0x32 and 0x3c. TP36/TP37 is inconsistent with what we know so far.

11:40 <wolfspraul> that's the end, I really think it through

11:40 <wpwrak> err, TP36/Tp37 going wild

11:40 <wolfspraul> the impossible to answer question is already the first one: _WHICH_ board? :-)

11:41 <wolfspraul> best would be all 90

11:41 <wolfspraul> and then overnight magically back to Taipei for assembly and packig

11:41 <wolfspraul> packing

11:41 <wpwrak> (which) 0x3a would be a good start :)

11:41 <wolfspraul> after we have design and test process stable, we will have plenty of boards

11:41 <wpwrak> hehe :)

11:41 <wolfspraul> yeah I know you like that one

11:41 <wolfspraul> NOR study galore

11:42 <wpwrak> you have to start somewhere :)

11:42 <wolfspraul> but not on a one-off NOR chip problem

11:42 <wolfspraul> at least not while the rc3 run is in showstopper mode

11:42 <wpwrak> what i'd look for is whether the issue is on the NOR side or the FPGA side. i think we can determine this.

11:42 <wolfspraul> you will get plenty of boards

11:42 <wpwrak> then it becomes a question of bad chip or bad connectivity

11:43 <wolfspraul> but now we have to focus on consistent rc3 quality

11:43 <wpwrak> bad connectivity can be probed on the NOR. not so easily on the FPGA (well, you can flex the board a little, see if it gets worse)

11:43 <wolfspraul> otherwise no rc3 can be sold, and I will continue to work on getting them to be good

11:43 <wolfspraul> absolutely not exhausted yet, just warming up

11:44 <wolfspraul> ok 9 more to go

11:44 <wpwrak> bad connectivity could point to an SMT process issue. of course, you wouldn't like to have any such thing pop up :)

11:44 <wolfspraul> will be interesting

11:44 <wolfspraul> the pulse thing is nasty because we think we overlooked something in the reset circuit

11:44 <wpwrak> bad chips are easier. once we're sure they're just bad, swap them, new chip, new luck

11:45 <wolfspraul> and 0x48 is nasty because it falls back from rendering to unreconfigurable

11:45 <wpwrak> pulse thing would be which board ?

11:45 <wolfspraul> 0x32 and 0x3C

11:45 <wolfspraul> 'bad' chip may come from the process

11:45 <wolfspraul> I think we will see more with pulses

11:45 <wpwrak> 0x32/0x3c also show regressions. they're regressing to a pre-fix2b behaviour

11:46 <wpwrak> (bad chip) yes, could just be a bad SMT profile

11:46 <wolfspraul> ok, I meant regression as in pass the test software, but then fail afterwards

11:46 <wolfspraul> no - not SMT profile

11:46 <wpwrak> yup, 0x48 has a high-level regression

11:46 <wolfspraul> we are way past that. the design is mostly good, the process is mostly good.

11:46 <wolfspraul> there are no fundamental issues.

11:47 <wolfspraul> the schmitt-trigger was a fundamental issue - fixed.

11:47 <wolfspraul> what we have now are statistical and manufacturing issues.

11:47 <wpwrak> or maybe the through-hole pass didn't agree with all the components. no idea what this one actually is.

11:47 <wpwrak> schmitt-trigger was the fake part ?

11:47 <wolfspraul> but our inability to test for 100% good boards means we cannot sell anything!

11:47 <wolfspraul> some were irregular, yes

11:47 <wolfspraul> we replaced all 270, done

11:48 <wolfspraul> one thing is funny in this

11:48 <wolfspraul> I just recently realized we should add a few 'render cycles' after the test program.

11:48 <wolfspraul> dont' know why, intuition

11:48 <wolfspraul> I felt uneasy that we never let the board do waht it's supposed to do with our users.

11:48 <wolfspraul> which is to - RENDER

11:49 <wpwrak> well, worst case, you can decide to just sell them with the promise to replace all of them in case something major pops up. better than losing rc3 entirely.

11:49 <wpwrak> yeah, end user testing is missing, too :)

11:49 <wolfspraul> and now these render cycles are what give us the most unsettling feedback about our test process, even our design.

11:49 <wolfspraul> good catch Wolfgang!

11:49 <wolfspraul> I don't mind the unsettling feedback, I can handle that.

11:49 <wolfspraul> no way, you don't know how expensive support is

11:50 <wolfspraul> the render cycles are a godsent

11:50 <wolfspraul> very good

11:50 <wolfspraul> it lifts m1 to the next level

11:50 <wolfspraul> I already want to do 1h render testing on each board :-)

11:50 <wolfspraul> or 24h :-)

11:50 <wpwrak> ;-)

11:50 <wolfspraul> I'm sure if we would do that, we would find more issues.

11:50 <wpwrak> next you'll want the temperatur chamber :)

11:50 <wolfspraul> isn't it funny. if we remove the render cycle test (which we did not have in rc2), we would already sell now :-)

11:50 <wolfspraul> keep that in mind when complaining

11:51 <wolfspraul> so I think we should not go bezerk, no 24h test etc. but we should do a few render cycles, yes.

11:51 <wolfspraul> and we have to handle the fallout.

11:51 <wpwrak> i'm not complaining about the meticulous process. i'm merely suggesting that you could widen your bottleneck :)

11:52 <wolfspraul> fly to Taipei

11:52 <wolfspraul> that widens the bottleneck

11:52 <wolfspraul> you can be there Saturday, no? :-)

11:52 <wolfspraul> anyway just kidding. sometimes it just needs a bit of relaxation. I will think more about get-there-ism.

11:52 <wpwrak> of course, there's no guarantee. e.g., further analysis on a problem board could be inconclusive, the board may suffer additional failures on its journey, just the shipping may take too long for the results to be meaningful, etc.

11:53 <wolfspraul> oh sure, it wouldn't help with the rc3 sales showstopper problem at all.

11:53 <wolfspraul> it would be nice for rc4 though

11:53 <wpwrak> (fly to tpe) heh, i'd also want my lab :) some of adam's equipment is pretty marginal.

11:53 <wolfspraul> it would even harm the rc3 showstopper resolution because it takes valuable data from Adam (from a consistent overview)

11:54 <wolfspraul> so I'd rather pick 'safe' boards for this strange exercise, which defeats the purpose already. bottom line: it doesn't work.

11:54 <wolfspraul> I've been in too many runs and tried too many things.

11:54 <wpwrak> (overview) only if he ever plans to return to those boards for analysis.

11:54 <wolfspraul> most important is consistency (for what we are trying to improve now).

11:57 <wpwrak> my view is simply that, before you've isolated a problem, you don't know whether it's a one-off or something systemic. the current approach of having lots of boards is good for common problems. you get a lot of data, can do clustering, etc., and you can modify a lot of boards and gather many new results. very useful.

11:58 <wpwrak> however, when you run out of these big clusters, then you need to track down seemingly individual problems. and there, the mass analysis approach doesn't scale.

11:59 <wpwrak> so you really have distinct phases: first, get the lay of the land. second, examine the widespread issues and apply the corresponding mass cure. go back to step one and repeat until things settle.

11:59 <wolfspraul> fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good

11:59 <wolfspraul> that's my grouping

11:59 <wolfspraul> very good results so far!

12:00 <wolfspraul> remember these are all boards that had problems before

12:00 <wolfspraul> if people don't see the progress, well, sorry, I cannot help

12:00 <wolfspraul> huge progress

12:00 <wolfspraul> fix2b is fantastic

12:00 <wpwrak> phase 2: hunt down the ones that are weird. then see if this yield anything to apply to the rest. e.g., new, targetted experiments.

12:00 <wolfspraul> a life safer

12:00 <wpwrak> yeah, fix2b is good

12:00 <wpwrak> but ... 0x3a and 0x3c worry me

12:00 <wolfspraul> those 2 pulse boards are strange

12:00 <wpwrak> err, 0x32 and 0x3c

12:00 <wolfspraul> they charge us

12:01 <wolfspraul> and 0x48 is an insult

12:01 <wolfspraul> :-)

12:01 <wolfspraul> but we get to it, really

12:01 <wpwrak> i don't like boards to regress to a pre-fix2b state. that's not supposed to happen :)

12:01 <wolfspraul> I don't know which board to send where now, it just doesn't work. I hope for some understanding.

12:01 <wolfspraul> let's wait for the other 9 boards

12:01 <wolfspraul> but those results are really good. all of those boards didn't work before - keep in mind.

12:01 <wpwrak> well, you can think a bit about which boards you'd want to send where :)

12:02 <wolfspraul> we took them all out of the fail pool!

12:02 <wpwrak> pity that the weekend is near. so unless you make a quick decision, you lose 1-2 days of fedex transit. but i can see the scheduling conflict, also with adam.

12:02 <wpwrak> he really needs an assistant :)

12:03 <wolfspraul> oh of course, we think about that carefully. and there will be enough selection. but I need to suck out anything valuable from the perspective of the entire run and yield first.

12:03 <wolfspraul> I did not do this well on rc2, plain and simple. over-excited I guess.

12:03 <wolfspraul> my problem, but this time I fix it.

12:03 <wolfspraul> also if not, rc4 would bankrupt me :-) (I wouldn't do it unless rc3 was under control)

12:04 <wolfspraul> Adam knows this too, we have to catch more cases here, otherwise the next run will totally blow up.

12:04 <wolfspraul> we should not forget - Adam has far more production experience than we do. many runs of many thousand units, even some runs with millions I think.

12:04 <wolfspraul> so it's not like we tell him - solder here, solder there. our solder monkey. Adam knows what is needed to get the _manufacturing_ quality (yield, efficiency, etc) up.

12:05 <wpwrak> well, he currently does operate a bit in the solder monkey way. and i agree, that's not a very efficient thing to do.

12:06 <wpwrak> taht's again some cost of the centralized approach. he has all the boards, but he's also got the only hands available for solder monkeying.

12:06 <wolfspraul> yes but I'm saying we (Adam and me) know what we need to do for rc4, and we will do it because we want a successful rc4.

12:07 <wolfspraul> oh sure, that I agree with. but our resources are limited there. no need to argue with me that it would be better to have more people for this, in Taipei.

12:07 <wolfspraul> or whereever the run is. the testing needs to be fast and efficient and in one place.

12:15 <wpwrak> btw, here's a nice picture from 0x3a: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3a_ch1-FLASH_RESET_N_ch2-INIT_B.JPG

12:16 <wpwrak> those narrow drops on CH2 are probably configuration retries, after hitting a CRC error

12:16 <wpwrak> (CH2 is INIT_B)

12:17 <wpwrak> this is a fairly distinct pattern. something that can help to classify problems, in case we see this anywhere else.

12:17 <wolfspraul> wpwrak: here's 0x3C 'pulsing' http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3c_ch1-tp37.JPG

12:19 <wpwrak> ah :)

12:19 <wolfspraul> here's the text for it http://en.qi-hardware.com/mmlogs/milkymist_2011-08-16.log.html#t06:11

12:19 <wpwrak> interesting voltages. 2V :)

12:20 <wpwrak> would need to see TP36 for a better picture. real trouble is probably there, TP37 just follows it

12:21 <wolfspraul> if you look at the 0x3C testing notes, this is after fix2b applied

12:22 <wolfspraul> maybe another malfunctioning part in the same circuit

12:22 <wpwrak> looks like one of those fix2b regressions. maybe adam really needs a new soldering iron :)

12:22 <wolfspraul> if it's a malfunctioning part that's no problem at all actually

12:22 <wolfspraul> then the validity of fix2b still stands

12:22 <wolfspraul> we don't even need to look into it in that case

12:23 <wolfspraul> question is whether we want to make that assumption :-)

12:23 <wpwrak> i think fix2b is valid, no matter what else we observe. it undoes an unnecessary extension of the reset circuit.

12:23 <wolfspraul> so I suggest 0x48 first (but before even that wait for full fix2b results)

12:23 <wolfspraul> things really look good, I am not worried right now.

12:24 <wolfspraul> that's all I can say and now I get some tasty dinner! :-)

12:24 <wpwrak> what bothers me is that we see so many weird effects on D16. here's a "test plan":

12:25 <wpwrak> - if the voltages are "weird", scope TP36 and TP37 and archive the screenshot

12:25 <wpwrak> - inject current from a 3.3 V source into TP36 and measure how much current flows

12:26 <wpwrak> - if the current is low, add D60, and try again

12:26 <wpwrak> - if the current is high (> 1 mA), stop for further headbanging

12:28 <wpwrak> well, or make this 100 uA even

12:28 <wpwrak> sorry, no D60. R60.

12:30 <wpwrak> well, if the current is high, remove C238, then try again

12:30 <wpwrak> removing C238 can cause FLASH_RESET_N to PROGRAM_B contamination, but that should be relatively benign

12:31 <wpwrak> i.e., if i understood lekernel correctly, what would happen is that, if you try to command a "software reset" (from the GUI or such), the M1 would just shut down.

12:32 <wpwrak> heh, if we remove the "software reset" feature, we could even connect PROGRAM_B to FLASH_RESET_N, throw away C238 and D16 for good ;-)

12:33 <wpwrak> sometimes, all a puzzling knot needs is a good sword ;-)

12:34 <wpwrak> (but don't try this - you'd also have to change the use of P22. else, you could drive P22 high into reset out low. not sure what happens then.)

12:42 <lekernel> if giving up the software reset on the rc3 boards prevents those already huge delays from growing up even further, i'm for it

12:43 <wpwrak> lekernel: i was just waiting for you to say this ;-))

12:44 <wpwrak> anyway, up to and including C238 removal, i think the above test plan looks reasonably, doesn't it ?

12:44 <wpwrak> if we end up with C238 removed, we can then figure out what to do about it

12:47 <wpwrak> wolfspraul: background / memory refresh: C238 protects PROGRAM_B from falling edges on INIT_B or FLASH_RESET_N propagating through the diodes. a falling edge on FLASH_RESET_N can happen (only ?) when a software reset is commanded, e.g., through the GUI. in this case, propagation into PROGRAM_B would also reset the FPGA, which according to lekernel just shuts down the M1. (why does it shut down and not just reconfigure ?)

12:51 <wpwrak> wolfspraul: INIT_B would drop when there's a CRC error. so contamination of PROGRAM_B would create a feedback loop, where each failed try to configure would reset the FPGA. that sounds undesirable. with fix2b, we already remove INIT_B from the equation, leaving only the much friendlier FLASH_RESET_N connection.

12:55 <lekernel> wpwrak: reconfigure to standby bitstream = shutdown

12:56 <lekernel> the nasty problem we may have here, though, is that the reset pulse may not be long enough for the flash

12:56 <lekernel> so that can become another headache

12:56 <lekernel> because as soon as the fpga is deconfigured by program_b, the reset will be deasserted immediately

12:57 <wpwrak> why does reconfigure to standby mean shutdown but initial configuration to standby means that the system starts ? how do the two paths diverge ?

13:01 <wpwrak> (glitch on flash reset) yeah, could be tricky. the NOR wants at least 100 ns.

13:05 <wolfspraul> no even in initial configuration, it ends with the standby bitstream and you have to press the middle button to actually boot further (start)

13:06 <wolfspraul> man we need to get one of those boards to you :-)

13:07 <wolfspraul> how about one of the good ones? including fix2b. 0x34 ?

13:07 <wolfspraul> let me check the history of that one

13:07 <wolfspraul> yeah looks perfect. a typical rc3 story :-)

13:10 <wolfspraul> I had an evil thought on 0x48: nor corruption after first power-down. in that case we may have to try the 4.4v reset ics...

13:10 <wolfspraul> but we see later what we find

13:10 <wpwrak> (middle button) aah, now it makes sense, thanks :) and now i also understand why it's called "standby" bitstream :)

13:11 <wpwrak> 0x48. yes, that would be a possibility. bring it up and read back the NOR. we now know that urjtag works :)

13:12 <wolfspraul> that'd be the worst case. nor corruption on 0x48 requiring a reset ic rework on the entire run :-)

13:12 <wolfspraul> and then who knows it may not even fix the nor corruption... well, think positive.

13:12 <wpwrak> wolfspraul: actually, you could go to taipei to help adam ;-) alas, it seems that you'd then have to relocate rejon as well

13:13 <wolfspraul> we did see a nor corruption on rc2 (xiangfu), and also on 0x3A (unexplained, I'm just leaning towards 'replace nor chip' right now)

13:15 <wpwrak> i don't like these "replace the chip" operations. at least not without having isolated the fault. otherwise, you just roll the dice and you have no idea where they fall.

13:15 <wolfspraul> yes and no. as long as it's efficient it may scale up well into the thousands of units.

13:15 <wolfspraul> our difficulty is our own uncertainty into the design and our test process

13:15 <wolfspraul> that complicates things

13:16 <wolfspraul> now we have too many unknowns

13:16 <wolfspraul> so we cannot effectively kill the bugs

13:16 <wpwrak> and of course, if the problem is anywhere NOR-related, you may very well make it go away. e.g., but eliminating all the parts with tolerances in the region of the bell curve the design doesn't cover :) (and, of course, in the next run, you'll run into more of the same again)

13:16 <wolfspraul> if at the same time you question your design, your test process, and the chips, what then?

13:17 <wolfspraul> we need to get the design and test process off the table first

13:17 <wolfspraul> no matter what

13:17 <wolfspraul> the design and test process must be of unquestionable standard

13:17 <wpwrak> that's when you need systematic analysis :) yes, you may waste your time on random freak accidents. but chances are there's more to these things.

13:17 <wolfspraul> otherwise we can never manufacture effectively

13:17 <wpwrak> the test process is a separate issue

13:18 <wpwrak> right now, we're still trying to find causes

13:30 <wolfspraul> good news 0x61 0x63 also good

13:31 <wolfspraul> fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good

13:32 <wolfspraul> 7 good, 2 pulse, 2 nor (my grouping), 1 render then fail

13:33 <wolfspraul> 7 more to go: 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x85

13:35 <wolfspraul> the good news about 0x48 is also that it failed on the second render cycle, i.e. after the first power cycle

13:35 <wolfspraul> I like that much better than failing on the 6th or 9th one as we saw before

13:50 <wpwrak> (0x61, 0x63) great !

13:50 <wpwrak> (0x48) will be interesting to see the NOR dump

14:16 <aw_> (0x6B, 0x6C) good

14:17 <wolfspraul> aw_: great!

14:17 <aw_> let's dump 0x48 now

14:20 <aw_> dumping...

14:20 <wolfspraul> there you go

14:20 <wolfspraul> already?

14:23 <aw_> needs 5 minutes to dump. :)

14:23 <wolfspraul> aw_: yes but the reading seems to work

14:23 <wolfspraul> ?

14:23 <wolfspraul> so 0x48 cannot reconfigure now?

14:24 <wolfspraul> maybe after the dumping you try to boot, just to see whether it's still stuck somewhere (cannot boot)

14:24 <aw_> it's said that but I've never calculate it

14:24 <wolfspraul> aw_: we followed the wiki a bit today, excellent work!

14:24 <aw_> 0x48 is quite a little same with 0x3a yesterday we did

14:25 <wolfspraul> fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good / 0x6b good / 0x6c good

14:25 <wolfspraul> 9 good, 2 pulse, 2 nor (my grouping), 1 render then fail

14:25 <wpwrak> i like the trend in the last few :)

14:25 <wolfspraul> 5 more to go: 0x77 0x7A 0x7D 0x7F 0x85

14:26 <aw_> so even d2/d3 is still dimly lit and make sure tp36 and tp37 is fully pull high , also init_b is okay, then it tried to enter reconfiguration stage

14:26 <wolfspraul> aw_: maybe I misread the 0x48 notes? the 0x48 notes say that this board rendered, and then failed after the first power cycle?

14:26 <wolfspraul> aw_: yes but did 0x48 render before?

14:26 <wpwrak> (5 to go) kewl. that was really a productive day.

14:28 <larsc> wpwrak: do you happen to know any rtc chips, which could be used for the milkymist?

14:28 <aw_> wolfspraul, the 0x48 has never rendering successfully before.

14:28 <wolfspraul> oh!

14:28 <wolfspraul> that's good

14:28 <wolfspraul> then I misunderstood the notes, one sec

14:29 <wolfspraul> aw_: 0x48 notes are saying "5. applied fix2b 6. D16(in-circuit): For.V.=152mV, Rev.V = 1548mV 7. d2/d3 is fully off after power on 8. reflashed successfully 9. cant reconfigure @2nd rendering, tp36/tp37 is 3.3V "

14:29 <wolfspraul> see 9. can't reconfigure @2nd rendering

14:29 <wpwrak> so "@2nd rendering" really means "@2nd power cycle" ?

14:30 <aw_> sorry that i should say in the first round test, it has never rendering before

14:30 <wolfspraul> did the test software run?

14:30 <aw_> then after fix2b, can't reconfigure at 2nd power - cyle

14:30 <wolfspraul> I don't understand the notes

14:30 <wolfspraul> aw_: did the test software run on 0x48 ?

14:30 <aw_> yes, it's passed in test program

14:31 <wolfspraul> then after the test software, you power cycle?

14:31 <wolfspraul> and then it doesn't reconfigure?

14:31 <aw_> http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/48-results

14:31 <aw_> yes

14:31 <wolfspraul> interesting

14:31 <wolfspraul> well but that's even better than I thought

14:32 <wolfspraul> aw_: how do you feel about fix2b today, and the boards you worked on?

14:32 <aw_> see the last bottom, you can see I've only copied first one time boot up log

14:32 <wpwrak> eagerly awaits sinking his greedy fingers into the dump of 0x48 :)

14:34 <aw_> i have a strange feelings that if the board has smoothly and fully passed (including rendering )in FIRST run (fix2 circuit) , then the applied fix2b will be also passed the rendering job

14:34 <wpwrak> (i actually thought if extending my bit error checker to look for algorithmic patterns in the address bits. could be fun.)

14:35 <wpwrak> yeah, if a board was happy with fix2, it should be only happier with fix2b.

14:35 <wolfspraul> aw_: ok I don't fully understand the 0x48 process, but maybe you can update the notes a little with what you remember. maybe like werner said "@2nd power cycle" or "@1st boot to render"

14:36 <aw_> oka..sure

14:36 <wolfspraul> still strange how 0x48 failed, oh well

14:36 <aw_> good note from werner, I'll do like that

14:36 <aw_> i don't know, just second power-cyle then byebye

14:36 <aw_> :-)

14:36 <wpwrak> passing with fix2 and failing with fix2b could have the following explanations: 1) some rework mistake in fix2b, 2) something was borderline and went over the limit (e.g., a temperature dependency)

14:37 <wpwrak> aw_: how's the 0x48 dump coming along ?

14:37 <aw_> second

14:37 <wolfspraul> wpwrak: which board?

14:37 <wolfspraul> aw_: I think the results today are super encouraging.

14:37 <aw_> dump done...let me mv

14:37 <wpwrak> wolfspraul: no, in general

14:38 <wolfspraul> we are on a very good path with fix2b

14:38 <aw_> wolfspraul, yes, no; i still fill somethings strange though.

14:39 <wolfspraul> aw_: today, you set 9 boards to 'available' status, that means 90 thirty second rendering cycles. and not a single board failed after the test software, with the exception of 0x48 which failed right after.

14:39 <wpwrak> it seems that the diodes are unreliable. with fix2b alone, we're removing about 50% of the unreliability :) and with the extra testing adam does as part of fix2b, most of the rest as well

14:39 <aw_> tomorrow when I test all cluster batch boards, then back to work with werner to check failed board

14:39 <wolfspraul> yes exactly

14:39 <wolfspraul> but I am still happy - look at the numbers I just said - because we can now safely distinguish between 100% good and failed boards

14:39 <wolfspraul> and I do trust the ones that are 100% good

14:39 <wolfspraul> they are stable and good and will stay like that

14:40 <wpwrak> yes, looks pretty good now

14:40 <wolfspraul> we can do another 10 render cycles on them in 2 days to verify

14:40 <wolfspraul> never say never

14:40 <wpwrak> heh :)

14:40 <wpwrak> wait for a hot day

14:40 <aw_> wpwrak, it could be on diode. but this leave tomorrow to check. since d16's one terminal was soldering twice: one is my fix2, the other is to take apart for fix2b.

14:40 <wolfspraul> aw_: yes we notice, sure. 0x3C, 0x48, 0x55

14:40 <wolfspraul> there are still problems

14:42 <wolfspraul> this bloody diode has to go in rc4 :-)

14:42 <wpwrak> wolfspraul: ah, and the reflashing may need some clarification: is the reflash script as adam uses it supposed to do a verification (e.g., CRC) ? because it appears that it doesn't do this

14:42 <wolfspraul> no

14:42 <wpwrak> yeah, the diode is evil

14:43 <wolfspraul> the problem is that jtag verification is too slow

14:43 <wolfspraul> crazy slow

14:43 <wpwrak> waiting for the dump

14:43 <wolfspraul> that's why we added crc checks to the test software

14:43 <wpwrak> wolfspraul: hmm. okay, so writing with urjtag is unreliable. okay.

14:43 <wolfspraul> not unreliable

14:43 <wolfspraul> verification is too slow to be practical (30 minutes or more)

14:43 <wpwrak> wolfspraul: unchecked = unreliable ;-)

14:43 <wolfspraul> don't know why, we can easily enable it

14:43 <wolfspraul> but then it's crazy slow

14:44 <wolfspraul> so we check crc in the test software, which runs right after urjtag

14:44 <wolfspraul> and the results are logged

14:44 <wpwrak> wolfspraul: that doesn't make sense ;-) if write + read is faster than write + verify, something doesn't add up :)

14:44 <wolfspraul> http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/48-results

14:44 <wolfspraul> I'm sure there are inefficiencies

14:44 <wolfspraul> but if we enable 'verification' now in the script, it's crazy slow

14:44 <wolfspraul> unusable

14:44 <wolfspraul> so we moved the crc check to the test software instead

14:45 <wolfspraul> read is also slow

14:45 <wolfspraul> as you can see right now

14:45 <wolfspraul> reading the entire 32 megabytes takes over 4 hours

14:45 <wolfspraul> the only thing urjtag is fast at is unverified writing

14:45 <wolfspraul> as of right now

14:45 <aw_> wpwrak, http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x48-standby1.bit/

14:45 <wpwrak> *hmm*

14:46 <wolfspraul> yeah

14:46 <wolfspraul> :-)

14:46 <wolfspraul> that's why there is a crc check in the test software

14:46 <wpwrak> whee, six 1 -> 0 transitions !

14:47 <wpwrak> http://pastebin.com/JqqWVik1

14:47 <wpwrak> formayt isÂ Â bit: #to0 #to1

14:47 <wpwrak> there #toX is "number of transitions from !X to X"

14:48 <wpwrak> s/there/where/

14:48 <aw_> which bit you think?

14:48 <aw_> sorry don't understand your format. :)

14:49 <wpwrak> http://pastebin.com/19s7kHKQ

14:49 <wpwrak> one word was zeroed out for some reason

14:50 <wpwrak> so this is quite different from 0x3a

14:50 <wolfspraul> oh

14:50 <wolfspraul> that could finally be a software bug! :-)

14:50 <wpwrak> i'd re-write, read back without power-cycling, then power-cycle and see what happens

14:50 <wolfspraul> we are moving uuupppp!

14:50 <wpwrak> yes :)

14:51 <wolfspraul> lekernel: man this is great!

14:51 <aw_> wpwrak, yes, different from 0x3a though

14:51 <wolfspraul> the nor becomes so stable now that we can see actual software bugs making it all the way back in (well, likely software bugs)

14:51 <wolfspraul> I think that's good news

14:51 <wolfspraul> the hardware become stable...

14:52 <lekernel> what kind of software bug?

14:52 <lekernel> urjtag?

14:52 <wpwrak> aw_: yes, very different. 0x3a has all the errors on the same bit and scattered over many many addresses

14:52 <aw_> wpwrak, so you want me to re-write/reflash it again?

14:52 <wpwrak> aw_: if you don't feel too tired, yes please

14:52 <aw_> wpwrak, yes, i noticed that.

14:52 <wolfspraul> lekernel: no worries, I was half joking. just extrapolating what it could be...

14:52 <aw_> wpwrak, wait

14:52 <aw_> should we use xilinx tool?

14:52 <wolfspraul> Werner just saw an entire word in nor zeroed out.

14:52 <wpwrak> aw_: and then read back before power-cycling, so that we can see whether the writing was okay

14:53 <wpwrak> aw_: naw, urjtag is fine

14:53 <wolfspraul> aw_: no use urjtag, I trust it

14:53 <aw_> aalright...let's use urjtag first. ;-)

14:54 <wpwrak> wolfspraul: it's a bit too early to blame sw. could also be a urjtag glitch for all we know. or powering down.

14:54 <wolfspraul> yes yes sure

14:54 <wolfspraul> I was just expressing my joy

14:54 <wolfspraul> a full word!!!

14:54 <wolfspraul> we are clearly moving upwards

14:55 <wolfspraul> actually, in that theory, it must have happened before flickernoise

14:55 <wolfspraul> but anyway, just speculation

14:56 <wolfspraul> I don't care much because this was caught by the test process

14:56 <wolfspraul> and safely caught, not at last second

14:59 <aw_> reflashed done

14:59 <aw_> now dump again.

15:05 <aw_> wpwrak, bad...sorry that I didn't notice that you wanted to dump without power-cycling...

15:05 <aw_> i redo now..sorry

15:05 <wpwrak> naw, take this one then

15:05 <aw_> wpwrak, also okay?

15:05 <wpwrak> yeah

15:05 <aw_> alright

15:06 <aw_> phew~ almost my finger to power off. :)

15:07 <wpwrak> heh :)

15:08 <wpwrak> if 0x48-2 is okay, which is what i'd expect, then the power cycle didn't matter. in case we find an error also in 0x48-2, the power cycle will need investigating.

15:09 <wolfspraul> if it were that easy, we are lucky

15:09 <wolfspraul> wpwrak: it could well be 1 out of 10 power cycles

15:09 <wolfspraul> remember that we are zooming in on troublemakers in a run. whenever you do that your cases get stranger and stranger.

15:09 <wpwrak> wolfspraul: it could be, yes.

15:10 <wolfspraul> don't forget all the dozens of boards in hundreds of tests that have never shown anything like this

15:10 <wolfspraul> and how we are looking at the one time that we saw this

15:10 <wpwrak> wolfspraul: we also have the risk of an undefined power state in the down ramp in all of rc3

15:10 <wolfspraul> yes sure, I know

15:10 <wolfspraul> I am aware of it

15:10 <wolfspraul> but today, we had 9 boards pass a total of 90 rendering power cycles

15:10 <wpwrak> i hope applying locking wherever possible will reduce the risk of the down ramp doing too much damage

15:11 <wpwrak> would be good to have a CRC check for the unprotected partitions, though

15:11 <Fallenou> /win 12

15:13 <aw_> wpwrak, http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x48-standby2.bit/

15:14 <wpwrak> perfect

15:14 <wpwrak> try to boot and render :)

15:15 <wolfspraul> aw_: if you do 10 full render cycles + crc checks on 0x48, I think you can add it to avail - fix2b

15:15 <wolfspraul> I wouldn't know why not

15:15 <aw_> d2 is ON..and rendering

15:15 <wolfspraul> but you can also do that tomorrow, it gets later and later and it was a long day...

15:16 <aw_> yes..i'd go for sleep ...hehe ;-)

15:16 <wolfspraul> I'm thinking whether we should be suspicious about 0x48, but I don't see why

15:16 <wolfspraul> yes

15:16 <wolfspraul> thanks for the excellent work today!

15:17 <wolfspraul> phantastic, really

15:17 <wolfspraul> so many boards

15:17 <aw_> but good that we caught known issue on 0x48 today

15:17 <wolfspraul> very enlightening fix2b results

15:17 <wolfspraul> well

15:17 <wolfspraul> we will do some more thinking

15:17 <wolfspraul> remember this in the notes history...

15:17 <aw_> wpwrak, thanks a lot though.

15:17 <aw_> okay

15:17 <wolfspraul> we can hold onto 0x48 for a while

15:18 <wolfspraul> but I am 99% sure there's no problem with 0x48

15:19 <aw_> alright...night

15:19 <wpwrak> a great day indeed !

15:19 <wpwrak> aw_: sweet dreams ! :)

15:19 <aw_> k

15:22 <wolfspraul> wpwrak: 0x48 is one of those cases that I would/might end up holding back or not selling

15:22 <wolfspraul> I always sell the best things first

15:23 <wolfspraul> but it's too early to tell. if we find a clear software bug one day then it changes.

15:23 <wolfspraul> I think we can leave 0x48 alone now.

15:24 <wolfspraul> so 0x3C/0x32 are interesting, or maybe the ones that I grouped as 'nor failure' (0x3A/0x55)

15:25 <wolfspraul> wpwrak: do you have any idea for an rtc chip we could add to rc4?

15:26 <wpwrak> 0x48 may just be the first one to exhibit a down ramp corruption

15:26 <wolfspraul> nah

15:26 <wolfspraul> very speculative, almost wishful thinking

15:26 <wpwrak> (rtc chip) no idea :)

15:26 <wpwrak> i don't exactly "wish" for down ramp corruption ;-)

15:27 <wolfspraul> no but it's too speculative for me - no reason

15:27 <wolfspraul> will think more

15:27 <wolfspraul> adam did a great job today, lots of hard data

15:27 <wolfspraul> fix2b looks good, all on track

15:28 <wpwrak> i think we may currently have a very low probability of encountering down ramp corruption. maybe it needs a bus access plus the right power drop. a synthetic test may be able to make it happen more often.

15:28 <wpwrak> or maybe down ramp corruption never happens and this was something else

15:28 <wpwrak> maybe it's a one in a hundred years sw bug :)

15:31 <wolfspraul> here's an important question: should adam investigate 32/3c or 3a/55 first, or first proceed with fix2b across all 90 boards?

15:32 <wpwrak> hmm, let's give 0x32/0x3c a try first. maybe there's a low-hanging fruit there. in 0x3a, we already know that things are a bit harder.

15:33 <wolfspraul> ok, but with time limit probably

15:34 <wolfspraul> I feel good about fix2b across all 90 boards

15:34 <wolfspraul> calling it a day as well, n8 (reading backlog tmr)

15:34 <wpwrak> 0x55 looks worse than 0x3a

15:34 <wpwrak> 0x32/0x3c still don't act as a fix2b'ed board should

15:35 <wolfspraul> yes but can 0x55 raise or lower fix2b validity? I doubt it...

15:35 <wolfspraul> same for 32/3c. just some small problem on those particular boards, nothing to do with fix2b.

15:35 <wpwrak> 0x3a and 0x55 don't affect fix2b. 0x32/0x3c might.

15:35 <wolfspraul> the more reworks we make, the more manual mistakes we introduce into the run, which then have to be fixed again.

15:36 <wpwrak> but if we find something "interesting" in 0x3a/0x55, it may make sense to include it in the post-fix2b testing, to save time.

15:36 <wolfspraul> ok so 32/3c first, I guess

15:36 <wpwrak> (manual errors) yes, that could very well be the problem of 0x32 and 0x3c

15:41 <wolfspraul> once we can safely assume that, there is no value in looking at them at all, even until after rc3 sales start (not just fix2b verification)

15:42 <wolfspraul> but let's ping them quickly, see what we find, then decide

15:45 <wpwrak> yeah, we won't know for sure before we've fixed them :)

15:45 <wpwrak> i don't expect this to be overly hard

15:45 <wpwrak> C238, most likely