<wpwrak>
and now, live from the arena, the eternal struggle of good versus evil ! today, the champion of the powers of good will be ... adam ! the forces of evil are represented by board 0x32. will the good prevail ? will one more M1rc3 perish in darkness ? watch it live, only on #milkymist !
<roh>
*lol*
<roh>
:)
<roh>
how's the score atm?
<wpwrak>
roh: so far, adam has just entered the area but hasn't uttered a word yet. perhaps he's meditating, gathering his spiritual forces to defeat the foe :)
<wpwrak>
s/area/arena/
<roh>
heh
<roh>
i am sure he'll get through that pile of stuff to fix. will just take some time.
<wpwrak>
yeah. let's hope there aren't too many of the truly weird problems left in that pile
<wpwrak>
hmm, seems that tonight will be quiet
<kristianpaul>
and still you thinking rework 90 pcbs again?
<wolfspraul>
alright, I saw some confusion in questions from Sebastien and Kristian Paul
<wolfspraul>
it's a little difficult to keep overview in the flood of details
<kristianpaul>
well, i'm asking with no awareness of last day of backlog
<wolfspraul>
1. we cannot say there there is a problem with 'bad nor chips'
<wolfspraul>
all is fine with the nor chips
<wolfspraul>
2. urjtag may have bugs, but right now it seems Xilinx Impact is only as good as urjtag, so we will continue to use urjtag and the jtag-serial board
<wolfspraul>
3. of course we will apply fix2b, or any additional fix we may find to be needed - to all boards that are being sold
<wolfspraul>
all boards that are being sold are sold in the same condition, and that condition is 100% pass and 100% bug-free
<wolfspraul>
quite simple :-)
<wolfspraul>
4. yesterday was a big messy, or rather slow, because it was not really a production day, but more a fix2b design verification day
<wolfspraul>
normally we would not spend so much time on one board that shows a NOR problem (_for whatever reason_), but we did because we had to verify fix2b and cannot risk jumping over too many unknowns
<wolfspraul>
after yesterday, I feel pretty good about fix2b
<wolfspraul>
Adam will now approach the other 15/16 fix2b candidates in a more production style, that is - fast
<wolfspraul>
if something doesn't work - note that went wrong, then next board
<wolfspraul>
5. I am very happy that now it seems we can remove the long wire
<wolfspraul>
the long wire is an invitation for trouble
<kristianpaul>
aka FM atenna :)
<wolfspraul>
the one thing special about rc3 is that we mix design verification and production run
<wolfspraul>
almost since day smt+1 if you remember
<wolfspraul>
we started fiddling with the reset circuit right at the beginning because boards wouldn't boot
<wolfspraul>
remember?
<kristianpaul>
sure, that was messy but still well managed wich is good !
<wolfspraul>
so that's causing a big problem
<kristianpaul>
yes i have good memory
<wolfspraul>
because design verification and production are so different
<wolfspraul>
production is about speed, economics
<wolfspraul>
every board goes through a predefined and efficient process
<wolfspraul>
and it doesn't matter why it fails, if it fails the failure point is recorded and we go to the next board
<wolfspraul>
but if we are uncertain about the design (!), then we cannot do that
<wolfspraul>
so...
<kristianpaul>
keep going !:)
<kristianpaul>
yes, i know understand your point
<wolfspraul>
why did sales not start yet?
<wolfspraul>
it's easy
<wolfspraul>
it's _NOT_ because some boards have problems
<wolfspraul>
in a run of 90 there will always be problems
<wolfspraul>
it's because our test procedure, which ended in 10 thirty second render cycles, showed sudden failures (cannot reconfig) on boards that were perfectly rendering fine, but stopped doing so at the 2nd, 5th, 9th render cycle
<wolfspraul>
that's bad!!!!
<wolfspraul>
that's the bad thing
<wolfspraul>
not the schmitt-triggers, usb transceivers, nor chips, urjtag bugs, long wires, etc.
<kristianpaul>
sure, thats uncertain
<kristianpaul>
at least was..
<wolfspraul>
no it still is
<wolfspraul>
so I have 40 boards (for example)
<wolfspraul>
30 pass
<wolfspraul>
10 fail, some at 2nd rendering, some at 5th, some at 9th
<wolfspraul>
ok?
<wolfspraul>
can we sell the 30 pass?
<wolfspraul>
NO!
<wolfspraul>
why?
<kristianpaul>
not working
<wolfspraul>
because if we had done 20 render cycles, or 30, then maybe only 20 boards would have 'passed'
<wolfspraul>
or 15
<wolfspraul>
maybe if we do 100 render cycles, none would pass
<wolfspraul>
got it?
<kristianpaul>
totally :)
<wolfspraul>
I cannot start selling like this, not _ANY_ board
<wolfspraul>
so what I want to see with fix2b now is this:
<wolfspraul>
1. Adam works on a lot of boards, one after another, fast
<wolfspraul>
some pass, some fail
<wolfspraul>
ok?
<wolfspraul>
he will do 10 render cycles on each one
<wolfspraul>
I have two conditions to start sales:
<wolfspraul>
1) at least 50% of boards must pass (otherwise something so big may be covered up somewhere that we are better off pausing for a day or two to study it)
<wolfspraul>
2) from the boards that fail, _NONE_ must fail at any point after running the test software (all peripherals). If they fail before that - fine. But once they boot for the first time after the test software, there must be no failures.
<wpwrak>
yeah, no regressions
<wolfspraul>
so I think we need to give Adam one or two full days, where he can just focus on speed and no chatting and no time consuming analysis.
<wolfspraul>
unless something really worrying comes up with fix2b, but we can follow the wiki page as he updates it
<wpwrak>
that's what we had with 0x3a. maybe one day we'll even know why ;-)
<wpwrak>
(no chatting) hehe ;-)
<wolfspraul>
wpwrak: 0x3A rendered fine, and then stopped?
<wolfspraul>
kristianpaul: is this all clear now?
<kristianpaul>
wolfspraul: yes
<kristianpaul>
i lack patience thats all :)
<wolfspraul>
he, ok. sorry. it's a flood of details I know.
<kristianpaul>
you think in long term
<kristianpaul>
wich is GOOD
<wpwrak>
0x3a didn't get that far. butat one point in time correct NOR content could be read back. after that, either writing failed or readback has 100% reproducible corruption.
<kristianpaul>
i havent even tought about tha 50% pass
<wolfspraul>
ok but that's way before 100% pass of the test software
<wolfspraul>
I don't care about those cases
<wolfspraul>
kristianpaul: that's just to protect us from a potentially still remaining design mistake
<wolfspraul>
if Adam works on 20 boards, and 2 pass - something is wrong :-)
<kristianpaul>
oh sure, as those poping with rc2 :)
<kristianpaul>
s/with/from
<wolfspraul>
wpwrak: it's good that we digged into the nor of 0x3A so much yesterday, but from a failure analysis standpoint, it may be just one of the 5-10 boards with 'various' problems here or there in the end
<wolfspraul>
now we are sure it's not related to fix2b
<roh>
re
<wolfspraul>
kristianpaul: you can also see it this way:
<wolfspraul>
the test software must catch 100% of failures
<wolfspraul>
if the test software passes, after that the board must work
<wpwrak>
wolfspraul: yeah, i feel good about fix2b. also in 0x3a, besides the actual corruption, the rest of the system behaviour makes sense
<wolfspraul>
if boards fail after the test software has determined they are 100% ok, that's a big problem
<wpwrak>
wolfspraul: this means that we can now do a little better than just "d2/d3 dimly lit" :)
<wolfspraul>
that means our test software is bad (or a design mistake that the test software cannot detect), and it means we have to re-test the entire batch after that issue is cleared up
<wpwrak>
yeah. these things would suck
<wolfspraul>
wpwrak: yes, we learnt a lot with 0x3a
<wolfspraul>
interesting board
<wolfspraul>
but not the time to go deeper there now, and maybe never
<wolfspraul>
maybe just replace the nor chip and it's all fine
<wolfspraul>
manufacturing is about economics, not every weird case needs to be analyzed to the total root cause
<wolfspraul>
if we have a strong design, and strong test software, we have a basis to run manufacturing economics on
<wolfspraul>
the most decisions become economic decisions
<wolfspraul>
but if we are uncertain about the design, or test software - BAD! :-)
<wolfspraul>
then it gets messy
<wolfspraul>
because then we cannot just make quick economic decisions
<wpwrak>
if we get a cluster of NOR troubles like in 0x3a, it may make sense to write a pattern of 0x0000 or 0xffff and read it back. then probe the bus lines. that would show whether the problem is on reading or on writing.
<wolfspraul>
I think it's crazy that we first solder this long wire and diode to 90 boards, and a few weeks later determine that it was not needed :-)
<wolfspraul>
that shows how we are mixing design and production work
<wolfspraul>
but it's ok, we go full power forward on everything now
<wpwrak>
yeah, that was the scenic route :)
<wolfspraul>
if more design problems show up, well, sorry, we have to go through the entire batch again...
<roh>
true. on the other hand.. better to find and fix that bugs before shipping.. not like some other vendors selling green bananas
<wolfspraul>
no worries.
<roh>
in the end there were not many 'design' errors right? mostly 'bad parts' as far as i understood
<wolfspraul>
oh no
<wolfspraul>
:-)
<roh>
like the schmitt-triggers and now this diodes?
<wolfspraul>
design 'error' maybe too much, but definitely design 'uncertainty'
<wolfspraul>
and that is enough to disrupt the normal testing of the run
<wolfspraul>
roh: like for example, it turned out that the way we produced the boards in SMT, none of them would have booted
<roh>
huh? i thought all design changes were prototyped and 'tested
<wolfspraul>
:-)
<wolfspraul>
would you call that a 'design error'? :-)
<roh>
by reworking a rc2
<wolfspraul>
so they went back to SMT for rework multiple times (!)
<wolfspraul>
plus reworks on Adam's side
<roh>
uh.. what was the root cause of that? bad smt params?
<wolfspraul>
roh: no, we made mistakes there
<wolfspraul>
no bad smt params
<wolfspraul>
the smt shop did everything as told
<kristianpaul>
up to how many rewords are aceptable? (wich was my concern when asked first)
<wolfspraul>
I'm telling you - design 'weaknesses'
<wolfspraul>
so we find 'oops'. 'they all don't boot'
<wolfspraul>
:-)
<wolfspraul>
that's how it started
<wolfspraul>
but it's ok, these things can happen in small runs
<wolfspraul>
kristianpaul: theoretically it's infinite, you can rework many times - why not
<wolfspraul>
but practically maybe not
<roh>
wolfspraul: well.. design is schematic and pcb layout for me. when it comes to parts and the mechanical mounting.. thats 'craft' for me ;)
<wolfspraul>
because every rework means new errors get introduced
<wolfspraul>
roh: we definitely ran into design issues
<roh>
wolfspraul: can you be more precise what assumptions were 'wrong' there?
<wolfspraul>
that's why you hear about fix1, fix2, fix3, fix4, fix2b, etc.
<kristianpaul>
yeah, thats increasi the mix of posible bugs
<wolfspraul>
roh: phew, hard. the reset circuit is really subtle.
<wolfspraul>
lots of details, lots of pins
<wolfspraul>
I have no full electrical overview.
<wolfspraul>
sebastien and werner do
<wolfspraul>
it's actually supposed to be 'simple' (ahem)
<wolfspraul>
but you know how it is
<roh>
maybe we should colaboratively write a book after that... list all classes of 'errors' and 'stuff to go wrong' ... to help other people do it better or not do the same errors again ;)
<wolfspraul>
a little thing wrong there and the board won't boot
<wolfspraul>
so before the boards even left the SMT shop for the first time, Adam was already on the phone trying to tell them "ahh, can you please do the following reworks before sending out: a, b, c"
<wpwrak>
i don't think anyone really has a full overview of the current reset circuit ;-) the very non-ideal diode(s) make it rather complicated.
<roh>
wpwrak: what i dont get.. why does it need to be so complicated?
<roh>
doesnt xilinx provide a proper, simple reset example?
<wolfspraul>
yeah, we already have more 'improvements' for the reset circuit lined up (gates, second reset ic, etc)
<wpwrak>
roh: now it's relatively simple again. but the problem is that the diodes have relatively large capacitance.
<wpwrak>
roh: the issue is that we (think we) need to hold the NOR in reset while power ramps up
<wolfspraul>
collectively we must have spent 3 months on the reset circuit now
<wolfspraul>
:-)
<wpwrak>
roh: (and we hope we don't really need to hold it in reset while power ramps down ... because rc3 doesn't do that :)
<wolfspraul>
Adam went to Xilinx FAE etc. etc. crazy.
<wpwrak>
heh ;-)
<kristianpaul>
(FAE) oh, what they said i dont remenber that..
<wolfspraul>
don't even ask
<roh>
wolfspraul: yes. and it will be a complete waste of time if xilinx does a new revision of the spartan ;)
<wolfspraul>
I don't know
<wolfspraul>
too many details
<kristianpaul>
sure
<wolfspraul>
well.
<wpwrak>
wolfspraul: (2nd reset IC) actually, that was a bad idea. we need a gate to act as our "diode"
<wolfspraul>
I think about the platform, Milkymist platform.
<wolfspraul>
so once more boards are out, I think our effectiveness in those things will go up.
<wpwrak>
roh: it may not be an FPGA-specific issue
<wolfspraul>
because in the end it is a simple circuit
<kristianpaul>
oh yes, now adam should learn to use the /ignore nick  command :)
<wolfspraul>
but we are struggling because we operate with so few people, so few boards
<wolfspraul>
I just need to stabilize the bloody rc3 run, get reliable 100% pass testing results, and start to ship those monsters out :-)
<wolfspraul>
roh: for sure the value of the m1 board is not in its reset circuit
<wolfspraul>
:-)
<wolfspraul>
if anything it's in the Milkymist SoC, Flickernoise, case :-)
<wolfspraul>
so I'm not worried that this is a very spartan-6 specific little circuit, that's fully understood.
<wolfspraul>
these are no 'investments', just cost and nastiness
<wolfspraul>
kristianpaul: in general you want to avoid reworks completely
<wolfspraul>
rework = heat
<wolfspraul>
heat = bad
<wolfspraul>
make the production process as determined as possible
<wolfspraul>
deterministic
<wpwrak>
crispy chips. yumm :)
<wolfspraul>
the rework heat can cause a nearly infinite number of side-effects in chips, passive parts, the pcb, etc.
<wolfspraul>
why is there all this fuss about precise reflow temperature curves?
<wpwrak>
... diodes ... :)
<wolfspraul>
because it's so important, even 1 degree makes a difference
<kristianpaul>
ah, thats why wpwrak like freeze boards :)
<wolfspraul>
whether the top is at 246 celsius or 247 celsius...
<wolfspraul>
so think about that
<wolfspraul>
if that is important, how crude a rework is!
<wpwrak>
kristianpaul: naw, that was actually something else :)
<wolfspraul>
it's like a hammer on delicate china
<wpwrak>
kristianpaul: what mystifies me in 0x3a is that we went from read noise to either write noise or stable read failure
<kristianpaul>
good, you hammer the monster before leave the cave :)
<wolfspraul>
so the best number of reworks is 0. but theoretically there is nothing wrong with reworks either.
<wolfspraul>
sorry, doesn't get more precise than that...
<roh>
wpwrak: well.. maybe its really a broken nor chip. or even only bad soldering for some reason. there are always flukes making sure you are puzzled about the rules
<wolfspraul>
kristianpaul: in terms of life expectancy of a particular board, I don't think you can say in general that a board with 0 reworks has a longer life expectancy than one with 5 reworks.
<wolfspraul>
the key is the test software here
<wolfspraul>
if the board with 5 reworks passes the test software, and the test software (or process) is good, we can safely assume the life expectancy of that board to be the same as the one with 0 reworks (and also pass the entire test process)
<wolfspraul>
that's because there may be a small lingering problem in the one with 0 reworks as well, I don't see how reworks in general increase the number of small lingering problems
<wolfspraul>
I have no such data.
<roh>
well.. heat can reduce the life expectancy of caps, semiconductors and other parts as well.. but i dont think that reworks have more influence than regular production or weird designs.
<roh>
weird meaning e.g. mis-spec-ing a smps and killing the caps over time in the process
<roh>
happens from time to time on mainboards (exploding caps are not always a 'bad caps' cause)
<wolfspraul>
in terms of life expectancy, there's some interesting stuff in the solder process.
<wolfspraul>
unfortunately more and more consumer electronics are designed for a 2 year or even less life span
<roh>
that also. i bet that lead-free mania will bite us in the ass atleast once ;)
<wolfspraul>
so the process gets optimized towards that
<wolfspraul>
but there are dramatic differences in life expectancy (say for example temperature impact over time), so if you want to you can solder in a way that will be dozens of times more robust towards temperature cycles
<wolfspraul>
but one by one
<roh>
maybe we shouldnt design like that and use that for marketing ;)
<wolfspraul>
we cannot work on all these details now
<wolfspraul>
it's not the design, it's the soldering process
<wolfspraul>
every time any part gets hot or cools down again, the different materials expand differently
<wolfspraul>
if you want to manufacture for 20 or 30 year life expectancy, there's a lot of good stuff you can do
<wolfspraul>
but increasingly the consumer electronics industry moves away from that
<wolfspraul>
so that's only for aviation, cars, medicine, etc.
<wolfspraul>
anyway we are not at that level yet
<wolfspraul>
just trying to get m1 rc3 out as a good product, that boots and works at all :-)
<wolfspraul>
but I looked at some data once of a comparison of different solder processes and techniques, and failure rates of temperature cycles (say up to 60 and back down to 30).
<wolfspraul>
and the differences were huge
<wolfspraul>
giant
<wolfspraul>
so after 500 cycles, 1000 cycles, 2000 cycles. some my have failure rates of 30-40%, and other processes maybe only 1%
<wpwrak>
so, how's the battle going ? cluster got smaller ?
<wolfspraul>
I don't dare to ask :-)
<wolfspraul>
sometimes gotta give people time...
<wpwrak>
ah, i thought you had a little window monitoring adam's vital functions, heart rate, blood pressure, transpiration, ... :)
<wolfspraul>
hmm
<wolfspraul>
some things I read on the wiki do not look that great
<wolfspraul>
search for fix2b
<wolfspraul>
0x32 is still not right somehow
<wolfspraul>
0x34 and 0x39 are ok
<wolfspraul>
0x3A has the nor problems we observed yesterday
<wolfspraul>
now...
<wolfspraul>
0x3C, hmm
<wolfspraul>
0x40 is good, but then: 0x48 (!) what's that?
<wolfspraul>
cannot configure after 2nd rendering!
<wpwrak>
0x34 and 0x39 were already good yesterday, no ?
<wolfspraul>
yes
<wolfspraul>
there is more stuff we have to find out about
<wolfspraul>
0x3C - strange (like 0x32?)
<wolfspraul>
0x48 is really bad
<wolfspraul>
because that means we still have boards that fall back from rendering to unreconfigurable, even after fix2b
<wpwrak>
yes
<wolfspraul>
0x54 is good, 0x55 could be something with the nor chip (=ignore)
<wolfspraul>
0x5C is good
<wolfspraul>
that seems to be all fix2b results so far
<wolfspraul>
the test results are very clear, that's good. I'm the eternal optimist.
<wolfspraul>
yes, perfect
<wpwrak>
ah, 0x55 was the NOR .. checking ...
<wolfspraul>
I think we can ignore 0x55, not important in our quest for a stable design and reliable test process for 100% pass boards
<wolfspraul>
I would look at 0x48 first, really dig in there. because that board regressed!
<wolfspraul>
and then 0x3C maybe, if needed
<wpwrak>
0x55 seems bad, yes. maybe we have a NOR cluster now. but let'see then adam is through with fix2b
<wolfspraul>
don't worry about problems with the nor chip per se
<wpwrak>
rc2 used the same NOR chips as rc3 ?
<wolfspraul>
yes
<wolfspraul>
there are no problems with the nor chips
<wolfspraul>
even if there are, they are easily replaced and done
<wolfspraul>
we are not debugging nor chips
<wpwrak>
i was thinking of the interaction with the FPGA. it's a fairly complex process. FPGA apparently needs to read the NOR's configuration data, etc.
<wolfspraul>
nah. we have way too many working boards to suspect a design issue there.
<wpwrak>
can be borderline parameters
<wolfspraul>
if we knew our design and test process was 100% stable, we would replace the nor chip on 0x55 and most likely it would pass then.
<wolfspraul>
I would not look at 0x55, waste of time imho.
<wolfspraul>
0x48 and 0x3C are interesting
<wolfspraul>
(and maybe more later since Adam is not finished yet)
<wpwrak>
0x55 is scary, yes. i'd rather look at 0x3a :)
<wpwrak>
0x3a looks as if one could figure out what's going on. and it somehow almost worked in the past. so if the NOR problems have a common cause, that may provide some clues.
<wolfspraul>
0x48 is my favorite
<wolfspraul>
crystal clear test path
<wolfspraul>
everything picture perfect, but then
<wolfspraul>
let me check 0x3A...
<wpwrak>
but .. the next tests would be harder to make: write synthetic patterns, check them on the bus, read them back, etc. not the things adam usually does. well, when you send me my M1(s) maybe include 0x3a :)
<wolfspraul>
ahh. 0x3A never booted before.
<wolfspraul>
I'm not so interested in those (maybe a mistake).
<wpwrak>
yes, it never booted. that's the fly in the ointment :)
<wolfspraul>
I don't suspect a big problem with the design.
<wolfspraul>
we made rc1, rc2, etc.
<wolfspraul>
that's all fine
<wolfspraul>
it must be something small, like we already fiddled with the reset circuit 3 times now.
<wpwrak>
the reset circuit looks good on 0x3a
<wolfspraul>
let me read 0x3A notes carefully
<wolfspraul>
oh wait
<wolfspraul>
0x3A is the one from yesterday!
<wolfspraul>
no - not go back to that :-)
<wpwrak>
yes
<wpwrak>
;-)
<wolfspraul>
just replace the nor chip (we have no spares right now so cannot try)
<wolfspraul>
I'm 80% sure after replacing nor chip it works
<wolfspraul>
not so interesting
<wolfspraul>
how about 0x3C ?
<wolfspraul>
that's exactly like 0x32, with the 'pulses' etc.
<wpwrak>
mmh, i think replacing the NOR is too radical. you may just mask a real problem. i wouldn't replace the NOR of 0x3a before a) checking the data that goes in really gets corrupted before it comes out again and b) verifying the signal timing.
<wolfspraul>
the difference is that 0x32 never booted before, but 0x3A did
<wolfspraul>
sorry I meant 0x3C did
<wolfspraul>
no really, no more time into 0x3A
<wolfspraul>
it's not worth it
<wolfspraul>
look at the difference between 0x3C and 0x32
<wpwrak>
0x3a didn't boot
<wolfspraul>
they have pretty much the same state now
<wpwrak>
ah, 0x3c :)
<wolfspraul>
those crazy nor bit corruption searches take huge time and don't help us in the big picture with the run
<wpwrak>
0x32 has a long patient's history :)
<wolfspraul>
we are not fixing every board here
<wolfspraul>
we are only trying to come up with a stable design and reliable test process (!)
<wolfspraul>
so taht we can start sending boards out
<wpwrak>
(nor a waste of time) dunno. i wouldn't be so quick to assume that the chips just go bad randomly.
<wolfspraul>
of course I understand the _real_ bug may hide anywhere...
<wolfspraul>
true, but we have lower hanging fruits
<wolfspraul>
compare 0x3C and 0x32
<lekernel>
by the way, have you tried assembling one complete unit already?
<wpwrak>
yeah,looking at 0x32
<lekernel>
with case, box, etc.
<wolfspraul>
sure everybody has 1 unit, I think Adam too (his own)
<lekernel>
rc3?
<lekernel>
with the case and the box?
<wolfspraul>
but not from 0x30 on and higher
<wolfspraul>
no probably not
<wolfspraul>
you worried it won't fit? :-)
<lekernel>
yes. given that absolutely everything in this run has gone wrong in one way or another, there could be surprises there as well
<wolfspraul>
nah it will fit. I'm not getting distracted on that now.
<wolfspraul>
fix2b is a big step forward, looking at today's results
<wolfspraul>
but not 100% yet, it seems
<wolfspraul>
I don't want to trample over test results and ignore them etc.
<wolfspraul>
not good
<wpwrak>
wow. 0x32 is crazy.
<wolfspraul>
well, read 0x3C now :-)
<wolfspraul>
from the boards that we have fix2b results for so far, I would look at 0x48 first
<wolfspraul>
tp36/37 is good, but it won't reconfigure currently (after rendering before)
<lekernel>
wolfspraul, maybe you should ship problem boards around (including to a Xilinx FAE) so people can look at them in parallel?
<wpwrak>
for 0x32 and 0x3c, the next thing to analyze would be to bring R60 back. if that doesn't help, try without D16 (without D16, the board is a likely NOR corruption candidate, though. so not for normal sale)
<wpwrak>
lekernel: yes, wolfgang seems to have a few boards he's already given up on. i think he could let these out.
<wolfspraul>
no, I think that will be the ultimate delay producer
<wolfspraul>
the quality and consistency of our test results would go down
<wpwrak>
worst case: something is found that fixes all these boards but is hard to apply in the field
<wolfspraul>
I made that mistake with rc2, so no way I'm going to make it again :-)
<wpwrak>
wolfspraul: rc2 went to people who didn't even turn it on :)
<wolfspraul>
if we do that we will not sell any rc3, so I won't do it
<wpwrak>
lekernel: i think he just doesn't want to spend money on fedex :)
<wolfspraul>
wpwrak: for 32/3c, bring R60 back involves bringing the long wire back as well?
<wpwrak>
wolfspraul: no no. just solder one resistor to an existing footprint
<wolfspraul>
ah ok
<wpwrak>
wolfspraul: r60 was removed as part of fix2, so lower the current on the reset chip a little
<lekernel>
by doing that we had mwalle fix the video chip (Adam and I failed), as well as me fixing the intermittent video-in failure and audio output noise
<wpwrak>
wolfspraul: but with fix2b, we're already nicer to the reset chip, so ...
<wolfspraul>
lekernel: no it wouldn't work. it would be the end of rc3.
<wolfspraul>
I will not do it.
<wolfspraul>
we need to be able to look across multiple boards.
<wolfspraul>
and if they are in different locations with different people the consistency will completely break down.
<wolfspraul>
of course all sorts of random results will pop up, and the general quality will go up
<wolfspraul>
then we can try in rc4 what the results are :-)
<wolfspraul>
that was what rc2 was for and we didn't do that well in this sytem
<wolfspraul>
not again
<wolfspraul>
I am not writing off rc3, no need.
<wolfspraul>
from the fix2b results so far, the only one that pops out is 0x48
<wolfspraul>
that one is not right
<lekernel>
the time sinks in rc3 are: protection circuit, counterfeit buffers, and now flash/reset circuit
<wpwrak>
maybe fly sebastien to taipei ? make R&D division do double shifts ;-)
<wolfspraul>
but it's only one board, so I suggest to wait until Adam finished the entire fix2b plan
<lekernel>
very little of that can be attributed to our shipping of rc2 boards around
<wolfspraul>
ahh :-) you can ask Adam later, without politics for an honest answer. we have seen 'similar' problems like the one we are dealing with here on rc2.
<wpwrak>
wolfspraul: so your plan is, if there aren't a lot more 0x48, just consider them outliers and go ahead ?
<wolfspraul>
but because of the way we sent rc2 out, we lost focus and consistency to get to the root causes and eliminate them for rc3.
<wolfspraul>
that's my analysis
<wolfspraul>
hmm
<wolfspraul>
we can make that judgment
<lekernel>
we did eliminate the video in instability and the audio noise
<wolfspraul>
0x48 is tough though
<wolfspraul>
lekernel: yes! :-) so those we don't have to worry about now :-)
<wolfspraul>
wpwrak: would you be willing to ignore 0x48 and assume our design is stable and our test process is reliable?
<lekernel>
also, a lot of successful improvements between rc1 and rc2 were done in a more 'distributed' way
<wpwrak>
wolfspraul: the thing is that, if something needs deep analysis, the current process is very inefficient. so if you can exclude needing deep analysis, then you're right not to spread the work
<wolfspraul>
sending boards anywhere now will delay rc3 sales by at least a month
<wolfspraul>
just saying
<wpwrak>
wolfspraul: (0x48) i think i'd want to know if the board responds to environmental parameters
<wolfspraul>
I will simply refuse to sell CRAP.
<wpwrak>
wolfspraul: (1 month) i don't think so. pick problems you don't expect to be able to analyze with the current process. then you can only win (well, minus the shipping cost)
<wolfspraul>
so as long as it's crap, I keep improving on it, until it's not crap anymore :-)
<wolfspraul>
nah there are no such problems
<wolfspraul>
I am reading the test results, not speculating or ranting at whatever targets come to my mind.
<wolfspraul>
the problem we have that stops rc3 sales is very isolated
<wolfspraul>
and _almost_ eliminated
<wpwrak>
again, the current process is inefficient for deep analysis. it is efficient, though, for things that need broad rework with non-trivial parts.
<wolfspraul>
any result that would come in from anywhere non-taipei delays sales for over a month
<wolfspraul>
I'm still more optimistic.
<wpwrak>
another problem with the current process is that, if adam makes any systematic mistakes, they may get undiscovered. debugging his workflow is very time-consuming.
<wolfspraul>
plus any board that goes anywhere quickly falls out of the logic with which we can right now still compare boards and group them
<wpwrak>
yes, that's true
<wolfspraul>
yes. so let's have 90 people produce 1 board each :-)
<wolfspraul>
anyway I can say clearly that I have 100% trust in rc3 and our current approach
<wolfspraul>
that design is good
<wpwrak>
naw, you don't have so many people :) mwalle is on vacation, so lekernel, me, anyone else ?
<wolfspraul>
:-)
<wolfspraul>
here's the procedure: Adam first finished the fix2b test plan.
<wolfspraul>
I don't want to interrupt him now.
<wolfspraul>
if 0x48 is super isolated, maybe all is good already
<wpwrak>
finishing fix2b is good, agreed
<wpwrak>
(trust in the process) there's a thing colloquially called "get-there-ism". it's the determination of following one's current path of action to achieve a certain objective, ignoring evidence that this may not be possible. it's a well-known phenomenon in the aviation industry. makes planes crash a few meters from the runway, with empty fuel tanks, because the pilot didn't want to divert.
<wpwrak>
(get-there-ism) this is something to watch out for. it's easy to get caught up in it :)
<wolfspraul>
these are the ones missing: 0x61 0x63 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x85
<wpwrak>
so .. fix2b completion ~saturday evening
<wpwrak>
then re-clustering
<wpwrak>
i dislike those where the TP36/TP37 voltage goes crazy after fix2b again
<wolfspraul>
we had that yesterday already, no?
<wpwrak>
with 0x3a ? no. that was rock solid.
<wolfspraul>
0x32
<wpwrak>
(0x3a, the TP36/37 results)
<wpwrak>
ah, we never got to having a good look at 0x32
<wolfspraul>
man I just think through sending a board to Werner. so painful. the number of dead-end scenarios is staggering. which one? 0x3A?
<wolfspraul>
on 0x3A, Werner would disappear into nor analysis land
<wolfspraul>
he would never replace the chip because a) he doesn't have a spare b) he needs to study that board because that's the one he has
<wpwrak>
he'd run a few more tests, that's for sure :)
<wolfspraul>
0x32 ? total guess. what if after 2 hours we find it's some plain and simple problem somewhere, without any relation to the fix2/fix2b issues?
<wolfspraul>
see that's the problem
<wpwrak>
when do the new NOR reach adam anyway ? i think he has ordered some, no ?
<wolfspraul>
of course it would be great if more people would be where the boards are, in one place
<wolfspraul>
because then we could parallelice
<wolfspraul>
lize
<wolfspraul>
but sending a board out? argh
<wpwrak>
i'm still confused about 0x32 and 0x3c. TP36/TP37 is inconsistent with what we know so far.
<wolfspraul>
that's the end, I really think it through
<wpwrak>
err, TP36/Tp37 going wild
<wolfspraul>
the impossible to answer question is already the first one: _WHICH_ board? :-)
<wolfspraul>
best would be all 90
<wolfspraul>
and then overnight magically back to Taipei for assembly and packig
<wolfspraul>
packing
<wpwrak>
(which) 0x3a would be a good start :)
<wolfspraul>
after we have design and test process stable, we will have plenty of boards
<wpwrak>
hehe :)
<wolfspraul>
yeah I know you like that one
<wolfspraul>
NOR study galore
<wpwrak>
you have to start somewhere :)
<wolfspraul>
but not on a one-off NOR chip problem
<wolfspraul>
at least not while the rc3 run is in showstopper mode
<wpwrak>
what i'd look for is whether the issue is on the NOR side or the FPGA side. i think we can determine this.
<wolfspraul>
you will get plenty of boards
<wpwrak>
then it becomes a question of bad chip or bad connectivity
<wolfspraul>
but now we have to focus on consistent rc3 quality
<wpwrak>
bad connectivity can be probed on the NOR. not so easily on the FPGA (well, you can flex the board a little, see if it gets worse)
<wolfspraul>
otherwise no rc3 can be sold, and I will continue to work on getting them to be good
<wolfspraul>
absolutely not exhausted yet, just warming up
<wolfspraul>
ok 9 more to go
<wpwrak>
bad connectivity could point to an SMT process issue. of course, you wouldn't like to have any such thing pop up :)
<wolfspraul>
will be interesting
<wolfspraul>
the pulse thing is nasty because we think we overlooked something in the reset circuit
<wpwrak>
bad chips are easier. once we're sure they're just bad, swap them, new chip, new luck
<wolfspraul>
and 0x48 is nasty because it falls back from rendering to unreconfigurable
<wpwrak>
pulse thing would be which board ?
<wolfspraul>
0x32 and 0x3C
<wolfspraul>
'bad' chip may come from the process
<wolfspraul>
I think we will see more with pulses
<wpwrak>
0x32/0x3c also show regressions. they're regressing to a pre-fix2b behaviour
<wpwrak>
(bad chip) yes, could just be a bad SMT profile
<wolfspraul>
ok, I meant regression as in pass the test software, but then fail afterwards
<wolfspraul>
no - not SMT profile
<wpwrak>
yup, 0x48 has a high-level regression
<wolfspraul>
we are way past that. the design is mostly good, the process is mostly good.
<wolfspraul>
there are no fundamental issues.
<wolfspraul>
the schmitt-trigger was a fundamental issue - fixed.
<wolfspraul>
what we have now are statistical and manufacturing issues.
<wpwrak>
or maybe the through-hole pass didn't agree with all the components. no idea what this one actually is.
<wpwrak>
schmitt-trigger was the fake part ?
<wolfspraul>
but our inability to test for 100% good boards means we cannot sell anything!
<wolfspraul>
some were irregular, yes
<wolfspraul>
we replaced all 270, done
<wolfspraul>
one thing is funny in this
<wolfspraul>
I just recently realized we should add a few 'render cycles' after the test program.
<wolfspraul>
dont' know why, intuition
<wolfspraul>
I felt uneasy that we never let the board do waht it's supposed to do with our users.
<wolfspraul>
which is to - RENDER
<wpwrak>
well, worst case, you can decide to just sell them with the promise to replace all of them in case something major pops up. better than losing rc3 entirely.
<wpwrak>
yeah, end user testing is missing, too :)
<wolfspraul>
and now these render cycles are what give us the most unsettling feedback about our test process, even our design.
<wolfspraul>
good catch Wolfgang!
<wolfspraul>
I don't mind the unsettling feedback, I can handle that.
<wolfspraul>
no way, you don't know how expensive support is
<wolfspraul>
the render cycles are a godsent
<wolfspraul>
very good
<wolfspraul>
it lifts m1 to the next level
<wolfspraul>
I already want to do 1h render testing on each board :-)
<wolfspraul>
or 24h :-)
<wpwrak>
;-)
<wolfspraul>
I'm sure if we would do that, we would find more issues.
<wpwrak>
next you'll want the temperatur chamber :)
<wolfspraul>
isn't it funny. if we remove the render cycle test (which we did not have in rc2), we would already sell now :-)
<wolfspraul>
keep that in mind when complaining
<wolfspraul>
so I think we should not go bezerk, no 24h test etc. but we should do a few render cycles, yes.
<wolfspraul>
and we have to handle the fallout.
<wpwrak>
i'm not complaining about the meticulous process. i'm merely suggesting that you could widen your bottleneck :)
<wolfspraul>
fly to Taipei
<wolfspraul>
that widens the bottleneck
<wolfspraul>
you can be there Saturday, no? :-)
<wolfspraul>
anyway just kidding. sometimes it just needs a bit of relaxation. I will think more about get-there-ism.
<wpwrak>
of course, there's no guarantee. e.g., further analysis on a problem board could be inconclusive, the board may suffer additional failures on its journey, just the shipping may take too long for the results to be meaningful, etc.
<wolfspraul>
oh sure, it wouldn't help with the rc3 sales showstopper problem at all.
<wolfspraul>
it would be nice for rc4 though
<wpwrak>
(fly to tpe) heh, i'd also want my lab :) some of adam's equipment is pretty marginal.
<wolfspraul>
it would even harm the rc3 showstopper resolution because it takes valuable data from Adam (from a consistent overview)
<wolfspraul>
so I'd rather pick 'safe' boards for this strange exercise, which defeats the purpose already. bottom line: it doesn't work.
<wolfspraul>
I've been in too many runs and tried too many things.
<wpwrak>
(overview) only if he ever plans to return to those boards for analysis.
<wolfspraul>
most important is consistency (for what we are trying to improve now).
<wpwrak>
my view is simply that, before you've isolated a problem, you don't know whether it's a one-off or something systemic. the current approach of having lots of boards is good for common problems. you get a lot of data, can do clustering, etc., and you can modify a lot of boards and gather many new results. very useful.
<wpwrak>
however, when you run out of these big clusters, then you need to track down seemingly individual problems. and there, the mass analysis approach doesn't scale.
<wpwrak>
so you really have distinct phases: first, get the lay of the land. second, examine the widespread issues and apply the corresponding mass cure. go back to step one and repeat until things settle.
<wolfspraul>
fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good
<wolfspraul>
that's my grouping
<wolfspraul>
very good results so far!
<wolfspraul>
remember these are all boards that had problems before
<wolfspraul>
if people don't see the progress, well, sorry, I cannot help
<wolfspraul>
huge progress
<wolfspraul>
fix2b is fantastic
<wpwrak>
phase 2: hunt down the ones that are weird. then see if this yield anything to apply to the rest. e.g., new, targetted experiments.
<wolfspraul>
a life safer
<wpwrak>
yeah, fix2b is good
<wpwrak>
but ... 0x3a and 0x3c worry me
<wolfspraul>
those 2 pulse boards are strange
<wpwrak>
err, 0x32 and 0x3c
<wolfspraul>
they charge us
<wolfspraul>
and 0x48 is an insult
<wolfspraul>
:-)
<wolfspraul>
but we get to it, really
<wpwrak>
i don't like boards to regress to a pre-fix2b state. that's not supposed to happen :)
<wolfspraul>
I don't know which board to send where now, it just doesn't work. I hope for some understanding.
<wolfspraul>
let's wait for the other 9 boards
<wolfspraul>
but those results are really good. all of those boards didn't work before - keep in mind.
<wpwrak>
well, you can think a bit about which boards you'd want to send where :)
<wolfspraul>
we took them all out of the fail pool!
<wpwrak>
pity that the weekend is near. so unless you make a quick decision, you lose 1-2 days of fedex transit. but i can see the scheduling conflict, also with adam.
<wpwrak>
he really needs an assistant :)
<wolfspraul>
oh of course, we think about that carefully. and there will be enough selection. but I need to suck out anything valuable from the perspective of the entire run and yield first.
<wolfspraul>
I did not do this well on rc2, plain and simple. over-excited I guess.
<wolfspraul>
my problem, but this time I fix it.
<wolfspraul>
also if not, rc4 would bankrupt me :-) (I wouldn't do it unless rc3 was under control)
<wolfspraul>
Adam knows this too, we have to catch more cases here, otherwise the next run will totally blow up.
<wolfspraul>
we should not forget - Adam has far more production experience than we do. many runs of many thousand units, even some runs with millions I think.
<wolfspraul>
so it's not like we tell him - solder here, solder there. our solder monkey. Adam knows what is needed to get the _manufacturing_ quality (yield, efficiency, etc) up.
<wpwrak>
well, he currently does operate a bit in the solder monkey way. and i agree, that's not a very efficient thing to do.
<wpwrak>
taht's again some cost of the centralized approach. he has all the boards, but he's also got the only hands available for solder monkeying.
<wolfspraul>
yes but I'm saying we (Adam and me) know what we need to do for rc4, and we will do it because we want a successful rc4.
<wolfspraul>
oh sure, that I agree with. but our resources are limited there. no need to argue with me that it would be better to have more people for this, in Taipei.
<wolfspraul>
or whereever the run is. the testing needs to be fast and efficient and in one place.
<wpwrak>
would need to see TP36 for a better picture. real trouble is probably there, TP37 just follows it
<wolfspraul>
if you look at the 0x3C testing notes, this is after fix2b applied
<wolfspraul>
maybe another malfunctioning part in the same circuit
<wpwrak>
looks like one of those fix2b regressions. maybe adam really needs a new soldering iron :)
<wolfspraul>
if it's a malfunctioning part that's no problem at all actually
<wolfspraul>
then the validity of fix2b still stands
<wolfspraul>
we don't even need to look into it in that case
<wolfspraul>
question is whether we want to make that assumption :-)
<wpwrak>
i think fix2b is valid, no matter what else we observe. it undoes an unnecessary extension of the reset circuit.
<wolfspraul>
so I suggest 0x48 first (but before even that wait for full fix2b results)
<wolfspraul>
things really look good, I am not worried right now.
<wolfspraul>
that's all I can say and now I get some tasty dinner! :-)
<wpwrak>
what bothers me is that we see so many weird effects on D16. here's a "test plan":
<wpwrak>
- if the voltages are "weird", scope TP36 and TP37 and archive the screenshot
<wpwrak>
- inject current from a 3.3 V source into TP36 and measure how much current flows
<wpwrak>
- if the current is low, add D60, and try again
<wpwrak>
- if the current is high (> 1 mA), stop for further headbanging
<wpwrak>
well, or make this 100 uA even
<wpwrak>
sorry, no D60. R60.
<wpwrak>
well, if the current is high, remove C238, then try again
<wpwrak>
removing C238 can cause FLASH_RESET_N to PROGRAM_B contamination, but that should be relatively benign
<wpwrak>
i.e., if i understood lekernel correctly, what would happen is that, if you try to command a "software reset" (from the GUI or such), the M1 would just shut down.
<wpwrak>
heh, if we remove the "software reset" feature, we could even connect PROGRAM_B to FLASH_RESET_N, throw away C238 and D16 for good ;-)
<wpwrak>
sometimes, all a puzzling knot needs is a good sword ;-)
<wpwrak>
(but don't try this - you'd also have to change the use of P22. else, you could drive P22 high into reset out low. not sure what happens then.)
<lekernel>
if giving up the software reset on the rc3 boards prevents those already huge delays from growing up even further, i'm for it
<wpwrak>
lekernel: i was just waiting for you to say this ;-))
<wpwrak>
anyway, up to and including C238 removal, i think the above test plan looks reasonably, doesn't it ?
<wpwrak>
if we end up with C238 removed, we can then figure out what to do about it
<wpwrak>
wolfspraul: background / memory refresh: C238 protects PROGRAM_B from falling edges on INIT_B or FLASH_RESET_N propagating through the diodes. a falling edge on FLASH_RESET_N can happen (only ?) when a software reset is commanded, e.g., through the GUI. in this case, propagation into PROGRAM_B would also reset the FPGA, which according to lekernel just shuts down the M1. (why does it shut down and not just reconfigure ?)
<wpwrak>
wolfspraul: INIT_B would drop when there's a CRC error. so contamination of PROGRAM_B would create a feedback loop, where each failed try to configure would reset the FPGA. that sounds undesirable. with fix2b, we already remove INIT_B from the equation, leaving only the much friendlier FLASH_RESET_N connection.
<lekernel>
wpwrak: reconfigure to standby bitstream = shutdown
<lekernel>
the nasty problem we may have here, though, is that the reset pulse may not be long enough for the flash
<lekernel>
so that can become another headache
<lekernel>
because as soon as the fpga is deconfigured by program_b, the reset will be deasserted immediately
<wpwrak>
why does reconfigure to standby mean shutdown but initial configuration to standby means that the system starts ? how do the two paths diverge ?
<wpwrak>
(glitch on flash reset) yeah, could be tricky. the NOR wants at least 100 ns.
<wolfspraul>
no even in initial configuration, it ends with the standby bitstream and you have to press the middle button to actually boot further (start)
<wolfspraul>
man we need to get one of those boards to you :-)
<wolfspraul>
how about one of the good ones? including fix2b. 0x34 ?
<wolfspraul>
let me check the history of that one
<wolfspraul>
yeah looks perfect. a typical rc3 story :-)
<wolfspraul>
I had an evil thought on 0x48: nor corruption after first power-down. in that case we may have to try the 4.4v reset ics...
<wolfspraul>
but we see later what we find
<wpwrak>
(middle button) aah, now it makes sense, thanks :) and now i also understand why it's called "standby" bitstream :)
<wpwrak>
0x48. yes, that would be a possibility. bring it up and read back the NOR. we now know that urjtag works :)
<wolfspraul>
that'd be the worst case. nor corruption on 0x48 requiring a reset ic rework on the entire run :-)
<wolfspraul>
and then who knows it may not even fix the nor corruption... well, think positive.
<wpwrak>
wolfspraul: actually, you could go to taipei to help adam ;-) alas, it seems that you'd then have to relocate rejon as well
<wolfspraul>
we did see a nor corruption on rc2 (xiangfu), and also on 0x3A (unexplained, I'm just leaning towards 'replace nor chip' right now)
<wpwrak>
i don't like these "replace the chip" operations. at least not without having isolated the fault. otherwise, you just roll the dice and you have no idea where they fall.
<wolfspraul>
yes and no. as long as it's efficient it may scale up well into the thousands of units.
<wolfspraul>
our difficulty is our own uncertainty into the design and our test process
<wolfspraul>
that complicates things
<wolfspraul>
now we have too many unknowns
<wolfspraul>
so we cannot effectively kill the bugs
<wpwrak>
and of course, if the problem is anywhere NOR-related, you may very well make it go away. e.g., but eliminating all the parts with tolerances in the region of the bell curve the design doesn't cover :) (and, of course, in the next run, you'll run into more of the same again)
<wolfspraul>
if at the same time you question your design, your test process, and the chips, what then?
<wolfspraul>
we need to get the design and test process off the table first
<wolfspraul>
no matter what
<wolfspraul>
the design and test process must be of unquestionable standard
<wpwrak>
that's when you need systematic analysis :) yes, you may waste your time on random freak accidents. but chances are there's more to these things.
<wolfspraul>
otherwise we can never manufacture effectively
<wpwrak>
the test process is a separate issue
<wpwrak>
right now, we're still trying to find causes
<wolfspraul>
good news 0x61 0x63 also good
<wolfspraul>
fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good
<wolfspraul>
7 good, 2 pulse, 2 nor (my grouping), 1 render then fail
<wolfspraul>
7 more to go: 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x85
<wolfspraul>
the good news about 0x48 is also that it failed on the second render cycle, i.e. after the first power cycle
<wolfspraul>
I like that much better than failing on the 6th or 9th one as we saw before
<wpwrak>
(0x61, 0x63) great !
<wpwrak>
(0x48) will be interesting to see the NOR dump
<aw_>
(0x6B, 0x6C) good
<wolfspraul>
aw_: great!
<aw_>
let's dump 0x48 now
<aw_>
dumping...
<wolfspraul>
there you go
<wolfspraul>
already?
<aw_>
needs 5 minutes to dump. :)
<wolfspraul>
aw_: yes but the reading seems to work
<wolfspraul>
?
<wolfspraul>
so 0x48 cannot reconfigure now?
<wolfspraul>
maybe after the dumping you try to boot, just to see whether it's still stuck somewhere (cannot boot)
<aw_>
it's said that but I've never calculate it
<wolfspraul>
aw_: we followed the wiki a bit today, excellent work!
<aw_>
0x48 is quite a little same with 0x3a yesterday we did
<wolfspraul>
fix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good / 0x6b good / 0x6c good
<wolfspraul>
9 good, 2 pulse, 2 nor (my grouping), 1 render then fail
<wpwrak>
i like the trend in the last few :)
<wolfspraul>
5 more to go: 0x77 0x7A 0x7D 0x7F 0x85
<aw_>
so even d2/d3 is still dimly lit and make sure tp36 and tp37 is fully pull high , also init_b is okay, then it tried to enter reconfiguration stage
<wolfspraul>
aw_: maybe I misread the 0x48 notes? the 0x48 notes say that this board rendered, and then failed after the first power cycle?
<wolfspraul>
aw_: yes but did 0x48 render before?
<wpwrak>
(5 to go) kewl. that was really a productive day.
<larsc>
wpwrak: do you happen to know any rtc chips, which could be used for the milkymist?
<aw_>
wolfspraul, the 0x48 has never rendering successfully before.
<wolfspraul>
oh!
<wolfspraul>
that's good
<wolfspraul>
then I misunderstood the notes, one sec
<wolfspraul>
aw_: 0x48 notes are saying "5. applied fix2b 6. D16(in-circuit): For.V.=152mV, Rev.V = 1548mV 7. d2/d3 is fully off after power on 8. reflashed successfully 9. cant reconfigure @2nd rendering, tp36/tp37 is 3.3V "
<wolfspraul>
see 9. can't reconfigure @2nd rendering
<wpwrak>
so "@2nd rendering" really means "@2nd power cycle" ?
<aw_>
sorry that i should say in the first round test, it has never rendering before
<wolfspraul>
did the test software run?
<aw_>
then after fix2b, can't reconfigure at 2nd power - cyle
<wolfspraul>
I don't understand the notes
<wolfspraul>
aw_: did the test software run on 0x48 ?
<aw_>
yes, it's passed in test program
<wolfspraul>
then after the test software, you power cycle?
<wolfspraul>
well but that's even better than I thought
<wolfspraul>
aw_: how do you feel about fix2b today, and the boards you worked on?
<aw_>
see the last bottom, you can see I've only copied first one time boot up log
<wpwrak>
eagerly awaits sinking his greedy fingers into the dump of 0x48 :)
<aw_>
i have a strange feelings that if the board has smoothly and fully passed (including rendering )in FIRST run (fix2 circuit) , then the applied fix2b will be also passed the rendering job
<wpwrak>
(i actually thought if extending my bit error checker to look for algorithmic patterns in the address bits. could be fun.)
<wpwrak>
yeah, if a board was happy with fix2, it should be only happier with fix2b.
<wolfspraul>
aw_: ok I don't fully understand the 0x48 process, but maybe you can update the notes a little with what you remember. maybe like werner said "@2nd power cycle" or "@1st boot to render"
<aw_>
oka..sure
<wolfspraul>
still strange how 0x48 failed, oh well
<aw_>
good note from werner, I'll do like that
<aw_>
i don't know, just second power-cyle then byebye
<aw_>
:-)
<wpwrak>
passing with fix2 and failing with fix2b could have the following explanations: 1) some rework mistake in fix2b, 2) something was borderline and went over the limit (e.g., a temperature dependency)
<wpwrak>
aw_: how's the 0x48 dump coming along ?
<aw_>
second
<wolfspraul>
wpwrak: which board?
<wolfspraul>
aw_: I think the results today are super encouraging.
<aw_>
dump done...let me mv
<wpwrak>
wolfspraul: no, in general
<wolfspraul>
we are on a very good path with fix2b
<aw_>
wolfspraul, yes, no; i still fill somethings strange though.
<wolfspraul>
aw_: today, you set 9 boards to 'available' status, that means 90 thirty second rendering cycles. and not a single board failed after the test software, with the exception of 0x48 which failed right after.
<wpwrak>
it seems that the diodes are unreliable. with fix2b alone, we're removing about 50% of the unreliability :) and with the extra testing adam does as part of fix2b, most of the rest as well
<aw_>
tomorrow when I test all cluster batch boards, then back to work with werner to check failed board
<wolfspraul>
yes exactly
<wolfspraul>
but I am still happy - look at the numbers I just said - because we can now safely distinguish between 100% good and failed boards
<wolfspraul>
and I do trust the ones that are 100% good
<wolfspraul>
they are stable and good and will stay like that
<wpwrak>
yes, looks pretty good now
<wolfspraul>
we can do another 10 render cycles on them in 2 days to verify
<wolfspraul>
never say never
<wpwrak>
heh :)
<wpwrak>
wait for a hot day
<aw_>
wpwrak, it could be on diode. but this leave tomorrow to check. since d16's one terminal was soldering twice: one is my fix2, the other is to take apart for fix2b.
<wolfspraul>
aw_: yes we notice, sure. 0x3C, 0x48, 0x55
<wolfspraul>
there are still problems
<wolfspraul>
this bloody diode has to go in rc4 :-)
<wpwrak>
wolfspraul: ah, and the reflashing may need some clarification: is the reflash script as adam uses it supposed to do a verification (e.g., CRC) ? because it appears that it doesn't do this
<wolfspraul>
no
<wpwrak>
yeah, the diode is evil
<wolfspraul>
the problem is that jtag verification is too slow
<wolfspraul>
crazy slow
<wpwrak>
waiting for the dump
<wolfspraul>
that's why we added crc checks to the test software
<wpwrak>
wolfspraul: hmm. okay, so writing with urjtag is unreliable. okay.
<wolfspraul>
not unreliable
<wolfspraul>
verification is too slow to be practical (30 minutes or more)
<wpwrak>
wolfspraul: unchecked = unreliable ;-)
<wolfspraul>
don't know why, we can easily enable it
<wolfspraul>
but then it's crazy slow
<wolfspraul>
so we check crc in the test software, which runs right after urjtag
<wolfspraul>
and the results are logged
<wpwrak>
wolfspraul: that doesn't make sense ;-) if write + read is faster than write + verify, something doesn't add up :)
<wolfspraul>
that could finally be a software bug! :-)
<wpwrak>
i'd re-write, read back without power-cycling, then power-cycle and see what happens
<wolfspraul>
we are moving uuupppp!
<wpwrak>
yes :)
<wolfspraul>
lekernel: man this is great!
<aw_>
wpwrak, yes, different from 0x3a though
<wolfspraul>
the nor becomes so stable now that we can see actual software bugs making it all the way back in (well, likely software bugs)
<wolfspraul>
I think that's good news
<wolfspraul>
the hardware become stable...
<lekernel>
what kind of software bug?
<lekernel>
urjtag?
<wpwrak>
aw_: yes, very different. 0x3a has all the errors on the same bit and scattered over many many addresses
<aw_>
wpwrak, so you want me to re-write/reflash it again?
<wpwrak>
aw_: if you don't feel too tired, yes please
<aw_>
wpwrak, yes, i noticed that.
<wolfspraul>
lekernel: no worries, I was half joking. just extrapolating what it could be...
<aw_>
wpwrak, wait
<aw_>
should we use xilinx tool?
<wolfspraul>
Werner just saw an entire word in nor zeroed out.
<wpwrak>
aw_: and then read back before power-cycling, so that we can see whether the writing was okay
<wpwrak>
aw_: naw, urjtag is fine
<wolfspraul>
aw_: no use urjtag, I trust it
<aw_>
aalright...let's use urjtag first. ;-)
<wpwrak>
wolfspraul: it's a bit too early to blame sw. could also be a urjtag glitch for all we know. or powering down.
<wolfspraul>
yes yes sure
<wolfspraul>
I was just expressing my joy
<wolfspraul>
a full word!!!
<wolfspraul>
we are clearly moving upwards
<wolfspraul>
actually, in that theory, it must have happened before flickernoise
<wolfspraul>
but anyway, just speculation
<wolfspraul>
I don't care much because this was caught by the test process
<wolfspraul>
and safely caught, not at last second
<aw_>
reflashed done
<aw_>
now dump again.
<aw_>
wpwrak, bad...sorry that I didn't notice that you wanted to dump without power-cycling...
<aw_>
i redo now..sorry
<wpwrak>
naw, take this one then
<aw_>
wpwrak, also okay?
<wpwrak>
yeah
<aw_>
alright
<aw_>
phew~ almost my finger to power off. :)
<wpwrak>
heh :)
<wpwrak>
if 0x48-2 is okay, which is what i'd expect, then the power cycle didn't matter. in case we find an error also in 0x48-2, the power cycle will need investigating.
<wolfspraul>
if it were that easy, we are lucky
<wolfspraul>
wpwrak: it could well be 1 out of 10 power cycles
<wolfspraul>
remember that we are zooming in on troublemakers in a run. whenever you do that your cases get stranger and stranger.
<wpwrak>
wolfspraul: it could be, yes.
<wolfspraul>
don't forget all the dozens of boards in hundreds of tests that have never shown anything like this
<wolfspraul>
and how we are looking at the one time that we saw this
<wpwrak>
wolfspraul: we also have the risk of an undefined power state in the down ramp in all of rc3
<wolfspraul>
yes sure, I know
<wolfspraul>
I am aware of it
<wolfspraul>
but today, we had 9 boards pass a total of 90 rendering power cycles
<wpwrak>
i hope applying locking wherever possible will reduce the risk of the down ramp doing too much damage
<wpwrak>
would be good to have a CRC check for the unprotected partitions, though
<wolfspraul>
aw_: if you do 10 full render cycles + crc checks on 0x48, I think you can add it to avail - fix2b
<wolfspraul>
I wouldn't know why not
<aw_>
d2 is ON..and rendering
<wolfspraul>
but you can also do that tomorrow, it gets later and later and it was a long day...
<aw_>
yes..i'd go for sleep ...hehe ;-)
<wolfspraul>
I'm thinking whether we should be suspicious about 0x48, but I don't see why
<wolfspraul>
yes
<wolfspraul>
thanks for the excellent work today!
<wolfspraul>
phantastic, really
<wolfspraul>
so many boards
<aw_>
but good that we caught known issue on 0x48 today
<wolfspraul>
very enlightening fix2b results
<wolfspraul>
well
<wolfspraul>
we will do some more thinking
<wolfspraul>
remember this in the notes history...
<aw_>
wpwrak, thanks a lot though.
<aw_>
okay
<wolfspraul>
we can hold onto 0x48 for a while
<wolfspraul>
but I am 99% sure there's no problem with 0x48
<aw_>
alright...night
<wpwrak>
a great day indeed !
<wpwrak>
aw_: sweet dreams ! :)
<aw_>
k
<wolfspraul>
wpwrak: 0x48 is one of those cases that I would/might end up holding back or not selling
<wolfspraul>
I always sell the best things first
<wolfspraul>
but it's too early to tell. if we find a clear software bug one day then it changes.
<wolfspraul>
I think we can leave 0x48 alone now.
<wolfspraul>
so 0x3C/0x32 are interesting, or maybe the ones that I grouped as 'nor failure' (0x3A/0x55)
<wolfspraul>
wpwrak: do you have any idea for an rtc chip we could add to rc4?
<wpwrak>
0x48 may just be the first one to exhibit a down ramp corruption
<wolfspraul>
nah
<wolfspraul>
very speculative, almost wishful thinking
<wpwrak>
(rtc chip) no idea :)
<wpwrak>
i don't exactly "wish" for down ramp corruption ;-)
<wolfspraul>
no but it's too speculative for me - no reason
<wolfspraul>
will think more
<wolfspraul>
adam did a great job today, lots of hard data
<wolfspraul>
fix2b looks good, all on track
<wpwrak>
i think we may currently have a very low probability of encountering down ramp corruption. maybe it needs a bus access plus the right power drop. a synthetic test may be able to make it happen more often.
<wpwrak>
or maybe down ramp corruption never happens and this was something else
<wpwrak>
maybe it's a one in a hundred years sw bug :)
<wolfspraul>
here's an important question: should adam investigate 32/3c or 3a/55 first, or first proceed with fix2b across all 90 boards?
<wpwrak>
hmm, let's give 0x32/0x3c a try first. maybe there's a low-hanging fruit there. in 0x3a, we already know that things are a bit harder.
<wolfspraul>
ok, but with time limit probably
<wolfspraul>
I feel good about fix2b across all 90 boards
<wolfspraul>
calling it a day as well, n8 (reading backlog tmr)
<wpwrak>
0x55 looks worse than 0x3a
<wpwrak>
0x32/0x3c still don't act as a fix2b'ed board should
<wolfspraul>
yes but can 0x55 raise or lower fix2b validity? I doubt it...
<wolfspraul>
same for 32/3c. just some small problem on those particular boards, nothing to do with fix2b.
<wpwrak>
0x3a and 0x55 don't affect fix2b. 0x32/0x3c might.
<wolfspraul>
the more reworks we make, the more manual mistakes we introduce into the run, which then have to be fixed again.
<wpwrak>
but if we find something "interesting" in 0x3a/0x55, it may make sense to include it in the post-fix2b testing, to save time.
<wolfspraul>
ok so 32/3c first, I guess
<wpwrak>
(manual errors) yes, that could very well be the problem of 0x32 and 0x3c
<wolfspraul>
once we can safely assume that, there is no value in looking at them at all, even until after rc3 sales start (not just fix2b verification)
<wolfspraul>
but let's ping them quickly, see what we find, then decide
<wpwrak>
yeah, we won't know for sure before we've fixed them :)