#milkymist on 2011-08-22 — irc logs at freenode.irclog.whitequark.org

00:24 <wpwrak> hmm, regarding the M1 extension connectors, are they keyed (i.e., is it possible/easy to reverse the polarity of a plug ?)

00:25 <wpwrak> also, how's the mechanical firmness of the board around them when in the case ? does it flex when inserting/removing a plug ?

00:26 <wpwrak> also, regarding keying, since they're both 9x2, can they be told apart except by position ? e.g., all other things being equal, color-coding would accomplish this

00:28 <wpwrak> J21 (the one with 3V3) has a nasty failure scenario: short any of pins 1 and 2 with 3 and 4, and the whole 3V3 rail becomes 5 V. i wonder if the M1 survives this :)

00:43 <wolfspraul> all very good points. Mechanically, the two 9x2 headers are quite close to each other, I'm wondering whether it's possible to connect an expansion board into each one at the same time.

00:44 <wolfspraul> maybe we should define a size for them, so when people start to build something they know it in advance. then later if things are moved around on the board, we can keep that guaranteed size in mind so people can continue to use their older expansion boards in newer m1 boards...

00:50 <wpwrak> expansion boards would probably also want some form of additional attachment. not nice to have a board hang off a header, acting as lever

00:51 <wpwrak> the closeness of the connectors also eliminates a simple "bridge" structure, e.g., used by arduino

01:14 <wpwrak> aw_: heya ! well rested from the weekend, and ready for battle ? what's the plan for today ?

01:17 <aw_> wpwrak, go all rest boards to be fix2b version. ;-) meanwhile you can interrupt me if need. so long big rework again this week.

01:17 <wpwrak> ah, so no further analysis planned on the ones with incorrect TP36/TP37 voltage ? or the ones with NOR/bus issues ?

01:18 <aw_> wpwrak, the more rework the more new problems i encounter.

01:18 <wpwrak> heh ;-)

01:18 <aw_> wpwrak, no, we can still plan analysis

01:18 <aw_> even now

01:19 <wpwrak> for those with incorrect TP36/TP37 voltage, I would suggest to try the 3.3 V injection again (on TP36/37)

01:19 <wpwrak> i.e., connect 3.3 V through 100 Ohm and via an amperemeter, then see how much current flows into TP36/37

01:20 <aw_> so later if I see any not a constant measured by in-circuit for D16, it has much probability is that D16 or C238 bad from soldering, i can just replace them as new ones.

01:20 <aw_> i see

01:20 <aw_> let's me settup a little then we check them

01:21 <wpwrak> yes, C238 would be a likely candidate there. the measurements should guide us :)

01:33 <wolfspraul> aw_: let me explain the 'big picture' plan as I see it this week, only high level.

01:34 <wolfspraul> first you continue with a mix of applying fix2b to all 90 boards, and analysis of boards with problems

01:35 <wolfspraul> after we have 30 boards that are 100% pass with fix2b applied, you pause that fixind and testing work, and spend a day or two to assemble (case) and package 30 full retail units of M1

01:36 <wolfspraul> once those 30 are ready for shipping, I will start catching up with some people that are waiting to buy, launch, shop, etc. you need to reserve maybe 2h / day or so for shipping out stuff. No worries you will get all paperwork prepared (invoices, HS code, 1040 form for US, etc)

01:36 <wolfspraul> after the 30 are ready for shipping, you go back to step 1, that is applying fix2b to all 90 together with analyzing boards with problems

01:37 <wolfspraul> that's how I see it :-) Let me know if it sounds wrong somewhere...

01:40 <wpwrak> wolfspraul: ("bridge" connector) here's a more professional variant, the USRP: http://gnuradio.org/redmine/projects/gnuradio/wiki/USRP

01:41 <wpwrak> wolfspraul: note that it also includes holes for spacers, for very good mechanical support. here's what it looks like with boards on it: http://gempillar.com/blog/2009/01/23/gnu-radio-rfid-reader/

01:42 <wpwrak> wolfspraul: (spacers) the extra mechanical support is also needed because you can have boards that are TX/RX only, so they don't form a "bridge"

01:43 <aw_> wolfspraul, since current "available" boards is more than 40pcs, this could be the first round of fix2b rework at one time job in order to get 30 full retail units.

01:45 <wolfspraul> aw_: I don't think you should only go through the current 'available' boards. That's a little risky because we may still overlook a bigger problem somewhere.

01:45 <aw_> or you just wanted an exactly 30 full retail units of M1, thus is 30pcs main boards 'available' enough, then immediately pause?

01:45 <wolfspraul> so I think you can mix in some interesting or failure boards, if only to see that our fix2b and testing process is now strong and can always identify well between pass and failure.

01:46 <wolfspraul> no not really, I just explain my thinking

01:46 <wolfspraul> which is that after we have ca. 30 'avail - fix2b' boards, you need to pause that work for 1-2 days to do assembly and packing

01:46 <wolfspraul> then continue with fixing and testing

01:47 <wolfspraul> but right now, it's still important to analyze some failed boards, as planned. so that we are sure everything is under control.

01:47 <wolfspraul> here is a simpler version: :-)

01:48 <wolfspraul> 1. you continue exactly as you did last week. fixing and testing, analyzing some failure boards.

01:48 <wolfspraul> 2. at some point I will interrupt you, when I see enough boards (ca. 30) that I believe we can sell

01:48 <wolfspraul> that's all :-)

01:48 <wpwrak> wolfspraul: i think what adam is saying that he already has 30 "available" boards, so your stopping condition is already fulfilled

01:48 <wolfspraul> no

01:48 <wolfspraul> they don't have fix2b applied

01:48 <wpwrak> aah, i see

01:48 <kristianpaul> they lack fix2b

01:48 <kristianpaul> ah yes :)

01:48 <kristianpaul> hi+

01:49 <wolfspraul> I just want to prepare Adam that there will be an interruption at some point, which is when we have ca. 30 fix2b 100% pass boards, and we are confident in our design, fix2b, and testing process.

01:49 <wolfspraul> then there's an assembly and packaging interruption, then back to fix2b/testing/fixing for all 90 boards

01:50 <wolfspraul> aw_: sorry now I wrote so much :-) but just repeat the same thing 3 times. did you understand / agree with the process?

01:51 <aw_> wolfspraul, l agreed , but here that I do here:

01:52 <aw_> 1. there's already more than 40 pcs put in "available" stage, not include passed "avail - fix2b"

01:55 <aw_> 2. mix good and bad boards, from the wiki results; the facts: the failure board are currently big "impedance" and few usb/midiÂ Â failure boards that I haven't fixed as "available" , Which of them are useless to approve fix2b design now.

01:56 <aw_> 3. from last first cluster with fix2b, we encountered a new branch of likely 0x32/0x3c/0x77 failure boards which I could say we they are bad board so far now. Those are we can keep to analysis.

01:57 <aw_> 4. so idea to use those more than 40 pcs "available" boards to meet/accumulate 30pcs "Avail -fix2b", how do you think?

01:57 <aw_> 5, once we reach 30 pcs "avai -fix2b" then we pause.

01:58 <wpwrak> aw_: what is "big impedance" ?

01:59 <aw_> wpwrak, before m1 to be powered on, I firstly measured their impedance on TP1 ~ TP4, TP33. They had have 'short' condition, which surely no need to apply 'fix2b' circuit.

02:00 <aw_> wolfspraul, is that reasonable i replied?

02:01 <wpwrak> ah, so it's "low impedance" :)

02:05 <aw_> for me, preparation is likely to be done as material to One-time job. so rework should be done from those 40 pcs "available" firstly

02:06 <wolfspraul> ok wait, reading :-)

02:06 <aw_> later when everyday i tested, the avail-fix2b will be accumulate rising to 30pcs then we pause

02:08 <kristianpaul> oh

02:08 <wolfspraul> aw_: yes, all reasonable. _BUT_ I think you should definitely do some analysis in parallel, today, tomorrow. 0x32/0x3C/0x77/0x85/etc.

02:08 <wolfspraul> not 100%, but in parallel with finishing more fix2b boards

02:09 <wolfspraul> maybe 20% analysis, 80% finish fix2b boards :-)

02:09 <aw_> wolfspraul, yes...just in parallel to feedback info from bad boards

02:10 <wolfspraul> correct

02:10 <wolfspraul> we have a good plan :-)

02:12 <aw_> well..sometimes let's see how amperemeter measured firstly....these would always be as ping pong status, hard to define a day that belongs to design validation day or productive day. ;-)

02:13 <aw_> well...no more chats now...only work or logical analysis from now. ;-)

02:34 <aw_> 0x32: d2/d3 is fully off, tp36/tp37 is stable 3.3V, tp36 - 0.08mA, tp37 - 0.014mA

02:34 <aw_> 0x32: d2/d3 is fully off after power on, tp36/tp37 is stable 3.3V, tp36 - 0.08mA, tp37 - 0.014mA

02:35 <wpwrak> does it boot ?

02:36 <aw_> i didn't put it boot stage, so NO,Â Â it's not.

02:36 <aw_> just standby stage for boot

02:37 <wpwrak> so what you mean is that the NOR isn't fully programmed yet ?

02:38 <wpwrak> from last week, i see that 0x32 had some garbled NOR content (in the standby bitstream)

02:40 <aw_> not exactly to say that. this have to be measured/triggered tp36 with tp35(DONE pin), to know if it's been finished reconfiguration stage.

02:40 <wpwrak> maybe it just needs a reflash

02:40 <aw_> wpwrak, yes, i dumped 0x32 last week. good that you checked dump file already

02:41 <wpwrak> there's something strange with this board, though: sometimes, TP36/TP37 voltages are good, sometimes they're not. maybe give the reset circuit a good visual inspection. look for broken solder joints or things that could short a component.

02:42 <aw_> no, before assert a nex reflash, do we miss some info or need to scope somewhere?

02:42 <wpwrak> if you want, you can take another dump to check whether the NOR content is the same (or if something "magically" changes it)

02:43 <aw_> wpwrak, yes, exactly sometimes it indeed, this morning I 've seen 0x32 auto boot again (d2 is ON) after powered on.

02:43 <aw_> when it auto boot, i didn't touch it though

02:44 <aw_> wpwrak, alright, try to if can dump it...

02:44 <wpwrak> hmm, i'll understand all these LEDs much better once i get to play a bit with the M1 that's currently in ... memphis ;-)

02:45 <aw_> wpwrak, yeah..you bet will.

02:45 <aw_> dumping...

02:48 <aw_> fact on 0x32: i 've soldered pins on flash chip...it could be worse than hide the problem from my soldering

02:48 <wpwrak> when did you do this ? recently or long ago ?

02:49 <aw_> wpwrak, long day ago, before applied for fix2b from histories,

02:50 <aw_> so i would just dump one time then we don't spend much time on 0x32, then back to 0x3c to see

02:50 <wpwrak> ah, you replaced the NOR chip. okay, that could indeed cause a lot of fun :)

02:50 <aw_> no, not replaced chip

02:50 <wolfspraul> maybe the focus on 0x32 should also be to get it to work, rather than to analyze its current state?

02:50 <aw_> resoldered pins of NOR

02:51 <wpwrak> aw_: why did you resolder the pins ? did anything look wrong with them ?

02:51 <wpwrak> wolfspraul: well, at the end of the day, that's the objective of the analysis ;-)

02:52 <aw_> wpwrak, i was thought it had have soldering problem. ;-) long days ago. at that time you've not jumped here. ;-)

02:52 <wpwrak> wolfspraul: i suspect we may end up with an uncertain status for that board, though: it may work but with a history of failures where nothing has been done to correct them

02:53 <wpwrak> aw_: so the soldering problem was just a theory, but you didn't actually see or measure anything wrong ?

02:53 <wolfspraul> I stay back. I just hope we are focused. Either we learn something, or we produce a result (make it sellable). But not get stuck on 0x32...

02:54 <aw_> wpwrak, agreed. so let's just dump once, yes, no see actually voltage/or meausre

02:55 <aw_> wpwrak, i would be later we back to see 0x32 after later I replace a new chip. not now.

02:57 <wpwrak> aw_: the next step for 0x32 would probably be to reflash. i may then boot and render without further rework. but as i said, we may not be able to trust it.

02:59 <aw_> wpwrak, agreed. then we stop 0x32

03:00 <wpwrak> aw_: but the dump first :) and afterwards maybe reflash and see if it comes up. then we can forget it. let's not have a gazillion almost finished boards around. that just causes confusion.

03:02 <aw_> http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x32-standby2.bit/

03:02 <wpwrak> wolfspraul: with 0x32, i just want to make sure it doesn't tell us anything new about the NOR. if it behaves, which it currently seems to be inclined to do, then it'll be a "kinda works but prefer not to sell it to customers" board. in a few minutes :)

03:03 <wpwrak> funny. 0x32-2 differs from 0x32-1.

03:04 <aw_> how different?

03:04 <wpwrak> 0x32-1: ten 0->1 changes on DQ7

03:04 <wpwrak> 0x32-2: three 0->1 changes on DQ5

03:04 <wpwrak> also on different locations than 0x32-1.

03:05 <wpwrak> http://pastebin.com/haGjQ6yn

03:07 <wpwrak> very interesting pattern.

03:07 <wpwrak> okay, so ... don't reflash. this one is for the "unstable NOR/bus" pile then. company for 0x3a ;-)

03:08 <aw_> okay

03:09 <wpwrak> just to be sure: there was no reflash between dump 0x32-1 and 0x32-2, correct ?

03:10 <aw_> yes, no reflash it again

03:11 <wpwrak> okay, perfect. then we have to different dumps from the same content. just like 0x3a.

03:12 <aw_> so at least a good consistent on 0x3a and 0x3c

03:12 <wpwrak> okay, who's next ? 0x3c ?

03:12 <aw_> 0x3c, yes

03:12 <wpwrak> according to the last dump, 0x3c's NOR content is perfect

03:13 <wpwrak> so it should be able to boot. now, does it ? :)

03:13 <aw_> moment...

03:13 <kristianpaul> https://bugzilla.redhat.com/show_bug.cgi?id=732291

03:14 <aw_> some sort of pieces of your words I need to record in notes ;-)

03:15 <wpwrak> kristianpaul: 121 is EREMOTEIO. that's a funny one, never saw it

03:16 <aw_> wpwrak, 0x3c: you are right, good boot to rendering.

03:16 <wpwrak> kristianpaul: (Remote I/O error)

03:16 <wpwrak> aw_: heh ;-)

03:17 <wpwrak> aw_: does it take a lot of time to run the other tests on 0x3c ? (the various tests you normally run, CRC, USB, MIDI, etc.)

03:18 <kristianpaul> I'll upgrade next week, when get a new laptop to play..., i wasted too much time on this.. but may be some Fedora 15 user around want to give it a try to the package inÂ Â F-16 :-)

03:18 <aw_> wpwrak, http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/3C-results

03:18 <wpwrak> kristianpaul: ah, your bug has a response. good ...

03:19 <aw_> you can scroll down a bit, there's three notes I marked

03:19 <wolfspraul> bah. I would reflash and resolder and replace NOR on 0x32 and 0x3A until they work. what is the value of NOR paranoia that just leads us to some bad soldering in the end...

03:19 <wolfspraul> but sure, move them aside now and fix later. but not study forever - no value.

03:19 <wolfspraul> just fix them

03:19 <kristianpaul> wpwrak: yeah. but i'm back to trusty debian now ;)

03:20 <aw_> wpwrak, so yes surely I tested 0x3c by test program, but no more tests log I recorded after applied fix2b circuit.

03:22 <wpwrak> wolfspraul: that's just hiding the problem :)

03:24 <aw_> wpwrak, 0x3c is weird , why it did have messy pulsing on tp36/tp37 before, then then dump shows correctly then it works?

03:25 <wpwrak> wolfspraul: maybe it's bad soldering. maybe not. the board also has strange things happening on the reset circuit. just making random changes until it works tweaks the statistics against you - you're actually decreasing the coverage of your production test.

03:25 <aw_> guessed that prober's capacitance influence 0x3c's tp36 weh i probered it.

03:26 <aw_> s/weh/when

03:26 <wpwrak> aw_: measure again ? if it's really only a probing problem, that would also shed some new light on the issue

03:27 <wpwrak> aw_: but i somehow don't think it's so easy :)

03:27 <aw_> wpwrak, 0x3c notes: 1. No VGA screen, replaced a new u19 then pass also as well as video input shows normally 2. rendering @ 2ndÂ Â then can't reconfigure 3. replaced new u7/u19/u20 4. d2/d3 dimly lit while TP37 and TP36 is unstable level, range 1.2V to 3.3V: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3c_ch1-tp37.JPG 5. when I attached prober on tp37, few seconds the messy pulse dissapeared and stays 3V3 steadly and

03:27 <aw_> d2/d3 is fully off. Messy pulses again after pressing middle btn. 6. applied fix2b 7. D16(in-circuit): For.V.=165mV, Rev.V = 1549mV 8. reflashed successfully 9. d2/d3 dimly lit after powered-cycle(tp36/tp37 pull high well) 10. d2/d3 dimly lit(tp36/tp37 is messy signal level 1.2V ~ 3.3V) after power-cycle, used prober touched TP36 can intermittencely let board d2/d3 is fully OFF and can boot up after pressing middle btn. 11. dum

03:27 <aw_> p after power on, d2/d3 is fully off: http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x3c-standby1.bit/ 0x3c's NOR content is perfect, goo boot to rendering,

03:28 <aw_> s/goo/good,

03:28 <wpwrak> aw_: regarding your notes, so the board tested okay (after some rework) but then it started to develop the tp36/37 instability ?

03:28 <wpwrak> aw_: yes yes, i'm looking at http://en.qi-hardware.com/wiki/Milkymist_One_run_3_schedule#Test_Results ;-)

03:29 <aw_> yes...from notes, you can say that...sot it's weird, that prober's capacitace can influence tp36 surely on 0x3c

03:31 <wpwrak> aw_: hmm, i don't think the probe should be able to upset TP36/TP37 so much. something doesn't seem right.

03:31 <wpwrak> aw_: but maybe just probe them again (with the scope). see if they're still unstable.

03:31 <aw_> wpwrak, 0x3c today firstly powered on, d2 is fully off well, then good boot to rendering

03:32 <wpwrak> aw_: ah, when you detemine the TP36/37 voltage, is this with the scope or a voltmeter ?

03:32 <aw_> wpwrak, moment...i scope it for one minute to see,

03:32 <aw_> i used scope

03:32 <wpwrak> okay, then please try again with the scope

03:34 <aw_> bad that d2/d3 dimly lit after powered cycle...let's see if prober can let d2 is fully off

03:34 <wpwrak> kristianpaul: (debian) so you won't be able to try the package hans de goede suggested ?

03:34 <aw_> or you want to dump firstly?

03:35 <wpwrak> aw_: let's look at TP36/37 first

03:35 <aw_> wpwrak, not just let my prober to let d2 off?

03:35 <aw_> ok

03:36 <aw_> tp36 is stable 3.3V about 5 couple seconds then messy pulsing, d2 is still dimly lit

03:37 <wpwrak> aha ! can you take a picture of TP36 and TP37 that shows the stable and the unstable part ?

03:37 <kristianpaul> wpwrak: not untilÂ Â next week

03:39 <aw_> wpwrak, tp36 is still some sort likely of http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3c_ch1-tp37.JPG

03:40 <aw_> if the scope I press "stop" will like the waveform as above, or even messy one.

03:41 <aw_> wpwrak, can you imagine the waveform's scenario? you know the pulsing is super messy

03:41 <wpwrak> aw_: let's try the current measurement then. 3.3 V through 100 Ohm into TP36 and TP37

03:42 <aw_> i got ready

03:44 <aw_> no current uA i can measured on tp36/tp37

03:44 <wpwrak> and you still get the "messy" signal ?

03:45 <aw_> no messy now, it's stable 3.3V so far

03:45 <wpwrak> that's consistent with no current. at least something :)

03:45 <aw_> need monitor lone time...

03:46 <wpwrak> hmm. tricky.

03:46 <wpwrak> does C238 look good (visually) ?

03:46 <aw_> yes, lemme trying to both probering and current

03:47 <aw_> C238 looks good

03:47 <aw_> from last note, the Voltage on D16 is constant...

03:47 <wpwrak> and i suppose U24 looks good too .. (it's a big component, hard to get wrong :)

03:48 <aw_> sure

03:49 <aw_> I would to measure D16 again to see if pulsing means D16/C238 has been draftly varies after pulsing. ;-)

03:49 <wpwrak> well, why not

03:50 <wpwrak> i have to admit that i'm a bit puzzled. some of the effect may actually be a digital interaction. but it's all very strange.

03:53 <aw_> fedex picker is coming...strange..wait

03:53 <wpwrak> lemme research reset chips a bit more. there are some parameters AiT don't mention ...

03:56 <aw_> D16(in-circuit, power off): For.V. = 157V, Rev.V. = 1546mV

03:57 <wpwrak> sounds good. can you check the 3.3 V supply for variations/glitches ?

03:58 <aw_> impedance to gnd: tp36 - 10.25 k ohm, tp37 - 18.45 k ohm....good to compare with good board.

03:59 <aw_> wpwrak, ha...good reminder..you mean how's the ripple on 3.3V, right?

03:59 <wpwrak> you never know :) it would explain things

03:59 <wpwrak> maybe there's a short somewhere else they gets triggered from time to time

03:59 <aw_> yup...this is a good direction way...

04:03 <wpwrak> maybe set up the scope as follows: CH1 on TP36, CH2 on 3V3. 2 ms/div. then trigger on one of those pulses on TP36.

04:04 <wpwrak> i just with your scope had more memory. that way, you could get the "big picture" but also zoom into the details.

04:04 <wpwrak> s/with/wish/

04:16 <aw_> hmm....messy pulsing it not listening to me...no happened now. :(

04:16 <aw_> thinking how to reproducible it. :(

04:20 <wpwrak> how many times did you try ?

04:21 <aw_> couples seconds only last time

04:22 <aw_> imm monitoing scope

04:24 <wpwrak> i mean, how many times did you power cycle, trying to reproduce the instability ?

04:24 <wpwrak> or does it come and go without power cycling ?

04:25 <aw_> without power cycling....just wait couple seconds after prober touch TP36

04:26 <aw_> but now even I used prober to TP36, it has not happened more..:(

04:27 <wpwrak> a heisenbug :-) http://en.wikipedia.org/wiki/Heisenbug#Heisenbug

04:30 <aw_> phew...heisenbug?

04:31 <wpwrak> a bug that disappears when you try to analyze it

04:32 <aw_> tp36 - 4mA, tp37 - 0

04:32 <wpwrak> whoopie !

04:32 <aw_> wait

04:32 <aw_> sorry

04:33 <aw_> tp36 - 0.004mA, tp37 - 0

04:33 <wpwrak> ah. boring :)

04:33 <aw_> typed too fast...sorry to confuse. :)

04:33 <wpwrak> stillÂ Â ... 4 uA may be significant. lemme calculate ...

04:35 <wpwrak> hmm, even a voltage diffference of 1.3 V, that would be ~200 kOhm. weaker even than the pull-up.

04:36 <wpwrak> but .. what was the voltage ?

04:36 <aw_> 3.29V

04:36 <wpwrak> ah okay, then it may just be resistance along 3V3

04:37 <aw_> read from scope

04:37 <wpwrak> 0.4 mV difference between the two ends of your 100R+meter setup. that's quite reasonable

04:38 <wpwrak> maybe power cycle a few times to see if the instability comes back

04:39 <aw_> yup

04:43 <wpwrak> aw_: when you saw the instability happen before, was that just during an attempt to power on ? or did you have to do something in addition to this ? e.g., press the middle button ?

04:44 <aw_> wpwrak, no need to press middle button, just used prober to touch tp36(sometime kept 3.3V stable, then instable level/pulsing happen.

04:45 <wpwrak> okay. let's see if more power cycling causes it to appear again

04:46 <aw_> i just finished 5 times powered - cycle with two probers to wait 10 seconds, no instability heppen (also d2 is fully off) , seems that hsisenbug like me.

04:46 <wpwrak> do you have an estimate of how many power cycles you did after fix2b and how many times the instability appeared ?

04:48 <aw_> the instability once happened I immediately recorded into note. but no a system way to count how many or times I had have met totally

04:48 <aw_> but I just felt one condition:

04:48 <wpwrak> okay. let's just try a few more times.

04:48 <wpwrak> let's say up to 20, so 15 more

04:48 <aw_> 1. will this bug related to temperature-oriented? seems now I can't reproduce it..

04:49 <aw_> 2. this morning when I firstly powered on, and probered tp36, it can be easiler to see messy pulsing..

04:49 <wpwrak> if that doesn't make it happen, then try letting it boot a little more (i.e., press the middle button), then power cycle

04:49 <wpwrak> temperature could be a factor, yes

04:50 <wpwrak> residual charges stored in caps may be another factor

04:53 <wpwrak> but let's vary one parameter at a time. if simple power cycling doesn't help, try maybe 5-10 times with booting the system (middle button)

04:55 <wpwrak> if it still doesn't happen. try cycling with longer off periods. e.g., leave it off for ~1 min between tries. also 5-10 times. (you could combine this with lunch or some other fix2b rework :)

04:55 <aw_> yup...i just accumulated 5 times of powered -cycle then still d2 is fully off good, no messy scope happened

04:55 <wpwrak> if still nothing happens, i'd let the board cool down and discharge itself until tomorrow morning.

04:58 <aw_> trys boot to rendering and power-cyle now

05:01 <aw_> while this, i keep an eye on watch scope.

05:05 <aw_> 5 times of boot to rendering with power cycle

05:05 <aw_> all worked well, no unnormal condition

05:05 <wpwrak> this bug is a slippery one

05:05 <wpwrak> let's increase the power-off time then to ~1 min

05:06 <aw_> okay

05:09 <wpwrak> if still nothing happens, 0x3c gets a rest until tomorrow morning. maybe we can then give 0x77 a quick try.

05:09 <wpwrak> for the measurements, you're soldering wires to the test points ?

05:10 <aw_> yes, for preparations(soldering) on 0x77's tp36/tp37

05:11 <wolfspraul> have we tried fixing 0x32 by not putting the focus on learning/analyzing, but by simply replacing parts that could potentially be the source?

05:12 <wpwrak> okay. so you're adding the wires to 0x77 already. good. that way, we can measure when it powers up the very first time.

05:12 <wolfspraul> c238, d16, reset ic, nor chip, etc.

05:12 <wolfspraul> that'd be my approach

05:12 <wolfspraul> if the problem is not even reproducible now, don't spend more time on 0x32, just put aside (like you are doing)

05:12 <wpwrak> cargo cult engineering ;-)

05:13 <wpwrak> 0x32 is already back on the pile

05:13 <wpwrak> we're at 0x3c now

05:13 <wolfspraul> yes I read it, just thought I throw my 2c in for 0x32

05:14 <wpwrak> 0x32 has some strange NOR data path problem. but it's not clear where it is. the reset circuit is most likely not to blame for this.

05:17 <wpwrak> what's odd about 0x32 is that data that is read from the NOR seems to change. so that could be: failing NOR cells, bad I/O buffers in the NOR, some disturbance of the data or address bus (interference ?), bad I/O buffers on the FPGA, some obscure problem on the usb-jtag side.

05:17 <wpwrak> ah, also badly programmed NOR cells could be a cause (i.e., a "soft" error)

05:19 <wolfspraul> replace nor chip

05:19 <wpwrak> so my next steps would be to read the NOR once or twice more tomorrow, see how the pattern behaves. try to program it again. see if it boots. if not, read back and see if there's corruption.

05:19 <wolfspraul> I would replace the nor chip right away

05:19 <wolfspraul> :-)

05:20 <wolfspraul> and not even now, because nothing to learn for fix2b now, so we can do that later

05:20 <wpwrak> naw, that's way too drastic. if it's a soft problem, you don't need to replace the chip

05:20 <wpwrak> yes, it's unrelated to fix2b

05:20 <wolfspraul> then move forward

05:20 <wpwrak> we're already at 0x3c :)

05:20 <wolfspraul> yes I know, good

05:20 <wpwrak> and possibly soon at 0x77

05:21 <aw_> wpwrak, finished 5 times with ~ 1 min power-off time, it all goes well

05:21 <aw_> i stop 0x3c now, we check this tomorrow morning again

05:21 <aw_> let's at 0x77 firstly:

05:22 <wpwrak> the problem with "changing chips until it works" is that you may never solve the problem. so in the next run, you'll just get N times the boards that need arbitrary changes. worse, if the problem persists, you'll just rework the board to death and haven't learned anything.

05:22 <wpwrak> aw_: yes

05:28 <wpwrak> wolfspraul: also, replacing the NOR chip is high risk. you need to heat up a relatively large area, pull the chip without force, clean up all the pads, maybe clean the board from flux (optional, but things get messy quickly if you don't), then solder the new part (this is the easiest bit)

05:28 <wpwrak> wolfspraul: so there's plenty of potential for damaging pads

05:29 <wolfspraul> I have a different perspective. Not everything needs to be understood, it's about economics.

05:29 <wolfspraul> once we have the feeling that there is nothing that we will learn from 0x32 (for example) that applies to any other board, the value of 0x32 drops dramatically.

05:29 <wolfspraul> in fact at that moment it's probably not worth even 5 minutes of the time of someone like Werner

05:30 <wolfspraul> it's difficult to make the decision about 'can we still learn something that applies to other boards?' though

05:30 <wolfspraul> I'm close to saying: no, we cannot

05:30 <wpwrak> wolfspraul: but that's for the production phase. there, based on prior analysis/experience, you just have standard set of attempts at fixes, which may include replacing NOR chips. but so far, we don't even have any evidence tha there is anything wrong with the chip. maybe it's cross-talk on the bus.

05:31 <wpwrak> wolfspraul: i'm not convinced yet that it's just a freak board. we already have two with similar issues. it's a cluster in the making ;-)

05:33 <wpwrak> wolfspraul: so for now, i'd just examine boards that show good D16 and TP36/37 result but still don't boot for NOR corruption and add any that exhibit variations to that cluster

05:37 <wpwrak> wolfspraul: so for now, we seem to have two trouble areas: at least two board that exhibit variations when reading the NOR (good old 0x3a and now 0x32), and those that "out of the blue" get instability on TP36/37 (0x3c, 0x77, i think there's at least one more)

05:37 <aw_> 0x77: d2/d3 is fully off, tp36/tp37 is stable 3.3V now with two probers touchs

05:37 <wpwrak> wolfspraul: from the symptoms, it appears that the TP36/37 instability may not be the cause but the effect of some other problem

05:37 <wpwrak> aw_: 0x77 is a bastard !

05:38 <aw_> no messy happened yet

05:38 <wpwrak> aw_: so you can boot etc. ?

05:39 <aw_> tp36 - 0.214mA, tp37 - 0.039mA

05:39 <wpwrak> 214 uA ? that's quite a bit

05:39 <aw_> wait...watch for a while. ;-)

05:39 <wpwrak> what's the voltage ?

05:39 <aw_> still 3.3V @ tp36, yup ..weird cuurent

05:40 <aw_> alright...time to press middle btn

05:40 <wpwrak> 21 mV at 100 Ohm. hmm.

05:41 <aw_> no boot....d2/d3 dimly lit

05:41 <wpwrak> hah !

05:41 <aw_> messy now

05:41 <wpwrak> excellent ! ;-

05:41 <aw_> hope i can capture

05:41 <wpwrak> )

05:41 <aw_> stay tuned

05:41 <wpwrak> yeah :)

05:42 <aw_> got it..but seems not related to 3.3V...

05:42 <aw_> moment...upload

05:47 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-TP1.JPG

05:47 <aw_> TP1 is 3V3 net

05:47 <aw_> trigger at falling edge

05:48 <aw_> the instable pulsing now still keeps...

05:48 <wpwrak> very good. let's keep it pulsating :)

05:49 <wpwrak> (scope picture) fascinating

05:49 <aw_> but it acts not always like messy pulsing...sometimes back to pull high good..then drop to messy pulsing

05:49 <wpwrak> can you please move CH2 to TP37 and see if the mess is there, too ?

05:50 <aw_> yup

05:50 <wpwrak> (sometimes stable/sometimes messy) yes, like t < 0 vs. t > 0 on the screen shot

05:51 <aw_> too bad...

05:51 <aw_> now all stable @ 3.3V, d2/d3 keeps dimly off

05:51 <aw_> dimly lit , sorry

05:52 <wpwrak> while watching TP36, can you gently push against the terminals of C238 ?

05:53 <aw_> wpwrak,Â Â gently push against?

05:53 <wpwrak> (with something non-conductive, fingernail, toothpick, etc.)

05:53 <aw_> but now tp36 is pull high gooe enough

05:53 <wpwrak> apply a bit of pressure on the terminals from various sides

05:54 <wpwrak> see if there's a bad solder joint or something else

05:54 <aw_> i see

05:54 <wpwrak> if C238 is okay, repeat for the reset chip

05:55 <aw_> okay

05:56 <aw_> no, i think their soldering is quite good, i can use microscope to catch them though. ;-)

05:57 <aw_> after I put more pressures on c238 and reset ic is the same, no changes now...

05:57 <wpwrak> if this still doesn't yield anything. then try pushing on the PCB such that it bends a little. around the reset chip, C238, D16, and then anywhere on the board (maybe in a grid pattern with ~1-2 cm spacing)

05:57 <aw_> weird

05:57 <wpwrak> maybe there's a hairline crack somewhere

05:57 <wpwrak> could also be inside a chip

05:58 <wpwrak> if nothing happens, maybe try power-cycling followed by the middle button 1-2 times and see if the problem comes back

06:01 <aw_> hmm..no come back...start to power cycle

06:02 <wpwrak> if all looks good, try to boot

06:02 <aw_> i saw it

06:02 <wpwrak> it came back ? great ! :)

06:03 <aw_> tp37 synchronized to tp36 at 2nd power cycle

06:03 <aw_> caught!

06:03 <wpwrak> can you take a picture showing just two peaks ?

06:04 <aw_> yup...upload

06:06 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-TP37.JPG

06:07 <wpwrak> can you "zoom" in until you have just 2-3 peaks on the screen ?

06:09 <wpwrak> and after that, try the 3.3V -> 100 Ohm -> amperemeter experiment again. first to TP37. check the current also also see if this ends the instability ?

06:12 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-TP37_zoomin.JPG

06:12 <wpwrak> oh, and you may want to set the voltage offset of CH2 to exactly -4.00 V :-)

06:13 <wpwrak> wow, looks like a data transmission ;-)

06:14 <aw_> wait...you wanted me set voltage offset of CH2 to exactly -4.00V now, or no need?

06:15 <wpwrak> yes, please set it to -4.00 V. that way, it's easier to compare voltages. (just for the future)

06:16 <aw_> when the TP37 starts a messy pulsing...the current goes up to roughly 1mA

06:16 <aw_> pull high...surely no current on TP37

06:16 <wpwrak> okay, now TP36

06:18 <aw_> yes, TP36 is 1mA when stays pulsing too.Â Â pull high is 3uA

06:18 <aw_> now...the instability is rare to happen though

06:19 <aw_> it seems to be warm up then goes to disppear a bit

06:19 <aw_> but need to monitor more

06:19 <wpwrak> can you try to catch it again at ~10 us/div ?

06:22 <wpwrak> and then, remove CH2 from TP37 and check the voltage rails. TP2 (2V5), TP3 (1V8), and TP4 (1V2)

06:23 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-TP37_10usDiv.JPG

06:24 <wpwrak> ah, and also TP33 (5V) and TP26 (4V3)

06:24 <wpwrak> (did i catch them all ? there are so many :)

06:25 <aw_> yes, not big discoveries. ;-)

06:25 <aw_> i have to power off and soldering wires

06:25 <wpwrak> very crazy noise

06:25 <wpwrak> ncan't you just touch the TPs with the probe of CH2 ?

06:25 <aw_> three pads under usb-jtag board

06:26 <wpwrak> argh

06:26 <wpwrak> maybe just remove the board

06:26 <aw_> I'll firstly unplug usb-jtag

06:26 <aw_> then just touch them sure

06:26 <aw_> moment

06:27 <aw_> hope still reproduce it

06:27 <wpwrak> oh, and how it your M1 supplied ? from a regular power supply or from a lab power supply ? in the latter case, you may want to have a look at the current consumption in the unstable state

06:29 <aw_> from regular power supply

06:29 <aw_> good news at least tp36 pulsing ie not related to TP1~4, TP33, TP26

06:29 <wpwrak> okay, let's keep checking the total system current for later

06:29 <wpwrak> grrr

06:30 <aw_> what else we think TP36 came from ?

06:30 <aw_> okay...sorry that what's "grrr"?

06:31 <wpwrak> "grrr" = i was hoping for an unstable supply rail

06:31 <aw_> okay...;-)

06:31 <wpwrak> and it's unstable also without the jtag board ?

06:32 <wpwrak> is there anything else connected ? ethernet, audio, ... ?

06:32 <aw_> no connections on 0x77 at all

06:32 <wpwrak> that would have been too easy, i guess :)

06:33 <aw_> unless we just removed usb-jtag board and went though all if TPs related to TP36. ;-)

06:34 <aw_> yup..so your guessing on regular supply and lab power supply is reasonable

06:34 <wpwrak> hmm. it would seem that PROGRAM_B becomes an output. or that something else connects into the PROGRAM_B/TP36/reset out net

06:34 <aw_> switching to lab power supply now...

06:37 <aw_> hmm...set limited 1A at lab power supply, 0x77 TP36 still got pulsing

06:38 <aw_> a total lab power current shows 0.55A

06:39 <aw_> alright...seems 0x77 is easiler to reproduce messy pulsing

06:39 <wpwrak> good :)

06:39 <aw_> wpwrak, i think we stop now analysis today

06:40 <wpwrak> wait a minute. two more ideas.

06:40 <aw_> lsitening

06:40 <wpwrak> but i need to look up something first. i'd be curious about how adjacent traces behave in relation to TP36

06:41 <wpwrak> there are two candidates: the trace "north" of D16, roughly under the white bar that marks the polarity

06:41 <aw_> not very clear on this , can you slowly describe it

06:41 <wpwrak> and the one coming our next to R30

06:41 <aw_> yes

06:41 <aw_> go on

06:42 <wpwrak> i need to find places where you can actually measure them

06:42 <wolfspraul> it just gets interesting with 0x77 :-)

06:43 <aw_> wpwrak, get syncronized to Dram routes?

06:44 <aw_> which other partly circuit you want to scope, i can check here. ;-)

06:44 <wpwrak> aw_: dunno. doesn't look like DRAM. but lemme find the package definitions ...

06:44 <wpwrak> sigh, if all this was done in kicad, i could just click on the pad and know what it does ...

06:45 <wpwrak> lekernel_: you aren't awake by any chance ? :)

06:47 <wpwrak> one would be ball AA2 ... now, what is this ...

06:52 <wpwrak> ah, AA2 = FLASH_D8 = DQ8 = U9 pin 34

06:52 <aw_> just let me know your idea, i can open my windows tool to see surrouding signals or ball under fpga

06:52 <wpwrak> so that's one potential correlation to check: TP36 noise vs. pin 32 of the NOR

06:52 <aw_> got it

06:55 <wpwrak> the other would be AA4 = BTN2 = ... naw

06:58 <wpwrak> another candidate would be FLASH_CE_N, pin 14 of U9 (2 "up" from pin 16 of FLASH_RESET_N)

07:00 <wpwrak> but FLASH_D8 is much more likely. FLASH_CE_N is also on the "wrong" side of D16

07:06 <wpwrak> aw_: to get to the instability, is it sufficient to just connect power ? or did you also have to press the middle button ?

07:08 <aw_> wpwrak, no need to press middle, just power cycle and you can see d2/d3 is either dimly lit or fully off. 0x77 has both messy pulsing.

07:09 <wolfspraul> if must be caused by some part behaving differently from other (functioning) boards. but which part can cause this?

07:09 <wpwrak> ok. are you checking for correlation now ? TP36 vs. pin 34 of the NOR

07:09 <wolfspraul> I would just make a priority list from most likely to least likely, and then replace them one by one with new ones.

07:10 <aw_> wpwrak, yes..the pins are close.. so need to touch very carefully

07:10 <wpwrak> sorry, pin 34

07:11 <wpwrak> err .. pin 34 was correct. getting tired :)

07:11 <aw_> yes, i knew,

07:11 <aw_> so you go sleep first, i

07:11 <wpwrak> aw_: naw, i'll wait ;-)

07:11 <wpwrak> wolfspraul: for the 0x77 problem ? good question :)

07:12 <aw_> I'll scope p34 and FLASH_CE_N

07:12 <wpwrak> wolfspraul: right now, it looks plain impossible. we're seeing signals on TP36 that have no business to be there

07:13 <wpwrak> wolfspraul: plus, they also look wrong. like two outputs working against each other

07:14 <wpwrak> wolfspraul: while all we really should have there in an input

07:15 <wpwrak> wolfspraul: so if the correlation check doens't yield anything, the next step would be to see if lekernel recognizes something familiar in the scope screenshots. or if he knows of a condition where PROGRAM_B can become a weird output.

07:16 <wpwrak> wolfspraul: if he doesn't have any magic rabbit in his hat, i'd start simplifying the circuit. maybe start with pulling U24

07:17 <wpwrak> wolfspraul: if this doens't do the trick, remove C238. if the instability is still there, D16.

07:18 <wpwrak> wolfspraul: one problem is that i'm not sure we have enough data to be able to tell when the instability has gone for good. so it if can't be observed at a given point in time, it may be necessary to let the board rest overnight.

07:18 <aw_> wpwrak, yes, you seems that right, tp36 seems sycronized to pin 34 of NOR

07:18 <aw_> moment...i still need to catch a very firm waveform. ;-)

07:18 <wpwrak> wheee ! now that's nice news ;-)

07:20 <aw_> caught, yes!

07:21 <wpwrak> the next two steps: resistance between TP36 and pin 34 of the NOR (for both polarities). compare with the same resistance of two known to be good boards.

07:21 <wpwrak> maybe start with the good boards first to give 0x77 some time to discharge

07:22 <wpwrak> but let's first wait for the evidence ;-)

07:22 <wpwrak> (i.e., the picture :)

07:23 <wpwrak> wolfspraul: if the impedance is unusually low, you'll like the next step: visual inspection of the underside of the FPGA :)

07:24 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8.JPG

07:25 <wpwrak> nice ! the voltage levels are a little odd, though

07:25 <wpwrak> but let's see about the impedance now

07:26 <wpwrak> or wait

07:27 <wpwrak> maybe take another scope shot at 500 ns/div

07:31 <wpwrak> hmm. thinking a bit more about it. the image does not suggest a simple short. otherwise, DQ8 would have to be at ~1.6 V too (i think the noisy "floor" of DQ8 is Z)

07:32 <aw_> http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8_500ns.JPG.JPG

07:34 <wpwrak> *hmm*

07:34 <aw_> a high impedance of Z as we knew, should pull high chip inside or outside resistor, this waveform still can not say that a possible 'short' under between them

07:35 <aw_> even if it's short, the waveform should not be like that level.

07:35 <wpwrak> let's forget about the impedance for now. this is more mysterious.

07:35 <aw_> do you think that that could be an interconnection inside fpga? as related to program_b?

07:36 <wpwrak> next try: TP36 and NOR pin 54 (OE#/FLASH_OE_N)

07:36 <aw_> okay

07:36 <wpwrak> it could be the FPGA just acting crazy, yes. but then, that's a bit too convenient an assumption :)

07:37 <wpwrak> i'd leave all the "crazy FPGA" theories to sebastien. he's probably read about a good number of FPGA madness issues, so something may sound familiar. if not, we probably have something else.

07:38 <aw_> wpwrak, no pulse on pin 54 of NOR

07:39 <wolfspraul> is there a chance that this board (this particular one, 0x77), could pass our test program and 10 render cycles?

07:39 <wolfspraul> or is it behaving badly enough from what we see so far that that is impossible?

07:39 <wolfspraul> not theoretically, but practically this one, 0x77

07:39 <aw_> wpwrak, no relations between TP36 and pin 54 of NOR

07:40 <wpwrak> wolfspraul: this one may be badly off enough. but the very similar 0x32 went for a while without showing problems.

07:40 <wpwrak> aw_: is OE# low all the time ?

07:40 <aw_> yes, you got it, always at low

07:44 <wpwrak> the results would correspond to having about 5 kOhm between AA1 (PROGRAM_B) and AA2 (FLASH_DQ8)

07:45 <wpwrak> thatÂ Â way, we'd just end up at around 1.6 V, with the kind of charge/discharge pattern on PROGRAM_M we see (C238)

07:46 <wpwrak> where it gets a little mysterious is what would provide these 5 kOhm. flux residues ? cooked I/O driver ?

07:46 <aw_> so next steps to measure impedance on bad and good board.

07:47 <wpwrak> maybe let's measure the impedance test now. TP36 vs. pin 34. both ways, i.e., swap probe +/-.

07:47 <wpwrak> yes

07:47 <aw_> regards to if flux residues, this 0x77 i see it that is clear,

07:48 <aw_> need to power off now

07:48 <aw_> okay

07:49 <wpwrak> aw_: (0x77 and flux) no change of something trapped under the FPGA ? :)

07:49 <wpwrak> s/change/chance/

07:50 <wpwrak> i've never seen flux do as little as 5 kOhm. but then, there's a first time for everything ;-)

07:51 <aw_> wpwrak, it could still have chance under FPGA, yes, but you were right, me too on that a flux can take a 5 k ohm?

07:52 <wpwrak> how big is C238 now ?

07:54 <wpwrak> maybe your flux has extra-powerful ions ;-)

07:56 <aw_> 10.6 K ohmÂ Â from TP36 to pin 54, 118K ohm reversely

07:57 <aw_> sorry that it's pin34

07:57 <wpwrak> hmm

07:57 <wpwrak> how big is C238 ?

07:58 <aw_> bad that i don't have equipment can measure capacitor now.

07:59 <wpwrak> you could build an oscillator with a 555 ;-)

07:59 <wpwrak> then you could _hear_ the capacitance :)

07:59 <aw_> i saw an arduino with few parts to do. ;-)

08:01 <wpwrak> 10/120 kOhm is on 0x77 ?

08:01 <aw_> yes

08:01 <aw_> moment

08:01 <wpwrak> now, two known to be good boards for comparison

08:04 <aw_> 59 / 118 k ohm on 0x40

08:06 <aw_> 58 / 129 kohm on 0x7a

08:08 <wpwrak> something's a little off :)

08:08 <wpwrak> maybe measure 0x32 too

08:08 <aw_> 0x3c: 63 / 131 k ohm

08:09 <wpwrak> so, "normal" = ~60 / ~120-130 kOhm. ~11 / ~120 kOhm of 0x77 is bad.

08:09 <aw_> 0x32: 53 / 129 k ohm

08:09 <wpwrak> aw_: do you remember the value of C238 ?

08:09 <aw_> so yes, at leas 0x77 is bad

08:10 <aw_> C238 is 220 pF

08:11 <wpwrak> thanks !

08:12 <aw_> also thanks to you !

08:13 <wpwrak> simulation says this: http://downloads.qi-hardware.com/people/werner/m1/tmp/pin34.ps

08:13 <wpwrak> a bit over-simplified of course

08:14 <aw_> 4KOhm is an equivalent resistor inside that pin34?

08:15 <wpwrak> hmm, 0x32 is about normal. so either it has a separate problem with just the same symptoms or we haven't found the real cause just yet

08:15 <wpwrak> (4 kOhm) yes

08:16 <aw_> agreed >>> so either it has a separate problem with just the same symptoms or we haven't found the real cause just yet

08:16 <wpwrak> do you have any cleaning process that has some hope of removing flux or dirt from under the fpga ?

08:16 <wpwrak> (preferably without just moving the dirt/particle to another area of the fpga)

08:19 <wpwrak> well, but then that probably doesn't make sense

08:19 <wpwrak> hmm. thinking ...

08:19 <aw_> hmm..this quitely needed to be think more if need to see if flux or dirt from under fpga... I've ever not dealt this topic.

08:19 <aw_> thinking...

08:24 <aw_> as you really knew that it's still possible a flux to be as likely huge ( 60 - 10 )= 50 KOhm...to bring resistance down, no big surprising; But if it indeed is. How few boards got similar problem like this. and Won't any other balls under FPGA surrounding Program_B be influenced too? and just only Program_B?

08:24 <wpwrak> well, this is the corner in which the rework was done

08:25 <wpwrak> AA1 and AA2 are in the second row of balls, very close

08:25 <aw_> That's too weird, 0x3c and 0x77 has been tested successfully on all I/Os though..

08:26 <aw_> yeah...it's close enough to cause this

08:26 <wpwrak> but ... if it was just flux, it should have the same conductance in both directions

08:26 <aw_> yeah...so that's too weird to say it's a flux problem now. ;-)

08:26 <wpwrak> maybe heat damage in the FPGA from the rework ?

08:27 <wpwrak> but then, i'm still not entirely sure whether we're seeing cause or effect here

08:27 <aw_> hmm....don't know exactly . but i knew factory used heat air to blow C238 and R30

08:28 <aw_> so only I go to Xray to find secrets on 0x3c/0x77

08:28 <wpwrak> do you have an xray session planned ?

08:28 <aw_> yes, no cause surely known now

08:28 <aw_> yes, sure

08:29 <aw_> but I hope do X-ray later

08:29 <aw_> so i think that I go for next reworks on fix2b to accumulate 30pcs of 'avail-fix2b' done

08:30 <aw_> then I think from them, we may get more boards like 0x77 similar too.

08:30 <wpwrak> heh :)

08:31 <aw_> at the end, we go for X-ray to see if any consistence existed inside of this weird problem.

08:31 <wpwrak> ah, one more test please: 0x77, correlation of TP36 and pin 17 (A11)

08:31 <aw_> okay

08:31 <wpwrak> that would be a neighbour of FLASH_RESET_N

08:32 <wpwrak> (the other one is VPEN, which is tied to 3V3)

08:33 <aw_> no instability more.. :(

08:34 <wpwrak> hi murphy ! good to see that you're watching :)

08:34 <aw_> good ...reproduced now

08:34 <wpwrak> ah, nice ;-)

08:34 <wpwrak> i'm beginning to like 0x77. that's a good board. fails whenever we ask it to. not like 0x3c ;-)

08:40 <aw_> not A11, no correlation

08:41 <wolfspraul> is there anything in 0x77 that we believe we can learn that impacts other boards?

08:41 <wolfspraul> if no - put 0x77 aside. if yes - continue studying it.

08:41 <wolfspraul> writing off 0x77 is no scary thought to me

08:41 <wolfspraul> delaying rc3 sales by one day (for example) is a scary thought

08:41 <wolfspraul> so we need to balance between those two...

08:42 <wolfspraul> I'm not following the electrical analysis and logic in detail today, so can just repeat the obvious high-level thinking...

08:42 <wpwrak> aw_: (no correlation) good, thanks

08:44 <wpwrak> wolfspraul: 0x77 is tricky. what's worrying is that 0x77 and 0x3c both show intermittent instability. on 0x77 it's fairly easy to reproduce, on 0x3c not so easy.

08:45 <wpwrak> wolfspraul: so far, we don't have a good explanation of what's going on. 0x77 exhibits one anomaly that could be causally linked, but 0x3c doesn't show this anomaly.

08:45 <wolfspraul> like I said. the key question is "can we learn _anything_ that applies to other boards?"

08:46 <wolfspraul> I understand that question is not easy to answer, but that's what it's all about.

08:46 <wolfspraul> if you are the first who can say "no" to that question, and you are right, that's great value

08:46 <wolfspraul> of course not if you were wrong :-)

08:46 <wolfspraul> xray pile...

08:47 <wolfspraul> glancing over the result today doesn't make me worried about fix2b and our ability to produce 100% pass boards

08:47 <wpwrak> wolfspraul: yes, xray pile sounds best for 0x77 and 0x3c for now

08:47 <wolfspraul> so let's move forward

08:49 <wpwrak> maybe we'll have new ideas in a while as well. i don't have anything else i'd want to try on 0x77 at the moment. there are a few "destructive" tests (repairable) one could do to 0x77, but i'd save them for as late as possible

08:50 <aw_> wpwrak, so I'll go for fix2b reworks firstly...but if you just think out any possible cause reason/or idea, you ping me.

08:51 <wpwrak> also because they may just make the problem disappear for the wrong reason. e.g., there may be a feedback loop. if you break it, the instability may vanish, but once you restore the normal functionality, it'll be right back.

08:52 <wpwrak> aw_: yes, sounds good. thanks for all the testing ! we made some good progress into understanding the behaviour of that critter :)

08:52 <aw_> wpwrak, hmm...good reminder that this could be possible as transfer functions: positive feedback or negative one. maybe

08:52 <aw_> so i go reworks firstly. ;-)

08:53 <wpwrak> aw_: the feedback loop i have in mind would be unknown -> PROGRAM_B -> FLASH_RESET_N -> unknown -> ...

08:54 <wpwrak> aw_: we could break the loop by removing D16, but that wouldn't remove the unknown -> PROGRAM_B path. without the feedback, maybe the board will just reset a few times and then boot, so you never notice that something is wrong. but of course, when you bring the flash reset back, also the feedback loop returns.

08:55 <wolfspraul> just replace 0x77 with 0x78 :-) (joking, joking)

08:56 <aw_> as you knew that system transferring is open and close types. I hope this weird problem is not existed as close type, so you remove D16, will let it acted as open type.

08:56 <wpwrak> wolfspraul: like the airlines do when one of their flights crashes ? ;-)

08:58 <wpwrak> aw_: it would be a bit like treating a broken bone with painkillers ;-)

08:58 <wpwrak> very efficient symptom treatment, but ... :)

08:58 <aw_> wpwrak, alright. we do such as later. I am poor adam to do reworks firstly though...keep this surely good idea to get some approach later

08:59 <wpwrak> aw_: and maybe get that xray trip scheduled :)

09:00 <aw_> wpwrak, yup...may more later. but will.

09:01 <wpwrak> wolfspraul: worst-case outcome: FPGA shows extensive heat damage on 0x77 (and similar) but also on "good" boards examined for comparison. would be hard to decide what to do in this case.

11:12 <kristianpaul> oh, wpwrak have a mm1 as well, nice !

11:25 <wpwrak> not yet ! at the moment, it may be in memphis or maybe a few hours south-southeast of memphis

11:45 <wpwrak> wolfspraul: phrew. just caught up with some really old stuff on #qi-hardware. the RTC thread was scary. pages of supercaps before finally the CR2032 was mentioned ;-)

12:19 <wpwrak> aw_: how's the fix2b on the "good" boards going ? any boards that have gone from "good" to "bad" ?

12:47 <aw_> wpwrak, tonight, wont test just rework. ;-)

12:57 <wpwrak> aw_: a wise decision ;-)

12:59 <wpwrak> aw_: ah, maybe tomorrow morning, you could try board 0x3c. first, see if you can reproduce the problem. and if yes, show TP36 and pin 34 at 10 us/div and at 500 ns/div. that way, we can see if the signal shape is the same of if it's different. also, measure the impedance between TP36 and pin 34 again.

13:00 <wpwrak> aw_: (measure again) e.g., if thermal expansion is part of the equation, the impedance may be "normal" when warm but decrease when cold. i very much hope this isn't the case, but let's be safe and check.

13:04 <wpwrak> i'll be afk for a bit

13:04 <aw_> wpwrak, okay...good reminder on thermal equation thing, thanks.

13:15 <wpwrak> already back :)

13:17 <wpwrak> i missed that today is a holiday ... the bastards introduced it just some 3 months ago, so it doesn't show up in any printed calendar

13:40 <wpwrak> aw_: oh, and when you measured the resistance, which side had the "high" voltage ?

13:43 <aw_> resistance between TP36 and pin 34?

13:45 <wpwrak> yes

13:46 <aw_> it's measured both ways( prober +/-) after powered-off

13:46 <aw_> so i wont be known which side was the high voltage. ;-)

13:47 <wpwrak> which side was the red wire ? ;-)

13:47 <aw_> your question about this was strange, hope that i was misunderstood your meaning. ;-)

13:48 <wpwrak> if you have two multimeters, you could even check that the red wire is really the high voltage :)

13:48 <aw_> oah....sure

13:48 <wpwrak> what i mean is this: when you do a resistance measurement, the multimeter injects a current. one of the two sides must be high and the other low :)

13:48 <aw_> the 10 / 120 was:

13:50 <aw_> 10 KOhm measured was red (high) on TP36, so 120 KOhm was red on pin 34

13:51 <aw_> phew~ just understood your question though. he ;-)

13:51 <wpwrak> kewl, thanks ! that would even make sense

13:53 <aw_> oh, yup

14:14 <roh> wpwrak: wouldnt it be helpful if you would be in taipei now?

14:15 <wpwrak> not sure. if i had my lab with me as well, yes :)

15:54 <roh> wpwrak: heh.. i see.. so we don't have a lab in tpe?

17:02 <wpwrak> roh: well, adam's home lab. TDS1012, etc.

19:22 <Fallenou> lekernel_: I've seen you merged a few commits from rtems cvs head into mmstaging 5 days ago, but you didn't merge the changes to cpukit/zlib/zconf.h.in , is it on purpose that you stay with the v1.1 instead of the v1.1.1.2 ?

19:23 <lekernel_> no

19:23 <lekernel_> maybe that's some git-cvs bug

19:24 <lekernel_> that would explain the problems

19:24 <Fallenou> yes maybe

19:34 <wpwrak> lekernel_: hey, good to see you here ! :) look what new tricks your M1 has learned:

19:35 <wpwrak> lekernel_: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-TP1.JPG

19:35 <wpwrak> CH1 is TP6 / PROGRAM_B, CH2 is the 3.3 V supply (we were looking for power supply glitches - without finding any)

19:36 <lekernel_> yeah I saw that thing already

19:36 <lekernel_> :(

19:36 <wpwrak> that's when the board should be loading its standby bitstream. instead, it's having fun with "interesting" signals on PROGRAM_B

19:36 <wpwrak> ah, all of it ?

19:36 <wpwrak> also this ? http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8_500ns.JPG.JPG

19:36 <lekernel_> well, that one scope picture

19:36 <lekernel_> no, not the second

19:36 <lekernel_> what's that?

19:37 <wpwrak> that's CH1 on TP36, as before, CH2 on NOR DQ8

19:37 <wpwrak> here a bit zoomed out: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8.JPG

19:38 <lekernel_> I don't see anything worrisome about DQ8 on those pictures, but TP36 is pure crap

19:38 <wpwrak> now, you may wonder what on earth PROGRAM)_B and NOR DQ8 have in common. well, PROGRAM_B is on ball AA1, and NOR DQ8 is on ball AA2.

19:38 <lekernel_> hmm...

19:38 <lekernel_> solder bridge?

19:38 <wpwrak> the 500 ns pic shows that DQ8 seems to push PROGRAM_B

19:39 <wpwrak> adam measured and found ~10 kOhm in one direction, ~120 kOhm in the other

19:39 <wpwrak> so there's a diode somewhere in the mix

19:40 <wpwrak> joerg thinks it's a fried FPGA. ESD or such.

19:40 <wpwrak> i was wondering if you had any other interpretation of this mess, or any ideas what to try

19:41 <lekernel_> you could be simply measuring some CMOS protection diode to VCC through some pull-up resistor

19:41 <wpwrak> to have another board with similar symptoms but DQ8-PROGRAM_B "resistance" normal (about 60/120 kOhm, like on several "good" boards)

19:42 <wpwrak> yes, we must be measuring something like this. what's remarkable is that ll the other boards are around 60/120 kOhm, while this one is about 10/120 kOhm

19:43 <wpwrak> the measurement itself is very dirty, because it's not anything ohmic we're measuring there. so the value also depends on adam's instrument, etc.

19:43 <lekernel_> maybe you could show that to a xilinx fae?

19:43 <wpwrak> but it seems the measurement is repeatable among several "good" boards. and the one with that rather interesting correlation between PROGRAM_B and DQ8 happens to be different.

19:44 <lekernel_> 'interesting'... hm :-)

19:44 <wpwrak> maybe you can try that. i have no idea about their support :)

19:45 <lekernel_> I would rather call it pesky, annoying, time-wasting, and other such adjectives :-)

19:45 <lekernel_> well, debugging those things is basically their job

19:45 <lekernel_> they do that all the time

19:45 <lekernel_> and they're often good at it

19:46 <wpwrak> i'm actually less worried about "saving" this board, but about finding a reliable test that tells us if something is amiss there. because we have another one with similar symptoms on TP36, but where DQ8 measures normally. we don't know yet if DQ8 and TP36 appear connected there, too, or if it's maybe another pair of pins that join forces

19:46 <lekernel_> could it be some problem in the PCB substrate? or regular capacitive/inductive crosstalk?

19:46 <lekernel_> seems unlikely because you're seeing a diode...

19:47 <wpwrak> plus, we have some more boards with strange effects on the NOR signals. of course, if the FPGA's I/O pads in that area are damaged, that could explain a lot of strange effects. but i wouldn't jump to conclusions just yet. maybe it's a completely unrelated problem.

19:47 <lekernel_> unless Murphy created some FR4-based semiconductor ofc :-)

19:47 <wpwrak> at first, i suspected flux. the we found the diode ;-)

19:52 <wpwrak> DQ8 was also more of a lucky discovery. we found it by examining traces adjacent to PROGRAM_B or FLASH_RESET_N (the latter is of course affected as well, so we may even see a feedback loop. of course, if we were to break the feedback by removing D16, we wouldn't solve the underlying problem.)

19:53 <wpwrak> that is, if it's really DQ8 affecting PROGRAM_B. the signal shape would agree with this theory. there could of course be another signal that just looks the same, and DQ8 simply happens to have the same pattern.

19:55 <wpwrak> also, DQ8 isn't all that pretty. there are some interesting little runts in the 500 ns picture. not that, according to adam, OE# is held low throughout all this. so DQ8 should be driven all the time.

19:57 <wpwrak> ah, and to make it all more interesting: the board affected by this don't always show this "noise" on PROGRAM_B. 0x77 does it almost always, but 0x3c is much less eager. it seems that, with increasing temperature, the probability of this occurring drops in 0x3c.

19:58 <wpwrak> as in: in the morning. adam just had it happen quite often, but later on, he booted into standby and sometimes even further about 15 times, without a single such anomaly

21:10 <wpwrak> lekernel_: anyway, so you haven't come across any xilinx errata saying that PROGRAM_B can become an output with weird signals ? or anything else that would explain the madness ? (besides the hypothesis that the chip is indeed damaged)

21:56 <roh> wpwrak: are there more than 1 board with that behaviour?

22:11 <wpwrak> roh: two known cases so far with the same weird pattern on TP36/PROGRAM_B

22:12 <wpwrak> roh: we haven't analyzed the 2nd one far enough to know if there's also a correlation between NOR.DQ8 and FPGA.PROGRAM_B, though. one problem with this board is also that it is a lot more reluctant to exhibit the problem.

22:14 <lekernel_> no, this is completely unexpected

22:14 <lekernel_> are you sure this comes from the fpga? it might be the reset IC too...

22:15 <wpwrak> roh: it's probably upset that we caught it. presumably, it was hoping for the opportunity to embarrass the VJ at some great festival

22:15 <lekernel_> also, maybe it works this way: PROGRAM_B is (wrongly) pulsed, FPGA deconfigures itself, and reads some memory address that happens to drive DQ8 high

22:17 <wpwrak> lekernel_: i've refrained from rework so far. but ... it seems odd for the reset chip. 1) the input voltage is perfectly stable. 2) the voltage jumps between ~1.7 V and 3.3 V. 3) the correlation with DQ8 would make even less sense then. so for now, i don't suspect the reset ic. but when we enter the "remove components" phase, then this would be the first component to go.

22:19 <wpwrak> lekernel_: (PROGRAM_B -> DQ8) not impossible, but the timing seems to fit a bit too well. e.g., why would PROGRAM_B drop just when DQ8 drops ?

23:39 <wpwrak> lekernel_: btw, can you tell me what signals connect to the balls around AA1 ? i.e., AB1-2, we already know AA2, and Z1-2 ? faster if you just click on them in altium than me visually searching the PDF :)

23:41 <wpwrak> err, make that Y1-2. there's no Z row, sorry

23:51 <wpwrak> i found AB2 = FLASH_D9, Y2 = SDRAM_DQ0, and Y1 = FPGA_VREF

23:53 <wpwrak> ah, and AB1 is ground

23:56 <wpwrak> so potential candidates would be NOR.DQ9 = pin 36 and U14/U15 DQ0 = pin 2

23:59 <wpwrak> if the supposed damage spreads wider, then we would have USBB on AB3, Y3. there's at least one board with an unexplained USB-B failure. but this is a little thin evidence this far.