<wpwrak>
hmm, regarding the M1 extension connectors, are they keyed (i.e., is it possible/easy to reverse the polarity of a plug ?)
<wpwrak>
also, how's the mechanical firmness of the board around them when in the case ? does it flex when inserting/removing a plug ?
<wpwrak>
also, regarding keying, since they're both 9x2, can they be told apart except by position ? e.g., all other things being equal, color-coding would accomplish this
<wpwrak>
J21 (the one with 3V3) has a nasty failure scenario: short any of pins 1 and 2 with 3 and 4, and the whole 3V3 rail becomes 5 V. i wonder if the M1 survives this :)
<wolfspraul>
all very good points. Mechanically, the two 9x2 headers are quite close to each other, I'm wondering whether it's possible to connect an expansion board into each one at the same time.
<wolfspraul>
maybe we should define a size for them, so when people start to build something they know it in advance. then later if things are moved around on the board, we can keep that guaranteed size in mind so people can continue to use their older expansion boards in newer m1 boards...
<wpwrak>
expansion boards would probably also want some form of additional attachment. not nice to have a board hang off a header, acting as lever
<wpwrak>
the closeness of the connectors also eliminates a simple "bridge" structure, e.g., used by arduino
<wpwrak>
aw_: heya ! well rested from the weekend, and ready for battle ? what's the plan for today ?
<aw_>
wpwrak, go all rest boards to be fix2b version. ;-) meanwhile you can interrupt me if need. so long big rework again this week.
<wpwrak>
ah, so no further analysis planned on the ones with incorrect TP36/TP37 voltage ? or the ones with NOR/bus issues ?
<aw_>
wpwrak, the more rework the more new problems i encounter.
<wpwrak>
heh ;-)
<aw_>
wpwrak, no, we can still plan analysis
<aw_>
even now
<wpwrak>
for those with incorrect TP36/TP37 voltage, I would suggest to try the 3.3 V injection again (on TP36/37)
<wpwrak>
i.e., connect 3.3 V through 100 Ohm and via an amperemeter, then see how much current flows into TP36/37
<aw_>
so later if I see any not a constant measured by in-circuit for D16, it has much probability is that D16 or C238 bad from soldering, i can just replace them as new ones.
<aw_>
i see
<aw_>
let's me settup a little then we check them
<wpwrak>
yes, C238 would be a likely candidate there. the measurements should guide us :)
<wolfspraul>
aw_: let me explain the 'big picture' plan as I see it this week, only high level.
<wolfspraul>
first you continue with a mix of applying fix2b to all 90 boards, and analysis of boards with problems
<wolfspraul>
after we have 30 boards that are 100% pass with fix2b applied, you pause that fixind and testing work, and spend a day or two to assemble (case) and package 30 full retail units of M1
<wolfspraul>
once those 30 are ready for shipping, I will start catching up with some people that are waiting to buy, launch, shop, etc. you need to reserve maybe 2h / day or so for shipping out stuff. No worries you will get all paperwork prepared (invoices, HS code, 1040 form for US, etc)
<wolfspraul>
after the 30 are ready for shipping, you go back to step 1, that is applying fix2b to all 90 together with analyzing boards with problems
<wolfspraul>
that's how I see it :-) Let me know if it sounds wrong somewhere...
<wpwrak>
wolfspraul: (spacers) the extra mechanical support is also needed because you can have boards that are TX/RX only, so they don't form a "bridge"
<aw_>
wolfspraul, since current "available" boards is more than 40pcs, this could be the first round of fix2b rework at one time job in order to get 30 full retail units.
<wolfspraul>
aw_: I don't think you should only go through the current 'available' boards. That's a little risky because we may still overlook a bigger problem somewhere.
<aw_>
or you just wanted an exactly 30 full retail units of M1, thus is 30pcs main boards 'available' enough, then immediately pause?
<wolfspraul>
so I think you can mix in some interesting or failure boards, if only to see that our fix2b and testing process is now strong and can always identify well between pass and failure.
<wolfspraul>
no not really, I just explain my thinking
<wolfspraul>
which is that after we have ca. 30 'avail - fix2b' boards, you need to pause that work for 1-2 days to do assembly and packing
<wolfspraul>
then continue with fixing and testing
<wolfspraul>
but right now, it's still important to analyze some failed boards, as planned. so that we are sure everything is under control.
<wolfspraul>
here is a simpler version: :-)
<wolfspraul>
1. you continue exactly as you did last week. fixing and testing, analyzing some failure boards.
<wolfspraul>
2. at some point I will interrupt you, when I see enough boards (ca. 30) that I believe we can sell
<wolfspraul>
that's all :-)
<wpwrak>
wolfspraul: i think what adam is saying that he already has 30 "available" boards, so your stopping condition is already fulfilled
<wolfspraul>
no
<wolfspraul>
they don't have fix2b applied
<wpwrak>
aah, i see
<kristianpaul>
they lack fix2b
<kristianpaul>
ah yes :)
<kristianpaul>
hi+
<wolfspraul>
I just want to prepare Adam that there will be an interruption at some point, which is when we have ca. 30 fix2b 100% pass boards, and we are confident in our design, fix2b, and testing process.
<wolfspraul>
then there's an assembly and packaging interruption, then back to fix2b/testing/fixing for all 90 boards
<wolfspraul>
aw_: sorry now I wrote so much :-) but just repeat the same thing 3 times. did you understand / agree with the process?
<aw_>
wolfspraul, l agreed , but here that I do here:
<aw_>
1. there's already more than 40 pcs put in "available" stage, not include passed "avail - fix2b"
<aw_>
2. mix good and bad boards, from the wiki results; the facts: the failure board are currently big "impedance" and few usb/midi  failure boards that I haven't fixed as "available" , Which of them are useless to approve fix2b design now.
<aw_>
3. from last first cluster with fix2b, we encountered a new branch of likely 0x32/0x3c/0x77 failure boards which I could say we they are bad board so far now. Those are we can keep to analysis.
<aw_>
4. so idea to use those more than 40 pcs "available" boards to meet/accumulate 30pcs "Avail -fix2b", how do you think?
<aw_>
5, once we reach 30 pcs "avai -fix2b" then we pause.
<wpwrak>
aw_: what is "big impedance" ?
<aw_>
wpwrak, before m1 to be powered on, I firstly measured their impedance on TP1 ~ TP4, TP33. They had have 'short' condition, which surely no need to apply 'fix2b' circuit.
<aw_>
wolfspraul, is that reasonable i replied?
<wpwrak>
ah, so it's "low impedance" :)
<aw_>
for me, preparation is likely to be done as material to One-time job. so rework should be done from those 40 pcs "available" firstly
<wolfspraul>
ok wait, reading :-)
<aw_>
later when everyday i tested, the avail-fix2b will be accumulate rising to 30pcs then we pause
<kristianpaul>
oh
<wolfspraul>
aw_: yes, all reasonable. _BUT_ I think you should definitely do some analysis in parallel, today, tomorrow. 0x32/0x3C/0x77/0x85/etc.
<wolfspraul>
not 100%, but in parallel with finishing more fix2b boards
<aw_>
wolfspraul, yes...just in parallel to feedback info from bad boards
<wolfspraul>
correct
<wolfspraul>
we have a good plan :-)
<aw_>
well..sometimes let's see how amperemeter measured firstly....these would always be as ping pong status, hard to define a day that belongs to design validation day or productive day. ;-)
<aw_>
well...no more chats now...only work or logical analysis from now. ;-)
<aw_>
0x32: d2/d3 is fully off, tp36/tp37 is stable 3.3V, tp36 - 0.08mA, tp37 - 0.014mA
<aw_>
0x32: d2/d3 is fully off after power on, tp36/tp37 is stable 3.3V, tp36 - 0.08mA, tp37 - 0.014mA
<wpwrak>
does it boot ?
<aw_>
i didn't put it boot stage, so NO,  it's not.
<aw_>
just standby stage for boot
<wpwrak>
so what you mean is that the NOR isn't fully programmed yet ?
<wpwrak>
from last week, i see that 0x32 had some garbled NOR content (in the standby bitstream)
<aw_>
not exactly to say that. this have to be measured/triggered tp36 with tp35(DONE pin), to know if it's been finished reconfiguration stage.
<wpwrak>
maybe it just needs a reflash
<aw_>
wpwrak, yes, i dumped 0x32 last week. good that you checked dump file already
<wpwrak>
there's something strange with this board, though: sometimes, TP36/TP37 voltages are good, sometimes they're not. maybe give the reset circuit a good visual inspection. look for broken solder joints or things that could short a component.
<aw_>
no, before assert a nex reflash, do we miss some info or need to scope somewhere?
<wpwrak>
if you want, you can take another dump to check whether the NOR content is the same (or if something "magically" changes it)
<aw_>
wpwrak, yes, exactly sometimes it indeed, this morning I 've seen 0x32 auto boot again (d2 is ON) after powered on.
<aw_>
when it auto boot, i didn't touch it though
<aw_>
wpwrak, alright, try to if can dump it...
<wpwrak>
hmm, i'll understand all these LEDs much better once i get to play a bit with the M1 that's currently in ... memphis ;-)
<aw_>
wpwrak, yeah..you bet will.
<aw_>
dumping...
<aw_>
fact on 0x32: i 've soldered pins on flash chip...it could be worse than hide the problem from my soldering
<wpwrak>
when did you do this ? recently or long ago ?
<aw_>
wpwrak, long day ago, before applied for fix2b from histories,
<aw_>
so i would just dump one time then we don't spend much time on 0x32, then back to 0x3c to see
<wpwrak>
ah, you replaced the NOR chip. okay, that could indeed cause a lot of fun :)
<aw_>
no, not replaced chip
<wolfspraul>
maybe the focus on 0x32 should also be to get it to work, rather than to analyze its current state?
<aw_>
resoldered pins of NOR
<wpwrak>
aw_: why did you resolder the pins ? did anything look wrong with them ?
<wpwrak>
wolfspraul: well, at the end of the day, that's the objective of the analysis ;-)
<aw_>
wpwrak, i was thought it had have soldering problem. ;-) long days ago. at that time you've not jumped here. ;-)
<wpwrak>
wolfspraul: i suspect we may end up with an uncertain status for that board, though: it may work but with a history of failures where nothing has been done to correct them
<wpwrak>
aw_: so the soldering problem was just a theory, but you didn't actually see or measure anything wrong ?
<wolfspraul>
I stay back. I just hope we are focused. Either we learn something, or we produce a result (make it sellable). But not get stuck on 0x32...
<aw_>
wpwrak, agreed. so let's just dump once, yes, no see actually voltage/or meausre
<aw_>
wpwrak, i would be later we back to see 0x32 after later I replace a new chip. not now.
<wpwrak>
aw_: the next step for 0x32 would probably be to reflash. i may then boot and render without further rework. but as i said, we may not be able to trust it.
<aw_>
wpwrak, agreed. then we stop 0x32
<wpwrak>
aw_: but the dump first :) and afterwards maybe reflash and see if it comes up. then we can forget it. let's not have a gazillion almost finished boards around. that just causes confusion.
<wpwrak>
wolfspraul: with 0x32, i just want to make sure it doesn't tell us anything new about the NOR. if it behaves, which it currently seems to be inclined to do, then it'll be a "kinda works but prefer not to sell it to customers" board. in a few minutes :)
<aw_>
some sort of pieces of your words I need to record in notes ;-)
<wpwrak>
kristianpaul: 121 is EREMOTEIO. that's a funny one, never saw it
<aw_>
wpwrak, 0x3c: you are right, good boot to rendering.
<wpwrak>
kristianpaul: (Remote I/O error)
<wpwrak>
aw_: heh ;-)
<wpwrak>
aw_: does it take a lot of time to run the other tests on 0x3c ? (the various tests you normally run, CRC, USB, MIDI, etc.)
<kristianpaul>
I'll upgrade next week, when get a new laptop to play..., i wasted too much time on this.. but may be some Fedora 15 user around want to give it a try to the package in  F-16 :-)
<wpwrak>
kristianpaul: ah, your bug has a response. good ...
<aw_>
you can scroll down a bit, there's three notes I marked
<wolfspraul>
bah. I would reflash and resolder and replace NOR on 0x32 and 0x3A until they work. what is the value of NOR paranoia that just leads us to some bad soldering in the end...
<wolfspraul>
but sure, move them aside now and fix later. but not study forever - no value.
<wolfspraul>
just fix them
<kristianpaul>
wpwrak: yeah. but i'm back to trusty debian now ;)
<aw_>
wpwrak, so yes surely I tested 0x3c by test program, but no more tests log I recorded after applied fix2b circuit.
<wpwrak>
wolfspraul: that's just hiding the problem :)
<aw_>
wpwrak, 0x3c is weird , why it did have messy pulsing on tp36/tp37 before, then then dump shows correctly then it works?
<wpwrak>
wolfspraul: maybe it's bad soldering. maybe not. the board also has strange things happening on the reset circuit. just making random changes until it works tweaks the statistics against you - you're actually decreasing the coverage of your production test.
<aw_>
guessed that prober's capacitance influence 0x3c's tp36 weh i probered it.
<aw_>
s/weh/when
<wpwrak>
aw_: measure again ? if it's really only a probing problem, that would also shed some new light on the issue
<wpwrak>
aw_: but i somehow don't think it's so easy :)
<aw_>
wpwrak, 0x3c notes: 1. No VGA screen, replaced a new u19 then pass also as well as video input shows normally 2. rendering @ 2nd  then can't reconfigure 3. replaced new u7/u19/u20 4. d2/d3 dimly lit while TP37 and TP36 is unstable level, range 1.2V to 3.3V: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3c_ch1-tp37.JPG 5. when I attached prober on tp37, few seconds the messy pulse dissapeared and stays 3V3 steadly and
<aw_>
d2/d3 is fully off. Messy pulses again after pressing middle btn. 6. applied fix2b 7. D16(in-circuit): For.V.=165mV, Rev.V = 1549mV 8. reflashed successfully 9. d2/d3 dimly lit after powered-cycle(tp36/tp37 pull high well) 10. d2/d3 dimly lit(tp36/tp37 is messy signal level 1.2V ~ 3.3V) after power-cycle, used prober touched TP36 can intermittencely let board d2/d3 is fully OFF and can boot up after pressing middle btn. 11. dum
<wpwrak>
a bug that disappears when you try to analyze it
<aw_>
tp36 - 4mA, tp37 - 0
<wpwrak>
whoopie !
<aw_>
wait
<aw_>
sorry
<aw_>
tp36 - 0.004mA, tp37 - 0
<wpwrak>
ah. boring :)
<aw_>
typed too fast...sorry to confuse. :)
<wpwrak>
still  ... 4 uA may be significant. lemme calculate ...
<wpwrak>
hmm, even a voltage diffference of 1.3 V, that would be ~200 kOhm. weaker even than the pull-up.
<wpwrak>
but .. what was the voltage ?
<aw_>
3.29V
<wpwrak>
ah okay, then it may just be resistance along 3V3
<aw_>
read from scope
<wpwrak>
0.4 mV difference between the two ends of your 100R+meter setup. that's quite reasonable
<wpwrak>
maybe power cycle a few times to see if the instability comes back
<aw_>
yup
<wpwrak>
aw_: when you saw the instability happen before, was that just during an attempt to power on ? or did you have to do something in addition to this ? e.g., press the middle button ?
<aw_>
wpwrak, no need to press middle button, just used prober to touch tp36(sometime kept 3.3V stable, then instable level/pulsing happen.
<wpwrak>
okay. let's see if more power cycling causes it to appear again
<aw_>
i just finished 5 times powered - cycle with two probers to wait 10 seconds, no instability heppen (also d2 is fully off) , seems that hsisenbug like me.
<wpwrak>
do you have an estimate of how many power cycles you did after fix2b and how many times the instability appeared ?
<aw_>
the instability once happened I immediately recorded into note. but no a system way to count how many or times I had have met totally
<aw_>
but I just felt one condition:
<wpwrak>
okay. let's just try a few more times.
<wpwrak>
let's say up to 20, so 15 more
<aw_>
1. will this bug related to temperature-oriented? seems now I can't reproduce it..
<aw_>
2. this morning when I firstly powered on, and probered tp36, it can be easiler to see messy pulsing..
<wpwrak>
if that doesn't make it happen, then try letting it boot a little more (i.e., press the middle button), then power cycle
<wpwrak>
temperature could be a factor, yes
<wpwrak>
residual charges stored in caps may be another factor
<wpwrak>
but let's vary one parameter at a time. if simple power cycling doesn't help, try maybe 5-10 times with booting the system (middle button)
<wpwrak>
if it still doesn't happen. try cycling with longer off periods. e.g., leave it off for ~1 min between tries. also 5-10 times. (you could combine this with lunch or some other fix2b rework :)
<aw_>
yup...i just accumulated 5 times of powered -cycle then still d2 is fully off good, no messy scope happened
<wpwrak>
if still nothing happens, i'd let the board cool down and discharge itself until tomorrow morning.
<aw_>
trys boot to rendering and power-cyle now
<aw_>
while this, i keep an eye on watch scope.
<aw_>
5 times of boot to rendering with power cycle
<aw_>
all worked well, no unnormal condition
<wpwrak>
this bug is a slippery one
<wpwrak>
let's increase the power-off time then to ~1 min
<aw_>
okay
<wpwrak>
if still nothing happens, 0x3c gets a rest until tomorrow morning. maybe we can then give 0x77 a quick try.
<wpwrak>
for the measurements, you're soldering wires to the test points ?
<aw_>
yes, for preparations(soldering) on 0x77's tp36/tp37
<wolfspraul>
have we tried fixing 0x32 by not putting the focus on learning/analyzing, but by simply replacing parts that could potentially be the source?
<wpwrak>
okay. so you're adding the wires to 0x77 already. good. that way, we can measure when it powers up the very first time.
<wolfspraul>
c238, d16, reset ic, nor chip, etc.
<wolfspraul>
that'd be my approach
<wolfspraul>
if the problem is not even reproducible now, don't spend more time on 0x32, just put aside (like you are doing)
<wpwrak>
cargo cult engineering ;-)
<wpwrak>
0x32 is already back on the pile
<wpwrak>
we're at 0x3c now
<wolfspraul>
yes I read it, just thought I throw my 2c in for 0x32
<wpwrak>
0x32 has some strange NOR data path problem. but it's not clear where it is. the reset circuit is most likely not to blame for this.
<wpwrak>
what's odd about 0x32 is that data that is read from the NOR seems to change. so that could be: failing NOR cells, bad I/O buffers in the NOR, some disturbance of the data or address bus (interference ?), bad I/O buffers on the FPGA, some obscure problem on the usb-jtag side.
<wpwrak>
ah, also badly programmed NOR cells could be a cause (i.e., a "soft" error)
<wolfspraul>
replace nor chip
<wpwrak>
so my next steps would be to read the NOR once or twice more tomorrow, see how the pattern behaves. try to program it again. see if it boots. if not, read back and see if there's corruption.
<wolfspraul>
I would replace the nor chip right away
<wolfspraul>
:-)
<wolfspraul>
and not even now, because nothing to learn for fix2b now, so we can do that later
<wpwrak>
naw, that's way too drastic. if it's a soft problem, you don't need to replace the chip
<wpwrak>
yes, it's unrelated to fix2b
<wolfspraul>
then move forward
<wpwrak>
we're already at 0x3c :)
<wolfspraul>
yes I know, good
<wpwrak>
and possibly soon at 0x77
<aw_>
wpwrak, finished 5 times with ~ 1 min power-off time, it all goes well
<aw_>
i stop 0x3c now, we check this tomorrow morning again
<aw_>
let's at 0x77 firstly:
<wpwrak>
the problem with "changing chips until it works" is that you may never solve the problem. so in the next run, you'll just get N times the boards that need arbitrary changes. worse, if the problem persists, you'll just rework the board to death and haven't learned anything.
<wpwrak>
aw_: yes
<wpwrak>
wolfspraul: also, replacing the NOR chip is high risk. you need to heat up a relatively large area, pull the chip without force, clean up all the pads, maybe clean the board from flux (optional, but things get messy quickly if you don't), then solder the new part (this is the easiest bit)
<wpwrak>
wolfspraul: so there's plenty of potential for damaging pads
<wolfspraul>
I have a different perspective. Not everything needs to be understood, it's about economics.
<wolfspraul>
once we have the feeling that there is nothing that we will learn from 0x32 (for example) that applies to any other board, the value of 0x32 drops dramatically.
<wolfspraul>
in fact at that moment it's probably not worth even 5 minutes of the time of someone like Werner
<wolfspraul>
it's difficult to make the decision about 'can we still learn something that applies to other boards?' though
<wolfspraul>
I'm close to saying: no, we cannot
<wpwrak>
wolfspraul: but that's for the production phase. there, based on prior analysis/experience, you just have standard set of attempts at fixes, which may include replacing NOR chips. but so far, we don't even have any evidence tha there is anything wrong with the chip. maybe it's cross-talk on the bus.
<wpwrak>
wolfspraul: i'm not convinced yet that it's just a freak board. we already have two with similar issues. it's a cluster in the making ;-)
<wpwrak>
wolfspraul: so for now, i'd just examine boards that show good D16 and TP36/37 result but still don't boot for NOR corruption and add any that exhibit variations to that cluster
<wpwrak>
wolfspraul: so for now, we seem to have two trouble areas: at least two board that exhibit variations when reading the NOR (good old 0x3a and now 0x32), and those that "out of the blue" get instability on TP36/37 (0x3c, 0x77, i think there's at least one more)
<aw_>
0x77: d2/d3 is fully off, tp36/tp37 is stable 3.3V now with two probers touchs
<wpwrak>
wolfspraul: from the symptoms, it appears that the TP36/37 instability may not be the cause but the effect of some other problem
<wpwrak>
apply a bit of pressure on the terminals from various sides
<wpwrak>
see if there's a bad solder joint or something else
<aw_>
i see
<wpwrak>
if C238 is okay, repeat for the reset chip
<aw_>
okay
<aw_>
no, i think their soldering is quite good, i can use microscope to catch them though. ;-)
<aw_>
after I put more pressures on c238 and reset ic is the same, no changes now...
<wpwrak>
if this still doesn't yield anything. then try pushing on the PCB such that it bends a little. around the reset chip, C238, D16, and then anywhere on the board (maybe in a grid pattern with ~1-2 cm spacing)
<aw_>
weird
<wpwrak>
maybe there's a hairline crack somewhere
<wpwrak>
could also be inside a chip
<wpwrak>
if nothing happens, maybe try power-cycling followed by the middle button 1-2 times and see if the problem comes back
<aw_>
hmm..no come back...start to power cycle
<wpwrak>
if all looks good, try to boot
<aw_>
i saw it
<wpwrak>
it came back ? great ! :)
<aw_>
tp37 synchronized to tp36 at 2nd power cycle
<aw_>
caught!
<wpwrak>
can you take a picture showing just two peaks ?
<wpwrak>
can you "zoom" in until you have just 2-3 peaks on the screen ?
<wpwrak>
and after that, try the 3.3V -> 100 Ohm -> amperemeter experiment again. first to TP37. check the current also also see if this ends the instability ?
<wpwrak>
(did i catch them all ? there are so many :)
<aw_>
yes, not big discoveries. ;-)
<aw_>
i have to power off and soldering wires
<wpwrak>
very crazy noise
<wpwrak>
ncan't you just touch the TPs with the probe of CH2 ?
<aw_>
three pads under usb-jtag board
<wpwrak>
argh
<wpwrak>
maybe just remove the board
<aw_>
I'll firstly unplug usb-jtag
<aw_>
then just touch them sure
<aw_>
moment
<aw_>
hope still reproduce it
<wpwrak>
oh, and how it your M1 supplied ? from a regular power supply or from a lab power supply ? in the latter case, you may want to have a look at the current consumption in the unstable state
<aw_>
from regular power supply
<aw_>
good news at least tp36 pulsing ie not related to TP1~4, TP33, TP26
<wpwrak>
okay, let's keep checking the total system current for later
<wpwrak>
grrr
<aw_>
what else we think TP36 came from ?
<aw_>
okay...sorry that what's "grrr"?
<wpwrak>
"grrr" = i was hoping for an unstable supply rail
<aw_>
okay...;-)
<wpwrak>
and it's unstable also without the jtag board ?
<wpwrak>
is there anything else connected ? ethernet, audio, ... ?
<aw_>
no connections on 0x77 at all
<wpwrak>
that would have been too easy, i guess :)
<aw_>
unless we just removed usb-jtag board and went though all if TPs related to TP36. ;-)
<aw_>
yup..so your guessing on regular supply and lab power supply is reasonable
<wpwrak>
hmm. it would seem that PROGRAM_B becomes an output. or that something else connects into the PROGRAM_B/TP36/reset out net
<aw_>
switching to lab power supply now...
<aw_>
hmm...set limited 1A at lab power supply, 0x77 TP36 still got pulsing
<aw_>
a total lab power current shows 0.55A
<aw_>
alright...seems 0x77 is easiler to reproduce messy pulsing
<wpwrak>
good :)
<aw_>
wpwrak, i think we stop now analysis today
<wpwrak>
wait a minute. two more ideas.
<aw_>
lsitening
<wpwrak>
but i need to look up something first. i'd be curious about how adjacent traces behave in relation to TP36
<wpwrak>
there are two candidates: the trace "north" of D16, roughly under the white bar that marks the polarity
<aw_>
not very clear on this , can you slowly describe it
<wpwrak>
and the one coming our next to R30
<aw_>
yes
<aw_>
go on
<wpwrak>
i need to find places where you can actually measure them
<wolfspraul>
it just gets interesting with 0x77 :-)
<aw_>
wpwrak, get syncronized to Dram routes?
<aw_>
which other partly circuit you want to scope, i can check here. ;-)
<wpwrak>
aw_: dunno. doesn't look like DRAM. but lemme find the package definitions ...
<wpwrak>
sigh, if all this was done in kicad, i could just click on the pad and know what it does ...
<wpwrak>
lekernel_: you aren't awake by any chance ? :)
<wpwrak>
one would be ball AA2 ... now, what is this ...
<wpwrak>
ah, AA2 = FLASH_D8 = DQ8 = U9 pin 34
<aw_>
just let me know your idea, i can open my windows tool to see surrouding signals or ball under fpga
<wpwrak>
so that's one potential correlation to check: TP36 noise vs. pin 32 of the NOR
<aw_>
got it
<wpwrak>
the other would be AA4 = BTN2 = ... naw
<wpwrak>
another candidate would be FLASH_CE_N, pin 14 of U9 (2 "up" from pin 16 of FLASH_RESET_N)
<wpwrak>
but FLASH_D8 is much more likely. FLASH_CE_N is also on the "wrong" side of D16
<wpwrak>
aw_: to get to the instability, is it sufficient to just connect power ? or did you also have to press the middle button ?
<aw_>
wpwrak, no need to press middle, just power cycle and you can see d2/d3 is either dimly lit or fully off. 0x77 has both messy pulsing.
<wolfspraul>
if must be caused by some part behaving differently from other (functioning) boards. but which part can cause this?
<wpwrak>
ok. are you checking for correlation now ? TP36 vs. pin 34 of the NOR
<wolfspraul>
I would just make a priority list from most likely to least likely, and then replace them one by one with new ones.
<aw_>
wpwrak, yes..the pins are close.. so need to touch very carefully
<wpwrak>
sorry, pin 34
<wpwrak>
err .. pin 34 was correct. getting tired :)
<aw_>
yes, i knew,
<aw_>
so you go sleep first, i
<wpwrak>
aw_: naw, i'll wait ;-)
<wpwrak>
wolfspraul: for the 0x77 problem ? good question :)
<aw_>
I'll scope p34 and FLASH_CE_N
<wpwrak>
wolfspraul: right now, it looks plain impossible. we're seeing signals on TP36 that have no business to be there
<wpwrak>
wolfspraul: plus, they also look wrong. like two outputs working against each other
<wpwrak>
wolfspraul: while all we really should have there in an input
<wpwrak>
wolfspraul: so if the correlation check doens't yield anything, the next step would be to see if lekernel recognizes something familiar in the scope screenshots. or if he knows of a condition where PROGRAM_B can become a weird output.
<wpwrak>
wolfspraul: if he doesn't have any magic rabbit in his hat, i'd start simplifying the circuit. maybe start with pulling U24
<wpwrak>
wolfspraul: if this doens't do the trick, remove C238. if the instability is still there, D16.
<wpwrak>
wolfspraul: one problem is that i'm not sure we have enough data to be able to tell when the instability has gone for good. so it if can't be observed at a given point in time, it may be necessary to let the board rest overnight.
<aw_>
wpwrak, yes, you seems that right, tp36 seems sycronized to pin 34 of NOR
<aw_>
moment...i still need to catch a very firm waveform. ;-)
<wpwrak>
wheee ! now that's nice news ;-)
<aw_>
caught, yes!
<wpwrak>
the next two steps: resistance between TP36 and pin 34 of the NOR (for both polarities). compare with the same resistance of two known to be good boards.
<wpwrak>
maybe start with the good boards first to give 0x77 some time to discharge
<wpwrak>
but let's first wait for the evidence ;-)
<wpwrak>
(i.e., the picture :)
<wpwrak>
wolfspraul: if the impedance is unusually low, you'll like the next step: visual inspection of the underside of the FPGA :)
<wpwrak>
nice ! the voltage levels are a little odd, though
<wpwrak>
but let's see about the impedance now
<wpwrak>
or wait
<wpwrak>
maybe take another scope shot at 500 ns/div
<wpwrak>
hmm. thinking a bit more about it. the image does not suggest a simple short. otherwise, DQ8 would have to be at ~1.6 V too (i think the noisy "floor" of DQ8 is Z)
<aw_>
a high impedance of Z as we knew, should pull high chip inside or outside resistor, this waveform still can not say that a possible 'short' under between them
<aw_>
even if it's short, the waveform should not be like that level.
<wpwrak>
let's forget about the impedance for now. this is more mysterious.
<aw_>
do you think that that could be an interconnection inside fpga? as related to program_b?
<wpwrak>
next try: TP36 and NOR pin 54 (OE#/FLASH_OE_N)
<aw_>
okay
<wpwrak>
it could be the FPGA just acting crazy, yes. but then, that's a bit too convenient an assumption :)
<wpwrak>
i'd leave all the "crazy FPGA" theories to sebastien. he's probably read about a good number of FPGA madness issues, so something may sound familiar. if not, we probably have something else.
<aw_>
wpwrak, no pulse on pin 54 of NOR
<wolfspraul>
is there a chance that this board (this particular one, 0x77), could pass our test program and 10 render cycles?
<wolfspraul>
or is it behaving badly enough from what we see so far that that is impossible?
<wolfspraul>
not theoretically, but practically this one, 0x77
<aw_>
wpwrak, no relations between TP36 and pin 54 of NOR
<wpwrak>
wolfspraul: this one may be badly off enough. but the very similar 0x32 went for a while without showing problems.
<wpwrak>
aw_: is OE# low all the time ?
<aw_>
yes, you got it, always at low
<wpwrak>
the results would correspond to having about 5 kOhm between AA1 (PROGRAM_B) and AA2 (FLASH_DQ8)
<wpwrak>
that  way, we'd just end up at around 1.6 V, with the kind of charge/discharge pattern on PROGRAM_M we see (C238)
<wpwrak>
where it gets a little mysterious is what would provide these 5 kOhm. flux residues ? cooked I/O driver ?
<aw_>
so next steps to measure impedance on bad and good board.
<wpwrak>
maybe let's measure the impedance test now. TP36 vs. pin 34. both ways, i.e., swap probe +/-.
<wpwrak>
yes
<aw_>
regards to if flux residues, this 0x77 i see it that is clear,
<aw_>
need to power off now
<aw_>
okay
<wpwrak>
aw_: (0x77 and flux) no change of something trapped under the FPGA ? :)
<wpwrak>
s/change/chance/
<wpwrak>
i've never seen flux do as little as 5 kOhm. but then, there's a first time for everything ;-)
<aw_>
wpwrak, it could still have chance under FPGA, yes, but you were right, me too on that a flux can take a 5 k ohm?
<wpwrak>
how big is C238 now ?
<wpwrak>
maybe your flux has extra-powerful ions ;-)
<aw_>
10.6 K ohm  from TP36 to pin 54, 118K ohm reversely
<aw_>
sorry that it's pin34
<wpwrak>
hmm
<wpwrak>
how big is C238 ?
<aw_>
bad that i don't have equipment can measure capacitor now.
<wpwrak>
you could build an oscillator with a 555 ;-)
<wpwrak>
then you could _hear_ the capacitance :)
<aw_>
i saw an arduino with few parts to do. ;-)
<wpwrak>
10/120 kOhm is on 0x77 ?
<aw_>
yes
<aw_>
moment
<wpwrak>
now, two known to be good boards for comparison
<aw_>
59 / 118 k ohm on 0x40
<aw_>
58 / 129 kohm on 0x7a
<wpwrak>
something's a little off :)
<wpwrak>
maybe measure 0x32 too
<aw_>
0x3c: 63 / 131 k ohm
<wpwrak>
so, "normal" = ~60 / ~120-130 kOhm. ~11 / ~120 kOhm of 0x77 is bad.
<aw_>
4KOhm is an equivalent resistor inside that pin34?
<wpwrak>
hmm, 0x32 is about normal. so either it has a separate problem with just the same symptoms or we haven't found the real cause just yet
<wpwrak>
(4 kOhm) yes
<aw_>
agreed >>> so either it has a separate problem with just the same symptoms or we haven't found the real cause just yet
<wpwrak>
do you have any cleaning process that has some hope of removing flux or dirt from under the fpga ?
<wpwrak>
(preferably without just moving the dirt/particle to another area of the fpga)
<wpwrak>
well, but then that probably doesn't make sense
<wpwrak>
hmm. thinking ...
<aw_>
hmm..this quitely needed to be think more if need to see if flux or dirt from under fpga... I've ever not dealt this topic.
<aw_>
thinking...
<aw_>
as you really knew that it's still possible a flux to be as likely huge ( 60 - 10 )= 50 KOhm...to bring resistance down, no big surprising; But if it indeed is. How few boards got similar problem like this. and Won't any other balls under FPGA surrounding Program_B be influenced too? and just only Program_B?
<wpwrak>
well, this is the corner in which the rework was done
<wpwrak>
AA1 and AA2 are in the second row of balls, very close
<aw_>
That's too weird, 0x3c and 0x77 has been tested successfully on all I/Os though..
<aw_>
yeah...it's close enough to cause this
<wpwrak>
but ... if it was just flux, it should have the same conductance in both directions
<aw_>
yeah...so that's too weird to say it's a flux problem now. ;-)
<wpwrak>
maybe heat damage in the FPGA from the rework ?
<wpwrak>
but then, i'm still not entirely sure whether we're seeing cause or effect here
<aw_>
hmm....don't know exactly . but i knew factory used heat air to blow C238 and R30
<aw_>
so only I go to Xray to find secrets on 0x3c/0x77
<wpwrak>
do you have an xray session planned ?
<aw_>
yes, no cause surely known now
<aw_>
yes, sure
<aw_>
but I hope do X-ray later
<aw_>
so i think that I go for next reworks on fix2b to accumulate 30pcs of 'avail-fix2b' done
<aw_>
then I think from them, we may get more boards like 0x77 similar too.
<wpwrak>
heh :)
<aw_>
at the end, we go for X-ray to see if any consistence existed inside of this weird problem.
<wpwrak>
ah, one more test please: 0x77, correlation of TP36 and pin 17 (A11)
<aw_>
okay
<wpwrak>
that would be a neighbour of FLASH_RESET_N
<wpwrak>
(the other one is VPEN, which is tied to 3V3)
<aw_>
no instability more.. :(
<wpwrak>
hi murphy ! good to see that you're watching :)
<aw_>
good ...reproduced now
<wpwrak>
ah, nice ;-)
<wpwrak>
i'm beginning to like 0x77. that's a good board. fails whenever we ask it to. not like 0x3c ;-)
<aw_>
not A11, no correlation
<wolfspraul>
is there anything in 0x77 that we believe we can learn that impacts other boards?
<wolfspraul>
if no - put 0x77 aside. if yes - continue studying it.
<wolfspraul>
writing off 0x77 is no scary thought to me
<wolfspraul>
delaying rc3 sales by one day (for example) is a scary thought
<wolfspraul>
so we need to balance between those two...
<wolfspraul>
I'm not following the electrical analysis and logic in detail today, so can just repeat the obvious high-level thinking...
<wpwrak>
aw_: (no correlation) good, thanks
<wpwrak>
wolfspraul: 0x77 is tricky. what's worrying is that 0x77 and 0x3c both show intermittent instability. on 0x77 it's fairly easy to reproduce, on 0x3c not so easy.
<wpwrak>
wolfspraul: so far, we don't have a good explanation of what's going on. 0x77 exhibits one anomaly that could be causally linked, but 0x3c doesn't show this anomaly.
<wolfspraul>
like I said. the key question is "can we learn _anything_ that applies to other boards?"
<wolfspraul>
I understand that question is not easy to answer, but that's what it's all about.
<wolfspraul>
if you are the first who can say "no" to that question, and you are right, that's great value
<wolfspraul>
of course not if you were wrong :-)
<wolfspraul>
xray pile...
<wolfspraul>
glancing over the result today doesn't make me worried about fix2b and our ability to produce 100% pass boards
<wpwrak>
wolfspraul: yes, xray pile sounds best for 0x77 and 0x3c for now
<wolfspraul>
so let's move forward
<wpwrak>
maybe we'll have new ideas in a while as well. i don't have anything else i'd want to try on 0x77 at the moment. there are a few "destructive" tests (repairable) one could do to 0x77, but i'd save them for as late as possible
<aw_>
wpwrak, so I'll go for fix2b reworks firstly...but if you just think out any possible cause reason/or idea, you ping me.
<wpwrak>
also because they may just make the problem disappear for the wrong reason. e.g., there may be a feedback loop. if you break it, the instability may vanish, but once you restore the normal functionality, it'll be right back.
<wpwrak>
aw_: yes, sounds good. thanks for all the testing ! we made some good progress into understanding the behaviour of that critter :)
<aw_>
wpwrak, hmm...good reminder that this could be possible as transfer functions: positive feedback or negative one. maybe
<aw_>
so i go reworks firstly. ;-)
<wpwrak>
aw_: the feedback loop i have in mind would be unknown -> PROGRAM_B -> FLASH_RESET_N -> unknown -> ...
<wpwrak>
aw_: we could break the loop by removing D16, but that wouldn't remove the unknown -> PROGRAM_B path. without the feedback, maybe the board will just reset a few times and then boot, so you never notice that something is wrong. but of course, when you bring the flash reset back, also the feedback loop returns.
<wolfspraul>
just replace 0x77 with 0x78 :-) (joking, joking)
<aw_>
as you knew that system transferring is open and close types. I hope this weird problem is not existed as close type, so you remove D16, will let it acted as open type.
<wpwrak>
wolfspraul: like the airlines do when one of their flights crashes ? ;-)
<wpwrak>
aw_: it would be a bit like treating a broken bone with painkillers ;-)
<wpwrak>
very efficient symptom treatment, but ... :)
<aw_>
wpwrak, alright. we do such as later. I am poor adam to do reworks firstly though...keep this surely good idea to get some approach later
<wpwrak>
aw_: and maybe get that xray trip scheduled :)
<aw_>
wpwrak, yup...may more later. but will.
<wpwrak>
wolfspraul: worst-case outcome: FPGA shows extensive heat damage on 0x77 (and similar) but also on "good" boards examined for comparison. would be hard to decide what to do in this case.
<kristianpaul>
oh, wpwrak have a mm1 as well, nice !
<wpwrak>
not yet ! at the moment, it may be in memphis or maybe a few hours south-southeast of memphis
<wpwrak>
wolfspraul: phrew. just caught up with some really old stuff on #qi-hardware. the RTC thread was scary. pages of supercaps before finally the CR2032 was mentioned ;-)
<wpwrak>
aw_: how's the fix2b on the "good" boards going ? any boards that have gone from "good" to "bad" ?
<aw_>
wpwrak, tonight, wont test just rework. ;-)
<wpwrak>
aw_: a wise decision ;-)
<wpwrak>
aw_: ah, maybe tomorrow morning, you could try board 0x3c. first, see if you can reproduce the problem. and if yes, show TP36 and pin 34 at 10 us/div and at 500 ns/div. that way, we can see if the signal shape is the same of if it's different. also, measure the impedance between TP36 and pin 34 again.
<wpwrak>
aw_: (measure again) e.g., if thermal expansion is part of the equation, the impedance may be "normal" when warm but decrease when cold. i very much hope this isn't the case, but let's be safe and check.
<wpwrak>
i'll be afk for a bit
<aw_>
wpwrak, okay...good reminder on thermal equation thing, thanks.
<wpwrak>
already back :)
<wpwrak>
i missed that today is a holiday ... the bastards introduced it just some 3 months ago, so it doesn't show up in any printed calendar
<wpwrak>
aw_: oh, and when you measured the resistance, which side had the "high" voltage ?
<aw_>
resistance between TP36 and pin 34?
<wpwrak>
yes
<aw_>
it's measured both ways( prober +/-) after powered-off
<aw_>
so i wont be known which side was the high voltage. ;-)
<wpwrak>
which side was the red wire ? ;-)
<aw_>
your question about this was strange, hope that i was misunderstood your meaning. ;-)
<wpwrak>
if you have two multimeters, you could even check that the red wire is really the high voltage :)
<aw_>
oah....sure
<wpwrak>
what i mean is this: when you do a resistance measurement, the multimeter injects a current. one of the two sides must be high and the other low :)
<aw_>
the 10 / 120 was:
<aw_>
10 KOhm measured was red (high) on TP36, so 120 KOhm was red on pin 34
<aw_>
phew~ just understood your question though. he ;-)
<wpwrak>
kewl, thanks ! that would even make sense
<aw_>
oh, yup
<roh>
wpwrak: wouldnt it be helpful if you would be in taipei now?
<wpwrak>
not sure. if i had my lab with me as well, yes :)
<roh>
wpwrak: heh.. i see.. so we don't have a lab in tpe?
<wpwrak>
roh: well, adam's home lab. TDS1012, etc.
<Fallenou>
lekernel_: I've seen you merged a few commits from rtems cvs head into mmstaging 5 days ago, but you didn't merge the changes to cpukit/zlib/zconf.h.in , is it on purpose that you stay with the v1.1 instead of the v1.1.1.2 ?
<lekernel_>
no
<lekernel_>
maybe that's some git-cvs bug
<lekernel_>
that would explain the problems
<Fallenou>
yes maybe
<wpwrak>
lekernel_: hey, good to see you here ! :) look what new tricks your M1 has learned:
<lekernel_>
I don't see anything worrisome about DQ8 on those pictures, but TP36 is pure crap
<wpwrak>
now, you may wonder what on earth PROGRAM)_B and NOR DQ8 have in common. well, PROGRAM_B is on ball AA1, and NOR DQ8 is on ball AA2.
<lekernel_>
hmm...
<lekernel_>
solder bridge?
<wpwrak>
the 500 ns pic shows that DQ8 seems to push PROGRAM_B
<wpwrak>
adam measured and found ~10 kOhm in one direction, ~120 kOhm in the other
<wpwrak>
so there's a diode somewhere in the mix
<wpwrak>
joerg thinks it's a fried FPGA. ESD or such.
<wpwrak>
i was wondering if you had any other interpretation of this mess, or any ideas what to try
<lekernel_>
you could be simply measuring some CMOS protection diode to VCC through some pull-up resistor
<wpwrak>
to have another board with similar symptoms but DQ8-PROGRAM_B "resistance" normal (about 60/120 kOhm, like on several "good" boards)
<wpwrak>
yes, we must be measuring something like this. what's remarkable is that ll the other boards are around 60/120 kOhm, while this one is about 10/120 kOhm
<wpwrak>
the measurement itself is very dirty, because it's not anything ohmic we're measuring there. so the value also depends on adam's instrument, etc.
<lekernel_>
maybe you could show that to a xilinx fae?
<wpwrak>
but it seems the measurement is repeatable among several "good" boards. and the one with that rather interesting correlation between PROGRAM_B and DQ8 happens to be different.
<lekernel_>
'interesting'... hm :-)
<wpwrak>
maybe you can try that. i have no idea about their support :)
<lekernel_>
I would rather call it pesky, annoying, time-wasting, and other such adjectives :-)
<lekernel_>
well, debugging those things is basically their job
<lekernel_>
they do that all the time
<lekernel_>
and they're often good at it
<wpwrak>
i'm actually less worried about "saving" this board, but about finding a reliable test that tells us if something is amiss there. because we have another one with similar symptoms on TP36, but where DQ8 measures normally. we don't know yet if DQ8 and TP36 appear connected there, too, or if it's maybe another pair of pins that join forces
<lekernel_>
could it be some problem in the PCB substrate? or regular capacitive/inductive crosstalk?
<lekernel_>
seems unlikely because you're seeing a diode...
<wpwrak>
plus, we have some more boards with strange effects on the NOR signals. of course, if the FPGA's I/O pads in that area are damaged, that could explain a lot of strange effects. but i wouldn't jump to conclusions just yet. maybe it's a completely unrelated problem.
<lekernel_>
unless Murphy created some FR4-based semiconductor ofc :-)
<wpwrak>
at first, i suspected flux. the we found the diode ;-)
<wpwrak>
DQ8 was also more of a lucky discovery. we found it by examining traces adjacent to PROGRAM_B or FLASH_RESET_N (the latter is of course affected as well, so we may even see a feedback loop. of course, if we were to break the feedback by removing D16, we wouldn't solve the underlying problem.)
<wpwrak>
that is, if it's really DQ8 affecting PROGRAM_B. the signal shape would agree with this theory. there could of course be another signal that just looks the same, and DQ8 simply happens to have the same pattern.
<wpwrak>
also, DQ8 isn't all that pretty. there are some interesting little runts in the 500 ns picture. not that, according to adam, OE# is held low throughout all this. so DQ8 should be driven all the time.
<wpwrak>
ah, and to make it all more interesting: the board affected by this don't always show this "noise" on PROGRAM_B. 0x77 does it almost always, but 0x3c is much less eager. it seems that, with increasing temperature, the probability of this occurring drops in 0x3c.
<wpwrak>
as in: in the morning. adam just had it happen quite often, but later on, he booted into standby and sometimes even further about 15 times, without a single such anomaly
<wpwrak>
lekernel_: anyway, so you haven't come across any xilinx errata saying that PROGRAM_B can become an output with weird signals ? or anything else that would explain the madness ? (besides the hypothesis that the chip is indeed damaged)
<roh>
wpwrak: are there more than 1 board with that behaviour?
<wpwrak>
roh: two known cases so far with the same weird pattern on TP36/PROGRAM_B
<wpwrak>
roh: we haven't analyzed the 2nd one far enough to know if there's also a correlation between NOR.DQ8 and FPGA.PROGRAM_B, though. one problem with this board is also that it is a lot more reluctant to exhibit the problem.
<lekernel_>
no, this is completely unexpected
<lekernel_>
are you sure this comes from the fpga? it might be the reset IC too...
<wpwrak>
roh: it's probably upset that we caught it. presumably, it was hoping for the opportunity to embarrass the VJ at some great festival
<lekernel_>
also, maybe it works this way: PROGRAM_B is (wrongly) pulsed, FPGA deconfigures itself, and reads some memory address that happens to drive DQ8 high
<wpwrak>
lekernel_: i've refrained from rework so far. but ... it seems odd for the reset chip. 1) the input voltage is perfectly stable. 2) the voltage jumps between ~1.7 V and 3.3 V. 3) the correlation with DQ8 would make even less sense then. so for now, i don't suspect the reset ic. but when we enter the "remove components" phase, then this would be the first component to go.
<wpwrak>
lekernel_: (PROGRAM_B -> DQ8) not impossible, but the timing seems to fit a bit too well. e.g., why would PROGRAM_B drop just when DQ8 drops ?
<wpwrak>
lekernel_: btw, can you tell me what signals connect to the balls around AA1 ? i.e., AB1-2, we already know AA2, and Z1-2 ? faster if you just click on them in altium than me visually searching the PDF :)
<wpwrak>
err, make that Y1-2. there's no Z row, sorry
<wpwrak>
i found AB2 = FLASH_D9, Y2 = SDRAM_DQ0, and Y1 = FPGA_VREF
<wpwrak>
ah, and AB1 is ground
<wpwrak>
so potential candidates would be NOR.DQ9 = pin 36 and U14/U15 DQ0 = pin 2
<wpwrak>
if the supposed damage spreads wider, then we would have USBB on AB3, Y3. there's at least one board with an unexplained USB-B failure. but this is a little thin evidence this far.