<zumbi>
wolfspraul: each vendor has different bitstream afaik
<wolfspraul>
yes different, but I'm wondering whether the fundamentals are different, or more 'how' different they are
<zumbi>
i don't really now
<zumbi>
but i suspect those differ quite a bit
<aw>
wpwrak, about the A4809E3R-440DN, 4.312-4.488 V; bad that we need to search compatible part in digikey or muser for easier sample orders.
<wolfspraul>
aw: in the future, we choose components preferably from standard digi-key parts unless there is a very good reason to not do so
<aw>
wolfspraul, okay
<aw>
including Mouser? or NO?
<wolfspraul>
also OK. _COMMON_ part, that's the key
<wolfspraul>
the choice of the AIC reset part looks wrong to me. in hindsight we are always smarter but I see nothing that's good about it.
<wolfspraul>
we even had to buy a whole reel of 3000 parts for 270 USD. all wrong ;-)
<wolfspraul>
that alone costs 3 USD / board for a run of 90, and 2910 parts forever in our 'archive of bad sourcing decisions'
<wolfspraul>
if we are lucky, we find a matching part from another manufacturer, but I won't hold my breadth
<wolfspraul>
there's a lot of reset ics, but once you go through the exact requirements we have here it shrinks fast (I did a little digikey searching...)
<wpwrak>
hah, i was wondering how that part ended up in M1 :) and it'd say the best parts come from digi-key and at least one other source :)
<wpwrak>
wouldn't do if some shiny new parts was previously on digi-key's archive of bad sourcing decisions ;-) well, they tell you when it becomes non-stocked, so i guess that's a warning
<wpwrak>
hmm, what's the maximum "5V" voltage the chip needs to survive ? are 6 V enough ?
<wpwrak>
(the A4809 goes up to 12 V)
<wolfspraul>
6V sounds enough (a bit more would probably be better though, I assume this is coming directly from the power adapter?)
<wpwrak>
directly after L10, so after the protection circuit, if that one is still around (not sure what the status is there, i remember you have some problems with it)
<wpwrak>
s/have/had/
<wolfspraul>
don't understand
<wolfspraul>
what are you getting at?
<wpwrak>
weren't there some issues with the protection circuit causing troubles ? or are they resolved now ?
<wpwrak>
something like bad beads
<wpwrak>
or a bad fuse or such
<wpwrak>
i don't remember the details. only that some parts were removed. but i don't know if this applies to rc3.
<wolfspraul>
all problems turned out to be faulty measurements
<wpwrak>
oh, cool :) very good. then 6 V should be plenty :)
<wpwrak>
also exists in a slower variant ;-) (1.12 s)
<wolfspraul>
we can just buy several at once to try some theories, if that helps
<wpwrak>
that may not be a bad idea. something for the R&D lab :)
<wpwrak>
SOT-23 for such a part seems mighty big, though
<wpwrak>
let me run a package comparison ...
<lekernel>
sot-23 is fine... that's what being used atm
<wpwrak>
hmm no, sot-23 seems to be the most common choice
<lekernel>
so there is space for it
<lekernel>
and it's easier to rework in case of yet another fuckup
<wpwrak>
lekernel: yes, i was looking for that makes sense to stock for future R&D
<wpwrak>
of course :)
<wolfspraul>
aw: you there? before you reflash your next board, can you ping us here? then we can try to force USB into full-speed mode as Werner described
<wolfspraul>
just wait until the next time you need to reflash, then we do it...
<wolfspraul>
of course for samples we can buy a few
<wolfspraul>
I'm wondering how you can sell something that's 5 or 10 times more expensive than a competitor that can be used as a drop-in replacement
<wpwrak>
(buy maxim) i wouldn't bother. they're for the "due diligence" appendix ;-)
<wolfspraul>
maybe they have some very outstanding performance parameters that some customers need?
<wolfspraul>
or tolerances? or some customers just totally trust their brand?
<wpwrak>
maybe the military likes then ? :)
<wolfspraul>
ok, some old government contracts or other large bureaucratic customers keeping those parts alive? another option
<wolfspraul>
the diodes inc. one is ca. 14 cents / 1k, the Maxim one 1.35 USD / 1k
<wolfspraul>
almost 10 times more
<wolfspraul>
interesting
<wpwrak>
maybe it's because they have such a large choice of parameter and output configurations
<wpwrak>
of course, for all we know, AIT may have a lot more. that data sheet alone could "generate" something like 700 different parts.
<wolfspraul>
AIT parts were used on Ben/AVT
<wpwrak>
(if all specified part number combinations really exist, which seems unlikely)
<wpwrak>
aah, that's where it comes from :)
<wolfspraul>
yes, I also wondered :-)
<wpwrak>
it had that "friends from taiwan" feeling to it :)
<wpwrak>
like so many of those parts we had in openmoko. without data sheets, no second source in the known universe, etc. :)
<wolfspraul>
we can't even say much about the part or manufacturer, but us being such a little guy with so much design verification and changes all the time, it's a difficult source
<wpwrak>
and of course, the company dead before openmoko :)
<wolfspraul>
once you are making large quantities of whatever all the time, they may be the best source of all
<wolfspraul>
who knows
<wpwrak>
you can always switch back once you're sure
<wolfspraul>
in large quantities datasheet availability doesn't matter
<wolfspraul>
oh yes, definitely
<wpwrak>
there's probably great potential if penny-pinching parts
<wpwrak>
each cent you save is a million dollars once you reach 100M+ quantities :)
<wolfspraul>
what matters is that your source can follow your forecast flexibly, that the quality of their parts is stable, that you have a good sales contact for problems, etc.
<wolfspraul>
but at our quantities and level of uncertainty, that's all pretty much the last thing we worry about :-)
<wpwrak>
what's what we dream of worrying about ;-)
<wolfspraul>
ok so those 3 reset parts are all the same idea, should we buy a few of each? anything else to add?
<wolfspraul>
I understand that this fix is surely a fix, since with the 2.6v reset ic we are out of spec. so the fix is correct in any case. the unknown is whether it fixes the flash corruption.
<wolfspraul>
if it does - nothing else to worry about. if it does not - then what?
<wpwrak>
if it doesn't, then it may be a sw or fpga problem. e.g., sending out spurious transactions
<aw>
new steps:1. insert DC jack
<aw>
2. middle button
<aw>
3. wait for booting, wait for render, let it render 30 seconds
<aw>
4. unplug DC jack
<aw>
5. insert DC jack
<aw>
6. press middle button but then run the test software over jtag serial
<aw>
7. run the test software only until the CRC check is finished, and record the results
<aw>
8. if the CRC check fails, abort the render cycles here
<aw>
9. if the CRC check passes, unplug DC jack
<aw>
10. go back to step #1
<wpwrak>
i would only get the ~200 ms from diodes and the 1.12 s from micrel
<wolfspraul>
1.1s ?
<aw>
now 0x7c: is available. hope that we can run into a flash problem occurred soon
<wolfspraul>
aw: you ran 10 render cycles with crc checks on 0x7c?
<aw>
yes
<wolfspraul>
ok
<wolfspraul>
remember when you do the next flashing, ping us here
<aw>
hope from now on can catch flash problem then dig into
<wolfspraul>
for the usb full-speed thing
<wpwrak>
aw: (new steps) sounds good. i wouldn't call it a "render cycle", though :)
<aw>
okay. ping guys. ;-)
<wolfspraul>
ah ok
<wolfspraul>
:-)
<wolfspraul>
let's see (opening werner's instructions :-))
<wolfspraul>
aw: get the board ready, plug usb cable into your notebook as usual
<wolfspraul>
after connecting the cable, run 'dmesg'
<wolfspraul>
in the last few lines, you should see something like "usb 2-1: new high speed USB device [...]"
<wolfspraul>
do you see that?
<wpwrak>
(nor corruption analysis) i think we'll know more about this when we get better data from the crc experiment. e.g., whether there are patterns in where and when it strikes.
<aw>
wolfspraul, what does this mean? for each board or when meet "next flashing"?
<aw>
yes, i just saw Werner's email and marked firstly
<wolfspraul>
let's try now
<wolfspraul>
if it works, we will probably do it for each board
<wolfspraul>
but let's try
<wolfspraul>
you ready?
<aw>
second
<wolfspraul>
1. plug in usb cable, like you normally flash
<wolfspraul>
2. run 'dmesg'
<wpwrak>
(analysis) so far, we only have very spurious results, and many have causal dependencies in them, which further twist the probabilities. so it's hard to tell anything from the existing data, except that bad things happen.
<wolfspraul>
wpwrak: how would you call it [instead of render cycle]
<wolfspraul>
'render cycle' because it's a full cycle from power on to rendering back to power off
<wpwrak>
(cycle) does the cycle even involve rendering anything ? i thought it was now just  power up -> CRC -> power down
<aw>
[16147.624074] usb 6-1: new full speed USB device using uhci_hcd and address 2
<aw>
[16147.767106] usb 6-1: not running at top speed; connect to a high speed hub
<aw>
[16147.795229] usb 6-1: configuration #1 chosen from 1 choice
<aw>
[16147.803204] usb 6-1: Ignoring serial port reserved for JTAG
<aw>
[16147.807510] ftdi_sio 6-1:1.1: FTDI USB Serial Device converter detected
<aw>
[16147.807554] usb 6-1: Detected FT2232H
<aw>
[16147.807557] usb 6-1: Number of endpoints 2
<aw>
[16147.807559] usb 6-1: Endpoint 1 MaxPacketSize 64
<aw>
[16147.807562] usb 6-1: Endpoint 2 MaxPacketSize 64
<wpwrak>
yes ! :)
<aw>
[16147.807564] usb 6-1: Setting MaxPacketSize 64
<aw>
[16147.808882] usb 6-1: FTDI USB Serial Device converter now attached to ttyUSB0
<aw>
mm...full speed device now.
<wpwrak>
triumph ! :)
<aw>
then? does this mean that I have to enter commands everytime when test each board?
<wpwrak>
make sure you use the longest cable you have ;-)
<wpwrak>
the port configuration should be permanent (until you reboot the PC)
<aw>
oah..sorry that i used a shorter cable..okay...change to long cable
<wpwrak>
but you can check with dmesg. unplug and replug, then see if it still comes up as full-speed
<wolfspraul>
argh
<aw>
umm..sounds good (until reboot the PC)
<wolfspraul>
why long cable?
<aw>
i see
<wolfspraul>
we are not trying to fix every bug on the planet
<wpwrak>
worst case: you need to run the command each time you re-plug the usb-jtag
<wpwrak>
wolfspraul: opportunistic testing :)
<wolfspraul>
wait, let's be clear and precise
<aw>
hmm...sounds different idea..i standby and listening firstly. ;-)
<wolfspraul>
yes
<wolfspraul>
I am focusing on the run of 90 boards, already badly delayed
<wolfspraul>
we can postpone discoveries of all kinds until after sales have started
<wolfspraul>
now...
<wolfspraul>
full-speed is good
<wolfspraul>
Adam can switch to 100% full-speed for the rest of the run now
<wolfspraul>
but I would say the same thing about the short cable
<wolfspraul>
we are trying to fix rc3 bugs, not make sure Adam's entire lab is bug free
<wolfspraul>
my opinion
<wpwrak>
(postpone) well, as you wish. confirmation that full-speed is the cure may create an action item before shipping, though.
<wolfspraul>
cure of which bug?
<wolfspraul>
libusb bug?
<wolfspraul>
we don't even know which bug :-)
<wpwrak>
cure of the reflash failures
<wolfspraul>
hmm
<wpwrak>
well, there's that, yes
<wolfspraul>
aw: which m1 board do you have attached now?
<wpwrak>
of course, are we sure there's even a bug in libusb ? :)
<wolfspraul>
:-)
<wolfspraul>
that's exactly what I want to avoid getting into now
<aw>
wolfspraul, 0x7c
<wpwrak>
that's the fun bit with stochastic bugs - it happens, then you change X and it doesn't happen. but are you sure it went away because you changes X or just because you didn't test often enough ? :)
<wpwrak>
anyway, we can deal with this later, okay
<wolfspraul>
aw: above you said 0x7C is available (testing finished)
<wolfspraul>
are you planning to reflash 0x7C now?
<wpwrak>
i think a fully tested and okay board is a good start
<wpwrak>
no need to reflash until CRC errors happen
<wolfspraul>
yes but I don't understand whether or why Adam wants to reflash 0x7C now, if he just said it's 100% pass
<wolfspraul>
probably a misunderstanding somehwere...
<aw>
wolfspraul, yes 0x7c was done successfully with "new steps" for rendering.
<wolfspraul>
aw: ok, so that sounds like 0x7C is finished.
<wolfspraul>
let's make a little test with our new full-speed happiness
<aw>
but 0x7c not ready for reflashing with "full speed" reflash. i just tried to learn commands. ;-)
<aw>
so what's next step here though?
<aw>
or just when I meet d2/d3 dimly list again? then ping here?
<wolfspraul>
aw: you don't need to reflash anything just because the USB speed is full-speed now
<wolfspraul>
the idea is that for new boards that you reflash from now on, you make sure they are flashed in full-speed
<aw>
so i keep using shorter usb cable and fix usb failure boards first. ;-)
<wolfspraul>
aw: should we try a test on 0x32 ?
<wolfspraul>
those things are unrelated
<wolfspraul>
yes, keep using the short cable
<wpwrak>
aw: you should reflash after each CRC failure. we assume that "d2/d3 dim" would also be a CRC failure. but there can be other CRC failures that do not cause "d2/d3 dim"
<wolfspraul>
aw: I just told him earlier to not reflash after crc failure to not remove evidence.
<wolfspraul>
I meant wpwrak : I just told adam earlier ...
<wolfspraul>
phew
<wolfspraul>
that's the hard part now, avoiding confusion
<aw>
[17656.533069] usb 6-1: new full speed USB device using uhci_hcd and address 3
<aw>
[17656.673148] usb 6-1: not running at top speed; connect to a high speed hub
<aw>
[17656.700375] usb 6-1: configuration #1 chosen from 1 choice
<aw>
[17656.707317] usb 6-1: Ignoring serial port reserved for JTAG
<aw>
[17656.712410] ftdi_sio 6-1:1.1: FTDI USB Serial Device converter detected
<aw>
[17656.712563] usb 6-1: Detected FT2232H
<aw>
[17656.712570] usb 6-1: Number of endpoints 2
<aw>
[17656.712576] usb 6-1: Endpoint 1 MaxPacketSize 64
<aw>
[17656.712582] usb 6-1: Endpoint 2 MaxPacketSize 64
<aw>
[17656.712587] usb 6-1: Setting MaxPacketSize 64
<aw>
[17656.717182] usb 6-1: FTDI USB Serial Device converter now attached to ttyUSB0
<aw>
now is full speed, so what steps you want to try?
<wolfspraul>
aw: perfect. just run reflash_m1.sh
<wolfspraul>
yes 0x3A is nice too! thanks. I didn't see it...
<aw>
be noticed that now it stays d2/d3 dimly lit.
<aw>
okay
<aw>
second
<aw>
wait...use xiangfu's last 'erase' version, right?
<wolfspraul>
sure why not
<aw>
okay
<wolfspraul>
always use the new reflash_m1.sh with erase now, I see no reason why not
<wpwrak>
seems we have more: 0x55, 0x67, 0x6d, 0x6f, 0x70, 0x77, ...
<wpwrak>
0x7a is a bit weird, but may also be the same
<wolfspraul>
well you are brave
<aw>
hmm...stops at 'Bitstream length: 1484404'
<aw>
standby next analysis step now..he he ;-)
<wpwrak>
GRRRR
<aw>
what's meaning of "GRRRR"? ;-)
<wpwrak>
seems that my "full speed" theory is wrong :-(
<wpwrak>
ah well, in any case it shouldn't make things worse ...
<aw>
wpwrak, not bad that a way would be came out from you. :-) never sad though..we here you.
<wolfspraul>
aw: let's try the same quick test on 0x3A
<aw>
okay
<aw>
second
<wolfspraul>
I do believe full-speed is good, we should always use it and it will help eliminate a few strange flashing problems. But I don't believe it has any impact on the physical/electrical condition of a particular m1 board.
<wolfspraul>
I trust the little jtag board and the ftdi chip. once the nor is written it's written. the strangeness must come from the m1 boards themselves.
<aw>
0x3a: good still detect with full speed and stays d2/d3 dimly lit after powered -on. now to reflash. ;-)
<aw>
mm...same stopped at 'Bitstream length: 1484404'
<wolfspraul>
aw: try to disconnect/reconnect the jtag-serial board too
<wolfspraul>
aw: now try to disconnect/reconnect the jtag-serial board
<wolfspraul>
(power off everything first)
<aw>
hmm...need to power off
<wolfspraul>
then reflash_m1.sh in full-speed again
<aw>
same stopped there. :-(
<wolfspraul>
ok
<wolfspraul>
one sec
<wolfspraul>
can you try reflashing with Xilinx Impact?
<wolfspraul>
and the xilinx cable
<aw>
hmm...seems different image i quite don't know this.
<aw>
need to ask xiangfu before do this. :-)
<wolfspraul>
ok
<wolfspraul>
we can do that later
<aw>
last time rc2 I used Lekernel's image
<wolfspraul>
on all boards with flashing problems, we can try Xilinx Impact and Xilinx cable later
<wolfspraul>
aw: ok, let's stop the full-speed tests right now
<wolfspraul>
Werner had another idea I like
<aw>
mmm..okay.
<aw>
wait
<wolfspraul>
aw: you just finished 0x7C, right?
<aw>
so from now on i still use full-speed to continue tests?
<wolfspraul>
absolutely
<aw>
yes, finished 0x7c
<wolfspraul>
always full-speed
<aw>
alright full-speed now
<wolfspraul>
so werner wants to make a special test on 0x7c
<wolfspraul>
like this:
<aw>
now?
<wolfspraul>
yes
<wolfspraul>
wait I write first
<wolfspraul>
1. plug DC jack in
<wolfspraul>
2. middle button, escape to test software, run test software until CRC checks
<wolfspraul>
3. unplug DC jack
<wolfspraul>
4. go back to step #1
<wolfspraul>
just that
<aw>
okay
<wolfspraul>
We are hoping that after some cycles, the CRC checks will find a corruption
<wolfspraul>
the cycles should be fast, so you can try 100 or 200
<wolfspraul>
start with 100 :-)
<wpwrak>
and please count the cycles
<wpwrak>
err, i'd stop at the first CRC error
<wolfspraul>
oh sure
<wpwrak>
then analyze
<wolfspraul>
sorry that wasn't clear
<wolfspraul>
aw: of course you stop at the first CRC error
<aw>
alright
<wolfspraul>
wpwrak: be warned (well, I warn myself). I believe this kind of testing may damage the nor chip or more, and turn a board unflashable for days or forever. :-)
<wolfspraul>
aw: no worries, I just explain my theory to Werner... You can have fun :-) We have enough boards now to ruin some :-)
<aw>
wolfspraul, ha...yes, from last rc2 experiences. ;-)
<wolfspraul>
yes
<wolfspraul>
we should have taken it much more seriously on rc2
<wolfspraul>
I learnt a lot
<wolfspraul>
but that's another story, now we try to rescue rc3 and make good boards
<wpwrak>
wolfspraul: the chip should be good for a few kcycles
<wolfspraul>
no
<wolfspraul>
you will see soon
<wolfspraul>
it's a bug somewhere, an electrical problem
<wolfspraul>
some kind of shock, over-current, over-voltage, whatever
<wolfspraul>
you saw Adam's reaction just now when I wrote this :-)
<aw>
2 times
<wpwrak>
wolfspraul: hmm, let's hope it's not overvoltage or such. the reset chip replacement couldn't fix that.
<wolfspraul>
correct
<wolfspraul>
I know
<wolfspraul>
so I keep asking "how comfortable are we" :-)
<wolfspraul>
because I'm not :-)
<wolfspraul>
I made some big mistakes in rc2, like I said - already learning...
<aw>
5
<wolfspraul>
but that analysis doesn't help now, so let's make the best out of the rc3 situation we have right in front of us
<wolfspraul>
sometimes all you have left is that some luck happens
<wolfspraul>
a lucky day!
<wolfspraul>
maybe today?
<wolfspraul>
:-)
<wolfspraul>
let's look for signs!!
<wpwrak>
signs and portents :)
<aw>
10
<wolfspraul>
wpwrak: do we have any theory what kind of damage or impact may turn the nor chip, or something else, unflashable for several days, but then flashable again?
<wolfspraul>
because Adam has seen that so many times now that we can rule out it just being some sort of noise
<aw>
15
<wolfspraul>
Adam will regularly let an unflashable board 'rest' for several days, and then try again, because we have seen a lot come back alive after such a resting period
<xiangfu>
wpwrak, you may already saw my patches on urjtag 'lockflash' 'unlockflash'. I have some question about how this urjtag works.
<wolfspraul>
it's not 5 minutes, or an hour, the effect is noticeable after 1 day or 2 days or so
<wolfspraul>
because the boards worked fine before, including reflashing
<wpwrak>
one test could be like this: if board X magically recovers, try all other boards with reflash problems at that time too. if it's temperature, some of them may also come back
<wolfspraul>
you mean room temperature?
<wpwrak>
yes
<wolfspraul>
definitely not. it's a time based phenomenom.
<xiangfu>
wpwrak, I want know how urjtag know the 'cfi_array->address' ?  for now I understand: 1. upload the fjmem.bit 2. then nor flash working 3. how urjtag know what is the address of nor flash data port?
<aw>
wolfspraul, last in rc2 we damaged our boards by "fast-powered cycling" though..not keep 5 seconds between power-on like this time
<wolfspraul>
yes sure, and the reset circuit is also there. let's just focus on trying to reproduce the flash bug now, I'm only saying if it falls into an unflashable state, I wouldn't be surprised.
<wolfspraul>
like 0x32 or 0x3A we just looked at
<aw>
20
<wpwrak>
xiangfu: (address of flash) isn't this configured somewhere ?
<aw>
wpwrak, why no use high speed to capture the tests I am doing?
<wpwrak>
wolfspraul: (time) hard to distinguish the two
<aw>
i felt this test if use full-speed?
<wolfspraul>
it shouldn't matter. you think it's slower now?
<wolfspraul>
I think you should always use full-speed, even for this test.
<aw>
since last week i met CRC err by high speed. :-)
<wolfspraul>
don't say that otherwise Werner will jump up and hurt his head
<aw>
so i just wanted to clarify what purpose you wanted to catch?
<aw>
oah...sorry ;-)
<wolfspraul>
no no, just joking
<wolfspraul>
I am just joking
<wolfspraul>
:-)
<wolfspraul>
aw: I think always use full-speed
<wolfspraul>
for everything
<aw>
alright .;-)
<wpwrak>
xiangfu: are you sure about  URJ_BUS_WRITE (bus, adr + 0x02, CFI_INTEL_CMD_READ_IDENTIFIER);  ?
<wpwrak>
xiangfu: the data sheet seems to want 0x1a (table 8, page 19)
<aw>
25
<wpwrak>
xiangfu: ah, sorry, misread it. it's not 0x1A but IA :)
<wpwrak>
xiangfu: so, if i understand things right: URJ_BUS_WRITE (bus, adr, CFI_INTEL_CMD_READ_IDENTIFIER);
<wpwrak>
xiangfu: and then sr = URJ_BUS_READ (bus, cfi_array->address+2);
<wpwrak>
hmm, vanished :(
<aw>
30
<wpwrak>
aw: you're running the CRC check each time ?
<aw>
sure
<aw>
haven't spotted CRC err though. ;-)
<aw>
35
<wolfspraul>
it could take 100-200 cycles
<wpwrak>
wasn't 100-200 the rate of "dim LEDs" ?
<wolfspraul>
unfortunately we know so little. it could be that some boards will never exhibit the problem.
<wpwrak>
with the CRC check, we should hit it ~10-20 times more often, assuming uniform distribution
<wolfspraul>
maybe it is caused by some unfortunate part tolerances coming together
<wpwrak>
that could be the case, too
<wolfspraul>
I don't believe that, but let's see
<aw>
40
<wpwrak>
maybe it's also a question of giving the board enough time to discharge
<wolfspraul>
if we know for sure that some boards are safe, they are good to go
<wolfspraul>
the bad thing is that we currently do 10 render cycles (30 seconds each) in our testing
<wolfspraul>
and we had boards failing on cycle #2 #6 #9 etc.
<wolfspraul>
not good
<wolfspraul>
why should '10' be the magic number to determine that the board is stable?
<wpwrak>
yeah
<wpwrak>
if we have the baseline probability, we can calculate how many tests you need to be, say, 99% sure the problem doesn't appear
<wolfspraul>
we don't need to look at or find root causes for all sorts of strange flash/dim lit/reconfig/whatever boards. we have enough time for that once we have cleared 40, 50, 60 or more to go out
<wolfspraul>
I think what helps is if we can more clearly see the different bugs separately that are probably overlapping here
<wolfspraul>
which is why I like the full-speed stuff, short cable, crc checks, etc.
<wolfspraul>
also the reset ic idea
<aw>
45
<wolfspraul>
not just idea, that seems to be a clean fix/improvement that is good no matter what other things we find
<wolfspraul>
wpwrak: speaking about that. you really want the 1.12s delay ic?
<wpwrak>
yeah, if the reset chip does anything useful at all, then this is an improvement
<wolfspraul>
I mean - can that work at all?
<wolfspraul>
yes I think the reset ic is fine, helps
<wpwrak>
i usually get at least 10, unless the item is very expensive :)
<wolfspraul>
I'm half Chinese
<wolfspraul>
so 5
<wolfspraul>
:-)
<wolfspraul>
for the cycle testing adam is doing on 0x7A now, I propose we stop that at 100 successful cycles
<aw>
55
<wolfspraul>
and let Adam continue to go through the whole batch as planned
<aw>
no , ox7c
<wolfspraul>
sorry 0x7C
<wolfspraul>
that's because we already have several improvements now (short cable, full-speed, crc checks in test software which is logged), and then we have more testing data to look at and thing about
<wolfspraul>
think about
<wolfspraul>
then we can zoom in on clusters, or try to find clusters, or try to find boards where it is easy to reproduce some particularly interesting behavior
<wpwrak>
yeah, 100 should be plenty. i would have expeced to see an error much earlier. maybe we've removed the step that actually causes the problem. but let's try a few more boards first.
<wolfspraul>
yes we may have removed the step
<wpwrak>
(zoom in) yes
<wolfspraul>
or the problem is only showing on particular boards
<aw>
so remove new steps?
<wpwrak>
that could also be
<wolfspraul>
aw: no, just continue
<wolfspraul>
Werner and I are discussing the next steps
<aw>
:-)
<wpwrak>
maybe, when 0x7c is done, pick one that has had NOR corruption before
<wolfspraul>
:-)
<wolfspraul>
Werner cannot wait looking at the interesting stuff NOW ;-)
<wolfspraul>
also from now on, Adam will run crc checks between the 10 render cycles
<wolfspraul>
that may show something (or not)
<wolfspraul>
we could increase the 10 render cycles to 15 ?
<wolfspraul>
they are time consuming though
<wolfspraul>
that's 30 minutes testing for each board, easily
<aw>
60
<wolfspraul>
nah let's only do 10 now
<wolfspraul>
I don't need more evidence that boards fail at #12 or #14
<wolfspraul>
I need to find the root cause
<wolfspraul>
in that thinking we could even reduce the cycles to 5 :-)
<wpwrak>
lekernel: in adam's usual test, boot and render for some minutes, do NOR access (read or write) occur after the rendering starts ?
<wolfspraul>
30 seconds render
<wpwrak>
wolfspraul: wait wait .. for now, we don't have rendering in the loop
<wolfspraul>
yes
<wolfspraul>
correct
<wolfspraul>
I am thinking once he's back to going through the batch
<wolfspraul>
I think we can reduce to 5 cycles
<wpwrak>
wolfspraul: let's keep the simplified loop and apply it to a board that's know not to be immune
<wolfspraul>
but with crc checks in between
<wolfspraul>
which one?
<aw>
65
<wolfspraul>
(looking at list)
<wolfspraul>
wpwrak: how about 0x39 ?
<wolfspraul>
clean and simple
<wolfspraul>
one cycle - and out :-)
<wolfspraul>
maybe too simple, maybe a little later...
<wpwrak>
0x39 sounds excellent :)
<wolfspraul>
0x54, also nice
<wolfspraul>
8th cycle
<wolfspraul>
ok, 0x39
<wpwrak>
another up to 100 tries with 0x39
<wolfspraul>
oh my
<wolfspraul>
dinner time for Adam :-)
<wpwrak>
if that still doesn't do anything, add rendering to the loop
<wpwrak>
can you go from test to render ? or do you have to reset in between ?
<aw>
70
<wolfspraul>
like I said, instead of doing time consuming tests on single boards now, we can also proceed going through the batch with the process that we improved in details
<wolfspraul>
wpwrak: maybe he could go from test to render over software reset (press three buttons), instead of pulling the DC cable
<wpwrak>
we need a larger number of tests for now. statistical baseline.
<wolfspraul>
yes but on which boards?
<wpwrak>
(sw reset) yes, that's an option
<wolfspraul>
you may be hitting on a board that may never show the problem, just wasting time
<wpwrak>
0x39 looks promising :) it did it once. we know it can ;-)
<wpwrak>
if it all of a sudden doesn't do it, that's interesting, too
<aw>
75
<wolfspraul>
wpwrak: what do you want to do on 0x39 actually? reflash it?
<wolfspraul>
first try whether it boots now
<wolfspraul>
boards have come back after X days, though I am not sure exactly from which of the multiple failure conditions we may actually be looking at
<wpwrak>
yeah, would be fun if the NOT corruption would somehow have healed itself ;-)
<wolfspraul>
so first try to boot 0x39, see what happens. if no reconfigure -> reflash_m1.sh with erase and full-speed
<wolfspraul>
I am telling you we have seen enough such cases now
<wpwrak>
try to boot and it it boots, run the CRC check
<wolfspraul>
good idea
<aw>
80
<aw>
85
<wpwrak>
i have that mental image of a guy in a prison cell counting the days with scratch marks on the wall. adam must be doing something slimiar, counting the tries until he can lay the board to rest :)
<aw>
90
<aw>
95
<wolfspraul>
wpwrak: I do think he should continue going throguh the batch first, before 0x39
<wpwrak>
aw: ah, and please paste (to pastebin.com, or similar) the console output of the 100th run
<wolfspraul>
but if you want him to do 0x39 next, ok with me
<wpwrak>
i'd prefer 0x39. let's make the thing happen before changing an unknown set of variables
<aw>
100
<wolfspraul>
yay!
<wolfspraul>
aw: thanks a lot!
<wolfspraul>
can you post the console output of the last run to pastebin.com ?
<wolfspraul>
I suggest you go back to the normal procedure, continue with all boards and all known fixes
<wolfspraul>
I propose a change to the render cycles, we already said that you run the crc test software after each render cycle
<wolfspraul>
I also think you should reduce the number from 10 to 5
<wolfspraul>
so here's the list:
<wolfspraul>
1. only use short cable (as before)
<aw>
mm..so your normal procedure is now becoming:
<aw>
go on
<wolfspraul>
2. always run reflash_m1.sh in usb full-speed mode
<wolfspraul>
3. run the test software (crc part) after each render cycle
<wolfspraul>
4. reduce the number of render cycles from 10 to 5
<wolfspraul>
that's all
<aw>
got it
<wolfspraul>
aw: so what you've found is that if a board is in d2/d3 dimly lit status, it cannot be reflashed over jtag-serial, and the nor can also not be read over jtag-serial
<wolfspraul>
we could try to reseat (disconnect/reconnect) the jtag-serial board, and we could try to reflash with Xilinx Impact
<wolfspraul>
but I suggest to do that later
<wolfspraul>
the real showstopper is to find the reason why a board can go from seemingly normal to this state. we have to fix that before boards can go out.
<wolfspraul>
and the only idea right now seems to be the new reset ic
<wolfspraul>
I think whether it's from the fpga, software or electrical, the m1 is doing something really bad to the nor chip under some circumstances
<aw>
wolfspraul, after it's in d2/d3 dimly lit, cannnot be read over jtag-serial, but bad that I forgot to reflash it again. but from previous other boards's histories, once board is in d2/d3 dimly lit, i t seems always stopped at "Bitstream length: 1484404"
<aw>
but we can try reflash 0x39 tomorrow
<wolfspraul>
:-)
<wolfspraul>
the famous "let's wait 1 day"
<aw>
oah..yeah...
<wolfspraul>
I think let's continue with all boards first
<wolfspraul>
more fixes, more data
<wolfspraul>
I need complete overview over the failure clusters
<wolfspraul>
in parallel the new reset ics are ordered
<aw>
okay..i continue tests
<wolfspraul>
aw: do you know how to _READ_ the nor chip with Xilinx Impact?
<wolfspraul>
you could try to read the nor from 0x39 with Xilinx Impact
<wolfspraul>
but yeah, I suggest - do that later
<aw>
hmm...need to do this later though ;-)
<wpwrak>
hmm, interesting ..
<wpwrak>
(sorry, fell asleep and missed part of the fun)
<wpwrak>
so maybe we don't have a NOR corruption after all. that would be good :)
<lekernel>
wpwrak, any other ideas about what is happening, then?
<lekernel>
temperature dependent timing failures?
<wpwrak>
could be some analog domain weirdness of the diode-based reset circuit ... but i don't have any clear error path for that
<wpwrak>
what's puzzling me is that JTAG and normal operation run into trouble with the NOR
<wpwrak>
otherwise, i would have suspected problems with the timing of NOR bus access cycles
<wpwrak>
maybe some of the signals are just too weak ? a voltage check could help to clarify this
<wolfspraul>
the sample set was smaller (run of 40 instead of 90), and there was a lot less testing in rc2 than rc3, but I am pretty sure this same 'kind' of bug already existed in rc2
<wolfspraul>
so I think that rules out anything new that got introduced by the reset ic or diode
<wolfspraul>
big guess though, just from thinking about what cases I saw or remember
<wpwrak>
wolfspraul: you think it's the same at the NOR corruption ?
<wolfspraul>
well, the best data I have now are the rc3 test results
<wpwrak>
(which could of course just be invalid data showing up, without the NOR itself being compromised)
<wolfspraul>
so I scan them, top to bottom and back up, on the 'notes' column
<wolfspraul>
what I see now, even though Adam is not finished yet, is easily 20-30 boards that all fall into one 'group'
<wolfspraul>
46 have passed, 1 adam, 17 in that 'group', 26 in other failure states currently
<wolfspraul>
that 26 will come down more
<wpwrak>
plus, 0x39 seems to be able to enter this state, whatever it is, relatively easily. let's make this our preferred candidate for now.
<wpwrak>
and if it is in this state, it doesn't seem to get out without a power cycle. but maybe this is just a lack of time
<wolfspraul>
so that's a big group already (17, counted conservatively), and growing
<wpwrak>
has a board with dim LEDs been left running for a long time, say, overnight ?
<wolfspraul>
you mean in dim LED state?
<wolfspraul>
afaik it's not running then
<wolfspraul>
dim LED means no boot
<wpwrak>
lekernel: on CRC failure, will the FPGA just keep on trying forever ? or does it eventually give up ?
<lekernel>
iirc it tries 3 times or something like that
<wpwrak>
wolfspraul: i mean leave it on, see if it eventually succeeds
<lekernel>
but i'm not sure
<wpwrak>
lekernel: and then ?
<lekernel>
stays in unconfigured state
<wpwrak>
bleh :-(
<lekernel>
in any case, loading fjmem.bit will stop all other configuration attempts
<wpwrak>
what's fjmem.bit ?
<wolfspraul>
wpwrak: you want to do voltage check on which wires?
<wpwrak>
the "three button salute" triggers a reset, right ? does it also work if unconfigured ?
<lekernel>
wpwrak, the bitstream that is loaded to give urjtag a "fast" jtag access to the flash
<wpwrak>
wolfspraul: basically all the NOR signals. pick a convenient line, e.g., OE, do, say, a read cycle, then see how they behave
<wolfspraul>
wpwrak: 0x39 does not just 'enter' this state easily, more importantly it is 'in' this state right now and we don't know how to get it out
<wpwrak>
wolfspraul: if one can't quite decide whether it should be 0 or 3.3 V, we may have found our problem. maybe set trigger on OE#, then start with RP#, WE#, DQ0, A0, then do the rest of DQx and Ax
<wpwrak>
wolfspraul: it does seem to get out of the state sometimes.
<wpwrak>
wolfspraul: ah, before DQ0, also CE0
<wolfspraul>
yes, but next time we have a board in a state we may zoom in then
<wolfspraul>
there's two different things I think
<wpwrak>
lekernel: maybe something to check out be if all the FPGA I/O cells of NOR pins are properly configured
<wolfspraul>
some event that gets it into this state, and some situation or effect that holds it there
<wpwrak>
wolfspraul: yes. could be temperature plus tolerances. the tolerances enable. the temperature makes it happen.
<wpwrak>
or maybe humidity, phase of the moon, ... ;-)
<wolfspraul>
I doubt it's room temperature. parts temperature - yes, possible.
<wpwrak>
part temp. starts at room temp. :) the you do a bit of testing, it fails, keeps on failing, you give up, put it away, and then it works, until ...
<wpwrak>
of course, if we're unlucky, probing the signals "fixes" it
<wolfspraul>
pah, tough
<wolfspraul>
can't seem to be able to pin it down
<wolfspraul>
I already ordered some more nor flash, just in case :-)
<wolfspraul>
have to ramp up the efforts a bit
<wpwrak>
you suspect the NOR could simply be bad ?
<wolfspraul>
unfortunately adam doesn't have a tsop-56 or whatever package it is tester that could test and scan the entire nor chip at once :-)
<wolfspraul>
no
<wolfspraul>
well
<wolfspraul>
I don't know
<wolfspraul>
'bad' as in what?
<wpwrak>
where's a cheap 56 channel analog scope with active probes when you need one ? ;-)
<wolfspraul>
maybe we are operating it outside of spec?
<wolfspraul>
not 'bad' as in broken parts or so, no
<wolfspraul>
not at this rate of 20% or more
<wpwrak>
you got it from a reputable source ?
<wolfspraul>
ahh :-)
<wolfspraul>
yes I think so
<wpwrak>
(-:C
<wolfspraul>
and no, there is no indication that that's the problem
<wolfspraul>
this chip is made on a 65nm process
<wolfspraul>
afaik nobody in China can do it yet
<wolfspraul>
anyway, no, the parts are good
<wolfspraul>
although if replacing some 'fixes' the bug of course I'd do that for now
<wpwrak>
lekernel: is there also a "slow" jtag access to flash ? i.e., just good old bit-banging ?
<wolfspraul>
but since we don't even know how to test whether the bug 'exists' on a particular board or not (if it is even board dependant), that wouldn't help either
<wpwrak>
wolfspraul: yeah, let's consider signal integrity for now
<lekernel>
afaik fjmem.bit is just bit banging, but you don't need to scan the 450+ pins of the BGA every time
<wolfspraul>
lekernel: can you image images of the same sources we have now for Xilinx Impact?
<wolfspraul>
or are they the same?
<wpwrak>
lekernel: so fjmem.bit is different from the regular NOR access algorithm ? i.e., much slower bus cycles ?
<lekernel>
but this has nothing to do with failure of the flash _after_ it has been written
<lekernel>
you need to convert to .mcs to use xilinx impact
<wolfspraul>
that's something we can try to bypass any libusb/urjtag/jtag-serial issue, although I don't think that's the root cause of the problem
<lekernel>
with srecord for example
<wpwrak>
lekernel: and when the FPGA boots from NOR, does it always use a built-in bus protocol or does it, say, load a bit from NOR, then switches ?
<lekernel>
it uses the hardwired configuration system
<lekernel>
it seems you can send commands from the flash to change a few things while it's running, but I don't know
<wpwrak>
okay, so we have 2-3 entirely different bus protocol implementations. seems unlikely that all of them would just be wrong.
<wolfspraul>
wpwrak: hey, you will like this
<wpwrak>
ducks
<wolfspraul>
I followed the wiki to find our flash source, and it is the World Peace Industrial Group!!!
<wolfspraul>
if that's not trustworthy, then sorry, I cannot help you
<wpwrak>
Fallenou: well, osama got the nobel peace price, so ...
<wpwrak>
err, obama. damn.
<Fallenou>
lol
<wolfspraul>
no but they are fine, really
<Fallenou>
i wtf'ed a few seconds
<wolfspraul>
also in this kind of part you rarely have problems, unless you really buy returned/used parts or so, and who does that...
<wolfspraul>
this part is too high-end
<wpwrak>
well, could be rejects
<wolfspraul>
no
<wolfspraul>
it's not
<Fallenou>
well you would not be the first to have troubles with flash parts
<wpwrak>
but let's assume for now it's a bus problem
<wpwrak>
Fallenou: understatement of the year ;-) you should work as a nucelear power spokesperson :)
<wpwrak>
Fallenou: of course, you admitted that a problem exists at all. so maybe not :)
<Fallenou>
hehe
<wolfspraul>
is it possible that the problem is in the fpga not the nor chip?
<wpwrak>
of course. same story there.
<wpwrak>
it doesn't seem to be a configuration problem, though, since the hardwired bus protocol also trips
<Fallenou>
i meant i heard a few people complaining about flash parts behaving strangely , even just soldered brand new ones
<Fallenou>
bad blocks problems and so on
<wolfspraul>
no I don't mean in terms of bad parts or so, that's not the case for sure. I'm just wondering what kind of problem it might be, theoretically.
<wpwrak>
Fallenou: NOR or NAND ?
<wolfspraul>
we could unsolder the nor of 0x39 and put it on a good board to see what happens there
<Fallenou>
was nand i think
<wolfspraul>
oh no
<wolfspraul>
now Werner will be busy for a while
<wolfspraul>
:-)
<Fallenou>
maybe it does not apply here at all
<wpwrak>
Fallenou: bad blocks are normal in NAND. and they have a very subtle definition of what constitutes a "good" block, too :)
<wolfspraul>
the problem of reseating a nor chip to another board is that it's quite intrusive and may create or mask problems
<wpwrak>
Fallenou:Â Â a "good" block is one with 0 or 1 error, i.e., few enough errors that the ECC can still fix it
<wolfspraul>
so we may just get noise back
<wpwrak>
wolfspraul: i'd look at the signal first. if we trust both FPGA and NOR, the problem must be on the bus :)
<Fallenou>
well ok nevermind sorry for the noise ;)
<wpwrak>
wolfspraul: first step: do something that exercises the bus and see if there's an anomaly
<wolfspraul>
we could take 0x39 and try to read the standby bitstream
<wolfspraul>
and compare with a board where that works
<wolfspraul>
what happens on 0x39 now - the ftdi chip loads a small bitstream into the fpga, and then it tries to read from nor via fpga
<wolfspraul>
but that fails/hangs ?
<wolfspraul>
any visibility into that?
<wolfspraul>
can the fpga 'log' all bus activity? :-)
<wolfspraul>
he he
<wolfspraul>
just thinking, maybe nonsense
<lekernel>
urjtag might have some debug mode
<lekernel>
also, inputting the commands manually one by one instead of using the batch script would already help
<lekernel>
and you can 'pld load' directly the soc design
<lekernel>
and run the test program
<wpwrak>
lekernel: does the FPGA take its master clock from the video codec ? or is there some other crystal ?
<wpwrak>
ah, Y2 .. so there must be a Y1 ...
<lekernel>
for configuration, it uses an internal oscillator
<wpwrak>
(found Y1, it's audio)
<wolfspraul>
pld load bitstream is interesting, we should have a little script for that too
<wolfspraul>
just in case
<wolfspraul>
but most likely the test program would then fail accessing the nor, no?
<wpwrak>
(internal) okay, so no risk of weird clock due to an unconfigured oscillator
<wolfspraul>
definitely something to try though
<wpwrak>
maybe crosstalk, reflections, ...
<lekernel>
why would they happen suddenly?
<lekernel>
also the jtag interface has a very slow clock
<wpwrak>
maybe it happens all of the time, just barely below the threshold
<wpwrak>
(jtag) so fjmem.bit is clocked by jtag, not the internal osc ?
<wolfspraul>
xiangfu: can you make a script that uses urjtag to pld load the soc and then runs the test program, all without accessing the nor chip?
<lekernel>
apparently there's some clock in fjmem too
<kristianpaul>
zlib magically solves by recompiling it again i think
<kristianpaul>
compile and install
<kristianpaul>
well that was time agoooo
<Fallenou>
I guess now flickernoise is using rtems' zlib, since it's no longer in the requirements on the wiki
<Fallenou>
too bad rtems zlib does not compile :o
<Fallenou>
at least on my mac os
<wpwrak>
maybe it's just an analog domain problem on the signals we can't tell by looking at the schmatics. or maybe it's a chain of events that sets off the trouble.
<wpwrak>
anyway, next step: try to boot 0x39. while it does reconfigure, reboot. when if fails to reconfigure, try to read back the NOR. that should clarify the NOR corruption theory.
<wpwrak>
(at least a little :)
<lekernel>
Fallenou, (zlib issue) please post that to the RTEMS mailing list; I have it, JP Bonn has it, and my friend Ralf is denying any problem exists
<kristianpaul>
(denying) that used to happen :-)
<Fallenou>
lekernel: ahah ok
<Fallenou>
lekernel: you opened a PR ?
<lekernel>
no I posted on the ML and all I got was stupid replies from Ralf
<wpwrak>
wonders if the NOR problem could still be INIT_B -> FLASH_RESET contamination
<wpwrak>
e.g., if "fix2" has a design flaw or if it frequently gets implemented in the wrong way
<wpwrak>
one test could be to remove D16 (FLASH_RESET_N to reset out). this should then remove any contamination, but may bring back the NOR corruption.
<wpwrak>
oh, and an alternative to using logic gates instead of the diodes in rc4 could be to have a second reset chip, dedicated on FLASH_RESET_N.
<wpwrak>
that could also be used to test whether properly separating FLASH_RESET_N from PROGRAM_B_2 and INIT_B would solve all the NOR problems. i.e., remove D16, add a reset chip in parallel to the existing one, and let it drive exclusively FLASH_RESET_N