<aw>
wpwrak, 0x3c: no critter being reproduced now after powered on, d2/d3 dimly lit. before powered-on, impedance TP36 - pin 34 of NOR: 20 KOhm(not constant) / 125 KOhm
<aw>
wpwrak, sorry, mend above as: no critter being reproduced now after powered on, d2/d3 is fully off. before powered-on, impedance TP36 - pin 34 of NOR: 20 KOhm(not constant) / 125 KOhm
<aw>
s/mend/amend
<aw>
continue to test 'fix2b' boards...
<wpwrak>
hmm, so 0x3c pretends to be good now
<wpwrak>
aw: did you try 0x77 again ? this time the 3.3 V injection on TP36
<wpwrak>
aw: (0x77) monitor TP36 and pin 34. reproduce the nastiness. then, while monitoring, connect TP36 through 100 Ohm to 3V3. see what happens
<aw>
wpwrak, (0x77) is not easily to reproduce today though. before I connect 100 Ohm to 3.3V, i've seen instability once, after connecting TP36 through 100 ohm to 3V3. I've NOT seen instability until now maybe 5 minutes passed
<aw>
wpwrak, not sure if TP36 100 ohm pulled high to cause.
<aw>
wpwrak, (after TP36 through 100 Ohm to 3.3V) DQ8 is normally low when rendering. Stay HIGH(3V) when in reconfigure stage, normal pulses accessed after pressed middle btn(to boot up) then kept low steadily
<xiangfu>
kristianpaul, which version toolchian you using?
<xiangfu>
I try to compile the latest rtems gcc.
<xiangfu>
but always stop at "checking whether the target assembler supports thread-local storage..." stop there hours.
<GitHub68>
[scripts] xiangfu pushed 1 new commit to master: http://git.io/kHTZZg
<GitHub68>
[scripts/master] compile-lm32-rtems: update gcc to 4.6.1 - Xiangfu Liu
<wpwrak>
aw: (0x77) can you try to let it run without pulling TP36 up ? wait until the anomaly appears, and only then pull TP36 ? (i.e., when it has already started to act weird)
<wpwrak>
aw: from what you've described to far, pulling seems to prevent it from entering anomalous behaviour. but i'd also be interested to know if pulling makes it exit anomalous behaviour.
<aw>
wpwrak, I'll try it but i'm testing other. ;-)
<wpwrak>
aw: ok. and after that, the next test would be similar: monitor TP36 and pin 34, wait until the anomaly happens, then pull DQ8 and see what happens.
<wpwrak>
aw: what i'm trying to find out is whether the synchronization between DQ8 and PROGRAM_B is cause or effect
<lekernel>
yes, but it lacks the divider enabled multilibs
<lekernel>
if you can get pesky ralf to add them, good
<xiangfu>
will try. then I will add the crc button in flickernoise. next to the "Check Version"
<xiangfu>
I don't want crc command any more :)
<lekernel>
mh
<lekernel>
where is flickernoise supposed to get the CRCs from?
<xiangfu>
lekernel, (4.5.3) I just install then uninstall them. needs compile version. maybe setup all those stuff one day in BUILDHOST
<xiangfu>
button maybe not. crc needs length  :(
<lekernel>
and what feature does a "CRC" button bring to the user?
<lekernel>
imo this only belongs in RTEMS
<lekernel>
the GUI shouldn't have system programming functions, only end user stuff
<Fallenou>
11:15 < lekernel> if you can get pesky ralf to add them, good < ahah
<xiangfu>
lekernel, ok. got it.
<lekernel>
if you want to be able to use the RTEMS shell without a serial cable you can 1) use telnet 2) write a terminal program in the GUI (activated with a relatively long keyboard shortcut, e.g. ctrl-alt-shift-t)
<xiangfu>
ok
<lekernel>
for the terminal program, you could certainly use the "text editor" widget as a starting point
<lekernel>
it shouldn't be hard, if you don't want colors, cursor control sequences, and things like that
<xiangfu>
definitely without colors etc. :)
<kristianpaul>
xiangfu, gcc versión 4.5.2 (GCC)
<kristianpaul>
old-rtems i guess... i dint update toolchain since months ago..
<wolfspraul>
wpwrak: oh well. bad news (for me :-)) we got the first failure after 4th rendering cycle with fix2b applied
<wolfspraul>
can't believe it but I guess that's what we found... So - Adam is still wrapping up some work, he will stop by here later to discuss what we can learn from 0x4C
<wpwrak>
wolfspraul: ah well, it had to happen sooner or later. will be interesting to see which symptoms he discovers.
<wolfspraul>
why did it have to happen sooner or later?
<wolfspraul>
you knew it coming?
<lekernel>
because fix2b isn't supposed to do anything, I'd guess .....
<wolfspraul>
I was hopeful that the other boards were contained at some earlier state of testing, but I guess you were right all along.
<wpwrak>
wolfspraul: i suspect we're not done yet with the 0x3c/0x77 cluster. and that one already showed all the promises of a long tail.
<wolfspraul>
I cannot even read the notes of 0x3C/0x77 reasonably, so long are they.
<wolfspraul>
tough
<wolfspraul>
wpwrak: but why does it fail after several rendering cycles? what triggers the failure?
<wpwrak>
(whiskers) hard to tell. i've never seen such things in real life and i don't quite know what you have to do wrong to get them.
<wolfspraul>
it seems Sebastien says if Adam uses leaded solder we can rule whiskers out
<wpwrak>
wolfspraul: maybe temperature. maybe it has a certain trigger condition. etc.
<wolfspraul>
temperature, argh
<wolfspraul>
but how can we get the design stable enough so this goes away?
<wolfspraul>
with the 0x4C results as I understand them now (not yet 100% confirmed), my confidence in selling boards has dropped quite low again
<wolfspraul>
how can we rule out that boards just spontaneously fail?
<wpwrak>
wolfspraul: there are probably symptoms we can detect all the time. just rendering isn't a good test.
<wolfspraul>
any scope measurements we can do to see early warning symptoms?
<wpwrak>
wolfspraul: e.g., 0x77 has a "resistance" between TP36 and pin 34 that's different from the rest of the herd. there may be more such symptoms that can be used for diagnosis.
<wolfspraul>
sure if we can identify a strong pass/fail test that would solve the most critical rc3 problem
<wolfspraul>
actually in all the crazy reworks and testing, the yield is getting quite good, he he. not that there is much to laugh about in this run.
<wpwrak>
wolfspraul: i'd look more for DC things. the scope is often a tricky instrument to use. particularly if you're looking for the absence of an event
<wolfspraul>
we are up to 60 'good' boards now
<wolfspraul>
in the end we make 80 I think
<wolfspraul>
but this stuff is painful, so many people are waiting for their m1...
<wpwrak>
yes, the overall yield now looks very promising
<wolfspraul>
oh sure. measure resistance between tp36 and pin34 is a promising test?
<wpwrak>
if you want to push things forward, you can also indicate that there may be this problem that's still being analyzed, and offer a rebate or replacement for rc4 in case this turns out to be bad for those who decide to take the risk
<wolfspraul>
now all the good work that went into 0x77 and 0x3C comes to help. I apologize for rushing earlier. I was wrong :-)
<wpwrak>
of course, it's trading time vs. risk of future expenses
<wolfspraul>
nah, most people after fully understanding the issue would say "please ship me the m1 when it really works"
<wolfspraul>
if I say "it could spontaneously fail at any time", that's not good
<wpwrak>
(tp46-pin34) it didn't yield anything suspicious on 0x3c. so that still needs more investigation.
<wolfspraul>
if we can find a strong test, that's all we may need
<wolfspraul>
the rendering cycles test is no good for that
<wolfspraul>
unless we find a nor corruption now on 0x4C and we believe it's caused by the power down/reset ic situation.
<wpwrak>
(spontaneous failure) depends a bit on the use. if it's for development or evaluation, that would be acceptable. maybe even for studio work if a reasonable work-around can be found (such as "let it cool down for 10 minutes")
<wolfspraul>
but we suspected that many times and so far I believe it hasn't materialized yet
<wpwrak>
of course, unreliable hw is the last thing you want during a live performance :)
<wolfspraul>
yes, no. it's not good.
<wolfspraul>
and if it were as easy as 10 minutes that would be nice.
<wolfspraul>
it could be a day, it could be forever.
<wolfspraul>
test! that's a good approach. we need to do some comparative testing to find early warning signs.
<wpwrak>
lekernel: btw, under what conditions is NOR accessed (read or write) after booting. e.g., is the "file system" mirrored in RAM or do reads go to NOR as well ?
<wpwrak>
(test) yes, step one: find a pattern that leads to the underlying defect. then look for things the defect may affect and see if any of them can be tested.
<wolfspraul>
wpwrak: that's pin 34 of which chip? nor chip?
<wpwrak>
i just hope it's not just some wild ESD havoc, because that could be fairly unpredictable
<wpwrak>
pin 34 of NOR, yes. it's on a ball next to PROGRAM_B on the FPGA's BGA
<wpwrak>
and the trace to pin 34 is adjacent to out rework zone. that's why i looked for it in the first place.
<wpwrak>
s/out/our/
<wpwrak>
(i was hoping for some soldering bridge that somehow reached the trace. didn't quite expect something that looks like a semi-fried FPGA. but well, you have to take things as they come ;-)
<wpwrak>
if adam continues with the testing, we may also find more boards for the cluster. the more, the merrier as far as analysis is concerned :)
<wpwrak>
afk for a bit. have a quick medical checkup today and then have to be back in time for the fedex man.
<wolfspraul>
I keep thinking about a connection and whether there is any insight to be discovered there.
<wpwrak>
btw, the jtag test joerg suggested would be a good thing to have. a systematic test would also catch things that don't cause a noticeable upset during regular operation.
<wolfspraul>
so lekernel says "fix2b is doing nothing".
<wolfspraul>
that neglects the dynamics of the production and testing process of course. in reality fix2b helped us to reduce the number of boards that failed after x rendering cycles a lot.
<wolfspraul>
so let's only think about those now - boards that fail after rendering cycle >= 2
<wolfspraul>
why did fix2b fix them?
<wpwrak>
fix2b removes potentially troublesome components. potentially as in we've already seen them act up.
<wolfspraul>
and why does there seem to be another case now with 0x4C ?
<wpwrak>
statistics :)
<wolfspraul>
in other words - whatever risk fix2b removed, in the same line of thought may be more risks
<wolfspraul>
well it could be unrelated phenomena
<wolfspraul>
or just statistical nirvana, yes
<wolfspraul>
but to me a board that failed after successful render cycles is still special
<wolfspraul>
something happened
<wolfspraul>
and that something went away with fix2b
<wolfspraul>
my thinkin may look for the wrong root cause of course, I admit. just trying different logic.
<wolfspraul>
wpwrak: but that doesn't explain why they suddenly fail
<wolfspraul>
my point is not bad soldering, bad component, etc.
<wolfspraul>
I'm thinking about the event that triggers the failure.
<wolfspraul>
which jtag test did joerg suggest? can we implement it easily?
<wpwrak>
(jtag) drive the pins to a set of states, see if they read back correct values, and measure system current all the while. not _easy_ to implement. but worthwhile :)
<wpwrak>
(sudden failure) statistics can explain all this. make five tests, then do something, make another five tests. probability of failure is 10% in each test. some will fail before the change, some after, some before and after, some never.
<wolfspraul>
yes you are right could be statistical issues
<wpwrak>
if you have 100 such boards, in fact about 35% would pass all ten tests without showing any problem. 3% of them will fail the very next test. and so on :)
<wolfspraul>
so maybe we focus on a strong pass/fail test only
<wolfspraul>
that must be possible, statistics or not ;-)
<wolfspraul>
ok the jtag test doesn't sound like something we can have in a few days, for rc3 in fact
<wpwrak>
yes, we need to get behind the statistics. find something that's not statistical. or if we really can't, characterize the pattern and design tests that have a high probability of producing the problem. that usually means to automate the tests.
<wpwrak>
(jtag) more like weeks
<wpwrak>
maybe for rc4 ;-)
<wolfspraul>
first I want to read back NOR on 0x4C
<wolfspraul>
you think it will show writes (corruptions)?
<wpwrak>
dunno. we don't understand the NOR corruption we've seen well enough yet.
<wpwrak>
but i wouldn't be surprised if it did
<wolfspraul>
I hope not
<wpwrak>
also, we don't know for sure if the data gets corrupted on read, on write, or both ways
<wolfspraul>
but I think we never saw corruptions after the first rendering, which is still my hope that at that point we have a 'good' board
<wpwrak>
the CFI may give us some clues. at least there, we have a bit of static information.
<wpwrak>
may also be statistics ;)
<wolfspraul>
CFI?
<wpwrak>
my rule of thumb is to do manual tests until i see something happen at least ~3 times. repeat 2-5 times to see if the frequency stays the same. then calculate the number of test cycles i need for sufficient probability of the thing happening. multiply with 10. then automate and let machines do what machines do best ;-)
<wpwrak>
CFI = an information structure in the flash. basically a set of parameters. with factory-defined (and a priori known to us) content.
<lekernel>
wpwrak, reads go to the NOR
<wpwrak>
lekernel: do reads happen "all the time" ? particularly in the tests adam does, do reads happen frequently throughout the test ? or maybe only at the beginning ?
<lekernel>
after flickernoise has booted, they only happen when the system configuration is read and patches are read for compilation
<lekernel>
should be only at the beginning
<lekernel>
compiled patches are stored in the SDRAM
<wpwrak>
so this means that "n hours of rendering" wouldn't tell us of any gremlins on the NOR bus
<wpwrak>
power cycling would, though
<wpwrak>
afk for ~30-60 min
<lekernel>
yes
<wpwrak>
back
<wpwrak>
kewl. fedex say there's nothing to pay for the M1 :)
<wpwrak>
they'll deliver it tomorrow
<wpwrak>
lekernel: have you ever used jtag for a boundary scan ? that would allow a more systematic examination of things than the current functional testing
<GitHub88>
[scripts] xiangfu force-pushed master from 09211c4 to afda277: http://git.io/DOsw5Q
<GitHub88>
[scripts/master] compile-lm32-rtems: update gcc to 4.5.3 - Xiangfu Liu
<xiangfu>
after update gcc to 4.5.3 and use newlib 1.19.0 when compile rtems. I still get error "configure: error: missing define CLOCK_PROCESS_CPUTIME_ID" :(
<xiangfu>
I missed the newlib patch :(
<Fallenou>
xiangfu: use 1.21.1 newlib
<xiangfu>
the newlib upload today: "newlib-1.19.0-rtems4.11-20110724.diff24-Jul-2011 09:14 204K "
<xiangfu>
?
<xiangfu>
yes. try new newlib now ,
<Fallenou>
oops sorry
<Fallenou>
was thinking about binutils
<wolfspraul>
calling it a day, n8 everybody
<xiangfu>
just grep found the CLOCK_PROCESS_CPUTIME_ID is in newlib patches.
<wolfspraul>
let's not be too worried about 0x4D, soon we have a high quality rc3 result. I can sense it :-)
<Fallenou>
damn it Ralf
<GitHub164>
[scripts] xiangfu pushed 1 new commit to master: http://git.io/eWPeRQ
<GitHub164>
[scripts/master] compile-lm32-rtems: update newlib patch - Xiangfu Liu
<xiangfu>
needs another hour to build from scratch. then another hour for build flickernoise from scratch. compiling...
<xiangfu>
's job is keep fan stay 6000 RPM :)
<xiangfu>
Fallenou, you saw the email from Ralf ? :)
<Fallenou>
yes
<Fallenou>
he's really pissing me off sometimes
<wpwrak>
Fallenou: who is Ralf and what did he do ?
<lekernel>
xiangfu, Joel proposed that you compile some newlib code with and without the -mdivide-enabled/-mbarrel-shift-enabled and show that there is an optimization made by using those flags
<wpwrak>
nice ;-)
<lekernel>
xiangfu, just take some source file in newlib, and compile it with -c and with/without the -m*-enabled flags, then disassemble with objdump
<wpwrak>
maybe someone could convince him that multiplication is overrated, too :) and a barrel shifter. c'mon ! all this fancy new stuff.
<lekernel>
also, this patch is upstream GCC now (but unfortunately, right now only in those 4.6 releases that do not work at all)
<xiangfu>
lekernel, ok. I will do that. maybe tomorrow. very late today. and I need about 1 hour wait toolchain compile
<wpwrak>
so this ralf was quite right about "short-sighted" ;-)
<xiangfu>
:D
<kristianpaul>
new excuses to migrate linux? :)
<wpwrak>
oh yes :)
<Fallenou>
lekernel: can't you back port the multilib patch to gcc upstream 4.5.3 ?
<Fallenou>
maybe it would make Ralk stfu
<Fallenou>
Ralf*
<lekernel>
it's done
<lekernel>
should be in 4.5.4, but it's not released yet
<Fallenou>
ok
<xiangfu>
cool
<kristianpaul>
does openwrt somewhere iplement a memtest app?..
<roh>
dunno. dont thinnk so
<roh>
but i guess you can use the bootloader for that
<roh>
uboot can do weird stuff sometimes
<kristianpaul>
cool, the memtester founded in debian looks prety portable so far..
<kristianpaul>
hum... /opt/rtems-4.11/libexec/gcc/lm32-rtems4.11/4.5.2/cc1: error while loading shared libraries: libmpc.so.2: cannot open shared object file: No such file or directory
<kristianpaul>
Fallenou: you can run ISE on mac os?
<kristianpaul>
nativelly
<kristianpaul>
larsc: is it posible to use mmap on currently ulibc?
<larsc>
should be
<Fallenou>
kristianpaul: never tried to run ISE on mac sorry
<Fallenou>
dunno if it's even possible
<kristianpaul>
sure not, just wondering :)
<kristianpaul>
hum seems ther is no mmap support in rtems either..
<kristianpaul>
too much ask, it seems it is arhc dependant?
<GitHub114>
[milkymist] sbourdeauducq pushed 2 new commits to master: http://git.io/L9-4Dw
<GitHub114>
[milkymist/master] tools: flterm: add log - Xiangfu Liu
<GitHub114>
[milkymist/master] flterm: cosmetic changes + bump version number - Sebastien Bourdeauducq