#milkymist on 2011-08-24 — irc logs at freenode.irclog.whitequark.org

01:40 <aw> wpwrak, 0x3c: no critter being reproduced now after powered on, d2/d3 dimly lit. before powered-on, impedance TP36 - pin 34 of NOR: 20 KOhm(not constant) / 125 KOhm

01:41 <aw> wpwrak, sorry, mend above as: no critter being reproduced now after powered on, d2/d3 is fully off. before powered-on, impedance TP36 - pin 34 of NOR: 20 KOhm(not constant) / 125 KOhm

01:41 <aw> s/mend/amend

01:47 <aw> continue to test 'fix2b' boards...

01:57 <wpwrak> hmm, so 0x3c pretends to be good now

01:57 <wpwrak> aw: did you try 0x77 again ? this time the 3.3 V injection on TP36

01:58 <wpwrak> aw: (0x77) monitor TP36 and pin 34. reproduce the nastiness. then, while monitoring, connect TP36 through 100 Ohm to 3V3. see what happens

02:43 <aw> wpwrak, (0x77) is not easily to reproduce today though. before I connect 100 Ohm to 3.3V, i've seen instability once, after connecting TP36 through 100 ohm to 3V3. I've NOT seen instability until now maybe 5 minutes passed

02:44 <aw> wpwrak, not sure if TP36 100 ohm pulled high to cause.

02:56 <aw> wpwrak, (after TP36 through 100 Ohm to 3.3V) DQ8 is normally low when rendering. Stay HIGH(3V) when in reconfigure stage, normal pulses accessed after pressed middle btn(to boot up) then kept low steadily

05:03 <kristianpaul> http://www.linuxfordevices.com/c/a/News/IBM-SyNAPSE-neural-computing-project-demonstrated/ <- Moving beyond von Neumann

05:22 <xiangfu> kristianpaul, which version toolchian you using?

05:23 <xiangfu> I try to compile the latest rtems gcc.

05:23 <xiangfu> but always stop at "checking whether the target assembler supports thread-local storage..." stop there hours.

06:56 <GitHub68> [scripts] xiangfu pushed 1 new commit to master: http://git.io/kHTZZg

06:56 <GitHub68> [scripts/master] compile-lm32-rtems: update gcc to 4.6.1 - Xiangfu Liu

07:01 <wpwrak> aw: (0x77) can you try to let it run without pulling TP36 up ? wait until the anomaly appears, and only then pull TP36 ? (i.e., when it has already started to act weird)

07:02 <wpwrak> aw: from what you've described to far, pulling seems to prevent it from entering anomalous behaviour. but i'd also be interested to know if pulling makes it exit anomalous behaviour.

07:03 <aw> wpwrak, I'll try it but i'm testing other. ;-)

07:19 <wpwrak> aw: ok. and after that, the next test would be similar: monitor TP36 and pin 34, wait until the anomaly happens, then pull DQ8 and see what happens.

07:20 <wpwrak> aw: what i'm trying to find out is whether the synchronization between DQ8 and PROGRAM_B is cause or effect

07:20 <aw> wpwrak, ok

07:22 <wpwrak> aw: also, when you saw it boot normally, did DQ8 also gave the runts ? runts = the little spikes at t = -900 ns, -500 ns, -200 ns, +200 ns, +550 ns, +900 ns, +1200 ns in http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8_500ns.JPG.JPG

07:25 <aw> wpwrak, used a 10us/div, so didn't see runts in details. I'll watch it.

07:29 <wpwrak> aw: at 10 us/div, you still see them as ~1 V "noise floor": http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x77_ch1-TP36_ch2-NOR-pin34-DQ8.JPG

07:33 <aw> wpwrak, i didn't triggered this morning though. ;-)

07:34 <wpwrak> aw: you mean you didn't see the anomaly on 0x77 today ? or that you don't remember seeing the runts today ?

07:36 <aw> wpwrak, NO. i saw anomaly once before TP36 pulled 100Ohm to 3.3V. but I didn't do trigger. So if it had have runts, I don't know. ;-)

07:37 <wpwrak> ah, i see

07:40 <aw> so 100Ohm // 10KOhm almost 99Ohm which seems preventing nastiness from anomalous behaviour, well...will see.

09:08 <lekernel> xiangfu, gcc 4.6.x doesn't work

09:08 <xiangfu> lekernel, oh, then which version I should update to?

09:09 <lekernel> 4.5.3 + the latest RTEMS patches

09:09 <xiangfu> oh

09:11 <xiangfu> ok. try to compile now.

09:12 <xiangfu> the ubuntu repo already have the compiled 4.5.3: http://www.rtems.org/ftp/pub/rtems/linux/4.11/ubuntu/

09:13 <lekernel> yes, but it lacks the divider enabled multilibs

09:13 <lekernel> if you can get pesky ralf to add them, good

09:15 <xiangfu> will try. then I will add the crc button in flickernoise. next to the "Check Version"

09:15 <xiangfu> I don't want crc command any more :)

09:15 <lekernel> mh

09:16 <lekernel> where is flickernoise supposed to get the CRCs from?

09:16 <xiangfu> lekernel, (4.5.3) I just install then uninstall them. needs compile version. maybe setup all those stuff one day in BUILDHOST

09:17 <xiangfu> button maybe not. crc needs lengthÂ Â :(

09:17 <lekernel> and what feature does a "CRC" button bring to the user?

09:17 <lekernel> imo this only belongs in RTEMS

09:18 <lekernel> the GUI shouldn't have system programming functions, only end user stuff

09:19 <Fallenou> 11:15 < lekernel> if you can get pesky ralf to add them, good < ahah

09:21 <xiangfu> lekernel, ok. got it.

09:22 <lekernel> if you want to be able to use the RTEMS shell without a serial cable you can 1) use telnet 2) write a terminal program in the GUI (activated with a relatively long keyboard shortcut, e.g. ctrl-alt-shift-t)

09:24 <xiangfu> ok

09:24 <lekernel> for the terminal program, you could certainly use the "text editor" widget as a starting point

09:25 <lekernel> it shouldn't be hard, if you don't want colors, cursor control sequences, and things like that

09:44 <xiangfu> definitely without colors etc. :)

11:10 <kristianpaul> xiangfu, gcc versiÃ³n 4.5.2 (GCC)

11:11 <kristianpaul> old-rtems i guess... i dint update toolchain since months ago..

11:22 <wolfspraul> wpwrak: oh well. bad news (for me :-)) we got the first failure after 4th rendering cycle with fix2b applied

11:22 <wolfspraul> 0x4C is the magic number. http://en.qi-hardware.com/wiki/Milkymist_One_run_3_schedule#Test_Results

11:24 <wolfspraul> can't believe it but I guess that's what we found... So - Adam is still wrapping up some work, he will stop by here later to discuss what we can learn from 0x4C

11:26 <wpwrak> wolfspraul: ah well, it had to happen sooner or later. will be interesting to see which symptoms he discovers.

11:26 <wolfspraul> Sebastien was wondering whether Adam is using leaded solder, otherwise he suspected whiskers http://en.wikipedia.org/wiki/Whisker_%28metallurgy%29

11:26 <wolfspraul> why did it have to happen sooner or later?

11:26 <wolfspraul> you knew it coming?

11:27 <lekernel> because fix2b isn't supposed to do anything, I'd guess .....

11:27 <wolfspraul> I was hopeful that the other boards were contained at some earlier state of testing, but I guess you were right all along.

11:27 <wpwrak> wolfspraul: i suspect we're not done yet with the 0x3c/0x77 cluster. and that one already showed all the promises of a long tail.

11:27 <wolfspraul> I cannot even read the notes of 0x3C/0x77 reasonably, so long are they.

11:28 <wolfspraul> tough

11:28 <wolfspraul> wpwrak: but why does it fail after several rendering cycles? what triggers the failure?

11:28 <wpwrak> (whiskers) hard to tell. i've never seen such things in real life and i don't quite know what you have to do wrong to get them.

11:29 <wolfspraul> it seems Sebastien says if Adam uses leaded solder we can rule whiskers out

11:29 <wpwrak> wolfspraul: maybe temperature. maybe it has a certain trigger condition. etc.

11:29 <wolfspraul> temperature, argh

11:29 <wolfspraul> but how can we get the design stable enough so this goes away?

11:30 <wolfspraul> with the 0x4C results as I understand them now (not yet 100% confirmed), my confidence in selling boards has dropped quite low again

11:30 <wolfspraul> how can we rule out that boards just spontaneously fail?

11:30 <wpwrak> wolfspraul: there are probably symptoms we can detect all the time. just rendering isn't a good test.

11:31 <wolfspraul> any scope measurements we can do to see early warning symptoms?

11:31 <wpwrak> wolfspraul: e.g., 0x77 has a "resistance" between TP36 and pin 34 that's different from the rest of the herd. there may be more such symptoms that can be used for diagnosis.

11:31 <wolfspraul> sure if we can identify a strong pass/fail test that would solve the most critical rc3 problem

11:32 <wolfspraul> actually in all the crazy reworks and testing, the yield is getting quite good, he he. not that there is much to laugh about in this run.

11:32 <wpwrak> wolfspraul: i'd look more for DC things. the scope is often a tricky instrument to use. particularly if you're looking for the absence of an event

11:32 <wolfspraul> we are up to 60 'good' boards now

11:32 <wolfspraul> in the end we make 80 I think

11:32 <wolfspraul> but this stuff is painful, so many people are waiting for their m1...

11:32 <wpwrak> yes, the overall yield now looks very promising

11:33 <wolfspraul> oh sure. measure resistance between tp36 and pin34 is a promising test?

11:34 <wpwrak> if you want to push things forward, you can also indicate that there may be this problem that's still being analyzed, and offer a rebate or replacement for rc4 in case this turns out to be bad for those who decide to take the risk

11:34 <wolfspraul> now all the good work that went into 0x77 and 0x3C comes to help. I apologize for rushing earlier. I was wrong :-)

11:34 <wpwrak> of course, it's trading time vs. risk of future expenses

11:35 <wolfspraul> nah, most people after fully understanding the issue would say "please ship me the m1 when it really works"

11:35 <wolfspraul> if I say "it could spontaneously fail at any time", that's not good

11:35 <wpwrak> (tp46-pin34) it didn't yield anything suspicious on 0x3c. so that still needs more investigation.

11:35 <wolfspraul> if we can find a strong test, that's all we may need

11:35 <wolfspraul> the rendering cycles test is no good for that

11:36 <wolfspraul> unless we find a nor corruption now on 0x4C and we believe it's caused by the power down/reset ic situation.

11:36 <wpwrak> (spontaneous failure) depends a bit on the use. if it's for development or evaluation, that would be acceptable. maybe even for studio work if a reasonable work-around can be found (such as "let it cool down for 10 minutes")

11:36 <wolfspraul> but we suspected that many times and so far I believe it hasn't materialized yet

11:37 <wpwrak> of course, unreliable hw is the last thing you want during a live performance :)

11:37 <wolfspraul> yes, no. it's not good.

11:37 <wolfspraul> and if it were as easy as 10 minutes that would be nice.

11:37 <wolfspraul> it could be a day, it could be forever.

11:37 <wolfspraul> test! that's a good approach. we need to do some comparative testing to find early warning signs.

11:37 <wpwrak> lekernel: btw, under what conditions is NOR accessed (read or write) after booting. e.g., is the "file system" mirrored in RAM or do reads go to NOR as well ?

11:39 <wpwrak> (test) yes, step one: find a pattern that leads to the underlying defect. then look for things the defect may affect and see if any of them can be tested.

11:39 <wolfspraul> wpwrak: that's pin 34 of which chip? nor chip?

11:39 <wpwrak> i just hope it's not just some wild ESD havoc, because that could be fairly unpredictable

11:40 <wpwrak> pin 34 of NOR, yes. it's on a ball next to PROGRAM_B on the FPGA's BGA

11:40 <wpwrak> and the trace to pin 34 is adjacent to out rework zone. that's why i looked for it in the first place.

11:40 <wpwrak> s/out/our/

11:41 <wpwrak> (i was hoping for some soldering bridge that somehow reached the trace. didn't quite expect something that looks like a semi-fried FPGA. but well, you have to take things as they come ;-)

11:43 <wpwrak> if adam continues with the testing, we may also find more boards for the cluster. the more, the merrier as far as analysis is concerned :)

11:44 <wpwrak> afk for a bit. have a quick medical checkup today and then have to be back in time for the fedex man.

11:59 <wolfspraul> I keep thinking about a connection and whether there is any insight to be discovered there.

11:59 <wpwrak> btw, the jtag test joerg suggested would be a good thing to have. a systematic test would also catch things that don't cause a noticeable upset during regular operation.

11:59 <wolfspraul> so lekernel says "fix2b is doing nothing".

12:00 <wolfspraul> that neglects the dynamics of the production and testing process of course. in reality fix2b helped us to reduce the number of boards that failed after x rendering cycles a lot.

12:00 <wolfspraul> so let's only think about those now - boards that fail after rendering cycle >= 2

12:00 <wolfspraul> why did fix2b fix them?

12:00 <wpwrak> fix2b removes potentially troublesome components. potentially as in we've already seen them act up.

12:01 <wolfspraul> and why does there seem to be another case now with 0x4C ?

12:01 <wpwrak> statistics :)

12:01 <wolfspraul> in other words - whatever risk fix2b removed, in the same line of thought may be more risks

12:01 <wolfspraul> well it could be unrelated phenomena

12:01 <wolfspraul> or just statistical nirvana, yes

12:02 <wolfspraul> but to me a board that failed after successful render cycles is still special

12:02 <wolfspraul> something happened

12:02 <wolfspraul> and that something went away with fix2b

12:02 <wolfspraul> my thinkin may look for the wrong root cause of course, I admit. just trying different logic.

12:03 <wolfspraul> wpwrak: but that doesn't explain why they suddenly fail

12:03 <wolfspraul> my point is not bad soldering, bad component, etc.

12:03 <wolfspraul> I'm thinking about the event that triggers the failure.

12:03 <wolfspraul> which jtag test did joerg suggest? can we implement it easily?

12:27 <wpwrak> (jtag) drive the pins to a set of states, see if they read back correct values, and measure system current all the while. not _easy_ to implement. but worthwhile :)

12:28 <wpwrak> (sudden failure) statistics can explain all this. make five tests, then do something, make another five tests. probability of failure is 10% in each test. some will fail before the change, some after, some before and after, some never.

12:29 <wolfspraul> yes you are right could be statistical issues

12:29 <wpwrak> if you have 100 such boards, in fact about 35% would pass all ten tests without showing any problem. 3% of them will fail the very next test. and so on :)

12:29 <wolfspraul> so maybe we focus on a strong pass/fail test only

12:30 <wolfspraul> that must be possible, statistics or not ;-)

12:30 <wolfspraul> ok the jtag test doesn't sound like something we can have in a few days, for rc3 in fact

12:31 <wpwrak> yes, we need to get behind the statistics. find something that's not statistical. or if we really can't, characterize the pattern and design tests that have a high probability of producing the problem. that usually means to automate the tests.

12:31 <wpwrak> (jtag) more like weeks

12:31 <wpwrak> maybe for rc4 ;-)

12:32 <wolfspraul> first I want to read back NOR on 0x4C

12:32 <wolfspraul> you think it will show writes (corruptions)?

12:33 <wpwrak> dunno. we don't understand the NOR corruption we've seen well enough yet.

12:33 <wpwrak> but i wouldn't be surprised if it did

12:34 <wolfspraul> I hope not

12:34 <wpwrak> also, we don't know for sure if the data gets corrupted on read, on write, or both ways

12:34 <wolfspraul> but I think we never saw corruptions after the first rendering, which is still my hope that at that point we have a 'good' board

12:35 <wpwrak> the CFI may give us some clues. at least there, we have a bit of static information.

12:35 <wpwrak> may also be statistics ;)

12:36 <wolfspraul> CFI?

12:36 <wpwrak> my rule of thumb is to do manual tests until i see something happen at least ~3 times. repeat 2-5 times to see if the frequency stays the same. then calculate the number of test cycles i need for sufficient probability of the thing happening. multiply with 10. then automate and let machines do what machines do best ;-)

12:37 <wpwrak> CFI = an information structure in the flash. basically a set of parameters. with factory-defined (and a priori known to us) content.

12:37 <lekernel> wpwrak, reads go to the NOR

12:38 <wpwrak> lekernel: do reads happen "all the time" ? particularly in the tests adam does, do reads happen frequently throughout the test ? or maybe only at the beginning ?

12:39 <lekernel> after flickernoise has booted, they only happen when the system configuration is read and patches are read for compilation

12:39 <lekernel> should be only at the beginning

12:39 <lekernel> compiled patches are stored in the SDRAM

12:40 <wpwrak> so this means that "n hours of rendering" wouldn't tell us of any gremlins on the NOR bus

12:45 <wpwrak> power cycling would, though

12:45 <wpwrak> afk for ~30-60 min

13:14 <lekernel> yes

13:45 <wpwrak> back

13:48 <wpwrak> kewl. fedex say there's nothing to pay for the M1 :)

13:49 <wpwrak> they'll deliver it tomorrow

13:50 <wpwrak> lekernel: have you ever used jtag for a boundary scan ? that would allow a more systematic examination of things than the current functional testing

13:52 <GitHub88> [scripts] xiangfu force-pushed master from 09211c4 to afda277: http://git.io/DOsw5Q

13:52 <GitHub88> [scripts/master] compile-lm32-rtems: update gcc to 4.5.3 - Xiangfu Liu

14:08 <xiangfu> after update gcc to 4.5.3 and use newlib 1.19.0 when compile rtems. I still get error "configure: error: missing define CLOCK_PROCESS_CPUTIME_ID" :(

14:28 <xiangfu> I missed the newlib patch :(

14:30 <Fallenou> xiangfu: use 1.21.1 newlib

14:30 <xiangfu> the newlib upload today: "newlib-1.19.0-rtems4.11-20110724.diff24-Jul-2011 09:14 204K "

14:30 <xiangfu> ?

14:30 <xiangfu> yes. try new newlib now ,

14:30 <Fallenou> oops sorry

14:30 <Fallenou> was thinking about binutils

14:31 <wolfspraul> calling it a day, n8 everybody

14:31 <xiangfu> just grep found the CLOCK_PROCESS_CPUTIME_ID is in newlib patches.

14:31 <wolfspraul> let's not be too worried about 0x4D, soon we have a high quality rc3 result. I can sense it :-)

14:32 <Fallenou> damn it Ralf

14:35 <GitHub164> [scripts] xiangfu pushed 1 new commit to master: http://git.io/eWPeRQ

14:35 <GitHub164> [scripts/master] compile-lm32-rtems: update newlib patch - Xiangfu Liu

14:42 <xiangfu> needs another hour to build from scratch. then another hour for build flickernoise from scratch. compiling...

14:42 <xiangfu> 's job is keep fan stay 6000 RPM :)

14:43 <xiangfu> Fallenou, you saw the email from Ralf ? :)

14:43 <Fallenou> yes

14:43 <Fallenou> he's really pissing me off sometimes

14:56 <wpwrak> Fallenou: who is Ralf and what did he do ?

14:57 <lekernel> we should offer him this t-shirt for his birthday: http://images4.cpcache.com/product/retentive-monk-is+there+a+hyphen+in+anal-retentive%3F/142359664v4_225x225_Front.jpg

14:57 <Fallenou> Ralf is a RTEMS maintainer/developer/guy

14:57 <Fallenou> and he is just refusing some patches for obscure reasons sometimes

14:57 <Fallenou> it's annoying

14:58 <kristianpaul> wpwrak: http://www.rtems.org/pipermail/rtems-users/2011-August/008850.html

14:58 <Fallenou> lekernel: LOL

14:59 <wpwrak> lekernel: ;-))

14:59 <lekernel> xiangfu, Joel proposed that you compile some newlib code with and without the -mdivide-enabled/-mbarrel-shift-enabled and show that there is an optimization made by using those flags

15:00 <wpwrak> nice ;-)

15:00 <lekernel> xiangfu, just take some source file in newlib, and compile it with -c and with/without the -m*-enabled flags, then disassemble with objdump

15:01 <wpwrak> maybe someone could convince him that multiplication is overrated, too :) and a barrel shifter. c'mon ! all this fancy new stuff.

15:01 <lekernel> also, this patch is upstream GCC now (but unfortunately, right now only in those 4.6 releases that do not work at all)

15:02 <xiangfu> lekernel, ok. I will do that. maybe tomorrow. very late today. and I need about 1 hour wait toolchain compile

15:02 <wpwrak> so this ralf was quite right about "short-sighted" ;-)

15:02 <xiangfu> :D

15:03 <kristianpaul> new excuses to migrate linux? :)

15:03 <wpwrak> oh yes :)

15:05 <Fallenou> lekernel: can't you back port the multilib patch to gcc upstream 4.5.3 ?

15:05 <Fallenou> maybe it would make Ralk stfu

15:05 <Fallenou> Ralf*

15:05 <lekernel> it's done

15:05 <lekernel> should be in 4.5.4, but it's not released yet

15:06 <Fallenou> ok

15:06 <xiangfu> cool

17:00 <kristianpaul> does openwrt somewhere iplement a memtest app?..

17:03 <roh> dunno. dont thinnk so

17:03 <roh> but i guess you can use the bootloader for that

17:03 <roh> uboot can do weird stuff sometimes

17:56 <kristianpaul> cool, the memtester founded in debian looks prety portable so far..

18:09 <kristianpaul> hum... /opt/rtems-4.11/libexec/gcc/lm32-rtems4.11/4.5.2/cc1: error while loading shared libraries: libmpc.so.2: cannot open shared object file: No such file or directory

18:25 <kristianpaul> Fallenou: you can run ISE on mac os?

18:26 <kristianpaul> nativelly

18:45 <kristianpaul> larsc: is it posible to use mmap on currently ulibc?

18:55 <larsc> should be

20:20 <Fallenou> kristianpaul: never tried to run ISE on mac sorry

20:20 <Fallenou> dunno if it's even possible

21:16 <kristianpaul> sure not, just wondering :)

21:22 <kristianpaul> hum seems ther is no mmap support in rtems either..

21:34 <kristianpaul> too much ask, it seems it is arhc dependant?

21:49 <GitHub114> [milkymist] sbourdeauducq pushed 2 new commits to master: http://git.io/L9-4Dw

21:49 <GitHub114> [milkymist/master] tools: flterm: add log - Xiangfu Liu

21:49 <GitHub114> [milkymist/master] flterm: cosmetic changes + bump version number - Sebastien Bourdeauducq