<ignatius->
Ok. I've dowloaded kernel source that supports the bigger NAND partition (openwrt-xburst-release_2010-11-17), compiled it. Booted it up, and I get this message, after a bunch of UBIFS debugging messages (which none of them help me): "ubimkvol: error!: bad volume type "ubifs" -- Anyone know what I could possibly doing wrong?
<ignatius->
And, yes, I have UBI compiled in the kernel.
<ignatius->
It'll boot to a prompt, the partition size just doesn't use the entire NAND.
<xiangfu>
ignatius-, you have to flash the correct UBI rootfs to your big partition.
<xiangfu>
ignatius-, or you can run 'mtd.nn moust data /data' for mount the 1.5GB data partition to /data
<ignatius->
Ah. That's what I was trying to avoid. Having to reinstall the rootfs. I have a lot of stuf that i'll have to download over again.
<ignatius->
What would the "proper" UBI rootfs be based on?
<ignatius->
I just don't understand why, when I compiled kernel sources a year or so ago. Everything worked right. I could compile (working) kernels on my desktop. I was able to configure the kernel sources to meet my needs. Problem way, for some reason (by my own doing?) I messed everything up, and can no longer (no which source tree i've tried) recompile a working kernel.
<ignatius->
And I followed the Debain instructions. That's the route I took.
<ignatius->
There are some precompiled kernels that _DO_ work with my system. Most others don't. I don't get it.
<ignatius->
Like the "JLime" kernels, for example.
<xiangfu>
ignatius-,  different GCC/uClibc may make kernel can not boot to rootfs. something wrong with load 'init'.
<ignatius->
Ah. I see.
<ignatius->
Yes. VFS error. Unable to load init.
<ignatius->
I remember seeing that before. Didn't make the connection.
<ignatius->
What did you mean by "flashing the correct UBI rootfs to the big partition"??? Is there a targetted way to compile the UMID/MTD partitions?
<ignatius->
Er. s/compile/access
<xiangfu>
xiangfu, sorry the UBI rootfs is same.
<xiangfu>
since your rootfs already used in a 512MB partitons then it will not working if you just modify the partition size.
<xiangfu>
you have to reflash it again.
<wolfspraul>
it's amazing how long a change in partition size is haunting us
<wpwrak>
yeah, such things need to be pushed, not offered for "pickup at your convenience"
<wpwrak>
there's also the problem of coordinating the distributions
<wolfspraul>
there's a worrying pattern I see in boards that work fine (called 'rendering cycle', meaning boot to render, let it render 30 seconds), but then fail to reconfigure the fpga after a power cycle
<wolfspraul>
I will wait for Adam to go through the entire lot first and let the dust settle, but I think it's already clear there's some valuable something to be discovered there.
<wolfspraul>
for example 0x34, 0x39
<wpwrak>
and what brings them back ?
<wolfspraul>
0x3C
<wolfspraul>
right now when there's a problem Adam will mostly just stop with the board 'failed'
<wolfspraul>
later when we have data for all 90 we zoom in on any cluster of problems we can see
<wolfspraul>
but I think I can see that one already :-)
<wolfspraul>
a board that renders just fine, some once, twice, some 6 times, 9 times, and then suddenly cannot reconfigure anymore
<wolfspraul>
'renders' means a full boot and render for 30 seconds without any noticable issues
<wpwrak>
"reconfigure" is just the boot, no reflashing, correct ? (i.e., only 34 of these three also has the reflashing problem, correct ?)
<wolfspraul>
some of them then stay in this condition, others may come back the next day or several days late
<wolfspraul>
later
<wpwrak>
odd indeed :)
<wolfspraul>
read the notes
<wolfspraul>
then you see the sequence of steps on each board
<wolfspraul>
yes it could be related to flash
<wolfspraul>
it could be related to fix2
<wolfspraul>
it could be related to tolerances of the diode or capacitors we added
<wolfspraul>
it could be related to something we could fix in software (bitstream), depending on where exactly in the 'reconfigure fails' it stops
<wpwrak>
how hard is it to read back the flash via jtag ?
<wolfspraul>
should be easy, we can try
<wpwrak>
i've heard that it is possible but slow. how slow ? :)
<wolfspraul>
I don't know though, just guessing
<wpwrak>
reading back the flash would eliminate mere flash corruption as an issue
<wpwrak>
well, unless it _is_ flash corruption ;-)
<wpwrak>
in which case you'd have identified the disease :)
<wolfspraul>
you mean the initial flashing is faulty?
<wolfspraul>
that is impossible now since we do crc checks, and also those are all boards that first rendered fine
<wolfspraul>
and then after X power cycles, they stop reconfiguring
<wpwrak>
you could still get later flash corruption for some reason
<wolfspraul>
oh yes, we should definitely read back and compare
<wolfspraul>
if the data has changed, what could be the cause?
<wpwrak>
now you're asking the tricky questions ;-)
<wpwrak>
could be some more unexpected behaviour of the reset circuit
<wpwrak>
or maybe it's something else. for all we know, it could be the FPGA sending some junk
<wolfspraul>
let's see what lekernel says later. I feel somewhat uneasy about those failure cases.
<wpwrak>
as i understand if, the flash is only accessed when booting (and when updating). so if it gets corrupted later on, then you wouldn't notice until the next boot
<wolfspraul>
because if a board fails on the 9th rendering cycle (out of 10), that's only one tiny test away from selling it. and what would guarantee that it won't fail on the 13th cycle, i.e. 3 cycles into the user's hands...
<wpwrak>
yeah, rc3 has quite a few troubles. a bit frustrating after things have gone so well before. but hey, that's how you earn the official endorsement from murphy ;-)
<wolfspraul>
ah no, I don't see it that way
<wolfspraul>
rc1 had a whole number of major issues, expectedly
<wpwrak>
okay. if an rc1 is flawless, you're cheating ;-)
<wolfspraul>
rc2 was 40 boards, and I helped Adam testing for several days, so I know white well what I saw
<wolfspraul>
and we were nowhere near this kind of testing quality as we are with rc3
<wolfspraul>
if we had done that, we would have never shipped or sold a single rc2
<wolfspraul>
every single rc2 board has trouble booting
<wpwrak>
*grin*
<wolfspraul>
as we've discussed before. we were just pretty good in selling boards to people that never did much with them, or that were smart enough to use workarounds, or simply got used to having to cycle several times etc.
<wolfspraul>
but that won't scale
<wolfspraul>
so we do need to get to the real root cause sooner or later
<wolfspraul>
the later the more expensive
<wpwrak>
are you sure the rc3 reconf cluster has a "memory" ? i.e., is it good-good-good-good-bad-bad-bad-bad ? or is it good-good-good-good-bad-ring the alarm and stop testing ?
<wolfspraul>
the latter right now
<wolfspraul>
but soon we will dig in there
<wpwrak>
so it could be that just 1/N tests fail
<wolfspraul>
I doubt that
<wolfspraul>
there is firm evidence of 'memory'
<wolfspraul>
read the notes
<wpwrak>
it would be good to automate all those things. such that you can do, say, 100 or 1000 boot tests without adam actually pushing buttons
<wolfspraul>
0x34 0x39 0x3c
<wolfspraul>
well
<wolfspraul>
Adam requested that already but it's hard to implement, and we want to do a real physical disconnect of the power cable as well, at least that's the test now.
<wolfspraul>
so we have 0x34 0x39 0x3C now, and I'm sure by the time Adam went all the way through after the schmitt-trigger fix, there'll be more
<wolfspraul>
in which case I would rather do more testing before start selling
<wolfspraul>
for example increase to 20 cycles and do all boards again
<wolfspraul>
see whether more 'drop out'
<wpwrak>
(notes) you mean the "notes" column ? or the .results link ? the "notes" column doesn't suggest any memory. and for 34, i see mainly a troubled history
<wolfspraul>
I mean all boards that are 'available' right now
<wolfspraul>
notes column
<wolfspraul>
0x34 is interesting
<wolfspraul>
if I understand the notes correctly it did flash, boot and render after replacing diode + c238
<wpwrak>
i don't see a "memory" anywhere. in the sense of a persistent state change of the system
<wolfspraul>
and then it went back to failure mode again
<wolfspraul>
well
<wolfspraul>
with 0x34 you see it very clear
<wolfspraul>
that's why Adam had to resort to trying diode/c238 replacements
<wolfspraul>
because there was no other way to kick it back alive
<wolfspraul>
let me look at 0x39 and 0x3C
<wpwrak>
0x34 also has the reflashing issue. is this also with the short cable ?
<wolfspraul>
yes those two no memory yet, just testing stop for now
<wolfspraul>
I am sure Adam is 100% on the short cable now
<wolfspraul>
no need to add more uncertainty
<wolfspraul>
as soon as we have identified _any_ improvement, we'll go for that
<wpwrak>
in other words, the short cable doens't always make the problem go away
<wolfspraul>
there may be multiple problems
<wolfspraul>
and the analysis of that short vs. long theory was also lacking
<wpwrak>
can you downgrade to full-speed usb ?
<wolfspraul>
it was an empirical decision
<wolfspraul>
I don't think it's a flashing issue, because once the crc test passes, we can assume everything was written correctly.
<wolfspraul>
so why doubt that?
<wpwrak>
you probably have impedance issues in the USB signals. high-speed is hairy
<wpwrak>
no, board 34 ends with: 10. stopped at 'Bitstream length: 1484404' while reflashing
<wolfspraul>
when Adam is back from lunch break, we can ask him to try to boot 0x39 and 0x3C, I would think we will see the 'memory' effect then
<wolfspraul>
yes now it does
<wpwrak>
so this is with the short cable ?
<wolfspraul>
but under 7. it rendered
<wolfspraul>
that's my point
<wolfspraul>
how can I board that boots and renders fall back (!) to an unreconfigurable state
<wpwrak>
no no ... different issue :)
<wolfspraul>
and yes, now it cannot be reflashed anymore
<wolfspraul>
but before it could be flashed
<wpwrak>
the "10. stopped at 'Bitstream length: 1484404' while reflashing" is also what you got when the cable was too long, correct ?
<wolfspraul>
don't know
<wolfspraul>
there are several boards in this state, but I believe all/most of them rendered before
<wolfspraul>
checking
<wolfspraul>
long cable issue was "d2/d3 dimly lit after flashing"
<wpwrak>
i would propose the following test: take a board that boots okay, verify that it boots okay. download the flash via jtag 3-5 times, verify that all the copies are identical (cmp, maybe also md5sum so that we have a reference), then boot again. if it boots okay, reconfig and everything, proceed. else, bring it back to life or pick a different board and repeat from the first step
<wolfspraul>
my concern now is on boards that rendered just fine, and then without any hw action, REGRESSED to unreconfigurable state
<wpwrak>
this will give you a known to be good downloaded flash content. this may or may not be identical to what you upload. if it is, even better. if not, that may just be something the flash process does, that's why i'd treat it as a black box for now.
<wolfspraul>
'download' you mean write the flash or read the flash?
<wolfspraul>
read multiple times? or write multiple times?
<wpwrak>
now, equipped with a reference downloaded content, take one of the boards that have regressed and jtag-download its flash. then compare with the reference. if there're different, something may have disturbed the flash, flash access may be unreliable, or it has a jtag/usb issue. if they're the same, it's something else
<wpwrak>
download = read from NOR to USB/PC via JTAG
<wolfspraul>
yes sounds good, we'll prepare that
<wolfspraul>
for now Adam will go through the entire lot with another whole round of testing
<wpwrak>
oh, and of the 3-5 images from the "good" board differ among each other, that wuold also be interesting. if the board then still boots okay, then it means that USB is bad. otherwise, the flash may have become compromised.
<wolfspraul>
I highly doubt we are looking at usb issues here
<wolfspraul>
if anything your test may lead us in the wrong direction because it exposes usb issues that are unrelated to the regression on the running boards we are trying to study
<wolfspraul>
the issue is a running board (renders fine), that suddenly regresses to a nonconfigurable board
<wpwrak>
ph i'm absolutely sure you have an usb issue ;-) but i don't think it causes the reconfig problem. but it may hit you on other occasions, making tests less predictable
<wolfspraul>
how can that be related to any usb issue?
<wpwrak>
s/ph/oh/
<wolfspraul>
I agree, but that's a completely different problem, and doesn't block much right now.
<wolfspraul>
what blocks a lot is boards that render fine and then fall back.
<wpwrak>
the problem with the usb issue is taht it may cause jtag-download to be corrupted. errors CAN get past the USB CRC.
<wolfspraul>
that may mean none of the boards that pass can be sold, until we find the root cause
<wolfspraul>
how can that lead to a board first rendering fine, and then failing on the xth power cycle
<wpwrak>
no no. that's not what i'm saying
<wpwrak>
what i'm saying is that you need jtag-download as a tool to analyze this problem. but that tool itself may be compromised, due to the usb issue. so you need to factor in tests for data integrity over usb as well.
<wolfspraul>
even if it takes several reflash cycles, why bother about that unreliability right now? the interesting part is that after a proven successful flash (proven by a crc in the test image, and then proven by successful rendering), the board falls back
<wolfspraul>
anyway I agree on your download multiple times and md5sum etc.
<wolfspraul>
that will give us some visibility into the lower levels
<wolfspraul>
but the really interesting thing happens somewhere between 2 power cycles
<wpwrak>
and again, the failure to reflash is also a regression. although possibly in a different domain. (we don't know for sure that the "usb problem" is really usb. could also be, say, the flash occasionally just not wanting to talk to the world)
<wolfspraul>
where the first one ends in a board that renders, and the next one ends in a board that won't reconfigure
<wolfspraul>
what happens in between?
<wpwrak>
well, does anything happen in between ? :) we don't know this yet
<wpwrak>
it could be a memory-less effect
<wpwrak>
20% chance of failure on boards X, Y, and Z
<wolfspraul>
hah, there is Adam :-)
<wolfspraul>
aw: how's testing going?
<wolfspraul>
wpwrak: that is not what i gather from closely following the testing so far, this issue seems more sticky, like a switch, and then it stays.
<wpwrak>
something like that. you need automated tests (or a patient operator ;-) to analyze such things. remember those capacitor issues we had in gta02, where i did hundreds of automated runs. it's this kind of thing. if you have a stochastic problem, you need to collect enough data before you can be sure what it is that you're seeing
<wpwrak>
a "switch" would be nice. easier to debug :)
<wolfspraul>
my most pressing problem is to decide whether the boards labeled 'available' currently can be sold or not
<wolfspraul>
aw: I am wondering about 0x39 and 0x3C...
<aw>
don't know yet why
<aw>
continuing to finish previous vga/midi/usb failures firstly..;-) my hands is not pipeline. :)
<aw>
will beck to see. :)
<wpwrak>
aw: you need more caffeine pills ;-)
<wpwrak>
wolfspraul: it's a pity that only adam can do testing. there's the risk of tunnel vision and other systematic errors in this.
<wpwrak>
wolfspraul: (besides the sheer workload, of course)
<wolfspraul>
true but unfixable
<wolfspraul>
there's a lot of risks
<wolfspraul>
we should already run rc4 with 160 units in parallel
<wolfspraul>
throw a lot more resources at it, take a lot more risks
<wolfspraul>
develop Milkymist Two in parallel
<wolfspraul>
an so on
<wolfspraul>
a lot of 'should'
<wpwrak>
(unfixable) yeah, possibly. lekernel didn't want to do a little trip to taipei ? :)
<wolfspraul>
Sebastien would need a few helping hands on the IC design as well, ahh. not 'would', but 'should'
<wolfspraul>
good idea, never thought about it
<wolfspraul>
Sebastien could surely learn some manufacturing patience, if he would be able to handle it :-)
<wolfspraul>
you are fatalistic enough that no matter what happens, you stay calm
<wolfspraul>
he he
<wpwrak>
(rc4) naw, i think it would be too early for this. there are still a number of things that seem a bit to vague and that can probably be debugged in rc3.
<wolfspraul>
such a trip would have been quite costly in time and money though, not sure even if we would have had the proposal before whether everybody would have liked it
<wolfspraul>
rc3 is already far better than rc2, as a product. the downside is that as we raise the testing level, more unpleasant surprises are made.
<wolfspraul>
what at least for my part, I want that
<wolfspraul>
I have no illusions about my ability to sell boards one by one to people that shelf them, and so will never report back problems.
<wpwrak>
(if he would be able to handle it) yeah, that's what may be a concern. i remember harald freaking out ;-) (okay, that was mostly about FIC incompetence)
<wolfspraul>
incompetence or not, it's a very demanding mental challenge
<wolfspraul>
most software and 'design' people would run away
<wpwrak>
hehe :)
<wolfspraul>
it's frustrating to have to accept the randomess and seemingly unending stream of rare/unexpected/strange/should-not-be cases
<wolfspraul>
makes you feel a small little nobody in the big wide universe
<wolfspraul>
so much better to hack on software, right?
<wolfspraul>
:-)
<wolfspraul>
anyway, good idea with the trip, but nobody had the idea before
<wolfspraul>
I'm not even sure it would have helped much, I think progress is just fine
<wpwrak>
(trip) before actually seeing all the little rc3 problems, it would indeed have sounded unnecessary. the problem is that, with those widely spread out issues and the large amount of rework, you really have to be "there"
<wolfspraul>
neither the acrylic cases nor the boxes/eva, labels and other print material has arrived in Taipei as of today
<wpwrak>
customs having a slow week ?
<wolfspraul>
no, all fine. moving.
<wolfspraul>
it's hardware
<wolfspraul>
doesn't travel at the speed of light
<wpwrak>
but i thought they're already at customs ?
<wpwrak>
and have been so for at least a day. maybe more ?
<wolfspraul>
yes, I don't know exactly which gps coordinate they are at at any given minute :-)
<wolfspraul>
A DAY!
<wolfspraul>
wow
<wolfspraul>
so many things can happen in a day, right?
<wolfspraul>
:-)
<wolfspraul>
Steve Jobs makes the entire iPhone in one single day.
<wolfspraul>
he gets up in the morning, and in the evening he shows his friends
<wpwrak>
well, i'm used to BUE customs sitting on stuff for a day and i hate every minute of it
<wolfspraul>
so let's see. Adam is making a full round of testing now.
<wolfspraul>
then we see new numbers
<wpwrak>
i thought he'd shit the next iMarvel right after breakfast. didn't realize it took him a whole day ;-)
<wolfspraul>
but I already have this suspicion there is more valuable stuff to discover
<wolfspraul>
if there were one board one time that went from rendering to unconfigurable, ok, i'd ignore it
<wolfspraul>
but it's several boards, and growing. too many.
<wpwrak>
can you downgrade the jtag dongle to full-speed ? and if yes, is it difficult to do ? if it's easy, i would recommend doing this before starting the flash corruption analysis
<wpwrak>
(jtag dongle) well, at least the one adam will use for this
<wolfspraul>
I think we can short or otherwise disable the eeprom on the daughterboard
<wolfspraul>
that way it will fall back to full-speed
<wolfspraul>
something like that
<wolfspraul>
it's a bit difficult to enforce on the Linux side, afaik
<wolfspraul>
missing feature
<wolfspraul>
can't just echo '0' into some sys file, afaik
<wpwrak>
you had them do only full-speed in the past (by accident), was that an eeprom issue ?
<wolfspraul>
no
<wolfspraul>
but now that that bug is fixed, I believe shorting the eeprom will cause it to fall to full
<wolfspraul>
I don't know how to force it on the Linux side, you tell me
<wolfspraul>
I looked once, couldn't find anything
<wpwrak>
okay. may be worth trying then
<wpwrak>
(linux side) dunno either. use a old full-speed-only hub ? ;-)
<wpwrak>
linux seems to have a mechanism, not for forcing full-speed per se, but for assigning the port to the companion UHCI/OHCI, which also has the effect of not using high-speed
<wpwrak>
now, let's see how this works ...
<wpwrak>
hmm. you need to know the PCI ID of you EHCI. how convenient. let's see if there's a link somewhere deeper in sysfs ...
<larsc>
well you could go by the driver node
<wpwrak>
larsc: the only path that seems to lead anywhere is via class/usbmon/. a little odd.
<wpwrak>
ah, there's the usb bus also in the pci hierarchy. let's see if this helps
<larsc>
wpwrak: /sys/bus/usb/devices/usbX
<larsc>
hmpf! were did my coffe go?
<wpwrak>
yes ... okay, with .. i can get to proper the PCI hierarchy. how, where's the companion file again ...
<wpwrak>
well, maybe. that only works for two EHCIs. the others have weirder paths.
<wpwrak>
yes ! :)
<wpwrak>
okay, here is how it works:
<wpwrak>
- plug in the critter and let it enumerate at high-speed. check the demsg output. should say something like "usb 2-1: new high speed USB device"
<wpwrak>
- that's usb-$BUS-$PORT. you remember these values
<wpwrak>
- the device should now reset and re-enumerate as full-speed. dmesg should then show something like "usb 6-1: new full speed USB device"
<wpwrak>
in the example above, the command would have been: echo 1 >/sys/bus/usb/drivers/usb/usb2/../companion
<wpwrak>
as usual, mind the space between 1 and > ;-)
<wpwrak>
this only works with EHCIs that do things the old way, of having an UHCI/OHCI on their side. if they are themselves capable of lower speeds, this mechanism isn't available (and, it seems, there's no alternative means in this case)
<wpwrak>
`antonio`: oh, interesting. seems that you didn't cross-compile the programs. if you run "file iz" on the /usr/sbin/iz, what does it say ? (you may have to copy it back to the Linux PC first, as your ben may not have "file" installed)
<`antonio`>
wpwrak, do you have any image I can flash with the dirtpan modifications?
<wpwrak>
i don't have a full system image. but i can make and upload kernel and tools binaries. lemme check what i have ...
<`antonio`>
wpwrak, that would be great
<wpwrak>
but fist, i have to fix my build environment ... the last experiments with try to teach openwrt to cross-build SDL didn't go so well ...
<wpwrak>
(and make sure you don't reset/power down/etc. between starting the flash_eraseall and the end of the nandwrite. otherwise, it's usbboot time ;-)
<`antonio`>
wpwrak, thanks i'll let you know how that goes
<wpwrak>
`antonio`: do you have 2 atbens or atben to atusb ?
<`antonio`>
wpwrak, yes i have 2 atbens
<wpwrak>
great. then you don't have to worry about the PC kernel :)
<`antonio`>
wpwrak, thank you ! we got it working !
<wpwrak>
`antonio`: whee ! welcome "on the air" ! ;-)
<wpwrak>
hmm, that CCCamp is in a place called "Finowfurt". is it just coincidence that this name almost contains the letters of "Fnord", and in the right sequence ?
<kristianpaul>
wolfspraul:if too expensive too travel may be give him acess by ssh/vpn to a laptop connected to a M1 so some test can remotelly done?
<wolfspraul>
ah no worries, we'll all work in parallel
<wolfspraul>
everybody cranking here ;-)
<kristianpaul>
hum, afaik stillmissing vga feedback for remote operation :(
<kristianpaul>
wpwrak:(Fnord) a good excuse to go to cccamp? :)