<wpwrak>
wolfspraul: good news: M1 arrived ! and it seems to behave :)
<wolfspraul>
nice!
<wolfspraul>
of course it behaves
<wolfspraul>
or you want to claim that we ship out untested goods? :-)
<wpwrak>
hehe :)
<wpwrak>
hmm, those bottoms ... i feel a strange urge to just mount a block of aluminium and mill a monolithic one
<wolfspraul>
bottoms?
<wolfspraul>
oh buttons?
<wpwrak>
er yes, buttons. wakeup not quite complete yet :)
<wolfspraul>
wakeup? wow
<wolfspraul>
my morning coffee just ready, one sec... (picking up from stove) :-)
<wolfspraul>
how can it be both my morning and your morning at the same time? strange :-)
<wolfspraul>
about test results: Adam finished 37 boards now, all fine
<wolfspraul>
only 0x4D stepped out of the line
<wolfspraul>
another 9 to go
<wpwrak>
wolfspraul: with fedex syncing me to day time and some nasty toothache the last two night keeping me from sleeping (fixed today - dentists are amazingly efficient nowadays), my pattern is even crazier than ever :)
<wpwrak>
(boards) nice !
<wpwrak>
roh: basic button shape ~3 mm + 0.7 mm shaft followed by ~1.1 mm disc, thickness about 4.8 mm in total ?
<wpwrak>
roh: shaft diameter ... duh .. 300 mil ? disc diameter 1.2 mm ?
<wpwrak>
looks at a 100 x 150 x 5 mm Al plate from some misguided experiments at thermal distribution
<wpwrak>
well, fun for later. for now, buttons aren't convenient to have anyway
<wpwrak>
grmbl. X crash.
<wpwrak>
now .. a 12 V wall wart for the camera. hmm ...
<kristianpaul>
same from wrt should work..
<wpwrak>
hmm, no wrt supply at hand
<wpwrak>
must be hiding
<wpwrak>
i'll just try 9 V
<wpwrak>
hmm, video brightness seems to be quite hard to set
<wpwrak>
at least at night. maybe it's better with daylight
<kristianpaul>
already tried to increase that brifght on with flcikernoise?
<wpwrak>
yeah. but the range in which i get useful images is incredibly narrow
<kristianpaul>
and thats ccd, case cmos was not that good
<kristianpaul>
zoom range?
<wpwrak>
zoom ?
<wpwrak>
it's the standard M1 cam. ain't no zoom :)
<kristianpaul>
no?
<kristianpaul>
i mean you can focus..
<kristianpaul>
ah yeah, i forgot :)
<wpwrak>
(focus) hmm, i can unscrew the lens. but that doesn't look like focus
<kristianpaul>
personally i felt confident at no more than 3 meters far from camera
<roh>
wpwrak: it should be 8mm diameter button caps
<wpwrak>
"explosive minds" may look even cooler with the direction reversed. now it's more like imploding :)
<roh>
the spacer (about 7.9mm diam) should be 0.5mm thick and the inner end cap (12mm diam) should be 1mm thick
<wpwrak>
kristianpaul: oh, i'm about 30-50 cm from the cam :)
<roh>
the button cap has (should have) the same thickness as the sidewall
<wpwrak>
so the spacer should have a smaller diameter than the cap ?
<kristianpaul>
wpwrak: wow, too near
<kristianpaul>
wpwrak: about to make video chat with lekernel ;)?
<wpwrak>
space constraints :)
<kristianpaul>
(glup)
<wpwrak>
short cable, limited length of arms, and unwillingness to raise to touch anything] :)
<kristianpaul>
(short cable) oh, well i bought a 3m cable ;-D
<kristianpaul>
also a black cloth :)
<roh>
wpwrak: that difference is only to make it easier to glue without it standing over and hindering it from sliding in completely
<wpwrak>
roh: aah, i see. nice.
<wpwrak>
kristianpaul: (cloth) to hide behind ? :)
<kristianpaul>
lol
<kristianpaul>
no, i wanted to see if quality around video in effects improv
<kristianpaul>
e
<wpwrak>
and, did it ?
<kristianpaul>
not for my own like
<kristianpaul>
but i hold some comments to avoid recall the topic of this channel :)
<wpwrak>
lekernel: idea for future improvement: if no local display of some sort is added, maybe have a LED next to each input. turn it on if the patch is using that channel (video, audio, etc.). blink it if the patch is using the channel but the signal doesn't look right (e.g., no sync, too much black / too much white, etc.)
<wpwrak>
lekernel: regarding recompiling patches, does it actually need to do this for each setup change ? i.e., do the patches depend on setup items ? if not, you could just have a flag in RAM which patches you've already compiled. should be much easier to implement than a persistent cache that survives power cycling.
<wpwrak>
oh, and if you go multi-core, you could just compile patches in the background, while rendering ;-))
<kristianpaul>
;-)
<stekern>
wpwrak: the problem with the flag approach is; how do you know that the patch haven't changed?
<wpwrak>
stekern: clear the flag when you overwrite/edit a patch
<stekern>
of course the 'flag' could be some crc/hash of the source, that might solve it
<wpwrak>
yes, or do a hash if you want to get fancy :)
<stekern>
wpwrak: yes, but what if the patch have been modified externally
<stekern>
(admittedly, I am not to familiar with how things work, is it possible that it would be externally modified?)
<wpwrak>
i don't think without you noticing. i.e., you'd still have to transfer it.
<stekern>
well, in that case, the flag approach might work
<stekern>
if cpu time need for calculating hash vs compiling patch is about the same, then there's no point with that
<stekern>
*needed
<wpwrak>
yeah. no idea how they compare
<stekern>
me neither ;)
<wpwrak>
just noticed that the M1 spends quite a bit of time compiling patches, even if all i do is go to the camera settings
<stekern>
I'm in larval stage, at the point where I've got the toolchain compiled and tested to run flickernoise in qemu
<kristianpaul>
thats good !
<kristianpaul>
since yday i started to try port the debian memtester package, it got it to compile, but after dirty comment mmu related code..
<kristianpaul>
also some posix functions that rtems dislked (mlock and related..)
<kristianpaul>
i dint tested yet, i still need to harcode some memory lenghts..
<kristianpaul>
may be you can take a look to the code, i really hackishm now... but it compiles ! ;)
<kristianpaul>
s/i/is
<wpwrak>
kristianpaul: hmm, porting a memtester that tries to defeat virtual memory to a MMU-less system, and having to defeat the VM-dependent feature the program uses, somehow sounds wrong to me ;-)
<kristianpaul>
humm
<kristianpaul>
wpwrak: what you suguest for a memtest/stres test?
<roh>
something simple and small which runs completely from sram and tests the complete dram?
<roh>
output/input via serial
<kristianpaul>
also acording to changelog mmap was for adding the feature of testing specific physical regions of memory
<kristianpaul>
my main concern about dram in M1 is posible corruption, as is just a *guess* as i never undertood well the DMA problem with first minimac core
<stekern>
kristianpaul: as roh said, start out with something simple, like just writing all '0's and all '1's and see if they read back ok, do walking '0'/'1's and see if they read back ok
<stekern>
if those simple tests passes, then you can start looking into more complex algorithms
<stekern>
if they don't pass, you might have saved yourself some trouble :)
<kristianpaul>
good plan ;)
<stekern>
(but perhaps got yourself into the trouble figuring out why they don't pass)
<lekernel>
I think hashing a patch will be much faster than compiling it
<lekernel>
and yes patches can be modified externally, via FTP, shell, file manager, etc.
<lekernel>
afaik there's no "file modified" notification API in RTEMS like there is in Linux, and given how badly the RTEMS filesystem is designed I'd rather not touch it
<lekernel>
kristianpaul, I have done tons of SDRAM tests, check the archives
<wolfspraul>
wpwrak: you up?
<lekernel>
wolfspraul, hi
<wolfspraul>
hi
<lekernel>
any prospect regarding when the first boards are shipped?
<lekernel>
yes, there seems to be fully working ones. but how about packaging them and selling them?
<lekernel>
sorry to be insistent, but so many things are depending on that...
<wolfspraul>
it doesn't worry you that 2 boards that worked perfectly stopped working after a little bit of rendering?
<wolfspraul>
it's your brand. you think we can ship products that are known to fail after a few times rendering?
<wolfspraul>
here's the plan: Adam is currently dumping the nor of those two
<wolfspraul>
overall the test results look really good now
<wolfspraul>
but I would like to have at least a theory for what happened on those 2 boards
<wolfspraul>
can you rule out that the bad reset ic we chose causes nor corruption on power down?
<lekernel>
maybe it's just the same thing that happened to the video chips on the RC1 boards Adam reworked and sent me
<wolfspraul>
I have an idea
<wolfspraul>
why don't we just erase all trace of those two boards, 0x4C and 0x7D, from the production and testing plans, and sell the rest as if everything was always perfect?
<wolfspraul>
:-)
<wolfspraul>
kidding...
<wolfspraul>
when Adam is here we ask him about the solder he used
<wolfspraul>
how would that explain a board that first works and then fails?
<wolfspraul>
a whisker - where? which chip?
<wolfspraul>
and it shows up after a few render cycles?
<wolfspraul>
are you trying to find a theory that can explain what we find, or are you trying to find a theory that will allow you to sell the remaining boards with a straight face?
<wolfspraul>
so the best would be if we can come up with a quick test to identify boards that will later fail
<wolfspraul>
the worst would be if we find that the wrong reset ic we have causes nor corruptions
<wolfspraul>
we can also close our eyes really hard and just sell the stuff even though we cannot produce it at a consistent quality
<wolfspraul>
I think that's suicidal for the Milkymist brand in the long run though.
<wolfspraul>
aw: hey Adam :-)
<wolfspraul>
congratulations on finishing the reworks of another 47 boards!
<lekernel>
perfectionism is suicidal too, because you can't get anything done in the end
<wolfspraul>
we have a question for you: which solder are you using for the reworks?
<aw>
wpwrak, have you settled down on your board? ;-)
<wolfspraul>
lekernel: oh that's why I'm asking you, it's your brand. Please think about the test results carefully.
<wolfspraul>
I am every bit aware that perfect is the enemy of good.
<wolfspraul>
but boards that fail after successful rendering worry me, that's all.
<wolfspraul>
in the companies I've worked so far (all Western brand companies), something like this would not ship.
<wolfspraul>
a Chinese company would long have started shipping, of course
<aw>
lead soldering to be used while reworks
<wolfspraul>
lekernel: does that settle your whisker theory?
<lekernel>
aw, and what solder did you use for the two video chips you reworked on rc1?
<wolfspraul>
aw: have you dumped some nor partitions from 0x4C and 0x7C ?
<lekernel>
the ones that failed
<aw>
lekernel, the same lead soldering of currently one i used, it's reel. same as while in rc1
<lekernel>
maybe wrong power-down ramps, as Werner suggested
<wolfspraul>
what is the chance in your opinion that the power-down ramps cause this zero word?
<xiangfu>
4c: is only one bit. but 7c is more.
<lekernel>
if that's the case, locking certainly would help restrict the incidence of the problem to the unlocked partitions, which means the board would always be able to boot in rescue mode
<wolfspraul>
from 7 to 0 is 3 bits, no?
<wolfspraul>
it may even make it go away entire if this mostly affects small addresses (IF)
<xiangfu>
oh. sorry. yes 3bits. :(
<lekernel>
and it seems the standby bitstream is affected more often (since when the board fails, it's usually no reconfiguration at all and not other issues) so there's some chance locking would make the problem disappear entirely
<wolfspraul>
I'm willing to accept all sorts of theories and support the product, but I need to do it with a straight face, i.e. after doing my best to understand and mitigate the problem.
<wolfspraul>
ok, so let's get the locking done first
<wolfspraul>
could the writing of zeroes also come from a software bug?
<lekernel>
yes, maybe
<wolfspraul>
that'd be best for me :-)
<lekernel>
actually the whole flash corruption could come from software bugs
<wolfspraul>
until Werner is back and requests some tests, Adam & Xiangfu will get the locking setup
<wolfspraul>
then we lock the rescue partitions on some boards (let's say 10), and do some render cycling
<wolfspraul>
xiangfu: are you on this with Adam?
<wolfspraul>
also make a short script that will let adam lock the partitions of an existing board without reflashing...
<wolfspraul>
lock-only.sh
<xiangfu>
wolfspraul, yes. we are talking
<wolfspraul>
great, thanks
<xiangfu>
wolfspraul, (lock-only.sh) sound good.
<wolfspraul>
lekernel: I think so far all zero words I've seen are at low addresses
<wolfspraul>
even within the 640 KB standby bitstream
<wolfspraul>
but I haven't paid close attention to all cases we had, Werner knows them all
<wolfspraul>
0x7C is not an entire word. offset 0x1EC: from 44 0C -> 40 00
<wolfspraul>
one bit remaining :-)
<wolfspraul>
a low offset again
<wolfspraul>
if it's a power ramp down problem, why would only small addresses be affected?
<wolfspraul>
is it easier for a wire to be 0 than to be 1?
<lekernel>
it's also interesting to notice that the two corruption events occurred at different addresses but with very similar content
<lekernel>
hmm, no, actually the whole beginning of the bitstream contains an almost periodic pattern
<wolfspraul>
in the second case one bit remains
<wolfspraul>
7c: offset 0x1EC from 44 0C -> 40 00
<wolfspraul>
well, all I've seen was at low addresses
<wolfspraul>
so if we are lucky, a locking of the standby bitstream will make the problem go away entirely
<wolfspraul>
although if it's power ramp-down cause, who knows maybe the locking will not work? :-)
<wolfspraul>
lekernel: if it's a ramp-down problem, is there a theory that suggests that low addresses are more likely to get hit than higher ones?
<wolfspraul>
is it more likely that an address line is 0 than 1?
<lekernel>
the power ramp down theory is that underpowering the FPGA while the flash is still running causes the FPGA to put out incorrect signals that are interpreted as valid writes by the flash
<wolfspraul>
it sounds very far fetched to me
<lekernel>
locking makes the accepted write sequence a lot more complex
<wolfspraul>
but I know too little about the signals between fpga and nor and how likely this is to happen
<wolfspraul>
well ok. we definitely try locking.
<wolfspraul>
eventually luck has to be with us
<lekernel>
he
<lekernel>
that flash does receive write commands
<wolfspraul>
if it's a software bug, that's also ok
<wolfspraul>
eventually we'll hunt it down, or at least defuse it first with locking etc.
<lekernel>
the 3.3V supply is correct, so if the flash gets written, then it has received a proper write command
<lekernel>
unless the flash chips are counterfeit/crappy, but you do not think this is true
<wolfspraul>
sounds pretty unlikely to me in an uncontrolled ramp-down
<wolfspraul>
no no
<wolfspraul>
get your mind off of that, that's a mental trap
<lekernel>
what can cause write commands are:
<lekernel>
* incorrect signals during power up/down - the reset IC was supposed to prevent that by holding the reset during those events. it does it during power up, but the power down case is less clear as Werner pointed
<lekernel>
* software bugs
<lekernel>
* FPGA configuration system going mad
<lekernel>
by making the accepted write sequence way more complex, locking would probably rid us of the symptoms of any of those problems
<wolfspraul>
if in addition for whatever reason this happens only on low addresses, we are all set
<wolfspraul>
our users will never experience the downside of the bandaid we use to keep the product working -> perfect solution
<wolfspraul>
if it also happens on higher addresses, we may still decide to ship, because the event is rare and will 'only' trigger the need for a web update
<wolfspraul>
(assuming the rescue path and web update actually work, which I assume now)
<wolfspraul>
basically in 470 render cycles (30 seconds each), we had this happen twice
<wolfspraul>
the numbers are a little low, but it seems to be in about 1 out of 200 render cycles
<wolfspraul>
[numbers low] I mean our statistical data is limited to really say 1/200
<wolfspraul>
but something like that
<wolfspraul>
xiangfu: can you also update Adam's flterm to the latest version?
<wolfspraul>
let's just get both flterm and urjtag updated
<xiangfu>
wolfspraul, yes. already done that.
<wolfspraul>
perfect
<aw>
xiangfu, thanks for your instructions, now my jtag is new
<wolfspraul>
xiangfu: everything updated?
<xiangfu>
wolfspraul, we just done update. now I finish the small lock_only.sh
<wolfspraul>
wow, great
<wolfspraul>
ok good
<wolfspraul>
aw: here is what I propose
<wolfspraul>
1. xiangfu writes a little lock_only.sh script that you can use to lock the partitions of already flashed good boards
<wolfspraul>
2. I think we can reflash 0x4C and 0x7C and see whether they boot again
<wolfspraul>
3. we pick 10 boards, 0x4C and 0x7C and 8 others, and run lock_only.sh on them
<wolfspraul>
4. then we do 10 render cycles on those 10 boards
<GitHub173>
[milkymist] sbourdeauducq pushed 1 new commit to master: http://git.io/2_lHRQ
<GitHub173>
[milkymist/master] flterm: add check if c is 0x00 - Xiangfu Liu
<xiangfu>
thanks lekernel
<wolfspraul>
well, that's only 100 render cycles, so maybe not enough
<wolfspraul>
aw: do you think we should reflash 0x4C and 0x7C ?
<wolfspraul>
until Werner is back, I have no reason for any measurements now. just want to reflash them (including locking)
<wolfspraul>
should we do that?
<aw>
wolfspraul, yes, i think before lock flash, we can reflash 0x4c and 0x7c firstly
<wolfspraul>
yes, let's reflash both and see whether they boot to render
<wolfspraul>
first step
<aw>
BUT, we're doing a no-bigger data base even 10-times power-cycle. my question is:
<wolfspraul>
maybe we should buy a programmable power supply :-)
<wolfspraul>
then we still have a problem how to press the middle button automatically
<wolfspraul>
we don't have this now
<aw>
if after this 10 boards with 10 times through lock flash function, say NO err happens, but can we trust us and say this step is safe?
<wolfspraul>
good question
<wolfspraul>
from your tests, it seems we need about 200 cycles for 1 failure
<wolfspraul>
but let's do step by step, not speculate too much
<aw>
yupp...
<wolfspraul>
let's reflash 4C and 7C and see whether they boot
<xiangfu>
aw, let me test first. ..
<wolfspraul>
then we lock
<aw>
i quite don't think that we should pick 10 boards firstly
<wolfspraul>
then we think :-)
<wolfspraul>
agree
<wolfspraul>
first step: reflash 4C and 7C, see whether they boot
<aw>
how about we just use 0x4c and 0x7c to do individually 100-times tests after reflash and lock?
<wolfspraul>
yes, why not. good idea.
<aw>
that's total 200 times
<wolfspraul>
but let's reflash first and see whether they boot :-)
<aw>
yup
<wolfspraul>
I have seen too many surprises, don't want to speculate too much.
<GitHub130>
[scripts] xiangfu pushed 2 new commits to master: http://git.io/V6b2WA
<GitHub130>
[scripts/master] compile-lm32-rtems: add clean-rtems for easy re-build rtems - Xiangfu Liu
<GitHub130>
[scripts/master] scripts: lockflash only script file - Xiangfu Liu
<aw>
sorry that we do this firstly even if Werner say later we were wrong
<wolfspraul>
then we just speculate speculate, and then the test results don't come out as expected -> time wasted speculating :-)
<aw>
;-)
<wolfspraul>
well
<wolfspraul>
we should move forward
<wolfspraul>
it cannot be so totally wrong :-)
<wolfspraul>
btw, I am online for about 1h, then I need to go to some club opening to demo m1
<wolfspraul>
so if I'm offline later, just fyi
<aw>
i meant that missed some good chance to find...well
<wolfspraul>
xiangfu: I even think the reflash_m1.sh original should enable locking by default
<wolfspraul>
it almost becomes part of the m1 design/architecture :-)
<wolfspraul>
locking only the standby and rescue partitions, but that should be enabled by default
<aw>
xiangfu, okay
<wolfspraul>
imho
<xiangfu>
wolfspraul, yes. agree.
<aw>
xiangfu, the difference between 'lockflash_only_m1_rc3.sh' and 'reflash_m1_rc3.sh' is just one for lock the other is for reflash too?
<xiangfu>
aw, have you update your local version reflash_m1.sh?
<aw>
not yet...change now...my one line cmd is that with log function you gave me before. ;-)
<aw>
xiangfu, i.e.: ./reflash_m1_rc3.sh $1 $2 2>&1 | tee -a log/urjtag_$2.log
<xiangfu>
aw, ok
<xiangfu>
you better delete old reflash_m1.sh . for don't confuse.
<wolfspraul>
disconnected
<wolfspraul>
xiangfu: why do we have a separate reflash_m1_rc3.sh ? can we have just one m1 reflash script?
<xiangfu>
wolfspraul, no. it's just the name in my repo.
<aw>
wolfspraul, no need though
<aw>
wolfspraul, sometimes is managed on my site i think...
<aw>
xiangfu, btw, i rename log file name as: ./reflash_m1_rc3.sh $1 $2 2>&1 | tee -a log/urjtag_lock_$2.log
<aw>
alright..now to reflash/lock those two.
<wolfspraul>
yes good
<wolfspraul>
xiangfu: name? don't understand. well. the name says _rc3 and that is hopefully temporary. there should be only one m1 reflash script.
<wolfspraul>
if we need multiple variants, there should be options (command line parameters)
<wolfspraul>
I didn't even look inside the script, just saying from the name - this will cause confusion, guaranteed.
<xiangfu>
wolfspraul, yes. I know. just don't have time merge them. we have 'snapshots' 'updates' different URL and different way to generate bios.bin file.
<wolfspraul>
so there should be only 1 script
<wolfspraul>
the script should have a version number right at the beginning in some variable, maybe just the date it was last edited
<wolfspraul>
so when someone has the script locally, they can quickly check whether they have the latest version
<xiangfu>
maybe I can do that this weekend :)
<wolfspraul>
ok
<xiangfu>
wolfspraul, (version) yes. should be already in adam's log file
<wpwrak>
good morning ! :) catching up and replenishing my caffeine store
<wolfspraul>
well I'm sure there are reasons for the different scripts, it's all work.
<wolfspraul>
just remember to fix it at some point (merge) - this will GUARANTEED create confusions
<wolfspraul>
even among ourselves :-)
<wolfspraul>
you will see :-)
<wolfspraul>
so if we don't merge them, we pay the price in a different way
<wolfspraul>
but sort it in with your other priorities, you have overview...
<wolfspraul>
he
<wolfspraul>
I'm already with the first evening beer :-)
<wolfspraul>
gotta get ready for the club opening...
<wolfspraul>
wpwrak: have you seen any nor corruptions at higher addresses?
<wolfspraul>
(after you caught up...)
<aw>
0x4c reflash and lock okay, 0x7c is not...wait..upload log...
<xiangfu>
after cleanup the reflash_m1.sh will send email to list. I am already lazy on this task :)
<wolfspraul>
"7c is not" - bah
<wolfspraul>
:-)
<wolfspraul>
wpwrak: what's your take on the new 4C and 7C findings?
<wolfspraul>
curious about the log update and why 7C did not reflash...
<wolfspraul>
we are hoping that locking will safely eliminate this problem
<wpwrak>
both have a single-word corruption. so a reflash should fix them.
<xiangfu>
UrJtag 0011<tab>xc6slx45<tab>3 sent out
<wolfspraul>
aw: let's do 100 each first
<wolfspraul>
sorry that we don't have this better automated right now
<aw>
4C rendered
<wolfspraul>
wpwrak: do you have any other ideas? do you agree with the approach to reflash 4c/7C (already done), and then 100 thirty-second render cycles on each?
<xiangfu>
aw, you have to unplug the power cable for reboot right?
<aw>
xiangfu, yes
<wolfspraul>
xiangfu: some automation thoughts
<xiangfu>
aw, ok. there is a command can reboot m1 in 'flterm' but anyway we can not use that command in our case
<wolfspraul>
first - we are not sure which exact sequence triggers the problem
<wolfspraul>
for example whether a soft-reboot is enough
<wolfspraul>
so to be safe, we do a cold power cycling right now (unplug dc jack)
<wolfspraul>
simply because that's how we always tested so far
<wpwrak>
a bias towards small numbers if common in real life. so that may not mean much. particularly if it's a sw bug :)
<xiangfu>
wolfspraul, yes.
<wolfspraul>
we don't really have any comparison data for cutting power at the mains, or for soft reboots
<aw>
i think that no way that I have to simulate a real power on and off action. ;-)
<wolfspraul>
aw:Â Â we know too little now
<wolfspraul>
and we just want to start selling :-)
<wolfspraul>
so it's difficult
<wolfspraul>
we need your help in manual testing, because that's how we tested so far
<wolfspraul>
and we cannot get a better automation understood and setup fast
<wolfspraul>
xiangfu: the next problem is the middle button, which needs to be pressed
<wolfspraul>
in the future we would use programmable power supplies, but they can only simulate certain types of power cycling
<wolfspraul>
they cannot simulate the user unplugging the DC jack with his hands (potentially even causing effects simply from touching the metal...)
<aw>
wolfspraul, ha...sorry that you would misunderstand my last sentence. sorry. i meant that I have to manually power on and off to simulate. ;-)
<wolfspraul>
and we will always run into the middle button press as well
<wolfspraul>
well, we will try to improve some of those things, but it will take time
<aw>
no complain at all. ;-)
<wolfspraul>
wpwrak: ok, so you are good with the 2*100 cycles test?
<wpwrak>
(making the sequence more complex) the unlocking would also be an uncommon code path. so if it's a sw bug of just using the wrong address somewhere, you'd never hit this
<wolfspraul>
let's just see what we get
<wolfspraul>
then we move from there
<wolfspraul>
I gotta run to the club...
<wolfspraul>
aw: see you tomorrow or Monday. I think we are close :-)
<wolfspraul>
thanks for all the hard work!
<wolfspraul>
l8... will read the backlog...
<wolfspraul>
good luck!
<aw>
wait...so
<aw>
so agree to test 200 times?
<wolfspraul>
I do
<aw>
wpwrak, agreed?
<wolfspraul>
then you just follow what Werner agrees with too :-)
<aw>
he...okay ;-)Â Â you go firstly. ;-)
<wolfspraul>
plus you will probably need dinner at some time first :-) and it's Friday evening!
<wolfspraul>
we are close I think
<wolfspraul>
maybe the locking is the final nail
<wolfspraul>
I certainly hope so
<wpwrak>
reflashing 0x4c and 0x7c sounds good to me. the single-word corruption we've already seen a few times doesn't look related to what happens in 0x3c/0x77. which is good news. it means that no new boards have joined the "something very very wrong but we don't quite know what" cluster.
<wolfspraul>
but we have to see the real data, what can we do
<wolfspraul>
wpwrak: yes
<wolfspraul>
and the addresses are all small, even in the 640 kb block we look at
<wolfspraul>
so there's a good chance whereever this comes from, it will never hit anything past the standby bitstream
<wolfspraul>
all wishful thinking of course...
<aw>
alright...so after dinner. I'll go for test 200 times.
<wolfspraul>
great
<wolfspraul>
and I will read the backlog later :-)
<wolfspraul>
aw: THANK YOU!
<wolfspraul>
thanks so much for the great energy and passion
<wolfspraul>
almost there!
<aw>
alright...no problems, i ought to.
<wpwrak>
the single NOR word corruption cluster may be: 1) fixed 100% by locking (unlikely, imho); 2) fixable in the field; 3) not fixable in the field but with a not too hard recovery path, so people can work around the issue; 4) point to a NOR defect (unlikely, imho)
<wolfspraul>
I think locking stands a good chance
<wolfspraul>
unless the problem just bypasses locking entirely
<wolfspraul>
what do you mean with "fixable in the field"?
<wolfspraul>
anyway gotta run
<wolfspraul>
backlog
<wolfspraul>
l8
<lekernel>
wpwrak, how can locking not fix the standby bitstream problem?
<wpwrak>
if it's a bad NOR cell (and not a rogue write), it may still lose data later
<wpwrak>
also, we may hit other addresses, which could still render the M1 unusable (that is, without human intervention)
<wpwrak>
i'm thinking of the VJ at club scenario: you plug it in and it doesn't start flickernoise, or comes up with a friendly message telling you to fix your bitstream or whatever. the crowd cheers, the VJ gets nervous :)
<lekernel>
rendering is possible in rescue mode
<wpwrak>
now, if we can properly protect standby and recovery, which i hope and expect we can, it's not insanely difficult to bring the system back to life after such a mishap
<wpwrak>
you could still lack new FN features, or your patches themselves may get corrupted
<wpwrak>
but yes, we can make recovery from NOR trouble relatively benign, even if it's not possible to prevent it from occurring in the first place
<wpwrak>
also, the users could simply be instructed to plan to have a few minutes before the show to deal with any potential NOR problem. plus, don't power cycle during the show.
<wpwrak>
not nice, but it would reduce the impact of the issue further
<wpwrak>
now, for testing what's really going on. i'd suggest to do the current power cycling test at least 1000 times and until the corruption has happened at least 10 times, i.e., whichever comes last.
<lekernel>
can you name a single technology device those days that has none of such problems?
<wpwrak>
each time a corruption is found, record the location of the corruption and fix, then continue
<lekernel>
even those overrated apple macbooks suffer display problems because of poor BGA soldering
<lekernel>
even with all the money and resources apple has, they failed to fix it in the first place
<scrts2>
power cycling at least 1000 times... :D
<wpwrak>
after this, do the same test, but with a soft reset. that avoids the power drop. if the corruptions magically go away, we know it's a power up/down issue. if they don't, it's software, FPGA logic, NOR itself, EMI, etc.
<scrts2>
I wonder who wouldn't bother doing this
<lekernel>
so i'm more than willing to accept a little incidence of NOR corruption in unlocked partitions here
<wpwrak>
scrts: you're saying aw will run screaming to the other end of taiwan when he reads this ? ;-)
<lekernel>
of course we should fix it, but we should balance it against the massive delays a perfect solution would cause
<wpwrak>
lekernel: some products do in fact much worse, e.g., recently, it was in the news that Intel SSDs are losing data quite predictably. they fixed one path via a firmware upgrade and are still guessing about another one. that much about the power of big corps :)
<wpwrak>
my hope its that it won't take all that long
<wpwrak>
if it's a general problem, each of us should be able to reproduce it
<wpwrak>
so the question is simply who manages to automate the test first :)
<wpwrak>
btw, any magic key combination to switch rendering to 1024x768 ?
<wpwrak>
btw2, it may be cool to have some patch that has a camera reaction in an augmented reality way. e.g., show the camera input; overlay it with white blocks in some area; sample the camera image "behind" these white blocks; if there's a sudden brightness/color change of a large number of pixels, let the block "explode"
<wpwrak>
that may motivate people to experiment with interactive effects, which i think could be very cool. alas, if you don't show the way in a simple example as the one i've described, it will take much much longer before someone gets motivated enough to try.
<scrts2>
I did not read the problem, but I suppose the device hangs up?
<lekernel>
nah, rendering in 1024x768 is only supported on git head with the demo firmware (not FN)
<lekernel>
it's slow too (~7-12 fps)
<lekernel>
and buggy
<wpwrak>
lekernel: (1024x768) :-( any hope to be able to get it to work ? your earlier experiments sounded encouraging
<lekernel>
maybe by doubling the SDRAM frequency
<wpwrak>
ah, and can midi control adjust audio sensitivity and maybe camera brightness ? these two often seem to need some tweaking
<wpwrak>
(double sdram) sounds scary :)
<lekernel>
there are already Fx keys to adjust camera brightness and contrast
<wpwrak>
oh, cool
<lekernel>
(sdram) yeah, i'll probably feel motivated to do that if/when this project becomes popular
<wpwrak>
(sdram) nice :)
<wpwrak>
btw, i think a tutorial mode would be nice. the current default of going as quickly as possible into "show" mode doesn't really seem to fit what most people will expect. e.g., first you want to explore, getting all the feedback and guidance you can. only once you're familiar with the system, you'd turn off those things. of course, someone would have to program this ...
<wpwrak>
(at least it's not scary verilog ;-)
<wpwrak>
aw: how did the 100 cycles go ? ;)
<wpwrak>
or was it 200 ? :)
<aw>
wpwrak, hi sorry, i just started .;-)
<kristianpaul>
lekernel: cross talk?
<lekernel>
yes
<adamw_>
0x4c: 10th power-cycle pass
<wpwrak>
"Shift" by Geiss is really cool
<adamw_>
wpwrak, i bought a relay card with Christopher in om to do tons of tests via auto tests with programmable power supply and multimeters (GPIB)
<adamw_>
with that way can verify many things. ;-)
<wpwrak>
adamw_: you still have them ?
<wpwrak>
adamw_: oh, and what multimeter do you have ?
<wpwrak>
heh, conduirebourre ;-) best camera effect, i think
<adamw_>
at that time we used Keithley 2303 and Agilent 34401A, 16 channels relay card through GPIB
<wpwrak>
adamw_: and what do you have now ?
<adamw_>
wpwrak,  now i have 34401A
<adamw_>
no programmable power supply. :(
<wpwrak>
ah, okay. do you have GPIB to the PC ?
<adamw_>
need buy one. ;-)
<adamw_>
so you want me to capture NOR corruption as it happens while auto measure current. ;-)
<adamw_>
well...hope we don't do this. then solve, but as a lab site with auto equipments is good. ;-)
<wpwrak>
naw, just thinking ahead
<wpwrak>
yes, automation is good. very good :)
<adamw_>
we probably will go for this auto... ;-)
<adamw_>
20th
<adamw_>
i even do think 100 times is not enough. ;-) you know that we can't five up any reasons caused especially that it's not a probability distribution.
<wpwrak>
yeah, my guess would be more like 1000
<adamw_>
the single NOR word corruption cluster may be: 1) fixed 100% by locking (unlikely, imho); 2) fixable in the field; 3) not fixable in the field but with a not too hard recovery path, so people can work around the issue; 4) point to a NOR defect (unlikely, imho)
<adamw_>
you just posted those four candidates. ;-)
<adamw_>
s/five/give
<wpwrak>
yup. by the way, do you run the CRC check or just see if standby loads ?
<adamw_>
process of boot to rendering with power-cycle
<adamw_>
NO CRC check
<adamw_>
that'd be long period...;-)
<wpwrak>
heh ;-)
<adamw_>
wpwrak, btw, how do you think that boards were failed in CRC test?
<adamw_>
wpwrak, since one board I caught it and re-performed CRC test without power off then just pass, how to explain this?
<adamw_>
that was 0x85: got "flickernoise.fbi(rescue)(CRC)CRC failed(expected aa12a56a, got b0c6b06d)" and "splash.raw CRC failed(expected 978f860c, got 33d3152a)" while using test program 10. keep performing CRC test again, then pass without power-cycle. 11. rendering and CRC test pass
<adamw_>
30th
<wpwrak>
hmm, 0x85 sounds like one of those NOR bus problem boards then
<wpwrak>
may be similar to 0x3c and 0x77. or maybe the NOR bus problems (without the "pulses" on PROGRAM_B) are something else
<adamw_>
40th
<wpwrak>
this is a touch one
<wpwrak>
s/touch/tough/
<wpwrak>
lekernel: your USB stack can't be all that bad - it managed to find the first device (the keyboard) in this little mess: http://pastebin.com/p1ymfXL7
<lekernel>
what is sad here is you need to go through all that crap just to receive stupid keystrokes
<adamw_>
i gotta go and 0x7c will be the next one. cool. ;-)
<wpwrak>
grr, vanished
<wpwrak>
would have been nice to get a CRC check at the end
<wolfspraul>
wpwrak: ok, 100 tests on 0x4C succeeded - good sign
<wolfspraul>
until I see evidence against it, I am assuming/hoping the locking fixes the bug ;-)
<lekernel>
it rather fixes the the symptom, but that's good enough for now
<wpwrak>
ah, that was with locking ?
<lekernel>
er... hopefully
<wpwrak>
yeah ;-)
<wpwrak>
hmm, there seems to be another issue with external connections. connected line in to my stereo (had used the battery-powered kaossilator before). then it stopped responding to audio. even when i connected back to the kaos.
<lekernel>
this totally sucks
<wpwrak>
power-cycled. everything okay again. connected stereo again. M1 froze (wouldn't get to the desktop with a mouse click)
<lekernel>
there's another FB between analog and digital ground, maybe that's the same problem as on the video in
<wpwrak>
power-cycled. still no reaction to the stereo.
<wpwrak>
hehe ;-)
<lekernel>
FYI, audio chip failure when rendering would freeze the software
<wpwrak>
went back to kaossilator. audio dead. power-cycling ...
<lekernel>
those run3 boards are the worst disaster that ever happened in this project
<wpwrak>
one issue that quite clearly exists in M1 is that it combines a lot of different grounds. and you can't quite know at what potential they are.
<wpwrak>
well, i think it's also seeing more intensive testing now. so it's normal that more critters come out. we turn more stones ;-)
<wpwrak>
audio back to normal after power-cycling
<lekernel>
phew...
<wpwrak>
at least it seems i can paralyze audio quite reliably :)
<lekernel>
try shorting L3 ...
<wpwrak>
i'm kinda curious what exactly my stereo sends out there
<lekernel>
the wm9707 datasheet says avss/dvss voltage should be max +/- 0.3V
<lekernel>
it could easily be exceeded by transients across L3 ...
<lekernel>
yay, smells like even more rework delays
<lekernel>
(and of course, the problem never manifested itself with the lm4550 nor on my wm9707 test board ...)
<wpwrak>
maybe your signal sources have better/different grounding
<wpwrak>
i'm also a little suspicious about DMX. those expensive USB-DMX dongles all seem to have galvanic isolation. that's probably not just because it sounds cool ...
<wpwrak>
and DMX seems a particularly good candidate for potential differences because the devices will be far away from the DJ desk, probably connecting to very different points in the mains wiring
<wpwrak>
(well, that's my layman's suspicion. i didn't know DMX even existed before i saw it in the M1 schematics, so maybe i'm all wrong :)
<wpwrak>
anyway, let's see what's up with the audio
<lekernel>
I haven't had any DMX issue so far, but it seems to be a persistent and inconvenient pattern that all problems happen on other people's boards
<lekernel>
just crank up the volume, this is a totally trivial issue
<lekernel>
kristianpaul, easy to say for you
<wpwrak>
lekernel: even more so if we consider that the "normal" level wolfson consider seems to be around only +/- 100 mV, so 0.2 Vpp
<wpwrak>
lekernel: no, i mean the voltage the M1 input is designed to handle
<wpwrak>
lekernel: the codec does up to 0.6 Vpp (absolute maximum ratings), you have an 1:2 divider, so you get 1.2 Vpp for the input signal
<wpwrak>
lekernel: (probably already with distortions, etc., but that may not matter so much)
<wpwrak>
lekernel: but it seems that "LINE" levels you may encounter can go up to about 1.8-2.2 Vpp, particularly with "professional" equipment
<wpwrak>
my sony, with ~1.3 Vpp would be high for a consumer electronics device, but still well below "professional equipment"
<wpwrak>
hmm, new hypothesis: the data sheet is simply wrong ;-)
<wpwrak>
and the absolute maximum rating is in truth AVss-0.3 V to AVdd+0.3 V
<wpwrak>
in which case everything is nice and well
<wpwrak>
adam will be disappointed that we failed to create yet another rework item for him ;-) L3 is still on, though. let's see about it ...
<lekernel>
I was talking about "Difference DVSS to AVSS"
<lekernel>
which is also +/- 0.3V
<wpwrak>
ah, i see. yes, that doesn't agree with L3.
<wpwrak>
heh, i see L19 also has a history of being made eliminated ;-) (huge solder blob)
<wpwrak>
reworked. works like a charm
<wpwrak>
doing a few unplug/plug cycles
<wpwrak>
solid as a rock
<wpwrak>
what's funny is that the stereo makes noise when i connect the stereo:line-out to m1:line-in. some interesting things must be passing over that ground.
<kristianpaul>
okay never mind my easy comments i regret now