<wolfspraul>
hey btw, I just did a little counting
<wolfspraul>
the rc3 yield as of right now is 49 100% perfect units
<wolfspraul>
49 out of 90
<wolfspraul>
the original goal was 80 (out of 90)
<wolfspraul>
adam takes a few days off from tomorrow (friday) to tuesday, and then he's back at bringing this up more
<wolfspraul>
next target: 60
<wolfspraul>
:-)
<wolfspraul>
who knows maybe in the end we can get close to the original 80... too early to tell now, have to wait and see which troublemakers remain at the end and what analysis shows for them
<wolfspraul>
aw: thanks a lot for the hard and persistent work!
<wpwrak>
i now made the test for equivalent output more comprehensive/strict. and the latest version(s) produce perfect matches for all patches we have.
<wolfspraul>
wow
<wpwrak>
the generated code is now a bit less efficient that what i had on monday. so now there are four patches that get longer even with optimization. all the rest is about the same and some significantly more compact.
<wpwrak>
without optimization, it still gets a bit worse. in all cases, the new scheduler is a lot faster than the original one. with profiling, no optimization (-O), on x86-64, and including parsing and all other compilation steps, on average about 10x faster.
<kristianpaul>
wow indeed
<wpwrak>
the next step is to build flickernoise and see how it works in its native context. i would expect the new scheduler to optimize (-O) better than the old one, because all frequently traveled code paths are in the same compilation unit and there are no terms greater than O(n^2) in any of the processing.
<wpwrak>
and even O(n^2) would be a degenerate case. things like foo = <expression> and then gazillions of varN = fn(foo), i.e., everything becomes executable after the first operation
<wpwrak>
the O(n^2) would hit the optimizer (the longest critical path first algorithm) hardest, because it always considers all available choices. (i could defang it with a merge sort, but that's probably excessive)
<wpwrak>
of the bad things that may still happen, we have the relatively inefficient parser. it relies heavily on string compares and identifier lookups are always O(n*m) while they could be O(m) or at last O(m+C*log n). n would be the number of known identifiers, m the average length of an identifier, C a constant.
<wpwrak>
but we'll see. maybe it doesn't matter so much and the scheduler dominates.
<wpwrak>
wolfspraul: in what categories do the remaining gremlins in M1 fall ? i know we have a few (2 ?) "NOR suddenly going completely mad" cases, which may be damaged chips, but what are the rest ?
<wolfspraul>
haven't tried to categorize yet
<wolfspraul>
sorry bbiab
<wpwrak>
then i guess, while aw is taking his days off, wolfgang will be data mining in the mines of mordor :)
<larsc>
one bug to break them all!
<wpwrak>
that would be too easy ;-)
<wpwrak>
sometimes i hate statistics. after realizing that the NOR corruption distribution seems to lack a very "late" corruption, my M1 promptly proceeded to have a run that takes forever to have a corruption. now in the 4th day ...
<kristianpaul>
:/
<wpwrak>
i somehow suspect that temperature does have an effect after all. the last days were relatively warm (today, the first day of spring brought excellent weather)
<kristianpaul>
(warm) good !!
<larsc>
time to make room in the fridge
<wpwrak>
i might have to return to the idea of putting the M1 into the fridge ... well, i could also cool my guest room down to 18 C and move things over there
<wpwrak>
larsc: yeah :)
<larsc>
any suspicions what might case the corruption?
<wpwrak>
kristianpaul: of course, outdoors events in winter may be about the last places where you want a surprise NOR corruption :) at least they should be less common than hot indoors events
<wpwrak>
i suspect it's some glitches caused by power ramping down unevenly
<kristianpaul>
there is no winter here ;)
<kristianpaul>
ramping?
<kristianpaul>
leakage?
<wpwrak>
e.g., I/O power still good but FPGA core power dropping out of range, and the core then acting crazy
<larsc>
but it should be possible to powerdown the flash before the fpga, or not?
<wpwrak>
with a little hardware change, we could hold it in reset, yes
<wpwrak>
alas, the current reset circuit only does this reliably when powering up, not when powering down
<larsc>
but there is nothing on the board which asserts reset globally if the voltage drops below a certain threshold?
<wpwrak>
a) there's no global reset, and b) the voltage the reset monitors is 3V3, not the 5 V input. so if the input drops but one rail stays up longer than the others, things can get nasty
<wolfspraul>
wpwrak: back. I think it's too early to categorize already. let's wait until more dust settles.
<wolfspraul>
there are probably a lot more low hanging fruits in terms of boards that will pass all fixes and tests just fine
<wolfspraul>
then we focus on analyzing the rest
<wpwrak>
good. let's hope for the best then :)
<wpwrak>
the ones with mad NOR will be tricky
<wpwrak>
we should try to see if we can detect them reliably with a boundary scan
<wolfspraul>
let's see what you find in the end
<wolfspraul>
if you feel you can reliably reproduce the nor corruption, one angle is to manually rework a board with the planned rc4 design (if that is possible)
<wpwrak>
no, i mean the ones where the NOR gets some oscillations. not the single-word corruption
<wolfspraul>
including gate and 4.4v reset ic
<wolfspraul>
if that prooves that the nor corruption becomes unreproducible, at least we have an exit path
<wpwrak>
yup. i think for NOR corruption the path is clear. just need to find the hidden variables :)
<wolfspraul>
I'm not clear about the difference between nor oscillation and single-word corruption right now
<wolfspraul>
waiting for dust to settle...
<wolfspraul>
meanwhile I feel good about the rc3 we sell
<wolfspraul>
Adam hasn't seen a single problem in 49 boards now
<wpwrak>
(the two types of NOR corruptions) i think they're radically different. single-word corruption seems to affect all boards and the cause appears to be relatively benign (some glitch). the oscillation is signals that should be unrelated getting synchronized, with a strong smell of chip-level damage.
<wolfspraul>
which chip? the nor chip?
<wolfspraul>
then we just replace it :-)
<wpwrak>
well, NOR problems actually. i'm not sure if we;ve actually seen corruption connected to the oscillation
<wpwrak>
probably the FPGA
<wolfspraul>
ah ok
<wpwrak>
trickier :)
<wolfspraul>
then replace that, or write off the board
<wolfspraul>
xray etc will also still come
<wolfspraul>
what do you mean with "some glitch"?
<wpwrak>
let's hope the number of such boards stays around 2. i wouldn't really trust a board where the FPGA has been reworked.
<wolfspraul>
you mean something fixable in software/soc ?
<wolfspraul>
why not [fpga rework]
<wpwrak>
according to joerg, the smt fab grade xray won't show anything. but we can of course try. maybe there are surprises.
<wolfspraul>
the smt fab xray is only good for checking the soldering joints
<wpwrak>
(49 good boards) yes, it seems we have a neat division between trouble boards and regular boards. that's encouraging.
<wolfspraul>
it cannot see much inside a chip, unless a really big burn maybe
<wpwrak>
(fpga rework) seems difficult -> good chance of creating new/more problems
<wolfspraul>
nah, that's why we run tests afterwards
<wolfspraul>
I can still use such boards, for example for internal units (like my own, xiangfu, sebastien), or for journalist review units, etc.
<wpwrak>
which may or may not catch them :)
<wpwrak>
of course, yes
<wolfspraul>
then the test needs to be improved
<wolfspraul>
I trust the test, by definition
<wpwrak>
the problem with this testing approach is that you drive the bugs into a corners your tests don't reach
<wolfspraul>
well let's see
<wolfspraul>
I have no problem reworking the fpga, if the smt fab (who would do it) thinks they can do it then why not
<wpwrak>
i think the tests are still relatively narrowly focused. you'd need very broad coverage to catch truly exotic bugs that way.
<wolfspraul>
maybe but everybody is testing
<wolfspraul>
we don't need to be worried about ghosts or invisible things, I am not
<wolfspraul>
let's just see
<wolfspraul>
also the smt fab etc. have a lot of rework experience. they can give us advice what makes sense and what not.
<wpwrak>
(some glitch) my current pet theory is that, when powering down, the FPGA core loses power before the I/O (3V3) does. then the core may act funny but since the I/O is still sufficiently powered, it would send out all the dying spasms of the core at full power. some of that may be write pulses to the NOR
<wpwrak>
(everybody is testing) with tests designed for broad coverage you have a reasonable chance. but i don't think the current process is very strong there. i'm not saying that it's bad but that the tests are fairly narrow and probably high-level, too. e.g., some glitches may even get auto-corrected without you noticing.
<wolfspraul>
that glitch would be fixed with the gate+4.4v reset ic solution planned for rc4, no?
<wpwrak>
(4V4 reset) i would expect that, yes
<wpwrak>
meanwhile, my suspicion grows that temperature has an effect, too. the last few days and nights have been quite warm and lo and behold, i've had a run that's been free of NOR corruption for > 3 days. and still counting. it could of course be coincidence, but ... :)
<wpwrak>
i wonder if i already have enough data points for a frequency domain analysis ... a temperature pattern should show up as a ~24 hours cycle, too
<wpwrak>
anyway, i don't think we're in a great hurry with this (yet :). so i'm taking my time to collect data and improve my analysis methods. will be handy when a real emergency hits.
<wpwrak>
ah, and after the latest firmware improvements, i haven't seen a single glitch of labsw. so i think my sw-based debouncing/denoising works well. the next hw revision will also have analog filters, for even better interference hardening.
<wpwrak>
my plan is, once the new scheduler is done, to update the labsw design, make another prototype, and if it behaves well, also make one for adam. then he can do his testing in his sleep, much like i do ;-)
<wpwrak>
then i should document the new schedule while my memory is still reasonably fresh. the efficiency comes at the price of some non-obvious dependencies. maybe i'll also find some bugs when documenting. wouldn't be the first time :)
<wolfspraul>
do you think tuxbrain can sell labsw boards?
<wolfspraul>
or anybody?
<wolfspraul>
maybe that is something sparkfun/adafruit and friends could be interested in...
<wpwrak>
i don't know. if there's enough interest, it would make sense to make proper boards, yes. also for internal use.
<wpwrak>
you could then add all the loose components and sell it as a kit. save assembly ;-)
<wolfspraul>
sure sure
<wolfspraul>
I'd leave that to sparkfun :-)
<wolfspraul>
I haven't looked at the tech details of labsw at all yet, I confess
<wpwrak>
one problem is the case. i use a locally sourced case and replace the front (and later also the rear) plate. works great but may not be very portable.
<wolfspraul>
maybe as part of the 10-01 news
<wpwrak>
hehe ;-)
<wpwrak>
ah yes, just a few days left. time flies :)
<roh>
well.. in that case its difficult.. maybe all non-smt parts
<roh>
kits are not really much trouble weee wise... devices are more complicated
<wpwrak>
labsw is a bit messy to build, yes. a bit of smt at the bottom, but then plenty of through-hole on top. and then a lot more items on the front panel.
<wpwrak>
(weee) is see ;-)
<wpwrak>
s/is/i/
<wpwrak>
hmm, the flickernoise build instructions imply removal of flex and bison. very funny :-(
<wpwrak>
does the slowdown also happen if i rebuild the things in the milkymist (for libfpvm) and flickernoise (for src/compiler.c) repos ?
<xiangfu_>
wpwrak, no. it's RTEMS bug.
<wpwrak>
excellent ;-)
<xiangfu_>
wpwrak, unless there is a new bug in your new libfpvm or compiler.c :D
<wpwrak>
we'll see :)
<xiangfu_>
wpwrak, what can I do for help you about speed up compiler.c?
<wpwrak>
let's see how the SDK and then the build goes
<wpwrak>
if i'm lucky, i can just drop in the new scheduler and things will fly
<wpwrak>
of course, if think murphy won't agree with this plan, as usual :)
<wpwrak>
now the moment if truth ... compiling flickernoise ...
<wolfspra1l>
if we have a mmu and really good Linux support one day, what are the reasons that still go for rtems then?
<wolfspra1l>
or is rtems just a temporary placeholder because Linux is harder to pull off?
<wolfspra1l>
will rtems always be smaller and easier to customize?
<wpwrak>
it'll probably be smaller. but i think switching to linux would make a lot of sense. more drivers and protocols, widely known environment, standard tools, and so on.
<wpwrak>
of course, all the RT aspects need to be handled as well. not sure how demanding flickernoise is in the regard. and i also don't know the status of the RT extensions (some RT features are in the standard kernel, but there's more stuff)
<wpwrak>
hmm, make -C compile-flickernoise flickernoise.fbi  still seems to build the SDK. let's see how this goes ...
<wpwrak>
or maybe not ... confusing :)
<xiangfu__>
not SDK. but all depends libs. like RTEMS, gtk etc..
<xiangfu__>
those libs + cross toolchain  is the SDK :)
<xiangfu__>
wpwrak, do you have 'lm32-rtems4.11-gcc' installed?
<wpwrak>
yes. and it's in the PATH
<xiangfu__>
where you get it? compiled from scripts.git?
<wpwrak>
which lm32-rtems4.11-gcc
<wpwrak>
/opt/rtems-4.11/bin/lm32-rtems4.11-gcc
<wpwrak>
from the SDK
<xiangfu__>
the ***-0000 sdk is gcc 4.5.2 and old newlib code. which is can not build latest source code :(
<xiangfu__>
you have to use 4.5.3
<wpwrak>
so the SDK is useless ?
<xiangfu__>
wpwrak, if you already have SDK and you want compile flickernoise. no needs the script.git
<xiangfu__>
just clone the flickernoise.git and compile it
<wpwrak>
ah :) okay. let's see how this goes ...
<xiangfu__>
wpwrak, make -C compile-flickernoise flickernoise.fbi will compile all from 0
<wpwrak>
cd /opt/milkymist/flickernoise.git/src#
<wpwrak>
make
<wpwrak>
yaffs.c:27:31: fatal error: yaffs/rtems_yaffs.h: No such file or directory
<xiangfu__>
wpwrak, checkout to 'stable_1.0' branch
<xiangfu__>
the compiler.c is same in 'master' and 'stable_1.0'
<wpwrak>
much better :)
<xiangfu__>
too many update here and there. :)
<wpwrak>
now i have a bin/flickernoise
<xiangfu__>
'make load' will compile a bin and copy to ftfp folder for netboot. then you don't needs to reflash
<xiangfu__>
if you want reflash, there is a flickernoise.git/flash/flash.sh
<xiangfu__>
wpwrak, for netboot, the m1 will setup ip address 192.168.0.42, and try to fetch 'boot.bin' from 192.168.0.14
<wpwrak>
hmm, but i didn't see it rebuild milkymist/software/libfpvm/Â Â i guess i need to build there too ...
<wpwrak>
i'll try flashing via jtag. haven't even set up ether yet.
<wpwrak>
ah, nice. make bin/flickernoise.fbi  that was easy :)
<wpwrak>
now the milkymist libs ...
<xiangfu__>
wpwrak, libfpvm. you can manually compile it. and copy the 'libfpvm.a' to '/opt/rtems-4.11/lm32-rtems4.11/milkymist/lib'Â Â then recompile flickernoise by 'make clean load'
<wpwrak>
... make bin/flickernoise.fbi  works. excellent :)
<wpwrak>
does stable_1.0 already have the skipping of patches that use the camera if there's no camera connected ?
<xiangfu__>
wpwrak, no.
<xiangfu__>
but cherry-pick should works fine.
<wpwrak>
pity. that'll be a great feature to have.
<wpwrak>
pheew. make the tree grow stranger and stranger ;-)
<xiangfu__>
wpwrak, for libfpvm, just found there the Makefile is under: 'milkymist.git/software/libfpvm/lm32-rtems'
<xiangfu__>
'make clean install' should works fine at that folder.
<wpwrak>
cool, even better
<xiangfu__>
(skip video) needs those three commits: 'f8e9008016285560e1826a48e0716d719d330387' '2776d11c50d88aa88f4372adabace3803004779e' '9523567c8bdc07b750e6b922e5c0d3acf90865f6', just run git cherry-pick ... should ok.
<xiangfu__>
it will also skip MIDI, OSC, DMX
<wpwrak>
writing it down ...
<wpwrak>
would there be an easy way to make the master branch compile ? i'd rather be at the current head, also for making patches
<xiangfu__>
it not skip compile them, it skip render those patches. just fyi
<xiangfu__>
wpwrak, I guess recompile and install the new yaffs libs should ok.
<wpwrak>
heh, skipping compilation would be something ;-)
<xiangfu__>
try 'make -C compile-flickernoise rtems-yaffs2' under scripts.git
<xiangfu__>
then compile the 'master' branch
<xiangfu__>
I am trying now. :)
<wpwrak>
hmm, it's unhappy. after make -C compile...
<xiangfu__>
not working,  the new yaffs2 needs new RTEMS API.
<wpwrak>
rtems/rtems_yaffs.c:839:13: error: 'rtems_filesystem_default_write' undeclared here (not in a function)
<wpwrak>
(and more errors)
<xiangfu__>
yes.
<xiangfu__>
wpwrak, there are not much update in 'master'
<xiangfu__>
there are new 'yaffs' api and new 'skip video' code in 'master'
<xiangfu__>
we should fix the rtems bug fast.
<xiangfu__>
then we will happy on 'master'Â Â branch
<wpwrak>
do you already know what the rtems bug is ? in the irc log, it looks as if it hadn't been quite identified yet
<wpwrak>
funny address. but indeed, it's our old friend. is that an rc2 or an rc3 board ?
<xiangfu__>
rc2 board.
<wpwrak>
seems that rc2 gets NOR corruption more often than rc3. the reset circuit in rc3 probably does help a bit..
<wpwrak>
(at least my rc3 needs on average 500 power cycles)
<xiangfu__>
wpwrak, I will reflash standby.bin now. there is not much info from me. just normal use. normal reflash
<wpwrak>
yes. i think that's "normal" for rc2. if it was as rare as in rc3, it may have gone unnoticed.
<lekernel>
xiangfu__, just a quick test. can you take FN 1.0RC1 (the old RTEMS/YAFFS and such) and verify that the flash write function does not get called at inappropriate times in YAFFS?
<lekernel>
xiangfu__, there's a task that periodically flushes the YAFFS cache every 10 seconds or so
<lekernel>
maybe it triggers flash writes everytime, even when the cache is in sync. that would be bad in every case, not only because it can cause unexpected corruption when the power is shut down in the middle, but also because it wears out the flash
<xiangfu__>
lekernel, ok. I will do that. I have old stuff in my system.
<xiangfu__>
I will do that today(if I have time), I will on the train to ChangeChun, 3 hours later. I will do this today or next few days.
<xiangfu__>
I ask some days off, but will read email as always :-)
<lekernel>
ha wow
<wpwrak>
now produces functionally identical code to your scheduler for all the patches. i'll give it a spin on the M1 after a nap.
<lekernel>
LCPF = long latency instructions first?
<wpwrak>
Longest Critical Path First. also considers the dependant operations.
<lekernel>
and "new (no optimizer)" is the same as my basic scheduler, except that it schedule the instruction with the largest latency when there are several choices?
<wpwrak>
realizes that "longest" and "critial path" are a bit redundant
<wpwrak>
it schedules the one that comes first in fpvm
<GitHub152>
[rtems-yaffs2/master] Fixed return value of ycb_file_lseek(). - Sebastian Huber
<lekernel>
this is cool, I never thought someone would contribute something on PFPU. I'm impressed :)
<xiangfu__>
indeed.
<wpwrak>
well, no, one more level: in each cycle, it adds the instructions that become available to its "to do" list. each list of additions is sorted by fpvm position. but the to do list doesn't get reordered. so the ones that become available first usually get issued first (unless the destination slot is blocked)
<wpwrak>
the pfpu design is quite nice and straightforward. maybe with a bit of tweaking of the handling of static registers, even my scheduler could get a bit simpler.
<lekernel>
wpwrak,  "the flickernoise build instructions imply removal of flex and bison"?
<lekernel>
mh?
<lekernel>
you can install both flex/bison and re2c/lemon at the same time
<wpwrak>
lekernel: for ubuntu, there's deinstallation of M4, which in turn deinstalls flex and bison. i could of course manually install them again later. but it's not so nice if you're forced to have a lot of manually installed packages.
<lekernel>
seems it's an ubuntu-specific bug
<wpwrak>
sigh. cycles without trouble. time to make room in the fridge ...
<wpwrak>
s/cycles/4000 cycles/
<wpwrak>
hmm, does RTEMS/FN show the build date somewhere ?
<xiangfu__>
the "About"
<xiangfu__>
in Control Panel""
<wpwrak>
great, thanks !
<xiangfu__>
now add some printf to 'yaffs_flush_whole_cache'
<wpwrak>
grmbl. it hits an assertion. now .. where did i go wrong ...
<lekernel>
xiangfu__, add it to the flash write function
<lekernel>
and check that the flash doesn't get written when it's only rendering
<lekernel>
you shouldn't need to dive into the cache flushing code unless you see that it's writing the flash when it should not
<xiangfu__>
lekernel, yes. I have added some printf to my_write.
<xiangfu__>
there is no 'my_write' called when rendering.
<wpwrak>
hmm, rtems isn't very smart when if comes to concurrent console output, isn't it ? even an abort concurrent with a printf from the same program yields alphabet soup :)
<wpwrak>
s/abort/assert/
<kristianpaul>
wpwrak: rtems have some surprises :)
<kristianpaul>
i baically uses because network support, but i'm actually using a hacked version of m1 bios, as i dont need too much troughput
<kristianpaul>
(labsw) for my personal case, will be nice for full control of a milkymist one remotelly
<kristianpaul>
so far i always my m1 turn on, and thanks to jtag-boot i can do remotelly soemthing before i needed to do in place (push button ;))
<kristianpaul>
also your neocon logging support is extremely usefull for me now :)
<wpwrak>
yeah, with jtag control, you need to power cycle only in some rare cases. i like how well urjtag works. at openmoko, we used openocd. that was a complete and utter nightmare. locked up at the slightest difficulty. and because it was daemon plus client, it wasn't easily scripted either.
<wpwrak>
(neocon logging) sometimes it's the simplest things ... ;-)
<wpwrak>
hmm, the wayward compilation spits out slightly different registers. interesting. let's see if the code input is the same ... about the only thing i can think of that might seriously derail the scheduler would be an incorrect value in nbindings.
<wpwrak>
valgrind is very happy with my code. also -O9 doesn't have any complaints. this normally means that it should work ;-)
<lekernel>
wolfspraul, who's Christiaan Virant ?
<wpwrak>
very interesting. same source, but scheduler input is a little different on lm32 and on x86-64. hmm ...
<lekernel>
input or output?
<wpwrak>
input ! and the output (of my scheduler) has troubles, too. not sure yet whether that's because of something weird in the input or a yet undiscovered bug.
<wpwrak>
Lekernel & Rovastar & Fvese - Subconscious Objects.fnp, lm32 has 108 bindings, x86 has 106 bindings (just unnamed constants). how peculiar. maybe this has something to do with the conversion in get_registers. the rest of the code looks pretty tame, though.
<lekernel>
hm, maybe FN feeds the patch code differently to libfpvm than your x86 test program?
<lekernel>
does it work?
<wpwrak>
nope. my scheduler trips over an assert. so there's something it doesn't like. still searching ...
<wpwrak>
hmm, but nourishment first. this feels as if it may take a while to sort out ...
<GitHub23>
[llvm-lm32] jpbonn pushed 659 new commits to master: http://git.io/GJe5RA
<GitHub23>
[llvm-lm32/master] Make IC_VEX* not inherit from IC_*. Prevents instructions with no VEX form from disassembling to their non-VEX form. Also prevents weak filter collisons that were keeping valid VEX instructions from decoding properly. Make VEX_L* not inherit from VEX_* because the VEX.L bit always important. This stops packed int VEX encodings from being disassembled when specified with VEX.L=1. Fixes PR10831 and PR10806. - Craig Topper
<GitHub23>
[llvm-lm32/master] Pass signed (not unsigned) 10 bit field to SPU 'ori' instruction. - Kalle Raiskila
<GitHub23>
[llvm-lm32/master] Compare type size instead of type _store_ size to make sure that BitCastInst - Jakub Staszak
<wpwrak>
hmm, -DPRINTF_FLOAT ... where on earth would not setting it make sense, ever ? :)