<GitHub99>
[scripts/master] compile-flickernoise, use full path for MILKYMIST_GIT_DIR - Xiangfu Liu
<wpwrak>
step one: send the covert black ops ninjas to collect all voodoo dolls of me
<wpwrak>
step two: make sure to be outside the range of weapons commonly used to kill messengers
<wpwrak>
step three: bring on the news.
<wpwrak>
after < 4325 cycles (due to a bug in labsw, not every test cycle resulted in a full on/off cycle), the standby partition still heroically resisted all corruption
<wpwrak>
however, i found a single-word corruption in the flickernoise partition, causing a CRC error and subsequent boot failure. this seems to have happened in the 3704th cycle
<lekernel>
well... worst case we'll also lock this one in the field
<lekernel>
btw, there's an easy way to rule out software bugs. from JTAG, issue "pld reconfigure" and then the boot commands - instead of a full power cycle
<wpwrak>
i think what this really means is that nothing in the NOR is safe from corruption
<wpwrak>
("soft-boot") ah, good idea. that would rule out power up/down issues.
<wpwrak>
(that is, if the problem strikes during this testing)
<wpwrak>
i suspect that it may be a combination of loss of power and NOR accesses
<lekernel>
that would be ok, because then there's an easy field fix: make sure you shutdown the M1 before cutting power
<wpwrak>
yes, i hope that works
<wolfspraul>
that's not a realistic 'fix', but it doesn't matter too much since the issue seems to be rare
<wolfspraul>
anyway it seems we are collecting very valuable data, great!
<wolfspraul>
maybe the rc4 reset circuit will fix the root cause? too early to tell now...
<wpwrak>
i need more data before being able to declare any test of some rc4 circuit a success. that's the problem here - if nothing happens, it could just mean that we were "lucky", but the problem is still there
<wpwrak>
with a few more data points however, it'll be possible to make a crude statistical model that allows to make statements about the probability of events, and what the absence of events in a certain sample size means
<wpwrak>
anyway, for now i'm still in investigation mode - see what failure patterns are possible. only makes sense after that to hammer the M1 with a specific pattern.
<GitHub125>
[rtems-yaffs2/master] Flush during close (similar to yaffs_close()). - Sebastian Huber
<lekernel>
wolfspraul, many modern operating systems implement this "fix" you find unrealistic
<wolfspraul>
you can say this is a fix, but you will only embarass yourself
<wolfspraul>
so better is to not talk about it at all
<wolfspraul>
it's a very rare thing it seems
<wolfspraul>
3700 cycles? :-)
<wpwrak>
there may be yet unknown parameters that affect the frequency
<wolfspraul>
yes sure
<wolfspraul>
first we need to learn more, those are excellent results
<wolfspraul>
is there a button-press to shutdown?
<wpwrak>
with enough analysis, i can probably make it happen on each try. but then, i don't think i'm quite persistent enough for that ;-))
<wolfspraul>
maybe we can just focus on the rc4 fix and proove that fixes it entirely?
<lekernel>
wolfspraul, yes.... hold middle pushbutton in FN and it shuts down
<wpwrak>
wolfspraul: i agree with the general direction. but ... first more data is needed to make sure any "proof" we attempt actually proves anything. right now, my statistics are based on a whole three events. this is scarcely more than a proof of existence :)
<wpwrak>
of course, i'm a little optimistic here - if the rc4 fix doesn't work either, then it may not take all that much preparation to prove that it doesn't work.
<wolfspraul>
hold middle button, ok. we can start to always talk about that when people ask how to turn off the device.
<wpwrak>
one thing that's interesting is that all these corruptions seem to affect only data that's actually been written. i haven't seen one change the 0xffff ... unused end of a partition yet
<wpwrak>
lekernel: do RTEMS/FN normally read anything from the standby partition ? e.g., some system constants or such ?
<lekernel>
no
<wpwrak>
hmm. maybe it's all just coincidence then.
<kristianpaul>
well,if FN dont need write to NOR why not disable this support in the norflash16 core?
<kristianpaul>
"/* register only when needed to reduce EMI */" what problems were encountered when developing this core?
<wpwrak>
kristianpaul: (disable writes) i think it may be a condition where the FPGA doesn't actually intentionally command a write. that may happen only in the confused state in which it ends up when powering down. (well, if the power-down ramp theory is correct)
<wpwrak>
evidence implicating power cycling mounts: 1338 cycles (with NOR unlocked) and standby is still healthy
<kristianpaul>
wpwrak: milkymist/cores/norflash16/rtl/norflash16.v line 58
<kristianpaul>
confused state :)
<wpwrak>
do we have something like a "NOR poke" in urjtag, BIOS, or RTEMS, that is known to work if the NOR is unlocked and known to fail (or show incorrect readback) if the NOR is locked ?
<wpwrak>
let's see if i have the prefix somewhere ...
<wpwrak>
maybe lekernel once suspected the NOR corruption could be caused by EMI ? (if that code is from him)
<kristianpaul>
or mwalle ? :)
<lekernel>
no, it's just to avoid unnecessary toggling of external FPGA signals whenever there's system bus activity
<wpwrak>
or mwalle. or maybe the code is inspired by something else
<wpwrak>
aaah !
<lekernel>
especially since the system bus toggles much faster than the flash can handle
<wpwrak>
what does the "register only when needed to reduce EMI" mean then ?
<wpwrak>
if sounds like #ifdef HAVE_EMI_TROUBLE
<wpwrak>
s/if/it/
<wpwrak>
from your description, it seems that you'd always want to use this glitch avoidance, the contrary of what the comment suggests
<wpwrak>
about the flyer, page 2: it lists many fancy features but it doesn't actually say that you can also just connect audio "line in" :)
<wpwrak>
only page 3 has it
<lekernel>
it's not a glitch problem, the system bus signals remain constant when the flash is being read
<lekernel>
but when it's not, there would be all sort of sorts on the flash address lines, causing completely unnecessary EMI and power consumption
<lekernel>
s/sorts/signals
<wpwrak>
yes, that's what i mean. glitches that don't affect principal functionality but that are undesirable nevertheless
<wpwrak>
"register only when needed to reduce EMI" sounds as if it was something you normally don't want to enable.
<wpwrak>
so there seems to be a bit of a contradiction :)
<kristianpaul>
boards/milkymist-one/rtl/system.v line 233, how this flash reset goes with the ic reset recently added?
<wpwrak>
regarding the flyer, page 3. maybe put a line break in the "USB ports" box, before "You can even write {...]" ?
<wpwrak>
"to stimulate your guests" why do i have that fuzzy mental image of someone either handing out little pills or fondling people's genitals ? ;-)
<stekern>
both sounds like a great party though ;)
<wpwrak>
(sorry, didn't have time yesterday for a closer look at the flyer. we had a rather interesting owner's meeting of my building last night. they're fighting some dirty mobbing war over the position of administrator. that's been going on for months. at the meeting, it came to blows, leading to its cancellation and postponement of further deliberation. the story is slowly approaching movie-grade levels of interestingness :)
<kristianpaul>
ah nv, this flash release delays is more for the soc than the flash it self
<mwalle>
ho
<mwalle>
mh too much backlog ;)
<wpwrak>
the curse of returning from vacations :)
<mwalle>
already worked for almost one week now again ;)
<wpwrak>
and only now you found the courage to even contemplate the backlog. qed ;-)
<mwalle>
haha ;)
<mwalle>
so was that writepld working?
<wpwrak>
writereg ? only partially. for reconfiguring a specific bitstream, i has to go a  pld load. there, the stuff surrounding the writes is right.
<mwalle>
err writereg.. ;) write pld is sth from my work
<wpwrak>
what i found is that urj_tap_reset_bypass before the urj_tap_chain_flush is essential
<mwalle>
ah so you are loading a bitstream, which causes a reconfiguration with a specific address?
<wpwrak>
yup. still need to teach the script a few tricks, such as selecting which bitstream to load (standby / recovery / regular). but that's trivial ;-)
<mwalle>
nice, sth like pld reconfigure [address] would be handy too ;)
<wpwrak>
yeah, that would be a nice extrapolation
<wpwrak>
not sure how general it would be, though
<mwalle>
mh?
<wpwrak>
anyway, this thing works quite well. the only anomalies i found were after flashing the NOR (without power-cycling between flashing and trying to reconfigure). not sure if it was really not working or if i was just clumsy.
<wpwrak>
(how general) i mean that other chips than the x6 may have slightly different sequences
<mwalle>
yeah
<mwalle>
at least for xc6s devices, your commands could be used with jtag the same way
<mwalle>
Hacked 2001 by Werner Almesberger << 2001?
<wpwrak>
argh
<wpwrak>
you know that you've been around too long when such things don't even look wrong
<mwalle>
hehe
<wpwrak>
fixed :)
<wpwrak>
btw, is there some kind of peek/poke command (in urtag/BIOS/RTEMS) i could use to test if the NOR is really not locked ?