<splitice>
We have been seeing it sporadically too in our fleet
<splitice>
Anyone got any gut feelings for this one?
<splitice>
clock monotonic jumps forward, sometimes there is a cpu stall (governor) but that may be because of the increase in clock_monotonic creating an uptime overflow
<splitice>
the a64 errata comes to mind, however surely someone else in this community would have noticed by now
_whitelogger has joined #linux-sunxi
dev1990 has quit [Quit: Konversation terminated!]
<megi>
never saw that on any of my H3 boards, except for those rcu stalls years ago when debugging cpufreq
<splitice>
I'll put the kernel trace up in case anyone has any ideas that I havent - https://paste.ee/p/05p13
<splitice>
unfortunately havent been able to work out how to replicate in the lab (that's from a remote device). So I can't get 100% access during the issue (openvpn ssh fails, but a http "backdoor" we have installed on some of these devices continues working)
<megi>
It's not clear to me what 4.19.57 armbian kernel is, so it's hard to see what cpu clock rate change code looks like
<megi>
Ie how it looks like after applying all the patches armbian carries for this kernel
<megi>
try reproducing it with the latest mainline kernel
<splitice>
that kernels already been stripped down of most of the non essential sunxi patches
<megi>
5.5 or 5.6-rc
<splitice>
I'll check the list
<splitice>
I'm working on a 5.4 port for testing but without a method to replicate it's going to be difficult to work out if it's fixed...
<splitice>
what I know so far is that 3/15 devices deployed have the issue. Two have had it once each (in the past fortnight) and one has the issue repeatedly every 2-4 days.
<splitice>
none of the 6 in our office (or when the 15 that were also there for 2 weeks powered on) exhibit the issue
<megi>
do the locations have differing ambient temperatures?
<megi>
what is the board dts?
<splitice>
friendlyarm nanopi neo
<splitice>
air
<splitice>
the main armbian patches we have left enabled are for thermal zone support
<splitice>
Temp: They are human occupied houses so they shouldnt be too extreme. The currently failed device is reporting 47deg currently, no idea if there was thermal stress events during operation.
<splitice>
(SoC temp)
<splitice>
big heatsinks and reasonable cpu settings are employed
<megi>
hmm, the board doesn't have CPU voltage regulator?
<splitice>
3 voltages I beleive
<megi>
dts doesn't have it set up
<megi>
this means cpufreq will alter the cpu frequency freely, without touching the whatever default voltage is there after boot
<splitice>
pretty sure it's patched in in the thermal areas.
<splitice>
ok, what block is that normally? I'll see if I can find a patch.
<megi>
if the default voltage is >= 1.32V it may work I guess, otherwise you're probably undervolting the CPU
<megi>
how to add the regulator node depends on the board
<megi>
you can try limiting the top frequency to 816MHz to see if it will stop the stalls
<megi>
that only requires 1.1V
<smaeul>
I threw my timer testing tools on my OPi+2E, and I immediately got:
<smaeul>
CPU 2: jumped back 0μs: 0x00000007997a7177 → 0x00000007997a7177 > 0x00000007997a7176 → 0x00000007997a7176
<smaeul>
CPU 0: jumped back 0μs: 0x00000007997a7177 → 0x00000007997a7177 → 0x00000007997a7177 > 0x00000007997a7176
<megi>
fun
<megi>
smaeul: do you have code somewhere?
<splitice>
@smaeul: indeed I'
<splitice>
can run on a fleet fairly easily
<megi>
splitice: regulator issue is real too, though :)
tllim has quit [Read error: Connection reset by peer]
<splitice>
megi currently investigating the cpu voltage situation, it's a possibility, however given we have some devices up over 180 days I suspect it might have just been my patch reduction (reduction of surface area)
ganbold_ has joined #linux-sunxi
<splitice>
bpi uses basically the same regular setup so I should be able to make a patch if not. More a kernel network susbsys guy but how hard can it be... lol
<megi>
some H3 SoC may be stable at 1.3V/1.2GHz some may need slightly more
<smaeul>
I never got around to cleaning it up after the a64 patch got merged, so... http://ix.io/2dRj
<megi>
Allwinner doesn't do binning for H3, AFAIK
<megi>
doesn't mean process doesn't vary
<smaeul>
I get one or two CPUs jumping back, but only right when the test starts, so maybe it really is a cpufreq thing
ganbold has quit [Ping timeout: 256 seconds]
<splitice>
our workload is one with alot of peaks and troffs, so the frequency would be changing
<smaeul>
definitely CPU frequency related (i.e. clocks, bus stalls, power spikes/voltage drops, etc.). I run this and the backward ticks come rolling in:
<smaeul>
while true; do cut -d' ' -f$(((RANDOM%9)+1)) scaling_available_frequencies > scaling_max_freq; done
<megi>
other thing is that I don't like mainline implementation of CPUX rate changing code, so I run my own - mainline intentionally uses simple but broken CPU rate changing code, locks up PLL and then tries to recover
<splitice>
megi got a patch?
<megi>
sure
<splitice>
I know for sure mainline cpu online code is broken
<splitice>
it's actually how we are restarting the devices locked up due to this bug
<smaeul>
got it to jump more than one tick backward: CPU 2: jumped back 1μs: 0x0000000d44b2e1ff → 0x0000000d44b2e1ff > 0x0000000d44b2e1e0 → 0x0000000d44b2e1e0
<splitice>
megi I'm thinking https://paste.ee/p/5WvOd for the regulator. It's the same regulator setup as the BPI zero plus but with a different pin for control.
<megi>
smaeul: I also get jumps
<smaeul>
even with your PLL patch?
<megi>
yes
<megi>
well I get output, I don't know what it means :)
<megi>
have to check the code
<smaeul>
it's just the raw CNTVCT values (or whatever the armv7 equivalent is called), so a 24MHz counter
<megi>
cyclic buffer
<megi>
jsut buffer :)
<smaeul>
yeah, it just reads 4 times and compares :)
<splitice>
the end result for us was an uptime that overflowed the 32bit counter (or under flowed the uptime?)
<splitice>
that's what actually causes the irreparable problems, systemd doesnt like that...
<splitice>
and I'm guessing nethier does some parts of the kernel
<smaeul>
yes, the size doesn't matter when the jump is backwards
<megi>
it's possible kernel will not even notice, since it's probably not getting cntvct in a tight loop
<smaeul>
very likely even, which is why the date jumps so rarely
<smaeul>
it just takes once :)
<megi>
it would likely happen on all boards then, not just some
<smaeul>
I'll leave the test running for 12h on my OPi+2E locked to the max freq (with active cooling) to see if I can get it to jump without a cpufreq transition
<megi>
maybe cpufreq code manipulates some counter values?
<megi>
or is the counter fixed freq?
<smaeul>
the counter is fixed frequency, running off HOSC
<smaeul>
Linux writing to the counter value at runtime would be crazy
<splitice>
I see date jumps every ~0.3s over multiple cpu on a 4.14 kernel during a cpu speed transition loop (scaling_setspeed). Usually 0ms other times 1ms back.
<megi>
even more bizarre why it would change on CPU freq change then
<megi>
hmm, mainline switches CPU to HOSC during CPUX rate change
<megi>
code execution slows down quite a bit during that time
<smaeul>
wow, so it's not like the a64 bug at all (where we read indeterminate values). the clock actually *goes backward* and then counts the same values over again
<megi>
that's why I suspect kernel is doing some correction :)
<smaeul>
in that picture you can see CPU 0 reaching [the same value] three times (can't copy/paste from your screenshots :/)
<megi>
hmmh, does the same thing happen on A64 during cpufreq?
<megi>
increase in backjumps?
<megi>
smaeul: it doesn't happen on H5
<smaeul>
I don't know, there are so many jumps already that it would be nontrivial to filter out from the noise (and all of my A64 systems have the workaround already)
<megi>
nevermind, H5 is 64-bit and doesn't jump on cpufreq
iyzsong has joined #linux-sunxi
megi has quit [Quit: WeeChat 2.7.1]
megi has joined #linux-sunxi
<MoeIcenowy>
smaeul: what means *actually goes backward* ?
<MoeIcenowy>
cannot parse the context
<megi>
smaeul: I fixed it :)
<megi>
I dropped the PLL gating and reparenting and there are no jumps anymore
<megi>
so it's caused by that
<smaeul>
MoeIcenowy: on the a64 when the clock jumps back from (say) 0xfffff back to 0x7ffff, it immediately jumps forward again to 0x100000 and counts from there
<smaeul>
on the h3, after jumping back, it would count 0x7ffff -> 0x80000 -> 0x80001 and so on
<MoeIcenowy>
why is the clock jumping back?
<smaeul>
so on the a64, the counter hardware is still counting forward, but what you can see from the CPU has some wrong bits
<MoeIcenowy>
megi: then please test CPU frequency switching stability
<megi>
MoeIcenowy: I will not do that again
<megi>
I already did that and it works fine
<MoeIcenowy>
megi: okay
<megi>
anyway, I don't think this is really the cause of splitice's problems
<MoeIcenowy>
so I still cannot understand what happened. Is it that CPUfreq scaling makes the clock to be tweaked back, and then triggered bug?
<megi>
it happens when switching parents of CPUX clock
<megi>
to HOSC and back
<megi>
no idea why
<megi>
it only happens on H3, not H5
<MoeIcenowy>
switching the parent will make the clock to be tweaked back?
<megi>
yes
<splitice>
Unfortunately even I have little idea what causes my issue. That will likely be the case until I can replicate it in lab. This certainly is a strong contender for the cause.
<MoeIcenowy>
and this happens on A64 too, right? (although the HW seems to be trying to hide it)
<megi>
probably not
<megi>
I don't think A64 had CPUX reparenting until recently
<MoeIcenowy>
I know that the current A64 bugfix is not optimal
<MoeIcenowy>
it cannot prevent timetravel, only reduces it
<MoeIcenowy>
(I mean the timer fix)
<megi>
yes
<megi>
someone reported some fsl bugfix worked for them better
<MoeIcenowy>
which fsl bugfix?
<megi>
but I had infinite loops using it in u-boot :)
<megi>
so no
<MoeIcenowy>
CONFIG_FSL_ERRATUM_A008585 ?
JohnDoe_71Rus has joined #linux-sunxi
<megi>
probably
<megi>
it does some unbounded loop
<MoeIcenowy>
megi: when I'm reading the code
<MoeIcenowy>
it's forced to be bounded to 200
<megi>
it randomly locked up my bootloader depending on the code size (which affected boot timing)
<megi>
maybe it's bounded in Linux
<MoeIcenowy>
let me check git log
<smaeul>
if you have a better fix, I'll review a patch, but I have never seen time travel on any of my 6 A64 devices with the current fix
<MoeIcenowy>
(My git repo is on HDD, so it's slow
<megi>
I did not either
<MoeIcenowy>
smaeul: I saw it on Pinebook once, PinePhone once
<MoeIcenowy>
both to the 22th century
<MoeIcenowy>
BTW dhclient seems to be not y2038-ready
<MoeIcenowy>
it starts to segfault after the timetravel
<MoeIcenowy>
megi: BTW, as we know this bug, I think you should submit a patch that hides the divider from the CCU driver
<megi>
and u-boot patch
<megi>
yeah, I'll try
<MoeIcenowy>
but keep its compatibility with older u-boot is an issue
<megi>
but this is a tough pill to swallow
<megi>
this will break kernels running on incompatible u-boots
<splitice>
thanks megi, i'll test that. Working on applying your patches currently. Broke my kernel in the last build so stepping back with patches.
<megi>
I don't think it will be accepted
<splitice>
I need a faster build machine :(
<megi>
splitice: good luck, I'm off :)
<MoeIcenowy>
megi: if this bug is confirmed, we MUST find a way to get it accepted
<splitice>
thanks mate, have a good morning/day/evening
<MoeIcenowy>
reset the divisor when booting?
<MoeIcenowy>
this will have an one-shot possibility to trigger the bug
<MoeIcenowy>
but it prevents triggers the bug all the time
<splitice>
a one shot on boot (only an old u-boot) is infinitely better than it occurring at a random time.
<splitice>
it would be great if it could be done after watchdog init then a restart could be triggered, but that would probably be more effort than it's worth
lurchi_ has joined #linux-sunxi
lurchi__ has quit [Ping timeout: 265 seconds]
aloo_shu has quit [Disconnected by services]
chewitt has quit [Quit: Zzz..]
selfbg has joined #linux-sunxi
<splitice>
Ah my failure wasnt megi's patches, it was `CONFIG_RTC_CLASS=y CONFIG_RTC_INTF_DEV=y`. Turns out that's broken on h3 (sun6i-rtc/sun8i-h3-rtc)
<Werner>
splitice: And I was just about to ask if this issue is related to the 1978 date thingy ^^
airgapp has joined #linux-sunxi
reinforce has joined #linux-sunxi
<splitice>
certainly is related I think, assuming the issue I'm seeing is same as OPs
<montjoie>
wow, the TRNG seems working on R40
JohnDoe_71Rus has quit [Ping timeout: 256 seconds]
JohnDoe_71Rus has joined #linux-sunxi
<splitice>
any idea on how to easily verifiy cpu voltage regulation is working?
<splitice>
i guess I could try and probe the component... so tiny
<KotCzarny>
temperature
<KotCzarny>
it's almost directly related to voltage, not freq
<KotCzarny>
so stick to 628 mhz or something and start playing
<KotCzarny>
montjoie: congrats! :)
Corkhat has joined #linux-sunxi
Corkhat has quit [Remote host closed the connection]
<splitice>
multimeter confirms my patch works
<splitice>
default configuration of the neo air dts is to always run at 1.3v i.e over-volt
<KotCzarny>
i hope you put some heatsink/fan combo on them
<splitice>
I'll package up the patch tomorrow
<splitice>
KotCzarny comes with a very effective heatsink
Putti has joined #linux-sunxi
Putti has quit [Changing host]
ldevulder_ has joined #linux-sunxi
kaspter has quit [Quit: kaspter]
kaspter has joined #linux-sunxi
<splitice>
I can confirm that megi'
ldevulder has quit [Ping timeout: 256 seconds]
<splitice>
s patch works to fix the issue as tested with smaeul's tool
<splitice>
I'll do some testing for an introduced issues then ship to some testing devices and see if the issue continues
maccraft has joined #linux-sunxi
<KotCzarny>
12MB cache on cpu, that thing could run linux directly, har har
mauz555 has joined #linux-sunxi
suprothunderbolt has quit [Ping timeout: 258 seconds]
mauz555 has quit [Ping timeout: 272 seconds]
matthias_bgg has joined #linux-sunxi
gaston1980 has joined #linux-sunxi
yann has quit [Ping timeout: 255 seconds]
JohnDoe_71Rus has quit [Read error: Connection reset by peer]
JohnDoe_71Rus has joined #linux-sunxi
gsz has joined #linux-sunxi
dddddd has quit [Ping timeout: 258 seconds]
markk__ has joined #linux-sunxi
JohnDoe_71Rus has quit [Ping timeout: 255 seconds]
JohnDoe_71Rus has joined #linux-sunxi
tnovotny has joined #linux-sunxi
JohnDoe_71Rus has quit [Remote host closed the connection]
JohnDoe_71Rus has joined #linux-sunxi
florian_kc has joined #linux-sunxi
gsz has quit [Quit: Konversation terminated!]
JohnDoe_71Rus has quit [Client Quit]
selfbg has quit [Ping timeout: 255 seconds]
selfbg has joined #linux-sunxi
<obbardc>
jernej: turns out, whatever has happened in drm since v5.4 has broken my HDMI>VGA converter
<obbardc>
as it works fine on X and modetest on 1080p HDMI screen
<obbardc>
but not on 1024x768 VGA monitor via the converter
<obbardc>
it may have been before v5.4 i last tested the converter, not sure
<obbardc>
unplugging the 1080p then plugging in the low-res screen via converter does get some output though
<obbardc>
strange :-)
<KotCzarny>
bad/unsupported mode (ie. clocks out of range?)
DrFrankensteinUK has quit [Ping timeout: 256 seconds]
<obbardc>
maybe, it worked before with the defaults though
<montjoie>
KotCzarny: I am just surprised it works, only H6 has it working until now
<KotCzarny>
montjoie: fixing allwinner bugs and holes is like adventure game
AneoX has joined #linux-sunxi
DrFrankensteinUK has joined #linux-sunxi
ldevulder_ is now known as ldevulder
yann has joined #linux-sunxi
florian_kc is now known as florian
hlauer has joined #linux-sunxi
markk__ has quit [Ping timeout: 265 seconds]
matthias_bgg has quit [Ping timeout: 268 seconds]
matthias_bgg has joined #linux-sunxi
markk__ has joined #linux-sunxi
mauz555 has joined #linux-sunxi
markk__ has quit [Ping timeout: 255 seconds]
<mru>
KotCzarny: a maze of twisty little passages, all alike?
JohnDoe_71Rus has joined #linux-sunxi
cnxsoft1 has quit [Read error: Connection reset by peer]
cnxsoft has joined #linux-sunxi
_whitelogger has joined #linux-sunxi
splitice has quit [Remote host closed the connection]
<willmore>
You have been eaten by the Grue.
<mru>
we finally know the name of the grue
<mru>
it is musb
<KotCzarny>
passages created by a copy paste with little changes and often not properly mapped (maps are also copypasted, but by another person)
<megi>
MoeIcenowy: it's not one shot with good odds of not hapeening, lockup always happens on first thermal event, when changing CPU frequency, when using dividers
<megi>
MoeIcenowy: yes, you can set divider to 1 in a safe way in the kernel, you just have to wait for PLL VCO to lock on the lower frequency first, before changing the post-divider
dev1990 has joined #linux-sunxi
cnxsoft has quit [Read error: Connection reset by peer]
cnxsoft has joined #linux-sunxi
lurchi_ is now known as lurchi__
yann has quit [Read error: Connection reset by peer]
<MoeIcenowy>
megi: good, then a kernel patch that locks the divider can be created
<megi>
splitice: 1.3V is slightly undervolted for 1.2GHz
<megi>
that's probably why you see some boards failing and some not, depending on SoC variability
<megi>
and probably outside conditions, like precision of the voltage regulator
yann has joined #linux-sunxi
lurchi__ is now known as lurchi_
afaerber has joined #linux-sunxi
dddddd has joined #linux-sunxi
aloo_shu has joined #linux-sunxi
matthias_bgg has quit [Ping timeout: 255 seconds]
reinforce has quit [Quit: Leaving.]
mauz555 has quit []
aalm has quit [Ping timeout: 268 seconds]
aalm has joined #linux-sunxi
selfbg has quit [Remote host closed the connection]
cnxsoft has quit [Remote host closed the connection]
markk__ has joined #linux-sunxi
afaerber has quit [Quit: Leaving]
afaerber has joined #linux-sunxi
lurchi_ is now known as lurchi__
AneoX has joined #linux-sunxi
AneoX has quit [Client Quit]
mauz555 has joined #linux-sunxi
gsz has joined #linux-sunxi
netlynx has joined #linux-sunxi
netlynx has quit [Changing host]
netlynx has joined #linux-sunxi
hlauer has quit [Ping timeout: 258 seconds]
florian_kc has joined #linux-sunxi
matthias_bgg has joined #linux-sunxi
maccraft has quit [Quit: WeeChat 2.7.1]
maccraft has joined #linux-sunxi
lkcl has quit [Ping timeout: 260 seconds]
yann has quit [Ping timeout: 272 seconds]
lurchi__ is now known as lurchi_
lkcl has joined #linux-sunxi
reinforce has joined #linux-sunxi
gediz0x539 has joined #linux-sunxi
florian_kc has quit [Ping timeout: 258 seconds]
matthias_bgg has quit [Ping timeout: 265 seconds]
lurchi_ is now known as lurchi__
tnovotny has quit [Quit: Leaving]
gsz has quit [Quit: Konversation terminated!]
maccraft123 has joined #linux-sunxi
maccraft has quit [Ping timeout: 255 seconds]
markk__ has quit [Ping timeout: 256 seconds]
arete74 has quit [Ping timeout: 256 seconds]
maccraft123 is now known as maccraft
arete74 has joined #linux-sunxi
matthias_bgg has joined #linux-sunxi
maccraft123 has joined #linux-sunxi
matthias_bgg has quit [Ping timeout: 260 seconds]
afaerber has quit [Remote host closed the connection]