<bbrezillon>
my guess is that the timeout happens because the MMU is stuck and the job wants to do a flush/invalidate
<bbrezillon>
oh, and we also ignore the return of write_cmd(), meaning that the MMU command might be skipped entirely without ever blocking the rest of the submission
yann has joined #panfrost
clementp[m] has quit [Quit: killed]
Ke has quit [Quit: killed]
nhp[m] has quit [Quit: killed]
l-as has quit [Quit: killed]
clementp[m] has joined #panfrost
<tomeu>
ah, that's bad in itself
<tomeu>
so what we should do: propagate errors and don't try to submit a job if we weren't able to prepare its AS, and reset the whole GPU if the MMU appears stuck?
<bbrezillon>
sounds like a sane approach
<tomeu>
hmm, or maybe propagate errors so submission fails, and reset the GPU whenever that happens?
l-as has joined #panfrost
Ke has joined #panfrost
nhp[m] has joined #panfrost
icecream95 has joined #panfrost
<icecream95>
bbrezillon: tomeu: AS_ACTIVE got stuck three times on rk3288-veyron-jaq-cbg-0 this month but none on -1
<tomeu>
hmm
<tomeu>
and when did it happen for the first time?
<icecream95>
I suspect that -0 needs more voltage when running the GPU at 600MHz than -1
<icecream95>
The kernel used for CI still doesn't have dynamic voltage scaling, right?
<tomeu>
not yet, indeed
<tomeu>
could be that, let me check where the patches are
warpme_ has joined #panfrost
<icecream95>
To confirm, try adding 'echo 600000000 >/sys/class/devfreq/*.gpu/min_freq' before dEQP runs and see if that causes -0 to fail
<icecream95>
alyssa: I use zram for swap (zstd, currently 3.6G/5G used with 50% compression) and have never had OOM issues like you mention
<daniels>
icecream95: that's a _really_ good spot, thankyou! I know we've had issues with -0 and not -1 in the past; they should have identical firmware but even that shouldn't matter as the kernel sets up the whole clock tree; I wonder if it's having thermal issues, or if it's simply just a bit older and needs to be put out to pasture
<daniels>
robmur01: could you please register an account on https://gitlab.freedesktop.org so I can harass you there? :)
<icecream95>
daniels: Because dynamic voltage scaling isn't being used, the GPU is kept at the same low voltage the firmware sets it to. I think -1 can undervolt better, so still works at the low voltage, but -0 needs a higher voltage to be stable
<icecream95>
Setting a maximum frequency of 400 MHz until voltage scaling arrives should make it more stable: echo 400000000 >/sys/class/devfreq/*.gpu/max_freq
icecream95 has quit [Ping timeout: 240 seconds]
<robmur01>
daniels: you know I'm just the pagetable guy, right? :P
<daniels>
think of it as a personal growth plan?
<daniels>
(more seriously, does this mean I should be tagging stepri01 for non-MMU things?)
<robmur01>
Why yes Office 365, the confirmation email most definitely deserves to be quarantined as a phishing attempt. Sigh...
<daniels>
Office365 is generally pretty skeptical of fd.o due to the volume of spam which comes through Mailman
<robmur01>
daniels: technically Steve and RobH are more officially involved than I am
<daniels>
sure :)
<robmur01>
I'm mostly squeezing it under my general "upstream kernel support" remit because it's more fun and interesting than reviewing SMMU patches ;)
<robmur01>
anyway, I'm in - usual work username because laziness
davidlt has quit [Ping timeout: 246 seconds]
nlhowell has quit [Ping timeout: 246 seconds]
davidlt has joined #panfrost
<alyssa>
tomeu: bbrezillon: I first saw that with the genxml attribute/varying series but I couldn't bisect it since nondeterminism and nothing stood out as wrong so I thought it was a fluke..
<alyssa>
"think of it as a personal growth plan?" lol
<alyssa>
robmur01: the trick is to just quarantine EVERYTHING, as 2020 has taught us :p
<alyssa>
icecream95: hm, interesting. It's certainly a lot better on 4gb than 2gb as mentioned. I also run without swap at all since I'm stubborn, so that isn't helping :)
nlhowell has joined #panfrost
<alyssa>
icecream95: Oh, TIL about heaptrack, neat!
<alyssa>
seems a lot more pleasant to use for leaks than valgrind :)
davidlt has quit [Read error: Connection reset by peer]
raster has joined #panfrost
davidlt has joined #panfrost
guillaume_g has quit [Quit: Konversation terminated!]
Elpaulo has quit [Read error: Connection reset by peer]
Elpaulo has joined #panfrost
BenG83 has quit [Ping timeout: 246 seconds]
<HdkR>
alyssa: Can confirm, heaptrack is great
<HdkR>
Really helped me smash down small allocations
<alyssa>
HdkR: :D
<HdkR>
TFW hunting a SIGBUS in an application that catches SIGBUS
raster has quit [Remote host closed the connection]
<alyssa>
;-;
<HdkR>
What's even more fun is that it seems to be a SIGBUS that my SIGBUS handler just doesn't catch ¯\_(ツ)_/¯
<HdkR>
Oh frick frack, I missed a commit, so sigprocmask was killing it :|
<urjaman>
catching SIGBUS sounds oddly like you're talking public transit :P
<HdkR>
I tried to catch the SIGBUS but it turns out it was SIGILL
gcl_ has joined #panfrost
gcl has quit [Ping timeout: 240 seconds]
gcl_ has quit [Ping timeout: 246 seconds]
<HdkR>
Oh, Valhal device is arriving today
<HdkR>
Valhall even
gcl has joined #panfrost
<Lyude>
HdkR: see you seen in Valhall[a]
<HdkR>
:P
davidlt has quit [Ping timeout: 240 seconds]
stikonas has quit [Remote host closed the connection]
davidlt has joined #panfrost
jgmdev has joined #panfrost
jgmdev has quit [Client Quit]
AreaScout_ has quit [Ping timeout: 240 seconds]
ezequielg has quit [Read error: Connection reset by peer]
enunes has quit [Read error: Connection reset by peer]
enunes has joined #panfrost
ezequielg has joined #panfrost
enunes has quit [Ping timeout: 240 seconds]
davidlt has quit [Ping timeout: 256 seconds]
stikonas has joined #panfrost
enunes has joined #panfrost
buzzmarshall has joined #panfrost
<Lyude>
tomeu: do you have any idea how the panfrost tests in IGT get built for autotools? I thought this would be more obvious but I don't see anything listed in tests/Makefile.sources
<alyssa>
Lyude: *distant voice* they don't
<Lyude>
alyssa: figured it might be something like that, I'm just a little surprised because it seems like something we test in CI according to the gitlab pipeline from here: https://patchwork.freedesktop.org/series/74811/
<Lyude>
oh wait
<Lyude>
duh, it says right there | grep -v vc4\|v4d\|panfrost
* alyssa
shrugs
* Lyude
has answered her question :), will just make nouveau exempt from that check as well