<bbrezillon>
stepri01: so, on a second thought, I'm not sure having the drm_sched_start() call protected by the reset_lock makes things easier. We still have a race if a timeout handler is called before the reset_lock is release, and in that case we never reset the GPU.
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
alpernebbi has joined #panfrost
stikonas_ has joined #panfrost
stikonas has quit [Read error: Connection reset by peer]
archetech has joined #panfrost
stikonas_ is now known as stikonas
gcl_ has joined #panfrost
gcl has quit [Ping timeout: 240 seconds]
<stepri01>
bbrezillon: yeah I think we need a better way of coordinating the two schedulers. Rather than assuming that if the mutex is held then nothing needs to be done, the code should somehow check if the reset is going to happen and either bail out (if the reset is going to happen), or block waiting for the mutex to trigger another reset if necessary
<stepri01>
kbase has a kbase_prepare_to_reset_gpu() function (and friends) which keeps track of whether a reset is in progress or not and ensures the dance happens correctly. But equally kbase doesn't have to deal with two schedulers driving the same GPU which is Panfrost's problem
<bbrezillon>
yep
<bbrezillon>
and drm_sched_job_timedout() seems to expect the timeout handling to happen synchronously (it restarts the timer after calling ->job_timedout()) , which is another problem
<bbrezillon>
otherwise we could schedule another work to do the reset
<stepri01>
yeah - kbase mostly makes reset aynchronous - it's natural for the GPU design, but sadly doesn't fit well with the DRM architecture
<stepri01>
the other option is to simply stop using the reset hammer and actually handle job failure properly ;)
<bbrezillon>
so maybe the solution is to allow asynchronous timeout handling
<stepri01>
but there might be some bugs hiding which still require resets to recover from
<bbrezillon>
yep, that's what I was about to ask
<bbrezillon>
I guess we sometimes need a reset
<stepri01>
kbase effectively has a watchdog for if the GPU stops behaving
<bbrezillon>
so the problem remains
<bbrezillon>
this being said, I'd like to find a fix that does not involve invasive changes :)
<tomeu>
can't we just mainline kbase? :p
<alyssa>
please no
<alyssa>
:p
<Lyude>
no
<Lyude>
my response means nothing but i don't like kbase :P
<alyssa>
tomeu: A channel op said it, how can you disagree ? :p
<kinkinkijkin>
im not entirely sure what kbase is in this context and that frightens me
<alyssa>
kinkinkijkin: Arm's kernel driver for mali midgard+
<kinkinkijkin>
oh THAT
<macc24>
nope nope nope nope
<alyssa>
Notoriously legacy code, that's why we're here ;p
<kinkinkijkin>
might as well merge the libmali blob into mesa while you're at it
<kinkinkijkin>
blobs*
<HdkR>
#include <libmali.so>
<daniels>
i mean that was basically lima
<macc24>
daniels: huh?
<daniels>
the original lima driver used kbase, did extremely primitive command-stream construction in standalone demo programs, and linked to Mali's offline shader compiler DSO to do all the compilation
<macc24>
that's... cursed...
felipealmeida has quit [Ping timeout: 256 seconds]
felipealmeida has joined #panfrost
<narmstrong>
Luc Verhaegen is a weird guy, he spent an infinite amount of time r-e, but when asked to make something that could useful for long-term, he says na I prefer hacking my stuff on my side and do weird q3 demo while adapting the GL api to meet my hacks
<narmstrong>
we lost 5years until Qiang take over, we could have lima upstream in linux & mesa in 2012
<narmstrong>
it's insane
<HdkR>
There's no need to bang on history. Everyone has different motivations that may not necessarily align with what people want.
<narmstrong>
yeah no offense, he really changed thing by r-e lima
* alyssa
<-- certified weirdo
<narmstrong>
alyssa: all people on this channel could be certified weirdos :-)
<narmstrong>
I mean idling on IRC on a GPU r-e dev channel
<alyssa>
well.. :p
<kinkinkijkin>
don't look at me i failed my weirdo certification failing to meet my exam date