<bbrezillon>
stepri01: did you have any chance to look at my reply to "drm/panfrost: Fix a race in the job timeout handling (again)"?
<bbrezillon>
I know this patch is not perfect, but I don't feel like going for more invasive changes right now
<bbrezillon>
archetech: DATA_INVALID_FAULT points to a userspace bug (malformed cmdstream/shader)
<stepri01>
sorry - I thought you were going to make some changes. AFAICT the WARN_ON is reachable (although the handling in that case is reasonable)
<stepri01>
it might be that we can simply remove the WARN_ON and just handle the 'double start' case
<bbrezillon>
stepri01: are you sure it's reachable?
<stepri01>
well you say "one of the queue might be started while another thread (attached to a different queue) is resetting the GPU" - surely in that case after the reset another start attempt will be made?
<archetech>
the effect of that bug is plasmashell cant run fyi
<bbrezillon>
stepri01: but the timeout work shouldn't be scheduled before we actually start the queue, right
<bbrezillon>
so how can we end up with another stop on the same queue in that case
<bbrezillon>
we can certainly have stops on queues we have restarted before that (in the for loop)
<stepri01>
can't we get a timeout (from the other slot) after the call to drm_sched_resubmit_jobs()?
<bbrezillon>
hm, let me check if sched_resubmit_jobs() starts the timer
<stepri01>
but either way we could (in theory) have a timeout after the first call to panfrost_scheduler_start() but before the second call in the loop
<bbrezillon>
yes, but that should be a problem
<bbrezillon>
(and no, sched_resubmit_jobs() does not start the timer)
<bbrezillon>
*shouldn't
<stepri01>
it would be a problem if the thread is scheduled out for 'a long time'. i.e.:
<bbrezillon>
if you have a timeout on one of the previous queue, that means this queue has already been restarted
<stepri01>
slot 1 times out, runs through the reset logic
<bbrezillon>
and we already passed the WARN_ON()
<stepri01>
the first slot is restarted, but then the thread hangs
<stepri01>
the first slot (slot 0) times out, runs through the reset logic
<stepri01>
the first slot timeout then re-starts *both* slots
<stepri01>
the second slot timeout then wakes up to call the scheduler_start() function (again)
<bbrezillon>
enters the critical reset section and *waits for all timeout handlers to be idle*
<stepri01>
... ah, yes I was just working that out as I typed ;p
<stepri01>
yep - ok I now agree, it's safe. But this code could really do with a cleaner way of handling this.
<stepri01>
if/when I've got some free time I might see if I can improve it
<bbrezillon>
"But this code could really do with a cleaner way of handling this."
<bbrezillon>
I couldn't agree more
<stepri01>
do we have a problem with the ordering of sched_stop and cancel_delayed_work_sync() though?
<bbrezillon>
don't we call it twice?
<bbrezillon>
uh, no, that should be fine
<bbrezillon>
if a timeout was pending, the cancel_delayed_work_sync() call should force a stop
<bbrezillon>
and if not, our panfrost_scheduler_stop() in the critical section should have stopped it already
<bbrezillon>
or am I missing something
<stepri01>
the call to cancel_delayed_work_sync() is after the call to stop the scheduler. so if another thread was stuck just before the final "restart scheduler" loop, then I can't see what stops that thread from restarting the scheduler (just before the reset)
<stepri01>
sorry - no I'm missing something ;)
<bbrezillon>
well we wait for all handlers to return ;)
<stepri01>
or am I? (I'm getting confused I'm sure about that!)
<bbrezillon>
actually, we might need another stop :)
<stepri01>
if there's a thread at the comment about "restart scheduler" then we know:
<stepri01>
a) the queue for that thread should be 'stopped=true'
<stepri01>
b) the first cancel_delayed_work_sync() is only called if stopped=false
<stepri01>
so the thread stuck at the comment won't be waited for until *after* the panfrost_scheduler_stop() call
<stepri01>
I think we do need another stop
<stepri01>
or a different condition for the first cancel_delayed_work_sync()
<stepri01>
(or a complete rewrite... ;) )
<archetech>
is this bug happening in mesa driver vs kernel?
<bbrezillon>
stepri01: the problem with the complete rewrite is that it involves touching the sched core if we really want to make things simple
<stepri01>
yes I know - but in the end it might turn out easier than repeatedly fixing the code...
<stepri01>
but I'm not going to block point fixes, so if you can make it work then that's fine too
<robmur01>
[ BTW do any drm-misc committers fancy picking that series up now that it has the relevant ack? :P ]
<chewitt_>
If anyone needs a Tested-by for ^ those, I've had them in my kernel branch for a while now
chewitt_ is now known as chewitt
<archetech>
robmur01: thks I should wait for the commit/patch from what bbrezillon fixed I think
<bbrezillon>
robmur01: duh, I thought those patches were merged already
<robmur01>
given that the Amlogic platforms are quite popular any pretty much unusable without, maybe they can be justified as fixes for 5.10?
<robmur01>
(particularly given LTS)
<archetech>
is rc2 taking MR's or its window is closed?
<robmur01>
oops, s/any/and/
<bbrezillon>
robmur01: I can queue them, but I'd rather let narmstrong do it since he's also the meson maintainer
<archetech>
idk what linus's rules are for mr's
<archetech>
vs patching rc1
<narmstrong>
bbrezillon: hmm Iām OoO today, but tommorow yes
<robmur01>
cool - no great rush as far as I'm concerned, just trying to stay on top of things ;)
<bbrezillon>
stepri01: crap, v2 is racy too (the scheduler stops dequeuing jobs at some point)
<stepri01>
bbrezillon: Doh! I was a bit afraid there was a good reason for the if (!stopped) condition, but I (still) haven't worked out how it actually goes wrong
archetech has quit [Quit: Konversation terminated!]
archetech has joined #panfrost
<bbrezillon>
stepri01: well, I only added as an optimization IIRC, so if there's something helping there, it was not intentional :)
<stepri01>
do you end up with a kernel thread stuck, or is the scheduler just not picking up new jobs?
<bbrezillon>
I think it's the latter, but I'll double check
alpernebbi has joined #panfrost
<bbrezillon>
stepri01: ok, to the schedulers are restarted, but things are blocked
<bbrezillon>
unloading/reloading panfrost unblocks the situation
<bbrezillon>
that doesn't make any sense
<brads>
I want to make it work with PREEMPT_RT so no stuck blocking threads please :) I'll buy beers
<brads>
you can put me in the weirdo class haha
<alyssa>
might I suggest formal methods? ;p
<HdkR>
Formal method? Is that when I complain on IRC?
<brads>
with a bit of time i'll test it (if can make mesa/ panfrost dev install and work on debian with RT)
<brads>
maybe I can make a robot to detect and swat fly's and take frame by frame pictures as the action happens ;p
<robmur01>
real-time expectations vs. the GPU's hardware job manager? I admire your optimism :P
<alyssa>
^^
<alyssa>
If you want RT use the CPU ;P
<archetech>
brads: I'll test your bullseye if ya make it
<archetech>
or simpler I have bullseye so just need a 5.10 / mesa-git debs
kaspter has quit [Ping timeout: 240 seconds]
kaspter has joined #panfrost
camus1 has joined #panfrost
davidlt_ has quit [Read error: Connection reset by peer]