alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
warpme_ has quit [Quit: Connection closed for inactivity]
robmur01_ has joined #panfrost
robmur01 has quit [Ping timeout: 260 seconds]
robmur01 has joined #panfrost
robmur01_ has quit [Ping timeout: 264 seconds]
karolherbst has joined #panfrost
stikonas has quit [Remote host closed the connection]
paulk-leonov has quit [Ping timeout: 246 seconds]
kaspter has quit [Ping timeout: 240 seconds]
kaspter has joined #panfrost
karolherbst has quit [Quit: duh šŸ§]
raster has quit [Quit: Gettin' stinky!]
lvrp16 has joined #panfrost
<chewitt> it worked for me (tm)
<chewitt> kodi restored to normal colours again
<chewitt> now to put @narmstrong patch for afbc in the gxm (midgard) drm driver and see if that still results in wonky colours
paulk-leonov has joined #panfrost
kaspter has quit [Ping timeout: 246 seconds]
kaspter has joined #panfrost
karolherbst has joined #panfrost
<chewitt> nope, that's still wonky :)
steev has quit [Read error: Connection reset by peer]
steev has joined #panfrost
karolherbst has quit [Ping timeout: 268 seconds]
vstehle has quit [Ping timeout: 246 seconds]
karolherbst has joined #panfrost
paulk-leonov has quit [Ping timeout: 260 seconds]
paulk-leonov has joined #panfrost
paulk-leonov has quit [Ping timeout: 264 seconds]
paulk-leonov has joined #panfrost
davidlt has joined #panfrost
icecream95 has joined #panfrost
chewitt has quit [Read error: Connection reset by peer]
chewitt_ has joined #panfrost
vstehle has joined #panfrost
felipealmeida has quit [Ping timeout: 272 seconds]
icecream95 has quit [Ping timeout: 264 seconds]
icecream95 has joined #panfrost
stepri01 has quit [Quit: leaving]
archetech has joined #panfrost
<archetech> gpu sched timeout, js=1, config=0x7300, status=0x58, head=0x3121100, tail=0x3121100, sched_job=00000000e>
<archetech> : js fault, js=1, status=DATA_INVALID_FAULT, head=0x3121100, tail=0x3121100
<archetech> fyi fr manjaro 5.10rc1
<archetech> weston
<archetech> runs ok
nlhowell has quit [Ping timeout: 265 seconds]
<bbrezillon> archetech: what's the GPU?
<archetech> N2 g52
<bbrezillon> that's not surprizing :-)
<archetech> no just an fyi
davidlt_ has joined #panfrost
davidlt has quit [Ping timeout: 240 seconds]
stepri01 has joined #panfrost
<macc24> archetech: do you have any patches applied?
<archetech> Linux manj-n2 5.10.0-rc1-1-MANJARO-ARM #1 SMP Mon Oct 26 19:03:39 CET 2020 aarch64 GNU/Linux
<archetech> stock
<archetech> Ill pass that to man j dev
<stepri01> isn't that commit already in v5.10-rc1?
<bbrezillon> macc24: it really looks like a userspace bug
<bbrezillon> I might be wrong of course
<macc24> oh yeah it is in 5.10-rc1
<archetech> kernel: panfrost ffe40000.gpu: dev_pm_opp_set_regulators: no regulator (mali) found
<archetech> fr module load at boot
icecream95 has quit [Ping timeout: 268 seconds]
<archetech> kernel: panfrost ffe40000.gpu: js fault, js=1, status=INSTR_INVALID_ENC, head=0x31>
<archetech> Oct 29 04:23:50 manj-n2 kernel: panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x5
<archetech> later in boot
<archetech> js fault, js=1, status=DATA_INVALID_FAULT
<archetech> and a crash too ill paste that
<bbrezillon> stepri01: did you have any chance to look at my reply to "drm/panfrost: Fix a race in the job timeout handling (again)"?
<bbrezillon> I know this patch is not perfect, but I don't feel like going for more invasive changes right now
<bbrezillon> archetech: DATA_INVALID_FAULT points to a userspace bug (malformed cmdstream/shader)
<stepri01> sorry - I thought you were going to make some changes. AFAICT the WARN_ON is reachable (although the handling in that case is reasonable)
<stepri01> it might be that we can simply remove the WARN_ON and just handle the 'double start' case
<bbrezillon> stepri01: are you sure it's reachable?
<stepri01> well you say "one of the queue might be started while another thread (attached to a different queue) is resetting the GPU" - surely in that case after the reset another start attempt will be made?
<archetech> the effect of that bug is plasmashell cant run fyi
<bbrezillon> stepri01: but the timeout work shouldn't be scheduled before we actually start the queue, right
<bbrezillon> so how can we end up with another stop on the same queue in that case
<bbrezillon> we can certainly have stops on queues we have restarted before that (in the for loop)
<stepri01> can't we get a timeout (from the other slot) after the call to drm_sched_resubmit_jobs()?
<bbrezillon> hm, let me check if sched_resubmit_jobs() starts the timer
<stepri01> but either way we could (in theory) have a timeout after the first call to panfrost_scheduler_start() but before the second call in the loop
<bbrezillon> yes, but that should be a problem
<bbrezillon> (and no, sched_resubmit_jobs() does not start the timer)
<bbrezillon> *shouldn't
<stepri01> it would be a problem if the thread is scheduled out for 'a long time'. i.e.:
<bbrezillon> if you have a timeout on one of the previous queue, that means this queue has already been restarted
<stepri01> slot 1 times out, runs through the reset logic
<bbrezillon> and we already passed the WARN_ON()
<stepri01> the first slot is restarted, but then the thread hangs
<stepri01> the first slot (slot 0) times out, runs through the reset logic
<stepri01> the first slot timeout then re-starts *both* slots
<stepri01> the second slot timeout then wakes up to call the scheduler_start() function (again)
<bbrezillon> enters the critical reset section and *waits for all timeout handlers to be idle*
<stepri01> ... ah, yes I was just working that out as I typed ;p
<stepri01> yep - ok I now agree, it's safe. But this code could really do with a cleaner way of handling this.
<stepri01> if/when I've got some free time I might see if I can improve it
<bbrezillon> "But this code could really do with a cleaner way of handling this."
<bbrezillon> I couldn't agree more
<stepri01> do we have a problem with the ordering of sched_stop and cancel_delayed_work_sync() though?
<bbrezillon> don't we call it twice?
<bbrezillon> uh, no, that should be fine
<bbrezillon> if a timeout was pending, the cancel_delayed_work_sync() call should force a stop
<bbrezillon> and if not, our panfrost_scheduler_stop() in the critical section should have stopped it already
<bbrezillon> or am I missing something
<stepri01> the call to cancel_delayed_work_sync() is after the call to stop the scheduler. so if another thread was stuck just before the final "restart scheduler" loop, then I can't see what stops that thread from restarting the scheduler (just before the reset)
<stepri01> sorry - no I'm missing something ;)
<bbrezillon> well we wait for all handlers to return ;)
<stepri01> or am I? (I'm getting confused I'm sure about that!)
<bbrezillon> actually, we might need another stop :)
<stepri01> if there's a thread at the comment about "restart scheduler" then we know:
<stepri01> a) the queue for that thread should be 'stopped=true'
<stepri01> b) the first cancel_delayed_work_sync() is only called if stopped=false
<stepri01> so the thread stuck at the comment won't be waited for until *after* the panfrost_scheduler_stop() call
<stepri01> I think we do need another stop
<stepri01> or a different condition for the first cancel_delayed_work_sync()
<stepri01> (or a complete rewrite... ;) )
<archetech> is this bug happening in mesa driver vs kernel?
<bbrezillon> stepri01: the problem with the complete rewrite is that it involves touching the sched core if we really want to make things simple
<stepri01> yes I know - but in the end it might turn out easier than repeatedly fixing the code...
<stepri01> but I'm not going to block point fixes, so if you can make it work then that's fine too
<bbrezillon> how about that version?
<stepri01> it looks like it should work to me - might be worth giving it a stress test to see if we've missed anything
<stepri01> thanks for reworking it
stikonas has joined #panfrost
tomboy64 has quit [Remote host closed the connection]
tomboy64 has joined #panfrost
raster has joined #panfrost
<bbrezillon> archetech: it's a mesa driver bug
<archetech> ok
nlhowell has joined #panfrost
<robmur01> archetech: oh wait, GPU seeing nonsense data on Odroid N2? These patches aren't merged yet - https://patchwork.freedesktop.org/series/81713/#rev2
warpme_ has joined #panfrost
<robmur01> [ BTW do any drm-misc committers fancy picking that series up now that it has the relevant ack? :P ]
<chewitt_> If anyone needs a Tested-by for ^ those, I've had them in my kernel branch for a while now
chewitt_ is now known as chewitt
<archetech> robmur01: thks I should wait for the commit/patch from what bbrezillon fixed I think
<bbrezillon> robmur01: duh, I thought those patches were merged already
<robmur01> given that the Amlogic platforms are quite popular any pretty much unusable without, maybe they can be justified as fixes for 5.10?
<robmur01> (particularly given LTS)
<archetech> is rc2 taking MR's or its window is closed?
<robmur01> oops, s/any/and/
<bbrezillon> robmur01: I can queue them, but I'd rather let narmstrong do it since he's also the meson maintainer
<archetech> idk what linus's rules are for mr's
<archetech> vs patching rc1
<narmstrong> bbrezillon: hmm Iā€™m OoO today, but tommorow yes
<robmur01> cool - no great rush as far as I'm concerned, just trying to stay on top of things ;)
<bbrezillon> stepri01: crap, v2 is racy too (the scheduler stops dequeuing jobs at some point)
<stepri01> bbrezillon: Doh! I was a bit afraid there was a good reason for the if (!stopped) condition, but I (still) haven't worked out how it actually goes wrong
archetech has quit [Quit: Konversation terminated!]
archetech has joined #panfrost
<bbrezillon> stepri01: well, I only added as an optimization IIRC, so if there's something helping there, it was not intentional :)
<stepri01> do you end up with a kernel thread stuck, or is the scheduler just not picking up new jobs?
<bbrezillon> I think it's the latter, but I'll double check
alpernebbi has joined #panfrost
<bbrezillon> stepri01: ok, to the schedulers are restarted, but things are blocked
<bbrezillon> unloading/reloading panfrost unblocks the situation
<bbrezillon> that doesn't make any sense
<brads> I want to make it work with PREEMPT_RT so no stuck blocking threads please :) I'll buy beers
<brads> you can put me in the weirdo class haha
<alyssa> might I suggest formal methods? ;p
<HdkR> Formal method? Is that when I complain on IRC?
<brads> with a bit of time i'll test it (if can make mesa/ panfrost dev install and work on debian with RT)
<brads> maybe I can make a robot to detect and swat fly's and take frame by frame pictures as the action happens ;p
<robmur01> real-time expectations vs. the GPU's hardware job manager? I admire your optimism :P
<alyssa> ^^
<alyssa> If you want RT use the CPU ;P
<archetech> brads: I'll test your bullseye if ya make it
<archetech> or simpler I have bullseye so just need a 5.10 / mesa-git debs
kaspter has quit [Ping timeout: 240 seconds]
kaspter has joined #panfrost
camus1 has joined #panfrost
davidlt_ has quit [Read error: Connection reset by peer]
kaspter has quit [Ping timeout: 240 seconds]
camus1 is now known as kaspter
gtucker has joined #panfrost
davidlt has joined #panfrost
alpernebbi has quit [Quit: alpernebbi]
kaspter has quit [Ping timeout: 240 seconds]
kaspter has joined #panfrost
<bbrezillon> stepri01: let's see if that works => https://gitlab.freedesktop.org/-/snippets/1290
<stepri01> bbrezillon: it's certainly longer ;) I'll take a look
<stepri01> However I've spotted another issue with your original commit: https://gitlab.arm.com/linux-arm/linux-sp/-/commit/0668071cdf4b4f305870870de209024e111a0a60
<stepri01> bbrezillon: looks reasonable, but then so did your previous version! Hopefully your testing will show if it works
<bbrezillon> stepri01: ouch
robclark has quit [Read error: Connection reset by peer]
robmur01_ has joined #panfrost
robclark has joined #panfrost
<bbrezillon> stepri01: and no, the version moving the reset to its own work doesn't work
<stepri01> :(
<bbrezillon> stepri01: can you send the fix you have with my R-b?
<stepri01> sure - I've been tracking down some other issues too. But I'll send that one out now (thanks for the R-b)
robmur01 has quit [Ping timeout: 264 seconds]
<alyssa> stepri01: gitlab.arm.com? Is that new?
<stepri01> yes, shiny and new ;)
<stepri01> we had a previous external git, but it was very underpowered
<alyssa> Shiny :o
<alyssa> stepri01: was kbase this racy?
robmur01_ is now known as robmur01
<robmur01> woop, we've finally announced the thing
<alyssa> robmur01: bwah?
camus1 has joined #panfrost
kaspter has quit [Ping timeout: 268 seconds]
camus1 is now known as kaspter
kaspter has quit [Remote host closed the connection]
kaspter has joined #panfrost
archetech has quit [Quit: Konversation terminated!]
archetech has joined #panfrost
<alyssa> daniels: šŸ‘€
ckeepax has quit [Ping timeout: 260 seconds]
felipealmeida has joined #panfrost
kaspter has quit [Ping timeout: 240 seconds]
<bbrezillon> stepri01: ok, I think I found the culprit (see this comment https://gitlab.freedesktop.org/-/snippets/1290#LC135)
kaspter has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
raster has joined #panfrost
camus1 has joined #panfrost
kaspter has quit [Ping timeout: 246 seconds]
camus1 is now known as kaspter
ckeepax has joined #panfrost
archetech has quit [Remote host closed the connection]
davidlt has quit [Ping timeout: 240 seconds]
archetech has joined #panfrost
archetech has quit [Client Quit]
vstehle has quit [Ping timeout: 260 seconds]
vstehle has joined #panfrost
nhp has quit [Quit: ZNC 1.8.1 - https://znc.in]
nhp has joined #panfrost
nhp has quit [Client Quit]