#panfrost on 2020-10-29 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:02 warpme_ has quit [Quit: Connection closed for inactivity]

00:23 robmur01_ has joined #panfrost

00:25 robmur01 has quit [Ping timeout: 260 seconds]

00:26 robmur01 has joined #panfrost

00:28 robmur01_ has quit [Ping timeout: 264 seconds]

00:41 karolherbst has joined #panfrost

00:52 stikonas has quit [Remote host closed the connection]

01:13 paulk-leonov has quit [Ping timeout: 246 seconds]

01:20 kaspter has quit [Ping timeout: 240 seconds]

01:20 kaspter has joined #panfrost

01:23 karolherbst has quit [Quit: duh 🐧]

01:23 raster has quit [Quit: Gettin' stinky!]

01:28 lvrp16 has joined #panfrost

01:45 <chewitt> it worked for me (tm)

01:46 <chewitt> kodi restored to normal colours again

01:47 <chewitt> now to put @narmstrong patch for afbc in the gxm (midgard) drm driver and see if that still results in wonky colours

01:47 paulk-leonov has joined #panfrost

01:48 kaspter has quit [Ping timeout: 246 seconds]

01:48 kaspter has joined #panfrost

01:51 karolherbst has joined #panfrost

01:58 <chewitt> nope, that's still wonky :)

02:00 steev has quit [Read error: Connection reset by peer]

02:01 steev has joined #panfrost

02:06 karolherbst has quit [Ping timeout: 268 seconds]

02:14 vstehle has quit [Ping timeout: 246 seconds]

02:32 karolherbst has joined #panfrost

02:47 paulk-leonov has quit [Ping timeout: 260 seconds]

02:56 paulk-leonov has joined #panfrost

03:15 paulk-leonov has quit [Ping timeout: 264 seconds]

03:16 paulk-leonov has joined #panfrost

04:18 davidlt has joined #panfrost

04:30 icecream95 has joined #panfrost

05:35 chewitt has quit [Read error: Connection reset by peer]

05:35 chewitt_ has joined #panfrost

06:00 vstehle has joined #panfrost

07:33 felipealmeida has quit [Ping timeout: 272 seconds]

07:36 icecream95 has quit [Ping timeout: 264 seconds]

07:36 icecream95 has joined #panfrost

08:15 stepri01 has quit [Quit: leaving]

08:32 archetech has joined #panfrost

08:33 <archetech> gpu sched timeout, js=1, config=0x7300, status=0x58, head=0x3121100, tail=0x3121100, sched_job=00000000e>

08:33 <archetech> : js fault, js=1, status=DATA_INVALID_FAULT, head=0x3121100, tail=0x3121100

08:33 <archetech> fyi fr manjaro 5.10rc1

08:33 <archetech> weston

08:33 <archetech> runs ok

08:36 nlhowell has quit [Ping timeout: 265 seconds]

08:38 <bbrezillon> archetech: what's the GPU?

08:39 <archetech> N2 g52

08:39 <bbrezillon> that's not surprizing :-)

08:39 <archetech> no just an fyi

08:41 davidlt_ has joined #panfrost

08:43 davidlt has quit [Ping timeout: 240 seconds]

08:44 stepri01 has joined #panfrost

08:44 <macc24> archetech: do you have any patches applied?

08:44 <archetech> Linux manj-n2 5.10.0-rc1-1-MANJARO-ARM #1 SMP Mon Oct 26 19:03:39 CET 2020 aarch64 GNU/Linux

08:45 <archetech> stock

08:45 <macc24> can you try with https://gitlab.freedesktop.org/tomeu/linux/-/commit/025731eebd1cec6da60fa2e5851c2286b85c06e3.diff applied?

08:46 <archetech> Ill pass that to man j dev

08:47 <stepri01> isn't that commit already in v5.10-rc1?

08:48 <bbrezillon> macc24: it really looks like a userspace bug

08:48 <bbrezillon> I might be wrong of course

08:49 <macc24> oh yeah it is in 5.10-rc1

09:08 <archetech> kernel: panfrost ffe40000.gpu: dev_pm_opp_set_regulators: no regulator (mali) found

09:08 <archetech> fr module load at boot

09:11 icecream95 has quit [Ping timeout: 268 seconds]

09:12 <archetech> kernel: panfrost ffe40000.gpu: js fault, js=1, status=INSTR_INVALID_ENC, head=0x31>

09:12 <archetech> Oct 29 04:23:50 manj-n2 kernel: panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x5

09:13 <archetech> later in boot

09:13 <archetech> js fault, js=1, status=DATA_INVALID_FAULT

09:15 <archetech> and a crash too ill paste that

09:15 <bbrezillon> stepri01: did you have any chance to look at my reply to "drm/panfrost: Fix a race in the job timeout handling (again)"?

09:15 <bbrezillon> I know this patch is not perfect, but I don't feel like going for more invasive changes right now

09:17 <bbrezillon> archetech: DATA_INVALID_FAULT points to a userspace bug (malformed cmdstream/shader)

09:17 <stepri01> sorry - I thought you were going to make some changes. AFAICT the WARN_ON is reachable (although the handling in that case is reasonable)

09:17 <stepri01> it might be that we can simply remove the WARN_ON and just handle the 'double start' case

09:17 <bbrezillon> stepri01: are you sure it's reachable?

09:18 <stepri01> well you say "one of the queue might be started while another thread (attached to a different queue) is resetting the GPU" - surely in that case after the reset another start attempt will be made?

09:19 <archetech> the effect of that bug is plasmashell cant run fyi

09:23 <bbrezillon> stepri01: but the timeout work shouldn't be scheduled before we actually start the queue, right

09:23 <bbrezillon> so how can we end up with another stop on the same queue in that case

09:24 <bbrezillon> we can certainly have stops on queues we have restarted before that (in the for loop)

09:24 <stepri01> can't we get a timeout (from the other slot) after the call to drm_sched_resubmit_jobs()?

09:25 <bbrezillon> hm, let me check if sched_resubmit_jobs() starts the timer

09:25 <stepri01> but either way we could (in theory) have a timeout after the first call to panfrost_scheduler_start() but before the second call in the loop

09:25 <bbrezillon> yes, but that should be a problem

09:25 <bbrezillon> (and no, sched_resubmit_jobs() does not start the timer)

09:26 <bbrezillon> *shouldn't

09:26 <stepri01> it would be a problem if the thread is scheduled out for 'a long time'. i.e.:

09:26 <bbrezillon> if you have a timeout on one of the previous queue, that means this queue has already been restarted

09:26 <stepri01> slot 1 times out, runs through the reset logic

09:26 <bbrezillon> and we already passed the WARN_ON()

09:27 <stepri01> the first slot is restarted, but then the thread hangs

09:27 <stepri01> the first slot (slot 0) times out, runs through the reset logic

09:27 <stepri01> the first slot timeout then re-starts *both* slots

09:27 <stepri01> the second slot timeout then wakes up to call the scheduler_start() function (again)

09:27 <bbrezillon> enters the critical reset section and *waits for all timeout handlers to be idle*

09:28 <stepri01> ... ah, yes I was just working that out as I typed ;p

09:29 <stepri01> yep - ok I now agree, it's safe. But this code could really do with a cleaner way of handling this.

09:29 <stepri01> if/when I've got some free time I might see if I can improve it

09:29 <bbrezillon> "But this code could really do with a cleaner way of handling this."

09:30 <bbrezillon> I couldn't agree more

09:32 <stepri01> do we have a problem with the ordering of sched_stop and cancel_delayed_work_sync() though?

09:32 <bbrezillon> don't we call it twice?

09:33 <bbrezillon> uh, no, that should be fine

09:34 <bbrezillon> if a timeout was pending, the cancel_delayed_work_sync() call should force a stop

09:34 <bbrezillon> and if not, our panfrost_scheduler_stop() in the critical section should have stopped it already

09:35 <bbrezillon> or am I missing something

09:36 <stepri01> the call to cancel_delayed_work_sync() is after the call to stop the scheduler. so if another thread was stuck just before the final "restart scheduler" loop, then I can't see what stops that thread from restarting the scheduler (just before the reset)

09:36 <stepri01> sorry - no I'm missing something ;)

09:36 <bbrezillon> well we wait for all handlers to return ;)

09:37 <stepri01> or am I? (I'm getting confused I'm sure about that!)

09:37 <bbrezillon> actually, we might need another stop :)

09:37 <stepri01> if there's a thread at the comment about "restart scheduler" then we know:

09:37 <stepri01> a) the queue for that thread should be 'stopped=true'

09:38 <stepri01> b) the first cancel_delayed_work_sync() is only called if stopped=false

09:39 <stepri01> so the thread stuck at the comment won't be waited for until *after* the panfrost_scheduler_stop() call

09:40 <stepri01> I think we do need another stop

09:40 <stepri01> or a different condition for the first cancel_delayed_work_sync()

09:40 <stepri01> (or a complete rewrite... ;) )

09:41 <archetech> is this bug happening in mesa driver vs kernel?

09:46 <bbrezillon> stepri01: the problem with the complete rewrite is that it involves touching the sched core if we really want to make things simple

09:48 <stepri01> yes I know - but in the end it might turn out easier than repeatedly fixing the code...

09:48 <stepri01> but I'm not going to block point fixes, so if you can make it work then that's fine too

09:50 <bbrezillon> stepri01: https://gitlab.freedesktop.org/-/snippets/1288

09:50 <bbrezillon> how about that version?

10:06 <stepri01> it looks like it should work to me - might be worth giving it a stress test to see if we've missed anything

10:06 <stepri01> thanks for reworking it

10:15 stikonas has joined #panfrost

10:16 tomboy64 has quit [Remote host closed the connection]

10:19 tomboy64 has joined #panfrost

10:26 raster has joined #panfrost

10:27 <bbrezillon> archetech: it's a mesa driver bug

10:28 <archetech> ok

10:31 nlhowell has joined #panfrost

10:55 <robmur01> archetech: oh wait, GPU seeing nonsense data on Odroid N2? These patches aren't merged yet - https://patchwork.freedesktop.org/series/81713/#rev2

11:00 warpme_ has joined #panfrost

11:01 <robmur01> [ BTW do any drm-misc committers fancy picking that series up now that it has the relevant ack? :P ]

11:01 <chewitt_> If anyone needs a Tested-by for ^ those, I've had them in my kernel branch for a while now

11:01 chewitt_ is now known as chewitt

11:02 <archetech> robmur01: thks I should wait for the commit/patch from what bbrezillon fixed I think

11:04 <bbrezillon> robmur01: duh, I thought those patches were merged already

11:13 <robmur01> given that the Amlogic platforms are quite popular any pretty much unusable without, maybe they can be justified as fixes for 5.10?

11:13 <robmur01> (particularly given LTS)

11:13 <archetech> is rc2 taking MR's or its window is closed?

11:13 <robmur01> oops, s/any/and/

11:15 <bbrezillon> robmur01: I can queue them, but I'd rather let narmstrong do it since he's also the meson maintainer

11:17 <archetech> idk what linus's rules are for mr's

11:17 <archetech> vs patching rc1

11:17 <narmstrong> bbrezillon: hmm I’m OoO today, but tommorow yes

11:25 <robmur01> cool - no great rush as far as I'm concerned, just trying to stay on top of things ;)

11:29 <bbrezillon> stepri01: crap, v2 is racy too (the scheduler stops dequeuing jobs at some point)

11:30 <stepri01> bbrezillon: Doh! I was a bit afraid there was a good reason for the if (!stopped) condition, but I (still) haven't worked out how it actually goes wrong

11:32 archetech has quit [Quit: Konversation terminated!]

11:37 archetech has joined #panfrost

11:41 <bbrezillon> stepri01: well, I only added as an optimization IIRC, so if there's something helping there, it was not intentional :)

11:43 <stepri01> do you end up with a kernel thread stuck, or is the scheduler just not picking up new jobs?

11:44 <bbrezillon> I think it's the latter, but I'll double check

12:51 alpernebbi has joined #panfrost

13:07 <bbrezillon> stepri01: ok, to the schedulers are restarted, but things are blocked

13:07 <bbrezillon> unloading/reloading panfrost unblocks the situation

13:08 <bbrezillon> that doesn't make any sense

13:21 <brads> I want to make it work with PREEMPT_RT so no stuck blocking threads please :) I'll buy beers

13:28 <brads> you can put me in the weirdo class haha

13:28 <alyssa> might I suggest formal methods? ;p

13:29 <HdkR> Formal method? Is that when I complain on IRC?

13:47 <brads> with a bit of time i'll test it (if can make mesa/ panfrost dev install and work on debian with RT)

13:48 <brads> maybe I can make a robot to detect and swat fly's and take frame by frame pictures as the action happens ;p

13:54 <robmur01> real-time expectations vs. the GPU's hardware job manager? I admire your optimism :P

13:57 <alyssa> ^^

13:57 <alyssa> If you want RT use the CPU ;P

14:04 <archetech> brads: I'll test your bullseye if ya make it

14:07 <archetech> or simpler I have bullseye so just need a 5.10 / mesa-git debs

14:23 kaspter has quit [Ping timeout: 240 seconds]

14:23 kaspter has joined #panfrost

14:57 camus1 has joined #panfrost

14:57 davidlt_ has quit [Read error: Connection reset by peer]

14:58 kaspter has quit [Ping timeout: 240 seconds]

14:58 camus1 is now known as kaspter

14:58 gtucker has joined #panfrost

15:20 davidlt has joined #panfrost

15:31 alpernebbi has quit [Quit: alpernebbi]

15:32 kaspter has quit [Ping timeout: 240 seconds]

15:34 kaspter has joined #panfrost

16:11 <bbrezillon> stepri01: let's see if that works => https://gitlab.freedesktop.org/-/snippets/1290

16:16 <stepri01> bbrezillon: it's certainly longer ;) I'll take a look

16:17 <stepri01> However I've spotted another issue with your original commit: https://gitlab.arm.com/linux-arm/linux-sp/-/commit/0668071cdf4b4f305870870de209024e111a0a60

16:39 <stepri01> bbrezillon: looks reasonable, but then so did your previous version! Hopefully your testing will show if it works

16:54 <bbrezillon> stepri01: ouch

16:55 robclark has quit [Read error: Connection reset by peer]

16:55 robmur01_ has joined #panfrost

16:55 robclark has joined #panfrost

16:56 <bbrezillon> stepri01: and no, the version moving the reset to its own work doesn't work

16:56 <stepri01> :(

16:56 <bbrezillon> stepri01: can you send the fix you have with my R-b?

16:58 <stepri01> sure - I've been tracking down some other issues too. But I'll send that one out now (thanks for the R-b)

16:59 robmur01 has quit [Ping timeout: 264 seconds]

17:07 <alyssa> stepri01: gitlab.arm.com? Is that new?

17:08 <stepri01> yes, shiny and new ;)

17:08 <stepri01> we had a previous external git, but it was very underpowered

17:10 <alyssa> Shiny :o

17:42 <alyssa> stepri01: was kbase this racy?

18:03 robmur01_ is now known as robmur01

18:07 <robmur01> woop, we've finally announced the thing

18:08 <alyssa> robmur01: bwah?

18:24 camus1 has joined #panfrost

18:24 kaspter has quit [Ping timeout: 268 seconds]

18:26 camus1 is now known as kaspter

18:27 <daniels> alyssa: https://www.arm.com/company/news/2020/10/morello-program-one-year-on

18:29 kaspter has quit [Remote host closed the connection]

18:29 kaspter has joined #panfrost

18:31 archetech has quit [Quit: Konversation terminated!]

18:44 archetech has joined #panfrost

18:45 <alyssa> daniels: 👀

19:24 ckeepax has quit [Ping timeout: 260 seconds]

19:32 felipealmeida has joined #panfrost

19:57 kaspter has quit [Ping timeout: 240 seconds]

20:05 <bbrezillon> stepri01: ok, I think I found the culprit (see this comment https://gitlab.freedesktop.org/-/snippets/1290#LC135)

20:07 kaspter has joined #panfrost

20:48 raster has quit [Quit: Gettin' stinky!]

20:59 raster has joined #panfrost

21:01 camus1 has joined #panfrost

21:01 kaspter has quit [Ping timeout: 246 seconds]

21:02 camus1 is now known as kaspter

21:19 ckeepax has joined #panfrost

21:58 archetech has quit [Remote host closed the connection]

22:21 davidlt has quit [Ping timeout: 240 seconds]

23:24 archetech has joined #panfrost

23:24 archetech has quit [Client Quit]

23:45 vstehle has quit [Ping timeout: 260 seconds]

23:46 vstehle has joined #panfrost

23:52 nhp has quit [Quit: ZNC 1.8.1 - https://znc.in]

23:52 nhp has joined #panfrost

23:56 nhp has quit [Client Quit]