#panfrost on 2020-11-04 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:03 youcai has quit [Read error: Connection reset by peer]

00:04 youcai has joined #panfrost

00:15 <alyssa> robmur01: \o/

00:17 <HdkR> Woo bifrost

00:58 raster has quit [Quit: Gettin' stinky!]

00:59 archetech has quit [Quit: Konversation terminated!]

01:17 vstehle has quit [Read error: Connection reset by peer]

02:12 stikonas has quit [Remote host closed the connection]

02:24 camus1 has joined #panfrost

02:24 kaspter has quit [Ping timeout: 264 seconds]

02:24 camus1 is now known as kaspter

02:39 kaspter has quit [Remote host closed the connection]

02:39 kaspter has joined #panfrost

02:44 kaspter has quit [Client Quit]

02:44 robink has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

02:45 robink has joined #panfrost

02:46 kaspter has joined #panfrost

02:46 kaspter has quit [Excess Flood]

02:47 kaspter has joined #panfrost

02:52 robink has quit [Ping timeout: 272 seconds]

02:52 robink has joined #panfrost

02:55 camus1 has joined #panfrost

02:56 kaspter has quit [Ping timeout: 256 seconds]

02:56 camus1 is now known as kaspter

03:09 <kinkinkijkin> duet will be around in a week or so

03:10 kaspter has quit [Quit: kaspter]

03:10 <kinkinkijkin> and ive just realized that, somewhat annoyingly, i actually cannot find anything on google about installing a base gnu distro directly on a chromebook

03:11 <kinkinkijkin> it's all about running a distro in a container on top of the existing chromeos

03:11 <kinkinkijkin> which is obviously not what i want

03:12 kaspter has joined #panfrost

03:21 icecream95 has joined #panfrost

03:28 <icecream95> kinkinkijkin: The instructions at https://archlinuxarm.org/platforms/armv8/rockchip/samsung-chromebook-plus should work, except the rootfs tarball linked from there has a too old kernel

03:29 <kinkinkijkin> thanks, bookmarking

03:29 <alyssa> icecream95: blog post is up, not sure if you saw, let me know if I butchered the description of your changes and we'll fix it 😇

03:37 <icecream95> alyssa: You misspelt 'OpenGL 3.3 with working geometry shaders' in the last line :P

03:40 <HdkR> Now we just need Valhall and a devboard that has it

03:40 <HdkR> :)

04:11 <anarsoul> I thought that midgard has geometry shaders

04:16 <HdkR> They are implemented as compute

04:18 <HdkR> I presume Bifrost would be the first to have it as a real hardware stage

05:01 <chewitt> no more PAN_MESA_DEBUG=bifrost! .. congrats and thanks to all involved :)

05:02 <HdkR> next up PAN_MESA_DEBUG=valhall

05:28 kaspter has quit [Remote host closed the connection]

05:28 kaspter has joined #panfrost

05:52 davidlt has joined #panfrost

06:04 kaspter has quit [Ping timeout: 256 seconds]

06:04 kaspter has joined #panfrost

06:43 <chewitt> :)

06:45 <icecream95> then PAN_MESA_DEBUG=nv

07:37 vstehle has joined #panfrost

07:41 nlhowell has joined #panfrost

08:01 <macc24> kinkinkijkin: when i get my duet i will make debian run on it without any containers :D

08:12 camus1 has joined #panfrost

08:12 kaspter has quit [Ping timeout: 256 seconds]

08:12 camus1 is now known as kaspter

08:19 chewitt has quit [Read error: Connection reset by peer]

08:19 chewitt has joined #panfrost

08:36 <narmstrong> VIM3L (G31) on GloDroid \o/

08:36 <narmstrong> rsglobal did all the job

08:37 <narmstrong> only integration with amlogic specific stuff was needed

08:41 <tomeu> nice!

08:44 stikonas has joined #panfrost

08:48 stikonas has quit [Remote host closed the connection]

08:54 stikonas has joined #panfrost

09:13 icecream95 has quit [Ping timeout: 260 seconds]

09:20 _whitelogger has joined #panfrost

09:29 raster has joined #panfrost

09:52 kaspter has quit [Ping timeout: 260 seconds]

09:52 camus1 has joined #panfrost

09:55 sphalerite has quit [Ping timeout: 260 seconds]

09:55 camus1 is now known as kaspter

09:57 alpernebbi has joined #panfrost

09:58 sphalerite has joined #panfrost

10:15 raster has quit [Quit: Gettin' stinky!]

10:20 raster has joined #panfrost

11:27 <brads> I now have more frost in my pan by adding "-Dgles2=true -Dglvnd=true -Dglx-direct=true -Dgbm=true -Ddri3=true", gnome runs like a rocket (defnently out perfroms libMali now no doubt) and my mouse has become smooth :)

11:43 kaspter has quit [Ping timeout: 240 seconds]

11:43 kaspter has joined #panfrost

11:44 archetech has joined #panfrost

11:48 <alyssa> icecream95: >:

11:49 <alyssa> HdkR: Bifrost has not native geom/tess either

12:47 nlhowell has quit [Quit: WeeChat 2.9]

12:48 nlhowell has joined #panfrost

13:08 kaspter has quit [Ping timeout: 256 seconds]

13:09 kaspter has joined #panfrost

13:11 <bbrezillon> robmur01, stepri01: I faced this error https://gitlab.freedesktop.org/-/snippets/1305, which makes me wonder if our MMU AS removal/re-assignment is safe

13:13 <bbrezillon> say we have a context that's assigned an AS on which an MMU fault happens, but by the time we reach the MMU fault handler (threaded IRQ), the AS gets re-assigned to a different context

13:13 <alyssa> Hmm can I bang out +ZS_EMIT support in the next 45 minutes? let's find out!

13:16 <robmur01> bbrezillon: hmm, so panfrost_mmu_map_fault_addr() tries to resolve the fault, looks up the wrong context and maps the page into someone else's pagetable?

13:17 <robmur01> bleh :(

13:17 <bbrezillon> well, that trace says it tries to map something that's already mapped

13:17 <bbrezillon> which means the region the fault happened on match a heap BO in both contexts

13:18 <robmur01> yup, it seems entirely possible

13:18 <bbrezillon> so there shouldn't be any security issues, but we might try to map something that's not needed

13:18 <bbrezillon> or remap something that's already mapped (the case I hit here)

13:18 <alyssa> okay dropping the internet, hopefully bbiab with working z/s stuff, bbiab

13:19 <bbrezillon> robmur01: clearing IRQs when we re-assign an AS should help

13:19 <robmur01> a simple approach might be to just not reschedule an AS while it's in fault state, but that seems antiproductive...

13:19 <bbrezillon> but I'm not sure it's enough

13:20 <robmur01> since ideally a fault would be a great time to schedule something else in to do useful work while we resolve it :/

13:22 nlhowell has quit [Ping timeout: 240 seconds]

13:22 <robmur01> do we have anything to uniquely identify a context irrespective of which AS it happens to be running in (or not) at any given time?

13:32 <robmur01> Actually, isn't "heap BO in both contexts" the best case? If a legitimate fault on a heap BO is pending and we switch in a context where a fault at that address *isn't* valid, won't that end up killing the second (innocent) job?

13:33 <bbrezillon> yep

13:33 <bbrezillon> probably

13:34 <bbrezillon> I don't think we have anything identifying the context apart from the AS it's been assigned

13:35 <bbrezillon> maybe we should collect/clear faults in the hard irq and assign them to the currently bound context

13:35 <robmur01> suddenly I feel unusually glad that I need to go off and do other things now :P

13:35 <bbrezillon> :D

13:36 * bbrezillon regrets that the lockdown happened one week earlier in France

13:37 <stepri01> panfrost_mmu_as_get() does look faulty - it should only reclaim an address space if that other address space is actually free (i.e. not running a job on the hardware). There is a seperate potential issue of the MMU fault handler still dealing with a fault *after* the job is belongs to has finished (one of those 'really shouldn't happen, but technically can' situations)

13:39 <brads> bbrezillion: just had screen freeze doing silly Xwayland stuff and these locks being held on closedown of glmark2 (CTRL-C) - https://pastebin.com/eSe5HLab

13:40 <stepri01> but I think we should hit a WARN_ON() in panfrost_mmu_as_get() if we attempt a reclaim on an in-use AS, so I'm not sure why that isn't triggering too if that's the bug

13:41 <bbrezillon> brads: do you have https://gitlab.freedesktop.org/bbrezillon/linux/-/commit/9f3211a185ec94950eeaba6486026d2e4ad9e0f5 applied?

13:42 <bbrezillon> stepri01: maybe the AS is no longer used but still has faults pending

13:43 <stepri01> that's the shouldn't really happen situation. If there's a fault pending the hardware will stall. However it is possible:

13:44 <stepri01> if the fault happens, and another action restarts the hardware (e.g. userspace maps/unmaps something on the GPU) then the job can continue, if it just so happens that the fault condition has gone away (e.g. userspace mapped something in the area that caused the fault) then the job can complete. And the kernel might then handle the JOB irq before it gets rather enough with the MMU irq

13:45 <stepri01> the upshot is really we should synchronise with the MMU irq before reassigning an address space - but it's an unlikely situation as far as I know

13:46 <bbrezillon> note that this has been triggered while debugging the timeout/reset handling stuff

13:46 <bbrezillon> so I had a lot of job faults and reset happening

13:46 <stepri01> ah - perhaps the DRM scheduler's idea of if a job is running is different from the hardware's...

13:48 <bbrezillon> and I still have a drm_sched hand BTW :'-(

13:48 <bbrezillon> *hang

13:49 <bbrezillon> which I can only reproduce on CI despite using the same kernel+config locally

13:50 <stepri01> :(

13:51 <bbrezillon> and as soon as I add traces, it goes away, of course

13:52 <brads> bbrezillon: it seems not, I might have to move to a newer kernel I think

14:06 <alyssa> answer: no, but I have the compiler side piped through and pushed

14:07 <alyssa> need to fix a few things on the cmdstream before the tests pass but class

14:07 <alyssa> bbrezillon: branch pushed if you want to take a look

14:25 <bbrezillon> stepri01: nailed it (I think)

14:26 kaspter has quit [Remote host closed the connection]

14:26 kaspter has joined #panfrost

14:28 <bbrezillon> stepri01: https://gitlab.freedesktop.org/bbrezillon/linux/-/commit/6fd2df0ae3defe6517ce1bf4fa46f1f836a58df1

14:48 kaspter has quit [Read error: Connection reset by peer]

14:49 kaspter has joined #panfrost

15:09 <stepri01> bbrezillon: cool - I hope it survives some testing this time! ;)

15:10 * alyssa shills for TLA+

15:18 archetech has quit [Quit: Konversation terminated!]

15:21 archetech has joined #panfrost

15:33 <bbrezillon> stepri01: well, it's hard to be sure given the number of times I thought I had it fixed

15:34 * alyssa shills more for TLA+

15:36 * bbrezillon waits for alyssa to convert the linux kernel (or even just the DRM part of it) to TLA+ :P

15:38 <daniels> 'the language is similar to LaTeX'

15:38 * daniels closes tab

15:38 <bbrezillon> :D

15:40 <alyssa> bbrezillon: It's not a programming language

15:40 <alyssa> It's a specification language, it's about precisely expressing what the system _should_ do.

15:41 <alyssa> The actual implementation is still in C or Rust or VHDL or whatever; it's ballparked that the actual code will be 10x larger than the spec.

15:41 <stepri01> I think at the moment the 'spec' is "run stuff on the hardware" and the code is significantly larger ;)

15:42 <alyssa> But the precision of it forces things to be really explicit, makes it possible to do formal proofs, and allows a lot of invariants to be machine-checked,

15:42 <alyssa> specialty is exposing concurrency bugs

15:43 <bbrezillon> if only I know what drm_sched tries to do/expects :p

15:43 <alyssa> ^^ exactly :p

15:44 <daniels> I mean if you manage to write a meaningful spec for Mali I'll be _super_ impressed

15:44 <bbrezillon> more seriously, the real problem boils down to the fact that drm_sched expects things to be controlled at the queue/scheduler granularity, including resets, while panfrost wants 3 schedulers (one per job slot) and reset to happen globally

15:45 <stepri01> well what we really need is one queue that feeds the three slots, but where jobs can overtake other jobs to keep the hardware busy

15:45 <bbrezillon> we're bending the drm_sched logic to make it fit our needs

15:45 <alyssa> bbrezillon: ^ then that's the sort of level you spec at

15:46 <bbrezillon> stepri01: yes, that's also an option I thought about

15:46 <alyssa> Internals are a black box but you would make the exact interactions between drm_sched (as a black box) and mali's hw schedulers (as black boxes) with the interactions between them precisely specced

15:46 <stepri01> that's the logic that kbase uses - but instead of a "queue" it's a tree of dependencies

15:46 <alyssa> Assuming a priori that drm_sched and the hardware are both correct

15:46 <bbrezillon> stepri01: but drm_sched is really not designed for that

15:46 <bbrezillon> AFAICT

15:46 <stepri01> it certainly doesn't seem to be :(

15:47 <bbrezillon> so maybe the right thing to do would be to have our own scheduler

15:47 <stepri01> of course I wouldn't say kbase's scheduler is great either. I was responsible for the rewrite to the current design, but it was constrained somewhat by having to have a migration path from the previous design

15:48 <stepri01> and then of course it evolved

15:48 <stepri01> although my biggest bugbear is that atom ID 0 is 'special' but it doesn't need to be ;)

15:50 * alyssa squints

15:50 <alyssa> Did you say bugbear?

15:51 <stepri01> yes...?

15:51 * alyssa sighs

15:51 <alyssa> My name isn't Alyssa. It's Special Agent Sweetie Drops. I worked...

15:51 <alyssa> :p

15:52 <stepri01> *whoosh* not a reference I understand :p

15:53 <alyssa> mlp:fim

15:54 <stepri01> yeah I got that much from google - but my knowledge of mlp is very limited!

15:57 <alyssa> I'd send the video but trying to cram fragdepth support before the branchpoint

15:58 <tomeu> priorities!

16:31 <alyssa> bbrezillon: https://people.collabora.com/~alyssa/0001-L8-hack.patch

16:31 <alyssa> ^^ This hack fixes a bunch of L8/A8/L8A8 tests failing. Probably worth finding the root cause but if it's too complicated that'll be easy to backport later.

16:31 <bbrezillon> stepri01: looks like amdgpu has pretty much the same model, with one scheduler per queue (there's also the distinction between gfx and compute queues there) and a global reset used when the per queue reset is not possible

16:34 <stepri01> bbrezillon: yeah it should work - it just feels like we're hacking around a limitation in the drm scheduler unfortunately

16:37 <bbrezillon> stepri01: I had a quick look, and I could add a multi-queue sched, re-using most of the logic with the single-queue one, I'm just wondering if it's the right solution

16:38 <bbrezillon> s/re-using/sharing/

16:38 <stepri01> might be worth trying to get a view from the amdgpu folks to see if they'd be interested in it as well

16:39 <alyssa> ^^

16:50 <bbrezillon> stepri01: ok, let's get this hack/fix merged first

16:53 <stepri01> yes it would be good to get the fix in first

17:02 raster has quit [Quit: Gettin' stinky!]

17:08 raster has joined #panfrost

17:42 raster has quit [Quit: Gettin' stinky!]

18:08 raster has joined #panfrost

18:48 archetech has quit [Ping timeout: 265 seconds]

18:56 tomboy64 has quit [Remote host closed the connection]

18:57 tomboy64 has joined #panfrost

19:21 rando25892 has quit [Ping timeout: 272 seconds]

19:35 tgall_fo_ has joined #panfrost

19:36 tgall_foo has quit [Ping timeout: 256 seconds]

19:46 rando25892 has joined #panfrost

20:17 davidlt has quit [Ping timeout: 240 seconds]

20:17 mfilion has left #panfrost ["The Lounge - https://thelounge.chat"]

22:28 alpernebbi has quit [Quit: alpernebbi]

22:28 raster has quit [Quit: Gettin' stinky!]

22:29 stikonas has quit [Remote host closed the connection]

22:30 stikonas has joined #panfrost

22:58 camus1 has joined #panfrost

22:58 raster has joined #panfrost

22:59 kaspter has quit [Ping timeout: 260 seconds]

22:59 camus1 is now known as kaspter

23:18 archetech has joined #panfrost