#panfrost on 2019-08-28 — irc logs at freenode.irclog.whitequark.org

2019-02-15 17:52 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:19 davidlt has quit [Ping timeout: 245 seconds]

00:57 davidlt has joined #panfrost

01:00 megi has quit [Ping timeout: 248 seconds]

02:29 davidlt has quit [Ping timeout: 245 seconds]

02:46 davidlt has joined #panfrost

03:48 rcf has quit [Quit: WeeChat 2.4]

03:51 rcf has joined #panfrost

03:58 davidlt_ has joined #panfrost

04:00 davidlt has quit [Ping timeout: 245 seconds]

04:22 sravn has quit [Quit: WeeChat 2.4]

04:43 davidlt__ has joined #panfrost

04:46 davidlt_ has quit [Ping timeout: 272 seconds]

05:10 davidlt__ has quit [Remote host closed the connection]

05:14 davidlt has joined #panfrost

05:32 davidlt has quit [Remote host closed the connection]

05:33 davidlt has joined #panfrost

06:12 davidlt has quit [Ping timeout: 268 seconds]

06:17 guillaume_g has joined #panfrost

06:38 megi has joined #panfrost

06:39 guillaume_g has quit [Remote host closed the connection]

06:54 <bbrezillon> alyssa: I know what I happened with the buggy patch I pushed => I tested before rebasing on master, and commit 4fa09329c104 ("pan/midgard: Use ralloc on ctx/blocks") turned things into ralloc() allocations

06:54 <bbrezillon> I did push the rebased version to tomeu's tree though, wonder why CI didn't catch the issue

06:55 <tomeu> guess it would be good to find out why

07:00 yann has quit [Ping timeout: 272 seconds]

07:23 <bbrezillon> tomeu: weston crashed here https://gitlab.freedesktop.org/tomeu/mesa/-/jobs/540727 and none of the dEQP tests were run

07:23 <bbrezillon> yet the CI is green

07:34 guillaume_g has joined #panfrost

07:37 <tomeu> cool, we should fix that

07:39 <bbrezillon> alyssa: do you remember how you triggered the bug you fixed in 0c5633036195 ("panfrost: Workaround bug in partial update implementation") ?

07:40 <bbrezillon> I'd like to reproduce it to understand how we could end up with an empty damage extent

07:49 pH5 has joined #panfrost

08:26 stepri01 has joined #panfrost

08:31 stikonas has joined #panfrost

08:32 stikonas has quit [Remote host closed the connection]

08:48 <narmstrong> damn I lost track of panfrost and now it's in an absolute bad state, nothing works at all on t820

08:59 <tomeu> narmstrong: we should have have it on CI :)

08:59 <tomeu> bbrezillon: tried to address that problem like this, but for some reason it isn't working: https://gitlab.freedesktop.org/tomeu/mesa/commit/2f7fd701791946bbcf17e3a9f40c49422b8aa0fd

08:59 <tomeu> any ideas why?

09:00 <narmstrong> tomeu: yep, but I'll need to fix the driver first, then find what caused the mesa part to stop working

09:00 <narmstrong> then CI will be cool !

09:01 <tomeu> ah, the driver isn't even probing?

09:02 <narmstrong> tomeu: apart the mmu 33bit issue, the power control override issue, the driver shows a error while powering the cores, bisecting it

09:02 <narmstrong> then I'll focus on the `panfrost d00c0000.gpu: Unhandled Page fault in AS0 at VA 0x000000000863C580` :-p

09:03 <tomeu> ok, for the latter, PAN_MESA_DEBUG=trace could give some info on what that pointer points to

09:06 <bbrezillon> tomeu: because you fork when lauching weston?

09:07 <narmstrong> tomeu: smells pretty bad when the fault is on the `mali_job_descriptor_header` no ?

09:39 raster has joined #panfrost

09:46 <raster> bbrezillon: we've advanced. your patch now nicely segvs... let me figure it out

09:48 <raster> need to drop optimizations for gdb to work now

10:12 davidlt has joined #panfrost

10:14 davidlt has quit [Read error: Connection reset by peer]

10:33 <raster> well it seems its my debugs printfs segving :)

10:38 <raster> bbrezillon: ok. we have problems... maybe i have too many things appiled

10:38 <raster> let me undo stuff

10:38 robmur01 has joined #panfrost

10:38 <raster> i'm now seeing buffer age swap between 1 and 3

10:39 <raster> which will be masking/hiding issues

10:39 <raster> yeah. ... your changes now break buffer age

10:39 <raster> specifically http://code.bulix.org/jcn8f9-849883 does

10:40 <raster> i see buffer age go 1, 3, 1, 3, 1, 3

10:41 megi has quit [Ping timeout: 246 seconds]

10:53 <daniels> that doesn't touch buffer age though ...

10:53 <daniels> going 1/3/1/3 will just be an artifact of how Mesa picks a buffer to use

10:53 <raster> bbrezillon: the good news is... nouw panres is the same

10:53 <raster> daniels: dunno right now... but as a result eflis not doing partial update

10:54 <raster> perhaps a reboot will help :)

10:54 <daniels> sure, but that's been the case for quite a while

10:54 <robmur01> Any chance anyone has an idea of why glmark2 would have started barfing "Error: eglCreateWindowSurface failed with error: 0x3009" with recent mesa/master?

10:55 <robmur01> kmscube still works but now spews a load of "unhandled 67" (and similar, depending on mode)

10:56 <daniels> raster: see https://gitlab.freedesktop.org/mesa/mesa/commit/4f1d27a406478d405eac6f9894ccc46a80034adb

10:57 <daniels> robmur01: is this using GBM, Wayland, or X11?

10:57 <rcf> I only get that with glmark2-es2-drm

10:57 <daniels> raster: the last paragraph is relevant

10:57 <rcf> X and Wayland are fine here.

10:58 <daniels> rcf: i'm not 100% sure why glmark2 would fail, but BAD_MATCH could be caused by it picking an EGLConfig that's not compatible with the GBM surface format

10:59 <robmur01> daniels: GBM, I assume (I'm driving it over SSH with no desktop running)

10:59 <rcf> (well, weston, anyway -- sway continues to fail miserably as always)

11:00 <raster> daniels: well unfortunately... http://code.bulix.org/jcn8f9-849883 is definitely the culprit

11:00 <raster> with it - buffer age swaps 1, 3, 1, 3, 1, 3

11:01 <raster> without it (only change i made) its back to 2, 2, 2, 2, 2

11:01 <raster> :(

11:04 <daniels> robmur01: assuming glmark needs the equivalent of https://gitlab.freedesktop.org/mesa/kmscube/commit/56c3917ffd1f05942246e2532ca4a5707554a2fc

11:04 <daniels> raster: i can't really see why that would be the case, but you might want to look at when and how you do rendering, e.g. if Boris's change introduces more reload regions which make paints take longer, which pushes you into triple- rather than double-buffering

11:06 <raster> well that's my next port of call - look at the changes and as to why this is the case :)

11:20 <raster> fascinating...

11:20 <raster> so we have 4 buffers available - only 2 of which have had a bo actually allocated

11:21 <raster> 2 of them are age 0 with no bo

11:21 <raster> the other 2 are age 1, 3 and have bo's allocated

11:21 <raster> it never picks these because they are age 0

11:22 <daniels> not at all sure how you're getting age 3 when double-buffering

11:22 <raster> also not sure

11:22 <raster> just looking into what it;'s even seeing in making its decisions

11:23 <raster> so one of the buffers keeps changing ages from 2 to 3

11:23 <raster> every 2nd frame

11:24 <raster> oh wait no. sorry

11:24 <raster> not quite. more complicated than that

11:25 <raster> https://pastebin.com/qzvv1RwN

11:25 <raster> the "look for new backbuffer" stuff is me tracing the logic at the top of get_back_bo()

11:25 <raster> in platform_drm.c

11:27 <bbrezillon> raster: my patch is probably wrong

11:27 <tomeu> narmstrong: iirc those descriptors are allocated in the transient pool

11:27 <raster> bbrezillon: the good news - panres is the same at swap time as well as when regins are set

11:28 <raster> so it does fix that

11:28 <raster> but it has a bi-product

11:28 <bbrezillon> yep, but it's probably breaking other things

11:28 <tomeu> narmstrong: so I would check how the BOs backing those are allocated, and whether by chance aren't being released or unmapped from the GPU

11:30 <bbrezillon> raster: ->set_damage_region() is called after a swap/flush to reset the damage region

11:30 <bbrezillon> maybe it's too early

11:30 <bbrezillon> (before the the aging update has taken place?)

11:31 <raster> we get buffer age first then set regions

11:31 <bbrezillon> tbh, I'm reading this piece of code for the first time, so there are probably a ton of things I didn't get right

11:32 <raster> so once something asks for buffer age it has to be evaluated and locked in stone from that point until a swap

11:32 <raster> so the query surface for buffer age should have done that job already and locked it down. based on buffer age we can calculate update regions (we then set them), then render, then swap (with damage rects too )

11:33 yann has joined #panfrost

11:35 <raster> bbrezillon: reading your patch... it's all about passing ctx in to verious places just so u can call st_validate_state() when seting dmg region

11:35 <bbrezillon> yep

11:35 <raster> shouldnt we do this actually when we query buffer age instead?

11:35 <bbrezillon> maybe

11:36 <bbrezillon> we should probably do it in both places

11:37 <bbrezillon> (the validate_state() should be cheap when called the 2nd time)

11:38 <raster> yeah

11:38 <raster> and its really only done once per frame as someone might submit 1 or more update rects before they begin their draw

11:43 <bbrezillon> raster: hm, "1 or more", remember that new calls to set_damage_region() resets the whole thing

11:44 <raster> they may be being silly and set one then decide they are wrong and set another

11:44 <raster> and so on

11:44 <raster> until they finally decide on what it needs to be

11:44 <bbrezillon> all good

11:44 <bbrezillon> just wanted to make it clear that it's not cumulative

11:44 <raster> it'd be silly to do that... but they might and it'd be "valid" i guess.

11:45 <raster> i wonder if the spec allows for things like

11:45 <raster> set_region() draw(); set_region(); draw(); swap();

11:46 <raster> so each time you set 1 region then draw it then move on to the next then swap

11:46 <raster> this would probably be horrible for pipelining and deferred rendering

11:46 <raster> but...

11:46 <raster> is it actually valid?

11:47 <bbrezillon> we definitely don't support that

11:48 <narmstrong> tomeu: It also needs another iopgtable fix from robmur01

11:48 <narmstrong> I missed also this one

11:48 <bbrezillon> raster: but that shouldn't be too hard to support

11:49 <bbrezillon> does adding a validate_framebuffer() before querying the buffer age helps?

11:49 <raster> bbrezillon: i just wonder if it's valid or not though.

11:50 <raster> bbrezillon: i'm trying to figure out how to even call st_validate_state or st_manager_validate_framebuffers from the buffer age query

11:50 <raster> like dri2_query_buffer_age or dri2_drm_query_buffer_age

11:51 <raster> how do dig out st_context :)

11:53 <robmur01> narmstrong: FWIW T820 is what I happen to have on my board currently, which has been reasonably happy until my glmark setup started falling apart :)

11:54 <narmstrong> robmur01: pretty cool, I thought I was alone in the dark with the T820 on the Amlogic S912

11:56 <bbrezillon> raster: hm, just looking at dri2_drm_query_buffer_age and I don't understand how the returned age can be wrong

11:56 <raster> it is though :)

11:56 <bbrezillon> you don't need to call the validate() function here, because the buffer age is directly stored in the platform-specific object

11:56 <raster> i put printf's in get_back_bo

11:57 <raster> https://pastebin.com/qzvv1RwN

11:58 <raster> https://pastebin.com/9dnUGXNJ

11:58 <bbrezillon> dri2_drm_query_buffer_age() calls get_back_bo()

11:58 <raster> my specific hacks/printfs

11:58 <raster> so u can grab the same daya

11:58 <raster> yeah

11:58 <raster> thats why i put my printfs there as that was the core logic

11:58 <raster> the above output is what i see and it explains why it returns what it does

11:59 <narmstrong> robmur01: when starting kmscube (since the first panfrost driver), we have systematically a `panfrost d00c0000.gpu: Unhandled Page fault in AS0 at VA 0x000000000213C200`, `PAN_MESA_DEBUG=trace` doesn't show anything related

11:59 <bbrezillon> the problem I had with the set_damage_region() function was that the new back BO info was not propagated to drawable->textures[BACK_LEFT]

11:59 <raster> it doesnt explain how we get buffers with those ages in the cbufs

11:59 <robmur01> narmstrong: even with the 4-level hack?

12:00 <narmstrong> robmur01: yes, but only 1, each time we start kmscube, than all runs fine

12:01 <narmstrong> robmur01: I have a doubt on the Amlogic T820 integration, for mali_kbase to work correctly they need to write `GPU_PWR_KEY=0x2968A819` then `GPU_PWR_OVERRIDE1=0xfff | (0x20 << 16)` but this doesn't seems to solve anything for panfrost

12:02 <tomeu> narmstrong: guess one could print on buffer creation the GPU addr and the size, to figure out what BO that addr belongs to

12:02 <robmur01> I do see a js fault (DATA_INVALID FAULT) fault plus a sched timeout when it starts

12:02 <robmur01> but no unhandled page fault

12:03 <narmstrong> the DATA_INVALID FAULT disappears with the 4-level hack

12:04 <robmur01> (and the timeout may well be more to do with the speed of the FPGA)

12:06 <bbrezillon> raster: looks like buffer age is incremented twice

12:06 <raster> yeah

12:06 <raster> but where and why... :)

12:07 <bbrezillon> can you add traces to dri2_drm_swap_buffers() ?

12:07 <robmur01> I guess it could be a flush timing thing, since the js fault comes immediately after mapping the relevant IOVA

12:07 <bbrezillon> see how many times it's called and what the age values are

12:08 <tomeu> robmur01: a timeout is expected after a fault, as we don't cancel jobs atm

12:09 <raster> bbrezillon: was just adding to dri2_drm_swap_buffers

12:10 <raster> interesting

12:10 <raster> it does age++ to bother buffers

12:10 <raster> inside swap

12:10 <raster> --panfrost: incr cbuf 2 [0xaaaad18c46a0] from 1 += 1

12:10 <raster> --panfrost: incr cbuf 3 [0xaaaad1f6ab10] from 2 += 1

12:11 <raster> thats tracking the age++ in dri2_drm_swap_buffers

12:11 <raster> thats part of swapbuffers it seesm

12:11 <raster> so we have 2 buffers, age 1, 2

12:11 <raster> and so this should lead to 2 buffer with age 2, 3

12:15 <bbrezillon> except one of them becomes the current buffer

12:15 <bbrezillon> and is reset to 1

12:15 <bbrezillon> (current == front)

12:15 <raster> yeah

12:15 <bbrezillon> so 3 and 1 sounds good, right?

12:16 <bbrezillon> or should it be 2 and 1

12:16 <bbrezillon> ?

12:17 <raster> 2 and 1 i think

12:18 <raster> so what i see is its sometimes 1, 2 and sometimes 3, 1

12:18 <raster> as opposed to always 2, 1 (or 1, 2) ..

12:18 <raster> u get the idea

12:18 <raster> but why?

12:19 <raster> ok

12:20 <raster> the problem is...

12:20 <raster> it's the locked

12:20 <raster> so its the buffer that is always age 1 that is locked when it works right

12:20 <raster> when it fails its the buffer of age 2 that is sometimes locked

12:20 <raster> sometiems buffer of age 1

12:21 <raster> thus since its locked some other buffer is chosen

12:27 <raster> food. brb

12:35 <raster> now here is a question

12:35 <raster> why does dri2_drm_swap_buffers ++ all buffers of age > 0

12:35 <raster> shouldnt it incr the swapped buffer age only?

12:36 <raster> actually wait

12:36 <raster> it SETS the current backbuffer after the swap to age 1

12:36 <bbrezillon> it's what it does

12:37 <raster> that's find

12:37 <raster> but why modify the others?

12:37 <raster> they havent been "used"

12:37 <bbrezillon> if (dri2_surf->color_buffers[i].age > 0)

12:37 <bbrezillon> dri2_surf->color_buffers[i].age++;

12:38 <bbrezillon> it only modifies ages on buffers that were used at some point

12:38 <bbrezillon> which is correct

12:38 <bbrezillon> and current is not the backbuffer

12:38 <bbrezillon> it's the front buffer

12:38 <raster> hnmm actually htis is just of implicitly counting how long they sit in the pipeline for

12:39 <bbrezillon> yes, so that if they ever get used again, you know how many frame passed since the last update

12:39 <raster> well current becomes the back buffer there :)

12:39 <bbrezillon> and can determine what has been updated in between

12:41 <bbrezillon> current back buffer becomes the front buffer at swap time, which is correct I guess

12:42 <bbrezillon> and if you need a proof that ->current is the front buffer, you can look at lock_front_buffer()

12:42 <bbrezillon> :)

12:44 <raster> just a question

12:45 <raster> to me... the logic here on buffer choosing and aging just seems... unusual to me

12:45 <raster> it's very liable to have buffer age variances if buffers stay locked for longer than we'd like

12:45 <bbrezillon> I'm sure that will be a question for daniels :)

12:46 <raster> i would have implmented it as much of a pipeline

12:46 <raster> with a fixed # of buffers

12:46 <raster> maybe start at 2

12:46 <raster> then grow to 3 or 4

12:46 <raster> and literally shuffle them down the array rather than pick one from a pool with age numbers

12:47 <bbrezillon> isn't it less efficient to do that?

12:47 <raster> so grow if needed, and maybe every now and again consider a shrink if we seem to have to many spare

12:47 <bbrezillon> I mean, the older the frame the more content you'll have to update, right?

12:47 <raster> well it's just copying the buffer header/struct data

12:47 <raster> sure

12:47 <raster> the older

12:47 <raster> thus grow if needed

12:48 <raster> e.g. we have 2 buffers locked/queued for display and so we have to have a 3rd if anothe r swap is submitted

12:48 <raster> another swap

12:48 <raster> this implicitly allocates bufferson demand

12:48 <bbrezillon> isn't it what happens right now?

12:48 <raster> so it kind of works that way

12:48 <raster> but its a pool with no ordering

12:48 <raster> or the ordering is the age

12:48 <raster> kind of

12:48 <bbrezillon> locked+age combination yes

12:49 <raster> yeah

12:49 <bbrezillon> pick the newest buf that's not locked

12:49 <bbrezillon> which sounds like a good solution to me

12:49 <raster> it's just now how i'd do this so thus it's feeling a bit odd to scrape throught it and understand how it works, so i'm asking :)

12:50 <raster> i'd literally keep it as a "ring buffer"

12:50 <raster> and shuffle the frames through

12:51 <raster> though i'd actually memcpy() the array down to shuffle so head is alwaYS AT 0

12:51 <bbrezillon> anyway, still does not explain why we end up with 1, 3 ages

12:51 <raster> since these are small structs with a few flags and a ptr - that memcpy would be pretty moot

12:51 <bbrezillon> don't you have a NOOP swapbuf?

12:51 <raster> hmm

12:52 <raster> i doubt we have a noop swap buf

12:52 <raster> we really try hard to do nothing if there is nothing to draw

12:52 <bbrezillon> might be something something doing an implicit swap maybe

12:53 <bbrezillon> s/something something/something/

12:53 <raster> like lots of layers of higher level object/scene graph tracking to minimize update regions and only begin a render sycle at all if something needs a redraw

12:53 <raster> it figures out obscured objects changing and turns that into nops etc.

12:54 <bbrezillon> that's the only explanation I see for having the front buffer with age = 2 when doing double-buffering

12:54 <raster> the buffer age change each frame is tickling our logic to force full redraws each frame

12:55 <bbrezillon> buffer age should definitely stay 2

12:56 <raster> well each render cycle i only see 1 call to dri2_drm_swap_buffers

12:57 <raster> wait up let me be sure of that

12:57 <bbrezillon> and when you comment the validate_state() call in set_damage_region(), the aging is correct

12:59 <raster> so yes

12:59 <raster> that validate state causes this to happen

12:59 <raster> commenting it out fixes it

13:00 <bbrezillon> but you're sure dri2_drm_swap_buffers

13:00 <bbrezillon> is called only once

13:00 <raster> yup

13:00 <bbrezillon> AFAICT, the only place that can swap front/back buffers is this function

13:01 <raster> just throw in printfs tobe sure

13:01 <raster> only once per frame

13:01 <bbrezillon> yes, once per frame

13:01 <bbrezillon> and with the validate_state() call uncommented

13:02 <raster> uncommented once per frame

13:02 <bbrezillon> that's crazy

13:05 <raster> gah

13:05 <raster> pastebin quota hit

13:06 <bbrezillon> raster: http://code.bulix.org/

13:06 <raster> http://code.bulix.org/ykrqsh-850421

13:07 <raster> mega-sized-fonts.bulix.org :)

13:07 <bbrezillon> wait, age is printed in get_back_bo()

13:07 <bbrezillon> so just after age has been incremented

13:07 <raster> the "look for new backbuffer" is in get_back_bo()

13:08 <bbrezillon> and front buffer has not be swapped yet

13:08 <bbrezillon> *front/back have not been swapped yet

13:08 <raster> ajd yes

13:09 <bbrezillon> getting 2 on the locked buf is fine, then

13:09 <raster> when it goes from having 1 to 2 buffers in the pool

13:09 <raster> 0xaaaafc7c3f90 is still locked

13:10 <bbrezillon> can you move the buffers aging dump at the end of dri2_drm_swap_buffers() ?

13:12 <raster> let me dump that

13:13 <raster> http://code.bulix.org/fiwhe1-850425

13:14 rcf has quit [Quit: WeeChat 2.4]

13:19 <bbrezillon> raster: can you add a trace to lock_front_buffer()?

13:19 <bbrezillon> with the age and pointer of the buffer being locked

13:21 <bbrezillon> raster: buffer @ idx 2 should be locked after the 2nd swap

13:21 <bbrezillon> but the get_back_bo() happening just after this swap shows that buffer @ idx 3 is locked

13:21 <raster> --panfrost: buf [0xaaaac1a978f0] 2 locked=0, age=1

13:21 <raster> --panfrost: chosen [0xaaaac1a978f0] age=1

13:21 <raster> --panfrost: buf [0xaaaac1454340] 3 locked=1, age=2

13:21 <raster> that bit?

13:21 <raster> and it's seemingly not locked?

13:22 <bbrezillon> would be interesting to know where the get_back_bo() call comes from

13:23 <raster> i was assuming that was from getting buffer age...

13:23 <bbrezillon> it's definitely not the one done in drm_swap_buffers() (->back is probably already assigned when we call it from there)

13:23 <bbrezillon> yes, that's what I suspect

13:23 <raster> but you're right

13:23 <raster> it could be other places

13:24 <bbrezillon> oh no, actually I suspect it comes from the validate_state() call we added to the set_damage_region()

13:25 <raster> errrrr

13:25 <raster> no...

13:25 <raster> dri2_drm_image_get_buffers()

13:25 <bbrezillon> and I fear the ->lock_front_buffer()/->release_buffer() calls have not been done yet when we call validate_state() in the set_damage_region() path

13:26 <bbrezillon> can you surround the added validate_state() call with printf("%s:%i\n", __func__, __LINE__); traces?

13:28 <raster> done

13:28 <raster> sec

13:29 <raster> yeah

13:29 <raster> the validate state calls image_get buffers

13:30 <raster> http://code.bulix.org/jzy7ks-850435

13:30 <bbrezillon> ok, so here is the problem

13:31 <bbrezillon> new front/back buffers have not been locked/released yet

13:31 <bbrezillon> and get_back_bo() fails to pick the right BO

13:32 <bbrezillon> can you check when lock_front_buffer()/release_buffer() are called and who call them?

13:32 <raster> by a read of the logs the release only happens at the next swap

13:32 <raster> which makes sense

13:33 <raster> the lock/release logs are there in that paste

13:33 <raster> withe the buffer ptr

13:33 <raster> well bo ptr

13:34 <raster> mesa doesnt happen to have a handy "dump a bt to stdout now" macro?

13:34 <raster> or func?

13:34 <bbrezillon> actually, they happen just after the set_damage_region()

13:35 <raster> yeah

13:35 <raster> the do

13:35 <raster> pourquoi

13:36 herbmillerjr has quit [Remote host closed the connection]

13:36 herbmillerjr has joined #panfrost

13:37 <raster> that's us doing that

13:38 <raster> http://code.bulix.org/jgcgwv-850440

13:38 <bbrezillon> those are caused by the implicit ->set_damage_region() calls

13:39 <bbrezillon> used to reset the damage region

13:39 <raster> not the lock/unlocks

13:41 <bbrezillon> raster: can you make the call to validate_state() conditional on if (nrects > 0)

13:41 <bbrezillon> just to validate this theory

13:41 <raster> we would be locking the new frontbuffer and releasing the old one when we swap

13:42 <bbrezillon> that's what happens

13:42 <raster> indeed

13:42 <raster> nrects > 0 fixes it

13:43 <bbrezillon> but it happens after dri2_swap_buffers() has returned

13:43 <bbrezillon> and we are calling ->set_damage_region() from inside dri2_swap_buffers(à

13:43 <raster> we're calling it from inside?

13:43 <raster> oh...

13:43 <raster> well tyhen

13:43 <raster> then

13:43 <raster> :)

13:44 <raster> i didnt notice that

13:44 <bbrezillon> still don't know how to solve that properly

13:44 <raster> we're setting region with 0 rects?

13:44 <raster> (from inside)

13:45 <bbrezillon> yes, that's the semantics we use to mean "reset damage region"

13:45 <raster> yeah

13:45 <raster> thats the definition

13:45 <raster> but in THIS case we do not want to do another validate

13:45 <raster> just reset

13:45 <raster> right?

13:45 <bbrezillon> it's not that simple

13:46 <bbrezillon> if you look at the spec, ->set_damage_region(0, NULL) is also valid

13:46 <raster> oh i know

13:46 <raster> it resets

13:46 <raster> well the egl call - yes

13:47 <bbrezillon> last time I read the spec it was not so clear

13:47 <raster> but in this case the region implicitly resets after a swap

13:47 <raster> oh i read the spec quickly

13:47 <raster> it does say that above

13:47 <bbrezillon> I understood it as "empty damage region"

13:47 <raster> so when called from OUTSIDE mesa - yes

13:47 <raster> this is right

13:47 <raster> but this set amage region is being called from inside implicitly as part of swapping

13:48 <raster> and thus is also implicitly shuffling up buffer ages etc.

13:48 <bbrezillon> yes

13:48 <raster> so its a different path

13:48 <bbrezillon> because of the validate_state() addition

13:48 <raster> and in this case ... you want to reset the region without age++

13:48 <bbrezillon> which am no sure is correct :)

13:48 <raster> my take is getting buffer age should validate

13:48 <raster> not setting region

13:48 <raster> imho

13:49 <raster> basically... everyone is going to use these together

13:49 <raster> they will get age FIRST to figure out regions

13:49 <raster> then set them

13:49 <bbrezillon> the reason I added this validate_state() in the first state was to update drawable->textures[BACK_LEFT]

13:49 <raster> unless of course they already know its a full re-draw anyway

13:49 <raster> in which case everything is already by default set up right region-wise

13:50 <bbrezillon> which is used by the set_damage_region() logic to pick the right resource

13:53 <raster> well at least you've made progress

13:53 <raster> you now know what specifically is the issue

13:53 <raster> and why

13:53 <raster> the question now is... what is the better thing to do that doesn't cause issues? :)

13:53 <bbrezillon> yep

13:53 <bbrezillon> maybe daniels knows :)

13:54 <raster> as it stands... i'd have a special case for resetting dmg region

13:55 <raster> or reorder things so when its called it doesnt mess things up

13:55 <bbrezillon> that's a solution, but I'm still not sure calling validate_state() from set_damage_region() is the right thing to do

13:56 <raster> damn

13:56 <raster> mesa has 2 dri2_set_damage_region()'s

13:57 <raster> :(

13:59 <raster> ok

13:59 <raster> really really really super dirty

13:59 <raster> use rects -1 or rect ptr of -1 (fffffffffffff...) to mean reset without validate

14:00 <raster> otherwise extend the function to have more params like flags .... :)

14:01 <alyssa> bbrezillon: The empty damage bug was triggering pretty much everywhere, iirc?

14:02 <alyssa> Like, no, I don't remember how, but it was pretty quick after playing with your branch

14:03 <alyssa> robmur01: ......Panfrost w/ an FPGA? Dang. :P

14:08 <bbrezillon> alyssa: ok, I'll try to reproduce

14:13 herbmillerjr has quit [Ping timeout: 246 seconds]

14:13 <robmur01> alyssa: https://developer.arm.com/tools-and-software/development-boards/fpga-prototyping-boards/logictile-express

14:14 <robmur01> just don't ask me how much they cost (beyond "a lot") :P

14:14 herbmillerjr has joined #panfrost

14:15 <raster> robmur01: trying midgard or bifrost bit files?

14:19 <robmur01> raster: yes - I've managed to nab some of each

14:21 <raster> robmur01: same... but as i have a juno r0... y dtb is broken ad i only have a dtb for r1+biforst

14:21 <raster> so i have to look at hacking my own up

14:21 <raster> and then... welll i decided i might just try the kihey960 again

14:21 <raster> after spending like 2 weeks getting a juno to boot reliably :)

14:22 <raster> (don't aske. front usb ports, ueif vs tftp vs uboot vs udb hdd stopping to work vs ....)

14:32 <robmur01> hooray for meetings; gave me a chance to bisect my glmark woes to https://gitlab.freedesktop.org/mesa/mesa/commit/7b4ed2b513efad86616e932eb4bca20557addc78

14:33 <robmur01> seems plausible; I guess it's not a panfrost issue then

14:37 <raster> yargh

14:38 <raster> that was a rejig of how color masks are stored

14:38 <raster> i wonder if there is a mystery typo/off-by-one there

14:38 <raster> :)

14:38 <alyssa> robmur01: Oh!

14:38 <alyssa> anarsoul: also blamed that commit for them having issues

14:38 <alyssa> with lima

14:39 <shadeslayer> alyssa: I'm seeing something interesting over here, just wanted to double check with you, are jobs shared across contexts?

14:39 <alyssa> shadeslayer: They... shouldn't be...

14:40 <shadeslayer> alyssa: huh, so I made these changes https://paste.ubuntu.com/p/2SYfM5gCSy/

14:41 <shadeslayer> alyssa: and I see this : http://paste.ubuntu.com/p/qPpxMPKzQZ/ , job 0xffff4c1ced70 seems to be created by 0xaaaae5a0ca30 but freed by ctx 0xaaaae4a80990 ( line 38 )

14:43 <raster> hmm

14:43 <raster> that match increases struct sizes.... thats not nice

14:44 <raster> it could have used shorts for shifts/sizes and at least not gotten more bloaty

14:44 <raster> :)

14:44 <raster> actually could have used chars and it'd have been find and used half the size then - even better

14:44 <raster> :)

14:48 megi has joined #panfrost

14:48 <daniels> bbrezillon, raster: tbh I can't really piece the backlog together - is there a tl;dr somewhere?

14:49 <raster> daniels: i could tell you in about 2h :)

14:49 <daniels> :)

14:50 <raster> but the simple version

14:50 <daniels> tomeu: ^ do you know why shadeslayer is seeing panfrost_jobs migrate between ctx? seems like it shoudln't be possible :\

14:50 <raster> the "Fix" to add a st_validate_state() in dri2_set_damage_region() does fix a core issue

14:51 <raster> it makes sure the right panfrost resource (panres) is being used for setting the update regions to as well as then being used int he swapbuffer later

14:51 <raster> without that the regions are attached to something else not using in the swap thus - it gets them wrong

14:51 <raster> probably to do with different code paths driven from higher up

14:52 <raster> anyway

14:52 <raster> the problem is this extra validate is then called as part of a swapbuffer()

14:52 <raster> because the update regions need to be reset to "empty"

14:52 <shadeslayer> the only other explanation I can come up with is that the output is just messed up because of threading, and that the same piece of memory get's free'd by one context and alloc'd to another ctx?

14:52 <raster> so this is called with 0 rects to do that

14:53 <raster> the issue now is that this then bumps up the buffer age of one of the buffers as a side-effect

14:53 <raster> inside of the swapbuffer

14:53 <raster> (in addition to all the other buffer age fun that is done here tht was correct)

14:53 <raster> and this causes us to have 1 buffer wildly out of step

14:53 <raster> thus ages of 1, 3, 1, 3 instead of 2, 2, 2, 2

14:55 <raster> so realistically the regiosn need a reset without bumping any buffer ages around (.e. drop the validate in this case)

15:01 davidlt has joined #panfrost

15:01 <raster> bbrezillon: oh... and now do some more rendering involving fbo's and i get rendering bugs again even with if (nrects > 0) disabling the validate

15:01 <raster> :(

15:01 <raster> so much more to go it seems

15:05 <bbrezillon> raster: :-(

15:05 <raster> i was using a simpler test case so far

15:05 <raster> i just expanded it

15:05 <raster> and well... flickery black blobs appear

15:06 <raster> and i know this will be using fbo's to render to

15:06 <raster> so .... :/

15:06 <bbrezillon> we'd need to trace the various calls again

15:06 <raster> yeah

15:06 <raster> damn

15:07 <raster> i wont have time today

15:10 <bbrezillon> np

15:17 davidlt has quit [Remote host closed the connection]

15:17 davidlt has joined #panfrost

15:19 tbueno has joined #panfrost

15:23 <shadeslayer> yeah, http://paste.ubuntu.com/p/3ghrvzC5c6/, this still makes no sense, TID 9112 free's job 0xffff78171510 when it was allocated by TID 9158

15:25 davidlt has quit [Remote host closed the connection]

15:28 raster has quit [Remote host closed the connection]

15:29 <bbrezillon> shadeslayer: can you protect the create/free section with a mutex (with the trace inside) so that we are sure the traces are accurate

15:30 <shadeslayer> bbrezillon: sure can do

15:44 pH5 has quit [Quit: bye]

15:49 davidlt has joined #panfrost

16:17 yann has quit [Ping timeout: 245 seconds]

16:32 raster has joined #panfrost

16:43 raster has quit [Remote host closed the connection]

16:44 raster has joined #panfrost

16:49 raster has quit [Ping timeout: 248 seconds]

16:57 raster has joined #panfrost

17:05 raster has quit [Ping timeout: 245 seconds]

17:23 <bbrezillon> daniels: could you have a look at http://code.bulix.org/9v8bhi-850562 and tell me if that makes sense (I doubt it does, but I also don't know how to fix the problem reported by raster)

17:37 yann has joined #panfrost

17:40 <bbrezillon> hm, I could do something simpler by letting drivers reset the damage region at flush time

18:18 <shadeslayer> tomeu: I should lockout the entirety of get_job and free_job right?

18:19 <shadeslayer> tomeu: https://paste.ubuntu.com/p/p6MyBHPsMP/

18:32 <shadeslayer> tomeu: http://paste.ubuntu.com/p/kz4VQJkjjf/

18:33 <shadeslayer> PANFROST DEBUG: TID 2783 CTX 0xaaaaefc43290 RENDERING JOB 0xffff58373e20

18:33 <shadeslayer> PANFROST DEBUG: TID 2770 CTX 0xaaaaeef640a0 FREE'ING JOB 0xffff58373e20P

18:33 <shadeslayer> definitely free'd by a different context?

18:34 <shadeslayer> tomeu: https://paste.ubuntu.com/p/tfBZcBr8yJ/ is the final patch that I used to check things

18:44 <alyssa> Help, this blend shader stuff is WATing me

18:45 * alyssa is afraid of running out of time before even getting to start scheduling

18:46 <shadeslayer> heh, running out of time ... start scheduling :P

18:46 <alyssa> >.<

18:46 <shadeslayer> pun intended? :P

18:46 <alyssa> Nope

18:46 <shadeslayer> even better

18:48 <alyssa> Oh, *seriosuly*, ugh

18:49 <alyssa> Getting clobbered with a blend constant

18:50 <bbrezillon> shadeslayer: yep

18:50 <bbrezillon> and we indeed have a bug

18:52 <bbrezillon> shadeslayer: hm, wait

18:52 <bbrezillon> can you also protect the secting printing the RENDERING trace?

18:53 <alyssa> ACK!

18:53 <alyssa> It's upside-down

18:53 <alyssa> That's why

18:56 <bbrezillon> shadeslayer: can you also add an \n at the end of the FREEING trace

19:00 <alyssa> Now dealing with a RA issue with blend shaders :|

19:00 * alyssa feels herself running out of time for lack of a better word

19:03 <bbrezillon> shadeslayer: here is the bug => https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/gallium/drivers/panfrost/pan_drm.c#L357

19:03 <bbrezillon> screen is shared by all contexts

19:06 <bbrezillon> shadeslayer: last_fragment_flushed, last_job should probably be moved to panfrost_context

19:07 <bbrezillon> transient_bo, free_transient and bo_cache might stay in panfrost_screen if they are protected with a mutex

19:32 <bbrezillon> shadeslayer: http://code.bulix.org/8kohss-850630

19:34 <shadeslayer> I move away for dinner and you fix it :(

19:34 <bbrezillon> oh, I doubt it's fixed

19:34 <shadeslayer> bbrezillon: but yeah, awesome, I wasn't hallucinating then :D

19:35 <bbrezillon> it's just the beginning of a long series of issues related to multi-ctx

19:35 <bbrezillon> :)

19:41 <shadeslayer> hooray

19:42 <bbrezillon> shadeslayer: it works?

19:43 <bbrezillon> so, patches I don't even test work, but those I test regress the entire world when I push them :'-(

19:46 stikonas has joined #panfrost

19:47 <shadeslayer> bbrezillon: I'm going through them now

19:47 <shadeslayer> - return;

19:47 <shadeslayer> + pthread_mutex_unlock(&screen->bo_cache_lock);

19:47 <shadeslayer> that looks wrong?

19:47 <shadeslayer> oh, I guess that return is superfluous

19:48 <bbrezillon> shadeslayer: and s/transiant/transient/

19:54 <shadeslayer> bbrezillon: yeah I figured

19:54 <shadeslayer> I just wanted to go through it myself to understand what's going on, I see you just moved things around from screen to ctx

19:55 <bbrezillon> sure, np

19:57 <bbrezillon> sorry I read "I just want to", hence the "sure, np". Sorry for chiming in and fixing that for you

19:57 <bbrezillon> I'll try to stay away next time :-/

19:57 <shadeslayer> bbrezillon: a lot less crashy for sure :)

19:59 <shadeslayer> I feel like it has a perf impact though

19:59 <shadeslayer> but I have no metrics to back it up

20:00 <bbrezillon> the lock certainly has an impact, though I doubt it's a huge one

20:00 <shadeslayer> bbrezillon: yeah definitely fixes the issue at hand

20:00 <bbrezillon> the other option would be to have a bo_cache and transient pool per context

20:01 <bbrezillon> but that means increasing the memory consumption

20:01 <bbrezillon> maybe something we can think about when the madvise stuff is merged

20:03 <alyssa> Only 474 regressions left according to CI... this is really ughle

20:03 <shadeslayer> I get nice INSTR_MISMATCH messages now

20:03 <shadeslayer> [ 6963.918002] panfrost ff9a0000.gpu: js fault, js=0, status=INSTR_TYPE_MISMATCH, head=0x5b08180, tail=0x5b08180

20:03 <shadeslayer> [ 6963.918909] panfrost ff9a0000.gpu: gpu sched timeout, js=0, status=0x52, head=0x5b08180, tail=0x5b08180, sched_job=0000000051927e83

20:04 <shadeslayer> but yay, less crashing

20:05 davidlt has quit [Remote host closed the connection]

20:07 <alyssa> That's a compiler issue

20:07 <alyssa> MIDGARD_MESA_DEBUG=shaders?

20:08 davidlt has joined #panfrost

20:14 <shadeslayer> I gotta rebase on master first

20:19 davidlt has quit [Remote host closed the connection]

20:19 davidlt has joined #panfrost

20:23 <alyssa> I don't even understand how `brx.discard.false` could come about from my compiler

20:57 <shadeslayer> meh, I'll take a look tomorrow, I'm *exhausted*

21:30 raster has joined #panfrost

21:40 stikonas has quit [Read error: Connection reset by peer]

21:43 davidlt has quit [Remote host closed the connection]

21:59 davidlt has joined #panfrost

22:06 davidlt has quit [Remote host closed the connection]

22:07 davidlt has joined #panfrost

22:14 raster has quit [Remote host closed the connection]

22:24 * alyssa poooooooofs

22:25 <alyssa> What even

22:31 <HdkR> What

22:32 <alyssa> HdkR: I've been doing a scheduler and uh

22:32 <alyssa> my head hurts.

22:33 <alyssa> I have done nothing today except sort out regressions and I don't do any scheduling yet.

22:33 <HdkR> Whoa hey, that's what I'm doing all the time :D

22:47 <daniels> bbrezillon, shadeslayer: bo_cache _must_ be per-screen, since the BO namespace is per-fd

22:47 <daniels> so that needs to be mutexed

22:47 <daniels> all the job/flush pointers should definitely be per-ctx however

23:04 <anarsoul> panfrost already has bo cache?

23:04 <alyssa> Yeah

23:05 <anarsoul> we definitely need it too

23:09 davidlt has quit [Remote host closed the connection]

23:15 <alyssa> I don't even know what I'm doing anymore

23:16 <anarsoul> alyssa: have some rest?

23:16 <alyssa> 10 more minutes :p

23:19 <alyssa> Oh ffs

23:20 <alyssa> How am I

23:20 <alyssa> still

23:20 <alyssa> on the same bug

23:26 davidlt has joined #panfrost

23:27 <Lyude> alyssa: you know I still run into bugs that have taken me even longer then that :P

23:29 davidlt has quit [Remote host closed the connection]

23:35 hopetech has quit [Ping timeout: 245 seconds]

23:37 alyssa has quit [Ping timeout: 272 seconds]

23:37 hopetech has joined #panfrost

23:41 davidlt has joined #panfrost