davidlt__ has quit [Remote host closed the connection]
davidlt has joined #panfrost
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
davidlt has quit [Ping timeout: 268 seconds]
guillaume_g has joined #panfrost
megi has joined #panfrost
guillaume_g has quit [Remote host closed the connection]
<bbrezillon>
alyssa: I know what I happened with the buggy patch I pushed => I tested before rebasing on master, and commit 4fa09329c104 ("pan/midgard: Use ralloc on ctx/blocks") turned things into ralloc() allocations
<bbrezillon>
I did push the rebased version to tomeu's tree though, wonder why CI didn't catch the issue
<bbrezillon>
alyssa: do you remember how you triggered the bug you fixed in 0c5633036195 ("panfrost: Workaround bug in partial update implementation") ?
<bbrezillon>
I'd like to reproduce it to understand how we could end up with an empty damage extent
pH5 has joined #panfrost
stepri01 has joined #panfrost
stikonas has joined #panfrost
stikonas has quit [Remote host closed the connection]
<narmstrong>
damn I lost track of panfrost and now it's in an absolute bad state, nothing works at all on t820
<tomeu>
narmstrong: we should have have it on CI :)
<daniels>
that doesn't touch buffer age though ...
<daniels>
going 1/3/1/3 will just be an artifact of how Mesa picks a buffer to use
<raster>
bbrezillon: the good news is... nouw panres is the same
<raster>
daniels: dunno right now... but as a result eflis not doing partial update
<raster>
perhaps a reboot will help :)
<daniels>
sure, but that's been the case for quite a while
<robmur01>
Any chance anyone has an idea of why glmark2 would have started barfing "Error: eglCreateWindowSurface failed with error: 0x3009" with recent mesa/master?
<robmur01>
kmscube still works but now spews a load of "unhandled 67" (and similar, depending on mode)
<daniels>
robmur01: is this using GBM, Wayland, or X11?
<rcf>
I only get that with glmark2-es2-drm
<daniels>
raster: the last paragraph is relevant
<rcf>
X and Wayland are fine here.
<daniels>
rcf: i'm not 100% sure why glmark2 would fail, but BAD_MATCH could be caused by it picking an EGLConfig that's not compatible with the GBM surface format
<robmur01>
daniels: GBM, I assume (I'm driving it over SSH with no desktop running)
<rcf>
(well, weston, anyway -- sway continues to fail miserably as always)
<daniels>
raster: i can't really see why that would be the case, but you might want to look at when and how you do rendering, e.g. if Boris's change introduces more reload regions which make paints take longer, which pushes you into triple- rather than double-buffering
<raster>
well that's my next port of call - look at the changes and as to why this is the case :)
<raster>
fascinating...
<raster>
so we have 4 buffers available - only 2 of which have had a bo actually allocated
<raster>
2 of them are age 0 with no bo
<raster>
the other 2 are age 1, 3 and have bo's allocated
<raster>
it never picks these because they are age 0
<daniels>
not at all sure how you're getting age 3 when double-buffering
<raster>
also not sure
<raster>
just looking into what it;'s even seeing in making its decisions
<raster>
so one of the buffers keeps changing ages from 2 to 3
<raster>
the "look for new backbuffer" stuff is me tracing the logic at the top of get_back_bo()
<raster>
in platform_drm.c
<bbrezillon>
raster: my patch is probably wrong
<tomeu>
narmstrong: iirc those descriptors are allocated in the transient pool
<raster>
bbrezillon: the good news - panres is the same at swap time as well as when regins are set
<raster>
so it does fix that
<raster>
but it has a bi-product
<bbrezillon>
yep, but it's probably breaking other things
<tomeu>
narmstrong: so I would check how the BOs backing those are allocated, and whether by chance aren't being released or unmapped from the GPU
<bbrezillon>
raster: ->set_damage_region() is called after a swap/flush to reset the damage region
<bbrezillon>
maybe it's too early
<bbrezillon>
(before the the aging update has taken place?)
<raster>
we get buffer age first then set regions
<bbrezillon>
tbh, I'm reading this piece of code for the first time, so there are probably a ton of things I didn't get right
<raster>
so once something asks for buffer age it has to be evaluated and locked in stone from that point until a swap
<raster>
so the query surface for buffer age should have done that job already and locked it down. based on buffer age we can calculate update regions (we then set them), then render, then swap (with damage rects too )
yann has joined #panfrost
<raster>
bbrezillon: reading your patch... it's all about passing ctx in to verious places just so u can call st_validate_state() when seting dmg region
<bbrezillon>
yep
<raster>
shouldnt we do this actually when we query buffer age instead?
<bbrezillon>
maybe
<bbrezillon>
we should probably do it in both places
<bbrezillon>
(the validate_state() should be cheap when called the 2nd time)
<raster>
yeah
<raster>
and its really only done once per frame as someone might submit 1 or more update rects before they begin their draw
<bbrezillon>
raster: hm, "1 or more", remember that new calls to set_damage_region() resets the whole thing
<raster>
they may be being silly and set one then decide they are wrong and set another
<raster>
and so on
<raster>
until they finally decide on what it needs to be
<bbrezillon>
all good
<bbrezillon>
just wanted to make it clear that it's not cumulative
<raster>
it'd be silly to do that... but they might and it'd be "valid" i guess.
<raster>
i wonder if the spec allows for things like
<raster>
so each time you set 1 region then draw it then move on to the next then swap
<raster>
this would probably be horrible for pipelining and deferred rendering
<raster>
but...
<raster>
is it actually valid?
<bbrezillon>
we definitely don't support that
<narmstrong>
tomeu: It also needs another iopgtable fix from robmur01
<narmstrong>
I missed also this one
<bbrezillon>
raster: but that shouldn't be too hard to support
<bbrezillon>
does adding a validate_framebuffer() before querying the buffer age helps?
<raster>
bbrezillon: i just wonder if it's valid or not though.
<raster>
bbrezillon: i'm trying to figure out how to even call st_validate_state or st_manager_validate_framebuffers from the buffer age query
<raster>
like dri2_query_buffer_age or dri2_drm_query_buffer_age
<raster>
how do dig out st_context :)
<robmur01>
narmstrong: FWIW T820 is what I happen to have on my board currently, which has been reasonably happy until my glmark setup started falling apart :)
<narmstrong>
robmur01: pretty cool, I thought I was alone in the dark with the T820 on the Amlogic S912
<bbrezillon>
raster: hm, just looking at dri2_drm_query_buffer_age and I don't understand how the returned age can be wrong
<raster>
it is though :)
<bbrezillon>
you don't need to call the validate() function here, because the buffer age is directly stored in the platform-specific object
<raster>
thats why i put my printfs there as that was the core logic
<raster>
the above output is what i see and it explains why it returns what it does
<narmstrong>
robmur01: when starting kmscube (since the first panfrost driver), we have systematically a `panfrost d00c0000.gpu: Unhandled Page fault in AS0 at VA 0x000000000213C200`, `PAN_MESA_DEBUG=trace` doesn't show anything related
<bbrezillon>
the problem I had with the set_damage_region() function was that the new back BO info was not propagated to drawable->textures[BACK_LEFT]
<raster>
it doesnt explain how we get buffers with those ages in the cbufs
<robmur01>
narmstrong: even with the 4-level hack?
<narmstrong>
robmur01: yes, but only 1, each time we start kmscube, than all runs fine
<narmstrong>
robmur01: I have a doubt on the Amlogic T820 integration, for mali_kbase to work correctly they need to write `GPU_PWR_KEY=0x2968A819` then `GPU_PWR_OVERRIDE1=0xfff | (0x20 << 16)` but this doesn't seems to solve anything for panfrost
<tomeu>
narmstrong: guess one could print on buffer creation the GPU addr and the size, to figure out what BO that addr belongs to
<robmur01>
I do see a js fault (DATA_INVALID FAULT) fault plus a sched timeout when it starts
<robmur01>
but no unhandled page fault
<narmstrong>
the DATA_INVALID FAULT disappears with the 4-level hack
<robmur01>
(and the timeout may well be more to do with the speed of the FPGA)
<bbrezillon>
raster: looks like buffer age is incremented twice
<raster>
yeah
<raster>
but where and why... :)
<bbrezillon>
can you add traces to dri2_drm_swap_buffers() ?
<robmur01>
I guess it could be a flush timing thing, since the js fault comes immediately after mapping the relevant IOVA
<bbrezillon>
see how many times it's called and what the age values are
<tomeu>
robmur01: a timeout is expected after a fault, as we don't cancel jobs atm
<raster>
bbrezillon: was just adding to dri2_drm_swap_buffers
<raster>
interesting
<raster>
it does age++ to bother buffers
<raster>
inside swap
<raster>
--panfrost: incr cbuf 2 [0xaaaad18c46a0] from 1 += 1
<raster>
--panfrost: incr cbuf 3 [0xaaaad1f6ab10] from 2 += 1
<raster>
thats tracking the age++ in dri2_drm_swap_buffers
<raster>
thats part of swapbuffers it seesm
<raster>
so we have 2 buffers, age 1, 2
<raster>
and so this should lead to 2 buffer with age 2, 3
<bbrezillon>
except one of them becomes the current buffer
<bbrezillon>
and is reset to 1
<bbrezillon>
(current == front)
<raster>
yeah
<bbrezillon>
so 3 and 1 sounds good, right?
<bbrezillon>
or should it be 2 and 1
<bbrezillon>
?
<raster>
2 and 1 i think
<raster>
so what i see is its sometimes 1, 2 and sometimes 3, 1
<raster>
as opposed to always 2, 1 (or 1, 2) ..
<raster>
u get the idea
<raster>
but why?
<raster>
ok
<raster>
the problem is...
<raster>
it's the locked
<raster>
so its the buffer that is always age 1 that is locked when it works right
<raster>
when it fails its the buffer of age 2 that is sometimes locked
<raster>
sometiems buffer of age 1
<raster>
thus since its locked some other buffer is chosen
<raster>
food. brb
<raster>
now here is a question
<raster>
why does dri2_drm_swap_buffers ++ all buffers of age > 0
<raster>
shouldnt it incr the swapped buffer age only?
<raster>
actually wait
<raster>
it SETS the current backbuffer after the swap to age 1
<bbrezillon>
it's what it does
<raster>
that's find
<raster>
but why modify the others?
<raster>
they havent been "used"
<bbrezillon>
if (dri2_surf->color_buffers[i].age > 0)
<bbrezillon>
dri2_surf->color_buffers[i].age++;
<bbrezillon>
it only modifies ages on buffers that were used at some point
<bbrezillon>
which is correct
<bbrezillon>
and current is not the backbuffer
<bbrezillon>
it's the front buffer
<raster>
hnmm actually htis is just of implicitly counting how long they sit in the pipeline for
<bbrezillon>
yes, so that if they ever get used again, you know how many frame passed since the last update
<raster>
well current becomes the back buffer there :)
<bbrezillon>
and can determine what has been updated in between
<bbrezillon>
current back buffer becomes the front buffer at swap time, which is correct I guess
<bbrezillon>
and if you need a proof that ->current is the front buffer, you can look at lock_front_buffer()
<bbrezillon>
:)
<raster>
just a question
<raster>
to me... the logic here on buffer choosing and aging just seems... unusual to me
<raster>
it's very liable to have buffer age variances if buffers stay locked for longer than we'd like
<bbrezillon>
I'm sure that will be a question for daniels :)
<raster>
i would have implmented it as much of a pipeline
<raster>
with a fixed # of buffers
<raster>
maybe start at 2
<raster>
then grow to 3 or 4
<raster>
and literally shuffle them down the array rather than pick one from a pool with age numbers
<bbrezillon>
isn't it less efficient to do that?
<raster>
so grow if needed, and maybe every now and again consider a shrink if we seem to have to many spare
<bbrezillon>
I mean, the older the frame the more content you'll have to update, right?
<raster>
well it's just copying the buffer header/struct data
<raster>
sure
<raster>
the older
<raster>
thus grow if needed
<raster>
e.g. we have 2 buffers locked/queued for display and so we have to have a 3rd if anothe r swap is submitted
<raster>
another swap
<raster>
this implicitly allocates bufferson demand
<bbrezillon>
isn't it what happens right now?
<raster>
so it kind of works that way
<raster>
but its a pool with no ordering
<raster>
or the ordering is the age
<raster>
kind of
<bbrezillon>
locked+age combination yes
<raster>
yeah
<bbrezillon>
pick the newest buf that's not locked
<bbrezillon>
which sounds like a good solution to me
<raster>
it's just now how i'd do this so thus it's feeling a bit odd to scrape throught it and understand how it works, so i'm asking :)
<raster>
i'd literally keep it as a "ring buffer"
<raster>
and shuffle the frames through
<raster>
though i'd actually memcpy() the array down to shuffle so head is alwaYS AT 0
<bbrezillon>
anyway, still does not explain why we end up with 1, 3 ages
<raster>
since these are small structs with a few flags and a ptr - that memcpy would be pretty moot
<bbrezillon>
don't you have a NOOP swapbuf?
<raster>
hmm
<raster>
i doubt we have a noop swap buf
<raster>
we really try hard to do nothing if there is nothing to draw
<bbrezillon>
might be something something doing an implicit swap maybe
<bbrezillon>
s/something something/something/
<raster>
like lots of layers of higher level object/scene graph tracking to minimize update regions and only begin a render sycle at all if something needs a redraw
<raster>
it figures out obscured objects changing and turns that into nops etc.
<bbrezillon>
that's the only explanation I see for having the front buffer with age = 2 when doing double-buffering
<raster>
the buffer age change each frame is tickling our logic to force full redraws each frame
<bbrezillon>
buffer age should definitely stay 2
<raster>
well each render cycle i only see 1 call to dri2_drm_swap_buffers
<raster>
wait up let me be sure of that
<bbrezillon>
and when you comment the validate_state() call in set_damage_region(), the aging is correct
<raster>
so yes
<raster>
that validate state causes this to happen
<raster>
commenting it out fixes it
<bbrezillon>
but you're sure dri2_drm_swap_buffers
<bbrezillon>
is called only once
<raster>
yup
<bbrezillon>
AFAICT, the only place that can swap front/back buffers is this function
<raster>
just throw in printfs tobe sure
<raster>
only once per frame
<bbrezillon>
yes, once per frame
<bbrezillon>
and with the validate_state() call uncommented
<bbrezillon>
would be interesting to know where the get_back_bo() call comes from
<raster>
i was assuming that was from getting buffer age...
<bbrezillon>
it's definitely not the one done in drm_swap_buffers() (->back is probably already assigned when we call it from there)
<bbrezillon>
yes, that's what I suspect
<raster>
but you're right
<raster>
it could be other places
<bbrezillon>
oh no, actually I suspect it comes from the validate_state() call we added to the set_damage_region()
<raster>
errrrr
<raster>
no...
<raster>
dri2_drm_image_get_buffers()
<bbrezillon>
and I fear the ->lock_front_buffer()/->release_buffer() calls have not been done yet when we call validate_state() in the set_damage_region() path
<bbrezillon>
can you surround the added validate_state() call with printf("%s:%i\n", __func__, __LINE__); traces?
<raster>
done
<raster>
sec
<raster>
yeah
<raster>
the validate state calls image_get buffers
<shadeslayer>
alyssa: and I see this : http://paste.ubuntu.com/p/qPpxMPKzQZ/ , job 0xffff4c1ced70 seems to be created by 0xaaaae5a0ca30 but freed by ctx 0xaaaae4a80990 ( line 38 )
<raster>
hmm
<raster>
that match increases struct sizes.... thats not nice
<raster>
it could have used shorts for shifts/sizes and at least not gotten more bloaty
<raster>
:)
<raster>
actually could have used chars and it'd have been find and used half the size then - even better
<raster>
:)
megi has joined #panfrost
<daniels>
bbrezillon, raster: tbh I can't really piece the backlog together - is there a tl;dr somewhere?
<raster>
daniels: i could tell you in about 2h :)
<daniels>
:)
<raster>
but the simple version
<daniels>
tomeu: ^ do you know why shadeslayer is seeing panfrost_jobs migrate between ctx? seems like it shoudln't be possible :\
<raster>
the "Fix" to add a st_validate_state() in dri2_set_damage_region() does fix a core issue
<raster>
it makes sure the right panfrost resource (panres) is being used for setting the update regions to as well as then being used int he swapbuffer later
<raster>
without that the regions are attached to something else not using in the swap thus - it gets them wrong
<raster>
probably to do with different code paths driven from higher up
<raster>
anyway
<raster>
the problem is this extra validate is then called as part of a swapbuffer()
<raster>
because the update regions need to be reset to "empty"
<shadeslayer>
the only other explanation I can come up with is that the output is just messed up because of threading, and that the same piece of memory get's free'd by one context and alloc'd to another ctx?
<raster>
so this is called with 0 rects to do that
<raster>
the issue now is that this then bumps up the buffer age of one of the buffers as a side-effect
<raster>
inside of the swapbuffer
<raster>
(in addition to all the other buffer age fun that is done here tht was correct)
<raster>
and this causes us to have 1 buffer wildly out of step
<raster>
thus ages of 1, 3, 1, 3 instead of 2, 2, 2, 2
<raster>
so realistically the regiosn need a reset without bumping any buffer ages around (.e. drop the validate in this case)
davidlt has joined #panfrost
<raster>
bbrezillon: oh... and now do some more rendering involving fbo's and i get rendering bugs again even with if (nrects > 0) disabling the validate
<raster>
:(
<raster>
so much more to go it seems
<bbrezillon>
raster: :-(
<raster>
i was using a simpler test case so far
<raster>
i just expanded it
<raster>
and well... flickery black blobs appear
<raster>
and i know this will be using fbo's to render to
<raster>
so .... :/
<bbrezillon>
we'd need to trace the various calls again
<raster>
yeah
<raster>
damn
<raster>
i wont have time today
<bbrezillon>
np
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
tbueno has joined #panfrost
<shadeslayer>
yeah, http://paste.ubuntu.com/p/3ghrvzC5c6/, this still makes no sense, TID 9112 free's job 0xffff78171510 when it was allocated by TID 9158
davidlt has quit [Remote host closed the connection]
raster has quit [Remote host closed the connection]
<bbrezillon>
shadeslayer: can you protect the create/free section with a mutex (with the trace inside) so that we are sure the traces are accurate
<shadeslayer>
bbrezillon: sure can do
pH5 has quit [Quit: bye]
davidlt has joined #panfrost
yann has quit [Ping timeout: 245 seconds]
raster has joined #panfrost
raster has quit [Remote host closed the connection]
raster has joined #panfrost
raster has quit [Ping timeout: 248 seconds]
raster has joined #panfrost
raster has quit [Ping timeout: 245 seconds]
<bbrezillon>
daniels: could you have a look at http://code.bulix.org/9v8bhi-850562 and tell me if that makes sense (I doubt it does, but I also don't know how to fix the problem reported by raster)
yann has joined #panfrost
<bbrezillon>
hm, I could do something simpler by letting drivers reset the damage region at flush time
<shadeslayer>
tomeu: I should lockout the entirety of get_job and free_job right?