alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - https://gitlab.freedesktop.org/panfrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature
davidlt has quit [Ping timeout: 245 seconds]
davidlt has joined #panfrost
megi has quit [Ping timeout: 248 seconds]
davidlt has quit [Ping timeout: 245 seconds]
davidlt has joined #panfrost
rcf has quit [Quit: WeeChat 2.4]
rcf has joined #panfrost
davidlt_ has joined #panfrost
davidlt has quit [Ping timeout: 245 seconds]
sravn has quit [Quit: WeeChat 2.4]
davidlt__ has joined #panfrost
davidlt_ has quit [Ping timeout: 272 seconds]
davidlt__ has quit [Remote host closed the connection]
davidlt has joined #panfrost
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
davidlt has quit [Ping timeout: 268 seconds]
guillaume_g has joined #panfrost
megi has joined #panfrost
guillaume_g has quit [Remote host closed the connection]
<bbrezillon> alyssa: I know what I happened with the buggy patch I pushed => I tested before rebasing on master, and commit 4fa09329c104 ("pan/midgard: Use ralloc on ctx/blocks") turned things into ralloc() allocations
<bbrezillon> I did push the rebased version to tomeu's tree though, wonder why CI didn't catch the issue
<tomeu> guess it would be good to find out why
yann has quit [Ping timeout: 272 seconds]
<bbrezillon> tomeu: weston crashed here https://gitlab.freedesktop.org/tomeu/mesa/-/jobs/540727 and none of the dEQP tests were run
<bbrezillon> yet the CI is green
guillaume_g has joined #panfrost
<tomeu> cool, we should fix that
<bbrezillon> alyssa: do you remember how you triggered the bug you fixed in 0c5633036195 ("panfrost: Workaround bug in partial update implementation") ?
<bbrezillon> I'd like to reproduce it to understand how we could end up with an empty damage extent
pH5 has joined #panfrost
stepri01 has joined #panfrost
stikonas has joined #panfrost
stikonas has quit [Remote host closed the connection]
<narmstrong> damn I lost track of panfrost and now it's in an absolute bad state, nothing works at all on t820
<tomeu> narmstrong: we should have have it on CI :)
<tomeu> bbrezillon: tried to address that problem like this, but for some reason it isn't working: https://gitlab.freedesktop.org/tomeu/mesa/commit/2f7fd701791946bbcf17e3a9f40c49422b8aa0fd
<tomeu> any ideas why?
<narmstrong> tomeu: yep, but I'll need to fix the driver first, then find what caused the mesa part to stop working
<narmstrong> then CI will be cool !
<tomeu> ah, the driver isn't even probing?
<narmstrong> tomeu: apart the mmu 33bit issue, the power control override issue, the driver shows a error while powering the cores, bisecting it
<narmstrong> then I'll focus on the `panfrost d00c0000.gpu: Unhandled Page fault in AS0 at VA 0x000000000863C580` :-p
<tomeu> ok, for the latter, PAN_MESA_DEBUG=trace could give some info on what that pointer points to
<bbrezillon> tomeu: because you fork when lauching weston?
<narmstrong> tomeu: smells pretty bad when the fault is on the `mali_job_descriptor_header` no ?
raster has joined #panfrost
<raster> bbrezillon: we've advanced. your patch now nicely segvs... let me figure it out
<raster> need to drop optimizations for gdb to work now
davidlt has joined #panfrost
davidlt has quit [Read error: Connection reset by peer]
<raster> well it seems its my debugs printfs segving :)
<raster> bbrezillon: ok. we have problems... maybe i have too many things appiled
<raster> let me undo stuff
robmur01 has joined #panfrost
<raster> i'm now seeing buffer age swap between 1 and 3
<raster> which will be masking/hiding issues
<raster> yeah. ... your changes now break buffer age
<raster> specifically http://code.bulix.org/jcn8f9-849883 does
<raster> i see buffer age go 1, 3, 1, 3, 1, 3
megi has quit [Ping timeout: 246 seconds]
<daniels> that doesn't touch buffer age though ...
<daniels> going 1/3/1/3 will just be an artifact of how Mesa picks a buffer to use
<raster> bbrezillon: the good news is... nouw panres is the same
<raster> daniels: dunno right now... but as a result eflis not doing partial update
<raster> perhaps a reboot will help :)
<daniels> sure, but that's been the case for quite a while
<robmur01> Any chance anyone has an idea of why glmark2 would have started barfing "Error: eglCreateWindowSurface failed with error: 0x3009" with recent mesa/master?
<robmur01> kmscube still works but now spews a load of "unhandled 67" (and similar, depending on mode)
<daniels> robmur01: is this using GBM, Wayland, or X11?
<rcf> I only get that with glmark2-es2-drm
<daniels> raster: the last paragraph is relevant
<rcf> X and Wayland are fine here.
<daniels> rcf: i'm not 100% sure why glmark2 would fail, but BAD_MATCH could be caused by it picking an EGLConfig that's not compatible with the GBM surface format
<robmur01> daniels: GBM, I assume (I'm driving it over SSH with no desktop running)
<rcf> (well, weston, anyway -- sway continues to fail miserably as always)
<raster> daniels: well unfortunately... http://code.bulix.org/jcn8f9-849883 is definitely the culprit
<raster> with it - buffer age swaps 1, 3, 1, 3, 1, 3
<raster> without it (only change i made) its back to 2, 2, 2, 2, 2
<raster> :(
<daniels> raster: i can't really see why that would be the case, but you might want to look at when and how you do rendering, e.g. if Boris's change introduces more reload regions which make paints take longer, which pushes you into triple- rather than double-buffering
<raster> well that's my next port of call - look at the changes and as to why this is the case :)
<raster> fascinating...
<raster> so we have 4 buffers available - only 2 of which have had a bo actually allocated
<raster> 2 of them are age 0 with no bo
<raster> the other 2 are age 1, 3 and have bo's allocated
<raster> it never picks these because they are age 0
<daniels> not at all sure how you're getting age 3 when double-buffering
<raster> also not sure
<raster> just looking into what it;'s even seeing in making its decisions
<raster> so one of the buffers keeps changing ages from 2 to 3
<raster> every 2nd frame
<raster> oh wait no. sorry
<raster> not quite. more complicated than that
<raster> the "look for new backbuffer" stuff is me tracing the logic at the top of get_back_bo()
<raster> in platform_drm.c
<bbrezillon> raster: my patch is probably wrong
<tomeu> narmstrong: iirc those descriptors are allocated in the transient pool
<raster> bbrezillon: the good news - panres is the same at swap time as well as when regins are set
<raster> so it does fix that
<raster> but it has a bi-product
<bbrezillon> yep, but it's probably breaking other things
<tomeu> narmstrong: so I would check how the BOs backing those are allocated, and whether by chance aren't being released or unmapped from the GPU
<bbrezillon> raster: ->set_damage_region() is called after a swap/flush to reset the damage region
<bbrezillon> maybe it's too early
<bbrezillon> (before the the aging update has taken place?)
<raster> we get buffer age first then set regions
<bbrezillon> tbh, I'm reading this piece of code for the first time, so there are probably a ton of things I didn't get right
<raster> so once something asks for buffer age it has to be evaluated and locked in stone from that point until a swap
<raster> so the query surface for buffer age should have done that job already and locked it down. based on buffer age we can calculate update regions (we then set them), then render, then swap (with damage rects too )
yann has joined #panfrost
<raster> bbrezillon: reading your patch... it's all about passing ctx in to verious places just so u can call st_validate_state() when seting dmg region
<bbrezillon> yep
<raster> shouldnt we do this actually when we query buffer age instead?
<bbrezillon> maybe
<bbrezillon> we should probably do it in both places
<bbrezillon> (the validate_state() should be cheap when called the 2nd time)
<raster> yeah
<raster> and its really only done once per frame as someone might submit 1 or more update rects before they begin their draw
<bbrezillon> raster: hm, "1 or more", remember that new calls to set_damage_region() resets the whole thing
<raster> they may be being silly and set one then decide they are wrong and set another
<raster> and so on
<raster> until they finally decide on what it needs to be
<bbrezillon> all good
<bbrezillon> just wanted to make it clear that it's not cumulative
<raster> it'd be silly to do that... but they might and it'd be "valid" i guess.
<raster> i wonder if the spec allows for things like
<raster> set_region() draw(); set_region(); draw(); swap();
<raster> so each time you set 1 region then draw it then move on to the next then swap
<raster> this would probably be horrible for pipelining and deferred rendering
<raster> but...
<raster> is it actually valid?
<bbrezillon> we definitely don't support that
<narmstrong> tomeu: It also needs another iopgtable fix from robmur01
<narmstrong> I missed also this one
<bbrezillon> raster: but that shouldn't be too hard to support
<bbrezillon> does adding a validate_framebuffer() before querying the buffer age helps?
<raster> bbrezillon: i just wonder if it's valid or not though.
<raster> bbrezillon: i'm trying to figure out how to even call st_validate_state or st_manager_validate_framebuffers from the buffer age query
<raster> like dri2_query_buffer_age or dri2_drm_query_buffer_age
<raster> how do dig out st_context :)
<robmur01> narmstrong: FWIW T820 is what I happen to have on my board currently, which has been reasonably happy until my glmark setup started falling apart :)
<narmstrong> robmur01: pretty cool, I thought I was alone in the dark with the T820 on the Amlogic S912
<bbrezillon> raster: hm, just looking at dri2_drm_query_buffer_age and I don't understand how the returned age can be wrong
<raster> it is though :)
<bbrezillon> you don't need to call the validate() function here, because the buffer age is directly stored in the platform-specific object
<raster> i put printf's in get_back_bo
<bbrezillon> dri2_drm_query_buffer_age() calls get_back_bo()
<raster> my specific hacks/printfs
<raster> so u can grab the same daya
<raster> yeah
<raster> thats why i put my printfs there as that was the core logic
<raster> the above output is what i see and it explains why it returns what it does
<narmstrong> robmur01: when starting kmscube (since the first panfrost driver), we have systematically a `panfrost d00c0000.gpu: Unhandled Page fault in AS0 at VA 0x000000000213C200`, `PAN_MESA_DEBUG=trace` doesn't show anything related
<bbrezillon> the problem I had with the set_damage_region() function was that the new back BO info was not propagated to drawable->textures[BACK_LEFT]
<raster> it doesnt explain how we get buffers with those ages in the cbufs
<robmur01> narmstrong: even with the 4-level hack?
<narmstrong> robmur01: yes, but only 1, each time we start kmscube, than all runs fine
<narmstrong> robmur01: I have a doubt on the Amlogic T820 integration, for mali_kbase to work correctly they need to write `GPU_PWR_KEY=0x2968A819` then `GPU_PWR_OVERRIDE1=0xfff | (0x20 << 16)` but this doesn't seems to solve anything for panfrost
<tomeu> narmstrong: guess one could print on buffer creation the GPU addr and the size, to figure out what BO that addr belongs to
<robmur01> I do see a js fault (DATA_INVALID FAULT) fault plus a sched timeout when it starts
<robmur01> but no unhandled page fault
<narmstrong> the DATA_INVALID FAULT disappears with the 4-level hack
<robmur01> (and the timeout may well be more to do with the speed of the FPGA)
<bbrezillon> raster: looks like buffer age is incremented twice
<raster> yeah
<raster> but where and why... :)
<bbrezillon> can you add traces to dri2_drm_swap_buffers() ?
<robmur01> I guess it could be a flush timing thing, since the js fault comes immediately after mapping the relevant IOVA
<bbrezillon> see how many times it's called and what the age values are
<tomeu> robmur01: a timeout is expected after a fault, as we don't cancel jobs atm
<raster> bbrezillon: was just adding to dri2_drm_swap_buffers
<raster> interesting
<raster> it does age++ to bother buffers
<raster> inside swap
<raster> --panfrost: incr cbuf 2 [0xaaaad18c46a0] from 1 += 1
<raster> --panfrost: incr cbuf 3 [0xaaaad1f6ab10] from 2 += 1
<raster> thats tracking the age++ in dri2_drm_swap_buffers
<raster> thats part of swapbuffers it seesm
<raster> so we have 2 buffers, age 1, 2
<raster> and so this should lead to 2 buffer with age 2, 3
<bbrezillon> except one of them becomes the current buffer
<bbrezillon> and is reset to 1
<bbrezillon> (current == front)
<raster> yeah
<bbrezillon> so 3 and 1 sounds good, right?
<bbrezillon> or should it be 2 and 1
<bbrezillon> ?
<raster> 2 and 1 i think
<raster> so what i see is its sometimes 1, 2 and sometimes 3, 1
<raster> as opposed to always 2, 1 (or 1, 2) ..
<raster> u get the idea
<raster> but why?
<raster> ok
<raster> the problem is...
<raster> it's the locked
<raster> so its the buffer that is always age 1 that is locked when it works right
<raster> when it fails its the buffer of age 2 that is sometimes locked
<raster> sometiems buffer of age 1
<raster> thus since its locked some other buffer is chosen
<raster> food. brb
<raster> now here is a question
<raster> why does dri2_drm_swap_buffers ++ all buffers of age > 0
<raster> shouldnt it incr the swapped buffer age only?
<raster> actually wait
<raster> it SETS the current backbuffer after the swap to age 1
<bbrezillon> it's what it does
<raster> that's find
<raster> but why modify the others?
<raster> they havent been "used"
<bbrezillon> if (dri2_surf->color_buffers[i].age > 0)
<bbrezillon> dri2_surf->color_buffers[i].age++;
<bbrezillon> it only modifies ages on buffers that were used at some point
<bbrezillon> which is correct
<bbrezillon> and current is not the backbuffer
<bbrezillon> it's the front buffer
<raster> hnmm actually htis is just of implicitly counting how long they sit in the pipeline for
<bbrezillon> yes, so that if they ever get used again, you know how many frame passed since the last update
<raster> well current becomes the back buffer there :)
<bbrezillon> and can determine what has been updated in between
<bbrezillon> current back buffer becomes the front buffer at swap time, which is correct I guess
<bbrezillon> and if you need a proof that ->current is the front buffer, you can look at lock_front_buffer()
<bbrezillon> :)
<raster> just a question
<raster> to me... the logic here on buffer choosing and aging just seems... unusual to me
<raster> it's very liable to have buffer age variances if buffers stay locked for longer than we'd like
<bbrezillon> I'm sure that will be a question for daniels :)
<raster> i would have implmented it as much of a pipeline
<raster> with a fixed # of buffers
<raster> maybe start at 2
<raster> then grow to 3 or 4
<raster> and literally shuffle them down the array rather than pick one from a pool with age numbers
<bbrezillon> isn't it less efficient to do that?
<raster> so grow if needed, and maybe every now and again consider a shrink if we seem to have to many spare
<bbrezillon> I mean, the older the frame the more content you'll have to update, right?
<raster> well it's just copying the buffer header/struct data
<raster> sure
<raster> the older
<raster> thus grow if needed
<raster> e.g. we have 2 buffers locked/queued for display and so we have to have a 3rd if anothe r swap is submitted
<raster> another swap
<raster> this implicitly allocates bufferson demand
<bbrezillon> isn't it what happens right now?
<raster> so it kind of works that way
<raster> but its a pool with no ordering
<raster> or the ordering is the age
<raster> kind of
<bbrezillon> locked+age combination yes
<raster> yeah
<bbrezillon> pick the newest buf that's not locked
<bbrezillon> which sounds like a good solution to me
<raster> it's just now how i'd do this so thus it's feeling a bit odd to scrape throught it and understand how it works, so i'm asking :)
<raster> i'd literally keep it as a "ring buffer"
<raster> and shuffle the frames through
<raster> though i'd actually memcpy() the array down to shuffle so head is alwaYS AT 0
<bbrezillon> anyway, still does not explain why we end up with 1, 3 ages
<raster> since these are small structs with a few flags and a ptr - that memcpy would be pretty moot
<bbrezillon> don't you have a NOOP swapbuf?
<raster> hmm
<raster> i doubt we have a noop swap buf
<raster> we really try hard to do nothing if there is nothing to draw
<bbrezillon> might be something something doing an implicit swap maybe
<bbrezillon> s/something something/something/
<raster> like lots of layers of higher level object/scene graph tracking to minimize update regions and only begin a render sycle at all if something needs a redraw
<raster> it figures out obscured objects changing and turns that into nops etc.
<bbrezillon> that's the only explanation I see for having the front buffer with age = 2 when doing double-buffering
<raster> the buffer age change each frame is tickling our logic to force full redraws each frame
<bbrezillon> buffer age should definitely stay 2
<raster> well each render cycle i only see 1 call to dri2_drm_swap_buffers
<raster> wait up let me be sure of that
<bbrezillon> and when you comment the validate_state() call in set_damage_region(), the aging is correct
<raster> so yes
<raster> that validate state causes this to happen
<raster> commenting it out fixes it
<bbrezillon> but you're sure dri2_drm_swap_buffers
<bbrezillon> is called only once
<raster> yup
<bbrezillon> AFAICT, the only place that can swap front/back buffers is this function
<raster> just throw in printfs tobe sure
<raster> only once per frame
<bbrezillon> yes, once per frame
<bbrezillon> and with the validate_state() call uncommented
<raster> uncommented once per frame
<bbrezillon> that's crazy
<raster> gah
<raster> pastebin quota hit
<bbrezillon> raster: http://code.bulix.org/
<raster> mega-sized-fonts.bulix.org :)
<bbrezillon> wait, age is printed in get_back_bo()
<bbrezillon> so just after age has been incremented
<raster> the "look for new backbuffer" is in get_back_bo()
<bbrezillon> and front buffer has not be swapped yet
<bbrezillon> *front/back have not been swapped yet
<raster> ajd yes
<bbrezillon> getting 2 on the locked buf is fine, then
<raster> when it goes from having 1 to 2 buffers in the pool
<raster> 0xaaaafc7c3f90 is still locked
<bbrezillon> can you move the buffers aging dump at the end of dri2_drm_swap_buffers() ?
<raster> let me dump that
rcf has quit [Quit: WeeChat 2.4]
<bbrezillon> raster: can you add a trace to lock_front_buffer()?
<bbrezillon> with the age and pointer of the buffer being locked
<bbrezillon> raster: buffer @ idx 2 should be locked after the 2nd swap
<bbrezillon> but the get_back_bo() happening just after this swap shows that buffer @ idx 3 is locked
<raster> --panfrost: buf [0xaaaac1a978f0] 2 locked=0, age=1
<raster> --panfrost: chosen [0xaaaac1a978f0] age=1
<raster> --panfrost: buf [0xaaaac1454340] 3 locked=1, age=2
<raster> that bit?
<raster> and it's seemingly not locked?
<bbrezillon> would be interesting to know where the get_back_bo() call comes from
<raster> i was assuming that was from getting buffer age...
<bbrezillon> it's definitely not the one done in drm_swap_buffers() (->back is probably already assigned when we call it from there)
<bbrezillon> yes, that's what I suspect
<raster> but you're right
<raster> it could be other places
<bbrezillon> oh no, actually I suspect it comes from the validate_state() call we added to the set_damage_region()
<raster> errrrr
<raster> no...
<raster> dri2_drm_image_get_buffers()
<bbrezillon> and I fear the ->lock_front_buffer()/->release_buffer() calls have not been done yet when we call validate_state() in the set_damage_region() path
<bbrezillon> can you surround the added validate_state() call with printf("%s:%i\n", __func__, __LINE__); traces?
<raster> done
<raster> sec
<raster> yeah
<raster> the validate state calls image_get buffers
<bbrezillon> ok, so here is the problem
<bbrezillon> new front/back buffers have not been locked/released yet
<bbrezillon> and get_back_bo() fails to pick the right BO
<bbrezillon> can you check when lock_front_buffer()/release_buffer() are called and who call them?
<raster> by a read of the logs the release only happens at the next swap
<raster> which makes sense
<raster> the lock/release logs are there in that paste
<raster> withe the buffer ptr
<raster> well bo ptr
<raster> mesa doesnt happen to have a handy "dump a bt to stdout now" macro?
<raster> or func?
<bbrezillon> actually, they happen just after the set_damage_region()
<raster> yeah
<raster> the do
<raster> pourquoi
herbmillerjr has quit [Remote host closed the connection]
herbmillerjr has joined #panfrost
<raster> that's us doing that
<bbrezillon> those are caused by the implicit ->set_damage_region() calls
<bbrezillon> used to reset the damage region
<raster> not the lock/unlocks
<bbrezillon> raster: can you make the call to validate_state() conditional on if (nrects > 0)
<bbrezillon> just to validate this theory
<raster> we would be locking the new frontbuffer and releasing the old one when we swap
<bbrezillon> that's what happens
<raster> indeed
<raster> nrects > 0 fixes it
<bbrezillon> but it happens after dri2_swap_buffers() has returned
<bbrezillon> and we are calling ->set_damage_region() from inside dri2_swap_buffers(à
<raster> we're calling it from inside?
<raster> oh...
<raster> well tyhen
<raster> then
<raster> :)
<raster> i didnt notice that
<bbrezillon> still don't know how to solve that properly
<raster> we're setting region with 0 rects?
<raster> (from inside)
<bbrezillon> yes, that's the semantics we use to mean "reset damage region"
<raster> yeah
<raster> thats the definition
<raster> but in THIS case we do not want to do another validate
<raster> just reset
<raster> right?
<bbrezillon> it's not that simple
<bbrezillon> if you look at the spec, ->set_damage_region(0, NULL) is also valid
<raster> oh i know
<raster> it resets
<raster> well the egl call - yes
<bbrezillon> last time I read the spec it was not so clear
<raster> but in this case the region implicitly resets after a swap
<raster> oh i read the spec quickly
<raster> it does say that above
<bbrezillon> I understood it as "empty damage region"
<raster> so when called from OUTSIDE mesa - yes
<raster> this is right
<raster> but this set amage region is being called from inside implicitly as part of swapping
<raster> and thus is also implicitly shuffling up buffer ages etc.
<bbrezillon> yes
<raster> so its a different path
<bbrezillon> because of the validate_state() addition
<raster> and in this case ... you want to reset the region without age++
<bbrezillon> which am no sure is correct :)
<raster> my take is getting buffer age should validate
<raster> not setting region
<raster> imho
<raster> basically... everyone is going to use these together
<raster> they will get age FIRST to figure out regions
<raster> then set them
<bbrezillon> the reason I added this validate_state() in the first state was to update drawable->textures[BACK_LEFT]
<raster> unless of course they already know its a full re-draw anyway
<raster> in which case everything is already by default set up right region-wise
<bbrezillon> which is used by the set_damage_region() logic to pick the right resource
<raster> well at least you've made progress
<raster> you now know what specifically is the issue
<raster> and why
<raster> the question now is... what is the better thing to do that doesn't cause issues? :)
<bbrezillon> yep
<bbrezillon> maybe daniels knows :)
<raster> as it stands... i'd have a special case for resetting dmg region
<raster> or reorder things so when its called it doesnt mess things up
<bbrezillon> that's a solution, but I'm still not sure calling validate_state() from set_damage_region() is the right thing to do
<raster> damn
<raster> mesa has 2 dri2_set_damage_region()'s
<raster> :(
<raster> ok
<raster> really really really super dirty
<raster> use rects -1 or rect ptr of -1 (fffffffffffff...) to mean reset without validate
<raster> otherwise extend the function to have more params like flags .... :)
<alyssa> bbrezillon: The empty damage bug was triggering pretty much everywhere, iirc?
<alyssa> Like, no, I don't remember how, but it was pretty quick after playing with your branch
<alyssa> robmur01: ......Panfrost w/ an FPGA? Dang. :P
<bbrezillon> alyssa: ok, I'll try to reproduce
herbmillerjr has quit [Ping timeout: 246 seconds]
<robmur01> just don't ask me how much they cost (beyond "a lot") :P
herbmillerjr has joined #panfrost
<raster> robmur01: trying midgard or bifrost bit files?
<robmur01> raster: yes - I've managed to nab some of each
<raster> robmur01: same... but as i have a juno r0... y dtb is broken ad i only have a dtb for r1+biforst
<raster> so i have to look at hacking my own up
<raster> and then... welll i decided i might just try the kihey960 again
<raster> after spending like 2 weeks getting a juno to boot reliably :)
<raster> (don't aske. front usb ports, ueif vs tftp vs uboot vs udb hdd stopping to work vs ....)
<robmur01> hooray for meetings; gave me a chance to bisect my glmark woes to https://gitlab.freedesktop.org/mesa/mesa/commit/7b4ed2b513efad86616e932eb4bca20557addc78
<robmur01> seems plausible; I guess it's not a panfrost issue then
<raster> yargh
<raster> that was a rejig of how color masks are stored
<raster> i wonder if there is a mystery typo/off-by-one there
<raster> :)
<alyssa> robmur01: Oh!
<alyssa> anarsoul: also blamed that commit for them having issues
<alyssa> with lima
<shadeslayer> alyssa: I'm seeing something interesting over here, just wanted to double check with you, are jobs shared across contexts?
<alyssa> shadeslayer: They... shouldn't be...
<shadeslayer> alyssa: huh, so I made these changes https://paste.ubuntu.com/p/2SYfM5gCSy/
<shadeslayer> alyssa: and I see this : http://paste.ubuntu.com/p/qPpxMPKzQZ/ , job 0xffff4c1ced70 seems to be created by 0xaaaae5a0ca30 but freed by ctx 0xaaaae4a80990 ( line 38 )
<raster> hmm
<raster> that match increases struct sizes.... thats not nice
<raster> it could have used shorts for shifts/sizes and at least not gotten more bloaty
<raster> :)
<raster> actually could have used chars and it'd have been find and used half the size then - even better
<raster> :)
megi has joined #panfrost
<daniels> bbrezillon, raster: tbh I can't really piece the backlog together - is there a tl;dr somewhere?
<raster> daniels: i could tell you in about 2h :)
<daniels> :)
<raster> but the simple version
<daniels> tomeu: ^ do you know why shadeslayer is seeing panfrost_jobs migrate between ctx? seems like it shoudln't be possible :\
<raster> the "Fix" to add a st_validate_state() in dri2_set_damage_region() does fix a core issue
<raster> it makes sure the right panfrost resource (panres) is being used for setting the update regions to as well as then being used int he swapbuffer later
<raster> without that the regions are attached to something else not using in the swap thus - it gets them wrong
<raster> probably to do with different code paths driven from higher up
<raster> anyway
<raster> the problem is this extra validate is then called as part of a swapbuffer()
<raster> because the update regions need to be reset to "empty"
<shadeslayer> the only other explanation I can come up with is that the output is just messed up because of threading, and that the same piece of memory get's free'd by one context and alloc'd to another ctx?
<raster> so this is called with 0 rects to do that
<raster> the issue now is that this then bumps up the buffer age of one of the buffers as a side-effect
<raster> inside of the swapbuffer
<raster> (in addition to all the other buffer age fun that is done here tht was correct)
<raster> and this causes us to have 1 buffer wildly out of step
<raster> thus ages of 1, 3, 1, 3 instead of 2, 2, 2, 2
<raster> so realistically the regiosn need a reset without bumping any buffer ages around (.e. drop the validate in this case)
davidlt has joined #panfrost
<raster> bbrezillon: oh... and now do some more rendering involving fbo's and i get rendering bugs again even with if (nrects > 0) disabling the validate
<raster> :(
<raster> so much more to go it seems
<bbrezillon> raster: :-(
<raster> i was using a simpler test case so far
<raster> i just expanded it
<raster> and well... flickery black blobs appear
<raster> and i know this will be using fbo's to render to
<raster> so .... :/
<bbrezillon> we'd need to trace the various calls again
<raster> yeah
<raster> damn
<raster> i wont have time today
<bbrezillon> np
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
tbueno has joined #panfrost
<shadeslayer> yeah, http://paste.ubuntu.com/p/3ghrvzC5c6/, this still makes no sense, TID 9112 free's job 0xffff78171510 when it was allocated by TID 9158
davidlt has quit [Remote host closed the connection]
raster has quit [Remote host closed the connection]
<bbrezillon> shadeslayer: can you protect the create/free section with a mutex (with the trace inside) so that we are sure the traces are accurate
<shadeslayer> bbrezillon: sure can do
pH5 has quit [Quit: bye]
davidlt has joined #panfrost
yann has quit [Ping timeout: 245 seconds]
raster has joined #panfrost
raster has quit [Remote host closed the connection]
raster has joined #panfrost
raster has quit [Ping timeout: 248 seconds]
raster has joined #panfrost
raster has quit [Ping timeout: 245 seconds]
<bbrezillon> daniels: could you have a look at http://code.bulix.org/9v8bhi-850562 and tell me if that makes sense (I doubt it does, but I also don't know how to fix the problem reported by raster)
yann has joined #panfrost
<bbrezillon> hm, I could do something simpler by letting drivers reset the damage region at flush time
<shadeslayer> tomeu: I should lockout the entirety of get_job and free_job right?
<shadeslayer> PANFROST DEBUG: TID 2783 CTX 0xaaaaefc43290 RENDERING JOB 0xffff58373e20
<shadeslayer> PANFROST DEBUG: TID 2770 CTX 0xaaaaeef640a0 FREE'ING JOB 0xffff58373e20P
<shadeslayer> definitely free'd by a different context?
<shadeslayer> tomeu: https://paste.ubuntu.com/p/tfBZcBr8yJ/ is the final patch that I used to check things
<alyssa> Help, this blend shader stuff is WATing me
* alyssa is afraid of running out of time before even getting to start scheduling
<shadeslayer> heh, running out of time ... start scheduling :P
<alyssa> >.<
<shadeslayer> pun intended? :P
<alyssa> Nope
<shadeslayer> even better
<alyssa> Oh, *seriosuly*, ugh
<alyssa> Getting clobbered with a blend constant
<bbrezillon> shadeslayer: yep
<bbrezillon> and we indeed have a bug
<bbrezillon> shadeslayer: hm, wait
<bbrezillon> can you also protect the secting printing the RENDERING trace?
<alyssa> ACK!
<alyssa> It's upside-down
<alyssa> That's why
<bbrezillon> shadeslayer: can you also add an \n at the end of the FREEING trace
<alyssa> Now dealing with a RA issue with blend shaders :|
* alyssa feels herself running out of time for lack of a better word
<bbrezillon> screen is shared by all contexts
<bbrezillon> shadeslayer: last_fragment_flushed, last_job should probably be moved to panfrost_context
<bbrezillon> transient_bo, free_transient and bo_cache might stay in panfrost_screen if they are protected with a mutex
<bbrezillon> shadeslayer: http://code.bulix.org/8kohss-850630
<shadeslayer> I move away for dinner and you fix it :(
<bbrezillon> oh, I doubt it's fixed
<shadeslayer> bbrezillon: but yeah, awesome, I wasn't hallucinating then :D
<bbrezillon> it's just the beginning of a long series of issues related to multi-ctx
<bbrezillon> :)
<shadeslayer> hooray
<bbrezillon> shadeslayer: it works?
<bbrezillon> so, patches I don't even test work, but those I test regress the entire world when I push them :'-(
stikonas has joined #panfrost
<shadeslayer> bbrezillon: I'm going through them now
<shadeslayer> - return;
<shadeslayer> + pthread_mutex_unlock(&screen->bo_cache_lock);
<shadeslayer> that looks wrong?
<shadeslayer> oh, I guess that return is superfluous
<bbrezillon> shadeslayer: and s/transiant/transient/
<shadeslayer> bbrezillon: yeah I figured
<shadeslayer> I just wanted to go through it myself to understand what's going on, I see you just moved things around from screen to ctx
<bbrezillon> sure, np
<bbrezillon> sorry I read "I just want to", hence the "sure, np". Sorry for chiming in and fixing that for you
<bbrezillon> I'll try to stay away next time :-/
<shadeslayer> bbrezillon: a lot less crashy for sure :)
<shadeslayer> I feel like it has a perf impact though
<shadeslayer> but I have no metrics to back it up
<bbrezillon> the lock certainly has an impact, though I doubt it's a huge one
<shadeslayer> bbrezillon: yeah definitely fixes the issue at hand
<bbrezillon> the other option would be to have a bo_cache and transient pool per context
<bbrezillon> but that means increasing the memory consumption
<bbrezillon> maybe something we can think about when the madvise stuff is merged
<alyssa> Only 474 regressions left according to CI... this is really ughle
<shadeslayer> I get nice INSTR_MISMATCH messages now
<shadeslayer> [ 6963.918002] panfrost ff9a0000.gpu: js fault, js=0, status=INSTR_TYPE_MISMATCH, head=0x5b08180, tail=0x5b08180
<shadeslayer> [ 6963.918909] panfrost ff9a0000.gpu: gpu sched timeout, js=0, status=0x52, head=0x5b08180, tail=0x5b08180, sched_job=0000000051927e83
<shadeslayer> but yay, less crashing
davidlt has quit [Remote host closed the connection]
<alyssa> That's a compiler issue
<alyssa> MIDGARD_MESA_DEBUG=shaders?
davidlt has joined #panfrost
<shadeslayer> I gotta rebase on master first
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
<alyssa> I don't even understand how `brx.discard.false` could come about from my compiler
<shadeslayer> meh, I'll take a look tomorrow, I'm *exhausted*
raster has joined #panfrost
stikonas has quit [Read error: Connection reset by peer]
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
davidlt has quit [Remote host closed the connection]
davidlt has joined #panfrost
raster has quit [Remote host closed the connection]
* alyssa poooooooofs
<alyssa> What even
<HdkR> What
<alyssa> HdkR: I've been doing a scheduler and uh
<alyssa> my head hurts.
<alyssa> I have done nothing today except sort out regressions and I don't do any scheduling yet.
<HdkR> Whoa hey, that's what I'm doing all the time :D
<daniels> bbrezillon, shadeslayer: bo_cache _must_ be per-screen, since the BO namespace is per-fd
<daniels> so that needs to be mutexed
<daniels> all the job/flush pointers should definitely be per-ctx however
<anarsoul> panfrost already has bo cache?
<alyssa> Yeah
<anarsoul> we definitely need it too
davidlt has quit [Remote host closed the connection]
<alyssa> I don't even know what I'm doing anymore
<anarsoul> alyssa: have some rest?
<alyssa> 10 more minutes :p
<alyssa> Oh ffs
<alyssa> How am I
<alyssa> still
<alyssa> on the same bug
davidlt has joined #panfrost
<Lyude> alyssa: you know I still run into bugs that have taken me even longer then that :P
davidlt has quit [Remote host closed the connection]
hopetech has quit [Ping timeout: 245 seconds]
alyssa has quit [Ping timeout: 272 seconds]
hopetech has joined #panfrost
davidlt has joined #panfrost