<anarsoul>
you can't detect all the infinite loops
* alyssa
hears Turing rolling in his grave
<alyssa>
I suppose not... In practice this has not been an issue
<alyssa>
With WebGL, conceivably it could be.
<alyssa>
WebGL is usually the problem.
<alyssa>
Joking aside, the hardware has a watchdog. If the job takes too long, it times out which is equivalent to a fault.
<alyssa>
As long as you don't overwhelm it it usually recovers automatically since the hw/kernel is reliable
<anarsoul>
right
<anarsoul>
but what if vertex job failed
<alyssa>
What about it?
<anarsoul>
do you run fragment job in this case?
<alyssa>
Dunno
rcf has joined #panfrost
vstehle has quit [Ping timeout: 246 seconds]
<TheCycoTWO>
has anyone investigated the weston 7 segfault?
lvrp16 has quit [Read error: Connection reset by peer]
austriancoder has quit [Ping timeout: 276 seconds]
austriancoder has joined #panfrost
lvrp16 has joined #panfrost
belgin has joined #panfrost
vstehle has joined #panfrost
fysa has joined #panfrost
chewitt has quit [Quit: Adios!]
NeuroScr has joined #panfrost
davidlt has joined #panfrost
davidlt_ has joined #panfrost
belgin has quit [Quit: Leaving]
davidlt has quit [Ping timeout: 245 seconds]
davidlt__ has joined #panfrost
davidlt_ has quit [Read error: Connection reset by peer]
davidlt_ has joined #panfrost
davidlt__ has quit [Ping timeout: 240 seconds]
<bbrezillon>
alyssa: ok, I think the part I was missing is that tiles are written back to DRAM and the fragment job has an fbd that describes where those tiles are (tiler_heap I guess)
<daniels>
TheCycoTWO: which 'weston 7 segfault'? a fair few people here are running weston 7 and it works fine
<bbrezillon>
anarsoul: just had a look at the scheduler code, and the fence is not signaled when a job fails. AFAIU, if the vertex job fails repetedly (the scheduler retries a few times), the fragment job will never be executed (actually, all jobs coming from this context/FD will be ignored after that)
<tomeu>
narmstrong: btw, I have started work on moving our CI to mesa's main CI scripts
<tomeu>
so it's probably not worth it to make many changes to the current panfrost setup
<narmstrong>
tomeu: ok, so in this case, tags is preferable to test only t820 on my runner
raster has quit [Remote host closed the connection]
<tomeu>
narmstrong: maybe yeah, is the runner ready now?
<narmstrong>
tomeu: nop, I need to finalize the lava part
<tomeu>
ok
<tomeu>
I need to check some regressions that have come from core mesa
<tomeu>
narmstrong: btw, one more comment: for now we should probably only test one arch per board
<tomeu>
once we think we have free capacity, we can change that
chewitt has joined #panfrost
raster has joined #panfrost
<TheCycoTWO>
daniels on start after showing GL info. "segmentation fault (core dumped) weston peerset by peer'1' of size 16ev) 48own', model 'unknown', serial 'unknown'ent0.
<daniels>
TheCycoTWO: i've not seen that and it generally works really well for us - one thing you can do is to redirect stdout and stderr to a file to capture that, as well as look at the core dump to see where the crash happened
<TheCycoTWO>
Stack trace shows #0 0x0 n/a (n/a) #1 0x0000ffffab6e0668 n/a (gl-renderer.so) and on more n/a level above that.
<TheCycoTWO>
Was trying to build weston myself with symbols but ran into complaints about egl.
<TheCycoTWO>
That's archlinuxarm on kevin with latest master mesa (and several other mesa versions back to its release) linux 5.2
<TheCycoTWO>
will the coredump be useful without symbols?
warpme_ has joined #panfrost
chewitt has quit [Quit: Adios!]
<TheCycoTWO>
daniels ^?
<daniels>
TheCycoTWO: unfortunately not really useful without symbols, no - getting a dump with symbols would be really useful
<TheCycoTWO>
ok. First thing to do is figure out why it says "Runtime dependency egl found: NO" when I run meson then
raster has quit [Remote host closed the connection]
<alyssa>
bbrezillon: Re tiles written to memory and fragment reading it back, correct. Think what you will of the efficiency..... but that's what the tiler_* structures are for.
raster has joined #panfrost
raster- has joined #panfrost
<TheCycoTWO>
ah, mesa had stopped providing the pc file for egl when compiled with glvnd - um - back in july. I have no need for glvnd anyway.
<bbrezillon>
looks like I have a few BOs that I was expecting to be used only by the vertex+tiler jobs, but if I don't attach them to the fragment job it starts failing
<bbrezillon>
alyssa: the SSBO buf for instance
<alyssa>
bbrezillon: Ah! :)
<alyssa>
Fragment shaders run during FRAGMENT, not during TILER
<alyssa>
Ummm
<alyssa>
VERTEX jobs = upload the vertex shader, run the vertex shader
<alyssa>
TILER jobs = upload the fragment shader, run the *tiler*
<alyssa>
FRAGMENT jobs = run the fragment shader, copy to framebuffer
<alyssa>
So if a fragment shader needs a BO, it needs to be attached to the FRAGMENT job
<alyssa>
But if a vertex shader needs it, to the VERTEX
<alyssa>
And if tiling itself needs it (varyings, tiler structures, fragment shader binaries, etc), to the TILER
raster- has quit [Read error: Connection reset by peer]
NeuroScr has quit [Quit: NeuroScr]
<bbrezillon>
alyssa: ok
<TheCycoTWO>
ok - weston 7 is working for me now. (mesa built without glvnd) so I guess it's not panfrost's fault at any rate (probably my fault)
NeuroScr has joined #panfrost
<bbrezillon>
alyssa: so, the FRAGMENT jobs knows where the fragment shader (and associated BOs) are
<tomeu>
2019-09-11T11:18:09 dEQP-GLES2.functional.uniform_api.random.3 Fail REGRESSED from (NotListed)
<tomeu>
2019-09-11T11:18:09 dEQP-GLES2.functional.uniform_api.random.21 Fail REGRESSED from (NotListed)
<alyssa>
bbrezillon: Yeah, somehow
<alyssa>
I suspect it's somewhere in the tiler_heap but I never investigated
<TheCycoTWO>
popolon: you were the other person that mentioned weston 7, and you are on almost an identical setup to me. Did you sort it out already, or see my libglvnd notes?
<alyssa>
tomeu: THat is a good question :|
<alyssa>
I mean
<alyssa>
The obvious answer is that Panfrost ingests NIR but llvmpipe ingests LLVM IR :P
<alyssa>
(and that commit is in the glsl_to_nir pass)
<tomeu>
guess that iris isn't broken by it either, but <i haven't seen the CI run for it
<alyssa>
Mm, the uniform/varying/attribute counting code is... fragile... so I'm not surprised it broke on some edge cases
<alyssa>
Could you grab a testlog.xml so I can see what the shaders are?
<tomeu>
hmm, let me check
* tomeu
is at plumbers now
raster has quit [Read error: Connection reset by peer]
<alyssa>
How's the plumbing going?
<tomeu>
there's no plumbing that I can see, only plumbing professionals going around and eating and drinking
<alyssa>
No plumbing?
<alyssa>
What if you need to use the washroom? :/
raster has joined #panfrost
raster has quit [Read error: Connection reset by peer]
<alyssa>
bbrezillon: Eek, yeah, that sounds like it could defn contribute..
<alyssa>
Wonder if that can be mitigated in user somehow...?
<anarsoul>
why do you call bo_wait()?
<bbrezillon>
alyssa: it does not explain the 100 fps diff
<bbrezillon>
but accounts for at least 20 AFAICT
<bbrezillon>
we can mark BOs ready, but they need to be tested once at least
<alyssa>
bbrezillon: I'm questioning if we made a bad ABI choice that we're stuck with now :-(
* alyssa
wishes she understood fences/syncs when this stuff was first being done
<alyssa>
Wait
<alyssa>
bbrezillon: Why can't we use syncs?
<alyssa>
When a BO is in flight, track "BO<->sync"
<alyssa>
-----------Oh!
<alyssa>
I am in Class but you know what this seems like fun, sec
<alyssa>
Yeah, can't we specify the dependency graph in terms of syncs and eliminate the waits entirely?
<alyssa>
You would still need wait_bo for shared BOs, I guess.... but the majority of BOs we deal with are not shared so it doesn't -terribly- matter
<alyssa>
If there's really a reason that can't work, I mean, we can patch the kernel to do waits on `bo_handles` automatically, but ideally we make it work without any ABI changes so everything continues to work against 5.2
<anarsoul>
bbrezillon: just check first BO, if it's ready - grab it, if it's not - others aren't ready either, so allocate new BO. Make sure that you add BO to list tail when you free them
<anarsoul>
that'd be 1-2 syscall per BO allocation and that's not bad
* alyssa
reads up on how sync objects work
<bbrezillon>
alyssa: that's what I do
<bbrezillon>
to track job lifetime
<bbrezillon>
but we still need to wait for something when testing BO readiness
<bbrezillon>
and calling bo_wait() is the same as calling wait_syncobj
<bbrezillon>
anarsoul: exactly what we do already
<bbrezillon>
we have extra BO waits in transfer_map()
<bbrezillon>
but that's all
<anarsoul>
it doesn't look like it's possible to optimize it further
<bbrezillon>
the big difference is that, by serializing jobs (and waiting on their out_sync fences) we were having a single sync point, instead of one per BO-cache alloc + one per transfer_map
<anarsoul>
don't serialize jobs?
<bbrezillon>
that's what I'm trying to do
<bbrezillon>
and it regresses the -bideas benchmaek
<bbrezillon>
*benchmark
<bbrezillon>
it's definitely not the only problem, but still accounts for almost 20fps on 100