<bbrezillon>
I realized DRM_IOCTL_SYNCOBJ_TRANSFER is only supported if the driver has the SYNCOBJ_TIMELINE cap set, which is not yet the case on panfrost
<bbrezillon>
ok, so that works for intra cmdbuf signalization, but IIUC, vkEvent is also handling CPU <-> GPU sync, and inter queue/cmdbuf sync
<tomeu>
tbh, I thought this was such a huge hammer that it covers all possible sync needs
<bbrezillon>
I think we need to embed a syncobj in our vkevent, and then transfer the batch out_fence to those syncobjs
<bbrezillon>
tomeu: no, what you do really doest wait for the batch to complete
<bbrezillon>
unless the have the trace or sync debug options
<bbrezillon>
*unless you have
<tomeu>
but, that wait is the same we do when the trace or sync debug options are enabled
<bbrezillon>
ok, maybe I missed something
<bbrezillon>
I'm a bit lost, you wait for events before queueing the batch?
<tomeu>
before queueing the second half of the batch we splitted
<tomeu>
but of course, after submitting the first half
<bbrezillon>
oh, I get it now
<bbrezillon>
still not a huge fan of this wait in the submit path
<bbrezillon>
I feel like the syncobj solution wouldn't be much more complicated
<bbrezillon>
and would avoid this wait
<tomeu>
yeah, tbh, I would prefer to get conformant first
<tomeu>
or at least conformant enough so we can run deqp in CI
<tomeu>
then it will be much easier to improve stuff without regressing
<tomeu>
also, I'm not really sure how much better waiting in the kernel would be if we had a separate submission tread
<tomeu>
but it's something I would very gladly leave for the future :)
<tomeu>
this passes all the dEQP-VK.synchronization.basic.* tests and allows me to move onto the dEQP-VK.api.command_buffers.* ones*
<bbrezillon>
yes, but again, I'm not sure what you're doing here is correct
<bbrezillon>
I still don't see how GPU waits can work
<tomeu>
so, when we insert a wait cmd, any subsequent commands will be submitted after all the previous ones have completed
<tomeu>
inefficient, but seems correct to me?
<bbrezillon>
that's only intra-queue sync
<bbrezillon>
AFAICT
<tomeu>
bbrezillon: vkQueue you mean? or hw queue?
<bbrezillon>
vkQueue
<tomeu>
how does that work, multiple app threads could have each its own vkQueue and call the submit ioctl concurrently?
<tomeu>
ah, looks that that case isn't supported by the spec
<tomeu>
"Events must not be used to insert a dependency between commands submitted to different queues."
<tomeu>
and earlier:
<tomeu>
"Events are a synchronization primitive that can be used to insert a fine-grained dependency between commands submitted to the same queue, or between the host and a queue. "
<bbrezillon>
ok
<amonakov>
(inter-thread synch is done via semaphores)
<amonakov>
erm, inter-queue I mean
<bbrezillon>
that leaves the "GPU waits on CPU signaling an event" case, which AFAICT is not handled
<bbrezillon>
or did I miss it
<bbrezillon>
tomeu: and I'd rather think that through than introducing an implementation that adds waits in the submit path
<tomeu>
yeah, we miss that, v3dv handles those waits in a separate thread
<bbrezillon>
yeah, but if you do that you clearly can't wait in the submit path
<bbrezillon>
you'd need to have a defered queue for batches that have such deps
<tomeu>
yep
<bbrezillon>
which would make the implementation a lot messier than if we were using syncobjs and extending the SUBMIT_IOCTL to allow unsignaling a syncobj (or any other solution we find to handle the reset case)
<tomeu>
hmm, wonder if we need to unsignal a syncobj, if we couldn't just replace an event's syncobj with a new one
<bbrezillon>
you can definitely transfer an unsignaled fence to a syncobj
<tomeu>
ok, that would work
<tomeu>
I'm wondering now if we shouldn't then use write_value jobs for the set and reset commands
<tomeu>
then add a ioctl for the kernel to poll a memory region and signal when it changes values
<bbrezillon>
ok, so a fence for the sync, and a GPU buf for the value?
<tomeu>
yep
<tomeu>
we would need to get the caching right I guess
<tomeu>
as it would be writable by both gpu and cpu
<bbrezillon>
if it's in a separate batch it shouldn't be an issue
<bbrezillon>
(at least not yet :))
<tomeu>
then we would be only splitting batches on waits
<tomeu>
ok, let me think about all this over lunch
<bbrezillon>
GPU buffers are mapped NC on the CPU side, that leaves GPU caches, which are flushed at job boundaries
<tomeu>
guess if we find a relatively simple solution that is optimal and we can be sure it's the way forward, it's better than a more complex one that doesn't require kernel changes
<bbrezillon>
but I wouldn't through away the syncobj-only solution so quickly
<bbrezillon>
I mean, if there's a way to transfer an unsignaled fence to the vkevent syncobj, it should work without requiring kernel changes
<bbrezillon>
just not sure when this transfer should happen
<bbrezillon>
and I feel it would have to be done kernel side
<tomeu>
ah, I thought we needed to do it in the kernel so we don't have to block as we do now
<bbrezillon>
most likely yes
<tomeu>
then maybe the write_value option is better because we avoid splits?
<bbrezillon>
I mean, unsignaling the syncobj should happen when the job is done executing, not earlier
<tomeu>
the mali supports only two queued jobs, so small chains are worst for latency than other gpus
<bbrezillon>
we avoid split at the expense of active waits
<bbrezillon>
and it's still not clear to me how you'd implement GPU waits without splitting the batch
<tomeu>
with write_value and a poll ioctl? I think we don't have any additional active waits that way
<bbrezillon>
the only thing you could avoid is splits on Set/Reset
<tomeu>
for the waits we need to split
<tomeu>
because the HW doesn't support it
<tomeu>
the question is now whether the wait should happen in the kernel or in userspace
<tomeu>
yeah, I think we are in agreement on that
<bbrezillon>
if GPU has to wait, I'd rather do that in kernel space, with an explicit dep (in_sync)
<tomeu>
yeah, I also like that
<tomeu>
in the mem-backed solution, it would be waiting on the syncobj returned by the new poll ioctl
<bbrezillon>
I feel like the "split batches on events" option, with a way to unsignal syncobjs would be the simplest solution, but I guess we can try the write-value approach too
<tomeu>
by unsignaling syncobjs, that would be transferring a new syncobj to the event's?
<bbrezillon>
no, same syncobj, new unsignaled fence attached to it
<tomeu>
ok, so it's all userspace then
<bbrezillon>
not really, because it has to be done when the batch is done
<bbrezillon>
otherwise you screw up the sequencing (or maybe not, I'm not sure)
<tomeu>
but, do we really want to change the kernel for something that we know is not optimal?
<bbrezillon>
is it really suboptimal?
<tomeu>
well, we have the added splits, right?
<bbrezillon>
sure, it adds a GPU -> CPU -> GPU round trid
<bbrezillon>
*trip
ids1024 has quit [Ping timeout: 245 seconds]
<tomeu>
what you suggest is probably what the ddk does, though
<bbrezillon>
I'll think about it and see if we can do it without kernel changes
<bbrezillon>
unsignaling syncobjs at submit time might work
<bbrezillon>
because we only care about intra-queue synchronisation
<bbrezillon>
and submission is serialized at the queue level
<tomeu>
ah, true
ids1024 has joined #panfrost
<bbrezillon>
so, for reset, the only thing you'd need to do is transfer an unsignaled fence to the syncobj, and for set, transfer the queue syncobj
<bbrezillon>
after queueing the batch that has events attached to it
<bbrezillon>
tomeu: BTW, for the reset operation, you can use DRM_IOCTL_SYNCOBJ_RESET
<tomeu>
ah, cool, thanks
<bbrezillon>
oh, and I just found out that msm has a flag for this 'reset syncobj at the end of the job' case => MSM_SUBMIT_SYNCOBJ_RESET,
<bbrezillon>
but it seems to apply to in_fences, not out_fences
newton688 has joined #panfrost
camus has joined #panfrost
kaspter has quit [Ping timeout: 246 seconds]
camus is now known as kaspter
zkrx has quit [Ping timeout: 252 seconds]
chewitt has joined #panfrost
karolherbst has quit [Remote host closed the connection]
karolherbst has joined #panfrost
zkrx has joined #panfrost
catfella has quit [Remote host closed the connection]
catfella has joined #panfrost
karolherbst has quit [Quit: duh 🐧]
karolherbst has joined #panfrost
kaspter has quit [Ping timeout: 240 seconds]
kaspter has joined #panfrost
kaspter has quit [Quit: kaspter]
newton688 has quit [Quit: Connection closed]
guillaume_g has quit [Ping timeout: 260 seconds]
zkrx has quit [Ping timeout: 265 seconds]
davidlt has quit [Ping timeout: 252 seconds]
adjtm has joined #panfrost
adjtm_ has quit [Ping timeout: 260 seconds]
zkrx has joined #panfrost
leah has quit [Quit: WeeChat 2.8]
warpme_ has quit [Quit: Connection closed for inactivity]