<raimo>
i.e the synchronization method is not entirely clear, theoretically it does not need fetch dispatcher to come up to start the simd arbiter at all
<raimo>
halt makes the fetch arbitration to not come up, but i do not understand how they capture the exception for the SIMD one really
<raimo>
since the SIMD arbiter can make the chip to go into an incorrect state when it deals with memory accesses
<raimo>
aah ok pardon it is a finished arbiter in code, so they probably deferr the rst signal for those flops, and put some command buffer instructions in the middle
<raimo>
since everything delt with command buffers is not implemented in miaow
<raimo>
so CUDA and OPENCL just do them correctly via command processors, where they exchange some data which i can not derive just like that
<raimo>
cause i do not have their code
<raimo>
I myself looked into NVIDIA design headquarters from youtube :), very interesting video was how they simulate the gpu when the design is changing to emulate all the behavior, until all bugs are gotten rid of and the chip is fabbed
<raimo>
pretty complex process that is, but doable with fewer resources too
<raimo>
there is one thing i am sure about looking into the amd clone RTL, when async reset pulse is sent from miaow2.0 aka scratch, which uses fpga interconnect and softcore to implement the dispatching to the cores
<raimo>
it is sent when endpgm there is met, and it resets all the flops that use rst signals
<raimo>
so hence i can be sure, that program counter is changing only upon starting the new stream
<raimo>
the relevant flops that put the threads to sleep are in finished arbiter verilog file in the issue module, when those flops are reset with reset pulse send by the softcores dispatcher
<raimo>
than all the scheduling arbiters start to work again
<raimo>
I still claim since there are many smart communities on the network, that realistically i have no troubles whatsoever to release my solutions, all the real problems are financial sided not technological
<raimo>
I have some buddies on IRC who like libv spotted that i have somewhat clue and structure in my thoughts
<raimo>
i have worked hard with accessing my brain and i am quite talented theoretician i like to think and generally like to place technological bets
<raimo>
it is all just a training, anticipating how things work and testing the designers how they did it, and if i was about to do it similarly as have been the case most the time
<raimo>
i like to test my thinking that way, if my expectations are real and am i really not a hallucinating personality
<rellla>
enunes: which kernel are you booting on your A64?
<enunes>
rellla: upstream v5.3-rc8
raimo has quit [Quit: Leaving]
mardestan has joined #lima
<mardestan>
I also to some degree looked the multi2sim code right, i try to express what was the intention doing so -- It is only for VLIW or because of VLIW architecture instruction queueing
<mardestan>
you see when we talk about this type of arch, then in queue computation is no longer bundle based, but it actually reorders stuff into queues
<mardestan>
while when you are in full pipeline mode, it will schedule the bundle if possible i mean all content of a single core words
<mardestan>
however it will place the issued instructions out-of-order into queues for future references
<mardestan>
And i soon start to program my mali GPU but doing it for the proprietary driver mostly, but can be plugged to yours as well
<mardestan>
but i won't do such code like you may be thinking at the moment, that you place smaller or midsized kernels entirely into queues and just run this
<mardestan>
i implement a little different approach what i talked about as redirecting FUs or functional units
<mardestan>
in my opinion such hack as coresight or JTAG in general is normally not implemented on GPUs, where you can load the queues without going through the full pipeline
<mardestan>
there maybe some facility on them or so called interconnect based general or random flop targetting, like on FPGAs there is a chanche for that type of a hack
<mardestan>
however this is not really fleshed out much into a readable or comprehendable spec
<mardestan>
which means that I can not replace the queue content straight without fetch&decoding also, i.e can't skip fetch and decode for replacing the queue with a new alu based of the data i would want to fill in directly
<mardestan>
you are maybe thinking an interruption based hack right, where decoder gets interrupted by another warp, probably yes that makes sense somewhat, maybe can be somehow done, but this is where i agree with karol, prolly interrupts on most cards are just slow
jrmuizel has joined #lima
<mardestan>
if one were to want to bring such interrupt based termination of a running wave without waiting for the graduation of the previous instructions
<mardestan>
than there is not awfully many ways to do this, but synchronization would have to be software controlled with NOPs
<mardestan>
aka that programmer needs to be surely relying on a developers sw controlled synchronization which ensures that previous commands really were executed before terminating waves via async reset pulse
<mardestan>
for instance if the highest latency instruction in flight is div or worst possible memory latency, then so many NOPs need to be padded before the interrupt
<mardestan>
that the latency is occupied with dummy cycles
<mardestan>
but can also some hybrid type of solution that NOPs are padded to worst case decoding latency, and chip takes care that they graduate afterwords
jrmuizel has quit [Remote host closed the connection]
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
megi has quit [Ping timeout: 245 seconds]
Tofe has joined #lima
<mardestan>
Anyways for instance on GCN epilog is generated always as a branch, which should take care of the decode delays.
<mardestan>
s_waitcnt is seeming to be generated before the endpgm actually
_whitelogger has joined #lima
jrmuizel has joined #lima
<mardestan>
this may not always work, would work on miaow though, well some boys on the net say that actually also GPUs have integrated jtag supprt i.e boundary scan
<mardestan>
but that is something that vendors do not provide supporting files to like BSDL files
<mardestan>
so the regions would need to be brute forced out.
drod has quit [Ping timeout: 265 seconds]
<mardestan>
this yeah does not appear to be possible without risking with damaging the board.
<mardestan>
so htag pin-layouts can not be identified easily
drod has joined #lima
afaerber has joined #lima
jbrown has quit [Quit: Leaving]
mardestan has quit [Quit: Leaving]
jbrown has joined #lima
jrmuizel has quit [Remote host closed the connection]
nerdboy has quit [Ping timeout: 245 seconds]
jrmuizel has joined #lima
nerdboy has joined #lima
jrmuizel has quit [Remote host closed the connection]
nerdboy has quit [Ping timeout: 276 seconds]
mardestan has joined #lima
megi has joined #lima
<mardestan>
did you remember how the moving around in the CORE queues work? There are really two easier ways, among which one does not work on GCN -- if you write into address registers from preceding LD/ST instruction which gets clamped -- reason why this does not work on GCN is unknown but the chip freezes there?
<mardestan>
other was that you also have duplets per instruction preceeding LSU and following ALU
<mardestan>
and on the same core to skip it, you need to write to the very same address on very same texture unit
jrmuizel has joined #lima
<mardestan>
but of course another method would do, if one were to use indirect loads preceeding the ALUs but i won't favor it, cause they prolly count as an ALU so hitting the limit sooner on alus
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
<mardestan>
yeah i understand that on some chips indirection flags are part of the instructions opcode
Elpaulo has quit [Quit: Elpaulo]
<mardestan>
well yeah it appears the out-of-range nops can be used too, at this very moment i had forgatten it
<mardestan>
so you place address register to certain value which will trigger nops
<mardestan>
for instance you have vgpr1 + addrreg +5 assuming that negative offsets are illegal
<mardestan>
that targets reg 6 the 7nth reg
<mardestan>
when the alus use registers from 10 to 1 decrementing order for 10ALUs
<mardestan>
starting from 10 and going under with each alu
<mardestan>
you have determined that you want to schedule the 3th instruction or even 3th and 4th
<mardestan>
hmm, hell why decrementing, silly me
<mardestan>
i think all understood that with this approach the problem is really that source registers redirected with indirections on specific alus, those ALUs will all be issued and it causes the
<mardestan>
column to change all the time
<mardestan>
but not all chips allow to redirect destination registers :( so one would have to put the stalling load in the front of every row for instance
<mardestan>
and it appears hence that it will work also that way only on true SIMD
<mardestan>
ah heck, this is not true
<mardestan>
it will also work cross bundle on VLIW
nerdboy has joined #lima
nerdboy has joined #lima
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
armessia_ has joined #lima
armessia_ has quit [Client Quit]
armessia has joined #lima
armessia has quit [Quit: Leaving]
armessia has joined #lima
<anarsoul>
indirect load seems to be working fine in ppir