#lima on 2019-09-14 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:25 StarfishPrime__ has joined #lima

00:43 drod has quit [Remote host closed the connection]

01:27 megi has quit [Ping timeout: 245 seconds]

02:25 nerdboy has quit [Ping timeout: 265 seconds]

02:25 nerdboy has joined #lima

02:26 nerdboy has quit [Client Quit]

02:26 nerdboy has joined #lima

03:23 _whitelogger has joined #lima

04:04 <MoeIcenowy> anarsoul: if both lima and sun4i-drm are built-in

04:05 <MoeIcenowy> then lima will always be card0

04:05 <MoeIcenowy> because sun4i-drm is complex and need more probe work

04:07 <MoeIcenowy> enunes: could you help me to generate a regression list for my flatten-cf branch? thanks

04:11 dllud_ has joined #lima

04:12 dllud has quit [Read error: Connection reset by peer]

04:12 dllud_ is now known as dllud

04:18 jrmuizel has joined #lima

04:29 _whitelogger has joined #lima

04:30 jrmuizel has quit [Remote host closed the connection]

04:34 jrmuizel has joined #lima

04:44 dddddd has quit [Remote host closed the connection]

04:59 _whitelogger has joined #lima

05:03 jrmuizel has quit [Read error: Connection reset by peer]

05:03 jrmuizel has joined #lima

05:25 jrmuizel has quit [Remote host closed the connection]

05:48 <rellla> anarsoul: i haven't

06:21 nerdboy has quit [Ping timeout: 276 seconds]

06:22 nerdboy has joined #lima

06:27 nerdboy has quit [Excess Flood]

06:28 nerdboy has joined #lima

08:10 <wens> maybe still traveling back from LPC / Maintainer summit? Maintainer summit was Thursday

09:44 megi has joined #lima

10:00 mardestan has joined #lima

10:02 raimo has joined #lima

10:03 <raimo> anyways the sequence of calls, how streams are managed and i elaborate quickly what happens under the hood

10:04 <raimo> first you create the stream which initializes bunch of scheduling registers and program counters to certain values

10:04 <raimo> then it is time for a chip to do all the arithmetical instructions in the stream

10:04 drod has joined #lima

10:04 mardestan has quit [Ping timeout: 245 seconds]

10:05 <raimo> until the program calls halt, which is captured

10:05 <raimo> by the chip again...next up you are suppose to call stream destroyal

10:06 <raimo> if this thing is not called and neither is new stream created, it will stay in the queues and do thread reuse based computation

10:29 dddddd has joined #lima

10:34 <raimo> I have a slight issue where I do not have export bits available in the miaow RTL, i guess i need to make some educated guesses as to how

10:34 <raimo> done bits of the final export works

10:48 <hellsenberg> huh, TIL about http://miaowgpu.org/

11:00 <raimo> i.e the synchronization method is not entirely clear, theoretically it does not need fetch dispatcher to come up to start the simd arbiter at all

11:00 <raimo> halt makes the fetch arbitration to not come up, but i do not understand how they capture the exception for the SIMD one really

11:01 <raimo> since the SIMD arbiter can make the chip to go into an incorrect state when it deals with memory accesses

11:04 <raimo> aah ok pardon it is a finished arbiter in code, so they probably deferr the rst signal for those flops, and put some command buffer instructions in the middle

11:04 <raimo> since everything delt with command buffers is not implemented in miaow

11:05 <raimo> so CUDA and OPENCL just do them correctly via command processors, where they exchange some data which i can not derive just like that

11:05 <raimo> cause i do not have their code

11:22 <raimo> I myself looked into NVIDIA design headquarters from youtube :), very interesting video was how they simulate the gpu when the design is changing to emulate all the behavior, until all bugs are gotten rid of and the chip is fabbed

11:22 <raimo> pretty complex process that is, but doable with fewer resources too

11:35 <raimo> there is one thing i am sure about looking into the amd clone RTL, when async reset pulse is sent from miaow2.0 aka scratch, which uses fpga interconnect and softcore to implement the dispatching to the cores

11:36 <raimo> it is sent when endpgm there is met, and it resets all the flops that use rst signals

11:37 <raimo> so hence i can be sure, that program counter is changing only upon starting the new stream

11:39 <raimo> the relevant flops that put the threads to sleep are in finished arbiter verilog file in the issue module, when those flops are reset with reset pulse send by the softcores dispatcher

11:40 <raimo> than all the scheduling arbiters start to work again

11:47 <raimo> I still claim since there are many smart communities on the network, that realistically i have no troubles whatsoever to release my solutions, all the real problems are financial sided not technological

11:47 <raimo> I have some buddies on IRC who like libv spotted that i have somewhat clue and structure in my thoughts

11:48 <raimo> i have worked hard with accessing my brain and i am quite talented theoretician i like to think and generally like to place technological bets

11:50 <raimo> it is all just a training, anticipating how things work and testing the designers how they did it, and if i was about to do it similarly as have been the case most the time

11:50 <raimo> i like to test my thinking that way, if my expectations are real and am i really not a hallucinating personality

11:52 <rellla> enunes: which kernel are you booting on your A64?

11:53 <enunes> rellla: upstream v5.3-rc8

11:54 raimo has quit [Quit: Leaving]

13:02 mardestan has joined #lima

13:13 <mardestan> I also to some degree looked the multi2sim code right, i try to express what was the intention doing so -- It is only for VLIW or because of VLIW architecture instruction queueing

13:14 <mardestan> you see when we talk about this type of arch, then in queue computation is no longer bundle based, but it actually reorders stuff into queues

13:15 <mardestan> while when you are in full pipeline mode, it will schedule the bundle if possible i mean all content of a single core words

13:16 <mardestan> however it will place the issued instructions out-of-order into queues for future references

13:18 <mardestan> And i soon start to program my mali GPU but doing it for the proprietary driver mostly, but can be plugged to yours as well

13:22 <mardestan> but i won't do such code like you may be thinking at the moment, that you place smaller or midsized kernels entirely into queues and just run this

13:23 <mardestan> i implement a little different approach what i talked about as redirecting FUs or functional units

13:26 <mardestan> in my opinion such hack as coresight or JTAG in general is normally not implemented on GPUs, where you can load the queues without going through the full pipeline

13:29 <mardestan> there maybe some facility on them or so called interconnect based general or random flop targetting, like on FPGAs there is a chanche for that type of a hack

13:29 <mardestan> however this is not really fleshed out much into a readable or comprehendable spec

13:32 <mardestan> which means that I can not replace the queue content straight without fetch&decoding also, i.e can't skip fetch and decode for replacing the queue with a new alu based of the data i would want to fill in directly

13:51 <mardestan> you are maybe thinking an interruption based hack right, where decoder gets interrupted by another warp, probably yes that makes sense somewhat, maybe can be somehow done, but this is where i agree with karol, prolly interrupts on most cards are just slow

14:11 jrmuizel has joined #lima

14:15 <mardestan> if one were to want to bring such interrupt based termination of a running wave without waiting for the graduation of the previous instructions

14:15 <mardestan> than there is not awfully many ways to do this, but synchronization would have to be software controlled with NOPs

14:16 <mardestan> aka that programmer needs to be surely relying on a developers sw controlled synchronization which ensures that previous commands really were executed before terminating waves via async reset pulse

14:18 <mardestan> for instance if the highest latency instruction in flight is div or worst possible memory latency, then so many NOPs need to be padded before the interrupt

14:19 <mardestan> that the latency is occupied with dummy cycles

14:26 <mardestan> but can also some hybrid type of solution that NOPs are padded to worst case decoding latency, and chip takes care that they graduate afterwords

14:30 jrmuizel has quit [Remote host closed the connection]

14:32 jrmuizel has joined #lima

14:34 dllud has quit [Quit: ZNC 1.7.4 - https://znc.in]

14:42 dllud has joined #lima

14:52 jrmuizel has quit [Remote host closed the connection]

14:57 jrmuizel has joined #lima

15:24 jrmuizel has quit [Remote host closed the connection]

15:25 jrmuizel has joined #lima

15:45 jrmuizel has quit [Remote host closed the connection]

15:57 megi has quit [Ping timeout: 245 seconds]

15:58 Tofe has joined #lima

16:41 <mardestan> Anyways for instance on GCN epilog is generated always as a branch, which should take care of the decode delays.

16:59 <mardestan> s_waitcnt is seeming to be generated before the endpgm actually

17:26 _whitelogger has joined #lima

17:43 jrmuizel has joined #lima

17:44 <mardestan> this may not always work, would work on miaow though, well some boys on the net say that actually also GPUs have integrated jtag supprt i.e boundary scan

17:45 <mardestan> but that is something that vendors do not provide supporting files to like BSDL files

17:45 <mardestan> so the regions would need to be brute forced out.

17:50 drod has quit [Ping timeout: 265 seconds]

17:58 <mardestan> this yeah does not appear to be possible without risking with damaging the board.

17:59 <mardestan> so htag pin-layouts can not be identified easily

18:02 drod has joined #lima

18:05 afaerber has joined #lima

18:13 jbrown has quit [Quit: Leaving]

18:15 mardestan has quit [Quit: Leaving]

18:18 jbrown has joined #lima

18:19 jrmuizel has quit [Remote host closed the connection]

18:53 nerdboy has quit [Ping timeout: 245 seconds]

18:55 jrmuizel has joined #lima

18:57 nerdboy has joined #lima

19:12 jrmuizel has quit [Remote host closed the connection]

19:22 nerdboy has quit [Ping timeout: 276 seconds]

19:25 mardestan has joined #lima

19:30 megi has joined #lima

19:31 <mardestan> did you remember how the moving around in the CORE queues work? There are really two easier ways, among which one does not work on GCN -- if you write into address registers from preceding LD/ST instruction which gets clamped -- reason why this does not work on GCN is unknown but the chip freezes there?

19:32 <mardestan> other was that you also have duplets per instruction preceeding LSU and following ALU

19:32 <mardestan> and on the same core to skip it, you need to write to the very same address on very same texture unit

19:33 jrmuizel has joined #lima

19:36 <mardestan> but of course another method would do, if one were to use indirect loads preceeding the ALUs but i won't favor it, cause they prolly count as an ALU so hitting the limit sooner on alus

19:36 jrmuizel has quit [Remote host closed the connection]

19:38 jrmuizel has joined #lima

19:50 <mardestan> yeah i understand that on some chips indirection flags are part of the instructions opcode

19:51 Elpaulo has quit [Quit: Elpaulo]

19:56 <mardestan> well yeah it appears the out-of-range nops can be used too, at this very moment i had forgatten it

19:56 <mardestan> so you place address register to certain value which will trigger nops

19:58 <mardestan> for instance you have vgpr1 + addrreg +5 assuming that negative offsets are illegal

19:58 <mardestan> that targets reg 6 the 7nth reg

19:59 <mardestan> when the alus use registers from 10 to 1 decrementing order for 10ALUs

19:59 <mardestan> starting from 10 and going under with each alu

20:00 <mardestan> you have determined that you want to schedule the 3th instruction or even 3th and 4th

20:07 <mardestan> hmm, hell why decrementing, silly me

20:16 <mardestan> i think all understood that with this approach the problem is really that source registers redirected with indirections on specific alus, those ALUs will all be issued and it causes the

20:16 <mardestan> column to change all the time

20:17 <mardestan> but not all chips allow to redirect destination registers :( so one would have to put the stalling load in the front of every row for instance

20:19 <mardestan> and it appears hence that it will work also that way only on true SIMD

20:23 <mardestan> ah heck, this is not true

20:24 <mardestan> it will also work cross bundle on VLIW

20:38 nerdboy has joined #lima

21:10 jrmuizel has quit [Remote host closed the connection]

21:11 jrmuizel has joined #lima

21:31 jrmuizel has quit [Remote host closed the connection]

21:40 armessia_ has joined #lima

21:40 armessia_ has quit [Client Quit]

21:41 armessia has joined #lima

21:52 armessia has quit [Quit: Leaving]

21:52 armessia has joined #lima

21:53 <anarsoul> indirect load seems to be working fine in ppir

22:02 armessia has quit [Quit: Leaving]

22:02 armessia has joined #lima

22:04 armessia has left #lima [#lima]

22:04 armessia has joined #lima

22:05 armessia has left #lima [#lima]

22:06 armessia has joined #lima

22:07 armessia has quit [Client Quit]

22:08 armessia has joined #lima

22:08 armessia has quit [Client Quit]

23:41 jrmuizel has joined #lima