#lima on 2019-09-24 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:09 jrmuizel has joined #lima

00:09 jrmuizel has quit [Remote host closed the connection]

00:09 jrmuizel has joined #lima

00:45 mardikene193 has joined #lima

00:48 <mardikene193> about the many-to-many scheduler, it is compromise between many-to-one and 1:1. As it is also possible to hijack syscalls, this comes with a similar security risk, when the upcall system call is known, security issue is slightly more but not too big of an issue either compared to 1:1.

00:49 jrmuizel_ has joined #lima

00:49 <mardikene193> For an example all the POSIX layer on windows nt the emulated POSIX API would work on system privileges, which has kernel execution support via some specific calls, on.

00:50 <mardikene193> on./on linux similar thing would happen and on BSD/XNU

00:51 jrmuizel has quit [Ping timeout: 276 seconds]

00:52 <MoeIcenowy> anarsoul: not a fix

00:53 <anarsoul> :(

00:56 <anarsoul> btw, gnome is not really friendly for tiling GPUs

00:56 <anarsoul> s/for/to

00:56 <anarsoul> it seems to use a huge fbo and renders only into part of it

00:57 <mardikene193> the win comes in fold of 800 allready because context switch from userspace done via upcalls is inherently vastly faster.

00:58 <mardikene193> on CPUs and when you combine that with coresight or pipelined jtag boundary scan hw api

00:58 <mardikene193> things go into mad heights for performance.

00:59 <mardikene193> I also developed this twos complement instruction stream compression, which as mentioned i call binary stream instruction checksummming

01:00 <mardikene193> that saves all the instruction cache bandwidth

01:07 <anarsoul> MoeIcenowy: btw, unrelated issue, but this part from cogl: "cogl_color_out = cogl_color_in * texture2D (tex, vec2 (cogl_tex_coord_in[0].xy));" where cogl_tex_coord_in is vec4 array of 1 element will result in poor image quality on lima

01:08 <mardikene193> VLIW is great because such bitwise operations on top of twos complement subtract can be taken as entirely parallel operations, cause it comes with more alus since scoreboarding in hw is expensive, 16 not parallel ops can be taken on terascale and on mali gpus they have sort of weird specs

01:09 <MoeIcenowy> anarsoul: oops

01:10 <anarsoul> as well as later calculations with coords

01:10 <mardikene193> based of the docs, mali 400 vliws have pipeline of 128 stages per core

01:10 <MoeIcenowy> why it will result in poor image?

01:10 <anarsoul> MoeIcenowy: because varying is not passed directly into sampler

01:10 <anarsoul> it will be loaded into register

01:10 <MoeIcenowy> the register is only 16-bit?

01:10 <anarsoul> it's fp16

01:10 <anarsoul> so it's only 10 bits of precision

01:11 <MoeIcenowy> oh

01:11 <anarsoul> 5 for mantissa, 1 for sign

01:11 <mardikene193> each bundle is 4words of instructions, i think they count simd width each stage of simd separately also as stage

01:11 <anarsoul> it's not enough to sample high-res textures accurately

01:11 <mardikene193> so 128/4/4 is 16 there as well

01:12 <MoeIcenowy> so directly using varying is the only way to keep 32-bit float number in PP?

01:12 <anarsoul> s/mantissa/exponent

01:12 <mardikene193> you can take 16bundles per core in parallel but 4*16 when there are no deps involved

01:12 <anarsoul> MoeIcenowy: yes, but only for samplers. Anything else uses fp16.

01:13 <anarsoul> I guess cogl authors couldn't imagine that some GPUs will be fp16-only :)

01:14 yuq825 has joined #lima

01:14 <mardikene193> when you think a little, then you should see why VLIW is considered the best architecture of all times.

01:16 <mardikene193> elbrus CPU has been done so that 32words per CPU core, and there is a coupled array prefetch buffer to the instruction windows, you can grab instruction even if it was dependent out from any position

01:17 <mardikene193> without causing excessive pipeline stalls due to bad ordering

01:22 <anarsoul> MoeIcenowy: unless fragment shader uses spilling the only place PP can read from is texture 0

01:23 <anarsoul> that has weird size of 800x1104, but I think it's OK

01:24 <MoeIcenowy> anarsoul: do you think GNOME Shell is very weird when analyzing the apitrace? ;-)

01:24 <anarsoul> that's definitely not the best piece of software

01:25 <MoeIcenowy> but also not the most weird?

01:25 <anarsoul> I've been in software for quite a while. I saw much weirder things :)

01:27 <anarsoul> it's quite surprising that it mixes GLSL shaders with ARB

01:28 <MoeIcenowy> anarsoul: BTW what's ARB shader?

01:28 <MoeIcenowy> I have 0 knowledge about it

01:28 <MoeIcenowy> some lowlevel thing?

01:28 <mardikene193> since the branches according to the patent can be done in the queues, as well as being async in nature, it is different on CPUs how branches are handled as i spoke

01:28 <anarsoul> https://en.wikipedia.org/wiki/ARB_assembly_language

01:29 <MoeIcenowy> anarsoul: why is there only fragment in this situation?

01:29 <mardikene193> however the concept is in generalisation the same there, everything is a branch

01:29 <anarsoul> no idea

01:34 <mardikene193> so yes, when the bundle is from 1-32 you can call 32 branches for their target more precisely as every instruction would be a short branch, and elbrus cpu will do the reordering at runtime

01:34 <mardikene193> it will run first the non-dependent instructions until the bundle is over and everything falls into place as it should

01:35 <MoeIcenowy> finally the eMMC on my PineTab bombed totally

01:37 <anarsoul> :(

01:37 <anarsoul> use NFS?

01:38 <bshah|matrix> O.o

01:39 <MoeIcenowy> oh? I can still read it with the USB reader?

01:39 <MoeIcenowy> (but it totally doesn't boot now on PineTab itself

01:41 <MoeIcenowy> maybe the connector is loosen

01:44 <mardikene193> is that in nature more expensive using bypass registers and array prefetch buffers and loop counters than that of scoreboard, yeah well definitely not

01:45 <mardikene193> it is just a fifo structure everything from branches are played until they are done

01:48 <mardikene193> and they come in flavors or 512 to 1024 queue entries, plenty of running scheduling, it was a suprising read that sometimes in history also russians are pretty smart designers, and pretty early on

01:52 <anarsoul> some IPA won't hurt :)

02:26 jrmuizel_ has quit [Remote host closed the connection]

02:37 <mardikene193> this is not some rant, I have no elbrus chip, it maybe something similar to what transmeta tried to acheive and tell people though, the last i understand as company no longer exists though

02:38 <mardikene193> I see there, as some russian wanted to connect me how to use different processors in oil refining industry

02:38 <mardikene193> pretty easy task, even with their own chips

02:48 <mardikene193> they have more than half fewer power consumption compared to intel chips on full pipeline allready cause of doing things correctly, i say it will further drastically get lowered on IB based computation on nested loops and sw pipelining triggered in

02:48 <mardikene193> I say those designers were very sanely spot on

02:56 <mardikene193> I assume their hw reordering is faster than that i could up with on radeon VLIWs hehee, it is superior chip to AMD vliws, 2048 queue entries against 2560 though, radeon does reorder inside a driver like i told, it has no unstalling async mechanism in hw

02:56 <mardikene193> but it instead has a driver, and possibility to feed zero operands and do it during runtime in hw with sw controlled

02:57 <mardikene193> as i mentioned this is due to difference in accelerators like GPUs and CPUs instruction buffer structures i.e instruction queues, issue queues or instruction windows all are referring to the same thing

02:58 <mardikene193> if you wanted to do the same thing on stanford CGRA like elbrus does , or at least similarly in functionality that is easily possible

02:59 <mardikene193> however GPUs would have rework the branch module and issue queue modules to do the similar

03:02 <mardikene193> I do not advise doing so my own, instead you do the scoreboarding in the sw with using twos complement procedures, which are only two gate delays and skipping with zero operands also two gate delays

03:04 <mardikene193> cause GPUs have something that can be used for redirecting operands which CPUs normally like X86 CISC and ARMs RISC do not have

03:04 <mardikene193> as mentioned throughout my talks this is register indirect addressing

03:07 megi has quit [Ping timeout: 276 seconds]

03:11 <mardikene193> and about different events that you can not do (that includes programming), i would advise/recommend to keep your mouths shut as well as fingers off the keyboard when stars who can do that start to talk ok?

03:13 <mardikene193> AS I TOLD YOU MORE THAN SEVERAL times allready, the in queue rendering/computation triggers column change on two instructions in sequence issued , i.e not graduated just issued consequently back to back from , consequent wfids, not 1 and 7 but either 1 and 2

03:13 <mardikene193> 4 and 5

03:13 <mardikene193> not 8 and 10 etc.

03:13 <mardikene193> otherwise it will fetch the current wfid row or line and stay there

03:15 <mardikene193> because the sum of valid_entries of from previous iteration and current instruction in scbd_feeder.v for hungry bits , this line puts +1 to the chip

03:16 <mardikene193> when the valid_entry.v captures those bits, the first instruction is no longer driven, but second is captured to instr_info_table.v only

03:17 <mardikene193> so it needs to capture the +1 only, and it can do so by running the next instruction also while the +1 is broadcasted, that way stuff gets driven

03:18 <mardikene193> i have fucking told you this more than one years time allready

03:21 <mardikene193> when you replace the values in next_hungry line, vacant | hungry & ~40{feed_valid} & previous wfid , what the fuck do you get than?

03:22 <mardikene193> I do not reread or structure anything, since I KNOW THAT!

03:23 <mardikene193> this will add +1

03:24 <mardikene193> you are fucking morans

03:25 <mardikene193> I also gave the mathematical expression which is absolutely correct

03:27 <mardikene193> why on full length pipeline the queues do not switch columns, i allready told that, they have automatic instance indexes

03:27 <mardikene193> it is round robin scheduling, all queue entries will be filled in

03:31 <mardikene193> it works in a way that , fetch counter always wraps around as the comments suggest, it fetches to wr_decode_data buffers 40*40 on GCN as understood

03:32 <mardikene193> fetch wfid of one will be put to the first column1 line 1

03:32 <mardikene193> second appearance of 1 will be put to column 1 line 2 etc.

03:32 <mardikene193> FUCKING HELL!

03:33 <mardikene193> how are you so nuts to blame me about anything when you have fucking head filled with only water, no synapses no neurons

03:35 <mardikene193> the procedure continuous until halt aka endpgm is met, which kills the fetch

03:35 <mardikene193> now if the chip is not being reset with a pulse or the memory underlying in case of graphics

03:37 <mardikene193> does get memory errors the chip would start doing in queue computation

03:41 <mardikene193> during the reset pulse not sent to the chip, it starts to send 6{1'bX} forward as wf_id

03:42 <mardikene193> which is lifted to 111111 always accepting wave

03:51 <mardikene193> if the interconnect delays the rst signal , like when it is not captured , everything will be rendered/computed from shader engine queues

03:52 <mardikene193> however when it sends the reset, and restarts the program counter hence things will go in full length of the pipeline

04:05 <mardikene193> and how you can imagine the stuff, from the queues everything gets issued in parallel upto 16 scheduled for execution if all they succeed the scoreboard test

04:07 <mardikene193> and sanity wise, it is absolutely correct to put things like that into queues, they are fetched from wr_decode_data stored buffers in case of fetch itself is sleeping, and yes all the column of 40 is being fetched

04:07 <mardikene193> and stored to the line out buffer

04:08 <mardikene193> concistency wise it is just playing with words

04:09 <mardikene193> maybe i was inaccurate to state that column is fetched to the row, how else can i say that? the line is fetched to the line perhaps one after another to form a line out

04:11 dddddd has quit [Remote host closed the connection]

04:17 <mardikene193> my code does not care of variable latencies

04:17 <mardikene193> cause you see everything is a branch

04:19 <mardikene193> all you have to do is to pull in all the queue entries initially with zero operand values

04:29 _whitelogger has joined #lima

05:19 Barada has joined #lima

06:09 niceplace has quit [Read error: Connection reset by peer]

06:13 niceplace has joined #lima

06:38 raimo has joined #lima

06:38 mardikene193 has quit [Read error: Connection reset by peer]

06:42 raimo has quit [Read error: Connection reset by peer]

07:40 niceplace has quit [Read error: Connection reset by peer]

07:42 niceplace has joined #lima

07:54 mardikene193 has joined #lima

08:12 <mardikene193> variable length latencies are the ones that generate carry or borrows, hence andandtech link is not very correct about deterministic delays, well it partly is cause the framerate is majorly capped per instruction, but in the queues they are not

08:12 <mardikene193> neither SIMD nor VLIW offer guarantees, that the same shader will take the same time to finnish in queued mode on two different iterations

08:13 <mardikene193> similarly the variable latency instruction is memory load as you have noticed, depending whether it hits the cache or not

08:29 <mardikene193> And on VLIW the best optimal way is reordered during the runtime, it can not do any better than how queues put things to natural order, hence i tell that mannerovs scheduling does not make difference on multiiteration shaders

08:32 yuq8251 has joined #lima

08:34 yuq825 has quit [Ping timeout: 276 seconds]

08:42 UnivrslSuprBox has quit [Ping timeout: 245 seconds]

08:42 kaspter has quit [Remote host closed the connection]

08:42 kaspter has joined #lima

08:44 jailbox has quit [Ping timeout: 245 seconds]

08:45 eightdot has quit [Ping timeout: 245 seconds]

08:45 UnivrslSuprBox has joined #lima

08:46 jailbox has joined #lima

08:46 eightdot has joined #lima

10:15 Barada has quit [Quit: Barada]

10:16 Barada has joined #lima

10:33 <mardikene193> I am not feeling well either when i am pissed off, i think miaow is reasonably over average tricky, people should not behave the way they do when i am talking the truth regardless

10:38 yuq8251 has quit [Quit: Leaving.]

10:39 <mardikene193> most complex to understand could be that my simulator shows fetch arbiter on full pipeline always trailing by one, the bitwise operation flips the bits with NOT after AND of the previous simd arbiters wfid, this results in bigger value kept

10:40 <mardikene193> since it is the line of continuous assignment, it will be driven again

10:40 <mardikene193> shortly after the bigger value is kept

10:47 jbrown has joined #lima

10:48 megi has joined #lima

11:13 <mardikene193> kospadin anarsoul then when you cancel say 8 with 8 it will drive zero, simd arbiter arbitrates to say 10 as one wishes, fetch arbitrates to 0...next time in cancels out 0, all 1s will be evaluated in the last part of the next_hungry hence

11:14 <mardikene193> how much will 1 | 10 next time equal?

11:36 <mardikene193> 2+8+1 is yeah 11, but lets do that in simulator

11:39 <mardikene193> since i am not sure whether i did not still do some mistake in the calculation

11:40 <mardikene193> weird is that when 9 is driven it allready has 1 bit in

11:40 <mardikene193> aaaah well yeah, this is zero bit but decoded differently

11:42 dddddd has joined #lima

12:55 mardikene193 has quit [Read error: No route to host]

12:56 jrmuizel has joined #lima

13:09 jrmuizel has quit [Remote host closed the connection]

13:14 jrmuizel has joined #lima

13:23 <rellla> huh, lowering the abs modifier with ppir_op_ge to sqrt(mul(x,x)) solves another 18 tests :)

13:24 <rellla> *within

13:26 <enunes> nice

13:27 <Tofe> what's "lowering", exactly ? replacing a costly op with less costly ones ?

13:28 jrmuizel has quit [Remote host closed the connection]

13:29 <MoeIcenowy> Tofe: replace an unsupported op with a supported one

13:29 <Tofe> ah ok

13:39 <rellla> enunes: http://imkreisrum.de/piglit/mali450/fed5b60..d6c7fcc-lima-absneg-fix/fixes.html

13:42 <rellla> i have the strong guess, that fs-exp-float for example fails due to accuracy/tolerance issues, as only 1 of the 4 probes fails. maybe i should try to tweak the piglit test a bit for testing...

13:43 mardikene193 has joined #lima

13:47 <mardikene193> I realised that this yeah was not a decoders issue, but non the less paired and unpaired numbers, or odd and even.

14:02 Barada has quit [Quit: Barada]

14:09 jrmuizel has joined #lima

14:16 <mardikene193> I would personally not do that this way probably, i consider this as a bug, or maybe someone held a patent with the usual way.

14:32 mardikene193 has quit [Remote host closed the connection]

14:32 raimo has joined #lima

14:42 <raimo> well well, i agree this works too what they do, and is not still considered a bug either saves some resources instead

14:43 <raimo> but i need to confess that i did not see that path in the beginning and i allready thought i had everything figured out, so it is a good day over a long period , one new implementation detail studied

14:45 <raimo> i did not do any bigger tests today, only this for some who do not understand bitwise OR aka | in verilog

14:46 <raimo> https://www.edaplayground.com/x/3m6K

14:55 <raimo> i did not spot this even though it is very clever way, cause it does not appear to save ultra lot of resources, just very few, i obviously did not expect it hence

14:56 <raimo> so i elaborate what happens there:

14:57 <raimo> when you have an odd wave transisioning from the even wave the +1 is done partly inside another path of a chip than that of even to odd

14:57 <raimo> what happens is wr_decode_data is ignoring as well as valid_entry.v that type of repeated odd address

14:58 <raimo> but the queues pick them up cause scoreboard will probably give green light to some instruction...

14:59 <raimo> ss it is tried to be computed from the table and if it is not there as it should not be in miaow case

15:02 <raimo> valid bits of if it was a transition from 5 to 5 that time, gets a six, but now it zeros out the six instead

15:06 <raimo> now let us elaborate to complication, why the miaow does not have that instruction in the tbl part of the queues!

15:06 <raimo> necause it zeros anything on the line while fetching the last instruction it zeros the upcoming ones

15:10 joss193 has joined #lima

15:12 <joss193> anyhow this could be some clever method to dissipate heat more around the chip, whatever really i am looking on is a bit weird to me, but should work indeed

15:12 raimo has quit [Ping timeout: 250 seconds]

15:16 <joss193> so when five no longer goes through it drives six into the table and all things starts again

15:17 <joss193> six forwarded as decode_wr_data or whatever matches the last issued instruction and in this case six is finally fetched too

15:27 <joss193> cheers i need to go now.

15:27 joss193 has quit [Quit: Leaving]

16:23 xdarklight has quit [Quit: ZNC - http://znc.in]

16:26 xdarklight has joined #lima

16:44 <rellla> cwabbott: i implemented a lowering, where abs modifier is lowered to sqrt(mul(x, x)) and could get some more piglit tests passing.

16:45 <rellla> with fs-exp-float i ran into a fail, so i tweaked the test a bit to see, what values are on the way. this is the result:

16:45 <rellla> https://pastebin.com/raw/DuVGui7E

16:45 <anarsoul> rellla: that's expensive lowering

16:46 <rellla> anarsoul: maybe, but it works and it's what the blob does.

16:46 <anarsoul> that's crazy

16:48 <rellla> anyway, it seems, that sth is wrong with the values when they are computed in the shader...

16:49 <anarsoul> rellla: max(a, -a) doesn't work?

16:50 <rellla> for example, in the first example (expected / 10.0) is reported as 0.011765. shouldn't it be 0.013533 ?

16:50 <rellla> what is the reason for this difference?

16:51 <rellla> anarsoul: i haven't tried other lowerings, but will do before i do a PR. a simple mov would be the one i prefer, but iirc that didn't work.

16:52 <rellla> and yes, the sqrt/mul one is an expensive one :)

16:53 drod has joined #lima

16:53 <rellla> i have to go now, but would be glad, if so can give me a hint, where i should have a look at ...

16:53 <rellla> in case of the fs-exp-float test

16:54 <rellla> anarsoul: btw, this is the extensive code https://gitlab.freedesktop.org/rellla/mesa/commits/lima-abs-fix :)

16:57 <anarsoul> rellla: do it in nir

17:02 <anarsoul> rellla: basically if you can do any lowering in nir - do it in nir. Doing it in ppir is error-prone

17:29 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

17:34 jernej has joined #lima

18:04 armessia has joined #lima

18:23 <anarsoul> rellla: enunes: could you review https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2094 ?

18:23 <anarsoul> it's small one

22:05 enunes has quit [Ping timeout: 276 seconds]

22:11 cwabbott has quit [Remote host closed the connection]

22:11 cwabbott has joined #lima

22:11 jernej has quit [Ping timeout: 250 seconds]

22:18 armessia has quit [Quit: Leaving]

22:20 jrmuizel has quit [Remote host closed the connection]

22:25 drod has quit [Remote host closed the connection]

23:23 megi has quit [Ping timeout: 268 seconds]

23:50 jrmuizel has joined #lima