ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at and - Contact ARM for binary driver support!
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
mardikene193 has joined #lima
<mardikene193> about the many-to-many scheduler, it is compromise between many-to-one and 1:1. As it is also possible to hijack syscalls, this comes with a similar security risk, when the upcall system call is known, security issue is slightly more but not too big of an issue either compared to 1:1.
jrmuizel_ has joined #lima
<mardikene193> For an example all the POSIX layer on windows nt the emulated POSIX API would work on system privileges, which has kernel execution support via some specific calls, on.
<mardikene193> on./on linux similar thing would happen and on BSD/XNU
jrmuizel has quit [Ping timeout: 276 seconds]
<MoeIcenowy> anarsoul: not a fix
<anarsoul> :(
<anarsoul> btw, gnome is not really friendly for tiling GPUs
<anarsoul> s/for/to
<anarsoul> it seems to use a huge fbo and renders only into part of it
<mardikene193> the win comes in fold of 800 allready because context switch from userspace done via upcalls is inherently vastly faster.
<mardikene193> on CPUs and when you combine that with coresight or pipelined jtag boundary scan hw api
<mardikene193> things go into mad heights for performance.
<mardikene193> I also developed this twos complement instruction stream compression, which as mentioned i call binary stream instruction checksummming
<mardikene193> that saves all the instruction cache bandwidth
<anarsoul> MoeIcenowy: btw, unrelated issue, but this part from cogl: "cogl_color_out = cogl_color_in * texture2D (tex, vec2 (cogl_tex_coord_in[0].xy));" where cogl_tex_coord_in is vec4 array of 1 element will result in poor image quality on lima
<mardikene193> VLIW is great because such bitwise operations on top of twos complement subtract can be taken as entirely parallel operations, cause it comes with more alus since scoreboarding in hw is expensive, 16 not parallel ops can be taken on terascale and on mali gpus they have sort of weird specs
<MoeIcenowy> anarsoul: oops
<anarsoul> as well as later calculations with coords
<mardikene193> based of the docs, mali 400 vliws have pipeline of 128 stages per core
<MoeIcenowy> why it will result in poor image?
<anarsoul> MoeIcenowy: because varying is not passed directly into sampler
<anarsoul> it will be loaded into register
<MoeIcenowy> the register is only 16-bit?
<anarsoul> it's fp16
<anarsoul> so it's only 10 bits of precision
<MoeIcenowy> oh
<anarsoul> 5 for mantissa, 1 for sign
<mardikene193> each bundle is 4words of instructions, i think they count simd width each stage of simd separately also as stage
<anarsoul> it's not enough to sample high-res textures accurately
<mardikene193> so 128/4/4 is 16 there as well
<MoeIcenowy> so directly using varying is the only way to keep 32-bit float number in PP?
<anarsoul> s/mantissa/exponent
<mardikene193> you can take 16bundles per core in parallel but 4*16 when there are no deps involved
<anarsoul> MoeIcenowy: yes, but only for samplers. Anything else uses fp16.
<anarsoul> I guess cogl authors couldn't imagine that some GPUs will be fp16-only :)
yuq825 has joined #lima
<mardikene193> when you think a little, then you should see why VLIW is considered the best architecture of all times.
<mardikene193> elbrus CPU has been done so that 32words per CPU core, and there is a coupled array prefetch buffer to the instruction windows, you can grab instruction even if it was dependent out from any position
<mardikene193> without causing excessive pipeline stalls due to bad ordering
<anarsoul> MoeIcenowy: unless fragment shader uses spilling the only place PP can read from is texture 0
<anarsoul> that has weird size of 800x1104, but I think it's OK
<MoeIcenowy> anarsoul: do you think GNOME Shell is very weird when analyzing the apitrace? ;-)
<anarsoul> that's definitely not the best piece of software
<MoeIcenowy> but also not the most weird?
<anarsoul> I've been in software for quite a while. I saw much weirder things :)
<anarsoul> it's quite surprising that it mixes GLSL shaders with ARB
<MoeIcenowy> anarsoul: BTW what's ARB shader?
<MoeIcenowy> I have 0 knowledge about it
<MoeIcenowy> some lowlevel thing?
<mardikene193> since the branches according to the patent can be done in the queues, as well as being async in nature, it is different on CPUs how branches are handled as i spoke
<MoeIcenowy> anarsoul: why is there only fragment in this situation?
<mardikene193> however the concept is in generalisation the same there, everything is a branch
<anarsoul> no idea
<mardikene193> so yes, when the bundle is from 1-32 you can call 32 branches for their target more precisely as every instruction would be a short branch, and elbrus cpu will do the reordering at runtime
<mardikene193> it will run first the non-dependent instructions until the bundle is over and everything falls into place as it should
<MoeIcenowy> finally the eMMC on my PineTab bombed totally
<anarsoul> :(
<anarsoul> use NFS?
<bshah|matrix> O.o
<MoeIcenowy> oh? I can still read it with the USB reader?
<MoeIcenowy> (but it totally doesn't boot now on PineTab itself
<MoeIcenowy> maybe the connector is loosen
<mardikene193> is that in nature more expensive using bypass registers and array prefetch buffers and loop counters than that of scoreboard, yeah well definitely not
<mardikene193> it is just a fifo structure everything from branches are played until they are done
<mardikene193> and they come in flavors or 512 to 1024 queue entries, plenty of running scheduling, it was a suprising read that sometimes in history also russians are pretty smart designers, and pretty early on
<anarsoul> some IPA won't hurt :)
jrmuizel_ has quit [Remote host closed the connection]
<mardikene193> this is not some rant, I have no elbrus chip, it maybe something similar to what transmeta tried to acheive and tell people though, the last i understand as company no longer exists though
<mardikene193> I see there, as some russian wanted to connect me how to use different processors in oil refining industry
<mardikene193> pretty easy task, even with their own chips
<mardikene193> they have more than half fewer power consumption compared to intel chips on full pipeline allready cause of doing things correctly, i say it will further drastically get lowered on IB based computation on nested loops and sw pipelining triggered in
<mardikene193> I say those designers were very sanely spot on
<mardikene193> I assume their hw reordering is faster than that i could up with on radeon VLIWs hehee, it is superior chip to AMD vliws, 2048 queue entries against 2560 though, radeon does reorder inside a driver like i told, it has no unstalling async mechanism in hw
<mardikene193> but it instead has a driver, and possibility to feed zero operands and do it during runtime in hw with sw controlled
<mardikene193> as i mentioned this is due to difference in accelerators like GPUs and CPUs instruction buffer structures i.e instruction queues, issue queues or instruction windows all are referring to the same thing
<mardikene193> if you wanted to do the same thing on stanford CGRA like elbrus does , or at least similarly in functionality that is easily possible
<mardikene193> however GPUs would have rework the branch module and issue queue modules to do the similar
<mardikene193> I do not advise doing so my own, instead you do the scoreboarding in the sw with using twos complement procedures, which are only two gate delays and skipping with zero operands also two gate delays
<mardikene193> cause GPUs have something that can be used for redirecting operands which CPUs normally like X86 CISC and ARMs RISC do not have
<mardikene193> as mentioned throughout my talks this is register indirect addressing
megi has quit [Ping timeout: 276 seconds]
<mardikene193> and about different events that you can not do (that includes programming), i would advise/recommend to keep your mouths shut as well as fingers off the keyboard when stars who can do that start to talk ok?
<mardikene193> AS I TOLD YOU MORE THAN SEVERAL times allready, the in queue rendering/computation triggers column change on two instructions in sequence issued , i.e not graduated just issued consequently back to back from , consequent wfids, not 1 and 7 but either 1 and 2
<mardikene193> 4 and 5
<mardikene193> not 8 and 10 etc.
<mardikene193> otherwise it will fetch the current wfid row or line and stay there
<mardikene193> because the sum of valid_entries of from previous iteration and current instruction in scbd_feeder.v for hungry bits , this line puts +1 to the chip
<mardikene193> when the valid_entry.v captures those bits, the first instruction is no longer driven, but second is captured to instr_info_table.v only
<mardikene193> so it needs to capture the +1 only, and it can do so by running the next instruction also while the +1 is broadcasted, that way stuff gets driven
<mardikene193> i have fucking told you this more than one years time allready
<mardikene193> when you replace the values in next_hungry line, vacant | hungry & ~40{feed_valid} & previous wfid , what the fuck do you get than?
<mardikene193> I do not reread or structure anything, since I KNOW THAT!
<mardikene193> this will add +1
<mardikene193> you are fucking morans
<mardikene193> I also gave the mathematical expression which is absolutely correct
<mardikene193> why on full length pipeline the queues do not switch columns, i allready told that, they have automatic instance indexes
<mardikene193> it is round robin scheduling, all queue entries will be filled in
<mardikene193> it works in a way that , fetch counter always wraps around as the comments suggest, it fetches to wr_decode_data buffers 40*40 on GCN as understood
<mardikene193> fetch wfid of one will be put to the first column1 line 1
<mardikene193> second appearance of 1 will be put to column 1 line 2 etc.
<mardikene193> FUCKING HELL!
<mardikene193> how are you so nuts to blame me about anything when you have fucking head filled with only water, no synapses no neurons
<mardikene193> the procedure continuous until halt aka endpgm is met, which kills the fetch
<mardikene193> now if the chip is not being reset with a pulse or the memory underlying in case of graphics
<mardikene193> does get memory errors the chip would start doing in queue computation
<mardikene193> during the reset pulse not sent to the chip, it starts to send 6{1'bX} forward as wf_id
<mardikene193> which is lifted to 111111 always accepting wave
<mardikene193> if the interconnect delays the rst signal , like when it is not captured , everything will be rendered/computed from shader engine queues
<mardikene193> however when it sends the reset, and restarts the program counter hence things will go in full length of the pipeline
<mardikene193> and how you can imagine the stuff, from the queues everything gets issued in parallel upto 16 scheduled for execution if all they succeed the scoreboard test
<mardikene193> and sanity wise, it is absolutely correct to put things like that into queues, they are fetched from wr_decode_data stored buffers in case of fetch itself is sleeping, and yes all the column of 40 is being fetched
<mardikene193> and stored to the line out buffer
<mardikene193> concistency wise it is just playing with words
<mardikene193> maybe i was inaccurate to state that column is fetched to the row, how else can i say that? the line is fetched to the line perhaps one after another to form a line out
dddddd has quit [Remote host closed the connection]
<mardikene193> my code does not care of variable latencies
<mardikene193> cause you see everything is a branch
<mardikene193> all you have to do is to pull in all the queue entries initially with zero operand values
_whitelogger has joined #lima
Barada has joined #lima
niceplace has quit [Read error: Connection reset by peer]
niceplace has joined #lima
raimo has joined #lima
mardikene193 has quit [Read error: Connection reset by peer]
raimo has quit [Read error: Connection reset by peer]
niceplace has quit [Read error: Connection reset by peer]
niceplace has joined #lima
mardikene193 has joined #lima
<mardikene193> variable length latencies are the ones that generate carry or borrows, hence andandtech link is not very correct about deterministic delays, well it partly is cause the framerate is majorly capped per instruction, but in the queues they are not
<mardikene193> neither SIMD nor VLIW offer guarantees, that the same shader will take the same time to finnish in queued mode on two different iterations
<mardikene193> similarly the variable latency instruction is memory load as you have noticed, depending whether it hits the cache or not
<mardikene193> And on VLIW the best optimal way is reordered during the runtime, it can not do any better than how queues put things to natural order, hence i tell that mannerovs scheduling does not make difference on multiiteration shaders
yuq8251 has joined #lima
yuq825 has quit [Ping timeout: 276 seconds]
UnivrslSuprBox has quit [Ping timeout: 245 seconds]
kaspter has quit [Remote host closed the connection]
kaspter has joined #lima
jailbox has quit [Ping timeout: 245 seconds]
eightdot has quit [Ping timeout: 245 seconds]
UnivrslSuprBox has joined #lima
jailbox has joined #lima
eightdot has joined #lima
Barada has quit [Quit: Barada]
Barada has joined #lima
<mardikene193> I am not feeling well either when i am pissed off, i think miaow is reasonably over average tricky, people should not behave the way they do when i am talking the truth regardless
yuq8251 has quit [Quit: Leaving.]
<mardikene193> most complex to understand could be that my simulator shows fetch arbiter on full pipeline always trailing by one, the bitwise operation flips the bits with NOT after AND of the previous simd arbiters wfid, this results in bigger value kept
<mardikene193> since it is the line of continuous assignment, it will be driven again
<mardikene193> shortly after the bigger value is kept
jbrown has joined #lima
megi has joined #lima
<mardikene193> kospadin anarsoul then when you cancel say 8 with 8 it will drive zero, simd arbiter arbitrates to say 10 as one wishes, fetch arbitrates to time in cancels out 0, all 1s will be evaluated in the last part of the next_hungry hence
<mardikene193> how much will 1 | 10 next time equal?
<mardikene193> 2+8+1 is yeah 11, but lets do that in simulator
<mardikene193> since i am not sure whether i did not still do some mistake in the calculation
<mardikene193> weird is that when 9 is driven it allready has 1 bit in
<mardikene193> aaaah well yeah, this is zero bit but decoded differently
dddddd has joined #lima
mardikene193 has quit [Read error: No route to host]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
<rellla> huh, lowering the abs modifier with ppir_op_ge to sqrt(mul(x,x)) solves another 18 tests :)
<rellla> *within
<enunes> nice
<Tofe> what's "lowering", exactly ? replacing a costly op with less costly ones ?
jrmuizel has quit [Remote host closed the connection]
<MoeIcenowy> Tofe: replace an unsupported op with a supported one
<Tofe> ah ok
<rellla> i have the strong guess, that fs-exp-float for example fails due to accuracy/tolerance issues, as only 1 of the 4 probes fails. maybe i should try to tweak the piglit test a bit for testing...
mardikene193 has joined #lima
<mardikene193> I realised that this yeah was not a decoders issue, but non the less paired and unpaired numbers, or odd and even.
Barada has quit [Quit: Barada]
jrmuizel has joined #lima
<mardikene193> I would personally not do that this way probably, i consider this as a bug, or maybe someone held a patent with the usual way.
mardikene193 has quit [Remote host closed the connection]
raimo has joined #lima
<raimo> well well, i agree this works too what they do, and is not still considered a bug either saves some resources instead
<raimo> but i need to confess that i did not see that path in the beginning and i allready thought i had everything figured out, so it is a good day over a long period , one new implementation detail studied
<raimo> i did not do any bigger tests today, only this for some who do not understand bitwise OR aka | in verilog
<raimo> i did not spot this even though it is very clever way, cause it does not appear to save ultra lot of resources, just very few, i obviously did not expect it hence
<raimo> so i elaborate what happens there:
<raimo> when you have an odd wave transisioning from the even wave the +1 is done partly inside another path of a chip than that of even to odd
<raimo> what happens is wr_decode_data is ignoring as well as valid_entry.v that type of repeated odd address
<raimo> but the queues pick them up cause scoreboard will probably give green light to some instruction...
<raimo> ss it is tried to be computed from the table and if it is not there as it should not be in miaow case
<raimo> valid bits of if it was a transition from 5 to 5 that time, gets a six, but now it zeros out the six instead
<raimo> now let us elaborate to complication, why the miaow does not have that instruction in the tbl part of the queues!
<raimo> necause it zeros anything on the line while fetching the last instruction it zeros the upcoming ones
joss193 has joined #lima
<joss193> anyhow this could be some clever method to dissipate heat more around the chip, whatever really i am looking on is a bit weird to me, but should work indeed
raimo has quit [Ping timeout: 250 seconds]
<joss193> so when five no longer goes through it drives six into the table and all things starts again
<joss193> six forwarded as decode_wr_data or whatever matches the last issued instruction and in this case six is finally fetched too
<joss193> cheers i need to go now.
joss193 has quit [Quit: Leaving]
xdarklight has quit [Quit: ZNC -]
xdarklight has joined #lima
<rellla> cwabbott: i implemented a lowering, where abs modifier is lowered to sqrt(mul(x, x)) and could get some more piglit tests passing.
<rellla> with fs-exp-float i ran into a fail, so i tweaked the test a bit to see, what values are on the way. this is the result:
<anarsoul> rellla: that's expensive lowering
<rellla> anarsoul: maybe, but it works and it's what the blob does.
<anarsoul> that's crazy
<rellla> anyway, it seems, that sth is wrong with the values when they are computed in the shader...
<anarsoul> rellla: max(a, -a) doesn't work?
<rellla> for example, in the first example (expected / 10.0) is reported as 0.011765. shouldn't it be 0.013533 ?
<rellla> what is the reason for this difference?
<rellla> anarsoul: i haven't tried other lowerings, but will do before i do a PR. a simple mov would be the one i prefer, but iirc that didn't work.
<rellla> and yes, the sqrt/mul one is an expensive one :)
drod has joined #lima
<rellla> i have to go now, but would be glad, if so can give me a hint, where i should have a look at ...
<rellla> in case of the fs-exp-float test
<rellla> anarsoul: btw, this is the extensive code :)
<anarsoul> rellla: do it in nir
<anarsoul> rellla: basically if you can do any lowering in nir - do it in nir. Doing it in ppir is error-prone
jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC:]
jernej has joined #lima
armessia has joined #lima
<anarsoul> rellla: enunes: could you review ?
<anarsoul> it's small one
enunes has quit [Ping timeout: 276 seconds]
cwabbott has quit [Remote host closed the connection]
cwabbott has joined #lima
jernej has quit [Ping timeout: 250 seconds]
armessia has quit [Quit: Leaving]
jrmuizel has quit [Remote host closed the connection]
drod has quit [Remote host closed the connection]
megi has quit [Ping timeout: 268 seconds]
jrmuizel has joined #lima