<rellla>
i did a manual disassembly of the binaries and compared both, lima and offline compiler. i don't find any unknown bits set. the only difference which *could* have an effect is, that offline compiler combines the dFdx and dFdy instructions...
<rellla>
though i only checked the control words for now...
<rellla>
but cwabbott said, that 2 instructions should not be the problem...
<enunes>
rellla: again I don't know if it makes sense... but can you do only one of fddx or fddy at a time rather than both, to make it simpler?
Elpaulo has joined #lima
<rellla>
uahah
<rellla>
passing 4|6 now
<rellla>
To get dFdx(-x, x) as described in "Note: dFdx(x) is actually implemented as dFdx(-x, x) (same for dFdy)" we need to !negate alu->src[1] and not alu->src[0]
jbrown has quit [Read error: Connection reset by peer]
<rellla>
cwabbott: fyi, as soon as i re-enable nir_lower_wpos_ytransform, all tests fail again
jbrown has joined #lima
Wizzup has quit [Ping timeout: 248 seconds]
Wizzup has joined #lima
<enunes>
rellla: when I had issues with precision I tried to run the same tests with the blob, and they also failed, so I noted it in the MR submitted it anyway as it is not lima's fault
<mardinator_>
I remember now , why i first hand described things differently. Most other chips besides miaow do bring in more instructions at time via fetch module.
<mardinator_>
then yeah if you go out of order, to be able to change the queue column, two instruction on the in question line should be scheduled to switch into the other column
jrmuizel has quit [Remote host closed the connection]
<mardinator_>
but this can be demonstrated with a little simulation of that module, giving values to logic of this module, but it is seen single values get elliminated while duplets won't
<mardinator_>
it's like a two-level recursive procedure as it seems, when only 4 is scheduled, it will elliminate 4 in the upper long bitwise line, when 4 and 5 is scheduled by simd vacant=00001100..., it will elliminate 4 and then 5 will be passed with valid and f_decode_wfid=4
<mardinator_>
err f_decode_wfid=5
<mardinator_>
pretty sure that verilog in the spec schedules writes or rewrites before subsequent reads, worth to confirm
<mardinator_>
I do not think that anyone is in particular a full idiot in this crew, so by far am I not, it is just that you have been said to stop violating me allready by many people, you just pee on yourself continuing to do that.
<mardinator_>
skills generally develop only by doing practice sessions, when you instead go violating someone that is not a good model, and you are not able to capitilize on your possible talent this way.
<mardinator_>
those ideas possibly only with minor drawbacks or occasional faults in case of me, they accumulate not via natural intelligence, but because i spend time every day in practicing stuff, i am not even a gamer bu i consider it better then drinking all the time with my fucked up life
jrmuizel has joined #lima
chewitt has quit [Quit: Zzz..]
libv_ has joined #lima
libv has quit [Disconnected by services]
libv_ is now known as libv
<mardinator_>
maybe there are some patents involved for this, maybe fd.o guys are afraid of something, however as libv said obsess compulsive and excess power demonstration instead of ignoring or working instead, is just never something useful, the likes of stalking someone absolutely wasted time, likes of conspiring, there are much better ways to spend time more wisely.
<cwabbott>
rellla: sorry for that! We had the source order backwards in the original lima project, which made a number of things more awkward, but the mesa driver + disassembler has it correct
chewitt has joined #lima
chewitt has quit [Client Quit]
<cwabbott>
and yeah, dFdx and dFdy are not commutative -- they're basically an add where one of the sources comes from the pixel's horizontal or vertical neighbor
<cwabbott>
the negate turns the sum into a difference
<mardinator_>
yeah i remeber derivative showing the gain in respect to some time interval
<mardinator_>
i was not able to read the code in greater detail, but looks like all know that it is done by summing up the lanes
<mardinator_>
interlane communication indeed does this type of thing the fastest, it can also be emulated, but this is quite bad performance then
chewitt has joined #lima
Barada has quit [Quit: Barada]
<mardinator_>
cwabbott: look almost ok, but you can also take derivative in respect to the first coordinate in four lane or 64 lane or whatever setup, or can't you? not only from the last element pixel
<mardinator_>
since division is with higher latency the arithmetics are add and subtract
<mardinator_>
on gcn this appears to be done on image unit, the compiler puts the indices properly and makes the arithmetic based of them
<mardinator_>
imo just the summing up needs to be done, but just the in respect sum needs to be discarded, like it was subtracted from the result
<rellla>
cwabbott: you do not have to apologize. any opinion on the nir_lower_wpos_ytransform?
<rellla>
i still do not understand, what this is needed for anyway :)
<cwabbott>
rellla: iirc it's for window system buffers, which for historical reasons are rendered upside-down, and for whatever reason the way gallium flips the framebuffer means that you need to flip y derivatives
<cwabbott>
that pass is part of gallium hiding the flipping from the driver so you don't have to worry about it
<cwabbott>
I think mali has a different way of flipping rendering that doesn't require flipping derivatives
<cwabbott>
hence why the blob doesn't do it
<cwabbott>
so avoiding that pass would require (a) reverse-engineering how the blob flips rendering and (b) adding, and setting, a cap + driver interface that says "I'll flip rendering myself" and wiring it up in lima
<rellla>
cwabbott: ok thanks, but when this doesn't cause a problem, i wonder why it breaks my tests again. the fddy ones for example...
<cwabbott>
no idea on that one :)
<mardinator_>
I would do derivatives with a sampler and clamping either with mirrored repeat or based of the virtual address
<cwabbott>
go through the assembly, take a look at the uniform values submitted by the driver, one of them should be wrong
<mardinator_>
so you accumulate the results which are needed to two locations and later sum them together
<cwabbott>
unless you're rendering to a system window it should be the same as if the pass didn't exist
<cwabbott>
(I mean, as long as you're rendering to an FBO and not rendering to the implicit window-system window)
mardinator_ has left #lima ["Leaving"]
mardinator_ has joined #lima
<mardinator_>
all this mentioned scbd_feeder.v does is it detects when simd arbiter schedules two instructions in program order, in other words the equivalent in that case, is last simd scheduled wfid +1, only difference is that you have to schedule the next one too for it to work to switch the column there, otherwise single scheduled instruction gets cleaved
guillaume_g has left #lima ["Konversation terminated!"]
<mardinator_>
if you won't i.e like when the next one is dependent, valid_entry gets a zero wafefront turns off the vacant right away cause issued_wfid goes in but issued_valid did not due to dependency, hence valid_entry gets zero, and scbd_feeder.v elliminates the last scheduled instruction, but not giving any of the f_decode_valid sigals forward
<mardinator_>
from there on it will try to schedule the subsequent instructions with giving plus 1 to the last simd wfid
<mardinator_>
if none of them schedule it will get X from the round_robin.v from fetch module in other words on a complete stall it switches in the end
gtucker has quit [Ping timeout: 252 seconds]
tomeu has quit [Ping timeout: 252 seconds]
gtucker has joined #lima
xexaxo1 has quit [Ping timeout: 252 seconds]
xexaxo1 has joined #lima
tomeu has joined #lima
<mardinator_>
so in case of round robin fetch it will go back to the first wavefront , in case of greedy-then-oldest, it will take the next one in the priorority list
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<mardinator_>
hence on the full pipeline you can not ever get freeze with round-robin neither greedy then oldest, on short-pipeline you can get a freeze on both if you do not do things properly, but easier to get the freeze with round-robin
<mardinator_>
there on fast pipeline
<mardinator_>
if you manage to look into the code, you may notice about my talks, that we are not talking about a troll here, but nonthe less complete expert
drod has joined #lima
mardinator_ has quit [Quit: Leaving]
mardinator_ has joined #lima
<mardinator_>
it does not particularly matter if there is opencl based scheduling abstraction available, such chips though are very rare that does not have Opencl EP even, powervr535 is one i know though! It does not also matter which type of scheduling is used, both will work but round robin with branching is faster
<mardinator_>
you can do some type of mix too on GCN since priorities can be altered in shader, however there is not much point, since round robin the default will do nicely
<mardinator_>
that was about embedded world, but desktop world has lots of desktop gpus that did not have opencl but were programmed entirely incorrectly.
<mardinator_>
not only under mesa, since mesa hackers use reverse engineered priopriatery methods though don't forget NVIDIA branched their Opengl from mesa times ago, Brian Pauls stack that time, but same case for propr. stacks they implemented the driver in a wrong way.
<mardinator_>
same goes for all kernels in CPU world, the schedulers are incorrectly programmed not taking advantage of sw based tomasulo derivatives
<mardinator_>
in other words, maybe the hw is scripted, designers did their work properly and there was not much chanche to avoid doing it correctly either, but sw developers haven't yet taken all under control
<mardinator_>
properly
jrmuizel has joined #lima
<mardinator_>
Little rant over this recently i discovered vmware publishing thir verilog simulator on github, the company who acquired tungsten graphics, those guys coded like real men under pressure but there is room for very large enhancements.
<mardinator_>
and i dunno maybe some of them are dealing with hw those days, it seems to be a great way of finally doing all correctly, when knowing hw you also know how to run the very last bit.
<mardinator_>
Maybe there are risks in making such code available for general public, but i really in some areas will practice further with being on those directions, cause in the end i need money on my bank account too.
<mardinator_>
no one lets you scam in a way that cause mart spammed the channel we utterly failed cause of the distraction, one in the past was talking how my energy waves from distance distracted him, so he did not tolerate me living too on my own.
<mardinator_>
you receive very critical staring and views when trying to do such thing, you only think you pull utter crap and hope me to die right, after scoring ten thousound bans are you seeing that someone succeeded in doing it?
<mardinator_>
so i do not have much more to say too, every time i have been attacked there are interfering people, what looks to be one sided fight for nutters, isn't at all like this, conflict has two sides, and there are supporters who adore me too.
<mardinator_>
bye
mardinator_ has quit [Quit: Leaving]
drod has quit [Read error: Connection reset by peer]
drod has joined #lima
jbrown has quit [Remote host closed the connection]
jbrown has joined #lima
libv has quit [Ping timeout: 245 seconds]
libv has joined #lima
jrmuizel has quit [Remote host closed the connection]
drod has quit [Remote host closed the connection]
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]