ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at and - Contact ARM for binary driver support!
<MoeIcenowy> anarsoul: the bit after texture_2d is texture_3d
<anarsoul> great
<MoeIcenowy> enabling it and set depth changes the behavior of the current texture load instruction
<anarsoul> do you clear texture_2d?
<MoeIcenowy> yes
<MoeIcenowy> BTW this bit seems to be also used when cubemap
<MoeIcenowy> but cubemap has the bit before texture_2d also set
<anarsoul> interesting
<anarsoul> MoeIcenowy: btw, see my comment to your uniforms fix
<anarsoul> my guess is that we don't need to specify size in this reg at all
<anarsoul> since what you do is essentially setting lower bits to 0
<MoeIcenowy> armessia: how is the 6 faces stored for cubemap ?
<anarsoul> uniform array size is 4
<anarsoul> 4 / 4 - 1 = 0
<anarsoul> so if it fixes ppmmu faults for it we don't need to set it
<MoeIcenowy> yes, although maybe someday we can find out how a bigger uniform array work?
<anarsoul> MoeIcenowy: mali4x0 for some reason uses double indirection for uniforms
<anarsoul> i.e. register contains pointer to a table of single entry
<anarsoul> and this table contains a pointer to uniform buffer
<MoeIcenowy> yes... but maybe we can have multiple uniform buffer?
<anarsoul> MoeIcenowy: but why?
<MoeIcenowy> ah... right
<MoeIcenowy> really useless when GL
<anarsoul> they use similar table for textures, but it actually makes sense in case of textures
<anarsoul> since each table entry has a pointer to a texture descriptor
<anarsoul> and we actually have multiple entries in this table
<anarsoul> but why they made it for uniforms - I have no idea
<anarsoul> MoeIcenowy: uniform load opcode has a lot of zeroes in it, so in theory it may be possible to use uniforms from another block
<anarsoul> but I don't think that we even need to explore it
Da_Coynul has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
yuq825 has joined #lima
Da_Coynul has joined #lima
Da_Coynul has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]
<anarsoul> yuq825: looks like I know why mipmapping is broken for linear textures
<anarsoul> yuq825: we align level size to 16 bytes boundary but hardware doesn't expect that and there's no stride for levels except 1
<anarsoul> I mean level 0
<anarsoul> I'm fixing it now
<anarsoul> and that probably means that mipmap levels can't be render target
<anarsoul> yuq825: I guess that's why blob uses tiled textures whenever possible
<yuq825> mipmap levels can't be render target for only linear texture?
<anarsoul> yes
<anarsoul> for tiled mipmap levels are aligned to tile boundaries
<yuq825> then we need to do something for stopping use unaligned texture as render target
<anarsoul> any ideas?
<yuq825> like in lima_set_framebuffer_state
<yuq825> check the start address of render buffer with level
<anarsoul> we can actually check width and stride
<anarsoul> and fail if it's not multiply of 16
jrmuizel has joined #lima
<anarsoul> looks like we can't throw an error from lima_set_framebuffer_state()
armessia has quit [Quit: Leaving]
<anarsoul> yuq825: I think we'll have to create shadow framebuffer for this cases
<yuq825> yeah, but painful
<anarsoul> that's why blob uses tiled textures :)
<anarsoul> render target requires buffer to be padded to 16 pixels in each direction
<anarsoul> but linear textures have mipmap levels with stride=width and width is not necessarily aligned to 16
<mardikene193> I try to read about full frame rasterization and tiled rasterization, but in both of the cases the question remains the same, when you spread different geometry to different cores, be those macrotiles in AMD case , tiles in Mali case
<mardikene193> how does it really map the same kernel to different cores, i.e shader instructions
<mardikene193> ?
<mardikene193> you may have an MMU to be in charge of doing that right, i tried to look at VMIDs on AMD
<mardikene193> but this has not made much sense either, only max 16 of them, but fiji GCN gpu for instance could have 32CUs
<anarsoul> my mipmap fix fixes 20 piglit tests and breaks 1
<mardikene193> 8*8 macrotile is aligned to 256b cacheline on single precision and 512b cacheline on double precision, yeah sure I understand that
<mardikene193> but...
<mardikene193> still how can it map the same kernel/shader for parallel access to different cores?
<mardikene193> There is then enourmous bus traffic and buses need to be very wide to accomodate relevant data during the data exchange
<mardikene193> David Kanter and some AMD guys say, it is sort of an L2 coherancy protocols which bring the bits into the cache, i.e duplicated cache, which somehow makes sense
<mardikene193> even
<mardikene193> ok time for the whitepaper again, i am not sure was it that l2 or l1 was shared by cluster of 4Compute units
<mardikene193> then 64 that of VEGA max Compute units divided by 4 is indeed maximum of 16VMIDs
<mardikene193> can you imagine some of the parallel memory buses that could do 64*2048/4096/8192 byte transfers, well this isn't probably realistic right?
<mardikene193> samewise it isn't realistic to do 16*anyofthementioned
<mardikene193> or is it?
jrmuizel has quit [Remote host closed the connection]
jrmuizel has joined #lima
<mardikene193> it brings in 32B per cycle on AMD GCN to all compute units in the cluster it appears, i was in the woods with my stuff, absolutely incorrect before
<mardikene193> ok covered, this does make a lot of sense
<mardikene193> 32bytes that means either 8single precision instructions or 4 double precision instructions at time
wens has left #lima [#lima]
megi has quit [Ping timeout: 240 seconds]
<mardikene193> are VMIDS cache automatically on L1 of appropriate CU banks, since instruction cache content should not change, or do they go through l2?
<mardikene193> I think the docs say they are brought in from l2 though still
Barada has joined #lima
<mardikene193> well i can perfectly understand the info, when MMU is involved indeed.
<mardikene193> however on say r300 there is no MMU on the chip, there is hardcoded memory controller
<mardikene193> since it does not have any of the cache-coherency protocol involved neither MMU, i think my solution never needs them, but on long/full pipeline mode this chip probably needs to fetch different data from memory for separate pixel shaders?
<mardikene193> this has to be either that, or just lock-step delayed interleaving execution if all the pixel and vertex shaders share the cache
<mardikene193> yes of course the second one probably
<mardikene193> rasterizer works in lock-step not parallel, by the time it has data the instruction is still in the cache
jrmuizel has quit [Remote host closed the connection]
<mardikene193> that brings me to final question, i know local/shared memory can be used to migrate a thread according to some korean reports/papers
<mardikene193> what is the high level abstraction with texture units, 16 of them right on sm4.0 instead of NUMCUS*16 indexed separately, that means even though
<mardikene193> r300 with 4pixel shaders has total of 16, programmer is given a chanche to use 4 of them right?
<mardikene193> it appears not so, probably cause texture units are shared also for cluster of four shaders
<mardikene193> like the TC l1 cache
armessia has joined #lima
<mardikene193> I can query that programmatically on my CI apu chip, i have two of them, but it is pretty sure to me, that the number is 64 there.
<mardikene193> you maybe wondering why am I sure that, 2048 entries for 16SIMDs of 4word wide bundles only, cause attila shows this, cause probably 32*32*4 no longer fits to the die somehow
<mardikene193> it's max 2048 for elbrus too on that process 28nm
<armessia> MoeIcenowy: the 6 faces are stored right after each other, their position is implicit. The start positions of each face must be tile aligned though, also in the linear case.
<mardikene193> however calculations on NAVI which is lot smallr about 4 fold indeed show that they have queues with lenght of 10240 per CU
<armessia> MoeIcenowy: the stride also needs to be 16 aligned
armessia has quit [Quit: Leaving]
<mardikene193> armessia you know as i was saying, from cache things are fetched (If they are in cache -- in cacheline lengths)
<mardikene193> no matter if that is data or instruction cache, however no matter the alignment for tiling
<mardikene193> tiling only works faster when the underlying memory is physically contiguous
<mardikene193> but yeah they say the tiles from rasterizer are best managed in 16x16 indeed
deesix has quit [Ping timeout: 265 seconds]
dddddd has quit [Ping timeout: 268 seconds]
deesix has joined #lima
dddddd has joined #lima
<mardikene193> yeah absolutely in shader you can access smaller tiles too, but rasterizer is known to do best with 16*16
<mardikene193> in other words vertex buffers should be probably aligned as such indeed
<mardikene193> double precision is more rare but this would in theory require 32x16 also 16byte aligned in fact indeed
<mardikene193> so what you said was correct for vertex buffers indeed
yuq8251 has joined #lima
yuq825 has quit [Ping timeout: 240 seconds]
<mardikene193> 16*16 is indeed 256B which is quite oftenly the cacheline size , yeah true
<mardikene193> since 32*32 is allready 1024B which is too much, then yeah 16 is the best solution
dddddd has quit [Remote host closed the connection]
<mardikene193> cacheline of 256B means that it has a staging flop based of that size in hw
<mardikene193> it is loaded with a burst of 256B to there, and only those bits that you loaded are forwarded to cache
<mardikene193> no matter what you do it brings always the full cacheline into that staging flop
<mardikene193> and with memory as cache it works in similar fashion , bridgman once talked about it on phoronix
<mardikene193> memory is accessed as tiles as well
<mardikene193> but it needs to be physically contiguous obviously that it can happen
<mardikene193> and right it needs to aligned too somehow, geesh
<mardikene193> but i do not remember since that in depth post is gone what is hw memory tile alignment
mardikene193 has quit [Quit: Leaving]
Elpaulo1 has joined #lima
Elpaulo has quit [Ping timeout: 246 seconds]
Elpaulo1 is now known as Elpaulo
_whitelogger has joined #lima
mardikene193 has joined #lima
<mardikene193> the code i have in mind, is likely going to be materialized, it carries world wide importance but i am unsure if I will receive any help from community to push those bits, which is fine if not, but ...
<mardikene193> myself I have issues too that needs to be solved , I have somewhat chanche to get into some better form than i am in now, but this requires a bit of effort to deal with and therapy and all for me to participate as older man again in some tournaments, for me this is pretty high rank importance as well to deal with my health
<mardikene193> But overconsuming worlds resources is a very bad thing, luckily some swedish activist made a pretty decent speech, i recon this was some young girl.
<mardikene193> definitely a mid-teenager or something
<mardikene193> swededen is a welfare country, i see proof that swedish big numbers are very smart people, as well as finish internationals, both are welfare countries, it has been a while since i last talked with swedish though, those on IRC are more like trolling, but they have something special in science terms as i can see too.
<mardikene193> one of my pal during new zealand stay was swedish that i met in hostel, i right away asked something like how is the situation how sweds deal with their own, do they have envy between them, and answer from that street musician was that in ghetoo type of areas they do, which i expected.
<mardikene193> and it proved my theory that it seems to be all about welfare not genetics, why the issue is sharper probably in estonia.
gaulishcoin has joined #lima
<mardikene193> things have gone towards better here locally, but when i was young we had chaos, nowdays there is structure allready, i had enough money but i was in a need to show my fast legs everywhere i entered, and this all did not even save me from perverse violations arranged later still.
<mardikene193> nowdays it is allready more friendlier here, where authorities are settled in mainly police power and miliatary and such, seems like some corruption in courts is still present though.
<mardikene193> the ninties men slightly older than me, and me included as slightly younger 10years old in 90, were very motivated and strong, but about them i can not speak, if they were stupid or why this shit happened to them, maybe cause they were little stubborn, maybe because the situation was in chaos and maybe both
<mardikene193> so what i think if you retain stable and clear thinking and have ever chanche to also make your athletisism to be at some level without a threat from others to violate you all the time, this is a clear mixture to success
<mardikene193> you mean what does that mean that some get violative feedback each and every day? You see this is stereotypical i have understood in our country at least in the past.
<mardikene193> pretty boys physically developed/advanced and in good position to do something, were tried to be wiped out the most back those days, and this jelousy generated in my personal opinion mostly due to bad welfare situation.
<rellla> MoeIcenowy: i ran piglit with your uniforms fix and these are the results:
<rellla> wow.
<rellla> consider, i'm running the tests with a more tolerant piglit version.
<mardikene193> so if you did not do anything relevant to violate me the guy who has world record number of them on his belt, than why are you afraid?
<mardikene193> in almost 99.5% degree, i never responded to any of the fabrications nor violations i got, in other words i was a victim.
<mardikene193> rellla: what i think with uniforms, there is an address register based indexing, with that said, anything on the clamping path in fragment shader is more easily even possible in vertex shader
<mardikene193> sm3.0 has this also in pixel shader, and sm4.0 has address register for register indexing allready
<mardikene193> rellla: there are two possibilities to skip an instruction with bunch of uniform access LSU operations, you repeat the last address this is 1. or possibly index out-of-range hoping that not 0 is returned but instruction skips, one of the definitely works, maybe even both
<mardikene193> sampler based clamping is a stage before lsu based indexing/offseting it is an easy way to put an offset of texture fetch into some value that gets clamped into previous address the effective on chip calculator
<mardikene193> and voila as it does not graduate it's dependency which has to be an alu can not also run
<mardikene193> remember since you have two schedule two lsu operations in sequence to change column?
<mardikene193> how do you think this can be done?
yuq8251 has quit [Remote host closed the connection]
<mardikene193> pretty easy you can through some two offsets into the game, or whatever indexes, so perhaps one offset is in ascending order the other in descending, and write into those two regs
<mardikene193> so that either 3 and 4, 1 and 2, 2 and 3 or whatever sequence is scheduled, it will change to the column of the last
<mardikene193> one
Da_Coynul has joined #lima
Da_Coynul has quit [Client Quit]
adjtm has quit [Ping timeout: 276 seconds]
kaspter has quit [Quit: kaspter]
<plaes> o_O
raimo has joined #lima
mardikene193 has quit [Read error: Connection reset by peer]
megi has joined #lima
raimo has quit [Read error: Connection reset by peer]
joss193 has joined #lima
raimo has joined #lima
joss193 has quit [Read error: Connection reset by peer]
raimo has quit [Read error: Connection reset by peer]
raimo has joined #lima
raimo has quit [Read error: Connection reset by peer]
Barada has quit [Quit: Barada]
raimo has joined #lima
<plaes> which merge request is this?
dddddd has joined #lima
<rellla> plaes: i'm actually running another on, because either the cubemaps or anarsoul's mipmapping patch causes some regressions
<rellla> s/on/one/
<raimo> I am trying to do final researches, whether i could be wrong in some of the aspects, it is my core understanding relying on fact, that full length of the pipeline is always in order unless there is a branch , which would mess everything up
<raimo> that means full length of the pipeline always changes column, because the time one instruction issues, there is never another yet available
guestt0876541 has joined #lima
<raimo> it is because amd docs say , one cycle and one instruction is fetched
<raimo> and valid bits start after reset from all toggled on
abelvesa has joined #lima
jrmuizel has joined #lima
<rellla> ah, cubemaps breaks it :)
<raimo> It should be true and incredibly logical to state as such, but never know i do not want to make more simulator tests, to me everything has been successful on the simulator, i said some days ago to libv that i am moving finally to hardware and see if those also work there
<raimo> those theories
guestt0876541 has quit [Remote host closed the connection]
<raimo> i've studied this stuff for 11years now alltogether would be nice if i was able to pull something off , but all this would take time still, if i can't does not matter either, i got more stable theoretician anyways, and i can handle fine
<raimo> in other fields
<raimo> i am considering the whole pipeline command processor and all shader stages to be optimized, not only vertex and fragment, but also tess and geom, and even compute shader
<raimo> I still make some or even lots of assumptions, one of them which i never tested was: Kayden once commented that they do not adjust wavefront/thread occupancy
<raimo> so does not any chip in my opinion, even though not tested, i almost see that path
<raimo> wave amount is always maximum when chip schedules stuff :D , yeah really it is the writebacks that start to block ;)
<raimo> enough, me out, and i do not have much to tell either.
raimo has quit [Quit: Leaving]
gaulishcoin has quit [Remote host closed the connection]
gaulishcoin has joined #lima
jrmuizel has quit [Remote host closed the connection]
gaulishcoin has quit [Quit: Leaving]
<MoeIcenowy> rellla: looks like no regressions
adjtm has joined #lima
<MoeIcenowy> (the only regression is failure because of memory alloc
<anarsoul> MoeIcenowy: and 105 fixes
<anarsoul> send an MR? just make sure that you set lower bits on uniform address to zero, I don't think it makes any sense to do any calculations there
<enunes> I wonder if any of these fixes finally fixes ideas
<enunes> no access to board/display today
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<anarsoul> let me try
<MoeIcenowy> anarsoul: as "consider, i'm running the tests with a more tolerant piglit version", I don't know whether the fixes are really fixes...
<MoeIcenowy> or maybe I misunderstood this sentence?
<anarsoul> rellla: ^^
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
<rellla> both tests rund the same piglit version. they fail with master and pass with the uniform patch
<MoeIcenowy> oh interesting
<MoeIcenowy> it's really fixing things
<plaes> \o/
<anarsoul> enunes: it doesn't fix ideas :(
<anarsoul> lamp still has wrong colors
<enunes> damn that's a persistent one
<rellla> MoeIcenowy: tolerant piglit just means that this patch is included:
<MoeIcenowy> rellla: I misunderstood your sentence...
<MoeIcenowy> sorry
<MoeIcenowy> anarsoul: although the code that set the size is NOP, maybe we should still keep it?
<MoeIcenowy> as we have a buff for the 1-item array
adjtm has quit [Ping timeout: 268 seconds]
<anarsoul> MoeIcenowy: no. git will keep the history
<MoeIcenowy> anarsoul: we never have the correct thing in the history, right?
<anarsoul> what do you mean?
<MoeIcenowy> the correct code should be set the field based on the length of the array of uniform storages, right?
<anarsoul> MoeIcenowy: we don't know that
adjtm has joined #lima
kaspter has joined #lima
<anarsoul> the code was taken from original lima project and looks like it was incorrect
<anarsoul> you can try REing it if you want
megi has quit [Ping timeout: 268 seconds]
<rellla> does this sound as a reasonable fix? it prevents glsl-no-vertex-attribs from asserting in u_upload_mgr when the size is 0...
jrmuizel has joined #lima
kaspter has quit [Ping timeout: 250 seconds]
<rellla> at least :)
<anarsoul> lima 1e80000.gpu: mmu page fault at 0x443d37a0 from bus id 0 of type read on gpmmu
<anarsoul> is it with fix or without fix?
abelvesa has quit [Ping timeout: 276 seconds]
abelvesa has joined #lima
jrmuizel has quit [Ping timeout: 240 seconds]
<MoeIcenowy> anarsoul: I think 6dd0ad6 doesn't come with my uniform fix
<MoeIcenowy> oh my fix applies on pp, not gp
<rellla> anarsoul: with
<anarsoul> then your fix is likely incorrect
<rellla> :)
<anarsoul> I'd suggest running the same test with blob and dumping what it does
<rellla> yeah, that would be the best.
adjtm has quit [Ping timeout: 245 seconds]
kaspter has joined #lima
megi has joined #lima
kaspter has quit [Read error: Connection reset by peer]
drod has joined #lima
abelvesa has quit [Ping timeout: 268 seconds]
abelvesa has joined #lima
jrmuizel has joined #lima
abelvesa has quit [Ping timeout: 245 seconds]
abelvesa has joined #lima
<anarsoul> MoeIcenowy: please add my r-b tag to and I'll merge it
jrmuizel has quit [Ping timeout: 245 seconds]
abelvesa has quit [Ping timeout: 245 seconds]
drod has quit [Ping timeout: 240 seconds]
abelvesa has joined #lima
jrmuizel has joined #lima
drod has joined #lima
jrmuizel has quit [Remote host closed the connection]
<plaes> 5~
<anarsoul> ?
jrmuizel has joined #lima
jrmuizel has quit [Remote host closed the connection]
Da_Coynul has joined #lima