#lima on 2019-09-27 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

00:36 <MoeIcenowy> anarsoul: the bit after texture_2d is texture_3d

00:36 <anarsoul> great

00:36 <MoeIcenowy> enabling it and set depth changes the behavior of the current texture load instruction

00:36 <anarsoul> do you clear texture_2d?

00:37 <MoeIcenowy> yes

00:37 <MoeIcenowy> BTW this bit seems to be also used when cubemap

00:37 <MoeIcenowy> but cubemap has the bit before texture_2d also set

00:37 <anarsoul> interesting

00:37 <anarsoul> MoeIcenowy: btw, see my comment to your uniforms fix

00:38 <anarsoul> my guess is that we don't need to specify size in this reg at all

00:38 <anarsoul> since what you do is essentially setting lower bits to 0

00:38 <MoeIcenowy> armessia: how is the 6 faces stored for cubemap ?

00:38 <anarsoul> uniform array size is 4

00:38 <anarsoul> 4 / 4 - 1 = 0

00:39 <anarsoul> so if it fixes ppmmu faults for it we don't need to set it

00:39 <MoeIcenowy> yes, although maybe someday we can find out how a bigger uniform array work?

00:39 <anarsoul> MoeIcenowy: mali4x0 for some reason uses double indirection for uniforms

00:39 <anarsoul> i.e. register contains pointer to a table of single entry

00:40 <anarsoul> and this table contains a pointer to uniform buffer

00:40 <MoeIcenowy> yes... but maybe we can have multiple uniform buffer?

00:40 <anarsoul> MoeIcenowy: but why?

00:40 <MoeIcenowy> ah... right

00:40 <MoeIcenowy> really useless when GL

00:41 <anarsoul> they use similar table for textures, but it actually makes sense in case of textures

00:41 <anarsoul> since each table entry has a pointer to a texture descriptor

00:41 <anarsoul> and we actually have multiple entries in this table

00:41 <anarsoul> but why they made it for uniforms - I have no idea

00:47 <anarsoul> MoeIcenowy: uniform load opcode has a lot of zeroes in it, so in theory it may be possible to use uniforms from another block

00:47 <anarsoul> but I don't think that we even need to explore it

01:33 Da_Coynul has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]

01:42 yuq825 has joined #lima

02:04 Da_Coynul has joined #lima

02:09 Da_Coynul has quit [Quit: My MacBook Air has gone to sleep. ZZZzzz…]

02:21 <anarsoul> yuq825: looks like I know why mipmapping is broken for linear textures

02:23 <anarsoul> yuq825: we align level size to 16 bytes boundary but hardware doesn't expect that and there's no stride for levels except 1

02:23 <anarsoul> I mean level 0

02:24 <anarsoul> I'm fixing it now

02:30 <anarsoul> and that probably means that mipmap levels can't be render target

02:51 <anarsoul> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2149

02:52 <anarsoul> yuq825: I guess that's why blob uses tiled textures whenever possible

03:06 <yuq825> mipmap levels can't be render target for only linear texture?

03:15 <anarsoul> yes

03:16 <anarsoul> for tiled mipmap levels are aligned to tile boundaries

03:23 <yuq825> then we need to do something for stopping use unaligned texture as render target

03:27 <anarsoul> any ideas?

03:29 <yuq825> like in lima_set_framebuffer_state

03:30 <yuq825> check the start address of render buffer with level

03:31 <anarsoul> we can actually check width and stride

03:31 <anarsoul> and fail if it's not multiply of 16

03:42 jrmuizel has joined #lima

03:42 <anarsoul> looks like we can't throw an error from lima_set_framebuffer_state()

03:42 armessia has quit [Quit: Leaving]

03:53 <anarsoul> yuq825: I think we'll have to create shadow framebuffer for this cases

03:55 <yuq825> yeah, but painful

03:55 <anarsoul> that's why blob uses tiled textures :)

03:56 <anarsoul> render target requires buffer to be padded to 16 pixels in each direction

03:57 <anarsoul> but linear textures have mipmap levels with stride=width and width is not necessarily aligned to 16

03:59 <mardikene193> I try to read about full frame rasterization and tiled rasterization, but in both of the cases the question remains the same, when you spread different geometry to different cores, be those macrotiles in AMD case , tiles in Mali case

03:59 <mardikene193> how does it really map the same kernel to different cores, i.e shader instructions

03:59 <mardikene193> ?

04:06 <mardikene193> you may have an MMU to be in charge of doing that right, i tried to look at VMIDs on AMD

04:07 <mardikene193> but this has not made much sense either, only max 16 of them, but fiji GCN gpu for instance could have 32CUs

04:07 <anarsoul> my mipmap fix fixes 20 piglit tests and breaks 1

04:10 <mardikene193> 8*8 macrotile is aligned to 256b cacheline on single precision and 512b cacheline on double precision, yeah sure I understand that

04:10 <mardikene193> but...

04:10 <mardikene193> still how can it map the same kernel/shader for parallel access to different cores?

04:12 <mardikene193> There is then enourmous bus traffic and buses need to be very wide to accomodate relevant data during the data exchange

04:15 <mardikene193> David Kanter and some AMD guys say, it is sort of an L2 coherancy protocols which bring the bits into the cache, i.e duplicated cache, which somehow makes sense

04:16 <mardikene193> even

04:18 <mardikene193> ok time for the whitepaper again, i am not sure was it that l2 or l1 was shared by cluster of 4Compute units

04:19 <mardikene193> then 64 that of VEGA max Compute units divided by 4 is indeed maximum of 16VMIDs

04:21 <mardikene193> can you imagine some of the parallel memory buses that could do 64*2048/4096/8192 byte transfers, well this isn't probably realistic right?

04:23 <mardikene193> samewise it isn't realistic to do 16*anyofthementioned

04:23 <mardikene193> or is it?

04:25 jrmuizel has quit [Remote host closed the connection]

04:26 jrmuizel has joined #lima

04:32 <mardikene193> it brings in 32B per cycle on AMD GCN to all compute units in the cluster it appears, i was in the woods with my stuff, absolutely incorrect before

04:32 <mardikene193> ok covered, this does make a lot of sense

04:33 <mardikene193> 32bytes that means either 8single precision instructions or 4 double precision instructions at time

04:40 wens has left #lima [#lima]

04:54 megi has quit [Ping timeout: 240 seconds]

05:13 <mardikene193> are VMIDS cache automatically on L1 of appropriate CU banks, since instruction cache content should not change, or do they go through l2?

05:16 <mardikene193> I think the docs say they are brought in from l2 though still

05:18 Barada has joined #lima

05:20 <mardikene193> well i can perfectly understand the info, when MMU is involved indeed.

05:20 <mardikene193> however on say r300 there is no MMU on the chip, there is hardcoded memory controller

05:25 <mardikene193> since it does not have any of the cache-coherency protocol involved neither MMU, i think my solution never needs them, but on long/full pipeline mode this chip probably needs to fetch different data from memory for separate pixel shaders?

05:28 <mardikene193> this has to be either that, or just lock-step delayed interleaving execution if all the pixel and vertex shaders share the cache

05:30 <mardikene193> yes of course the second one probably

05:31 <mardikene193> rasterizer works in lock-step not parallel, by the time it has data the instruction is still in the cache

05:32 jrmuizel has quit [Remote host closed the connection]

05:52 <mardikene193> that brings me to final question, i know local/shared memory can be used to migrate a thread according to some korean reports/papers

05:53 <mardikene193> what is the high level abstraction with texture units, 16 of them right on sm4.0 instead of NUMCUS*16 indexed separately, that means even though

05:54 <mardikene193> r300 with 4pixel shaders has total of 16, programmer is given a chanche to use 4 of them right?

06:06 <mardikene193> it appears not so, probably cause texture units are shared also for cluster of four shaders

06:06 <mardikene193> like the TC l1 cache

06:09 armessia has joined #lima

06:10 <mardikene193> I can query that programmatically on my CI apu chip, i have two of them, but it is pretty sure to me, that the number is 64 there.

06:19 <mardikene193> you maybe wondering why am I sure that, 2048 entries for 16SIMDs of 4word wide bundles only, cause attila shows this, cause probably 32*32*4 no longer fits to the die somehow

06:20 <mardikene193> it's max 2048 for elbrus too on that process 28nm

06:21 <armessia> MoeIcenowy: the 6 faces are stored right after each other, their position is implicit. The start positions of each face must be tile aligned though, also in the linear case.

06:21 <mardikene193> however calculations on NAVI which is lot smallr about 4 fold indeed show that they have queues with lenght of 10240 per CU

06:22 <armessia> MoeIcenowy: the stride also needs to be 16 aligned

06:22 armessia has quit [Quit: Leaving]

06:28 <mardikene193> armessia you know as i was saying, from cache things are fetched (If they are in cache -- in cacheline lengths)

06:29 <mardikene193> no matter if that is data or instruction cache, however no matter the alignment for tiling

06:29 <mardikene193> tiling only works faster when the underlying memory is physically contiguous

06:30 <mardikene193> but yeah they say the tiles from rasterizer are best managed in 16x16 indeed

06:33 deesix has quit [Ping timeout: 265 seconds]

06:33 dddddd has quit [Ping timeout: 268 seconds]

06:34 deesix has joined #lima

06:34 dddddd has joined #lima

06:35 <mardikene193> yeah absolutely in shader you can access smaller tiles too, but rasterizer is known to do best with 16*16

06:35 <mardikene193> in other words vertex buffers should be probably aligned as such indeed

06:38 <mardikene193> double precision is more rare but this would in theory require 32x16 also 16byte aligned in fact indeed

06:38 <mardikene193> so what you said was correct for vertex buffers indeed

06:38 yuq8251 has joined #lima

06:41 yuq825 has quit [Ping timeout: 240 seconds]

06:41 <mardikene193> 16*16 is indeed 256B which is quite oftenly the cacheline size , yeah true

06:43 <mardikene193> since 32*32 is allready 1024B which is too much, then yeah 16 is the best solution

06:46 dddddd has quit [Remote host closed the connection]

06:54 <mardikene193> cacheline of 256B means that it has a staging flop based of that size in hw

06:55 <mardikene193> it is loaded with a burst of 256B to there, and only those bits that you loaded are forwarded to cache

06:55 <mardikene193> no matter what you do it brings always the full cacheline into that staging flop

06:57 <mardikene193> and with memory as cache it works in similar fashion , bridgman once talked about it on phoronix

06:57 <mardikene193> memory is accessed as tiles as well

06:58 <mardikene193> but it needs to be physically contiguous obviously that it can happen

06:59 <mardikene193> and right it needs to aligned too somehow, geesh

06:59 <mardikene193> but i do not remember since that in depth post is gone what is hw memory tile alignment

07:02 mardikene193 has quit [Quit: Leaving]

07:21 Elpaulo1 has joined #lima

07:21 Elpaulo has quit [Ping timeout: 246 seconds]

07:21 Elpaulo1 is now known as Elpaulo

07:50 _whitelogger has joined #lima

08:30 mardikene193 has joined #lima

08:31 <mardikene193> the code i have in mind, is likely going to be materialized, it carries world wide importance but i am unsure if I will receive any help from community to push those bits, which is fine if not, but ...

08:32 <mardikene193> myself I have issues too that needs to be solved , I have somewhat chanche to get into some better form than i am in now, but this requires a bit of effort to deal with and therapy and all for me to participate as older man again in some tournaments, for me this is pretty high rank importance as well to deal with my health

08:35 <mardikene193> But overconsuming worlds resources is a very bad thing, luckily some swedish activist made a pretty decent speech, i recon this was some young girl.

08:35 <mardikene193> definitely a mid-teenager or something

08:42 <mardikene193> swededen is a welfare country, i see proof that swedish big numbers are very smart people, as well as finish internationals, both are welfare countries, it has been a while since i last talked with swedish though, those on IRC are more like trolling, but they have something special in science terms as i can see too.

08:50 <mardikene193> one of my pal during new zealand stay was swedish that i met in hostel, i right away asked something like how is the situation how sweds deal with their own, do they have envy between them, and answer from that street musician was that in ghetoo type of areas they do, which i expected.

08:50 <mardikene193> and it proved my theory that it seems to be all about welfare not genetics, why the issue is sharper probably in estonia.

08:53 gaulishcoin has joined #lima

09:01 <mardikene193> things have gone towards better here locally, but when i was young we had chaos, nowdays there is structure allready, i had enough money but i was in a need to show my fast legs everywhere i entered, and this all did not even save me from perverse violations arranged later still.

09:02 <mardikene193> nowdays it is allready more friendlier here, where authorities are settled in mainly police power and miliatary and such, seems like some corruption in courts is still present though.

09:07 <mardikene193> the ninties men slightly older than me, and me included as slightly younger 10years old in 90, were very motivated and strong, but about them i can not speak, if they were stupid or why this shit happened to them, maybe cause they were little stubborn, maybe because the situation was in chaos and maybe both

09:18 <mardikene193> so what i think if you retain stable and clear thinking and have ever chanche to also make your athletisism to be at some level without a threat from others to violate you all the time, this is a clear mixture to success

09:26 <mardikene193> you mean what does that mean that some get violative feedback each and every day? You see this is stereotypical i have understood in our country at least in the past.

09:27 <mardikene193> pretty boys physically developed/advanced and in good position to do something, were tried to be wiped out the most back those days, and this jelousy generated in my personal opinion mostly due to bad welfare situation.

09:31 <rellla> MoeIcenowy: i ran piglit with your uniforms fix and these are the results: http://imkreisrum.de/piglit/mali450/6dd0ad6..f3efd4a-lima-uniforms-address-size/index.html

09:31 <rellla> wow.

09:32 <rellla> consider, i'm running the tests with a more tolerant piglit version.

09:40 <mardikene193> so if you did not do anything relevant to violate me the guy who has world record number of them on his belt, than why are you afraid?

09:40 <mardikene193> in almost 99.5% degree, i never responded to any of the fabrications nor violations i got, in other words i was a victim.

09:45 <mardikene193> rellla: what i think with uniforms, there is an address register based indexing, with that said, anything on the clamping path in fragment shader is more easily even possible in vertex shader

09:46 <mardikene193> sm3.0 has this also in pixel shader, and sm4.0 has address register for register indexing allready

09:50 <mardikene193> rellla: there are two possibilities to skip an instruction with bunch of uniform access LSU operations, you repeat the last address this is 1. or possibly index out-of-range hoping that not 0 is returned but instruction skips, one of the definitely works, maybe even both

09:57 <mardikene193> sampler based clamping is a stage before lsu based indexing/offseting it is an easy way to put an offset of texture fetch into some value that gets clamped into previous address the effective on chip calculator

09:57 <mardikene193> and voila as it does not graduate it's dependency which has to be an alu can not also run

10:01 <mardikene193> remember since you have two schedule two lsu operations in sequence to change column?

10:01 <mardikene193> how do you think this can be done?

10:04 yuq8251 has quit [Remote host closed the connection]

10:15 <mardikene193> pretty easy you can through some two offsets into the game, or whatever indexes, so perhaps one offset is in ascending order the other in descending, and write into those two regs

10:16 <mardikene193> so that either 3 and 4, 1 and 2, 2 and 3 or whatever sequence is scheduled, it will change to the column of the last

10:16 <mardikene193> one

10:35 Da_Coynul has joined #lima

10:36 Da_Coynul has quit [Client Quit]

10:43 adjtm has quit [Ping timeout: 276 seconds]

10:56 kaspter has quit [Quit: kaspter]

11:44 <plaes> o_O

11:50 raimo has joined #lima

11:50 mardikene193 has quit [Read error: Connection reset by peer]

11:51 megi has joined #lima

11:52 raimo has quit [Read error: Connection reset by peer]

11:52 joss193 has joined #lima

11:54 raimo has joined #lima

11:54 joss193 has quit [Read error: Connection reset by peer]

11:55 raimo has quit [Read error: Connection reset by peer]

11:55 raimo has joined #lima

11:58 raimo has quit [Read error: Connection reset by peer]

11:59 Barada has quit [Quit: Barada]

12:00 raimo has joined #lima

12:09 <plaes> which merge request is this?

12:11 <plaes> ah.. it's this one: https://gitlab.freedesktop.org/icenowy/mesa/commit/5fa8cd779364475dfba2f2efc7853d150491bf35

12:21 dddddd has joined #lima

12:28 <rellla> plaes: i'm actually running another on, because either the cubemaps or anarsoul's mipmapping patch causes some regressions

12:28 <rellla> s/on/one/

12:41 <raimo> I am trying to do final researches, whether i could be wrong in some of the aspects, it is my core understanding relying on fact, that full length of the pipeline is always in order unless there is a branch , which would mess everything up

12:42 <raimo> that means full length of the pipeline always changes column, because the time one instruction issues, there is never another yet available

12:43 guestt0876541 has joined #lima

12:44 <raimo> it is because amd docs say , one cycle and one instruction is fetched

12:44 <raimo> and valid bits start after reset from all toggled on

13:07 abelvesa has joined #lima

13:08 jrmuizel has joined #lima

13:08 <rellla> ah, cubemaps breaks it :)

13:09 <raimo> It should be true and incredibly logical to state as such, but never know i do not want to make more simulator tests, to me everything has been successful on the simulator, i said some days ago to libv that i am moving finally to hardware and see if those also work there

13:09 <raimo> those theories

13:16 guestt0876541 has quit [Remote host closed the connection]

13:16 <raimo> i've studied this stuff for 11years now alltogether would be nice if i was able to pull something off , but all this would take time still, if i can't does not matter either, i got more stable theoretician anyways, and i can handle fine

13:16 <raimo> in other fields

13:17 <raimo> i am considering the whole pipeline command processor and all shader stages to be optimized, not only vertex and fragment, but also tess and geom, and even compute shader

13:26 <raimo> I still make some or even lots of assumptions, one of them which i never tested was: Kayden once commented that they do not adjust wavefront/thread occupancy

13:27 <raimo> so does not any chip in my opinion, even though not tested, i almost see that path

13:27 <raimo> wave amount is always maximum when chip schedules stuff :D , yeah really it is the writebacks that start to block ;)

13:33 <raimo> enough, me out, and i do not have much to tell either.

13:34 raimo has quit [Quit: Leaving]

13:51 gaulishcoin has quit [Remote host closed the connection]

13:51 gaulishcoin has joined #lima

13:52 jrmuizel has quit [Remote host closed the connection]

14:01 gaulishcoin has quit [Quit: Leaving]

14:16 <MoeIcenowy> rellla: looks like no regressions

14:16 adjtm has joined #lima

14:16 <MoeIcenowy> (the only regression is failure because of memory alloc

14:26 <anarsoul> MoeIcenowy: and 105 fixes

14:28 <anarsoul> send an MR? just make sure that you set lower bits on uniform address to zero, I don't think it makes any sense to do any calculations there

14:30 <enunes> I wonder if any of these fixes finally fixes ideas

14:30 <enunes> no access to board/display today

14:30 jrmuizel has joined #lima

14:33 jrmuizel has quit [Remote host closed the connection]

14:37 <anarsoul> let me try

14:38 <MoeIcenowy> anarsoul: as "consider, i'm running the tests with a more tolerant piglit version", I don't know whether the fixes are really fixes...

14:38 <MoeIcenowy> or maybe I misunderstood this sentence?

14:38 <anarsoul> rellla: ^^

14:38 jrmuizel has joined #lima

14:39 <rellla> http://imkreisrum.de/piglit/mali450/6dd0ad6..f3efd4a-lima-uniforms-address-size/

14:39 jrmuizel has quit [Remote host closed the connection]

14:39 <rellla> both tests rund the same piglit version. they fail with master and pass with the uniform patch

14:39 <MoeIcenowy> oh interesting

14:40 <MoeIcenowy> it's really fixing things

14:40 <plaes> \o/

14:40 <anarsoul> enunes: it doesn't fix ideas :(

14:40 <anarsoul> lamp still has wrong colors

14:40 <enunes> damn that's a persistent one

14:41 <rellla> MoeIcenowy: tolerant piglit just means that this patch is included: https://gitlab.freedesktop.org/rellla/piglit/commit/73282d061afdf4993509336e063e189a3d1a7ba9

14:41 <MoeIcenowy> rellla: I misunderstood your sentence...

14:41 <MoeIcenowy> sorry

14:44 <MoeIcenowy> anarsoul: although the code that set the size is NOP, maybe we should still keep it?

14:44 <MoeIcenowy> as we have a buff for the 1-item array

14:47 adjtm has quit [Ping timeout: 268 seconds]

14:48 <anarsoul> MoeIcenowy: no. git will keep the history

14:49 <MoeIcenowy> anarsoul: we never have the correct thing in the history, right?

14:49 <anarsoul> what do you mean?

14:50 <MoeIcenowy> the correct code should be set the field based on the length of the array of uniform storages, right?

14:52 <anarsoul> MoeIcenowy: we don't know that

14:52 adjtm has joined #lima

14:52 kaspter has joined #lima

14:52 <anarsoul> the code was taken from original lima project and looks like it was incorrect

14:53 <anarsoul> you can try REing it if you want

15:05 megi has quit [Ping timeout: 268 seconds]

15:19 <rellla> https://gitlab.freedesktop.org/rellla/mesa/commits/lima-fix-no-vertex-attribs

15:20 <rellla> does this sound as a reasonable fix? it prevents glsl-no-vertex-attribs from asserting in u_upload_mgr when the size is 0...

15:21 jrmuizel has joined #lima

15:21 kaspter has quit [Ping timeout: 250 seconds]

15:21 <rellla> with fix i get at least one of the mem issues http://imkreisrum.de/piglit/mali450/6dd0ad6..0e76b09-lima-fix-no-vertex-attribs-glsl-no-vertex-attribs/

15:22 <rellla> at least :)

15:29 <anarsoul> lima 1e80000.gpu: mmu page fault at 0x443d37a0 from bus id 0 of type read on gpmmu

15:29 <anarsoul> is it with fix or without fix?

15:33 abelvesa has quit [Ping timeout: 276 seconds]

15:34 abelvesa has joined #lima

15:37 jrmuizel has quit [Ping timeout: 240 seconds]

15:42 <MoeIcenowy> anarsoul: I think 6dd0ad6 doesn't come with my uniform fix

15:42 <MoeIcenowy> oh my fix applies on pp, not gp

15:51 <rellla> anarsoul: with

15:52 <anarsoul> then your fix is likely incorrect

15:52 <rellla> :)

15:53 <anarsoul> I'd suggest running the same test with blob and dumping what it does

15:54 <rellla> yeah, that would be the best.

15:54 adjtm has quit [Ping timeout: 245 seconds]

16:04 kaspter has joined #lima

16:11 megi has joined #lima

16:17 kaspter has quit [Read error: Connection reset by peer]

17:03 drod has joined #lima

17:16 abelvesa has quit [Ping timeout: 268 seconds]

17:18 abelvesa has joined #lima

17:23 jrmuizel has joined #lima

17:52 abelvesa has quit [Ping timeout: 245 seconds]

17:58 abelvesa has joined #lima

18:10 <anarsoul> MoeIcenowy: please add my r-b tag to https://gitlab.freedesktop.org/mesa/mesa/merge_requests/2156 and I'll merge it

18:12 jrmuizel has quit [Ping timeout: 245 seconds]

18:32 abelvesa has quit [Ping timeout: 245 seconds]

18:53 drod has quit [Ping timeout: 240 seconds]

19:00 abelvesa has joined #lima

19:00 jrmuizel has joined #lima

19:07 drod has joined #lima

19:19 jrmuizel has quit [Remote host closed the connection]

20:02 <plaes> 5~

20:03 <anarsoul> ?

22:23 jrmuizel has joined #lima

22:31 jrmuizel has quit [Remote host closed the connection]

23:41 Da_Coynul has joined #lima