#lima on 2020-12-04 — irc logs at freenode.irclog.whitequark.org

2019-07-03 10:24 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel has landed in mainline, userspace driver is part of mesa - Logs at https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=lima and https://freenode.irclog.whitequark.org/lima - Contact ARM for binary driver support!

02:01 kaspter has joined #lima

02:07 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

02:10 jernej has joined #lima

02:11 jernej has quit [Client Quit]

02:11 jernej has joined #lima

02:12 jernej has quit [Client Quit]

02:13 jernej has joined #lima

02:28 camus has joined #lima

02:29 kaspter has quit [Ping timeout: 246 seconds]

02:29 camus is now known as kaspter

02:53 camus has joined #lima

02:54 kaspter has quit [Ping timeout: 246 seconds]

02:54 camus is now known as kaspter

03:02 camus has joined #lima

03:02 camus1 has joined #lima

03:04 kaspter has quit [Ping timeout: 256 seconds]

03:04 camus1 is now known as kaspter

03:06 camus has quit [Ping timeout: 265 seconds]

03:33 <bshah> I am afraid I did not manage to bisect it off (well bisect is strong word because I dont have any

03:34 <bshah> erm

03:34 <bshah> I don't have any good commit to begin with or which repo even

03:35 <bshah> as I understand previously frequency of this was quite low, but now its almost instant gpu crash on rotation :(

03:46 <anarsoul> so did it start with your compositor update?

03:46 <anarsoul> or with mesa update?

03:46 <anarsoul> or something else?

03:50 <bshah> now that I think about it, I realize that on pmOS where I can not reproduce this it have mesa 20.2 and other systems where I can reproduce this is 20.3 or master

04:00 kaspter has quit [Ping timeout: 256 seconds]

04:03 camus has joined #lima

04:05 camus is now known as kaspter

04:24 <bshah> hm or not

04:25 <bshah> both had mesa 20.2.3 :(

04:26 <bshah> 20.3 was released just yesterday

05:15 Barada has joined #lima

07:45 Net147 has quit [Quit: Quit]

07:47 Net147 has joined #lima

07:58 kaspter has quit [Ping timeout: 265 seconds]

07:58 kaspter has joined #lima

09:29 Barada has quit [Quit: Barada]

10:43 Viciouss has quit [Ping timeout: 246 seconds]

11:21 <enunes> bshah: can you upload all shaders you captured somewhere, just to take a look?

11:22 <bshah> sure give me moment

11:24 <bshah> enunes: https://bshah.in/lima-timeout-dump.tar.gz

11:36 <enunes> bshah: hmm nothing too weird, I dont remember seeing one with that many texture references

11:36 <enunes> maybe tyring to simplify the most complex ones and seeing if it makes a difference could be something

11:37 <enunes> but kind of a blind shot

11:38 <bshah> enunes: if by some way I can figure out what shader exactly is causing this then I can do something about it, currently I have almost 0 idea where to look :/

11:39 <enunes> as in, finding wherever they are defined in Qt or something, patching the shader (removing mostly loops, conditionals, long calculation sequence and things hard to optimize) and rebuilding that component, and running to see if it makes a difference

11:39 <enunes> I'd just try to grep some part of that shader in the sources for the involved components

11:41 <enunes> again, I dont know if this will solve anything, but it would be interesting data to at least eliminate that

11:41 <bshah> there is no way to add some dbeug in mesa or something to see what shader it is processing?

11:42 <bshah> mostly because this 20-ish shaders are kind spread across 5-6 different repos

11:42 <enunes> https://docs.mesa3d.org/shading.html

11:43 <enunes> I use mostly MESA_SHADER_CAPTURE_PATH, but you already have the shaders

11:45 <rellla> bshah: i didn't follow the whole disussion, but did you already try with LIMA_DEBUG=gp,pp ?

11:45 <bshah> no I did not, but I can try

11:46 <bshah> rellla: context: https://gitlab.freedesktop.org/lima/linux/-/issues/33#note_711261

11:46 <rellla> you can try first pp, then gp and see what lima compiler does with the shaders and if there is any dubious things ...

11:46 <rellla> thanks for the link

11:53 <enunes> that probably requires some knowledge of how pp and gp work, I was thinking more to find a way to help narrowing the issue down without having to know that

11:59 <bshah> huh of-course now with LIMA_DEBUG exported I somehow cant reproduce this

11:59 <bshah> :'(

12:00 <bshah> had been running for like 4 mins but thats' already more then what I was able to in past

12:06 <rellla> enunes: different context question: shouldn't the gp scheduler be successful even if https://gitlab.freedesktop.org/mesa/mesa/-/blob/master/src/gallium/drivers/lima/ir/gp/scheduler.c#L1507 is not done?

12:07 camus has joined #lima

12:08 kaspter has quit [Ping timeout: 256 seconds]

12:08 camus is now known as kaspter

12:11 <enunes> rellla: I'm a bit out of the loop in gpir, but seems like it tried to schedule a node as it is, and if it's not successful it needs to insert a move for that node and try again later

12:12 <enunes> in the loop at 1563, so the logic makes sense to me

12:13 e is now known as demiurge

12:13 <rellla> true - i try to figure out, why we have an endless loop here ...

12:18 <enunes> endless loop while compiling or endless loop in the generated code?

12:18 Viciouss has joined #lima

12:24 <rellla> i tried the remaining deqp test, which hit the 512 limit and if i drop sched_move i end up at a ready list, which can't be scheduled any more and generates no-op instructions until we reach 512

12:24 <rellla> but i guess i have the re-add the sched_move again and see what it does...

12:26 <enunes> yeah it is probably important, it is by design since the gp doesnt have registers so those moves are needed to keep the values alive until their nodes can be scheduled

12:56 <rellla> http://imkreisrum.de/deqp/512/log7

12:58 <rellla> i guess, the bug results in having (mul0|mul1|add0|add1|pass|cmpl) blocked with sched_moves and so we are not able to schedule the instruction anymore

13:00 <bshah> sorry for noob question ... so we had been trying to narrow down issue, one thing I noticed is this crash happens only when device is locked, after we dismantled our lock greeter to bare bone it still caused crash

13:01 <bshah> now our current theory is there is some issue wrt window in background stopping their rendering when screen is locked

13:01 <bshah> (well not window, but compositor deciding to not update any windows in background)

13:17 camus has joined #lima

13:18 kaspter has quit [Ping timeout: 240 seconds]

13:18 camus is now known as kaspter

14:52 <enunes> bshah: when you hit the issue, what does it say in status= ?

14:52 <enunes> in dmesg

14:53 <bshah> enunes: first process failing have:

14:53 <bshah> [ 85.037167] lima 1c40000.gpu: pp task error 0 int_state=0 status=1

14:53 <bshah> [ 85.043382] lima 1c40000.gpu: pp task error 1 int_state=0 status=1

14:53 <bshah> and then next

14:53 <bshah> [ 85.581390] lima 1c40000.gpu: pp task error 0 int_state=0 status=5

14:53 <bshah> [ 85.587612] lima 1c40000.gpu: pp task error 1 int_state=0 status=5

14:54 <enunes> so different status, but same way to reproduce?

14:54 <bshah> they both happen at same time

14:54 <bshah> one after another

14:54 <bshah> I guess one fails and then borks other processes?

14:56 <enunes> I have a different way to cause a pp task error but its status=5 only

14:56 <enunes> that issue report has status=0

14:56 <enunes> so, it's possible there are multiple issues in here, pp task error can be many things I suppose

14:56 <bshah> https://invent.kde.org/plasma/kwin/-/blob/master/composite.cpp#L674 we verified that if we remove these 3 lines from kwin then this crash is gone

14:57 <enunes> in my case I can only reproduce it by running multiple things in parallel, running one at a time doesn't reproduce it

14:57 <enunes> which is bad news I guess

14:57 <bshah> more or less same thing here I guess

14:57 <enunes> did you try bisecting kernel versions?

14:58 <bshah> I have 3 things running, compositor, and 2 client (lockscreen and app)

14:58 <enunes> like, going back to something like 5.7 or some time where it didn't reproduce

14:59 <bshah> I did not tbh

14:59 <enunes> lima didnt change much but there are many reworks going on in drm, maybe we missed something

14:59 <enunes> if you could try it, would save some debug time too

15:00 <bshah> I erm am kinda time stressed at moment, but probably can try in week after

15:00 <enunes> yeah, same for me

15:03 <bshah> the kwin code I refer to basically causes only rendering of lockscreen client and compositor to stay active, it stops rendering other clients

15:03 <bshah> maybe that kinda provides some hints on what might be going wrong... dunno

15:04 <enunes> to me it's a bit far away from something we can relate to lima

15:04 <enunes> but if it reduces the things running in parallel, it can be the same issue I hit

15:04 <bshah> yep sorry can't help much :D

15:05 <enunes> I don't remember this before so if I had time now I'd try a fairly old kernel

15:06 <enunes> the problem is it takes a while to reproduce so I'd have to run it for hours to ensure it really doesn't reproduce and bisect successfully

15:07 <enunes> maybe it can be accelerated by coming up with some application that renders really a lot of things at the same time to stress it

15:09 <bshah> in a way userspace setup I have can reproduce it easily

15:09 <bshah> I will try to find a time to bisect kernel

15:09 <enunes> is there some image?

15:09 <enunes> I have a pinephone too

15:10 <bshah> oooh

15:11 <bshah> enunes: https://kdebuild.manjaro.org/images/dev/Manjaro-ARM-plasma-mobile-dev-pinephone-201204.img.xz

15:11 <enunes> so just boot it and rotate screen?

15:11 <bshah> easiest way to reproduce this issue is boot phone with phone in landscape

15:11 <bshah> and it will crash in 1 minute max

15:14 kaspter has quit [Quit: kaspter]

15:14 <bshah> (username password kde/123456)

15:14 <enunes> ok, no promises but maybe next week I can give it a try

15:15 <bshah> thanks, meanwhile I will see if we can do some "userspace" hack :D

15:48 kaspter has joined #lima

15:53 mmind00 has quit [Quit: No Ping reply in 180 seconds.]

17:14 mmind00 has joined #lima

17:34 yann has quit [Ping timeout: 272 seconds]

17:47 yann has joined #lima

18:59 champagneg has quit [Quit: WeeChat 2.3]

19:26 yann has quit [Ping timeout: 260 seconds]

19:36 dev1990 has quit [Quit: Konversation terminated!]

19:37 dev1990 has joined #lima

19:53 yann has joined #lima

22:41 warpme_ has joined #lima