#panfrost on 2020-05-08 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:05 yann has quit [Ping timeout: 240 seconds]

01:26 vstehle has quit [Ping timeout: 256 seconds]

03:00 Green has quit [Quit: ...]

03:38 davidlt has joined #panfrost

04:10 buzzmarshall has quit [Remote host closed the connection]

04:17 kinkinkijkin has joined #panfrost

04:23 Green has joined #panfrost

04:37 Green has quit [Quit: Ping timeout (120 seconds)]

04:50 Green has joined #panfrost

05:00 vstehle has joined #panfrost

05:22 rcf has quit [Quit: WeeChat 2.7]

05:22 rcf has joined #panfrost

05:25 rcf has quit [Client Quit]

05:26 rcf has joined #panfrost

06:06 Elpaulo has quit [Quit: Elpaulo]

07:24 nerdboy has joined #panfrost

07:31 icecream95 has quit [Quit: leaving]

07:46 kinkinkijkin has quit [Remote host closed the connection]

08:11 NeuroScr has quit [Ping timeout: 240 seconds]

08:15 NeuroScr has joined #panfrost

08:49 yann has joined #panfrost

09:17 yann has quit [Ping timeout: 272 seconds]

09:40 NeuroScr has quit [Read error: Connection reset by peer]

09:41 NeuroScr has joined #panfrost

10:16 raster has joined #panfrost

10:25 <la-s> alyssa: how would I debug bad GPU performance? I am using sway now, and though it works mostly great (the background has some graphical glitches), the performance is still not great, just as with weston.

10:26 <la-s> was thinking of trying to fix it myself

11:06 nerdboy has quit [Ping timeout: 264 seconds]

11:40 <tomeu> la-s: first step is figuring out if the bottleneck is cpu or gpu

11:40 <la-s> good point yeah

11:41 <la-s> should figure out how to profile sway

12:16 adjtm has joined #panfrost

12:18 Green has quit [Ping timeout: 256 seconds]

12:18 adjtm_ has quit [Ping timeout: 256 seconds]

12:18 Green has joined #panfrost

12:27 <tomeu> well, if it's gpu, then you can look at performance counters to figure out why

12:28 <tomeu> but if it's cpu, then something like perf top could give an indication quite quickly

12:33 <tomeu> alyssa: so the depth fbo is totally black

12:33 <tomeu> and the varyings out of the vertex job are wrong: https://people.collabora.com/~tomeu/mali.trace https://people.collabora.com/~tomeu/panfrost.trace https://people.collabora.com/~tomeu/trace.diff

12:34 <tomeu> well, are very different, nto sure they are wrong because the format is different:

12:34 <tomeu> +rg32f varying_0.rrrr;

12:34 <tomeu> -rgba32f varying_0.rrrr;

12:35 <tomeu> actually, if we ignore the 3rd and 4th components, the output matches:

12:35 <tomeu> -<0.000000, 0.000000, 0.000000, 0.000000>

12:35 <tomeu> -<1.000000, 0.000000, 0.000000, 0.000000>

12:35 <tomeu> +<0.000000, 0.000000, 0.000000, 1.000000>

12:35 <tomeu> -<0.000000, 1.000000, 0.000000, 0.000000>

12:35 <tomeu> -<1.000000, 1.000000, 0.000000, 0.000000>

12:35 <tomeu> +<1.000000, 0.000000, 1.000000, 1.000000>

12:36 <tomeu> so I guess the suspect is the fragment job in the depth fbo job chain

12:42 <tomeu> the pos varyings in the color fbo job chain don't match either, but seems like the drawing is y-flipped?

13:30 adjtm_ has joined #panfrost

13:32 ente has quit [Read error: Connection reset by peer]

13:32 adjtm has quit [Ping timeout: 256 seconds]

13:54 Elpaulo has joined #panfrost

14:02 Elpaulo has quit [Quit: Elpaulo]

14:22 <alyssa> tomeu: mesa and blob are y-flipped from each others perspective, that's normal

14:22 <alyssa> so yes, frag job I guess

14:26 thecycoone has quit [Ping timeout: 256 seconds]

14:51 buzzmarshall has joined #panfrost

15:06 mixfix41 has quit [Ping timeout: 260 seconds]

15:06 mixfix4one has quit [Ping timeout: 272 seconds]

15:11 bnieuwenhuizen has quit [Ping timeout: 260 seconds]

15:52 mixfix41 has joined #panfrost

15:57 mixfix41 has quit [Ping timeout: 264 seconds]

16:02 raster has quit [Quit: Gettin' stinky!]

16:02 raster has joined #panfrost

16:13 xdarklight has quit [Quit: ZNC - http://znc.in]

16:15 xdarklight has joined #panfrost

16:44 cwabbott has quit [Ping timeout: 252 seconds]

16:51 cwabbott has joined #panfrost

17:50 nerdboy has joined #panfrost

17:57 <alyssa> robher: I'm seeing some pretty serious regressions in 5.6 (from 5.4)

18:00 <alyssa> Easy reproduction: open weston and run glmark2-es2-wayland -bterrain

18:01 <alyssa> (Or even -bshadow)

18:02 <alyssa> Anything that uses FBOs is hosed.

18:02 <alyssa> Even glmark2-es2-drm -bterrain (w/o a display manager) reproduces.

18:03 yann has joined #panfrost

18:14 <alyssa> I've downgraded to 5.4 in the meantime.

18:29 <urjaman> is that the same thing that i have with 5.7 (rc any) or less severe? (it complains a bunch, fails to reset the gpu, and eventually just kinda hangs the process doing the GPU stuff)

18:30 <urjaman> and yeah i jumped from 5.4 to 5.7rc so it could be introduced in 5.6 all i know

18:52 <alyssa> urjaman: not sure, try the above repro (super obvious with weston)

19:02 <urjaman> ... i'ma build glmark2 then ...

19:02 <alyssa> urjaman: fair enough :p

19:02 <alyssa> it's a fast build, dw

19:03 <urjaman> yeah more surprised i havent used it before

19:05 <urjaman> i've legit just been super lazy since 5.4 works fine for me :P

19:05 <alyssa> relatable

19:11 <urjaman> umm i'll update mesa too first

19:13 <urjaman> i did a for comparison test of doing the -bterrain on weston and got weston crashing after a few seconds of running that and a "pan_bo.c:176: pan_bucket_index: Assertion 'bucket_index >= MIN_BO_CACHE_BUCKET' failed" in the terminal

19:13 <urjaman> (comparison on 5.4 that is...)

19:13 <urjaman> i assume that's fixed already but like whoops

19:16 <urjaman> good idea to do a control test first :P

19:18 <alyssa> uhhhh

19:22 <urjaman> we'll see after about some 800 objects by this lap warmer of a C201 :P

19:34 <urjaman> yep updated mesa, and this repro runs fine on 5.4, now to reboot into 5.7rcsomething ÖP

19:34 <urjaman> *:P

19:35 <alyssa> Nyoof

19:36 <urjaman> okay interesting ... it flickered white like a handful of times and dmesg shows a bunch of gpu sched timeouts and 2+ faults

19:37 <urjaman> actually two faults and one "There were multiple GPU faults - some have not been reported"

19:38 <alyssa> urjaman: That sounds about right

19:38 <alyssa> I mean wrong but

19:41 <urjaman> and now i need to ssh in to restart this thing because i tried to start my Xorg session (just to confirm it still fails the same way-ish i guess... yup it was laggy and then hung a bit after starting firefox, same as before)

19:42 <urjaman> ... i suppose that was pointless since the kernel doesnt manage to reboot from a "reboot -f" in this state

19:47 <Lyude> alyssa: sounds like it's time for a bisect?

19:48 <alyssa> I mean wrong but

19:48 <alyssa> uhm

19:48 <alyssa> silly arrow keys

19:48 <alyssa> Lyude: probably, yeah. though there haven't been many changes, so

19:48 <Lyude> might be a change outside of panfrost maybe

19:49 davidlt has quit [Ping timeout: 260 seconds]

19:49 <alyssa> Perhaps

19:52 <robmur01> hmm, -ENOREPRO here: 5.4-rc7 and glmark2-es2-drm runs all the way through just fine

19:53 <bbrezillon> robmur01: the problem is on 5.6+

19:53 <robmur01> derp, that was supposed to say 5.7-rc4

19:54 <robmur01> been playing with Firefox under GDM with 5.6/5.7-rc with no issue either

19:54 <alyssa> Hmm

19:54 <alyssa> robmur01: is this real hw?

19:54 <robmur01> NanoPC-T4 (RK3399)

19:54 <alyssa> Alright.

19:55 <alyssa> 12 files changed, 260 insertions(+), 365 deletions(-)

19:55 <alyssa> I should maybe clean that up. Uhm.

19:57 <alyssa> Anyway, I have thousands of conformance fails to fix for fp16 now. tata :p

20:02 <urjaman> my kernel building process isnt really set up for bisecting :/

20:03 <urjaman> i guess i could set something up, but like that sounds like work

20:04 <urjaman> i guess i should check with 5.7-rc4 for completeness too (my last one was rc3)

20:05 <urjaman> but right now upgrading the Arch linux on my C201 (since i realized that was over a month old too)

20:05 <alyssa> Ahhh working in Weston feels so different after being in GNOME for so long

20:07 <urjaman> somehow the situation(TM) feels like time doesnt exist (and isnt really moving) but then suddenly you havent updated your linuces in a month+

20:08 <alyssa> urjaman: I had a terrible nightmare a few days ago where there was a worldwide pandemic

20:10 <urjaman> alyssa: how do you distinguish that from reality tho

20:10 <alyssa> I was asleep.

20:10 <urjaman> ah yeah that bit

20:15 <alyssa> fails.txt is 1519 lines long, wee. but just fixed a bunch

20:15 <alyssa> so down to 1438 :P

20:18 <alyssa> er 1133, one thing fixed a bunch

20:27 * alyssa under 1000 in her to-triage list, this is going faster than expected :~)

20:40 <bbrezillon> alyssa: same as robmur01, works fine here with mesa/master and linux/master (AKA 5.7-rc4)

20:40 <alyssa> bbrezillon: Maybe something was fixed between 5.6.1 and master?

20:40 <bbrezillon> I can test on 5.6.1

20:40 <alyssa> vmlinuz-5.6.0-1-arm64 from deabin

20:41 <bbrezillon> ok, so 5.6

20:42 <bbrezillon> alyssa: and I did not test things extensively, just ran glmark2 under weston

20:45 <alyssa> bbrezillon: glmark2 -bterrain reproduced reliably

20:46 <alyssa> (-bbuild etc do not)

20:46 <bbrezillon> yep, -bterrain

20:46 <alyssa> OK

20:46 <bbrezillon> it works fine here

20:46 <bbrezillon> but that's a debug build

20:46 <bbrezillon> maybe it has an impact

20:46 <bbrezillon> (I mean mesa debug build)

20:47 <alyssa> Same

20:47 <alyssa> here

20:47 <bbrezillon> could also be a platform issue

20:48 <bbrezillon> I'm testing on a rockpi

20:48 <alyssa> Plausibly

20:48 <alyssa> kevin here

20:48 <bbrezillon> which doesn't have the same OPP

20:48 <bbrezillon> IIRC

20:56 <bbrezillon> alyssa: same result with 5.6.0

20:56 bnieuwenhuizen_ has joined #panfrost

20:56 bnieuwenhuizen_ is now known as bnieuwenhuizen

21:01 <alyssa> under 900 :)

21:03 <urjaman> i'm building a 5.7-rc4

21:22 raster has quit [Quit: Gettin' stinky!]

21:37 <alyssa> well, I've worked through my fails list

21:37 <alyssa> but -bterrain is still a bit broken

21:37 <alyssa> time to run through CI from scratch :>

21:46 <alyssa> also not sure why I'm not seeing a statistically siginficant fps difference with fp16 on glmark

21:47 <alyssa> I guess except for -bterrain, register pressure isn't the bottleneck since they're simple enough

21:47 <HdkR> Not bounded by ALU? :)

21:48 <alyssa> HdkR: Well, lower pressure ==> more threads in flight

21:48 <HdkR> Ah right

21:49 <alyssa> But if it's memory bound, well.

21:52 <HdkR> Sounds like we just need more SoCs with >100GB/s memory bandwidth

21:53 <robmur01> Oh FFS... how do we keep forgetting this? :P

21:53 <robmur01> what does -bterrain do? pretty much guarantee running at max OPP

21:54 <robmur01> what landed since 5.4? The generic OPP support that broke voltage scaling :(

21:58 <robmur01> default GPU voltage on my board seems to be nominally 1.0V, so probably close enough to the to OPP's 1.1V to squeak by

21:58 <robmur01> and more than enough for 600MHz and below

21:59 <alyssa> robmur01: sorry? :innocent:

21:59 <urjaman> oh i thought that was something that only applied to some other board

22:00 <urjaman> not to everything

22:00 <urjaman> (like yes i had read about it here but...)

22:04 <urjaman> (also, how many kernel versions you need to fix setting a voltage...............................)

22:07 <robmur01> urjaman: the default voltage (and thus how likely higher OPPs are to go wrong) is somewhat board-dependent

22:07 <robmur01> Chromebooks seem to hurt the most since they have a different regulator setup to most reference-design-based boards

22:15 <robmur01> as far as I've seen, fixing it has turned out to be really quite fiddly thanks to awkward interaction between the regulator and devfreq APIs, and both devfreq and/or explicit regulators being optional from our PoV

22:31 <alyssa> Erg why is this test failing CI but passing local

22:32 <robmur01> "Continuous Instability"

22:32 <alyssa> >:D

22:34 <urjaman> that also applies to my experience with the kernel development process

22:34 <alyssa> Oh, joy - the behaviour chnages with gles3 exposed

22:38 <alyssa> Okay, I see the problem. But making that test pass still doesn't fix -bterrain

22:46 icecream95 has joined #panfrost

22:49 <robmur01> does `echo 300000000 | sudo tee /sys/class/devfreq/ff9a0000.gpu/max_freq` fix it?

23:03 <icecream95> Speaking of things that got broken in the last few kernel releases, the microphone doesn't work anymore on c201 - it tries recording through the speaker instead

23:04 <alyssa> has it worked recently?

23:04 <alyssa> it's been broken on kevin since forever..

23:06 <icecream95> alyssa: I'm pretty sure it was working on 5.3, or at least 5.1

23:20 <alyssa> Neigh

23:20 <alyssa> (have you tried various alsa devices btw?)

23:21 <alyssa> still a bug but maybe a userspace workaround

23:21 <icecream95> I spent a while trying to change stuff in alsamixer, but didn't manage to fix it

23:21 <alyssa> meh

23:21 <alyssa> (also, same here for kevin but I digress)