#panfrost on 2020-03-25 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

01:29 icecream95 has quit [Ping timeout: 260 seconds]

01:33 icecream95 has joined #panfrost

01:38 stikonas has quit [Remote host closed the connection]

02:08 <chewitt> alyssa: https://github.com/LibreELEC/mali-bifrost

02:27 vstehle has quit [Ping timeout: 246 seconds]

02:27 yann|work has quit [Ping timeout: 260 seconds]

02:40 vstehle has joined #panfrost

03:04 yann has joined #panfrost

03:21 davidlt has joined #panfrost

03:31 buzzmarshall has quit [Remote host closed the connection]

03:37 davidlt has quit [Ping timeout: 258 seconds]

04:20 chewitt has quit [Quit: Zzz..]

04:25 davidlt has joined #panfrost

04:31 chewitt has joined #panfrost

04:50 _whitelogger has joined #panfrost

05:06 cwabbott_ has joined #panfrost

05:07 cwabbott has quit [Ping timeout: 246 seconds]

05:07 cwabbott_ is now known as cwabbott

05:21 <tomeu> alyssa: yeah, I was already there :/

05:22 <tomeu> flakiness in power supply could lead to the GPU reading wrong values from the cmdstream or behaving erratically, but I think it's far less likely that it would start overwriting random parts of the cmdstream

05:24 <tomeu> that said, I never checked for cmdstream changes in the spurious failures in the gles3 tests on the kevin when running at the highest frequencies

05:24 <tomeu> because I don't have a kevin, mainly :p

05:32 _whitelogger has joined #panfrost

05:59 <tomeu> narmstrong: guess the kbase you are using on the G52 is using the aarch64 page tables?

05:59 <tomeu> if so, wonder if it would be too much work to test with the midgard page table format and see if the same erratic behaviour is observed

06:34 blaze_cornbread has joined #panfrost

06:47 <tomeu> narmstrong: nm, I'm trying to hack aarch64 format support in

06:47 <tomeu> may be faster

07:04 <tomeu> grr, why am I getting permission faults when the GPU tries to write to the cmdstream?

07:05 blaze_cornbread has quit [Quit: blaze_cornbread]

07:31 <tomeu> ah, I was using ARM_64_LPAE_S2 instead of ARM_64_LPAE_S1

07:32 <tomeu> robher: I guess the same erratic behavior with the aarch64 page table format, so must be something else

07:49 bbrezillon has joined #panfrost

08:32 <tomeu> robher: narmstrong: robmur01: so, if I check the contents of the command stream *before* submitting to the GPU, in the bad runs it's all zeroes

08:32 <tomeu> in the good runs, it's the expected cmdstream

08:33 <tomeu> so it has nothing to do with the GPU, and rather with how the panfrost kernel driver allocates those buffers

08:34 <narmstrong> tomeu: how is this possible ?

08:35 <tomeu> no idea, because seems to be plain shmem

08:35 <tomeu> so maybe mesa is overwriting the cmdstream?

08:36 <tomeu> no idea how that could happen, though, in this random fashion

08:37 <narmstrong> tomeu: do you dump at panfrost or mesa level ?

08:43 raster has joined #panfrost

08:43 <narmstrong> tomeu: are you on the Odroid-N2 ?

08:44 <narmstrong> tomeu: are you using the Hardkernel U-boot ?

08:45 <narmstrong> I hope it doesn't allocate in a reserved memory zone, it can't be otherwise the kernel and userspace would crash at some point

08:46 <narmstrong> tomeu: what do I need to reproduce ? I can try on the VIM3

08:47 <narmstrong> I also have the N2 if necessary

08:50 <tomeu> narmstrong: yep, on a odroid-n2

08:50 <tomeu> haven't tried yet to map the buffer from within the kernel

08:51 <tomeu> let me do before some sanity checking within mesa

08:59 <tomeu> narmstrong: afaics, if I make a new mapping, then I read zeroes instead of what I last wrote with a previous mapping

09:04 <tomeu> well, not always, maybe 4 out of 5 times

09:05 icecream95 has quit [Ping timeout: 240 seconds]

09:05 <tomeu> when I read back using the same mapping as when I wrote, then I get the expected contents back fine

09:05 cwabbott has quit [Remote host closed the connection]

09:05 <narmstrong> wtf

09:05 cwabbott has joined #panfrost

09:05 <narmstrong> can you share your kernel tree ?

09:06 <tomeu> narmstrong: also, using odroid's u-boot

09:07 <tomeu> yeah, guess I should go back first to the mali page table format

09:08 <narmstrong> if you loose data between maps, it can't be an hw issue

09:09 <tomeu> okthink it's a hw issue

09:09 <tomeu> when it works, the second mmap is the same address than before

09:09 <tomeu> when it doesn't, I get a different address

09:10 <tomeu> I suspect some problem with BO caching or so

09:10 <narmstrong> all of this is pure sw

09:10 <tomeu> yeah, sorry, meant the opposite

09:10 <narmstrong> ok

09:11 <narmstrong> are you using a stable kernel tree working for midgard ?

09:14 <tomeu> it's 5.6-rc5 plus the reset hacks, but it's

09:14 <tomeu> ..also the one we use in the Mesa CI

09:35 <tomeu> oops, the thing about different mappings and reading zeroes from there was all my fault

09:35 <tomeu> too many hacks on top of other hacks

09:36 <tomeu> we're back now to zeroes appearing around the fields in the cmdstream that the GPU is expected to be updating

09:37 <tomeu> as if writes were spilling around, could be a problem with the write back cache not being prepopulated?

10:01 cwabbott has quit [Quit: cwabbott]

10:01 cwabbott has joined #panfrost

10:24 stikonas has joined #panfrost

10:36 <tomeu> alyssa: if one allocates a bo for the checksum data, it gets much more reliable

10:36 <tomeu> and looks like the bigger the BO, the more reliable it becomes :p

10:38 <tomeu> I've been allocating transiently for this experiment, and noticed that if I made it too small, the header descriptor which is allocated next in the transient pool is overwritten with sequences of ff808080 c0008080 :p

10:38 <tomeu> which are the values in the new fields in the extra descriptor

10:59 <tomeu> alyssa: you should be able to reproduce that with https://gitlab.freedesktop.org/tomeu/mesa/-/commits/bifrost

11:09 robmur01_ has joined #panfrost

11:16 <robmur01_> tomeu: could it be an insufficient alignment thing? i.e. does the point where the corruption starts look like a rounded-up/rounded-down version of some pointer the GPU was previously given?

11:31 maciejjo has quit [Remote host closed the connection]

11:38 cwabbott has quit [Ping timeout: 246 seconds]

11:40 cwabbott has joined #panfrost

11:47 <tomeu> robmur01_: hard to tell because there's a lot of zeroes around the values that change

11:48 <tomeu> but there was indeed a clear alignment requirement on the first header descriptor, that I already took care of

12:07 <tomeu> hmm, there's a bunch of cache_clean-related functions in mali_kbase_device_hw.c that weren't in the kbase I had before

12:09 <tomeu> one more difference: we don't handle BASE_HW_FEATURE_CLEAN_ONLY_SAFE

12:11 <tomeu> one more:

12:11 <tomeu> + /* Ensure page-tables reads use read-allocate cache-policy in

12:11 <tomeu> + * the L2

12:11 <tomeu> + transcfg |= AS_TRANSCFG_R_ALLOCATE;

12:11 <tomeu> + */

12:31 maciejjo has joined #panfrost

12:33 Depau_ has quit [Quit: ZNC 1.7.5 - https://znc.in]

12:35 Depau has joined #panfrost

12:36 cwabbott has quit [Quit: cwabbott]

12:37 cwabbott has joined #panfrost

12:48 robmur01_1 has joined #panfrost

12:48 robmur01_1 has quit [Client Quit]

12:49 <narmstrong> the bifrost kbase is slightly different

12:49 <narmstrong> kind of astonishing ARM distributes 2 different version of kbase...

12:49 robmur01_ has quit [Ping timeout: 250 seconds]

12:50 robmur01_ has joined #panfrost

12:50 robmur01_ has quit [Client Quit]

12:51 robmur01_ has joined #panfrost

12:54 <chewitt> someone from Amlogic has previously pointed out that you can use the bifrost kbase on midgard too

12:54 davidlt has quit [Remote host closed the connection]

12:56 enunes has quit [Quit: ZNC 1.7.2 - https://znc.in]

12:59 Depau has quit [Quit: ZNC 1.7.5 - https://znc.in]

13:00 Depau has joined #panfrost

13:05 Depau has quit [Quit: ZNC 1.7.5 - https://znc.in]

13:07 Depau has joined #panfrost

13:17 enunes has joined #panfrost

13:20 yann has quit [Ping timeout: 264 seconds]

13:21 <narmstrong> tomeu: https://gitlab.freedesktop.org/narmstrong/mesa/-/jobs/2053874 `Fatal Python error: initfsencoding: unable to load the file system codec`

13:21 davidlt has joined #panfrost

13:21 enunes has quit [Quit: ZNC - https://znc.in]

13:21 <narmstrong> trying to run your stuff on an aarch64 runner

13:21 enunes has joined #panfrost

13:23 <tomeu> narmstrong: what python version are you using?

13:23 <narmstrong> tomeu: the version in the arm_build !

13:23 <tomeu> oh, so the same docker image?

13:24 <tomeu> well, how could it be? ...

13:24 <narmstrong> no idea

13:24 <tomeu> hmm, I'm quite lost

13:25 <tomeu> let's see if somebody else in this channel has any ideas

13:26 <narmstrong> tomeu: what do you use as aarch64 runner ?

13:27 <tomeu> narmstrong: I think it's in one of the arm64 servers from packet, that we use to build

13:27 <tomeu> daniels: is that right?

13:27 <daniels> correct

13:27 <daniels> it's one of the fd.o shared runners

13:27 Depau has quit [Quit: ZNC 1.7.5 - https://znc.in]

13:29 <narmstrong> ok, but what's the soc ? weird it faults on my runner

13:29 Depau has joined #panfrost

13:30 <daniels> Cavium ThunderX

13:30 <narmstrong> ok, can't compete :-p

13:32 <daniels> i mean, it's not running Gentoo or anything, it's just a Debian system which should run on any armv8 ...

13:32 <daniels> this _shouldn't_ be it, but could you push a script change which executes 'locale' right before it tries to run the Python script which fails?

13:32 yann has joined #panfrost

13:32 <narmstrong> yeah I know, the system is running ubuntu with a shitload of python already

13:33 <narmstrong> I restarted a pipeline, and I'll do that if it still fails

13:33 <daniels> on that machine, LANG/LANGUAGE/LC_ALL are all unset, and the rest of the LC_* come out as POSIX

13:33 <daniels> you can do this to get a shell in the exact same environment btw: docker run -ti registry.freedesktop.org/narmstrong/mesa/debian/arm_build:2020-03-24 /bin/bash

13:38 tomboy64 has quit [Remote host closed the connection]

13:39 tomboy64 has joined #panfrost

13:39 <narmstrong> https://www.irccloud.com/pastebin/ZAsAslpL/

13:41 <narmstrong> ok, python3 faults alone

13:47 <narmstrong> ok with registry.freedesktop.org/tomeu/mesa/debian/arm_build:2020-03-24 it's fine :-/

13:50 <tomeu> narmstrong: in case you can spot something obvious: https://gitlab.freedesktop.org/tomeu/linux/-/tree/bifrost

13:50 <tomeu> it belongs exactly the same with either aarch64 page tables or mali legacy

13:50 <tomeu> s/belongs/behaves

14:05 <daniels> narmstrong: so that's an interesting point of difference then - does 'LC_ALL=C python' work?

14:05 <daniels> or 'LANG= python3'

14:21 megi has quit [Quit: WeeChat 2.7.1]

14:21 megi has joined #panfrost

14:37 enunes has quit [Quit: ZNC - https://znc.in]

14:47 <narmstrong> daniels: i deleted all my images in the registry and now it's ok, seems something got corrupted

14:49 <daniels> narmstrong: bizarre! thanks for working through it though :)

14:56 <narmstrong> daniels: tomeu: got the `Serve files for LAVA via separate service` running, but with an internal URL (no https, :8080 port)

14:57 <narmstrong> I think the FILES_HOST_NAME and a new FILES_HOST_URL should be a runner-specific variable

14:57 <daniels> narmstrong: \o/ you should be able to change the URL definition to have your LAVA dispatcher variable pull from there

14:57 <narmstrong> done https://gitlab.freedesktop.org/narmstrong/mesa/-/jobs/2055042

14:57 <daniels> yeah, that would work

14:58 <daniels> narmstrong: awesome! thanks a lot!

16:04 anarsoul|c has joined #panfrost

16:08 cwabbott has quit [Read error: Connection reset by peer]

16:08 cwabbott has joined #panfrost

16:46 clementp[m] has quit [Ping timeout: 240 seconds]

16:47 thefloweringash has quit [Ping timeout: 256 seconds]

16:52 thefloweringash has joined #panfrost

16:55 clementp[m] has joined #panfrost

16:58 mixfix41 has quit [Quit: Leaving.]

17:12 robmur01_ has quit [Quit: robmur01_]

17:12 robmur01_ has joined #panfrost

17:35 raster has quit [Quit: Gettin' stinky!]

18:15 rcf has quit [Quit: WeeChat 2.7]

18:16 rcf has joined #panfrost

18:26 raster has joined #panfrost

19:08 chewitt has quit [Quit: Zzz..]

19:51 icecream95 has joined #panfrost

19:57 chewitt has joined #panfrost

20:01 TheKit has quit [Ping timeout: 246 seconds]

20:11 icecream95 has quit [Ping timeout: 250 seconds]

20:11 icecream95 has joined #panfrost

20:47 mias has joined #panfrost

20:47 mias_ has quit [Ping timeout: 256 seconds]

20:55 adjtm_ has joined #panfrost

20:57 adjtm has quit [Ping timeout: 260 seconds]

21:33 davidlt has quit [Ping timeout: 250 seconds]

21:51 robmur01_ has quit [Quit: robmur01_]

22:02 cwabbott has quit [Quit: cwabbott]

22:35 raster has quit [Quit: Gettin' stinky!]