#etnaviv on 2021-04-07 — irc logs at freenode.irclog.whitequark.org

2020-05-12 17:40 austriancoder changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://freenode.irclog.whitequark.org/etnaviv

02:24 DPA has quit [Ping timeout: 240 seconds]

02:32 DPA has joined #etnaviv

02:39 mth has quit [Quit: Konversation terminated!]

03:16 mth has joined #etnaviv

07:39 srk has joined #etnaviv

07:40 lynxeye has joined #etnaviv

07:43 <marex> lynxeye: hey, morning, so now that I have the devcoredump(s) , what next ? I can run viv-unpack on those and they indicate MMU checks failed with a long list of entries

07:45 <austriancoder> marex: you are a really impatient guy

07:45 <austriancoder> marex: when doing the viv_unpack it tells you where the fetch engine was stuck .. in your case it is in the ring buffer

07:46 <austriancoder> marex: use the cmd stream dumper on the ring buffer and see what the GPU should do at the bad address

07:46 <austriancoder> marex: also look what the mmu fault address is

07:46 <austriancoder> marex: this should give you a starting point

07:46 <marex> austriancoder: impatient ? I've been at this for a month and half already with zero progress ...

07:47 <lynxeye> marex: I don't think the MMU checker supports MMUv2, so ignore those errors.

07:49 <marex> austriancoder: how did you find out its in the ring ?

07:50 <marex> lynxeye: ha, ok

07:52 <marex> austriancoder: the cmd stream dumper is this dump_cmdstream.py from etna_viv ?

07:55 <austriancoder> marex: viv-unpack tells you that it is in the ring: "* 2 ring 00001000 00001000 4096" -- with a little * at the beginning of the line

07:55 <austriancoder> marex: yes

08:01 <marex> $ ./tools/dump_cmdstream.py ring.bin gives me 'Magic value 40000005 not recognized'

08:03 <marex> maybe the format is different than the FDR files ?

08:03 <austriancoder> marex:

08:03 <austriancoder> marex: sorry.. should be dump_separate_cmdbuf.py

08:04 <marex> austriancoder: ah ok, that crashes on UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 64: ordinal not in range(128)

08:04 <marex> austriancoder: I'll take a look into this

08:05 <austriancoder> marex: ./dump_separate_cmdbuf.py -b /stuff/ring.bin

08:05 <austriancoder> works for me

08:07 <marex> austriancoder: oh, yeah, that works

08:08 pcercuei has joined #etnaviv

08:10 <marex> austriancoder: all right, the MMU fault address is nowhere in the ring dump

08:11 <marex> but there is a BO with that address

08:11 <marex> bo-fd945000.bin in this dump here

08:11 <lynxeye> marex: What's the thing that is supposed to be executed at that ring address or slightly before?

08:11 <lynxeye> A cache flush?

08:11 <marex> lynxeye: what address ?

08:12 <lynxeye> the ring address where the FE stopped

08:12 <marex> lynxeye: that's the MMU fault address? that address is not in the ring dump

08:13 <austriancoder> marex: 00000660 = 00000815 Cmd: [stall DMA: idle Fetch: valid] Req idle Cal idle

08:13 <austriancoder> 00000664 = 00001138 Command DMA address

08:13 <austriancoder> 00001138 is the address

08:13 <marex> 1138 is also not in the ring dump

08:14 <austriancoder> 00001138 = the ring address where the FE stopped

08:14 <lynxeye> marex: 1138 is an address in the ring

08:14 <lynxeye> What command is at that address?

08:15 <marex> dump_separate_cmdbuf.py -b ring.bin | grep 1138 gives 0 results

08:15 <austriancoder> that not how it works..

08:15 <marex> that's not what the documentation says ;-)

08:18 * austriancoder is in a video conference the next 15 minutes

08:18 <marex> https://paste.debian.net/hidden/6e983c05/ could it be line 85 here ?

08:18 <marex> lynxeye: ^

08:46 <austriancoder> marex: https://paste.debian.net/1192588/

08:47 <austriancoder> lynxeye: some flushing is happening

08:48 <marex> austriancoder: so where is the 1138 in that ?

08:49 <austriancoder> marex: 00001138 Command DMA address ... ring starts at 00001000 --> offset 138

08:49 <marex> austriancoder: how do you get from 0x1000 to offset 138 ?

08:50 <marex> austriancoder: how do you get from 0x1000 to offset 0x138 ?

08:51 <austriancoder> as I wrote.. the FE DAM Engine stopped at 0x1138 .. that address in the ring (which starts at 0x1000) -> offset = 0x138

08:51 lrusak has quit [Ping timeout: 248 seconds]

08:53 <marex> austriancoder: d'oh ...

08:54 <austriancoder> next video call is coming quickly

08:56 <marex> sigh

08:57 chewitt has quit [Quit: Adios!]

09:03 <lynxeye> marex: Is there a buffer at the MMU fault address?

09:04 <marex> there is bo-fd945000.bin produced by viv_unpack, which patches the MMU fault address

10:19 <marex> lynxeye: ^

10:19 <marex> *matches

10:26 <lynxeye> hm, maybe now is the time to add MMUv2 support to the checker...

10:27 <lynxeye> If the BO is in the dump, it's clearly still alive and there is no MMU context switching going on in your ring dump

10:27 <lynxeye> so I'm not sure why this would fault

10:27 <lynxeye> What's the fault status?

10:28 <marex> 2

10:30 <marex> lynxeye: 2

10:31 <lynxeye> which is a page not present, so either the pagetables are really wrong, or we are hitting some more obscure GPU bug

10:34 <marex> lynxeye: what exactly does "page not present" mean ?

10:35 <lynxeye> marex: Some engine of the GPU tried to access an address where no valid VA->PA translation is present in the pagetables.

10:35 <marex> lynxeye: the GPU page tables, right ?

10:35 <lynxeye> yep

10:36 <marex> lynxeye: could there be some race ?

10:36 <marex> i.e. the pagetables are populated too late ?

10:38 <lynxeye> I wouldn't rule out that possibility. But what's happening here is that you return from a user command stream and only the forced cache flush hits the MMU fault. I would expect that the user command stream did in fact write something into the buffer already, so the page entries should have been there.

10:39 <lynxeye> So from that it seems more likely that the pagetable get depopulated too early, but I'm not sure how this would happen, as long as the BO is alive. We don't depopulate pagetable entries for live BOs on MMUv2 IIRC.

10:40 <marex> lynxeye: I am running glmark on weston, I would expect that to be rather linear, i.e. not something that would trigger races

10:40 <marex> lynxeye: could it be one of the locking issues we had with BOs ?

10:41 <marex> but then, with glmark ... that sounds odd

10:41 <lynxeye> I don't think so. The kernels view of the buffer alive status seems to match what is programmed into the GPU hw.

10:42 <lynxeye> Really the first thing now would be to actually type up that MMU checker for MMUv2, to see if the pagetables look sane.

10:43 <marex> lynxeye: thats something that goes into the viv_unpack or also needs some part in the devcoredump kernel part ?

10:44 <lynxeye> marex: The kernel already dumps the pagetables. viv_unpack needs to learn the new dump format for MMUv2.

10:44 <marex> oh

10:45 <lynxeye> MMUv1 is just a large linear one level pagetable. MMUv2 is a two level pagetable. First page in the dump is first level. Following pages are the populated second level entries.

11:44 berton has joined #etnaviv

14:50 karolherbst has quit [Quit: duh 🐧]

15:00 karolherbst has joined #etnaviv

15:02 lrusak has joined #etnaviv

16:28 chewitt has joined #etnaviv

16:47 gbisson has quit [Remote host closed the connection]

16:55 flto has quit [Remote host closed the connection]

17:59 gbisson has joined #etnaviv

20:50 srk has quit [Ping timeout: 260 seconds]

20:54 kherbst has joined #etnaviv

20:54 karolherbst has quit [Ping timeout: 260 seconds]

20:55 kherbst has quit [Client Quit]

20:59 karolherbst has joined #etnaviv

21:18 pcercuei has quit [Quit: dodo]

21:23 berton has quit [Remote host closed the connection]

22:13 lynxeye has quit [Quit: lynxeye]