#etnaviv on 2021-03-22 — irc logs at freenode.irclog.whitequark.org

2020-05-12 17:40 austriancoder changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://freenode.irclog.whitequark.org/etnaviv

01:30 chewitt has joined #etnaviv

02:30 karolherbst has quit [Read error: Connection reset by peer]

02:31 karolherbst has joined #etnaviv

02:54 chewitt has quit [Quit: Adios!]

03:07 afaerber has quit [Ping timeout: 260 seconds]

03:07 afaerber has joined #etnaviv

08:10 Net147 has quit [Quit: Quit]

08:33 Net147 has joined #etnaviv

08:42 Net147 has quit [Quit: Quit]

08:43 Net147 has joined #etnaviv

08:57 lynxeye has joined #etnaviv

10:15 pcercuei has joined #etnaviv

12:45 <marex> hm, is it possible that the vivante GPU is overwriting memory randomly ?

12:45 <marex> and if so, is there some way to debug that ?

12:46 <marex> austriancoder: lynxeye: ^

12:47 <austriancoder> there is an iommu .. so.. chances are low

12:48 <marex> austriancoder: that iommu is disabled on the MP1 because ... hold on

12:48 <marex> 1db012790446 ("drm/etnaviv: move linear window on MC1.0 parts if necessary")

12:48 <marex> this

12:49 <marex> but wait, that only talks about fast clear, I thought there was something about mmuv1 too

12:52 <marex> austriancoder: the issue I am seeing happens once every 3-4 hours under heavy load ... so chances of triggering it are indeed low

12:52 <marex> austriancoder: it is the same issue I am looking for for the past 2-3 weeks btw

13:05 <austriancoder> do you see 'only' miss-renderings or even crashed apps etc. due to overwriting memory?

13:06 <austriancoder> maybe there is a race condition regarding bo's

13:07 <marex> austriancoder: I see the machine prints BUG about corrupted page

13:08 <marex> austriancoder: and sometimes the machine even reboots

13:08 <marex> austriancoder: I think something like that is happening

13:08 <marex> austriancoder: but how do you debug that ?

13:16 <lynxeye> marex: on MMUv1 you have a 2GB linear window, so you can bypass the MMU with reads/writes through that window

13:17 <lynxeye> MMUv2 has real isolation and all accesses go through the MMU, so on MMUv2 chances for random memory corruption are much lower

13:19 <austriancoder> stm32 should have mmuv2

13:21 <lynxeye> austriancoder: You sure about this? My recollection is that the STM32 is MMUv1 (but MC2.0).

13:25 <lynxeye> marex: You can force buffers to be mapped through the MMU with a BO flag, but that only helps if your issue isn't caused by random state corruption. Also MMUv1 isn't able to trigger exception IRQs, but just retargets accesses to the bad page. So you need to manually check that the bad page magic isn't overwritten in order to find out if some access is going astray.

13:55 <austriancoder> lynxeye: I just booted up a stm32: minor_features1: 0xbe13b219 & chipMinorFeatures1_MMU_VERSION (0x10000000) --> ETNAVIV_IOMMU_V2

14:05 <lynxeye> austriancoder: Okay, thanks, so my memory was wrong.

14:06 <austriancoder> time for a debugfs patch to print the mmu version :)

14:16 <marex> austriancoder: but then why disable TS on MP1 ?

14:17 <marex> but then that might mean there is a random state corruption

14:23 <austriancoder> marex: mmu and mc are two different things

14:23 <marex> austriancoder: from what I understand from the above, MP1 should be MC2 and IOMMU2 ?

14:25 <austriancoder> no: minor_features0: 0xe1299fff & chipMinorFeatures0_MC20 (0x00400000) --> MC2 .. but that bit is not set --> stm32 = mc1 + mmuv2

14:26 <lynxeye> marex: Nope, seems I mixed things up. According to feature bits the GC400 on STM32 is MC1.0, but MMUv2.

14:26 <marex> meaning my problem is likely state corruption ?

14:29 <lynxeye> marex: Nope, meaning that it is very unlikely that the GPU is writing to unmapped regions (that could be a result of state corruption) as MMUv2 signals bad writes via an exception. So if it's really the GPU corrupting sysmem, the address must be mapped into the MMU address space.

14:30 <marex> lynxeye: I am seeing various MMU errors in the kernel log here and there, could those be related then ?

15:55 chewitt has joined #etnaviv

16:08 <austriancoder> marex: if there is an mmu exception for etnaviv you could try to collect a devcoredump

16:08 <austriancoder> marex: https://github.com/etnaviv/etna-gpu-tools

16:22 <marex> austriancoder: er, details please ?

16:52 <austriancoder> marex: that repo contains udev rule an a decoredump extractor. sudo make install should work. Then if an entaviv mmu exception happens you get all the information you to find the cause.

16:53 <austriancoder> fyi: during deqp runs in CI I always get a devcoredump

16:53 <marex> austriancoder: I didnt manage to crash the machine when running deqps in a loop for days

17:35 lynxeye has quit [Quit: lynxeye]

19:05 shoragan has quit [Ping timeout: 258 seconds]

19:08 shoragan has joined #etnaviv

21:16 JohnnyonFlame has joined #etnaviv