buzzmarshall has quit [Remote host closed the connection]
davidlt has quit [Ping timeout: 258 seconds]
chewitt has quit [Quit: Zzz..]
davidlt has joined #panfrost
chewitt has joined #panfrost
_whitelogger has joined #panfrost
cwabbott_ has joined #panfrost
cwabbott has quit [Ping timeout: 246 seconds]
cwabbott_ is now known as cwabbott
<tomeu>
alyssa: yeah, I was already there :/
<tomeu>
flakiness in power supply could lead to the GPU reading wrong values from the cmdstream or behaving erratically, but I think it's far less likely that it would start overwriting random parts of the cmdstream
<tomeu>
that said, I never checked for cmdstream changes in the spurious failures in the gles3 tests on the kevin when running at the highest frequencies
<tomeu>
because I don't have a kevin, mainly :p
_whitelogger has joined #panfrost
<tomeu>
narmstrong: guess the kbase you are using on the G52 is using the aarch64 page tables?
<tomeu>
if so, wonder if it would be too much work to test with the midgard page table format and see if the same erratic behaviour is observed
blaze_cornbread has joined #panfrost
<tomeu>
narmstrong: nm, I'm trying to hack aarch64 format support in
<tomeu>
may be faster
<tomeu>
grr, why am I getting permission faults when the GPU tries to write to the cmdstream?
blaze_cornbread has quit [Quit: blaze_cornbread]
<tomeu>
ah, I was using ARM_64_LPAE_S2 instead of ARM_64_LPAE_S1
<tomeu>
robher: I guess the same erratic behavior with the aarch64 page table format, so must be something else
bbrezillon has joined #panfrost
<tomeu>
robher: narmstrong: robmur01: so, if I check the contents of the command stream *before* submitting to the GPU, in the bad runs it's all zeroes
<tomeu>
in the good runs, it's the expected cmdstream
<tomeu>
so it has nothing to do with the GPU, and rather with how the panfrost kernel driver allocates those buffers
<narmstrong>
tomeu: how is this possible ?
<tomeu>
no idea, because seems to be plain shmem
<tomeu>
so maybe mesa is overwriting the cmdstream?
<tomeu>
no idea how that could happen, though, in this random fashion
<narmstrong>
tomeu: do you dump at panfrost or mesa level ?
raster has joined #panfrost
<narmstrong>
tomeu: are you on the Odroid-N2 ?
<narmstrong>
tomeu: are you using the Hardkernel U-boot ?
<narmstrong>
I hope it doesn't allocate in a reserved memory zone, it can't be otherwise the kernel and userspace would crash at some point
<narmstrong>
tomeu: what do I need to reproduce ? I can try on the VIM3
<narmstrong>
I also have the N2 if necessary
<tomeu>
narmstrong: yep, on a odroid-n2
<tomeu>
haven't tried yet to map the buffer from within the kernel
<tomeu>
let me do before some sanity checking within mesa
<tomeu>
narmstrong: afaics, if I make a new mapping, then I read zeroes instead of what I last wrote with a previous mapping
<tomeu>
well, not always, maybe 4 out of 5 times
icecream95 has quit [Ping timeout: 240 seconds]
<tomeu>
when I read back using the same mapping as when I wrote, then I get the expected contents back fine
cwabbott has quit [Remote host closed the connection]
<narmstrong>
wtf
cwabbott has joined #panfrost
<narmstrong>
can you share your kernel tree ?
<tomeu>
narmstrong: also, using odroid's u-boot
<tomeu>
yeah, guess I should go back first to the mali page table format
<narmstrong>
if you loose data between maps, it can't be an hw issue
<tomeu>
okthink it's a hw issue
<tomeu>
when it works, the second mmap is the same address than before
<tomeu>
when it doesn't, I get a different address
<tomeu>
I suspect some problem with BO caching or so
<narmstrong>
all of this is pure sw
<tomeu>
yeah, sorry, meant the opposite
<narmstrong>
ok
<narmstrong>
are you using a stable kernel tree working for midgard ?
<tomeu>
it's 5.6-rc5 plus the reset hacks, but it's
<tomeu>
..also the one we use in the Mesa CI
<tomeu>
oops, the thing about different mappings and reading zeroes from there was all my fault
<tomeu>
too many hacks on top of other hacks
<tomeu>
we're back now to zeroes appearing around the fields in the cmdstream that the GPU is expected to be updating
<tomeu>
as if writes were spilling around, could be a problem with the write back cache not being prepopulated?
cwabbott has quit [Quit: cwabbott]
cwabbott has joined #panfrost
stikonas has joined #panfrost
<tomeu>
alyssa: if one allocates a bo for the checksum data, it gets much more reliable
<tomeu>
and looks like the bigger the BO, the more reliable it becomes :p
<tomeu>
I've been allocating transiently for this experiment, and noticed that if I made it too small, the header descriptor which is allocated next in the transient pool is overwritten with sequences of ff808080 c0008080 :p
<tomeu>
which are the values in the new fields in the extra descriptor
<robmur01_>
tomeu: could it be an insufficient alignment thing? i.e. does the point where the corruption starts look like a rounded-up/rounded-down version of some pointer the GPU was previously given?
maciejjo has quit [Remote host closed the connection]
cwabbott has quit [Ping timeout: 246 seconds]
cwabbott has joined #panfrost
<tomeu>
robmur01_: hard to tell because there's a lot of zeroes around the values that change
<tomeu>
but there was indeed a clear alignment requirement on the first header descriptor, that I already took care of
<tomeu>
hmm, there's a bunch of cache_clean-related functions in mali_kbase_device_hw.c that weren't in the kbase I had before
<tomeu>
one more difference: we don't handle BASE_HW_FEATURE_CLEAN_ONLY_SAFE
<tomeu>
one more:
<tomeu>
+ /* Ensure page-tables reads use read-allocate cache-policy in
<narmstrong>
ok, but what's the soc ? weird it faults on my runner
Depau has joined #panfrost
<daniels>
Cavium ThunderX
<narmstrong>
ok, can't compete :-p
<daniels>
i mean, it's not running Gentoo or anything, it's just a Debian system which should run on any armv8 ...
<daniels>
this _shouldn't_ be it, but could you push a script change which executes 'locale' right before it tries to run the Python script which fails?
yann has joined #panfrost
<narmstrong>
yeah I know, the system is running ubuntu with a shitload of python already
<narmstrong>
I restarted a pipeline, and I'll do that if it still fails
<daniels>
on that machine, LANG/LANGUAGE/LC_ALL are all unset, and the rest of the LC_* come out as POSIX
<daniels>
you can do this to get a shell in the exact same environment btw: docker run -ti registry.freedesktop.org/narmstrong/mesa/debian/arm_build:2020-03-24 /bin/bash
tomboy64 has quit [Remote host closed the connection]