<alyssa>
Every time I have to use autotools I get a little sadder.
<HdkR>
never use autotools :D
<icecream95>
Every time I have to download config.guess and config.sub I get a litte sadder
<macc24>
Every time I have to use x86 I get a little sadder.
<alyssa>
I don't understand why I have to compile a scanner driver in 2021.
<alyssa>
Or why scanner drivers are using autotools in 2021.
<alyssa>
---
<alyssa>
I think I've decided that our madvise handling is totally backwards.
<kinkinkijkin>
perhaps the scanner was made in 2019, a perfectly acceptible year to still be using autotools
<macc24>
why would you use a scanner in 2021?
<kinkinkijkin>
2019 was 12 years ago according to real math
<alyssa>
It doesn't make sense to constantly allocate and free memory, marking as purgable so in low mem the cache gets snarfed out of to reclaim memory,
<macc24>
wait what
<alyssa>
when we'll just reallocate it again immediately, fighting the kernel to free/allocate until our low-memory sitatuion spirals into a freeze.
<alyssa>
That doesn't solve the actual underlying issue.
<macc24>
alyssa: whatever you are doing right now, you are doing a good job :D
<alyssa>
More to the point: a poorly behaved client eating up all the heap should never be able to take down the compositor.
<alyssa>
If x11 or weston or whatever allocated a fixed amount of command buffers upfront and _never freed them_, what would happen?
<alyssa>
In the short term, maybe higher memory usage, ok.
<alyssa>
But memory usage is just a number. If it's too high we can optimize for memory usage, but that isn't 'really' the issue we're seeing.
<alyssa>
In the long term, it should handle low-memory situations much more frequently -- rather than fighting anything, the compositor isn't even on the hook anymore. It uses a certain amount of memory, and if it isn't the process getting the OOM killer's wrath, there's no problem.
<alyssa>
[I type this from laptop #2 since laptop #1 is now frozen for the second time in 10 minutes trying to compile the scanner driver.]
<alyssa>
madvise, swap, and the BO cache are all well intentioned but their interaction is highly unstable and counterproductive.
<alyssa>
Historically madvise solved the issue of "a GPU process just keeps allocating until it takes down the machine" when the BO cache was first added. That hasn't been an issue since the LRU test was added, which is much less invasive than madvise.
<alyssa>
Actually I'd go so far as to exempt a small amount of memory from the LRU test. Mali data structures are tiny, keeping around the bare minimum to keep pushing frames even when allocations are failing isn't such a scary idea.
<alyssa>
No amount of clever tricks are going to magically make less memory used, if the kernel overcommits we're already toast.
<alyssa>
("zstd?" "Never mind.")
<icecream95>
alyssa: It's zram + zstd as swap, zstd by itself doesn't do anything
<alyssa>
The least we can do is get the offending process to fail fast and keep the rest out of the line of fire, as opposed to the current game we play of drawing out a long painful death until the whole system hangs.
<macc24>
so that's why my laptop was hanging randomly
<alyssa>
(Also, madvise adds nontrivial CPU overhead for the 99.99% of the frames that don't need it. But I care more about stability.)
<alyssa>
(And I'm just really hit my limit for system freezes and I shouldn't keep blaming the kernel.)
<alyssa>
I'm strongly inclined to make a WIP branch ripping out madvise from userspace and dogfooding it for a week or something and seeing if it's noticeably better.
<alyssa>
Low-memory situations are going to suck regardless, OpenGL as an API isn't really great here. But Panfrost is aggravating it.
<icecream95>
Current zram stats for me are used:599M uncompressed:1.64G, with 3.5/3.8G RAM used, everything is still running perfectly smoothly
<icecream95>
It really is magic
<alyssa>
bbrezillon: ^^
<icecream95>
alyssa: Try it, try it, green eggs and zram. You may like it, you will see. You may like it. In a device-tree?
<kinkinkijkin>
on a horse? in a source?
<alyssa>
saying neigh? June through May?
<alyssa>
---Wait, a minute
<alyssa>
("A minute waits")
<kinkinkijkin>
everything would be a lot more hunkey and dorey if shared memory weren't the only gpu memory in thes-- *receives a ticket for insinuating that splitting gpu memory between shared and privte like that would be better*
<alyssa>
Okay, this is inconsistent.
<alyssa>
When we close the panfrost device, we evict everything from the cache (==> GEM_CLOSE ioctls).
<alyssa>
But BOs used by jobs that are currently in flight when closing won't be in the cache, so they won't be evicted with that call (==> those BOs may not have GEM_CLOSE called).
<alyssa>
To my knowledge, the driver doesn't block on in-flight jobs terminating before exiting, although the kernel is supposed to terminate any in-flight jobs from a process that ends.
<alyssa>
There are two cases here.
<alyssa>
1. The kernel automatically cleans up BOs from dead processes. In this case the cache eviction routine is totally useless and can be removed.
<alyssa>
2. The kernel does not clean up BOs from dead processes. In this case we have a, possibly quite serious, BO memory leak.
<alyssa>
I don't know enough of the Linux side to know which case it is, but both suck and demand a fix.
<alyssa>
(Why is this single C++ file taking gigabytes of RAM to compile? This isn't even Chromium.)
tlwoerner has quit [Remote host closed the connection]
<urjaman>
Why am i not surprised that it's C++ tho...
<icecream95>
"The kernel does not clean up BOs from dead processes." At least in 5.10, it does, even if there are still running jobs (see complaints about "Memory manager not clean during takedown" kernel warnings)
<HdkR>
alyssa: Template hell usually
kaspter has quit [Ping timeout: 240 seconds]
<alyssa>
icecream95: I don't know I've seen those warnings? But if they're there I think that indicates an actual leak.
kaspter has joined #panfrost
<HdkR>
Alternatively, C++ has regex support built in. Constexpression expansion for that can be rough
tlwoerner has joined #panfrost
<HdkR>
clang has a -ftime-trace argument, now it just needs a -fmemory-trace argument D:
<HdkR>
Although bad times could imply large amounts of allocations I guess :D
<kinkinkijkin>
not related to current convo, hdkr I have a bunch of audio productivity softwares I've been trying to make work on this chromebook, got famitracker in through box86, want to try renoise now and use the x86_64 build, anything I should keep in mind trying to use f
<kinkinkijkin>
EX
<icecream95>
heaptrack is very useful for tracking memory usage and leaks
<HdkR>
Heaptrack is quality
<alyssa>
Is opening Aquarium on both Firefox and Chromium simultaneously not sufficient for OOMing anymore? Bah.
<HdkR>
OOM situation improving? bah humming birds
<HdkR>
kinkinkijkin: ah, make sure to set up a sane working config. Default is very slow, but fast may currently encounter bugs :P
<alyssa>
ok, let's open Mattermost too
<alyssa>
that's React, it should "help" move along the memory
<alyssa>
and uh start a compile with -j6 in the background
* icecream95
feels sorry for those with only two big cores
<alyssa>
icecream95: I'm almost a little scared to compile Mesa on the M1, lest I never want to go back to rockchip :p
<icecream95>
I rarely boot veyron-speedy any more except to make sure the battery is charged
<HdkR>
Cross-compile life then :)
<alyssa>
Still managed to get it to freeze :|
<icecream95>
With zram?
<alyssa>
No, with madvise ripped out... why didn't the OOM killer take down { SuperTuxKart, Firefox, Chromium, gcc }?
<icecream95>
alyssa: Is allocated but unmapped GPU memory attributed to processes?
<icecream95>
(or even mapped memory, for that matter)
<alyssa>
Not sure. I don't do much kernel.
<alyssa>
bbrezillon: should know.
<alyssa>
After properly ninja installing things do seem better?
<alyssa>
Like I don't want to jinx it but
<alyssa>
The fact I can have Chromium and Firefox open at once, and a Youtube video playing, and not be totally broken is a good thing.
<alyssa>
Speaking of, only getting 20fps... groan. Though it looks like the s/w video decoding isn't the "fastest" thing..
<icecream95>
mpv can software decode 1080p even on RK3288, at least if ffmpeg and codecs are compiled with the right options
<alyssa>
This was Chromium with the default yt proprietary js player
<alyssa>
mpv has been fine since forever
<icecream95>
Fix: Don't watch videos in Chromium
<alyssa>
👈
<HdkR>
I use mpv exclusively to watch Youtube videos on my LInux devices. Google won't ever enable hardware video decode on Linux :<
<alyssa>
So if we drop madvise, that revert can be reverted.
<alyssa>
I think
rando25892 has quit [Ping timeout: 240 seconds]
<HdkR>
32bit process should prioritize munmap instead of madvise anyway because of limited virtual memory range
<anarsoul>
what's the point of madvise anyway? it kind of defeats the purpose of BO cache
<alyssa>
anarsoul: Ostensibly to deal with low mem situations
<alyssa>
But I'm starting to convince myself it might've been the single worst design decision of the project to date...
<anarsoul>
alyssa: limit BO cache size?
<alyssa>
(and this is my fault)
<alyssa>
anarsoul: nowadays we have an LRU cache thanks to bbrezillon which probably defeats the prupose of madvise
<HdkR>
Tells the kernel that those pages can have their physical memory backing dropped, but you still want the virtual address mapping. Kernel will do a its fault dancing to reload or zero the backing if it tries to get accessed again
* anarsoul
doesn't remember whether lima BO cache is LRU
<alyssa>
gfxbench numbers on midgard seem a bit higher than a few months ago
<anarsoul>
HdkR: yeah, but do you really want that for a GPU driver?
<alyssa>
not complaining :p
<HdkR>
anarsoul: in a 64bit process it is probably fine regardless
<HdkR>
Plenty of virtual memory to go around and fault dance is slightly faster
<anarsoul>
HdkR: what I mean is if you kept a BO around in the cache you may want it back ASAP
<HdkR>
Right, you shouldn't madvise or munmap it in that case :P
<icecream95>
Unless my calculations are wrong, the cache size is about 350MB for SuperTuxKart
<anarsoul>
icecream95: do you have cache size limit?
<icecream95>
Okay, calculations are wrong...
<icecream95>
anarsoul: There isn't a size limit
<anarsoul>
ouch
<icecream95>
Revised figures are a peak at 200MB during level loading, and 60MB during a race
<alyssa>
Well, this was a big tangent for "scanner drivers"
<icecream95>
Most other games use 5-10MB, with only a couple (like Neverball and LZDoom) using 30-40MB
<icecream95>
Maybe this figure should be added to Gallium HUD upstream?
davidlt has joined #panfrost
<anarsoul>
I guess you can expose it as a perf counter
<anarsoul>
not really a counter though
<alyssa>
ok, calling it a night
<alyssa>
thanks for all the fish
<icecream95>
Pinch-zooming on firefox causes it to go up to a cache size of about 250MB
archetech has quit [Quit: Konversation terminated!]
<chewitt>
@robmur01 if you need some T6xx hardware to advance that dark corner of midgard support, I'll be happy to fund/arrange/ship an XU4 to you
<chewitt>
either pm or email me some details on where to ship it, or if you order from somewhere that takes paypal we (LibreELEC) will refund the cost
<icecream95>
alyssa: After doing some testing in low memory situations, I'm no longer opposed to removing the madvise calls
<icecream95>
The problem is that by the time all the BOs make it to the cache, the system is no longer under memory pressure and so freeing them isn't very useful
vstehle has joined #panfrost
cdu13a has quit [Quit: Konversation terminated!]
rak-zero has joined #panfrost
rak-zero has quit [Ping timeout: 260 seconds]
kaspter has quit [Ping timeout: 246 seconds]
kaspter has joined #panfrost
megi has quit [Ping timeout: 272 seconds]
Stenzek has quit [Ping timeout: 264 seconds]
megi has joined #panfrost
Stenzek has joined #panfrost
chewitt has quit [Read error: Connection reset by peer]
chewitt_ has joined #panfrost
chewitt has joined #panfrost
chewitt_ has quit [Ping timeout: 272 seconds]
<bbrezillon>
alyssa: the kernel is supposed to release BOs (and the underlying mem) when they are no longer used, if it doesn't that's a bug
raster has joined #panfrost
<bbrezillon>
all BOs are refcounted, when we issue a job, refcount is incremented on all referenced BOs, and decremented when the job is done
<bbrezillon>
a close on the drm file will release all the refs the process has to open BOs
<bbrezillon>
icecream95: Re: pre-process memory accouting => I'll have to check, I'd say mmap-ed() memory is counted, but I'm not sure if that's the case for allocated but unmapped (maybe stepri01 knows)
<bbrezillon>
*per-process
<bbrezillon>
note that over-provisioning also happens on GPU buffers, you can have BOs that are not pinned to physical memory...
<bbrezillon>
regarding the whole "LRU+madvise => OOM" situation, I'd still don't see how the combination can get things worse. I do agree that madvise has an overhead (extra ioctls + some potential contention on locks when checking/updating the madvise status), but it should actually improve the situation under high mem pressure.
<bbrezillon>
I remember that we had a bug with mmap-ed buffer not being reclaimable, and IIRC, mesa tries to keep buffers mmap-ed
<bbrezillon>
the LRU cache has 2 downsides:
<bbrezillon>
1/ entries are only evicted when BOs are allocated => a process allocating a lot and then doing nothing might keep a huge amount of memory in its BO cache
camus1 has joined #panfrost
kaspter has quit [Ping timeout: 240 seconds]
camus1 is now known as kaspter
<bbrezillon>
2/ even with this userspace-BO cache, the memory usage can get quite high pretty quickly because we don't limit the number of pending batches (actually that's also true for the madvised buffers, but at least those can be reclaimed right away if the system needs memory)
<bbrezillon>
(by right away, I mean just after the GPU jobs have been flushed and executed, which means we might still have a lot of memory reserved before batches are flushed :-/)
yann has quit [Ping timeout: 256 seconds]
icecream95 has quit [Ping timeout: 246 seconds]
alpernebbi has joined #panfrost
<bbrezillon>
if I read mm/shmem.c correctly, pages allocated through shmem_read_mapping_page() are charged to the task allocating them, so it let the OOM-killer pick the right task
<bbrezillon>
tried it on firefox+aquarium, and it seems to consume around 200MB (with the idle pool oscillating between 20 and 30M)
kaspter has quit [Ping timeout: 256 seconds]
camus1 is now known as kaspter
chewitt has joined #panfrost
<bbrezillon>
when I close the aquarium tab most BOs stay allocated (~180M), and they are only freed when I load something else in another tab
chewitt has quit [Quit: Zzz..]
chewitt has joined #panfrost
chewitt has quit [Quit: Zzz..]
chewitt has joined #panfrost
<bbrezillon>
that's the problem I was mentioning above, the userspace solution doesn't work well when apps stop using the GL context, with madvise we at least make sure the kernel can reclaim the memory (maybe we should start reclaiming pro-actively though, and there might be bugs in the madvise implem too :-/)
stikonas has quit [Remote host closed the connection]
stikonas has joined #panfrost
gcl has quit [Ping timeout: 246 seconds]
rcf has quit [Ping timeout: 264 seconds]
gcl has joined #panfrost
rcf has joined #panfrost
kaspter has quit [Ping timeout: 256 seconds]
kaspter has joined #panfrost
urjaman has quit [Read error: Connection reset by peer]
urjaman has joined #panfrost
<tomeu>
narmstrong: any ideas why roughly the same deqp-gles3 run on the vim3 takes a minute longer than on the kevin (rk3399)?
<tomeu>
we are specially seeing it on tests that do a lot of cpu work calculating reference images, so I'm suspecting of cpufreq
<alyssa>
bbrezillon: "LRU+madvise => OOM" maybe I misspoke, it's the "madvise+BO cache+swap+OOM => hang" that I'm worried about, and LRU avoids the issue that madvise was trying to solve
kaspter has quit [Remote host closed the connection]
<alyssa>
so I'm now trying to figure out if "LRU + BO cache + swap + OOM" will recover gracefully in real workloads as opposed to hanging
kaspter has joined #panfrost
<alyssa>
There are compelling theoretical reasons why it might but, theory and practice..
<alyssa>
i.e. there's some evidence madvise might be making these worse.
<narmstrong>
tomeu: good question
<narmstrong>
tomeu: i saw a perf regression with the final g12b fixup for bifrost
<narmstrong>
Disabling outer cache sync was much faster
<narmstrong>
Do you have a heatsink on the vim3 ?
<narmstrong>
Without it may get too hot and cpufreq may act on the gpu and cpu freq
Green has quit [Ping timeout: 240 seconds]
Green has joined #panfrost
<bbrezillon>
alyssa: well, as I said, userspace-cache-eviction doesn't work if there's no activity on the GL context, which might force the OOM to needlessly kill an app
<bbrezillon>
my point is, it does cover part of the madvise feature, but some corner cases are not handled
<bbrezillon>
that's not to say we shouldn't temporarily revert back to a non-madvise solution if there's a kernel bug, but I keep thinking letting madvise reclaim unused BOs is preferable to letting the OOM kill an app (or swapping pages to disk)
kaspter has quit [Quit: kaspter]
<alyssa>
bbrezillon: I don't see how madvise claiming from the cache actually solves the issue.
<alyssa>
Since if the app _is_ GL using (all but the corner case), it will just try to reallocate immediately thereafter, succeed, and then madvise will claim back again
<alyssa>
And the whole system will grind to a halt.
<alyssa>
OOM killer at least breaks the cycle.
<alyssa>
And killing the app is preferably to hanging the entire system, including "innocent" processes like the compositor.
<alyssa>
(which will continue to use GL!)
<bbrezillon>
madvise is only set when we return a BO to the cache, not when the BO is used
<alyssa>
right, which will happen on frame n+2
<bbrezillon>
which we'll do as soon as the GPU is done executing jobs, okay
<alyssa>
and tht madvise shrinking is not exactly free either
<alyssa>
Under current assumptions, if you get to the point of madvise claiming anything, _it's already too late_
<alyssa>
There is no winning except for the user to try to ctrl-c as fast as possible and hope it's enough.
<alyssa>
At least letting the OOM killer do its job automates that.
enunes has joined #panfrost
<alyssa>
If the issue is "we use too much memory", we can optimize our memory usage. But no matter how optimal there _will_ come a time when a bad process (maybe not even a bad Panfrost process, but something like a compile in the background when I switch to Mattermost).65;1;9c..
<alyssa>
...causes the system to hit low mem, and we need to be able to handle that gracefully. I am saying from experience of using Panfrost daily since Oct 2019, and having experienced thousands of system freezes, despite what it seems on paper madvise is not graceful.
<bbrezillon>
well, the issue is more "the cache size is currently unbounded"
<alyssa>
bbrezillon: We can bound the cache size in userspace, even something high like 128MB per process would make it bounded so a runaway app can't take down the system from cache behaviour.
<alyssa>
Or more to the point, we should strongly consider limiting in-flight batches.
kaspter has joined #panfrost
<alyssa>
Freedreno limits to max 32 (o4 maybe 64), which is plenty, and lets them use single-word bitsets for the tracking logic (as opposed to full blown sets like we resort to, which has higher CPU overhead... but that's irrelevant to this discussion)
<bbrezillon>
an app using a lot of mem and being killed because of that is okay, but mesa caching things in the app back and not releasing it leading to OOM killing the app is not great. Anyway, I get your point, when we're under mem pressure, the shrinker does more harm than it helps (moving from one process to another, leading to huge delays)
<alyssa>
right
<alyssa>
as for the "allocate a ton then no rendering" worst case for the LRU strategy... does that actually apply to anything?
dstzd has joined #panfrost
<bbrezillon>
alyssa: I gave you a real example with firefox
<alyssa>
Games will be rendering constantly, compositors/desktops/x will hit that case but we explicitly _do not_ want them killed if we can help it
dstzd has quit [Client Quit]
<alyssa>
right, browsers. need to give that some thought, thank you
<bbrezillon>
open a tab, load a webgl page, close the tab, the mem stays around until you open another tab and load a GL app into it
<alyssa>
fwiw that doesn't apply to chrome which kills the context
<bbrezillon>
didn't try with chrome
<alyssa>
Better question being "if we bound the size of the BO cache, and establish it's basically just Firefox that has an acutal issue here, does it follow that the size of the problem is bounded for the whole system?"
<alyssa>
(And if so, would Firefox cause the OOM? Or would low-mem only happen because of something else happening in the background that behaves worse, and that will get the OOM treatment, possibly not even graphics.)
<alyssa>
[I am guilty of compiling -j5 with Firefox running. Killing the build is the expected thing to do.]
dstzd has joined #panfrost
dstzd has quit [Client Quit]
<bbrezillon>
well, the corner case still exists. I wonder how other drivers deal with this madvise issue...
<bbrezillon>
anyway, I'm all for a quick solution (AKA disable madvise and limit the cache size + batch size)
dstzd has joined #panfrost
<alyssa>
I do wonder as well.
<bbrezillon>
I'm just curious why we're the only ones to have problems with madvise
dstzd has quit [Client Quit]
dstzd has joined #panfrost
dstzd has quit [Client Quit]
dstzd has joined #panfrost
dstzd has quit [Client Quit]
stikonas has left #panfrost ["Konversation terminated!"]
rcf has quit [Quit: WeeChat 2.9]
rcf has joined #panfrost
dstzd has joined #panfrost
jernej has joined #panfrost
jernej has quit [Client Quit]
jernej has joined #panfrost
dstzd has quit [Client Quit]
dstzd has joined #panfrost
dstzd has quit [Client Quit]
jernej has quit [Client Quit]
dstzd has joined #panfrost
jernej has joined #panfrost
dstzd has quit [Client Quit]
dstzd has joined #panfrost
jernej has quit [Remote host closed the connection]
jernej has joined #panfrost
kaspter has quit [Remote host closed the connection]
kaspter has joined #panfrost
dstzd has quit [Client Quit]
chewitt has quit [Quit: Zzz..]
jernej has quit [Remote host closed the connection]
jernej has joined #panfrost
jernej has quit [Remote host closed the connection]
jernej has joined #panfrost
jernej has quit [Client Quit]
jernej has joined #panfrost
jernej has quit [Client Quit]
jernej has joined #panfrost
jernej has quit [Client Quit]
jernej has joined #panfrost
jernej has quit [Client Quit]
yann has quit [Ping timeout: 256 seconds]
jernej has joined #panfrost
jernej has quit [Client Quit]
dstzd has joined #panfrost
dstzd has quit [Client Quit]
<alyssa>
Yeah..
jernej has joined #panfrost
jernej has quit [Client Quit]
kaspter has quit [Remote host closed the connection]
kaspter has joined #panfrost
jernej has joined #panfrost
dstzd has joined #panfrost
kaspter has quit [Quit: kaspter]
alpernebbi has quit [Quit: alpernebbi]
warpme_ has quit [Quit: Connection closed for inactivity]
icecream95 has joined #panfrost
icecream95 has quit [Read error: Connection reset by peer]
icecream95 has joined #panfrost
raster has quit [Quit: Gettin' stinky!]
davidlt has quit [Ping timeout: 264 seconds]
<icecream95>
Maybe everyone else has found out about zram and never has problems with OOM?
raster has joined #panfrost
nlhowell has quit [Ping timeout: 240 seconds]
nlhowell has joined #panfrost
icecream95 has quit [Read error: Connection reset by peer]
icecream95 has joined #panfrost
<icecream95>
Why does doing a printf whenever a BO is allocated fix the GPU faults in Firefox?
<macc24>
icecream95: timing?
nlhowell has quit [Ping timeout: 240 seconds]
nlhowell has joined #panfrost
<icecream95>
Looks like the address it faults on was an imported BO...
<macc24>
icecream95: are you fixing weird issues when firefox is opening a download window?
<alyssa>
icecream95: wat.mp4
<icecream95>
Making panfrost_bo_unreference a no-op didn't fix it...
<alyssa>
bbrezillon: Hey, question -- if process A creates and exports a resource, then process B imports the resource, then process A is killed, what happens to the resource?
<alyssa>
If it's freed, there are probably use-after-frees in mesa (maybe icecream95's bug related)
<alyssa>
If it isn't freed, there is a memory leak: even if it's freed when A is killed, it's possible A could substantially outlive B.
<alyssa>
(Specifically relevant to window framebuffers, with A being a GL client and B being the compositor)
raster has quit [Quit: Gettin' stinky!]
<anarsoul>
alyssa: isn't it ref-counted in kernel?
raster has joined #panfrost
<alyssa>
anarsoul: sure, does the compositor own a reference though?
<anarsoul>
well, if it has a fd, then yes
<alyssa>
I gues that gets freed when the compositor cleans up the window. Probably fine. Never mind, just paranoid now.
<icecream95>
Oops... The faults were happening at 0x3E013F40, but the imported BO is at 0x3e01000-0x3e01fff