#panfrost on 2021-01-14 — irc logs at freenode.irclog.whitequark.org

2019-09-06 11:20 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://freenode.irclog.whitequark.org/panfrost - <daniels> avoiding X is a huge feature

00:36 raster has quit [Quit: Gettin' stinky!]

00:38 dstzd has quit [Quit: ZNC - https://znc.in]

00:39 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

00:39 stikonas has quit [Remote host closed the connection]

00:40 stikonas has joined #panfrost

00:42 dstzd has joined #panfrost

00:42 dstzd has quit [Client Quit]

00:59 jernej has joined #panfrost

01:00 dstzd has joined #panfrost

01:01 dstzd has quit [Client Quit]

01:02 dstzd has joined #panfrost

01:03 dstzd has quit [Client Quit]

01:04 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

01:05 jernej has joined #panfrost

01:06 jernej has quit [Remote host closed the connection]

01:07 dstzd has joined #panfrost

01:08 dstzd has quit [Client Quit]

01:10 stikonas has quit [Remote host closed the connection]

01:17 dstzd has joined #panfrost

01:20 dstzd has quit [Client Quit]

01:20 dstzd has joined #panfrost

01:22 dstzd has quit [Client Quit]

01:27 jernej has joined #panfrost

01:28 jernej has quit [Client Quit]

01:28 archetech has quit [Quit: Textual IRC Client: www.textualapp.com]

01:31 jernej has joined #panfrost

01:33 jernej has quit [Client Quit]

01:36 jernej has joined #panfrost

01:40 cdu13a has joined #panfrost

01:41 dstzd has joined #panfrost

01:46 gcl has quit [Ping timeout: 260 seconds]

01:48 archetech has joined #panfrost

01:54 kaspter has joined #panfrost

01:57 gcl has joined #panfrost

02:03 vstehle has quit [Ping timeout: 264 seconds]

02:31 <alyssa> Every time I have to use autotools I get a little sadder.

02:31 <HdkR> never use autotools :D

02:33 <icecream95> Every time I have to download config.guess and config.sub I get a litte sadder

02:34 <macc24> Every time I have to use x86 I get a little sadder.

02:35 <alyssa> I don't understand why I have to compile a scanner driver in 2021.

02:35 <alyssa> Or why scanner drivers are using autotools in 2021.

02:36 <alyssa> ---

02:36 <alyssa> I think I've decided that our madvise handling is totally backwards.

02:36 <kinkinkijkin> perhaps the scanner was made in 2019, a perfectly acceptible year to still be using autotools

02:36 <macc24> why would you use a scanner in 2021?

02:36 <kinkinkijkin> 2019 was 12 years ago according to real math

02:36 <alyssa> It doesn't make sense to constantly allocate and free memory, marking as purgable so in low mem the cache gets snarfed out of to reclaim memory,

02:37 <macc24> wait what

02:37 <alyssa> when we'll just reallocate it again immediately, fighting the kernel to free/allocate until our low-memory sitatuion spirals into a freeze.

02:37 <alyssa> That doesn't solve the actual underlying issue.

02:38 <macc24> alyssa: whatever you are doing right now, you are doing a good job :D

02:38 <alyssa> More to the point: a poorly behaved client eating up all the heap should never be able to take down the compositor.

02:38 <alyssa> If x11 or weston or whatever allocated a fixed amount of command buffers upfront and _never freed them_, what would happen?

02:39 <alyssa> In the short term, maybe higher memory usage, ok.

02:41 <alyssa> But memory usage is just a number. If it's too high we can optimize for memory usage, but that isn't 'really' the issue we're seeing.

02:42 <alyssa> In the long term, it should handle low-memory situations much more frequently -- rather than fighting anything, the compositor isn't even on the hook anymore. It uses a certain amount of memory, and if it isn't the process getting the OOM killer's wrath, there's no problem.

02:43 <alyssa> [I type this from laptop #2 since laptop #1 is now frozen for the second time in 10 minutes trying to compile the scanner driver.]

02:43 <alyssa> madvise, swap, and the BO cache are all well intentioned but their interaction is highly unstable and counterproductive.

02:44 <alyssa> Historically madvise solved the issue of "a GPU process just keeps allocating until it takes down the machine" when the BO cache was first added. That hasn't been an issue since the LRU test was added, which is much less invasive than madvise.

02:47 <alyssa> Actually I'd go so far as to exempt a small amount of memory from the LRU test. Mali data structures are tiny, keeping around the bare minimum to keep pushing frames even when allocations are failing isn't such a scary idea.

02:48 <alyssa> No amount of clever tricks are going to magically make less memory used, if the kernel overcommits we're already toast.

02:48 <alyssa> ("zstd?" "Never mind.")

02:48 <icecream95> alyssa: It's zram + zstd as swap, zstd by itself doesn't do anything

02:49 <alyssa> The least we can do is get the offending process to fail fast and keep the rest out of the line of fire, as opposed to the current game we play of drawing out a long painful death until the whole system hangs.

02:49 <macc24> so that's why my laptop was hanging randomly

02:50 <alyssa> (Also, madvise adds nontrivial CPU overhead for the 99.99% of the frames that don't need it. But I care more about stability.)

02:50 <alyssa> (And I'm just really hit my limit for system freezes and I shouldn't keep blaming the kernel.)

02:52 <alyssa> I'm strongly inclined to make a WIP branch ripping out madvise from userspace and dogfooding it for a week or something and seeing if it's noticeably better.

02:52 <alyssa> Low-memory situations are going to suck regardless, OpenGL as an API isn't really great here. But Panfrost is aggravating it.

02:52 <icecream95> Current zram stats for me are used:599M uncompressed:1.64G, with 3.5/3.8G RAM used, everything is still running perfectly smoothly

02:52 <icecream95> It really is magic

02:54 <alyssa> bbrezillon: ^^

02:56 <icecream95> alyssa: Try it, try it, green eggs and zram. You may like it, you will see. You may like it. In a device-tree?

02:56 <kinkinkijkin> on a horse? in a source?

02:57 <alyssa> saying neigh? June through May?

02:57 <alyssa> ---Wait, a minute

02:57 <alyssa> ("A minute waits")

02:59 <kinkinkijkin> everything would be a lot more hunkey and dorey if shared memory weren't the only gpu memory in thes-- *receives a ticket for insinuating that splitting gpu memory between shared and privte like that would be better*

03:00 <alyssa> Okay, this is inconsistent.

03:00 <alyssa> When we close the panfrost device, we evict everything from the cache (==> GEM_CLOSE ioctls).

03:01 <alyssa> But BOs used by jobs that are currently in flight when closing won't be in the cache, so they won't be evicted with that call (==> those BOs may not have GEM_CLOSE called).

03:02 <alyssa> To my knowledge, the driver doesn't block on in-flight jobs terminating before exiting, although the kernel is supposed to terminate any in-flight jobs from a process that ends.

03:02 <alyssa> There are two cases here.

03:03 <alyssa> 1. The kernel automatically cleans up BOs from dead processes. In this case the cache eviction routine is totally useless and can be removed.

03:03 <alyssa> 2. The kernel does not clean up BOs from dead processes. In this case we have a, possibly quite serious, BO memory leak.

03:03 <alyssa> I don't know enough of the Linux side to know which case it is, but both suck and demand a fix.

03:04 <alyssa> (Why is this single C++ file taking gigabytes of RAM to compile? This isn't even Chromium.)

03:06 tlwoerner has quit [Remote host closed the connection]

03:06 <urjaman> Why am i not surprised that it's C++ tho...

03:07 <icecream95> "The kernel does not clean up BOs from dead processes." At least in 5.10, it does, even if there are still running jobs (see complaints about "Memory manager not clean during takedown" kernel warnings)

03:08 <HdkR> alyssa: Template hell usually

03:08 kaspter has quit [Ping timeout: 240 seconds]

03:09 <alyssa> icecream95: I don't know I've seen those warnings? But if they're there I think that indicates an actual leak.

03:09 kaspter has joined #panfrost

03:10 <HdkR> Alternatively, C++ has regex support built in. Constexpression expansion for that can be rough

03:11 tlwoerner has joined #panfrost

03:13 <HdkR> clang has a -ftime-trace argument, now it just needs a -fmemory-trace argument D:

03:14 <HdkR> Although bad times could imply large amounts of allocations I guess :D

03:15 <kinkinkijkin> not related to current convo, hdkr I have a bunch of audio productivity softwares I've been trying to make work on this chromebook, got famitracker in through box86, want to try renoise now and use the x86_64 build, anything I should keep in mind trying to use f

03:15 <kinkinkijkin> EX

03:15 <icecream95> heaptrack is very useful for tracking memory usage and leaks

03:16 <HdkR> Heaptrack is quality

03:16 <alyssa> Is opening Aquarium on both Firefox and Chromium simultaneously not sufficient for OOMing anymore? Bah.

03:16 <HdkR> OOM situation improving? bah humming birds

03:17 <HdkR> kinkinkijkin: ah, make sure to set up a sane working config. Default is very slow, but fast may currently encounter bugs :P

03:18 <alyssa> ok, let's open Mattermost too

03:18 <alyssa> that's React, it should "help" move along the memory

03:19 <alyssa> and uh start a compile with -j6 in the background

03:20 * icecream95 feels sorry for those with only two big cores

03:21 <alyssa> icecream95: I'm almost a little scared to compile Mesa on the M1, lest I never want to go back to rockchip :p

03:22 <icecream95> I rarely boot veyron-speedy any more except to make sure the battery is charged

03:23 <HdkR> Cross-compile life then :)

03:24 <alyssa> Still managed to get it to freeze :|

03:24 <icecream95> With zram?

03:25 <alyssa> No, with madvise ripped out... why didn't the OOM killer take down { SuperTuxKart, Firefox, Chromium, gcc }?

03:25 <icecream95> alyssa: Is allocated but unmapped GPU memory attributed to processes?

03:26 <icecream95> (or even mapped memory, for that matter)

03:26 <alyssa> Not sure. I don't do much kernel.

03:26 <alyssa> bbrezillon: should know.

03:38 <alyssa> After properly ninja installing things do seem better?

03:39 <alyssa> Like I don't want to jinx it but

03:47 <alyssa> The fact I can have Chromium and Firefox open at once, and a Youtube video playing, and not be totally broken is a good thing.

03:56 <alyssa> Speaking of, only getting 20fps... groan. Though it looks like the s/w video decoding isn't the "fastest" thing..

03:57 <icecream95> mpv can software decode 1080p even on RK3288, at least if ffmpeg and codecs are compiled with the right options

03:57 <alyssa> This was Chromium with the default yt proprietary js player

03:58 <alyssa> mpv has been fine since forever

03:58 <icecream95> Fix: Don't watch videos in Chromium

03:58 <alyssa> 👈

04:00 <HdkR> I use mpv exclusively to watch Youtube videos on my LInux devices. Google won't ever enable hardware video decode on Linux :<

04:00 <alyssa> icecream95: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5355

04:01 <alyssa> Was that bug madvise related or am I imagining things?

04:01 <icecream95> Firefox just got OOM killed, re-enabling zram

04:01 <alyssa> https://gitlab.freedesktop.org/mesa/mesa/-/issues/3038 Signs point to yes

04:02 <alyssa> right, shrinker

04:02 <alyssa> So if we drop madvise, that revert can be reverted.

04:02 <alyssa> I think

04:04 rando25892 has quit [Ping timeout: 240 seconds]

04:04 <HdkR> 32bit process should prioritize munmap instead of madvise anyway because of limited virtual memory range

04:05 <anarsoul> what's the point of madvise anyway? it kind of defeats the purpose of BO cache

04:05 <alyssa> anarsoul: Ostensibly to deal with low mem situations

04:05 <alyssa> But I'm starting to convince myself it might've been the single worst design decision of the project to date...

04:05 <anarsoul> alyssa: limit BO cache size?

04:05 <alyssa> (and this is my fault)

04:06 <alyssa> anarsoul: nowadays we have an LRU cache thanks to bbrezillon which probably defeats the prupose of madvise

04:06 <HdkR> Tells the kernel that those pages can have their physical memory backing dropped, but you still want the virtual address mapping. Kernel will do a its fault dancing to reload or zero the backing if it tries to get accessed again

04:06 * anarsoul doesn't remember whether lima BO cache is LRU

04:08 <alyssa> gfxbench numbers on midgard seem a bit higher than a few months ago

04:08 <anarsoul> HdkR: yeah, but do you really want that for a GPU driver?

04:08 <alyssa> not complaining :p

04:08 <HdkR> anarsoul: in a 64bit process it is probably fine regardless

04:08 <HdkR> Plenty of virtual memory to go around and fault dance is slightly faster

04:09 <anarsoul> HdkR: what I mean is if you kept a BO around in the cache you may want it back ASAP

04:10 <HdkR> Right, you shouldn't madvise or munmap it in that case :P

04:10 <icecream95> Unless my calculations are wrong, the cache size is about 350MB for SuperTuxKart

04:11 <anarsoul> icecream95: do you have cache size limit?

04:11 <icecream95> Okay, calculations are wrong...

04:11 <icecream95> anarsoul: There isn't a size limit

04:11 <anarsoul> ouch

04:13 <icecream95> Revised figures are a peak at 200MB during level loading, and 60MB during a race

04:18 <alyssa> Well, this was a big tangent for "scanner drivers"

04:20 <icecream95> Most other games use 5-10MB, with only a couple (like Neverball and LZDoom) using 30-40MB

04:22 <icecream95> Maybe this figure should be added to Gallium HUD upstream?

04:23 davidlt has joined #panfrost

04:23 <anarsoul> I guess you can expose it as a perf counter

04:24 <anarsoul> not really a counter though

04:24 <alyssa> ok, calling it a night

04:24 <alyssa> thanks for all the fish

04:33 <icecream95> Pinch-zooming on firefox causes it to go up to a cache size of about 250MB

04:50 archetech has quit [Quit: Konversation terminated!]

04:56 <chewitt> @robmur01 if you need some T6xx hardware to advance that dark corner of midgard support, I'll be happy to fund/arrange/ship an XU4 to you

04:58 <chewitt> either pm or email me some details on where to ship it, or if you order from somewhere that takes paypal we (LibreELEC) will refund the cost

05:30 <icecream95> alyssa: After doing some testing in low memory situations, I'm no longer opposed to removing the madvise calls

05:30 <icecream95> The problem is that by the time all the BOs make it to the cache, the system is no longer under memory pressure and so freeing them isn't very useful

06:00 vstehle has joined #panfrost

06:13 cdu13a has quit [Quit: Konversation terminated!]

07:14 rak-zero has joined #panfrost

07:22 rak-zero has quit [Ping timeout: 260 seconds]

07:29 kaspter has quit [Ping timeout: 246 seconds]

07:31 kaspter has joined #panfrost

07:49 megi has quit [Ping timeout: 272 seconds]

07:49 Stenzek has quit [Ping timeout: 264 seconds]

07:50 megi has joined #panfrost

07:53 Stenzek has joined #panfrost

07:56 chewitt has quit [Read error: Connection reset by peer]

07:56 chewitt_ has joined #panfrost

08:07 chewitt has joined #panfrost

08:08 chewitt_ has quit [Ping timeout: 272 seconds]

08:11 <bbrezillon> alyssa: the kernel is supposed to release BOs (and the underlying mem) when they are no longer used, if it doesn't that's a bug

08:11 raster has joined #panfrost

08:12 <bbrezillon> all BOs are refcounted, when we issue a job, refcount is incremented on all referenced BOs, and decremented when the job is done

08:12 <bbrezillon> a close on the drm file will release all the refs the process has to open BOs

08:20 <bbrezillon> icecream95: Re: pre-process memory accouting => I'll have to check, I'd say mmap-ed() memory is counted, but I'm not sure if that's the case for allocated but unmapped (maybe stepri01 knows)

08:20 <bbrezillon> *per-process

08:21 <bbrezillon> note that over-provisioning also happens on GPU buffers, you can have BOs that are not pinned to physical memory...

08:24 <bbrezillon> regarding the whole "LRU+madvise => OOM" situation, I'd still don't see how the combination can get things worse. I do agree that madvise has an overhead (extra ioctls + some potential contention on locks when checking/updating the madvise status), but it should actually improve the situation under high mem pressure.

08:25 <bbrezillon> I remember that we had a bug with mmap-ed buffer not being reclaimable, and IIRC, mesa tries to keep buffers mmap-ed

08:26 <bbrezillon> the LRU cache has 2 downsides:

08:27 <bbrezillon> 1/ entries are only evicted when BOs are allocated => a process allocating a lot and then doing nothing might keep a huge amount of memory in its BO cache

08:27 camus1 has joined #panfrost

08:28 kaspter has quit [Ping timeout: 240 seconds]

08:28 camus1 is now known as kaspter

08:30 <bbrezillon> 2/ even with this userspace-BO cache, the memory usage can get quite high pretty quickly because we don't limit the number of pending batches (actually that's also true for the madvised buffers, but at least those can be reclaimed right away if the system needs memory)

08:33 <bbrezillon> (by right away, I mean just after the GPU jobs have been flushed and executed, which means we might still have a lot of memory reserved before batches are flushed :-/)

08:37 yann has quit [Ping timeout: 256 seconds]

09:12 icecream95 has quit [Ping timeout: 246 seconds]

09:22 alpernebbi has joined #panfrost

09:23 <bbrezillon> if I read mm/shmem.c correctly, pages allocated through shmem_read_mapping_page() are charged to the task allocating them, so it let the OOM-killer pick the right task

09:37 yann has joined #panfrost

09:49 stikonas has joined #panfrost

09:56 tomeu has joined #panfrost

10:04 nlhowell has joined #panfrost

10:37 BenG83 has joined #panfrost

10:37 chewitt has quit [Quit: Zzz..]

10:41 <macc24> -

10:54 karolherbst has joined #panfrost

11:00 klaxa has joined #panfrost

11:04 <bbrezillon> alyssa: here's a patch to dump basic stats => https://gitlab.freedesktop.org/bbrezillon/mesa/-/commit/8f61ff75228d7424c01a69fd2b7c8a7a60643a63

11:06 camus1 has joined #panfrost

11:06 <bbrezillon> tried it on firefox+aquarium, and it seems to consume around 200MB (with the idle pool oscillating between 20 and 30M)

11:07 kaspter has quit [Ping timeout: 256 seconds]

11:07 camus1 is now known as kaspter

11:07 chewitt has joined #panfrost

11:10 <bbrezillon> when I close the aquarium tab most BOs stay allocated (~180M), and they are only freed when I load something else in another tab

11:19 chewitt has quit [Quit: Zzz..]

11:24 chewitt has joined #panfrost

11:33 chewitt has quit [Quit: Zzz..]

11:34 chewitt has joined #panfrost

11:36 <bbrezillon> that's the problem I was mentioning above, the userspace solution doesn't work well when apps stop using the GL context, with madvise we at least make sure the kernel can reclaim the memory (maybe we should start reclaiming pro-actively though, and there might be bugs in the madvise implem too :-/)

11:52 stikonas has quit [Remote host closed the connection]

11:53 stikonas has joined #panfrost

11:54 gcl has quit [Ping timeout: 246 seconds]

11:57 rcf has quit [Ping timeout: 264 seconds]

11:58 gcl has joined #panfrost

12:14 rcf has joined #panfrost

13:07 kaspter has quit [Ping timeout: 256 seconds]

13:07 kaspter has joined #panfrost

13:15 urjaman has quit [Read error: Connection reset by peer]

13:16 urjaman has joined #panfrost

13:36 <tomeu> narmstrong: any ideas why roughly the same deqp-gles3 run on the vim3 takes a minute longer than on the kevin (rk3399)?

13:37 <tomeu> we are specially seeing it on tests that do a lot of cpu work calculating reference images, so I'm suspecting of cpufreq

13:38 <alyssa> bbrezillon: "LRU+madvise => OOM" maybe I misspoke, it's the "madvise+BO cache+swap+OOM => hang" that I'm worried about, and LRU avoids the issue that madvise was trying to solve

13:38 kaspter has quit [Remote host closed the connection]

13:39 <alyssa> so I'm now trying to figure out if "LRU + BO cache + swap + OOM" will recover gracefully in real workloads as opposed to hanging

13:39 kaspter has joined #panfrost

13:39 <alyssa> There are compelling theoretical reasons why it might but, theory and practice..

13:42 <alyssa> i.e. there's some evidence madvise might be making these worse.

13:43 <narmstrong> tomeu: good question

13:44 <narmstrong> tomeu: i saw a perf regression with the final g12b fixup for bifrost

13:44 <narmstrong> Disabling outer cache sync was much faster

13:45 <narmstrong> Do you have a heatsink on the vim3 ?

13:45 <narmstrong> Without it may get too hot and cpufreq may act on the gpu and cpu freq

13:50 Green has quit [Ping timeout: 240 seconds]

13:50 Green has joined #panfrost

14:16 <bbrezillon> alyssa: well, as I said, userspace-cache-eviction doesn't work if there's no activity on the GL context, which might force the OOM to needlessly kill an app

14:18 <bbrezillon> my point is, it does cover part of the madvise feature, but some corner cases are not handled

14:22 <bbrezillon> that's not to say we shouldn't temporarily revert back to a non-madvise solution if there's a kernel bug, but I keep thinking letting madvise reclaim unused BOs is preferable to letting the OOM kill an app (or swapping pages to disk)

14:24 kaspter has quit [Quit: kaspter]

14:29 <alyssa> bbrezillon: I don't see how madvise claiming from the cache actually solves the issue.

14:29 <alyssa> Since if the app _is_ GL using (all but the corner case), it will just try to reallocate immediately thereafter, succeed, and then madvise will claim back again

14:30 <alyssa> And the whole system will grind to a halt.

14:30 <alyssa> OOM killer at least breaks the cycle.

14:30 <alyssa> And killing the app is preferably to hanging the entire system, including "innocent" processes like the compositor.

14:30 <alyssa> (which will continue to use GL!)

14:30 <bbrezillon> madvise is only set when we return a BO to the cache, not when the BO is used

14:31 <alyssa> right, which will happen on frame n+2

14:31 <bbrezillon> which we'll do as soon as the GPU is done executing jobs, okay

14:31 <alyssa> and tht madvise shrinking is not exactly free either

14:32 <alyssa> Under current assumptions, if you get to the point of madvise claiming anything, _it's already too late_

14:32 <alyssa> There is no winning except for the user to try to ctrl-c as fast as possible and hope it's enough.

14:32 <alyssa> At least letting the OOM killer do its job automates that.

14:33 enunes has joined #panfrost

14:35 <alyssa> If the issue is "we use too much memory", we can optimize our memory usage. But no matter how optimal there _will_ come a time when a bad process (maybe not even a bad Panfrost process, but something like a compile in the background when I switch to Mattermost).65;1;9c..

14:35 dstzd has quit [Quit: ZNC - https://znc.in]

14:36 <alyssa> ...causes the system to hit low mem, and we need to be able to handle that gracefully. I am saying from experience of using Panfrost daily since Oct 2019, and having experienced thousands of system freezes, despite what it seems on paper madvise is not graceful.

14:36 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

14:38 <bbrezillon> well, the issue is more "the cache size is currently unbounded"

14:41 <alyssa> bbrezillon: We can bound the cache size in userspace, even something high like 128MB per process would make it bounded so a runaway app can't take down the system from cache behaviour.

14:42 <alyssa> Or more to the point, we should strongly consider limiting in-flight batches.

14:42 kaspter has joined #panfrost

14:43 <alyssa> Freedreno limits to max 32 (o4 maybe 64), which is plenty, and lets them use single-word bitsets for the tracking logic (as opposed to full blown sets like we resort to, which has higher CPU overhead... but that's irrelevant to this discussion)

14:43 <bbrezillon> an app using a lot of mem and being killed because of that is okay, but mesa caching things in the app back and not releasing it leading to OOM killing the app is not great. Anyway, I get your point, when we're under mem pressure, the shrinker does more harm than it helps (moving from one process to another, leading to huge delays)

14:44 <alyssa> right

14:44 <alyssa> as for the "allocate a ton then no rendering" worst case for the LRU strategy... does that actually apply to anything?

14:44 dstzd has joined #panfrost

14:44 <bbrezillon> alyssa: I gave you a real example with firefox

14:45 <alyssa> Games will be rendering constantly, compositors/desktops/x will hit that case but we explicitly _do not_ want them killed if we can help it

14:45 dstzd has quit [Client Quit]

14:45 <alyssa> right, browsers. need to give that some thought, thank you

14:45 <bbrezillon> open a tab, load a webgl page, close the tab, the mem stays around until you open another tab and load a GL app into it

14:46 <alyssa> fwiw that doesn't apply to chrome which kills the context

14:46 <bbrezillon> didn't try with chrome

14:47 <alyssa> Better question being "if we bound the size of the BO cache, and establish it's basically just Firefox that has an acutal issue here, does it follow that the size of the problem is bounded for the whole system?"

14:48 <alyssa> (And if so, would Firefox cause the OOM? Or would low-mem only happen because of something else happening in the background that behaves worse, and that will get the OOM treatment, possibly not even graphics.)

14:49 <alyssa> [I am guilty of compiling -j5 with Firefox running. Killing the build is the expected thing to do.]

14:49 dstzd has joined #panfrost

14:50 dstzd has quit [Client Quit]

14:50 <bbrezillon> well, the corner case still exists. I wonder how other drivers deal with this madvise issue...

14:51 <bbrezillon> anyway, I'm all for a quick solution (AKA disable madvise and limit the cache size + batch size)

14:51 dstzd has joined #panfrost

14:52 <alyssa> I do wonder as well.

14:52 <bbrezillon> I'm just curious why we're the only ones to have problems with madvise

14:53 dstzd has quit [Client Quit]

14:56 dstzd has joined #panfrost

14:56 dstzd has quit [Client Quit]

14:58 dstzd has joined #panfrost

14:59 dstzd has quit [Client Quit]

15:02 stikonas has left #panfrost ["Konversation terminated!"]

15:08 rcf has quit [Quit: WeeChat 2.9]

15:08 rcf has joined #panfrost

15:43 dstzd has joined #panfrost

15:44 jernej has joined #panfrost

15:46 jernej has quit [Client Quit]

15:47 jernej has joined #panfrost

15:48 dstzd has quit [Client Quit]

15:48 dstzd has joined #panfrost

15:49 dstzd has quit [Client Quit]

15:50 jernej has quit [Client Quit]

15:50 dstzd has joined #panfrost

15:50 jernej has joined #panfrost

15:53 dstzd has quit [Client Quit]

15:54 dstzd has joined #panfrost

15:54 jernej has quit [Remote host closed the connection]

15:55 jernej has joined #panfrost

15:55 kaspter has quit [Remote host closed the connection]

15:56 kaspter has joined #panfrost

15:58 dstzd has quit [Client Quit]

16:02 chewitt has quit [Quit: Zzz..]

16:09 jernej has quit [Remote host closed the connection]

16:10 jernej has joined #panfrost

16:13 jernej has quit [Remote host closed the connection]

16:13 jernej has joined #panfrost

16:14 jernej has quit [Client Quit]

16:32 jernej has joined #panfrost

16:34 jernej has quit [Client Quit]

16:35 jernej has joined #panfrost

16:35 jernej has quit [Client Quit]

16:36 jernej has joined #panfrost

16:37 jernej has quit [Client Quit]

16:37 yann has quit [Ping timeout: 256 seconds]

16:37 jernej has joined #panfrost

16:38 jernej has quit [Client Quit]

16:40 dstzd has joined #panfrost

16:42 dstzd has quit [Client Quit]

16:50 <alyssa> Yeah..

16:56 jernej has joined #panfrost

16:58 jernej has quit [Client Quit]

17:15 kaspter has quit [Remote host closed the connection]

17:16 kaspter has joined #panfrost

17:17 jernej has joined #panfrost

17:19 dstzd has joined #panfrost

17:34 kaspter has quit [Quit: kaspter]

18:16 alpernebbi has quit [Quit: alpernebbi]

19:08 warpme_ has quit [Quit: Connection closed for inactivity]

19:20 icecream95 has joined #panfrost

20:00 icecream95 has quit [Read error: Connection reset by peer]

20:01 icecream95 has joined #panfrost

20:02 raster has quit [Quit: Gettin' stinky!]

20:03 davidlt has quit [Ping timeout: 264 seconds]

20:16 <icecream95> Maybe everyone else has found out about zram and never has problems with OOM?

20:23 raster has joined #panfrost

21:13 nlhowell has quit [Ping timeout: 240 seconds]

21:35 nlhowell has joined #panfrost

21:37 icecream95 has quit [Read error: Connection reset by peer]

21:41 icecream95 has joined #panfrost

21:50 <icecream95> Why does doing a printf whenever a BO is allocated fix the GPU faults in Firefox?

21:51 <macc24> icecream95: timing?

21:55 nlhowell has quit [Ping timeout: 240 seconds]

21:57 nlhowell has joined #panfrost

21:59 <icecream95> Looks like the address it faults on was an imported BO...

22:00 <macc24> icecream95: are you fixing weird issues when firefox is opening a download window?

22:02 <alyssa> icecream95: wat.mp4

22:11 <icecream95> Making panfrost_bo_unreference a no-op didn't fix it...

22:11 <alyssa> https://www.destroyallsoftware.com/talks/wat

22:23 <alyssa> bbrezillon: Hey, question -- if process A creates and exports a resource, then process B imports the resource, then process A is killed, what happens to the resource?

22:24 <alyssa> If it's freed, there are probably use-after-frees in mesa (maybe icecream95's bug related)

22:25 <alyssa> If it isn't freed, there is a memory leak: even if it's freed when A is killed, it's possible A could substantially outlive B.

22:25 <alyssa> (Specifically relevant to window framebuffers, with A being a GL client and B being the compositor)

22:25 raster has quit [Quit: Gettin' stinky!]

22:28 <anarsoul> alyssa: isn't it ref-counted in kernel?

22:35 raster has joined #panfrost

22:38 <alyssa> anarsoul: sure, does the compositor own a reference though?

22:38 <anarsoul> well, if it has a fd, then yes

22:42 <alyssa> I gues that gets freed when the compositor cleans up the window. Probably fine. Never mind, just paranoid now.

22:49 <icecream95> Oops... The faults were happening at 0x3E013F40, but the imported BO is at 0x3e01000-0x3e01fff

22:50 <icecream95> https://www.youtube.com/watch?v=z3tmo_AHKJQ

23:35 raster has quit [Quit: Gettin' stinky!]