marcan changed the topic of #asahi to: Asahi Linux: porting Linux to Apple Silicon macs | General project discussion | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Topics: #asahi-dev #asahi-re #asahi-gpu #asahi-offtopic | Keep things on topic | Logs: https://alx.sh/l/asahi
<jn__>
do both CPU types (firestorm, icestorm) in the M1 SoC have this bug?
<marcan>
jn__: it's not a bug, it's a deliberate design choice in violation of the spec
<maz>
jn__: no idea.
<jn__>
marcan: ah, ok
klaus has joined #asahi
<marcan>
I think it is safe to assume that both CPU types will behave identically in details such as this; I would only expect differences in corner cases and errata
<maz>
marcan: my sentiment as well. you don't have this "bug" by accident, you have to design it as such. and write all your SW accordingly.
<marcan>
yup
<marcan>
so... as for that patch... are you happy with it as it is, or do we need to come up with another solution?
<maz>
marcan: I'm happy with it for now, and we can always have some discussion on the list. I expect Mark to nitpick at it, but that's the rule of the game!
<marcan>
fair :)
<marcan>
I'll also base it off of Mark's tree with the FIQ changes
<maz>
the really ugly part is the interaction with the cpufeature override.
<jn__>
("CPU known as Apple M1" in the patch text is a bit imprecise, because M1 the SoC has two kinds of CPU, that are at least in some regards different — hence my question)
<marcan>
fair point
<dhewg>
on that note "fruity cpu" sounds like raspberry
<marcan>
AIUI, the way these big.LITTLE systems are engineered (at least in Apple's case) is that the CPUs basically share the same design, and instead are just largely different configurations and synthesis options
<marcan>
so the power-efficient cores will have less wide dispatch, be built with less leaky / slower cells, have downsized elements, and stuff like that
<marcan>
but not so much be a completely different design
<marcan>
so you would expect them to match in behavior (which is also pretty important not to run into horribleness with SMP)
<maz>
marcan: it'd be good if they had a similar uarch. the traditional BL crap is awful to deal with.
<dhewg>
maz: yeah, I know. I've just seen various projects using "fruit" to refer to rpi or clones like banana pi
<marcan>
part of this I'm guessing based on the chicken bit names
<marcan>
some of them are the same, some of them have identical function in the E/P registers
<maz>
dhewg: well, fruit had a meaning before dev boards, so I feel perfectly fine with my comment! :-)
<dhewg>
hehe alright ;)
<marcan>
maz: oh yeah, one more thing, since I had a request for documenting ioremap_np... I haven't found any actual arch-agnostic ioremap docs, and I have no idea if I'd dare write one from scratch given all the "fun" details involved... but maybe it's worth giving a shot at writing an arm64 one?
<marcan>
at least listing out the mapping between ioremap types, MAIR settings, and what they mean
<maz>
marcan: I don't think this should be arm64-specific, as the ioremap API should have a consistent meaning across architectures (after all, that's how we write "portable" code).
<maz>
marcan: but if you can start with something that also describe the MAIR settings for arm64 as an example of an architectural implementqwtion, that'd be great!
<marcan>
maybe a new section in Documentation/driver-api/device-io.rst then
<marcan>
that mentions ioremap, but none of the variants
<maz>
sounds very good to me! cc Jon Corbet for this.
<marcan>
thanks, will do :)
<arnd>
marcan: if you start a Documentation text in a wiki or your favourite web based shared text editing platform, I'll help fill in some blanks for the other functions and architecture specfics
<marcan>
oh, that works for me; I can throw it on the Asahi wiki as a WIP text, or google docs or something like that if you prefer real-time collaboration as a one-off thing?
<maz>
it may be worth looking at what x86 has done in their own corner: Documentation/x86/pat.rst
<marcan>
yup, I saw that one
<arnd>
I prefer google docs, but wiki might be easier for others to join in
<arnd>
I see we are also lacking proper documentation for the differences between readl()/readl_relaxed()/ioread32()/ioread32_be()/__raw_readl()
<arnd>
I can write that section, probably should have done that years ago, because I keep having to explain it in code review ;-)
<marcan>
yeah, honestly I think Documentation/driver-api/device-io.rst probably needs a big revamp to make this all clearer, and also give arches a template to work off of
<marcan>
google docs should work for everyone, I can just have by-link access
<marcan>
it should be a short-term thing anyway, not a living doc
<arnd>
right
<marcan>
how about I de-wrap Documentation/driver-api/device-io.rst into a doc, we can mess with it there, and then I can take care of re-wrapping it for the submission?
<marcan>
(I'll make sure not to spuriously re-wrap stuff we didn't touch, but it's easier to do all that in one final pass anyway rather than try to keep stuff wrapped)
<arnd>
marcan: I'd prefer to just add one or two sections to the document for the moment and not touch the other bits
<marcan>
oh sure, but I figured I might as well have the whole doc in there for context
<marcan>
(bbiab, need to make a quick trip to the pharmacy)
<marcan>
(comments welcome, I'm no expert on the ioremap_*() sematics :))
plainbits has joined #asahi
<marcan>
(uhh nevermind the pharmacy, mission abort on that)
<arnd>
My brain dump on readl/writel gets fairly long, but that probably means it was indeed necessary do document it properly
<marcan>
absolutely
<marcan>
it's very hard to wrap your head around this stuff if you're not already familiar with the semantics
<marcan>
I had to think pretty hard about readl() vs _relaxed for AIC and I'm still not 100% on the details
<arnd>
marcan: I think the description of ioremap_wc() is misleading as it sounds like this is cached plus allows write-combining, while I think it is the reverse:
<arnd>
ioremap_wc() is like ioremap(), plus write-combining
<arnd>
ioremap_wt() is like ioremap_wc(), plus it is cached (but not write-back)
<marcan>
ah, right, I have them in the wrong order then
amw has joined #asahi
<arnd>
I have to read up on previous mailing list discussions about ioremap_uc(), I think this is also wrong. It's definitely specific to x86 and ia64
<marcan>
it's used in... two drivers outside of arch/
<marcan>
atyfb and intel-lpss
<arnd>
right, there was a larger cleanup to remove it from almost all drivers that previously used it
<arnd>
see "git grep ioremap_nocache v3.0"
<arnd>
a8ff78f7f773 ("mfd: intel-lpss: Use devm_ioremap_uc for MMIO") explains one of the two, I think the other is similar but old
<marcan>
well I just realized something silly perhaps
<marcan>
ioremap_wt is not implemented on ARM64, which means it falls back to ioremap
<marcan>
but ioremap_wc is
<marcan>
that feels odd, if the order is ioremap > ioremap_wc > ioremap_wt, then ioremap_wt should fall back to ioremap_wc, not ioremap
<arnd>
that sounds correct, yes
amw has quit [Ping timeout: 272 seconds]
<arnd>
Don't know why there is no real ioremap_wt() though
<marcan>
yeah, it's a thing anyway, so it should be implemented
<arnd>
arm32 has "#define ioremap_wt ioremap_wc"
<marcan>
yeah
<marcan>
we just need #define ioremap_wt(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_WT))
<marcan>
I hope that doesn't break any broken drivers that wrongly used it :)
<marcan>
(cc maz, any idea why we don't have this yet?)
<j`ey>
there was an ioreamp_wt that was removed d092a87073269677b7ff09e71a8d91912b7f969a but it used PROT_DEVICE_nGnRE
<j`ey>
oh nvm, that's just the same as default ioremap
<marcan>
looks like that was a multi-arch NFC cleanup patch
<marcan>
the better question is why was it nGnRE to begin with
<arnd>
probably copied from arm32
<j`ey>
556269c138a8b2d3f5714b8105fa6119ecc505f2 added it
<arnd>
yes, found the same
<marcan>
the only things using ioremap_wt are certain framebuffers, and I bet there are 0 intances of any of those being used on ARM64
<arnd>
looks like I was on Cc, but none of the arm (or other) architecture maintainers were
<marcan>
and z2ram
<marcan>
amiga, atari, powermac...
<arnd>
efifb was changed to memremap()
<marcan>
this makes me wonder why other FBs do not use _wt, but just _wc
<arnd>
my guess is that they predated the ioremap_wt interface, so they used what they could get
<arnd>
drivers/video/fbdev is 95% obsolete
<marcan>
but it's the other way around; a few old-looking drivers are using ioremap_wt, while many more are using _wc
<arnd>
those are the even older ones that are m68k specific
<marcan>
yeah, so nothing outside of m68k uses ioremap_wt
<arnd>
and powermac
<marcan>
right
<marcan>
but on the face of it ioremap_wt sounds appropriate for framebuffers? so why does nobody use it for modern hardware?
<arnd>
I suspect nobody cares enough about the speed of the text console on fbdev
<arnd>
anything else is in user space, which uses a separate mapping
<marcan>
I need to slap some kernel devs then, you have no idea how much slow fbcons frustrate me at the worst times ;) (as recently as a few months ago I had one of those, I still have a video somewhere...)
<arnd>
marcan: be careful, or you might end up as the fbdev subsystem maintainer
<marcan>
nooooooo
<arnd>
I think this has changed hands more frequently than any other subsystem. Usually it's someone who needs to get something done, but then gets frustrated to moves on to other work and abandons it again
<sven>
sounds like the perfect job for marcan! :P
<marcan>
sven: I hate you :D
<sven>
I know ;)
aaronsmall99a[m] has left #asahi ["User left"]
amw has joined #asahi
<marcan>
arnd: incidentally, the thing seems to be fully coherent on M1, as we're getting away with normal cached in m1n1 for the FB
<marcan>
yay for unified memory
<marcan>
simplefb still tries to map it with something less efficient though
<maz>
marcan: I'm not surprised. they have also implemented ARMv8.4-FWB, which is going to be a godsend for KVM.
<maz>
no more silly cache maintenance!
<marcan>
huh, does it require icache coherence?
<marcan>
I don't think icache is coherent here (i.e. you still need cache maintenance if you touch code)
<maz>
cicache coherency is isn't that big a deal.
<maz>
icache*
<maz>
the killer is the PoC flush on each page being mapped into the guest.
<maz>
(vecause the guest could run with MMU off or a NC mapping)
<maz>
because*
<marcan>
right...
<marcan>
(I'm just going off the commit msg of e48d53a9 )
<j`ey>
marcan: lol ok, i thought it might be slow, but not that slow
<marcan>
(afk, 10m)
<never_released>
maz: Apple M1 is defined by Apple as "ARMv8.5 minus BTI"
<never_released>
which is pretty nice
<arnd>
I have this vague memory of PCI prefetchable memory BARs being mapped to write-combining page mappings on some architectures, but can't find that now
<never_released>
marcan: yes, Arm designs with coherent icache are quite rare
<arnd>
I see that some old x86 fbdev drivers force write-combining with MTRR
<never_released>
(Neoverse N1, but is broken on almost-all-revisions-but-the-recent-ones there, and NVIDIA Denver are the ones that you can easily get access to)
<never_released>
I'm wondering if memory types is why Apple didn't support eGPUs
nonemu has joined #asahi
<arnd>
Ah, I found arch_can_pci_mmap_wc(), which makes the user space mapping for a PCI bar write-combined when it comes from a IORESOURCE_PREFETCH BAR
<arnd>
but it does not affect the in-kernel mapping
<never_released>
a thousand-dollar question: what happens if we map the same range with different memory types for kernel and for user-space
<never_released>
I'd expect things to blow up essentially
<arnd>
and now I read proc_bus_pci_ioctl() and wish I had never learned about that
<arnd>
never_released: on powerpc it blows up spectacularly, on arm you technically get undefined behavior but it will (mostly?) work in practice
<never_released>
and on some Arm designs, memory model guarantees vary too depending on the memory type in some designs
<arnd>
I have seen multiple reports about broken framebuffer access on arm64, in particular with unaligned cacheable writes to the framebuffer, but I think that is a separate bug
<never_released>
ugh, I repeated myself above
<never_released>
> For coherent memory types, Carmel cores provide a single, sequentially consistent view of coherent memory. Accordingly if no non-coherent access, Cache maintenance or TLB maintenance instruction has been executed since the last memory barrier, memory barriers behave similarly to a single-cycle NOP.
<never_released>
arm64 is everything from that to weakly-ordered for everything
<never_released>
(except some device memory types)
<arnd>
never_released: I thought it was NVIDIA who wanted RISC-V to adopt the weakest possible ordering model, I wonder if I mixed that up or if it was simply a different part of the company with different requirements
<never_released>
arnd: NVIDIA's own CPU cores all implement sequential consistency
<never_released>
as the memory model
<arnd>
never_released: what about the GPUs?
<never_released>
arnd: weakly-ordered there of course
<arnd>
ok, then I probably remembered correctly, as they have been talking about RISC-V in their GPUs but seem to favor ARM for applications cores
<never_released>
for the firmware cores in the GPUs
<never_released>
already shipped in Ampere as far as I know
<marcan>
I don't buy this being about their hardware "display pipes"
<marcan>
more like the display pipe code ;)
mechpilotace has quit [Quit: WeeChat 2.8]
<marcan>
never_released: playing with xrandr transforms on my old iMac 2015 retina results in glitchless scaling changes; and of course all modern GPUs must support page flipping
<marcan>
those are the only two building blocks you need to do glitchless resolution switchin, *if* the software stack is up to par
<marcan>
for some reason going to 1080p scaled to 4K glitches, but going back from that to plain 4K does not at all, other than windows moving around and stuff because X11 sucks at this
mechpilotace has joined #asahi
<marcan>
so, glitch-free resolution switching is clearly a software problem and has been for many years
vimal has joined #asahi
<marcan>
actually I think there might be 1 frame of black with the xrandr change here, but I'm sure that's the driver being stupid; obviously there is no output mode re-configuration as otherwise you get a very obvious slow black blink including a fade-in at the end (which is part of Apple's display controller)
<marcan>
I should try wayland again one of these days
<j`ey>
is resolution change that big of a deal? maybe I'm just not fussy about these kinda things :P
<marcan>
it's just one of those polish things
<marcan>
either way, this isn't possible with external displays if you *actually* change the signal resolution (not just GPU-side scaling), because there's no way of stopping external displays from blinking for a whole second or more on mode changes
<arnd>
Our mac mini on a 27" screen randomly comes up with text much too small or much too big. Changing the resolution from 4K to 1080p and back usually gets it into the other mode. There is probably a setting somewhere to change the font size to the desired value, but I have never found that. "many years ahead of the industry" ;-)
<marcan>
the M1 is many years behind on bug fixes ;)
<marcan>
I've seen graphical corruption in the settings menu of all places
<marcan>
this launch was very, very obviously rushed, and it's rough around the edges
<marcan>
to be fair, that *is* a loud song... and the remasters are even louder :/
<marcan>
japan has a real problem with the loudness war
<marcan>
but I'm guessing you wouldn't be the kind to enjoy my own music then ;)
<marcan>
at least I don't crush it with a limiter though :p
<marcan>
arnd: moved the relevant ioremap_uc() points to ioremap_wc() then
<j`ey>
marcan: listening to Endless Scarlet Night
<marcan>
ha :)
<marcan>
original or soundflora's remix?
<j`ey>
marcan: Tsuioku Circuit linked from your yt
<marcan>
yeah, but there are two versions
<j`ey>
ah, not the soundflora one
<marcan>
hers is much groovier than mine :)
* eta
likes the original endless scarlet night
<marcan>
(also turn on the english subs if you haven't yet)
<marcan>
arnd: do you have any idea about what the store buffer story is with _wc, if any?
<eta>
re Wayland: it's nice, but be prepared to run bleeding edge if you want stuff like screen sharing to work
<marcan>
AIUI write combining technically means stuff can stick around in the store buffer for a while, possibly forever if no other stores happen? but I'm not sure if we kind of assume that real CPUs will flush that at some point, or there is some barrier that forces that
<marcan>
eta: for a long time I could not use Wayland because there was no standard middle-button-click protocol (and thus KDE did not support it) and I refuse to give that up
<marcan>
for paste I mean
<eta>
marcan: yeah, this is par for the course with Wayland IME
<marcan>
now it does, but I'm a bit scared because I still have a lot of legacy X11 software doing very X11-ish things
<eta>
there is no "standard" screen sharing protocol
<marcan>
I'm getting the feeling this M1 stuff is also going to become my own personal wayland testbed, good excuse to try it on the new platform
<eta>
there's the undocumented gnome framebuffer thing and then xdg-desktop-portal
<eta>
the latter "works" but only under Firefox with some config flags twiddled and with pipewire running
<marcan>
hah
* eta
ends up using OBS to screen share mostly
<marcan>
well, that's also a good excuse to use pipewire :)
<eta>
(which has a hacky pipewire source)
<eta>
I love pipewire
<marcan>
amusingly enough though, I don't actually care about OBS screen sharing any more... since I use HDMI capture with a different PC for that
<eta>
so nice
<marcan>
but OTOH I need support for scaled mirroring
<marcan>
(which xrandr can do, no idea about wayland)
<marcan>
I really *really* want to replace JACK+PA with PW
* eta
shrugs
<eta>
probably depends on your compositor
<eta>
marcan: do it!!
<marcan>
but first I need to spend a weekend finishing that libffado plugin I started writing for it
<eta>
ah
<eta>
oh yeah, firewire
<marcan>
I got audio in/out working at least, but it's still broken-ish
<arnd>
marcan: as far as I can tell, the use of a store buffer is what does the write-combining, so you can't have one without the other
<eta>
fwiw you can run PW as a JACK client
<marcan>
need to polish it into something usable
<eta>
that's cursed though
<eta>
it's easiest if you just replace libjack with pw
<marcan>
arnd: right, but are there any guarantees as to the lifetime of writes in the store buffer, or operations guaranteed to flush them?
<arnd>
so no guarantees about completion with _wc
<marcan>
eta: yeah I don't want to do that, half the reason to run PW is to hopefully avert JACK's brokenness
<marcan>
JACK1 and JACK2 both have *different* problems
<marcan>
arnd: i.e. if there are zero guarantees, when technically _wc is like _cached, but that isn't how it is used in practice
<arnd>
marcan: my understanding is that normal ioremap() being posted already requires a read-back to flush it (as you write there), so that would be the same with _wc
<marcan>
ah, so reads are guaranteed to flush the wc buffer?
<marcan>
though it's still not quite the same thing; ioremap() being posted means you need a read to ensure ordering, but a write via ioremap() *is* guaranteed to hit the device in due time
<arnd>
I would assume so, but I don't know for sure
<marcan>
while something stuck in the wc buffer isn't
<arnd>
isn't there an implementation specific upper bound on the write buffer?
<marcan>
time to pull up some docs
<marcan>
let me see what PPC750CL says about this (which I happen to have some experience with)
<arnd>
I mean for write-back cached mappings there isn't, because nothing necessarily enforces the writeback, but for a normal write-buffer I would assume that it gets flushed out as soon as there is bandwidth avaiable on the bus
odmir has joined #asahi
<marcan>
I don't think that is necessarily the case, because you could end up with somewhat pathological cases
<marcan>
but it probably depends on the implementation
<kettenis>
my understanding is that you need a memory barrier to guarantee that store buffers have been flushed
<kettenis>
btw, the reason nobody bothers mapping framebuffers write-through is that most grapphics code these days only writes to the framebuffer
<kettenis>
so it doesn't really offer a speed benefit over write-combining
<kettenis>
(write-combining tends to speed things up a lot though)
<marcan>
ok, so the ARM ARM actually says writes must reach the endpoint in finite time for NC/WT modes
<marcan>
so clearly staying forever in a write buffer is not allowable
<arnd>
kettenis: good point, no reason to throw away your dcache when filling a 4K framebuffer
<arnd>
marcan: ok, let's assume that is the sensible behavior then. Without this, you could end up never seeing a framebuffer update
<marcan>
yeah
<marcan>
I think we can just say nothing in that case
<arnd>
ok
<arnd>
marcan: do you mind if I share the current draft on another channel with more kernel developers? I'd like to find out if anyone else knows why mapping a prefetchable PCI resource from user space uses a _wc mapping while pci_ioremap_bar() doesn't
<marcan>
sure, go for it!
<marcan>
(where?)
<arnd>
it's an invitation-only channel, I could never figure out who exactly qualifies to get invited
<marcan>
oh, huh
<marcan>
well, I guess I shouldn't be surprised the kernel cabal has one of those too ;)
<kettenis>
hardware that icorrectly tags a BAR as prefetchable exists
<arnd>
kettenis: that would explain the in-kernel interface, but not the user space side