The first conformant M1 GPU driver

from On Life and Lisp [alt+shift+b] in programming

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the...

over a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Comments

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

Dissecting the Apple M1 GPU, the end

In 2020, Apple released the M1 with a custom GPU. We got to work reverse-engineering the hardware and porting Linux. Today, you can run Linux on a range of M1 and M2 Macs, with almost all hardware working: wireless, audio, and full graphics acceleration. Our story begins in December 2020, when Hector Martin kicked off Asahi Linux. I was working for Collabora working on Panfrost, the open source Mesa3D driver for Arm Mali GPUs. Hector put out a public call for guidance from upstream open source maintainers, and I bit. I just intended to give some quick pointers. Instead, I bought myself a Christmas present and got to work. In between my university coursework and Collabora work, I poked at the shader instruction set. One thing led to another. Within a few weeks, I drew a triangle. In 3D graphics, once you can draw a triangle, you can do anything. Pretty soon, I started work on a shader compiler. After my final exams that semester, I took a few days off from Collabora to bring up an OpenGL driver capable of spinning gears with my new compiler. Over the next year, I kept reverse-engineering and improving the driver until it could run 3D games on macOS. Meanwhile, Asahi Lina wrote a kernel driver for the Apple GPU. My userspace OpenGL driver ran on macOS, leaving her kernel driver as the missing piece for an open source graphics stack. In December 2022, we shipped graphics acceleration in Asahi Linux. In January 2023, I started my final semester in my Computer Science program at the University of Toronto. For years I juggled my courses with my part-time job and my hobby driver. I faced the same question as my peers: what will I do after graduation? Maybe Panfrost? I started reverse-engineering of the Mali Midgard GPU back in 2017, when I was still in high school. That led to an internship at Collabora in 2019 once I graduated, turning into my job throughout four years of university. During that time, Panfrost grew from a kid’s pet project based on blackbox reverse-engineering, to a professional driver engineered by a team with Arm’s backing and hardware documentation. I did what I set out to do, and the project succeeded beyond my dreams. It was time to move on. What did I want to do next? Finish what I started with the M1. Ship a great driver. Bring full, conformant OpenGL drivers to the M1. Apple’s drivers are not conformant, but we should strive for the industry standard. Bring full, conformant Vulkan to Apple platforms, disproving the myth that Vulkan isn’t suitable for Apple hardware. Bring Proton gaming to Asahi Linux. Thanks to Valve’s work for the Steam Deck, Windows games can run better on Linux than even on Windows. Why not reap those benefits on the M1? Panfrost was my challenge until we “won”. My next challenge? Gaming on Linux on M1. Once I finished my coursework, I started full-time on gaming on Linux. Within a month, we shipped OpenGL 3.1 on Asahi Linux. A few weeks later, we passed official conformance for OpenGL ES 3.1. That put us at feature parity with Panfrost. I wanted to go further. OpenGL (ES) 3.2 requires geometry shaders, a legacy feature not supported by either Arm or Apple hardware. The proprietary OpenGL drivers emulate geometry shaders with compute, but there was no open source prior art to borrow. Even though multiple Mesa drivers need geometry/tessellation emulation, nobody did the work to get there. My early progress on OpenGL was fast thanks to the mature common code in Mesa. It was time to pay it forward. Over the rest of the year, I implemented geometry/tessellation shader emulation. And also the rest of the owl. In January 2024, I passed conformance for the full OpenGL 4.6 specification, finishing up OpenGL. Vulkan wasn’t too bad, either. I polished the OpenGL driver for a few months, but once I started typing a Vulkan driver, I passed 1.3 conformance in a few weeks. What remained was wiring up the geometry/tessellation emulation to my shiny new Vulkan driver, since those are required for Direct3D. Et voilà, Proton games. Along the way, Karol Herbst passed OpenCL 3.0 conformance on the M1, running my compiler atop his “rusticl” frontend. Meanwhile, when the Vulkan 1.4 specification was published, we were ready and shipped a conformant implementation on the same day. After that, I implemented sparse texture support, unlocking Direct3D 12 via Proton. …Now what? Ship a great driver? Check. Conformant OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0? Check. Conformant Vulkan 1.4? Check. Proton gaming? Check. That’s a wrap. We’ve succeeded beyond my dreams. The challenges I chased, I have tackled. The drivers are fully upstream in Mesa. Performance isn’t too bad. With the Vulkan on Apple myth busted, conformant Vulkan is now coming to macOS via LunarG’s KosmicKrisp project building on my work. Satisfied, I am now stepping away from the Apple ecosystem. My friends in the Asahi Linux orbit will carry the torch from here. As for me? Onto the next challenge!

5 days ago • 14 votes

Vulkan 1.4 sur Asahi Linux

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les notres sont les seules pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique. Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour accéder à la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponsibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite. Pour plus d’informations, consultez l’article de blog de Khronos. Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one. That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers. Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared. Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard graphics. Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in Vulkan 1.3 driver. For more information, see the Khronos blog post.

9 months ago • 105 votes

AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

10 months ago • 94 votes

Vulkan 1.3 on the M1 in 1 month

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized. Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games. We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period? It’s unprecedented… Challenge accepted. April 2 It begins with a text. Faith… I think I want to write a Vulkan driver. Her advice? Just start typing. Thre’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay. April 3 To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add. April 4 With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works. That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes. April 5 Fleshing out yesterday’s code, all copy tests pass. April 6 We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on. April 7 What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional. We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines. Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance. I am not reasonable. To eliminate stuttering in OpenGL, we make state dynamic with four strategies: Conditional code. Precompiled variants. Indirection. Prologs and epilogs. Wait, what-a-logs? AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too. For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs. Putting it together, we get a (dynamic) triangle: April 8 Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours. /* Translate an American VkBorderColor into a Canadian agx_border_colour */ enum agx_border_colour translate_border_color(VkBorderColor color) { switch (color) { case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; ... } } Test results are getting there. Pass: 149770, Fail: 7741, Crash: 2396 That’s good enough for vkQuake. April 9 Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line. Pass: 255209, Fail: 3818, Crash: 599 98.3% pass rate for 1.3 on our 1 week anniversary. Not bad. April 10 SuperTuxKart has a Vulkan renderer. April 11 Zink works too. April 12 I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks. April 16 The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests. April 17 A few tests crash inside our register allocator. Their shaders contain a peculiar construction: if (condition) { while (true) { } } condition is always false, but the compiler doesn’t know that. Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy. April 18 Remember copies? They’re slow, and every frame currently requires a copy to get on screen. For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy. Copies to avoid copies. April 20 “I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.” … “Ma’am, this is a Wendy’s.” April 22 As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead. Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second. I think we’re okay. April 24 Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code. It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK. Which means he spent a summer adding it to Honeykrisp. Thanks, Mohamed ;-) April 25 Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU. For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking. For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right? …Actually, we can. A little witchcraft makes GPU query copies as easy as C. void copy_query(struct params *p, int i) { uintptr_t dst = p->dest + i * p->stride; int query = p->first + i; if (p->available[query] || p->partial) { int q = p->index[query]; write_result(dst, p->_64, p->results[q]); } ... } April 26 The final boss: border colours, hard mode. Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours: (0, 0, 0, 0) – transparent black (0, 0, 0, 1) – opaque black (1, 1, 1, 1) – opaque white We handled these on April 8. Unfortunately, there are two problems. First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension. Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle. Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format. There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information. Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black. There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha. That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result. The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh. We’re stuck. No known hardware configuration implements correct Vulkan semantics. Is hope lost? Do we give up? A reasonable person would. I am not reasonable. Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support. As you know, I am not reasonable. Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests. April 27 All that’s left is some last minute bug fixing, and… Pass: 686930, Fail: 0 Success. The future The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux. That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned. Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.

a year ago • 143 votes

Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago • 82 votes

More in programming

Why Amateur Radio

I always had a diffuse idea of why people are spending so much time and money on amateur radio. Once I got my license and started to amass radios myself, it became more clear.

2 days ago • 5 votes

strongly typed?

What does it mean when someone writes that a programming language is “strongly typed”? I’ve known for many years that “strongly typed” is a poorly-defined term. Recently I was prompted on Lobsters to explain why it’s hard to understand what someone means when they use the phrase. I came up with more than five meanings! how strong? The various meanings of “strongly typed” are not clearly yes-or-no. Some developers like to argue that these kinds of integrity checks must be completely perfect or else they are entirely worthless. Charitably (it took me a while to think of a polite way to phrase this), that betrays a lack of engineering maturity. Software engineers, like any engineers, have to create working systems from imperfect materials. To do so, we must understand what guarantees we can rely on, where our mistakes can be caught early, where we need to establish processes to catch mistakes, how we can control the consequences of our mistakes, and how to remediate when somethng breaks because of a mistake that wasn’t caught. strong how? So, what are the ways that a programming language can be strongly or weakly typed? In what ways are real programming languages “mid”? Statically typed as opposed to dynamically typed? Many languages have a mixture of the two, such as run time polymorphism in OO languages (e.g. Java), or gradual type systems for dynamic languages (e.g. TypeScript). Sound static type system? It’s common for static type systems to be deliberately unsound, such as covariant subtyping in arrays or functions (Java, again). Gradual type systems migh have gaping holes for usability reasons (TypeScript, again). And some type systems might be unsound due to bugs. (There are a few of these in Rust.) Unsoundness isn’t a disaster, if a programmer won’t cause it without being aware of the risk. For example: in Lean you can write “sorry” as a kind of “to do” annotation that deliberately breaks soundness; and Idris 2 has type-in-type so it accepts Girard’s paradox. Type safe at run time? Most languages have facilities for deliberately bypassing type safety, with an “unsafe” library module or “unsafe” language features, or things that are harder to spot. It can be more or less difficult to break type safety in ways that the programmer or language designer did not intend. JavaScript and Lua are very safe, treating type safety failures as security vulnerabilities. Java and Rust have controlled unsafety. In C everything is unsafe. Fewer weird implicit coercions? There isn’t a total order here: for instance, C has implicit bool/int coercions, Rust does not; Rust has implicit deref, C does not. There’s a huge range in how much coercions are a convenience or a source of bugs. For example, the PHP and JavaScript == operators are made entirely of WAT, but at least you can use === instead. How fancy is the type system? To what degree can you model properties of your program as types? Is it convenient to parse, not validate? Is the Curry-Howard correspondance something you can put into practice? Or is it only capable of describing the physical layout of data? There are probably other meanings, e.g. I have seen “strongly typed” used to mean that runtime representations are abstract (you can’t see the underlying bytes); or in the past it sometimes meant a language with a heavy type annotation burden (as a mischaracterization of static type checking). how to type So, when you write (with your keyboard) the phrase “strongly typed”, delete it, and come up with a more precise description of what you really mean. The desiderata above are partly overlapping, sometimes partly orthogonal. Some of them you might care about, some of them not. But please try to communicate where you draw the line and how fuzzy your line is.

3 days ago • 12 votes

Logical Duals in Software Engineering

(Last week's newsletter took too long and I'm way behind on Logic for Programmers revisions so short one this time.1) In classical logic, two operators F/G are duals if F(x) = !G(!x). Three examples: x || y is the same as !(!x && !y). <>P ("P is possibly true") is the same as ![]!P ("not P isn't definitely true"). some x in set: P(x) is the same as !(all x in set: !P(x)). (1) is just a version of De Morgan's Law, which we regularly use to simplify boolean expressions. (2) is important in modal logic but has niche applications in software engineering, mostly in how it powers various formal methods.2 The real interesting one is (3), the "quantifier duals". We use lots of software tools to either find a value satisfying P or check that all values satisfy P. And by duality, any tool that does one can do the other, by seeing if it fails to find/check !P. Some examples in the wild: Z3 is used to solve mathematical constraints, like "find x, where f(x) >= 0. If I want to prove a property like "f is always positive", I ask z3 to solve "find x, where !(f(x) >= 0), and see if that is unsatisfiable. This use case powers a LOT of theorem provers and formal verification tooling. Property testing checks that all inputs to a code block satisfy a property. I've used it to generate complex inputs with certain properties by checking that all inputs don't satisfy the property and reading out the test failure. Model checkers check that all behaviors of a specification satisfy a property, so we can find a behavior that reaches a goal state G by checking that all states are !G. Here's TLA+ solving a puzzle this way.3 Planners find behaviors that reach a goal state, so we can check if all behaviors satisfy a property P by asking it to reach goal state !P. The problem "find the shortest traveling salesman route" can be broken into some route: distance(route) = n and all route: !(distance(route) < n). Then a route finder can find the first, and then convert the second into a some and fail to find it, proving n is optimal. Even cooler to me is when a tool does both finding and checking, but gives them different "meanings". In SQL, some x: P(x) is true if we can query for P(x) and get a nonempty response, while all x: P(x) is true if all records satisfy the P(x) constraint. Most SQL databases allow for complex queries but not complex constraints! You got UNIQUE, NOT NULL, REFERENCES, which are fixed predicates, and CHECK, which is one-record only.4 Oh, and you got database triggers, which can run arbitrary queries and throw exceptions. So if you really need to enforce a complex constraint P(x, y, z), you put in a database trigger that queries some x, y, z: !P(x, y, z) and throws an exception if it finds any results. That all works because of quantifier duality! See here for an example of this in practice. Duals more broadly "Dual" doesn't have a strict meaning in math, it's more of a vibe thing where all of the "duals" are kinda similar in meaning but don't strictly follow all of the same rules. Usually things X and Y are duals if there is some transform F where X = F(Y) and Y = F(X), but not always. Maybe the category theorists have a formal definition that covers all of the different uses. Usually duals switch properties of things, too: an example showing some x: P(x) becomes a counterexample of all x: !P(x). Under this definition, I think the dual of a list l could be reverse(l). The first element of l becomes the last element of reverse(l), the last becomes the first, etc. A more interesting case is the dual of a K -> set(V) map is the V -> set(K) map. IE the dual of lived_in_city = {alice: {paris}, bob: {detroit}, charlie: {detroit, paris}} is city_lived_in_by = {paris: {alice, charlie}, detroit: {bob, charlie}}. This preserves the property that x in map[y] <=> y in dual[x]. And after writing this I just realized this is partial retread of a newsletter I wrote a couple months ago. But only a partial retread! ↩ Specifically "linear temporal logics" are modal logics, so "eventually P ("P is true in at least one state of each behavior") is the same as saying !always !P ("not P isn't true in all states of all behaviors"). This is the basis of liveness checking. ↩ I don't know for sure, but my best guess is that Antithesis does something similar when their fuzzer beats videogames. They're doing fuzzing, not model checking, but they have the same purpose check that complex state spaces don't have bugs. Making the bug "we can't reach the end screen" can make a fuzzer output a complete end-to-end run of the game. Obvs a lot more complicated than that but that's the general idea at least. ↩ For CHECK to constraint multiple records you would need to use a subquery. Core SQL does not support subqueries in check. It is an optional database "feature outside of core SQL" (F671), which Postgres does not support. ↩

4 days ago • 11 votes

Omarchy 2.0

Omarchy 2.0 was released on Linux's 34th birthday as a gift to perhaps the greatest open-source project the world has ever known. Not only does Linux run 95% of all servers on the web, billions of devices as an embedded OS, but it also turns out to be an incredible desktop environment! It's crazy that it took me more than thirty years to realize this, but while I spent time in Apple's walled garden, the free software alternative simply grew better, stronger, and faster. The Linux of 2025 is not the Linux of the 90s or the 00s or even the 10s. It's shockingly more polished, capable, and beautiful. It's been an absolute honor to celebrate Linux with the making of Omarchy, the new Linux distribution that I've spent the last few months building on top of Arch and Hyprland. What began as a post-install script has turned into a full-blown ISO, dedicated package repository, and flourishing community of thousands of enthusiasts all collaborating on making it better. It's been improving rapidly with over twenty releases since the premiere in late June, but this Version 2.0 update is the biggest one yet. If you've been curious about giving Linux a try, you're not afraid of an operating system that asks you to level up and learn a little, and you want to see what a totally different computing experience can look and feel like, I invite you to give it a go. Here's a full tour of Omarchy 2.0.

5 days ago • 9 votes

Dissecting the Apple M1 GPU, the end

5 days ago • 14 votes

New here?