The first conformant M1 GPU driver

from On Life and Lisp [alt+shift+b] in programming

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the...

a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

Vulkan 1.4 sur Asahi Linux

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les notres sont les seules pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique. Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour accéder à la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponsibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite. Pour plus d’informations, consultez l’article de blog de Khronos. Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one. That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers. Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared. Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard graphics. Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in Vulkan 1.3 driver. For more information, see the Khronos blog post.

7 months ago • 88 votes

AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

9 months ago • 85 votes

Vulkan 1.3 on the M1 in 1 month

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized. Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games. We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period? It’s unprecedented… Challenge accepted. April 2 It begins with a text. Faith… I think I want to write a Vulkan driver. Her advice? Just start typing. Thre’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay. April 3 To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add. April 4 With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works. That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes. April 5 Fleshing out yesterday’s code, all copy tests pass. April 6 We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on. April 7 What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional. We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines. Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance. I am not reasonable. To eliminate stuttering in OpenGL, we make state dynamic with four strategies: Conditional code. Precompiled variants. Indirection. Prologs and epilogs. Wait, what-a-logs? AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too. For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs. Putting it together, we get a (dynamic) triangle: April 8 Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours. /* Translate an American VkBorderColor into a Canadian agx_border_colour */ enum agx_border_colour translate_border_color(VkBorderColor color) { switch (color) { case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; ... } } Test results are getting there. Pass: 149770, Fail: 7741, Crash: 2396 That’s good enough for vkQuake. April 9 Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line. Pass: 255209, Fail: 3818, Crash: 599 98.3% pass rate for 1.3 on our 1 week anniversary. Not bad. April 10 SuperTuxKart has a Vulkan renderer. April 11 Zink works too. April 12 I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks. April 16 The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests. April 17 A few tests crash inside our register allocator. Their shaders contain a peculiar construction: if (condition) { while (true) { } } condition is always false, but the compiler doesn’t know that. Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy. April 18 Remember copies? They’re slow, and every frame currently requires a copy to get on screen. For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy. Copies to avoid copies. April 20 “I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.” … “Ma’am, this is a Wendy’s.” April 22 As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead. Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second. I think we’re okay. April 24 Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code. It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK. Which means he spent a summer adding it to Honeykrisp. Thanks, Mohamed ;-) April 25 Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU. For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking. For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right? …Actually, we can. A little witchcraft makes GPU query copies as easy as C. void copy_query(struct params *p, int i) { uintptr_t dst = p->dest + i * p->stride; int query = p->first + i; if (p->available[query] || p->partial) { int q = p->index[query]; write_result(dst, p->_64, p->results[q]); } ... } April 26 The final boss: border colours, hard mode. Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours: (0, 0, 0, 0) – transparent black (0, 0, 0, 1) – opaque black (1, 1, 1, 1) – opaque white We handled these on April 8. Unfortunately, there are two problems. First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension. Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle. Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format. There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information. Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black. There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha. That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result. The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh. We’re stuck. No known hardware configuration implements correct Vulkan semantics. Is hope lost? Do we give up? A reasonable person would. I am not reasonable. Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support. As you know, I am not reasonable. Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests. April 27 All that’s left is some last minute bug fixing, and… Pass: 686930, Fail: 0 Success. The future The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux. That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned. Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.

a year ago • 124 votes

Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago • 72 votes

More in programming

Google is tracking you even when you use DuckDuckGo

12 hours ago • 2 votes

The 6 Hours of Lex

When I drive the 24 Hours of Le Mans, I spend a total of about 6-9 hours in the car, divided into stints of roughly two hours at a time. It's intense. But talking with Lex Fridman in Austin on his podcast? Over six hours straight! We only interrupted the session for five minutes total to take three bathroom breaks. All that endurance training has clearly paid off! But the magic of a good conversation, like the magic of driving at Le Mans, is that time flies by. Those six hours felt more like sixty minutes. This is what flow does: it compresses the moment. Besides, we had plenty to talk about. Lex prepares like no other podcast I've ever been on. Pages and pages of notes. Deep questions, endless attention for tangents. We covered the beauty of Ruby for half an hour alone! But also the future of AI, small teams, why we left the cloud, Elon Musk, fatherhood, money and happiness, and a million other topics (which Lex mercifully timestamps, so listeners without six hours to spare can hop around). It was a privilege to appear. If you're interested, the conversation is on YouTube, on Spotify, on X, and as a regular podcast.

2 hours ago • 2 votes

Measurement and Numbers

Here’s Jony Ive talking to Patrick Collison about measurement and numbers: People generally want to talk about product attributes that you can measure easily with a number…schedule, costs, speed, weight, anything where you can generally agree that six is a bigger number than two He says he used to get mad at how often people around him focused on the numbers of the work over other attributes of the work. But after giving it more thought, he now has a more generous interpretation of why we do this: because we want relate to each other, understand each other, and be inclusive of one another. There are many things we can’t agree on, but it’s likely we can agree that six is bigger than two. And so in this capacity, numbers become a tool for communicating with each other, albeit a kind of least common denominator — e.g. “I don’t agree with you at all, but I can’t argue that 134 is bigger than 87.” This is conducive to a culture where we spend all our time talking about attributes we can easily measure (because then we can easily communicate and work together) and results in a belief that the only things that matter are those which can be measured. People will give lip service to that not being the case, e.g. “We know there are things that can’t be measured that are important.” But the reality ends up being: only that which can be assigned a number gets managed, and that which gets managed is imbued with importance because it is allotted our time, attention, and care. This reminds me of the story of the judgement of King Solomon, an archetypal story found in cultures around the world. Here’s the story as summarized on Wikipedia: Solomon ruled between two women who both claimed to be the mother of a child. Solomon ordered the baby be cut in half, with each woman to receive one half. The first woman accepted the compromise as fair, but the second begged Solomon to give the baby to her rival, preferring the baby to live, even without her. Solomon ordered the baby given to the second woman, as her love was selfless, as opposed to the first woman's selfish disregard for the baby's actual well-being In an attempt to resolve the friction between two individuals, an appeal was made to numbers as an arbiter. We can’t agree on who the mother is, so let’s make it a numbers problem. Reduce the baby to a number and we can agree! But that doesn’t work very well, does it? I think there is a level of existence where measurement and numbers are a sound guide, where two and two make four and two halves make a whole. But, as humans, there is another level of existence where mathematical propositions don’t translate. A baby is not a quantity. A baby is an entity. Take a whole baby and divide it up by a sword and you do not half two halves of a baby. I am not a number. I’m an individual. Indivisible. What does this all have to do with software? Software is for us as humans, as individuals, and because of that I believe there is an aspect of its nature where metrics can’t take you.cIn fact, not only will numbers not guide you, they may actually misguide you. I think Robin Rendle articulated this well in his piece “Trust the vibes”: [numbers] are not representative of human experience or human behavior and can’t tell you anything about beauty or harmony or how to be funny or what to do next and then how to do it. Wisdom is knowing when to use numbers and when to use something else. Email · Mastodon · Bluesky

2 hours ago • 1 votes

German court rules Meta’s tracking tech violates GDPR

4 days ago • 11 votes

New Edna feature: multiple notes

I started working on Edna several months ago and I’ve implemented lots of functionality. Edna is a note taking application with super powers. I figured I’ll make a series of posts about all the features I’ve added in last few months. The first is multiple notes. By default we start with 3 notes: scratch inbox daily journal Here’s a note switcher (Ctrl + K): From note switcher you can: quickly find a note by partial name open selected note with Enter or mouse click create new note: enter fully unique note name and Enter or Ctrl + Enter if it partially matches existing note. I learned this trick from Notational Velocity delete note with Ctrl + Delete archive notes with icon on the right star / un-star (add to favorites, remove from favorites) by clicking star icon on the left assign quick access shortcut Alt + <n> You can also rename notes: context menu (right click mouse) and This note / Rename Rename current note in command palette (Ctrl + Shift + K) Use context menu This note sub-menu for note-related commands. Note: I use Windows keyboard bindings. For Mac equivalent, visit https://edna.arslexis.io/help#keyboard-shortcuts

4 days ago • 8 votes

New here?