Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
14
Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto. Wait – phones? Android and iOS only? Can’t my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5? Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python. Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system? … Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android,...
over a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

Vulkan 1.4 sur Asahi Linux

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les notres sont les seules pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique. Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour accéder à la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponsibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite. Pour plus d’informations, consultez l’article de blog de Khronos. Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one. That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers. Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared. Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard graphics. Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in Vulkan 1.3 driver. For more information, see the Khronos blog post.

2 months ago 53 votes
AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

4 months ago 58 votes
Vulkan 1.3 on the M1 in 1 month

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized. Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games. We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period? It’s unprecedented… Challenge accepted. April 2 It begins with a text. Faith… I think I want to write a Vulkan driver. Her advice? Just start typing. Thre’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay. April 3 To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add. April 4 With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works. That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes. April 5 Fleshing out yesterday’s code, all copy tests pass. April 6 We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on. April 7 What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional. We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines. Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance. I am not reasonable. To eliminate stuttering in OpenGL, we make state dynamic with four strategies: Conditional code. Precompiled variants. Indirection. Prologs and epilogs. Wait, what-a-logs? AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too. For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs. Putting it together, we get a (dynamic) triangle: April 8 Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours. /* Translate an American VkBorderColor into a Canadian agx_border_colour */ enum agx_border_colour translate_border_color(VkBorderColor color) { switch (color) { case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; ... } } Test results are getting there. Pass: 149770, Fail: 7741, Crash: 2396 That’s good enough for vkQuake. April 9 Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line. Pass: 255209, Fail: 3818, Crash: 599 98.3% pass rate for 1.3 on our 1 week anniversary. Not bad. April 10 SuperTuxKart has a Vulkan renderer. April 11 Zink works too. April 12 I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks. April 16 The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests. April 17 A few tests crash inside our register allocator. Their shaders contain a peculiar construction: if (condition) { while (true) { } } condition is always false, but the compiler doesn’t know that. Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy. April 18 Remember copies? They’re slow, and every frame currently requires a copy to get on screen. For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy. Copies to avoid copies. April 20 “I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.” … “Ma’am, this is a Wendy’s.” April 22 As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead. Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second. I think we’re okay. April 24 Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code. It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK. Which means he spent a summer adding it to Honeykrisp. Thanks, Mohamed ;-) April 25 Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU. For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking. For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right? …Actually, we can. A little witchcraft makes GPU query copies as easy as C. void copy_query(struct params *p, int i) { uintptr_t dst = p->dest + i * p->stride; int query = p->first + i; if (p->available[query] || p->partial) { int q = p->index[query]; write_result(dst, p->_64, p->results[q]); } ... } April 26 The final boss: border colours, hard mode. Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours: (0, 0, 0, 0) – transparent black (0, 0, 0, 1) – opaque black (1, 1, 1, 1) – opaque white We handled these on April 8. Unfortunately, there are two problems. First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension. Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle. Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format. There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information. Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black. There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha. That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result. The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh. We’re stuck. No known hardware configuration implements correct Vulkan semantics. Is hope lost? Do we give up? A reasonable person would. I am not reasonable. Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support. As you know, I am not reasonable. Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests. April 27 All that’s left is some last minute bug fixing, and… Pass: 686930, Fail: 0 Success. The future The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux. That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned. Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.

8 months ago 79 votes
Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago 40 votes
The first conformant M1 GPU driver

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max. Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-) Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux. Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics. Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation… It’s not too late though. They should follow our lead! OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster. Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition. “Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms. Can we put these two features together to write to an image atomically? Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20). Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem. The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address. If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel. Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve. We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better. There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance. In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly: # Inputs x, y in r0l, r0h. # Output in r1. add r2, #0, r0, lsl 4 or r1, r0, r2 and r1, r1, #0xf0f0f0f add r2, #0, r1, lsl 2 or r1, r1, r2 and r1, r1, #0x33333333 add r2, #0, r1, lsl 1 or r1, r1, r2 and r1, r1, #0x55555555 add r1, r1l, r1h, lsl 1 We could stop here, but what if there’s a dedicated instruction to interleave bits? PowerVR has a “shuffle” instruction shfl, and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. Unfortunately, even if it was, the proprietary compiler won’t use it when compiling our test shaders. That makes it difficult to reverse-engineer the instruction – if it exists – by observing compiled shaders. It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check. Dougall Johnson provided the guess. When considering the instructions we already know about, he took special notice of the “reverse bits” instruction. Since reversing bits is a type of bit shuffle, the interleave instruction should be encoded similarly. The bit reverse instruction has a two-bit field specifying the operation, with value 01. Related instructions to count the number of set bits and find the first set bit have values 10 and 11 respectively. That encompasses all known “complex bit manipulation” instructions. tr:first-child > td:nth-child(2) { text-align:center !important } td > strong > a:visited { color: #0000EE } 00 ? ? ? 01 Reverse bits 10 Count set bits 11 Find first set There is one value of the two-bit enumeration that is unobserved and unknown: 00. If this interleave instruction exists, it’s probably encoded like the bit reverse but with operation code 00 instead of 01. There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there. Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver. All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion (\(2^32\)) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction. As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests. And that’s what matters. Thank you to Khronos and Software in the Public Interest for supporting open drivers.

a year ago 19 votes

More in programming

Diagnosis in engineering strategy.

Once you’ve written your strategy’s exploration, the next step is working on its diagnosis. Diagnosis is understanding the constraints and challenges your strategy needs to address. In particular, it’s about doing that understanding while slowing yourself down from deciding how to solve the problem at hand before you know the problem’s nuances and constraints. If you ever find yourself wanting to skip the diagnosis phase–let’s get to the solution already!–then maybe it’s worth acknowledging that every strategy that I’ve seen fail, did so due to a lazy or inaccurate diagnosis. It’s very challenging to fail with a proper diagnosis, and almost impossible to succeed without one. The topics this chapter will cover are: Why diagnosis is the foundation of effective strategy, on which effective policy depends. Conversely, how skipping the diagnosis phase consistently ruins strategies A step-by-step approach to diagnosing your strategy’s circumstances How to incorporate data into your diagnosis effectively, and where to focus on adding data Dealing with controversial elements of your diagnosis, such as pointing out that your own executive is one of the challenges to solve Why it’s more effective to view difficulties as part of the problem to be solved, rather than a blocking issue that prevents making forward progress The near impossibility of an effective diagnosis if you don’t bring humility and self-awareness to the process Into the details we go! This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Diagnosis is strategy’s foundation One of the challenges in evaluating strategy is that, after the fact, many effective strategies are so obvious that they’re pretty boring. Similarly, most ineffective strategies are so clearly flawed that their authors look lazy. That’s because, as a strategy is operated, the reality around it becomes clear. When you’re writing your strategy, you don’t know if you can convince your colleagues to adopt a new approach to specifying APIs, but a year later you know very definitively whether it’s possible. Building your strategy’s diagnosis is your attempt to correctly recognize the context that the strategy needs to solve before deciding on the policies to address that context. Done well, the subsequent steps of writing strategy often feel like an afterthought, which is why I think of diagnosis as strategy’s foundation. Where exploration was an evaluation-free activity, diagnosis is all about evaluation. How do teams feel today? Why did that project fail? Why did the last strategy go poorly? What will be the distractions to overcome to make this new strategy successful? That said, not all evaluation is equal. If you state your judgment directly, it’s easy to dispute. An effective diagnosis is hard to argue against, because it’s a web of interconnected observations, facts, and data. Even for folks who dislike your conclusions, the weight of evidence should be hard to shift. Strategy testing, explored in the Refinement section, takes advantage of the reality that it’s easier to diagnose by doing than by speculating. It proposes a recursive diagnosis process until you have real-world evidence that the strategy is working. How to develop your diagnosis Your strategy is almost certain to fail unless you start from an effective diagnosis, but how to build a diagnosis is often left unspecified. That’s because, for most folks, building the diagnosis is indeed a dark art: unspecified, undiscussion, and uncontrollable. I’ve been guilty of this as well, with The Engineering Executive’s Primer’s chapter on strategy staying silent on the details of how to diagnose for your strategy. So, yes, there is some truth to the idea that forming your diagnosis is an emergent, organic process rather than a structured, mechanical one. However, over time I’ve come to adopt a fairly structured approach: Braindump, starting from a blank sheet of paper, write down your best understanding of the circumstances that inform your current strategy. Then set that piece of paper aside for the moment. Summarize exploration on a new piece of paper, review the contents of your exploration. Pull in every piece of diagnosis from similar situations that resonates with you. This is true for both internal and external works! For each diagnosis, tag whether it fits perfectly, or needs to be adjusted for your current circumstances. Then, once again, set the piece of paper aside. Mine for distinct perspectives on yet another blank page, talking to different stakeholders and colleagues who you know are likely to disagree with your early thinking. Your goal is not to agree with this feedback. Instead, it’s to understand their view. The Crux by Richard Rumelt anchors diagnosis in this approach, emphasizing the importance of “testing, adjusting, and changing the frame, or point of view.” Synthesize views into one internally consistent perspective. Sometimes the different perspectives you’ve gathered don’t mesh well. They might well explicitly differ in what they believe the underlying problem is, as is typical in tension between platform and product engineering teams. The goal is to competently represent each of these perspectives in the diagnosis, even the ones you disagree with, so that later on you can evaluate your proposed approach against each of them. When synthesizing feedback goes poorly, it tends to fail in one of two ways. First, the author’s opinion shines through so strongly that it renders the author suspect. Your goal is never to agree with every team’s perspective, just as your diagnosis should typically avoid crowning any perspective as correct: a reader should generally be appraised of the details and unaware of the author. The second common issue is when a group tries to jointly own the synthesis, but create a fractured perspective rather than a unified one. I generally find that having one author who is accountable for representing all views works best to address both of these issues. Test drafts across perspectives. Once you’ve written your initial diagnosis, you want to sit down with the people who you expect to disagree most fervently. Iterate with them until they agree that you’ve accurately captured their perspective. It might be that they disagree with some other view points, but they should be able to agree that others hold those views. They might argue that the data you’ve included doesn’t capture their full reality, in which case you can caveat the data by saying that their team disagrees that it’s a comprehensive lens. Don’t worry about getting the details perfectly right in your initial diagnosis. You’re trying to get the right crumbs to feed into the next phase, strategy refinement. Allowing yourself to be directionally correct, rather than perfectly correct, makes it possible to cover a broad territory quickly. Getting caught up in perfecting details is an easy way to anchor yourself into one perspective prematurely. At this point, I hope you’re starting to predict how I’ll conclude any recipe for strategy creation: if these steps feel overly mechanical to you, adjust them to something that feels more natural and authentic. There’s no perfect way to understand complex problems. That said, if you feel uncertain, or are skeptical of your own track record, I do encourage you to start with the above approach as a launching point. Incorporating data into your diagnosis The strategy for Navigating Private Equity ownership’s diagnosis includes a number of details to help readers understand the status quo. For example the section on headcount growth explains headcount growth, how it compares to the prior year, and providing a mental model for readers to translate engineering headcount into engineering headcount costs: Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%). If everyone evaluating a strategy shares the same foundational data, then evaluating the strategy becomes vastly simpler. Data is also your mechanism for supporting or critiquing the various views that you’ve gathered when drafting your diagnosis; to an impartial reader, data will speak louder than passion. If you’re confident that a perspective is true, then include a data narrative that supports it. If you believe another perspective is overstated, then include data that the reader will require to come to the same conclusion. Do your best to include data analysis with a link out to the full data, rather than requiring readers to interpret the data themselves while they are reading. As your strategy document travels further, there will be inevitable requests for different cuts of data to help readers understand your thinking, and this is somewhat preventable by linking to your original sources. If much of the data you want doesn’t exist today, that’s a fairly common scenario for strategy work: if the data to make the decision easy already existed, you probably would have already made a decision rather than needing to run a structured thinking process. The next chapter on refining strategy covers a number of tools that are useful for building confidence in low-data environments. Whisper the controversial parts At one time, the company I worked at rolled out a bar raiser program styled after Amazon’s, where there was an interviewer from outside the team that had to approve every hire. I spent some time arguing against adding this additional step as I didn’t understand what we were solving for, and I was surprised at how disinterested management was about knowing if the new process actually improved outcomes. What I didn’t realize until much later was that most of the senior leadership distrusted one of their peers, and had rolled out the bar raiser program solely to create a mechanism to control that manager’s hiring bar when the CTO was disinterested holding that leader accountable. (I also learned that these leaders didn’t care much about implementing this policy, resulting in bar raiser rejections being frequently ignored, but that’s a discussion for the Operations for strategy chapter.) This is a good example of a strategy that does make sense with the full diagnosis, but makes little sense without it, and where stating part of the diagnosis out loud is nearly impossible. Even senior leaders are not generally allowed to write a document that says, “The Director of Product Engineering is a bad hiring manager.” When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: Say something awkward or uncomfortable about your company or someone working within it Omit a critical piece of your diagnosis that’s necessary to understand the wider thinking Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. I think it’s helpful to discuss a few concrete examples of this, starting with the strategy for navigating private equity, whose diagnosis includes: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. There are many things the authors of this strategy likely feel about their state of reality. First, they are probably upset about the fact that their new private equity ownership is likely to eliminate colleagues. Second, they are likely upset that there is no clear plan around what they need to do, so they are stuck preparing for a wide range of potential outcomes. However they feel, they don’t say any of that, they stick to precise, factual statements. For a second example, we can look to the Uber service migration strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. The team didn’t agree that their headcount should not be growing, but it was the reality they were operating in. They acknowledged their reality as a factual statement, without any additional commentary about that statement. In both of these examples, they found a professional, non-judgmental way to acknowledge the circumstances they were solving. The authors would have preferred that the leaders behind those decisions take explicit accountability for them, but it would have undermined the strategy work had they attempted to do it within their strategy writeup. Excluding critical parts of your diagnosis makes your strategies particularly hard to evaluate, copy or recreate. Find a way to say things politely to make the strategy effective. As always, strategies are much more about realities than ideals. Reframe blockers as part of diagnosis When I work on strategy with early-career leaders, an idea that comes up a lot is that an identified problem means that strategy is not possible. For example, they might argue that doing strategy work is impossible at their current company because the executive team changes their mind too often. That core insight is almost certainly true, but it’s much more powerful to reframe that as a diagnosis: if we don’t find a way to show concrete progress quickly, and use that to excite the executive team, our strategy is likely to fail. This transforms the thing preventing your strategy into a condition your strategy needs to address. Whenever you run into a reason why your strategy seems unlikely to work, or why strategy overall seems difficult, you’ve found an important piece of your diagnosis to include. There are never reasons why strategy simply cannot succeed, only diagnoses you’ve failed to recognize. For example, we knew in our work on Uber’s service provisioning strategy that we weren’t getting more headcount for the team, the product engineering team was going to continue growing rapidly, and that engineering leadership was unwilling to constrain how product engineering worked. Rather than preventing us from implementing a strategy, those components clarified what sort of approach could actually succeed. The role of self-awareness Every problem of today is partially rooted in the decisions of yesterday. If you’ve been with your organization for any duration at all, this means that you are directly or indirectly responsible for a portion of the problems that your diagnosis ought to recognize. This means that recognizing the impact of your prior actions in your diagnosis is a powerful demonstration of self-awareness. It also suggests that your next strategy’s success is rooted in your self-awareness about your prior choices. Don’t be afraid to recognize the failures in your past work. While changing your mind without new data is a sign of chaotic leadership, changing your mind with new data is a sign of thoughtful leadership. Summary Because diagnosis is the foundation of effective strategy, I’ve always found it the most intimidating phase of strategy work. While I think that’s a somewhat unavoidable reality, my hope is that this chapter has somewhat prepared you for that challenge. The four most important things to remember are simply: form your diagnosis before deciding how to solve it, try especially hard to capture perspectives you initially disagree with, supplement intuition with data where you can, and accept that sometimes you’re missing the data you need to fully understand. The last piece in particular, is why many good strategies never get shared, and the topic we’ll address in the next chapter on strategy refinement.

19 hours ago 4 votes
My friend, JT

I’ve had a cat for almost a third of my life.

11 hours ago 4 votes
[Course Launch] Hands-on Introduction to X86 Assembly

A Live, Interactive Course for Systems Engineers

13 hours ago 2 votes
It’s cool to care

I’m sitting in a small coffee shop in Brooklyn. I have a warm drink, and it’s just started to snow outside. I’m visiting New York to see Operation Mincemeat on Broadway – I was at the dress rehearsal yesterday, and I’ll be at the opening preview tonight. I’ve seen this show more times than I care to count, and I hope US theater-goers love it as much as Brits. The people who make the show will tell you that it’s about a bunch of misfits who thought they could do something ridiculous, who had the audacity to believe in something unlikely. That’s certainly one way to see it. The musical tells the true story of a group of British spies who tried to fool Hitler with a dead body, fake papers, and an outrageous plan that could easily have failed. Decades later, the show’s creators would mirror that same spirit of unlikely ambition. Four friends, armed with their creativity, determination, and a wardrobe full of hats, created a new musical in a small London theatre. And after a series of transfers, they’re about to open the show under the bright lights of Broadway. But when I watch the show, I see a story about friendship. It’s about how we need our friends to help us, to inspire us, to push us to be the best versions of ourselves. I see the swaggering leader who needs a team to help him truly achieve. The nervous scientist who stands up for himself with the support of his friends. The enthusiastic secretary who learns wisdom and resilience from her elder. And so, I suppose, it’s fitting that I’m not in New York on my own. I’m here with friends – dozens of wonderful people who I met through this ridiculous show. At first, I was just an audience member. I sat in my seat, I watched the show, and I laughed and cried with equal measure. After the show, I waited at stage door to thank the cast. Then I came to see the show a second time. And a third. And a fourth. After a few trips, I started to see familiar faces waiting with me at stage door. So before the cast came out, we started chatting. Those conversations became a Twitter community, then a Discord, then a WhatsApp. We swapped fan art, merch, and stories of our favourite moments. We went to other shows together, and we hung out outside the theatre. I spent New Year’s Eve with a few of these friends, sitting on somebody’s floor and laughing about a bowl of limes like it was the funniest thing in the world. And now we’re together in New York. Meeting this kind, funny, and creative group of people might seem as unlikely as the premise of Mincemeat itself. But I believed it was possible, and here we are. I feel so lucky to have met these people, to take this ridiculous trip, to share these precious days with them. I know what a privilege this is – the time, the money, the ability to say let’s do this and make it happen. How many people can gather a dozen friends for even a single evening, let alone a trip halfway round the world? You might think it’s silly to travel this far for a theatre show, especially one we’ve seen plenty of times in London. Some people would never see the same show twice, and most of us are comfortably into double or triple-figures. Whenever somebody asks why, I don’t have a good answer. Because it’s fun? Because it’s moving? Because I enjoy it? I feel the need to justify it, as if there’s some logical reason that will make all of this okay. But maybe I don’t have to. Maybe joy doesn’t need justification. A theatre show doesn’t happen without people who care. Neither does a friendship. So much of our culture tells us that it’s not cool to care. It’s better to be detached, dismissive, disinterested. Enthusiasm is cringe. Sincerity is weakness. I’ve certainly felt that pressure – the urge to play it cool, to pretend I’m above it all. To act as if I only enjoy something a “normal” amount. Well, fuck that. I don’t know where the drive to be detached comes from. Maybe it’s to protect ourselves, a way to guard against disappointment. Maybe it’s to seem sophisticated, as if having passions makes us childish or less mature. Or perhaps it’s about control – if we stay detached, we never have to depend on others, we never have to trust in something bigger than ourselves. Being detached means you can’t get hurt – but you’ll also miss out on so much joy. I’m a big fan of being a big fan of things. So many of the best things in my life have come from caring, from letting myself be involved, from finding people who are a big fan of the same things as me. If I pretended not to care, I wouldn’t have any of that. Caring – deeply, foolishly, vulnerably – is how I connect with people. My friends and I care about this show, we care about each other, and we care about our joy. That care and love for each other is what brought us together, and without it we wouldn’t be here in this city. I know this is a once-in-a-lifetime trip. So many stars had to align – for us to meet, for the show we love to be successful, for us to be able to travel together. But if we didn’t care, none of those stars would have aligned. I know so many other friends who would have loved to be here but can’t be, for all kinds of reasons. Their absence isn’t for lack of caring, and they want the show to do well whether or not they’re here. I know they care, and that’s the important thing. To butcher Tennyson: I think it’s better to care about something you cannot affect, than to care about nothing at all. In a world that’s full of cynicism and spite and hatred, I feel that now more than ever. I’d recommend you go to the show if you haven’t already, but that’s not really the point of this post. Maybe you’ve already seen Operation Mincemeat, and it wasn’t for you. Maybe you’re not a theatre kid. Maybe you aren’t into musicals, or history, or war stories. That’s okay. I don’t mind if you care about different things to me. (Imagine how boring the world would be if we all cared about the same things!) But I want you to care about something. I want you to find it, find people who care about it too, and hold on to them. Because right now, in this city, with these people, at this show? I’m so glad I did. And I hope you find that sort of happiness too. Some of the people who made this trip special. Photo by Chloe, and taken from her Twitter. Timing note: I wrote this on February 15th, but I delayed posting it because I didn’t want to highlight the fact I was away from home. [If the formatting of this post looks odd in your feed reader, visit the original article]

2 days ago 4 votes
Stick with the customer

One of the biggest mistakes that new startup founders make is trying to get away from the customer-facing roles too early. Whether it's customer support or it's sales, it's an incredible advantage to have the founders doing that work directly, and for much longer than they find comfortable. The absolute worst thing you can do is hire a sales person or a customer service agent too early. You'll miss all the golden nuggets that customers throw at you for free when they're rejecting your pitch or complaining about the product. Seeing these reasons paraphrased or summarized destroy all the nutrients in their insights. You want that whole-grain feedback straight from the customers' mouth!  When we launched Basecamp in 2004, Jason was doing all the customer service himself. And he kept doing it like that for three years!! By the time we hired our first customer service agent, Jason was doing 150 emails/day. The business was doing millions of dollars in ARR. And Basecamp got infinitely, better both as a market proposition and as a product, because Jason could funnel all that feedback into decisions and positioning. For a long time after that, we did "Everyone on Support". Frequently rotating programmers, designers, and founders through a day of answering emails directly to customers. The dividends of doing this were almost as high as having Jason run it all in the early years. We fixed an incredible number of minor niggles and annoying bugs because programmers found it easier to solve the problem than to apologize for why it was there. It's not easy doing this! Customers often offer their valuable insights wrapped in rude language, unreasonable demands, and bad suggestions. That's why many founders quit the business of dealing with them at the first opportunity. That's why few companies ever do "Everyone On Support". That's why there's such eagerness to reduce support to an AI-only interaction. But quitting dealing with customers early, not just in support but also in sales, is an incredible handicap for any startup. You don't have to do everything that every customer demands of you, but you should certainly listen to them. And you can't listen well if the sound is being muffled by early layers of indirection.

2 days ago 4 votes