Vulkan 1.4 sur Asahi Linux

from On Life and Lisp [alt+shift+b] in programming

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les...

7 months ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

9 months ago • 86 votes

Vulkan 1.3 on the M1 in 1 month

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized. Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games. We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period? It’s unprecedented… Challenge accepted. April 2 It begins with a text. Faith… I think I want to write a Vulkan driver. Her advice? Just start typing. Thre’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay. April 3 To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add. April 4 With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works. That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes. April 5 Fleshing out yesterday’s code, all copy tests pass. April 6 We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on. April 7 What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional. We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines. Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance. I am not reasonable. To eliminate stuttering in OpenGL, we make state dynamic with four strategies: Conditional code. Precompiled variants. Indirection. Prologs and epilogs. Wait, what-a-logs? AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too. For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs. Putting it together, we get a (dynamic) triangle: April 8 Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours. /* Translate an American VkBorderColor into a Canadian agx_border_colour */ enum agx_border_colour translate_border_color(VkBorderColor color) { switch (color) { case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; ... } } Test results are getting there. Pass: 149770, Fail: 7741, Crash: 2396 That’s good enough for vkQuake. April 9 Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line. Pass: 255209, Fail: 3818, Crash: 599 98.3% pass rate for 1.3 on our 1 week anniversary. Not bad. April 10 SuperTuxKart has a Vulkan renderer. April 11 Zink works too. April 12 I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks. April 16 The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests. April 17 A few tests crash inside our register allocator. Their shaders contain a peculiar construction: if (condition) { while (true) { } } condition is always false, but the compiler doesn’t know that. Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy. April 18 Remember copies? They’re slow, and every frame currently requires a copy to get on screen. For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy. Copies to avoid copies. April 20 “I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.” … “Ma’am, this is a Wendy’s.” April 22 As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead. Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second. I think we’re okay. April 24 Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code. It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK. Which means he spent a summer adding it to Honeykrisp. Thanks, Mohamed ;-) April 25 Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU. For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking. For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right? …Actually, we can. A little witchcraft makes GPU query copies as easy as C. void copy_query(struct params *p, int i) { uintptr_t dst = p->dest + i * p->stride; int query = p->first + i; if (p->available[query] || p->partial) { int q = p->index[query]; write_result(dst, p->_64, p->results[q]); } ... } April 26 The final boss: border colours, hard mode. Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours: (0, 0, 0, 0) – transparent black (0, 0, 0, 1) – opaque black (1, 1, 1, 1) – opaque white We handled these on April 8. Unfortunately, there are two problems. First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension. Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle. Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format. There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information. Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black. There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha. That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result. The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh. We’re stuck. No known hardware configuration implements correct Vulkan semantics. Is hope lost? Do we give up? A reasonable person would. I am not reasonable. Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support. As you know, I am not reasonable. Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests. April 27 All that’s left is some last minute bug fixing, and… Pass: 686930, Fail: 0 Success. The future The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux. That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned. Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.

a year ago • 126 votes

Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago • 73 votes

The first conformant M1 GPU driver

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max. Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-) Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux. Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics. Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation… It’s not too late though. They should follow our lead! OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster. Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition. “Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms. Can we put these two features together to write to an image atomically? Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20). Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem. The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address. If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel. Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve. We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better. There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance. In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly: # Inputs x, y in r0l, r0h. # Output in r1. add r2, #0, r0, lsl 4 or r1, r0, r2 and r1, r1, #0xf0f0f0f add r2, #0, r1, lsl 2 or r1, r1, r2 and r1, r1, #0x33333333 add r2, #0, r1, lsl 1 or r1, r1, r2 and r1, r1, #0x55555555 add r1, r1l, r1h, lsl 1 We could stop here, but what if there’s a dedicated instruction to interleave bits? PowerVR has a “shuffle” instruction shfl, and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. Unfortunately, even if it was, the proprietary compiler won’t use it when compiling our test shaders. That makes it difficult to reverse-engineer the instruction – if it exists – by observing compiled shaders. It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check. Dougall Johnson provided the guess. When considering the instructions we already know about, he took special notice of the “reverse bits” instruction. Since reversing bits is a type of bit shuffle, the interleave instruction should be encoded similarly. The bit reverse instruction has a two-bit field specifying the operation, with value 01. Related instructions to count the number of set bits and find the first set bit have values 10 and 11 respectively. That encompasses all known “complex bit manipulation” instructions. tr:first-child > td:nth-child(2) { text-align:center !important } td > strong > a:visited { color: #0000EE } 00 ? ? ? 01 Reverse bits 10 Count set bits 11 Find first set There is one value of the two-bit enumeration that is unobserved and unknown: 00. If this interleave instruction exists, it’s probably encoded like the bit reverse but with operation code 00 instead of 01. There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there. Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver. All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion (\(2^32\)) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction. As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests. And that’s what matters. Thank you to Khronos and Software in the Public Interest for supporting open drivers.

a year ago • 46 votes

More in programming

Increase software sales by 50% or more

This is re-post of How to Permanently Increase Your Sales by 50% or More in Only One Day article by Steve Pavlina Of all the things you can do to increase your sales, one of the highest leverage activities is attempting to increase your products’ registration rate. Increasing your registration rate from 1.0% to 1.5% means that you simply convince one more downloader out of every 200 to make the decision to buy. Yet that same tiny increase will literally increase your sales by a full 50%. If you’re one of those developers who simply slapped the ubiquitous 30-day trial incentive on your shareware products without going any further than that, then I think a 50% increase in your registration rate is a very attainable goal you can achieve if you spend just one full day of concentrated effort on improving your product’s ability to sell. My hope is that this article will get you off to a good start and get you thinking more creatively. And even if you fail, your result might be that you achieve only a 25% or a 10% increase. How much additional money would that represent to you over the next five years of sales? What influence, if any, did the title of this article have on your decision to read it? If I had titled this article, “Registration Incentives,” would you have been more or less likely to read it now? Note that the title expresses a specific and clear benefit to you. It tells you exactly what you can expect to gain by reading it. Effective registration incentives work the same way. They offer clear, specific benefits to the user if a purchase is made. In order to improve your registration incentives, the first thing you need to do is to adopt some new beliefs that will change your perspective. I’m going to introduce you to what I call the “lies of success” in the shareware industry. These are statements that are not true at all, but if you accept them as true anyway, you’ll achieve far better results than if you don’t. Rule 1: What you are selling is merely the difference between the shareware and the registered versions, not the registered version itself. Note that this is not a true statement, but if you accept it as true, you’ll immediately begin to see the weaknesses in your registration incentives. If there are few additional benefits for buying the full version vs. using the shareware version, then you aren’t offering the user strong enough incentives to make the full purchase. Rule 2: The sole purpose of the shareware version is to close the sale. This is our second lie of success. Note the emphasis on the word “close.” Your shareware version needs to act as a direct sales vehicle. It must be able to take the user all the way to the point of purchase, i.e. your online order form, ideally with nothing more than a few mouse clicks. Anything that detracts from achieving a quick sale is likely to hurt sales. Rule 3: The customer’s perspective is the only one that matters. Defy this rule at your peril. Customers don’t care that you spent 2000 hours creating your product. Customers don’t care that you deserve the money for your hard work. Customers don’t care that you need to do certain things to prevent piracy. All that matters to them are their own personal wants and needs. Yes, these are lies of success. Some customers will care, but if you design your registration incentives assuming they only care about their own self-interests, your motivation to buy will be much stronger than if you merely appeal to their sense of honesty, loyalty, or honor. Assume your customers are all asking, “What’s in it for me if I choose to buy? What will I get? How will this help me?” I don’t care if you’re selling to Fortune 500 companies. At some point there will be an individual responsible for causing the purchase to happen, and that individual is going to consider how the purchase will affect him/her personally: “Will this purchase get me fired? Will it make me look good in front of my peers? Will this make my job easier or harder?” Many shareware developers get caught in the trap of discriminating between honest and dishonest users, believing that honest users will register and dishonest ones won’t. This line of thinking will ultimately get you nowhere, and it violates the third lie of success. When you make a purchase decision, how often do you use honesty as the deciding factor? Do you ever say, “I will buy this because I’m honest?” Or do you consider other more selfish factors first, such as how it will make you feel to purchase the software? The truth is that every user believes s/he is honest, so no user applies the honesty criterion when making a purchase decision. Thinking of your users in terms of honest ones vs. dishonest ones is a complete waste of time because that’s not how users primarily view themselves. Rule 4: Customers buy on emotion and justify with fact. If you’re honest with yourself, you’ll see that this is how you make most purchase decisions. Remember the last time you bought a computer. Is it fair to say that you first became emotionally attached to the idea of owning a new machine? For me, it’s the feeling of working faster, owning the latest technology, and being more productive that motivates me to go computer shopping. Once I’ve become emotionally committed, the justifications follow: “It’s been two years since I’ve upgraded, it will pay for itself with the productivity boost I gain, I can easily afford it, I’ve worked hard and I deserve a new machine, etc.” You use facts to justify the purchase. Once you understand how purchase decisions are made, you can see that your shareware products need to first get the user emotionally invested in the purchase, and then you give them all the facts they need to justify it. Now that we’ve gotten these four lies of success out of the way, let’s see how we might apply them to create some compelling registration incentives. Let’s start with Rule 1. What incentives can be spawned from this rule? The common 30-day trial is one obvious derivative. If you are only selling the difference between the shareware and registered versions, then a 30-day trial implies that you are selling unlimited future days of usage of the program after the trial period expires. This is a powerful incentive, and it’s been proven effective for products that users will continue to use month after month. 30-day trials are easy for users to understand, and they’re also easy to implement. You could also experiment with other time periods such as 10 days, 14 days, or 90 days. The only way of truly knowing which will work best for your products is to experiment. But let’s see if we can move a bit beyond the basic 30-day trial here by mixing in a little of Rule 3. How would the customer perceive a 30-day trial? In most cases 30 days is plenty of time to evaluate a product. But in what situations would a 30-day trial have a negative effect? A good example is when the user downloads, installs, and briefly checks out a product s/he may not have time to evaluate right away. By the time the user gets around to fully evaluating it, the shareware version has already expired, and a sale may be lost as a result. To get around this limitation, many shareware developers have started offering 30 days of actual program usage instead of 30 consecutive days. This allows the user plenty of time to try out the program at his/her convenience. Another possibility would be to limit the number of times the program can be run. The basic idea is that you are giving away limited usage and selling unlimited usage of the program. This incentive definitely works if your product is one that will be used frequently over a long period of time (much longer than the trial period). The flip side of usage limitation is to offer an additional bonus for buying within a certain period of time. For instance, in my game Dweep, I offer an extra 5 free bonus levels to everyone who buys within the first 10 days. In truth I give the bonus levels to everyone who buys, but the incentive is real from the customer’s point of view. Remember Rule 3 - it doesn’t matter what happens on my end; it only matters what the customer perceives. Any customer that buys after the first 10 days will be delighted anyway to receive a bonus they thought they missed. So if your product has no time-based incentives at all, this is the first place to start. When would you pay your bills if they were never due, and no interest was charged on late payments? Use time pressure to your advantage, either by disabling features in the shareware version after a certain time or by offering additional bonuses for buying sooner rather than later. If nothing else and if it’s legal in your area, offer a free entry in a random monthly drawing for a small prize, such as one of your other products, for anyone who buys within the first X days. Another logical derivative of Rule 1 is the concept of feature limitation. On the crippling side, you can start with the registered version and begin disabling functionality to create the shareware version. Disabling printing in a shareware text editor is a common strategy. So is corrupting your program’s output with a simple watermark. For instance, your shareware editor could print every page with your logo in the background. Years ago the Association of Shareware Professionals had a strict policy against crippling, but that policy was abandoned, and crippling has been recognized as an effective registration incentive. It is certainly possible to apply feature limitation without having it perceived as crippling. This is especially easy for games, which commonly offer a limited number of playable levels in the shareware version with many more levels available only in the registered version. In this situation you offer the user a seemingly complete experience of your product in the shareware version, and you provide additional features on top of that for the registered version. Time-based incentives and feature-based incentives are perhaps the two most common strategies used by shareware developers for enticing users to buy. Which will work best for you? You will probably see the best results if you use both at the same time. Imagine you’re the end user for a moment. Would you be more likely to buy if you were promised additional features and given a deadline to make the decision? I’ve seen several developers who were using only one of these two strategies increase their registration rates dramatically by applying the second strategy on top of the first. If you only use time-based limitations, how could you apply feature limitation as well? Giving the user more reasons to buy will translate to more sales per download. One you have both time-based and feature-based incentives to buy, the next step is to address the user’s perceived risk by applying a risk-reversal strategy. Fortunately, the shareware model already reduces the perceived risk of purchasing significantly, since the user is able to try before buying. But let’s go a little further, keeping Rule 3 in mind. What else might be a perceived risk to the user? What if the user reaches the end of the trial period and still isn’t certain the product will do what s/he needs? What if the additional features in the registered version don’t work as the user expects? What can we do to make the decision to purchase safer for the user? One approach is to offer a money-back guarantee. I’ve been offering a 60-day unconditional money-back guarantee on all my products since January 2000. If someone asks for their money back for any reason, I give them a full refund right away. So what is my return rate? Well, it’s about 8%. Just kidding! Would it surprise you to learn that my return rate at the time of this writing is less than 0.2%? Could you handle two returns out of every 1000 sales? My best estimate is that this one technique increased my sales by 5-10%, and it only took a few minutes to implement. When I suggest this strategy to other shareware developers, the usual reaction is fear. “But everyone would rip me off,” is a common response. I suggest trying it for yourself on an experimental basis; a few brave souls have already tried it and are now offering money-back guarantees prominently. Try putting it up on your web site for a while just to convince yourself it works. You can take it down at any time. After a few months, if you’re happy with the results, add the guarantee to your shareware products as well. I haven’t heard of one bad outcome yet from those who’ve tried it. If you use feature limitation in your shareware products, another important component of risk reversal is to show the user exactly what s/he will get in the full version. In Dweep I give away the first five levels in the demo version, and purchasing the full version gets you 147 more levels. When I thought about this from the customer’s perspective (Rule 3), I realized that a perceived risk is that s/he doesn’t know if the registered version levels will be as fun as the demo levels. So I released a new demo where you can see every level but only play the first five. This lets the customer see all the fun that awaits them. So if you have a feature-limited product, show the customer how the feature will work. For instance, if your shareware version has printing disabled, the customer could be worried that the full version’s print capability won’t work with his/her printer or that the output quality will be poor. A better strategy is to allow printing, but to watermark the output. This way the customer can still test and verify the feature, and it doesn’t take much imagination to realize what the output will look like without the watermark. Our next step is to consider Rule 2 and include the ability close the sale. It is imperative that you include an “instant gratification” button in your shareware products, so the customer can click to launch their default web browser and go directly to your online order form. If you already have a “buy now” button in your products, go a step further. A small group of us have been finding that the more liberally these buttons are used, the better. If you only have one or two of these buttons in your shareware program, you should increase the count by at least an order of magnitude. The current Dweep demo now has over 100 of these buttons scattered throughout the menus and dialogs. This makes it extremely easy for the customer to buy, since s/he never has to hunt around for the ordering link. What should you label these buttons? “Buy now” or “Register now” are popular, so feel free to use one of those. I took a slightly different approach by trying to think like a customer (Rule 3 again). As a customer the word “buy” has a slightly negative association for me. It makes me think of parting with my cash, and it brings up feelings of sacrifice and pressure. The words “buy now” imply that I have to give away something. So instead, I use the words, “Get now.” As a customer I feel much better about getting something than buying something, since “getting” brings up only positive associations. This is the psychology I use, but at present, I don’t know of any hard data showing which is better. Unless you have a strong preference, trust your intuition. Make it as easy as possible for the willing customer to buy. The more methods of payment you accept, the better your sales will be. Allow the customer to click a button to print an order form directly from your program and mail it with a check or money order. On your web order form, include a link to a printable text order form for those who are afraid to use their credit cards online. If you only accept two or three major credit cards, sign up with a registration service to handle orders for those you don’t accept. So far we’ve given the customer some good incentives to buy, minimized perceived risk, and made it easy to make the purchase. But we haven’t yet gotten the customer emotionally invested in making the purchase decision. That’s where Rule 4 comes in. First, we must recognize the difference between benefits and features. We need to sell the sizzle, not the steak. Features describe your product, while benefits describe what the user will get by using your product. For instance, a personal information manager (PIM) program may have features such as daily, weekly, and monthly views; task and event timers; and a contact database. However, the benefits of the program might be that it helps the user be more organized, earn more money, and enjoy more free time. For a game, the main benefit might be fun. For a nature screensaver, it could be relaxation, beauty appreciation, or peace. Features are logical; benefits are emotional. Logical features are an important part of the sale, but only after we’ve engaged the customer’s emotions. Many products do a fair job of getting the customer emotionally invested during the trial period. If you have an addictive program or one that’s fun to use, such as a game, you may have an easy time getting the customer emotionally attached to using it because the experience is already emotional in nature. But whatever your product is, you can increase your sales by clearly illustrating the benefits of making the purchase. A good place to do this is in your nag screens. I use nag screens both before and after the program runs to remind the user of the benefits of buying the full version. At the very least, include a nag screen when the customer exits the program, so the last thing s/he sees will be a reminder of the product’s benefits. Take this opportunity to sell the user on the product. Don’t expect features like “customizable colors” to motivate anyone to buy. Paint a picture of what benefits the user will obtain with the full version. Will I save time? Will I have more fun? Will I live longer, save money, or feel better? The simple change from feature-oriented selling to benefit-oriented selling can easily double or triple your sales. Be sure to use this approach on your web site as well if you don’t already. Developers who’ve recently made the switch have been reporting some amazing results. If you’re drawing a blank when trying to come up with benefits for your products, the best thing you can do is to email some of your old customers and ask them why they bought your program. What did it do for them? I’ve done this and was amazed at the answers I got back. People were buying my games for reasons I’d never anticipated, and that told me which benefits I needed to emphasize in my sales pitch. The next key is to make your offer irresistible to potential customers. Find ways to offer the customer so much value that it would be harder to say no than to say yes. Take a look at your shareware product as if you were a potential customer who’d never seen it before. Being totally honest with yourself, would you buy this program if someone else had written it? If not, don’t stop here. As a potential customer, what additional benefits or features would put you over the top and convince you to buy? More is always better than less. In the original version of Dweep, I offered ten levels in the demo and thirty in the registered version. Now I offer only five demo levels and 152 in the full version, plus a built-in level editor. Originally, I offered the player twice the value of the demo; now I’m offering over thirty times the value. I also offer free hints and solutions to every level; the benefit here is that it minimizes player frustration. As I keep adding bonuses for purchasing, the offer becomes harder and harder to resist. What clever bonuses can you throw in for registering? Take the time to watch an infomercial. Notice that there is always at least one “FREE” bonus thrown in. Consider offering a few extra filters for an image editor, ten extra images for a screensaver, or extra levels for a game. What else might appeal to your customers? Be creative. Your bonus doesn’t even have to be software-based. Offer a free report about building site traffic with your HTML editor, include an essay on effective time management with your scheduling program, or throw in a small business success guide with your billing program. If you make such programs, you shouldn’t have too much trouble coming up with a few pages of text that would benefit your customers. Keep working at it until your offer even looks irresistible to you. If all the bonuses you offer can be delivered electronically, how many can you afford to include? If each one only gains one more customer in a thousand (0.1%), would it be worth the effort over the lifetime of your sales? So how do you know if your registration incentives are strong enough? And how do you know if your product is over-crippled? Where do you draw the line? These are tough issues, but there is a good way to handle them if your product is likely to be used over a long period of time, particularly if it’s used on a daily basis. Simply make your program gradually increase its registration incentives over time. One easy way to do this is with a delay timer on your nag screens that increases each time the program is run. Another approach is to disable certain features at set intervals. You begin by disabling non-critical features and gradually move up to disabling key functionality. The program becomes harder and harder to continue using for free, so the benefits of registering become more and more compelling. Instead of having your program completely disable itself after your trial period, you gradually degrade its usability with additional usage. This approach can be superior to a strict 30-day trial, since it allows your program to still be used for a while, but after prolonged usage it becomes effectively unusable. However, you don’t simply shock the user by taking away all the benefits s/he has become accustomed to on a particular day. Instead, you begin with a gentle reminder that becomes harder and harder to ignore. There may be times when your 30-day trial shuts off at an inconvenient time for the user, and you may lose a sale as a result. For instance, the user may not have the money at the time, or s/he may be busy at the trial’s end and forget to register. In that case s/he may quickly replace what was lost with a competitor’s trial version. The gradual degradation approach allows the user to continue using your product, but with increasing difficulty over time. Eventually, there is a breaking point where the user either decides to buy or to stop using the program completely, but this can be done within a window of time at the user’s convenience. Hopefully this article has gotten you thinking creatively about all the overlooked ways you can entice people to buy your shareware products. The most important thing you can do is to begin seeing your products through your customers’ eyes. What additional motivation would convince you to buy? What would represent an irresistible offer to you? There is no limit to how many incentives you can add. Don’t stop at just one or two; instead, give the customer a half dozen or more reasons to buy, and you’ll see your registration rate soar. Is it worth spending a day to do this? I think so.

yesterday • 4 votes

Maybe writing speed actually is a bottleneck for programming

I'm a big (neo)vim buff. My config is over 1500 lines and I regularly write new scripts. I recently ported my neovim config to a new laptop. Before then, I was using VSCode to write, and when I switched back I immediately saw a big gain in productivity. People often pooh-pooh vim (and other assistive writing technologies) by saying that writing code isn't the bottleneck in software development. Reading, understanding, and thinking through code is! Now I don't know how true this actually is in practice, because empirical studies of time spent coding are all over the place. Most of them, like this study, track time spent in the editor but don't distinguish between time spent reading code and time spent writing code. The only one I found that separates them was this study. It finds that developers spend only 5% of their time editing. It also finds they spend 14% of their time moving or resizing editor windows, so I don't know how clean their data is. But I have a bigger problem with "writing is not the bottleneck": when I think of a bottleneck, I imagine that no amount of improvement will lead to productivity gains. Like if a program is bottlenecked on the network, it isn't going to get noticeably faster with 100x more ram or compute. But being able to type code 100x faster, even with without corresponding improvements to reading and imagining code, would be huge. We'll assume the average developer writes at 80 words per minute, at five characters a word, for 400 characters a minute.What could we do if we instead wrote at 8,000 words/40k characters a minute? Writing fast Boilerplate is trivial Why do people like type inference? Because writing all of the types manually is annoying. Why don't people like boilerplate? Because it's annoying to write every damn time. Programmers like features that help them write less! That's not a problem if you can write all of the boilerplate in 0.1 seconds. You still have the problem of reading boilerplate heavy code, but you can use the remaining 0.9 seconds to churn out an extension that parses the file and presents the boilerplate in a more legible fashion. We can write more tooling This is something I've noticed with LLMs: when I can churn out crappy code as a free action, I use that to write lots of tools that assist me in writing good code. Even if I'm bottlenecked on a large program, I can still quickly write a script that helps me with something. Most of these aren't things I would have written because they'd take too long to write! Again, not the best comparison, because LLMs also shortcut learning the relevant APIs, so also optimize the "understanding code" part. Then again, if I could type real fast I could more quickly whip up experiments on new apis to learn them faster. We can do practices that slow us down in the short-term Something like test-driven development significantly slows down how fast you write production code, because you have to spend a lot more time writing test code. Pair programming trades speed of writing code for speed of understanding code. A two-order-of-magnitude writing speedup makes both of them effectively free. Or, if you're not an eXtreme Programming fan, you can more easily follow the The Power of Ten Rules and blanket your code with contracts and assertions. We could do more speculative editing This is probably the biggest difference in how we'd work if we could write 100x faster: it'd be much easier to try changes to the code to see if they're good ideas in the first place. How often have I tried optimizing something, only to find out it didn't make a difference? How often have I done a refactoring only to end up with lower-quality code overall? Too often. Over time it makes me prefer to try things that I know will work, and only "speculatively edit" when I think it be a fast change. If I could code 100x faster it would absolutely lead to me trying more speculative edits. This is especially big because I believe that lots of speculative edits are high-risk, high-reward: given 50 things we could do to the code, 49 won't make a difference and one will be a major improvement. If I only have time to try five things, I have a 10% chance of hitting the jackpot. If I can try 500 things I will get that reward every single time. Processes are built off constraints There are just a few ideas I came up with; there are probably others. Most of them, I suspect, will share the same property in common: they change the process of writing code to leverage the speedup. I can totally believe that a large speedup would not remove a bottleneck in the processes we currently use to write code. But that's because those processes are developed work within our existing constraints. Remove a constraint and new processes become possible. The way I see it, if our current process produces 1 Utils of Software / day, a 100x writing speedup might lead to only 1.5 UoS/day. But there are other processes that produce only 0.5 UoS/d because they are bottlenecked on writing speed. A 100x speedup would lead to 10 UoS/day. The problem with all of this that 100x speedup isn't realistic, and it's not obvious whether a 2x improvement would lead to better processes. Then again, one of the first custom vim function scripts I wrote was an aid to writing unit tests in a particular codebase, and it lead to me writing a lot more tests. So maybe even a 2x speedup is going to be speed things up, too. Patreon Stuff I wrote a couple of TLA+ specs to show how to model fork-join algorithms. I'm planning on eventually writing them up for my blog/learntla but it'll be a while, so if you want to see them in the meantime I put them up on Patreon.

2 days ago • 6 votes

Occupation and Preoccupation

Here’s Jony Ive in his Stripe interview: What we make stands testament to who we are. What we make describes our values. It describes our preoccupations. It describes beautiful succinctly our preoccupation. I’d never really noticed the connection between these two words: occupation and preoccupation. What comes before occupation? Pre-occupation. What comes before what you do for a living? What you think about. What you’re preoccupied with. What you think about will drive you towards what you work on. So when you’re asking yourself, “What comes next? What should I work on?” Another way of asking that question is, “What occupies my thinking right now?” And if what you’re occupied with doesn’t align with what you’re preoccupied with, perhaps it's time for a change. Email · Mastodon · Bluesky

2 days ago • 3 votes

American hype

There's no country on earth that does hype better than America. It's one of the most appealing aspects about being here. People are genuinely excited about the future and never stop searching for better ways to work, live, entertain, and profit. There's a unique critical mass in the US accelerating and celebrating tomorrow. The contrast to Europe couldn't be greater. Most Europeans are allergic to anything that even smells like a commercial promise of a better tomorrow. "Hype" is universally used as a term to ridicule anyone who dares to be excited about something new, something different. Only a fool would believe that real progress is possible! This is cultural bedrock. The fault lines have been settling for generations. It'll take an earthquake to move them. You see this in AI, you saw it in the Internet. Europeans are just as smart, just as inventive as their American brethren, but they don't do hype, so they're rarely the ones able to sell the sizzle that public opinion requires to shift its vision for tomorrow. To say I have a complicated relationship with venture capital is putting it mildly. I've spent a career proving the counter narrative. Proving that you can build and bootstrap an incredible business without investor money, still leave a dent in the universe, while enjoying the spoils of capitalism. And yet... I must admit that the excesses of venture capital are integral to this uniquely American advantage on hype. The lavish overspending during the dot-com boom led directly to a spectacular bust, but it also built the foundation of the internet we all enjoy today. Pets.com and Webvan flamed out such that Amazon and Shopify could transform ecommerce out of the ashes. We're in the thick of peak hype on AI right now. Fantastical sums are chasing AGI along with every dumb derivative mirage along the way. The most outrageous claims are being put forth on the daily. It's easy to look at that spectacle with European eyes and roll them. Some of it is pretty cringe! But I think that would be a mistake. You don't have to throw away your critical reasoning to accept that in the face of unknown potential, optimism beats pessimism. We all have to believe in something, and you're much better off believing that things can get better than not. Americans fundamentally believe this. They believe the hype, so they make it come to fruition. Not every time, not all of them, but more of them, more of the time than any other country in the world. That really is exceptional.

2 days ago • 3 votes

File sync is very slow

I’m working on a Go library appendstore for append-only store of lots of things in a single file. To make things as robust as possible I was calling os.File.Sync() after each append. Sync() is waiting until the data is acknowledged as truly, really written to disk (as opposed to maybe floating somewhere in disk drive’s write buffer). Oh boy, is it slow. A test of appending 1000 records would take over 5 seconds. After removing the Sync() it would drop to 5 milliseconds. 1000x faster. I made sync optional - it’s now up to the user of the library to pick it, defaults to non-sync. Is it unsafe now? Well, the reality is that it probably doesn’t matter. I don’t think lots of software does the sync due to slowness and the world still runs.

2 days ago • 2 votes

New here?