The first conformant M1 GPU driver

from On Life and Lisp [alt+shift+b] in technology

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the...

a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

Vulkan 1.4 sur Asahi Linux

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les notres sont les seules pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique. Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour accéder à la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponsibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite. Pour plus d’informations, consultez l’article de blog de Khronos. Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one. That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers. Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared. Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard graphics. Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in Vulkan 1.3 driver. For more information, see the Khronos blog post.

4 months ago • 68 votes

AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

6 months ago • 71 votes

Vulkan 1.3 on the M1 in 1 month

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean and well-organized. Why spend years implementing features from scratch when we can reuse NVK? There will be friction starting out, given NVIDIA’s desktop architecture differs from the M1’s mobile roots. In exchange, we get a modern driver designed for desktop games. We’ll need to pass a half-million tests ensuring correctness, submit the results, and then we’ll become conformant after 30 days of industry review. Starting from NVK and our OpenGL 4.6 driver… can we write a driver passing the Vulkan 1.3 conformance test suite faster than the 30 day review period? It’s unprecedented… Challenge accepted. April 2 It begins with a text. Faith… I think I want to write a Vulkan driver. Her advice? Just start typing. Thre’s no copy-pasting yet – we just add M1 code to NVK and remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we begin connecting “NVK” to Asahi Lina’s kernel driver using code shared with OpenGL. Then we plug in our shader compiler and hit the hay. April 3 To access resources, GPUs use “descriptors” containing the address, format, and size of a resource. Vulkan bundles descriptors into “sets” per the application’s “descriptor set layout”. When compiling shaders, the driver lowers descriptor accesses to marry the set layout with the hardware’s data structures. As our descriptors differ from NVIDIA’s, our next task is adapting NVK’s descriptor set lowering. We start with a simple but correct approach, deleting far more code than we add. April 4 With working descriptors, we can compile compute shaders. Now we program the fixed-function hardware to dispatch compute. We first add bookkeeping to map Vulkan command buffers to lists of M1 “control streams”, then we generate a compute control stream. We copy that code from our OpenGL driver, translate the GL into Vulkan, and compute works. That’s enough to move on to “copies” of buffers and images. We implement Vulkan’s copies with compute shaders, internally dispatched with Vulkan commands as if we were the application. The first copy test passes. April 5 Fleshing out yesterday’s code, all copy tests pass. April 6 We’re ready to tackle graphics. The novelty is handling graphics state like depth/stencil. That’s straightforward, but there’s a lot of state to handle. Faith’s code collects all “dynamic state” into a single structure, which we translate into hardware control words. As usual, we grab that translation from our OpenGL driver, blend with NVK, and move on. April 7 What makes state “dynamic”? Dynamic state can change without recompiling shaders. By contrast, static state is baked into shader binaries called “pipelines”. If games create all their pipelines during a loading screen, there is no compiler “stutter” during gameplay. The idea hasn’t quite panned out: many game developers don’t know their state ahead-of-time so cannot create pipelines early. In response, Vulkan has made ever more state dynamic, punctuated with the EXT_shader_object extension that makes pipelines optional. We want full dynamic state and shader objects. Unfortunately, the M1 bakes random state into shaders: vertex attributes, fragment outputs, blending, even linked interpolation qualifiers. Like most of the industry in the 2010s, the M1’s designers bet on pipelines. Faced with this hardware, a reasonable driver developer would double-down on pipelines. DXVK would stutter, but we’d pass conformance. I am not reasonable. To eliminate stuttering in OpenGL, we make state dynamic with four strategies: Conditional code. Precompiled variants. Indirection. Prologs and epilogs. Wait, what-a-logs? AMD also bakes state into shaders… with a twist. They divide the hardware binary into three parts: a prolog, the shader, and an epilog. Confining dynamic state to the periphery eliminates shader variants. They compile prologs and epilogs on the fly, but that’s fast and doesn’t stutter. Linking shader parts is a quick concatenation, or long jumps avoid linking altogether. This strategy works for the M1, too. For Honeykrisp, let’s follow NVK’s lead and treat all state as dynamic. No other Vulkan driver has implemented full dynamic state and shader objects this early on, but it avoids refactoring later. Today we add the code to build, compile, and cache prologs and epilogs. Putting it together, we get a (dynamic) triangle: April 8 Guided by the list of failing tests, we wire up the little bits missed along the way, like translating border colours. /* Translate an American VkBorderColor into a Canadian agx_border_colour */ enum agx_border_colour translate_border_color(VkBorderColor color) { switch (color) { case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; ... } } Test results are getting there. Pass: 149770, Fail: 7741, Crash: 2396 That’s good enough for vkQuake. April 9 Lots of little fixes bring us to a 99.6% pass rate… for Vulkan 1.1. Why stop there? NVK is 1.3 conformant, so let’s claim 1.3 and skip to the finish line. Pass: 255209, Fail: 3818, Crash: 599 98.3% pass rate for 1.3 on our 1 week anniversary. Not bad. April 10 SuperTuxKart has a Vulkan renderer. April 11 Zink works too. April 12 I tracked down some fails to a test bug, where an arbitrary verification threshold was too strict to pass on some devices. I filed a bug report, and it’s resolved within a few weeks. April 16 The tests for “descriptor indexing” revealed a compiler bug affecting subgroup shuffles in non-uniform control flow. The M1’s shuffle instruction is quirky, but it’s easy to workaround. Fixing that fixes the descriptor indexing tests. April 17 A few tests crash inside our register allocator. Their shaders contain a peculiar construction: if (condition) { while (true) { } } condition is always false, but the compiler doesn’t know that. Infinite loops are nominally invalid since shaders must terminate in finite time, but this shader is syntactically valid. “All loops contain a break” seems obvious for a shader, but it’s false. It’s straightforward to fix register allocation, but what a doozy. April 18 Remember copies? They’re slow, and every frame currently requires a copy to get on screen. For “zero copy” rendering, we need enough Linux window system integration to negotiate an efficient surface layout across process boundaries. Linux uses “modifiers” for this purpose, so we implement the EXT_image_drm_format_modifier extension. And by implement, I mean copy. Copies to avoid copies. April 20 “I’d like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac.” … “Ma’am, this is a Wendy’s.” April 22 As bug fixing slows down, we step back and check our driver architecture. Since we treat all state as dynamic, we don’t pre-pack control words during pipeline creation. That adds theoretical CPU overhead. Is that a problem? After some optimization, vkoverhead says we’re pushing 100 million draws per second. I think we’re okay. April 24 Time to light up YCbCr. If we don’t use special YCbCr hardware, this feature is “software-only”. However, it touches a lot of code. It touches so much code that Mohamed Ahmed spent an entire summer adding it to NVK. Which means he spent a summer adding it to Honeykrisp. Thanks, Mohamed ;-) April 25 Query copies are next. In Vulkan, the application can query the number of samples rendered, writing the result into an opaque “query pool”. The result can be copied from the query pool on the CPU or GPU. For the CPU, the driver maps the pool’s internal data structure and copies the result. This may require nontrivial repacking. For the GPU, we need to repack in a compute shader. That’s harder, because we can’t just run C code on the GPU, right? …Actually, we can. A little witchcraft makes GPU query copies as easy as C. void copy_query(struct params *p, int i) { uintptr_t dst = p->dest + i * p->stride; int query = p->first + i; if (p->available[query] || p->partial) { int q = p->index[query]; write_result(dst, p->_64, p->results[q]); } ... } April 26 The final boss: border colours, hard mode. Direct3D lets the application choose an arbitrary border colour when creating a sampler. By contrast, Vulkan only requires three border colours: (0, 0, 0, 0) – transparent black (0, 0, 0, 1) – opaque black (1, 1, 1, 1) – opaque white We handled these on April 8. Unfortunately, there are two problems. First, we need custom border colours for Direct3D compatibility. Both DXVK and vkd3d-proton require the EXT_custom_border_color extension. Second, there’s a subtle problem with our hardware, causing dozens of fails even without custom border colours. To understand the issue, let’s revisit texture descriptors, which contain a pixel format and a component reordering swizzle. Some formats are implicitly reordered. Common “BGRA” formats swap red and blue for historical reasons. The M1 does not directly support these formats. Instead, the driver composes the swizzle with the format’s reordering. If the application uses a BARB swizzle with a BGRA format, the driver uses an RABR swizzle with an RGBA format. There’s a catch: swizzles apply to the border colour, but formats do not. We need to undo the format reordering when programming the border colour for correct results after the hardware applies the composed swizzle. Our OpenGL driver implements border colours this way, because it knows the texture format when creating the sampler. Unfortunately, Vulkan doesn’t give us that information. Without custom border colour support, we “should” be okay. Swapping red and blue doesn’t change anything if the colour is white or black. There’s an even subtler catch. Vulkan mandates support for a packed 16-bit format with 4-bit components. The M1 supports a similar format… but with reversed “endianness”, swapping red and alpha. That still seems okay. For transparent black (all zero) and opaque white (all one), swapping components doesn’t change the result. The problem is opaque black: (0, 0, 0, 1). Swapping red and alpha gives (1, 0, 0, 0). Transparent red? Uh-oh. We’re stuck. No known hardware configuration implements correct Vulkan semantics. Is hope lost? Do we give up? A reasonable person would. I am not reasonable. Let’s jump into the deep end. If we implement custom border colours, opaque black becomes a special case. But how? The M1’s custom border colours entangle the texture format with the sampler. A reasonable person would skip Direct3D support. As you know, I am not reasonable. Although the hardware is unsuitable, we control software. Whenever a shader samples a texture, we’ll inject code to fix up the border colour. This emulation is simple, correct, and slow. We’ll use dirty driver tricks to speed it up later. For now, we eat the cost, advertise full custom border colours, and pass the opaque black tests. April 27 All that’s left is some last minute bug fixing, and… Pass: 686930, Fail: 0 Success. The future The next task is implementing everything that DXVK and vkd3d-proton require to layer Direct3D. That includes esoteric extensions like transform feedback. Then Wine and an open source x86 emulator will run Windows games on Asahi Linux. That’s getting ahead of ourselves. In the mean time, enjoy Linux games with our conformant OpenGL 4.6 drivers… and stay tuned. Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.

10 months ago • 96 votes

Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago • 53 votes

More in technology

Greatest Hits

I’ve been blogging now for approximately 8,465 days since my first post on Movable Type. My colleague Dan Luu helped me compile some of the “greatest hits” from the archives of ma.tt, perhaps some posts will stir some memories for you as well: Where Did WordCamps Come From? (2023) A look back at how Foo … Continue reading Greatest Hits →

21 hours ago • 2 votes

Let's give PRO/VENIX a barely adequate, pre-C89 TCP/IP stack (featuring Slirp-CK)

TCP/IP Illustrated (what would now be called the first edition prior to the 2011 update) for a hundred-odd bucks on sale which has now sat on my bookshelf, encased in its original shrinkwrap, for at least twenty years. It would be fun to put up the 4.4BSD data structures poster it came with but that would require opening it. Fortunately, today we have AI we have many more excellent and comprehensive documents on the subject, and more importantly, we've recently brought back up an oddball platform that doesn't have networking either: our DEC Professional 380 running the System V-based PRO/VENIX V2.0, which you met a couple articles back. The DEC Professionals are a notoriously incompatible member of the PDP-11 family and, short of DECnet (DECNA) support in its unique Professional Operating System, there's officially no other way you can get one on a network — let alone the modern Internet. Are we going to let that stop us? Crypto Ancienne proxy for TLS 1.3. And, as we'll discuss, if you can get this thing on the network, you can get almost anything on the network! Easily portable and painfully verbose source code is included. Recall from our lengthy history of DEC's early misadventures with personal computers that, in Digital's ill-advised plan to avoid the DEC Pros cannibalizing low-end sales from their categorical PDP-11 minicomputers, Digital's Small Systems Group deliberately made the DEC Professional series nearly totally incompatible despite the fact they used the same CPUs. In their initial roll-out strategy in 1982, the Pros (as well as their sibling systems, the Rainbow and the DECmate II) were only supposed to be mere desktop office computers — the fact the Pros were PDP-11s internally was mostly treated as an implementation detail. The idea backfired spectacularly against the IBM PC when the Pros and their promised office software failed to arrive on time and in 1984 DEC retooled around a new concept of explicitly selling the Pros as desktop PDP-11s. This required porting operating systems that PDP-11 minis typically ran: RSX-11M Plus was already there as the low-level layer of the Professional Operating System (P/OS), and DEC internally ported RT-11 (as PRO/RT-11) and COS. PDP-11s were also famous for running Unix and so DEC needed a Unix for the Pro as well, though eventually only one official option was ever available: a port of VenturCom's Venix based on V7 Unix and later System V Release 2.0 called PRO/VENIX. After the last article, I had the distinct pleasure of being contacted by Paul Kleppner, the company's first paid employee in 1981, who was part of the group at VenturCom that did the Pro port and stayed at the company until 1988. Venix was originally developed from V6 Unix on the PDP-11/23 incorporating Myron Zimmerman's real-time extensions to the kernel (such as semaphores and asynchronous I/O), then a postdoc in physics at MIT; Kleppner's father was the professor of the lab Zimmerman worked in. Zimmerman founded VenturCom in 1981 to capitalize on the emerging Unix market, becoming one of the earliest commercial Unix licensees. Venix-11 was subsequently based on the later V7 Unix, as was Venix/86, which was the first Unix on the IBM PC in January 1983 and was ported to the DEC Rainbow as Venix/86R. In addition to its real-time extensions and enhanced segmentation capability, critical for memory management in smaller 16-bit address spaces, it also included a full desktop graphics package. Notably, DEC themselves were also a Unix licensee through their Unix Engineering Group and already had an enhanced V7 Unix of their own running on the PDP-11, branded initially as V7M. Subsequently the UEG developed a port of 4.2BSD with some System V components for the VAX and planned to release it as Ultrix-32, simultaneously retconning V7M as Ultrix-11 even though it had little in common with the VAX release. Paul recalls that DEC did attempt a port of Ultrix-11 to the Pro 350 themselves but ran into intractable performance problems. By then the clock was ticking on the Pro relaunch and the issues with Ultrix-11 likely prompted DEC to look for alternatives. Crucially, Zimmerman had managed to upgrade Venix-11's kernel while still keeping it small, a vital aspect on his 11/23 which lacked split instruction and data addressing and would have had to page in and out a larger kernel otherwise. Moreover, the 11/23 used an F-11 CPU — the same CPU as the original Professional 350 and 325. DEC quickly commissioned VenturCom to port their own system over to the Pro, which Paul says was a real win for VenturCom, and the first release came out in July 1984 complete with its real-time features intact and graphics support for the Pro's bitmapped screen. It was upgraded ("PRO/VENIX Rev 2.0") in October 1984, adding support for the new top-of-the-line DEC Professional 380, and then switched to System V (SVR2) in July 1985 with PRO/VENIX V2.0. (For its part Ultrix-11 was released as such in 1984 as well, but never for the Pro series.) Keep that kernel version history in mind for when we get to oddiments of the C compiler. As for networking, though, with the exception of UUCP over serial, none of these early versions of Venix on either the PDP-11 or 8086 supported any kind of network connectivity out of the box — officially the only Pro operating system to support its Ethernet upgrade option was P/OS 2.0. Although all Pros have a 15-pin AUI network port, it isn't activated until an Ethernet CTI card is installed. (While Stan P. found mention of a third-party networking product called Fusion by Network Research Corporation which could run on PRO/VENIX, Paul's recollection is that this package ran into technical problems with kernel size during development. No examples of the PRO/VENIX version have so far been located and it may never have actually been released. You'll hear about it if a copy is found. The unofficial Pro 2.9BSD port also supports the network card, but that was always an under-the-table thing.) Since we run Venix on our Pro, that means currently our only realistic option to get this on the 'Nets is also over a serial port. lower speed port for our serial IP implementation. PRO/VENIX supports using only the RS-423 port as a remote terminal, and because it's twice as fast, it's more convenient for logins and file exchange over Kermit (which also has no TCP/IP overhead). Using the printer port also provides us with a nice challenge: if our stack works acceptably well at 4800bps, it should do even better at higher speeds if we port it elsewhere. On the Pro, we connect to our upstream host using a BCC05 cable (in the middle of this photograph), which terminates in a regular 25-pin RS-232 on the other end. Now for the software part. There are other small TCP/IP stacks, notably things like Adam Dunkel's lwIP and so on. But even SVR2 Venix is by present standards a old Unix with a much less extensive libc and more primitive C compiler — in a short while you'll see just how primitive — and relatively modern code like lwIP's would require a lot of porting. Ideally we'd like a very minimal, indeed barely adequate, stack that can do simple tasks and can be expressed in a fashion acceptable to a now antiquated compiler. Once we've written it, it would be nice if it were also easily portable to other very limited systems, even by directly translating it to assembly language if necessary. What we want this barebones stack to accomplish will inform its design: and the hardware 24-7 to make such a use case meaningful. The Ethernet option was reportedly competent at server tasks, but Ethernet has more bandwidth, and that card also has additional on-board hardware. Let's face the cold reality: as a server, we'd find interacting with it over the serial port unsatisfactory at best and we'd use up a lot of power and MTBF keeping it on more than we'd like to. Therefore, we really should optimize for the client case, which means we also only need to run the client when we're performing a network task. no remote login capacity, like, I dunno, a C64, the person on the console gets it all. Therefore, we really should optimize for the single user case, which means we can simplify our code substantially by merely dealing with sockets sequentially, one at a time, without having to worry about routing packets we get on the serial port to other tasks or multiplexing them. Doing so would require extra work for dual-socket protocols like FTP, but we're already going to use directly-attached Kermit for that, and if we really want file transfer over TCP/IP there are other choices. (On a larger antique system with multiple serial ports, we could consider a setup where each user uses a separate outgoing serial port as their own link, which would also work under this scheme.) Some of you may find this conflicts hard with your notion of what a "stack" should provide, but I also argue that the breadth of a full-service driver would be wasted on a limited configuration like this and be unnecessarily more complex to write and test. Worse, in many cases, is better, and I assert this particular case is one of them. Keeping the above in mind, what are appropriate client tasks for a microcomputer from 1984, now over 40 years old — even a fairly powerful one by the standards of the time — to do over a slow TCP/IP link? Crypto Ancienne's carl can serve as an HTTP-to-HTTPS proxy to handle the TLS part, if necessary.) We could use protocols like these to download and/or view files from systems that aren't directly connected, or to send and receive status information. One task that is also likely common is an interactive terminal connection (e.g., Telnet, rlogin) to another host. However, as a client this particular deployment is still likely to hit the same sorts of latency problems for the same reasons we would experience connecting to it as a server. These other tasks here are not highly sensitive to latency, require only a single "connection" and no multiplexing, and are simple protocols which are easy to implement. Let's call this feature set our minimum viable product. Because we're writing only for a couple of specific use cases, and to make them even more explicit and easy to translate, we're going to take the unusual approach of having each of these clients handle their own raw packets in a bytewise manner. For the actual serial link we're going to go even more barebones and use old-school RFC 1055 SLIP instead of PPP (uncompressed, too, not even Van Jacobson CSLIP). This is trivial to debug and straightforward to write, and if we do so in a relatively encapsulated fashion, we could consider swapping in CSLIP or PPP later on. A couple of utility functions will do the IP checksum algorithm and reading and writing the serial port, and DNS and some aspects of TCP also get their own utility subroutines, but otherwise all of the programs we will create will read and write their own network datagrams, using the SLIP code to send and receive over the wire. The C we will write will also be intentionally very constrained, using bytewise operations assuming nothing about endianness and using as little of the C standard library as possible. For types, you only need some sort of 32-bit long, which need not be native, an int of at least 16 bits, and a char type — which can be signed, and in fact has to be to run on earlier Venices (read on). You can run the entirety of the code with just malloc/free, read/write/open/close, strlen/strcat, sleep, rand/srand and time for the srand seed (and fprintf for printing debugging information, if desired). On a system with little or no operating system support, almost all of these primitive library functions are easy to write or simulate, and we won't even assume we're capable of non-blocking reads despite the fact Venix can do so. After all, from that which little is demanded, even less is expected. slattach which effectively makes a serial port directly into a network interface. Such an arrangement would be the most flexible approach from the user's perspective because you necessarily have a fixed, bindable external address, but obviously such a scheme didn't scale over time. With the proliferation of dialup Unix shell accounts in the late 1980s and early 1990s, closed-source tools like 1993's The Internet Adapter ("TIA") could provide the SLIP and later PPP link just by running them from a shell prompt. Because they synthesize artificial local IP addresses, sort of NAT before the concept explicitly existed, the architecture of such tools prevented directly creating listening sockets — though for some situations this could be considered a more of a feature than a bug. Any needed external ports could be proxied by the software anyway and later network clients tended not to require it, so for most tasks it was more than sufficient. Closed-source and proprietary SLIP/PPP-over-shell solutions like TIA were eventually displaced by open source alternatives, most notably SLiRP. SLiRP (hereafter Slirp so I don't gouge my eyes out) emerged in 1995 and used a similar architecture to TIA, handing out virtual addresses on an synthetic network and bridging that network to the Internet through the host system. It rapidly became the SLIP/PPP shell solution of choice, leading to its outright ban by some shell ISPs who claimed it violated their terms of service. As direct SLIP/PPP dialup became more common than shell accounts, during which time yours truly upgraded to a 56K Mac modem I still have around here somewhere, Slirp eventually became most useful for connecting small devices via their serial ports (PDAs and mobile phones especially, but really anything — subsets of Slirp are still used in emulators today like QEMU for a similar purpose) to a LAN. By a shocking and completely contrived coincidence, that's exactly what we'll be doing! Slirp has not been officially maintained since 2006. There is no package in Fedora, which is my usual desktop Linux, and the one in Debian reportedly has issues. A stack of patch sets circulated thereafter, but the planned 1.1 release never happened and other crippling bugs remain, some of which were addressed in other patches that don't seem to have made it into any release, source or otherwise. If you tried to build Slirp from source on a modern system and it just immediately exits, you got bit. I have incorporated those patches and a couple of my own to port naming and the configure script, plus some additional fixes, into an unofficial "Slirp-CK" which is on Github. It builds the same way as prior versions and is tested on Fedora Linux. I'm working on getting it functional on current macOS also. Next, I wrote up our four basic functional clients: ping, DNS lookup, NTP client (it doesn't set the clock, just shows you the stratum, refid and time which you can use for your own purposes), and TCP client. The TCP client accepts strings up to a defined maximum length, opens the connection, sends those strings (optionally separated by CRLF), and then reads the reply until the connection closes. This all seemed to work great on the Linux box, which you yourself can play with as a toy stack (directions at the end). Unfortunately, I then pushed it over to the Pro with Kermit and the compiler immediately started complaining. SLIP is a very thin layer on IP packets. There are exactly four metabytes, which I created preprocessor defines for: A SLIP packet ends with SLIP_END, or hex $c0. Where this must occur within a packet, it is replaced by a two byte sequence for unambiguity, SLIP_ESC SLIP_ESC_END, or hex $db $dc, and where the escape byte must occur within a packet, it gets a different two byte sequence, SLIP_ESC SLIP_ESC_ESC, or hex $db $dd. Although I initially set out to use defines and symbols everywhere instead of naked bytes, and wrote slip.c on that basis, I eventually settled on raw bytes afterwards using copious comments so it was clear what was intended to be sent. That probably saved me a lot of work renaming everything, because: Dimly I recalled that early C compilers, including System V, limit their identifiers to eight characters (the so-called "Ritchie limit"). At this point I probably should have simply removed them entirely for consistency with their absence elsewhere, but I went ahead and trimmed them down to more opaque, pithy identifiers. That wasn't the only problem, though. I originally had two functions in slip.c, slip_start and slip_stop, and it didn't like that either despite each appearing to have a unique eight-character prefix: That's because their symbols in the object file are actually prepended with various metacharacters like _ and ~, so effectively you only get seven characters in function identifiers, an issue this error message fails to explain clearly. The next problem: there's no unsigned char, at least not in PRO/VENIX Rev. 2.0 which I want to support because it's more common, and presumably the original versions of PRO/VENIX and Venix-11. (This type does exist in PRO/VENIX V2.0, but that's because it's System V and has a later C compiler.) In fact, the unsigned keyword didn't exist at all in the earliest C compilers, and even when it did, it couldn't be applied to every basic type. Although unsigned char was introduced in V7 Unix and is documented as legal in the PRO/VENIX manual, and it does exist in Venix/86 2.1 which is also a V7 Unix derivative, the PDP-11 and 8086 C compilers have different lineages and Venix's V7 PDP-11 compiler definitely doesn't support it: I suspect this may not have been intended because unsigned int works (unsigned long would be pointless on this architecture, and indeed correctly generates Misplaced 'long' on both versions of PRO/VENIX). Regardless of why, however, the plain char type on the PDP-11 is signed, and for compatibility reasons here we'll have no choice but to use it. Recall that when C89 was being codified, plain char was left as an ambiguous type since some platforms (notably PDP-11 and VAX) made it signed by default and others made it unsigned, and C89 was more about codifying existing practice than establishing new ones. That's why you see this on a modern 64-bit platform, e.g., my POWER9 workstation, where plain char is unsigned: If we change the original type explicitly to signed char on our POWER9 Linux machine, that's different: and, accounting for different sizes of int, seems similar on PRO/VENIX V2.0 (again, which is System V): but the exact same program on PRO/VENIX Rev. 2.0 behaves a bit differently: The differences in int size we expect, but there's other kinds of weird stuff going on here. The PRO/VENIX manual lists all the various permutations about type conversions and what gets turned into what where, but since the manual is already wrong about unsigned char I don't think we can trust the documentation for this part either. Our best bet is to move values into int and mask off any propagated sign bits before doing comparisons or math, which is agonizing, but reliable. That means throwing around a lot of seemingly superfluous & 0xff to make sure we don't get negative numbers where we don't want them. Once I got it built, however, there were lots of bugs. Many were because it turns out the compiler isn't too good with 32-bit long, which is not a native type on the 16-bit PDP-11. This (part of the NTP client) worked on my regular Linux desktop, but didn't work in Venix: The first problem is that the intermediate shifts are too large and overshoot, even though they should be in range for a long. Consider this example: On the POWER9, accounting for the different semantics of %lx, But on Venix, the second shift blows out the value. We can get an idea of why from the generated assembly in the adb debugger (here from PRO/VENIX V2.0, since I could cut and paste from the Kermit session): (Parenthetical notes: csav is a small subroutine that pushes volatiles r2 through r4 on the stack and turns r5 into the frame pointer; the corresponding cret unwinds this. The initial branch in this main is used to reserve additional stack space, but is often practically a no-op.) The first shift is here at ~main+024. Remember the values are octal, so 010 == 8. r0 is 16 bits wide — no 32-bit registers — so an eight-bit shift is fine. When we get to the second shift, however, it's the same instruction on just one register (030 == 24) and the overflow is never checked. In fact, the compiler never shifts the second part of the long at all. The result is thus zero. The second problem in this example is that the compiler never treats the constant as a long even though statically there's no way it can fit in a 16-bit int. To get around those two gotchas on both Venices here, I rewrote it this way: An alternative to a second variable is to explicitly mark the epoch constant itself as long, e.g., by casting it, which also works. Here's another example for your entertainment. At least some sort of pseudo-random number generator is crucial, especially for TCP when selecting the pseudo-source port and initial sequence numbers, or otherwise Slirp seemed to get very confused because we would "reuse" things a lot. Unfortunately, the obvious typical idiom to seed it like srand(time(NULL)) doesn't work: srand() expects a 16-bit int but time(NULL) returns a 32-bit long, and it turns out the compiler only passes the 16 most significant bits of the time — i.e., the ones least likely to change — to srand(). Here's the disassembly as proof (contents trimmed for display here; since this is a static binary, we can see everything we're calling): At the time we call the glue code for time from main, the value under the stack pointer (i.e., r6) is cleared immediately beforehand since we're passing NULL (at ~main+06). We then invoke the system call, which per the Venix manual for time(2) uses two registers for the 32-bit result, namely r0 (high bits) and r1 (low bits). We passed a null pointer, so the values remain in those registers and aren't written anywhere (branch at _time+014). When we return to ~main+014, however, we only put r0 on the stack for srand (remember that r5 is being used as the frame pointer; see the disassembly I provided for csav) and r1 is completely ignored. Why would this happen? It's because time(2) isn't declared anywhere in /usr/include or /usr/include/sys (the two C include directories), nor for that matter rand(3) or srand(3). This is true of both Rev. 2.0 and V2.0. Since the symbols are statically present in the standard library, linking will still work, but since the compiler doesn't know what it's supposed to be working with, it assumes int and fails to handle both halves of the long. One option is to manually declare everything ourselves. However, from the assembly at _time+016 we do know that if we pass a pointer, the entire long value will get placed there. That means we can also do this: Now this gets the lower bits and there is sufficient entropy for our purpose (though obviously not a cryptographically-secure PRNG). Interestingly, the Venix manual recommends using the time as the seed, but doesn't include any sample code. At any rate this was enough to make the pieces work for IP, ICMP and UDP, but TCP would bug out after just a handful of packets. As it happens, Venix has rather small serial buffers by modern standards: tty(7), based on the TIOCQCNT ioctl(2), appears to have just a 256-byte read buffer (sg_ispeed is only char-sized). If we don't make adjustments for this, we'll start losing framing when the buffer gets overrun, as in this extract from a test build with debugging dumps on and a maximum segment size/window of 512 bytes. Here, the bytes marked by dashes are the remote end and the bytes separated by dots are what the SLIP driver is scanning for framing and/or throwing away; you'll note there is obvious ASCII data in them. If we make the TCP MSS and window on our client side 256 bytes, there is still retransmission, but the connection is more reliable since overrun occurs less often and seems to work better than a hard cap on the maximum transmission unit (e.g., "mtu 256") from SLiRP's side. Our only consequence to dropping the TCP MSS and window size is that the TCP client is currently hard-coded to just send one packet at the beginning (this aligns with how you'd do finger, HTTP/1.x, gopher, etc.), and that datagram uses the same size which necessarily limits how much can be sent. If I did the extra work to split this over several datagrams, it obviously wouldn't be a problem anymore, but I'm lazy and worse is better! The connection can be made somewhat more reliable still by improving the SLIP driver's notion of framing. RFC 1055 only specifies that the SLIP end byte (i.e., $c0) occur at the end of a SLIP datagram, though it also notes that it was proposed very early on that it could also start datagrams — i.e., if two occur back to back, then it just looks like a zero length or otherwise obviously invalid entity which can be trivially discarded. However, since there's no guarantee or requirement that the remote link will do this, we can't assume it either. We also can't just look for a $45 byte (i.e., IPv4 and a 20 byte length) because that's an ASCII character and appears frequently in text payloads. However, $45 followed by a valid DSCP/ECN byte is much less frequent, and most of the time this byte will be either $00, $08 or $10; we don't currently support ECN (maybe we should) and we wouldn't find other DSCP values meaningful anyway. The SLIP driver uses these sequences to find the start of a datagram and $c0 to end it. While that doesn't solve the overflow issue, it means the SLIP driver will be less likely to go out of framing when the buffer does overrun and thus can better recover when the remote side retransmits. And, well, that's it. There are still glitches to bang out but it's good enough to grab Hacker News: src/ directory, run configure and then run make (parallel make is fine, I use -j24 on my POWER9). Connect your two serial ports together with a null modem, which I assume will be /dev/ttyUSB0 and /dev/ttyUSB1. Start Slirp-CK with a command line like ./slirp -b 4800 "tty /dev/ttyUSB1" but adjusting the baud and path to your serial port. Take note of the specified virtual and nameserver addresses: Unlike the given directions, you can just kill it with Control-C when you're done; the five zeroes are only if you're running your connection over standard output such as direct shell dial-in (this is a retrocomputing blog so some of you might). To see the debug version in action, next go to the BASS directory and just do a make. You'll get a billion warnings but it should still work with current gcc and clang because I specifically request -std=c89. If you use a different path for your serial port (i.e., not /dev/ttyUSB0), edit slip.c before you compile. You don't do anything like ifconfig with these tools; you always provide the tools the client IP address they'll use (or create an alias or script to do so). Try this initial example, with slirp already running: Because I'm super-lazy, you separate the components of the IPv4 address with spaces, not dots. In Slirp-land, 10.0.2.2 is always the host you are connected to. You can see the ICMP packet being sent, the bytes being scanned by the SLIP driver for framing (the ones with dots), and then the reply (with dashes). These datagram dumps have already been pre-processed for SLIP metabytes. Unfortunately, you may not be able to ping other hosts through Slirp because there's no backroute but you could try this with a direct SLIP connection, an exercise left for the reader. If Slirp doesn't want to respond and you're sure your serial port works (try testing both ends with Kermit?), you can recompile it with -DDEBUG (change this in the generated Makefile) and pass your intended debug level like -d 1 or -d 3. You'll get a file called slirp_debug with some agonizingly detailed information so you can see if it's actually getting the datagrams and/or liking the datagrams it gets. For nslookup, ntp and minisock, the second address becomes your accessible recursive nameserver (or use -i to provide an IP). The DNS dump is also given in the debug mode with slashes for the DNS answer section. nslookup and ntp are otherwise self-explanatory: minisock takes a server name (or IP) and port, followed by optional strings. The strings, up to 255 characters total (in this version), are immediately sent with CR-LFs between them except if you specify -n. If you specify no strings, none are sent. It then waits on that port for data and exits when the socket closes. This is how we did the HTTP/1.0 requests in the screenshots. On the DEC Pro, this has been tested on my trusty DEC Professional 380 running PRO/VENIX V2.0. It should compile and run on a 325 or 350, and on at least PRO/VENIX Rev. V2.0, though I don't have any hardware for this and Xhomer's serial port emulation is not good enough for this purpose (so unfortunately you'll need a real DEC Pro until I or Tarek get around to fixing it). The easiest way to get it over there is Kermit. Assuming you have this already, connect your host and the Pro on the "real" serial port at 9600bps. Make sure both sides are set to binary and just push all the files over (except the Markdown documentation unless you really want), and then do a make -f Makefile.venix (it may have been renamed to makefile.venix; adjust accordingly). Establishing the link is as simple as connecting your server's serial port to the other end of the BCC05 or equivalent from the Pro and starting Slirp to talk to that port (on my system, it's even the same port, so the same command line suffices). If you experience issues with the connection, the easiest fix is to just bounce Slirp — because there are no timeouts, there are also no retransmits. I don't know if this is hitting bugs in Slirp or in my code, though it's probably the latter. Nevertheless, I've been able to run stuff most of the day without issue. It's nice to have a simple network option and the personal satisfaction of having written it myself. There are many acknowledged deficiencies, mostly because I assume little about the system itself and tried to keep everything very simplistic. There are no timeouts and thus no retransmits, and if you break the TCP connection in the middle there will be no proper teardown. Also, because I used Slirp for the other side (as many others will), and because my internal network is full of machines that have no idea what IPv6 is, there is no IPv6 support. I agree there should be and SLIP doesn't care whether it gets IPv4 or IPv6, but for now that would require patching Slirp which is a job I just don't feel up to at the moment. I'd also like to support at least CSLIP in the future. In the meantime, if you want to try this on other operating systems, the system-dependent portions are in compat.h and slip.c with a small amount in ntp.c for handling time values. You will likely want to make changes to where your serial ports are and the speed they run at and how to make that port "raw" in slip.c. You should also add any extra #includes to compat.h that your system requires. I'd love to hear about it running other places. Slirp-CK remains under the original modified Slirp license and BASS is under the BSD 2-clause license. You can get Slirp-CK and BASS at Github.

15 hours ago • 2 votes

Transactions are a protocol

Transactions are not an intrinsic part of a storage system. Any storage system can be made transactional: Redis, S3, the filesystem, etc. Delta Lake and Orleans demonstrated techniques to make S3 (or cloud storage in general) transactional. Epoxy demonstrated techniques to make Redis (and any other system) transactional. And of course there's always good old Two-Phase Commit. If you don't want to read those papers, I wrote about a simplified implementation of Delta Lake and also wrote about a simplified MVCC implementation over a generic key-value storage layer. It is both the beauty and the burden of transactions that they are not intrinsic to a storage system. Postgres and MySQL and SQLite have transactions. But you don't need to use them. It isn't possible to require you to use transactions. Many developers, myself a few years ago included, do not know why you should use them. (Hint: read Designing Data Intensive Applications.) And you can take it even further by ignoring the transaction layer of an existing transactional database and implement your own transaction layer as Convex has done (the Epoxy paper above also does this). It isn't entirely clear that you have a lot to lose by implementing your own transaction layer since the indexes you'd want on the version field of a value would only be as expensive or slow as any other secondary index in a transactional database. Though why you'd do this isn't entirely clear (I will like to read about this from Convex some time). It's useful to see transaction protocols as another tool in your system design tool chest when you care about consistency, atomicity, and isolation. Especially as you build systems that span data systems. Maybe, as Ben Hindman hinted at the last NYC Systems, even proprietary APIs will eventually provide something like two-phase commit so physical systems outside our control can become transactional too. Transactions are a protocol short new post pic.twitter.com/nTj5LZUpUr — Phil Eaton (@eatonphil) April 20, 2025

21 hours ago • 2 votes

Humanities Crash Course Week 16: The Art of War

In week 16 of the humanities crash course, I revisited the Tao Te Ching and The Art of War. I just re-read the Tao Te Ching last year, so I only revisited my notes now. I’ve also read The Art of War a few times, but decided to re-visit it now anyway. Readings Both books are related. The Art of War is older; Sun Tzu wrote it around 500 BCE, at a time when war was becoming more “professionalized” in China. The book aims convey what had (or hadn’t) worked in the battlefield. The starting point is conflict. There’s an enemy we’re looking to defeat. The best victory is achieved without engagement. That’s not always possible, so the book offers pragmatic suggestions on tactical maneuvers and such. It gives good advice for situations involving conflict, which is why they’ve influenced leaders (including businesspeople) throughout centuries: It’s better to win before any shots are fired (i.e., through cunning and calculation.) Use deception. Don’t let conflicts drag on. Understand the context to use it to your advantage. Keep your forces unified and disciplined. Adapt to changing conditions on the ground. Consider economics and logistics. Gather intelligence on the opposition. The goal is winning through foresight rather than brute force — good advice! The Tao Te Ching, written by Lao Tzu around the late 4th century BCE, is the central text in Taoism, a philosophy that aims for skillful action by aligning with the natural order of the universe — i.e., doing through “non-doing” and transcending distinctions (which aren’t present in reality but layered onto experiences by humans.) Tao means Way, as in the Way to achieve such alignment. The book is a guide to living the Tao. (Living in Tao?) But as it makes clear from its very first lines, you can’t really talk about it: the Tao precedes language. It’s a practice — and the practice entails non-striving. Audiovisual Music: Gioia recommended the Beatles (The White Album, Sgt. Pepper’s, and Abbey Road) and Rolling Stones (Let it Bleed, Beggars Banquet, and Exile on Main Street.) I’d heard all three Rolling Stones albums before, but don’t know them by heart (like I do with the Beatles.) So I revisited all three. Some songs sounded a bit cringe-y, especially after having heard “real” blues a few weeks ago. Of the three albums, Exile on Main Street sounds more authentic. (Perhaps because of the band member’s altered states?) In any case, it sounded most “in the Tao” to me — that is, as though the musicians surrendered to the experience of making this music. It’s about as rock ‘n roll as it gets. Arts: Gioia recommended looking at Chinese architecture. As usual, my first thought was to look for short documentaries or lectures in YouTube. I was surprised by how little there was. Instead, I read the webpage Gioia suggested. Cinema: Since we headed again to China, I took in another classic Chinese film that had long been on my to-watch list: Wong Kar-wai’s IN THE MOOD FOR LOVE. I found it more Confucian than Taoist, although its slow pacing, gentleness, focus on details, and passivity strike something of a Taoist mood. Reflections When reading the Tao Te Ching, I’m often reminded of this passage from the Gospel of Matthew: No man can serve two masters: for either he will hate the one, and love the other; or else he will hold to the one, and despise the other. Ye cannot serve God and mammon. Therefore I say unto you, Take no thought for your life, what ye shall eat, or what ye shall drink; nor yet for your body, what ye shall put on. Is not the life more than meat, and the body than raiment? Behold the fowls of the air: for they sow not, neither do they reap, nor gather into barns; yet your heavenly Father feedeth them. Are ye not much better than they? Which of you by taking thought can add one cubit unto his stature? And why take ye thought for raiment? Consider the lilies of the field, how they grow; they toil not, neither do they spin: And yet I say unto you, That even Solomon in all his glory was not arrayed like one of these. Wherefore, if God so clothe the grass of the field, which to day is, and to morrow is cast into the oven, shall he not much more clothe you, O ye of little faith? Therefore take no thought, saying, What shall we eat? or, What shall we drink? or, Wherewithal shall we be clothed? (For after all these things do the Gentiles seek:) for your heavenly Father knoweth that ye have need of all these things. But seek ye first the kingdom of God, and his righteousness; and all these things shall be added unto you. Take therefore no thought for the morrow: for the morrow shall take thought for the things of itself. Sufficient unto the day is the evil thereof. The Tao Te Ching is older and from a different culture, but “Consider the lilies of the field, how they grow; they toil not, neither do they spin” has always struck me as very Taoistic: both texts emphasize non-striving and putting your trust on a higher order. Even though it’s even older, that spirit is also evident in The Art of War. It’s not merely letting things happen, but aligning mindfully with the needs of the time. Sometimes we must fight. Best to do it quickly and efficiently. And best yet if the conflict can be settled before it begins. Notes on Note-taking This week, I started using ChatGPT’s new o3 model. Its answers are a bit better than what I got with previous models, but there are downsides. For one thing, o3 tends to format answers in tables rather than lists. This works well if you use ChatGPT in a wide window, but is less useful on a mobile device or (as in my case) on a narrow window to the side. This is how I usually use ChatGPT on my Mac: in a narrow window. o3’s responses often include tables that get cut off in this window. For another, replies take much longer as the AI does more “research” in the background. As a result, it feels less conversational than 4o — which changes how I interact with it. I’ll play more with o3 for work, but for this use case, I’ll revert to 4o. Up Next Gioia recommends Apulelius’s The Golden Ass. I’ve never read this, and frankly feel weary about returning to the period of Roman decline. (Too close to home?) But I’ll approach it with an open mind. Again, there’s a YouTube playlist for the videos I’m sharing here. I’m also sharing these posts via Substack if you’d like to subscribe and comment. See you next week!

14 hours ago • 1 votes

My approach to teaching electronics

Explaining the reasoning behind my series of articles on electronics -- and asking for your thoughts.

yesterday • 2 votes

New here?