Vulkan 1.3 on the M1 in 1 month

133

from On Life and Lisp [alt+shift+b] in programming

u{text-decoration-thickness:0.09em;text-decoration-color:skyblue} Finally, conformant Vulkan for the M1! The new “Honeykrisp” driver is the first conformant Vulkan® for Apple hardware on any operating system, implementing the full 1.3 spec without “portability” waivers. Honeykrisp is not yet released for end users. We’re continuing to add features, improve performance, and port to more hardware. Source code is available for developers. HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton. Honeykrisp is not based on prior M1 Vulkan efforts, but rather Faith Ekstrand’s open source NVK driver for NVIDIA GPUs. In her words: All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan driver and started by copying+pasting from it. My hope is that NVK will eventually become the driver that everyone copies and pastes from. To that end, I’m building NVK with all the best practices we’ve developed for Vulkan drivers over the last 7.5 years and trying to keep the code-base clean...

a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from On Life and Lisp

Vulkan 1.4 sur Asahi Linux

English version follows. Aujourd’hui, Khronos Group a sorti la spécification 1.4 de l’API graphique standard Vulkan. Le projet Asahi Linux est fier d’annoncer le premier pilote Vulkan 1.4 pour le matériel d’Apple. En effet, notre pilote graphique Honeykrisp est reconnu par Khronos comme conforme à cette nouvelle version dès aujourd’hui. Ce pilote est déjà disponible dans nos dépôts officiels. Après avoir installé Fedora Asahi Remix, executez dnf upgrade --refresh pour obtenir la dernière version du pilote. Vulkan 1.4 standardise plusieurs fonctionnalités importantes, y compris les horodatages et la lecture locale avec le rendu dynamique. L’industrie suppose que ces fonctionnalités devront être plus courantes, et nous y sommes préparés. Sortir un pilote conforme reflète notre engagement en faveur des standards graphiques et du logiciel libre. Asahi Linux est aussi compatible avec OpenGL 4.6, OpenGL ES 3.2, et OpenCL 3.0, tous conformes aux spécifications pertinentes. D’ailleurs, les notres sont les seules pilotes conformes pour le materiel d’Apple de n’importe quel standard graphique. Même si le pilote est sorti, il faut encore compiler une version expérimentale de Vulkan-Loader pour accéder à la nouvelle version de Vulkan. Toutes les nouvelles fonctionnalités sont néanmoins disponsibles comme extensions à notre pilote Vulkan 1.3 pour en profiter tout de suite. Pour plus d’informations, consultez l’article de blog de Khronos. Today, the Khronos Group released the 1.4 specification of Vulkan, the standard graphics API. The Asahi Linux project is proud to announce the first Vulkan 1.4 driver for Apple hardware. Our Honeykrisp driver is Khronos-recognized as conformant to the new version since day one. That driver is already available in our official repositories. After installing Fedora Asahi Remix, run dnf upgrade --refresh to get the latest drivers. Vulkan 1.4 standardizes several important features, including timestamps and dynamic rendering local read. The industry expects that these features will become more common, and we are prepared. Releasing a conformant driver reflects our commitment to graphics standards and software freedom. Asahi Linux is also compatible with OpenGL 4.6, OpenGL ES 3.2, and OpenCL 3.0, all conformant to the relevant specifications. For that matter, ours are the only conformant drivers on Apple hardware for any graphics standard graphics. Although the driver is released, you still need to build an experimental version of Vulkan-Loader to access the new Vulkan version. Nevertheless, you can immediately use all the new features as extensions in Vulkan 1.3 driver. For more information, see the Khronos blog post.

8 months ago • 96 votes

AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We’re thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0. Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today’s release is an alpha, Control runs well! Installation First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead. The stack Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference: FEX emulates x86 on Arm. Wine translates Windows to Linux. DXVK and vkd3d-proton translate DirectX to Vulkan. There’s one curveball: page size. Operating systems allocate memory in fixed size “pages”. If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That’s a problem: x86 expects 4K pages but Apple systems use 16K pages. While Linux can’t mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you’re happy because you can play Fallout 4. Vulkan The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I’ve since added DXVK support. Let’s look at some new features. Tessellation Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today’s talk at XDC2024. Geometry shaders Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don’t need to be fast – just fast enough for games like Ghostrunner. Enhanced robustness “Robustness” permits an application’s shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1. Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness. Like before, we implement robustness with compare-and-select instructions. A naïve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load. load R, buffer, index * 16 ulesel R[0], index, size, R[0], 0 ulesel R[1], index, size, R[1], 0 ulesel R[2], index, size, R[2], 0 ulesel R[3], index, size, R[3], 0 There’s a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects. ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi load R, buffer, index * 16 Two instructions, not four. Next steps Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don’t require sparse, like Cyberpunk 2077. While many games are playable, newer AAA titles don’t hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed. Beyond gaming, we’re adding general purpose x86 emulation based on this stack. For more information, see the FAQ. Today’s alpha is a taste of what’s to come. Not the final form, but enough to enjoy Portal 2 while we work towards “1.0”. Acknowledgements This work has been years in the making with major contributions from… Alyssa Rosenzweig Asahi Lina chaos_princess Davide Cavalca Dougall Johnson Ella Stanforth Faith Ekstrand Janne Grunau Karol Herbst marcan Mary Guillemard Neal Gompa Sergio López TellowKrinkle Teoh Han Hui Rob Clark Ryan Houdek … Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today’s release is thanks to the magic of open source. We hope you enjoy the magic. Happy gaming.

10 months ago • 89 votes

Conformant OpenGL 4.6 on the M1

For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf –refresh upgrade. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender, Ryujinx, and Citra. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.

a year ago • 76 votes

The first conformant M1 GPU driver

Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. That means the drivers are compatible with any OpenGL ES 3.1 application. Interested? Just install Linux! For existing Asahi Linux users, upgrade your system with dnf upgrade (Fedora) or pacman -Syu (Arch) for the latest drivers. Our reverse-engineered, free and open source graphics drivers are the world’s only conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics hardware. That means our driver passed tens of thousands of tests to demonstrate correctness and is now recognized by the industry. To become conformant, an “implementation” must pass the official conformance test suite, designed to verify every feature in the specification. The test results are submitted to Khronos, the standards body. After a 30-day review period, if no issues are found, the implementation becomes conformant. The Khronos website lists all conformant implementations, including our drivers for the M1, M1 Pro/Max/Ultra, M2, and M2 Pro/Max. Today’s milestone isn’t just about OpenGL ES. We’re releasing the first conformant implementation of any graphics standard for the M1. And we don’t plan to stop here ;-) Unlike ours, the manufacturer’s M1 drivers are unfortunately not conformant for any standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that there is no guarantee that applications using the standards will work on your M1/M2 (if you’re not running Linux). This isn’t just a theoretical issue. Consider Vulkan. The third-party MoltenVK layers a subset of Vulkan on top of the proprietary drivers. However, those drivers lack key functionality, breaking valid Vulkan applications. That hinders developers and users alike, if they haven’t yet switched their M1/M2 computers to Linux. Why did we pursue standards conformance when the manufacturer did not? Above all, our commitment to quality. We want our users to know that they can depend on our Linux drivers. We want standard software to run without M1-specific hacks or porting. We want to set the right example for the ecosystem: the way forward is implementing open standards, conformant to the specifications, without compromises for “portability”. We are not satisfied with proprietary drivers, proprietary APIs, and refusal to implement standards. The rest of the industry knows that progress comes from cross-vendor collaboration. We know it, too. Achieving conformance is a win for our community, for open source, and for open graphics. Of course, Asahi Lina and I are two individuals with minimal funding. It’s a little awkward that we beat the big corporation… It’s not too late though. They should follow our lead! OpenGL ES 3.1 updates the experimental OpenGL ES 3.0 and OpenGL 3.1 we shipped in June. Notably, ES 3.1 adds compute shaders, typically used to accelerate general computations within graphics applications. For example, a 3D game could run its physics simulations in a compute shader. The simulation results can then be used for rendering, eliminating stalls that would otherwise be required to synchronize the GPU with a CPU physics simulation. That lets the game run faster. Let’s zoom in on one new feature: atomics on images. Older versions of OpenGL ES allowed an application to read an image in order to display it on screen. ES 3.1 allows the application to write to the image, typically from a compute shader. This new feature enables flexible image processing algorithms, which previously needed to fit into the fixed-function 3D pipeline. However, GPUs are massively parallel, running thousands of threads at the same time. If two threads write to the same location, there is a conflict: depending which thread runs first, the result will be different. We have a race condition. “Atomic” access to memory provides a solution to race conditions. With atomics, special hardware in the memory subsystem guarantees consistent, well-defined results for select operations, regardless of the order of the threads. Modern graphics hardware supports various atomic operations, like addition, serving as building blocks to complex parallel algorithms. Can we put these two features together to write to an image atomically? Yes. A ubiquitous OpenGL ES extension, required for ES 3.2, adds atomics operating on pixels in an image. For example, a compute shader could atomically increment the value at pixel (10, 20). Other GPUs have dedicated instructions to perform atomics on an images, making the driver implementation straightforward. For us, the story is more complicated. The M1 lacks hardware instructions for image atomics, even though it has non-image atomics and non-atomic images. We need to reframe the problem. The idea is simple: to perform an atomic on a pixel, we instead calculate the address of the pixel in memory and perform a regular atomic on that address. Since the hardware supports regular atomics, our task is “just” calculating the pixel’s address. If the image were laid out linearly in memory, this would be straightforward: multiply the Y-coordinate by the number of bytes per row (“stride”), multiply the X-coordinate by the number of bytes per pixel, and add. That gives the pixel’s offset in bytes relative to the first pixel of the image. To get the final address, we add that offset to the address of the first pixel. Alas, images are rarely linear in memory. To improve cache efficiency, modern graphics hardware interleaves the X- and Y-coordinates. Instead of one row after the next, pixels in memory follow a spiral-like curve. We need to amend our previous equation to interleave the coordinates. We could use many instructions to mask one bit at a time, shifting to construct the interleaved result, but that’s inefficient. We can do better. There is a well-known “bit twiddling” algorithm to interleave bits. Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, parallelizing the problem. Implementing this algorithm in shader code improves performance. In practice, only the lower 7-bits (or less) of each coordinate are interleaved. That lets us use 32-bit instructions to “vectorize” the interleave, by putting the X- and Y-coordinates in the low and high 16-bits of a 32-bit register. Those 32-bit instructions let us interleave X and Y at the same time, halving the instruction count. Plus, we can exploit the GPU’s combined shift-and-add instruction. Putting the tricks together, we interleave in 10 instructions of M1 GPU assembly: # Inputs x, y in r0l, r0h. # Output in r1. add r2, #0, r0, lsl 4 or r1, r0, r2 and r1, r1, #0xf0f0f0f add r2, #0, r1, lsl 2 or r1, r1, r2 and r1, r1, #0x33333333 add r2, #0, r1, lsl 1 or r1, r1, r2 and r1, r1, #0x55555555 add r1, r1l, r1h, lsl 1 We could stop here, but what if there’s a dedicated instruction to interleave bits? PowerVR has a “shuffle” instruction shfl, and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. Unfortunately, even if it was, the proprietary compiler won’t use it when compiling our test shaders. That makes it difficult to reverse-engineer the instruction – if it exists – by observing compiled shaders. It’s time to dust off a powerful reverse-engineering technique from magic kindergarten: guess and check. Dougall Johnson provided the guess. When considering the instructions we already know about, he took special notice of the “reverse bits” instruction. Since reversing bits is a type of bit shuffle, the interleave instruction should be encoded similarly. The bit reverse instruction has a two-bit field specifying the operation, with value 01. Related instructions to count the number of set bits and find the first set bit have values 10 and 11 respectively. That encompasses all known “complex bit manipulation” instructions. tr:first-child > td:nth-child(2) { text-align:center !important } td > strong > a:visited { color: #0000EE } 00 ? ? ? 01 Reverse bits 10 Count set bits 11 Find first set There is one value of the two-bit enumeration that is unobserved and unknown: 00. If this interleave instruction exists, it’s probably encoded like the bit reverse but with operation code 00 instead of 01. There’s a difficulty: the three known instructions have one single input source, but our instruction interleaves two sources. Where does the second source go? We can make a guess based on symmetry. Presumably to simplify the hardware decoder, M1 GPU instructions usually encode their sources in consistent locations across instructions. The other three instructions have a gap where we would expect the second source to be, in a two-source arithmetic instruction. Probably the second source is there. Armed with a guess, it’s our turn to check. Rather than handwrite GPU assembly, we can hack our compiler to replace some two-source integer operation (like multiply) with our guessed encoding of “interleave”. Then we write a compute shader using this operation (by “multiplying” numbers) and run it with the newfangled compute support in our driver. All that’s left is writing a shader that checks that the mystery instruction returns the interleaved result for each possible input. Since the instruction takes two 16-bit sources, there are about 4 billion ($2^32$) inputs. With our driver, the M1 GPU manages to check them all in under a second, and the verdict is in: this is our interleave instruction. As for our clever vectorized assembly to interleave coordinates? We can replace it with one instruction. It’s anticlimactic, but it’s fast and it passes the conformance tests. And that’s what matters. Thank you to Khronos and Software in the Public Interest for supporting open drivers.

a year ago • 49 votes

More in programming

We Are Still the Web

Twenty years ago, Kevin Kelly wrote an absolutely seminal piece for Wired. This week is a great opportunity to look back at it. The post We Are Still the Web appeared first on The History of the Web.

18 hours ago • 4 votes

Omarchy is on the move

Omarchy has been improving at a furious pace. Since it was first released on June 26, I've pushed out 18(!) new releases together with a rapidly growing community of collaborators, users, and new-to-Linux enthusiasts. We have about 3,500 early adopters on the Omarchy Discord, 250 pull requests processed, and one heck of an awesome Arch + Hyprland Linux environment to show for it! The latest release is 1.11.0, and it brings an entirely overhauled control menu to the experience. Now everything is controlled through a single, unified system that makes it super fast to operate Omarchy's settings and options through the keyboard. It's exactly the kind of hands-off-the-mouse operation that I've always wanted, and with Linux, I've been able to build it just to my tastes. It's a delight. There's really something special going on in Linux at the moment. Arch has been around for twenty years, but with Hyprland on top, it's been catapulted in front of an entirely new audience. Folks who'd never thought that open source could be able to deliver a desktop experience worth giving up Windows or macOS for. Of course, Linux isn't for everyone. It's still an adventure! An awesome, teach-you-about-computers adventure, but not everyone is into computer adventures. Plenty of people are content with a computer appliance where they never have to look under the hood. All good. Microsoft and Apple have those people covered. But the world is a big place! And in that big place, there are a growing number of computer enthusiasts who've grown very disillusioned with both Microsoft and Apple. Folks who could be enticed to give Linux a look, if the barrier was a little lower and the benefits a little clearer. Those are the folks I'm building Omarchy for.

13 hours ago • 3 votes

HTML is Dead, Long Live HTML

Rethinking DOM from first principles Browsers are in a very weird place. While WebAssembly has succeeded, even on the server, the client still feels largely the same as it did 10 years ago. Enthusiasts will tell you that accessing native web APIs via WASM is a solved problem, with some minimal JS glue. But the question not asked is why you would want to access the DOM. It's just the only option. So I'd like to explain why it really is time to send the DOM and its assorted APIs off to a farm somewhere, with some ideas on how. I won't pretend to know everything about browsers. Nobody knows everything anymore, and that's the problem. The 'Document' Model Few know how bad the DOM really is. In Chrome, document.body now has 350+ keys, grouped roughly like this: This doesn't include the CSS properties in document.body.style of which there are... 660. The boundary between properties and methods is very vague. Many are just facades with an invisible setter behind them. Some getters may trigger a just-in-time re-layout. There's ancient legacy stuff, like all the onevent properties nobody uses anymore. The DOM is not lean and continues to get fatter. Whether you notice this largely depends on whether you are making web pages or web applications. Most devs now avoid working with the DOM directly, though occasionally some purist will praise pure DOM as being superior to the various JS component/templating frameworks. What little declarative facilities the DOM has, like innerHTML, do not resemble modern UI patterns at all. The DOM has too many ways to do the same thing, none of them nice. connectedCallback() { const shadow = this.attachShadow({ mode: 'closed' }), template = document.getElementById('hello-world') .content.cloneNode(true), hwMsg = `Hello ${ this.name }`; Array.from(template.querySelectorAll('.hw-text')) .forEach(n => n.textContent = hwMsg); shadow.append(template); } Web Components deserve a mention, being the web-native equivalent of JS component libraries. But they came too late and are unpopular. The API seems clunky, with its Shadow DOM introducing new nesting and scoping layers. Proponents kinda read like apologetics. The achilles heel is the DOM's SGML/XML heritage, making everything stringly typed. React-likes do not have this problem, their syntax only looks like XML. Devs have learned not to keep state in the document, because it's inadequate for it. For HTML itself, there isn't much to critique because nothing has changed in 10-15 years. Only ARIA (accessibility) is notable, and only because this was what Semantic HTML was supposed to do and didn't. Semantic HTML never quite reached its goal. Despite dating from around 2011, there is e.g. no <thread> or <comment> tag, when those were well-established idioms. Instead, an article inside an article is probably a comment. The guidelines are... weird. There's this feeling that HTML always had paper-envy, and couldn't quite embrace or fully define its hypertext nature, and did not trust its users to follow clear rules. Stewardship of HTML has since firmly passed to WHATWG, really the browser vendors, who have not been able to define anything more concrete as a vision, and have instead just added epicycles at the margins. Along the way even CSS has grown expressions, because every templating language wants to become a programming language. Editability of HTML remains a sad footnote. While technically supported via contentEditable, actually wrangling this feature into something usable for applications is a dark art. I'm sure the Google Docs and Notion people have horror stories. Nobody really believes in the old gods of progressive enhancement and separating markup from style anymore, not if they make apps. Most of the applications you see nowadays will kitbash HTML/CSS/SVG into a pretty enough shape. But this comes with immense overhead, and is looking more and more like the opposite of a decent UI toolkit. The Slack input box Off-screen clipboard hacks Lists and tables must be virtualized by hand, taking over for layout, resizing, dragging, and so on. Making a chat window's scrollbar stick to the bottom is somebody's TODO, every single time. And the more you virtualize, the more you have to reinvent find-in-page, right-click menus, etc. The web blurred the distinction between UI and fluid content, which was novel at the time. But it makes less and less sense, because the UI part is a decade obsolete, and the content has largely homogenized. CSS is inside-out CSS doesn't have a stellar reputation either, but few can put their finger on exactly why. Where most people go wrong is to start with the wrong mental model, approaching it like a constraint solver. This is easy to show with e.g.: <div> <div style="height: 50%">...</div> <div style="height: 50%">...</div> </div> <div> <div style="height: 100%">...</div> <div style="height: 100%">...</div> </div> The first might seem reasonable: divide the parent into two halves vertically. But what about the second? Viewed as a set of constraints, it's contradictory, because the parent div is twice as tall as... itself. What will happen instead in both cases is the height is ignored. The parent height is unknown and CSS doesn't backtrack or iterate here. It just shrink-wraps the contents. If you set e.g. height: 300px on the parent, then it works, but the latter case will still just spill out. Outside-in and inside-out layout modes Instead, your mental model of CSS should be applying two passes of constraints, first going outside-in, and then inside-out. When you make an application frame, this is outside-in: the available space is divided, and the content inside does not affect sizing of panels. When paragraphs stack on a page, this is inside-out: the text stretches out its containing parent. This is what HTML wants to do naturally. By being structured this way, CSS layouts are computationally pretty simple. You can propagate the parent constraints down to the children, and then gather up the children's sizes in the other direction. This is attractive and allows webpages to scale well in terms of elements and text content. CSS is always inside-out by default, reflecting its document-oriented nature. The outside-in is not obvious, because it's up to you to pass all the constraints down, starting with body { height: 100%; }. This is why they always say vertical alignment in CSS is hard. Use flex grow and shrink for spill-free auto-layouts with completely reasonable gaps The scenario above is better handled with a CSS3 flex box (display: flex), which provides explicit control over how space is divided. Unfortunately flexing muddles the simple CSS model. To auto-flex, the layout algorithm must measure the "natural size" of every child. This means laying it out twice: first speculatively, as if floating in aether, and then again after growing or shrinking to fit: This sounds reasonable but can come with hidden surprises, because it's recursive. Doing speculative layout of a parent often requires full layout of unsized children. e.g. to know how text will wrap. If you nest it right, it could in theory cause an exponential blow up, though I've never heard of it being an issue. Instead you will only discover this when someone drops some large content in somewhere, and suddenly everything gets stretched out of whack. It's the opposite of the problem on the mug. To avoid the recursive dependency, you need to isolate the children's contents from the outside, thus making speculative layout trivial. This can be done with contain: size, or by manually setting the flex-basis size. CSS has gained a few constructs like contain or will-transform, which work directly with the layout system, and drop the pretense of one big happy layout. It reveals some of the layer-oriented nature underneath, and is a substitute for e.g. using position: absolute wrappers to do the same. What these do is strip off some of the semantics, and break the flow of DOM-wide constraints. These are overly broad by default and too document-oriented for the simpler cases. This is really a metaphor for all DOM APIs. The Good Parts? That said, flex box is pretty decent if you understand these caveats. Building layouts out of nested rows and columns with gaps is intuitive, and adapts well to varying sizes. There is a "CSS: The Good Parts" here, which you can make ergonomic with sufficient love. CSS grids also work similarly, they're just very painfully... CSSy in their syntax. But if you designed CSS layout from scratch, you wouldn't do it this way. You wouldn't have a subtractive API, with additional extra containment barrier hints. You would instead break the behavior down into its component facets, and use them à la carte. Outside-in and inside-out would both be legible as different kinds of containers and placement models. The inline-block and inline-flex display models illustrate this: it's a block or flex on the inside, but an inline element on the outside. These are two (mostly) orthogonal aspects of a box in a box model. Text and font styles are in fact the odd ones out, in hypertext. Properties like font size inherit from parent to child, so that formatting tags like <b> can work. But most of those 660 CSS properties do not do that. Setting a border on an element does not apply the same border to all its children recursively, that would be silly. It shows that CSS is at least two different things mashed together: a system for styling rich text based on inheritance... and a layout system for block and inline elements, nested recursively but without inheritance, only containment. They use the same syntax and APIs, but don't really cascade the same way. Combining this under one style-umbrella was a mistake. Worth pointing out: early ideas of relative em scaling have largely become irrelevant. We now think of logical vs device pixels instead, which is a far more sane solution, and closer to what users actually expect. SVG is natively integrated as well. Having SVGs in the DOM instead of just as <img> tags is useful to dynamically generate shapes and adjust icon styles. But while SVG is powerful, it's neither a subset nor superset of CSS. Even when it overlaps, there are subtle differences, like the affine transform. It has its own warts, like serializing all coordinates to strings. CSS has also gained the ability to round corners, draw gradients, and apply arbitrary clipping masks: it clearly has SVG-envy, but falls very short. SVG can e.g. do polygonal hit-testing for mouse events, which CSS cannot, and SVG has its own set of graphical layer effects. Whether you use HTML/CSS or SVG to render any particular element is based on specific annoying trade-offs, even if they're all scalable vectors on the back-end. In either case, there are also some roadblocks. I'll just mention three: text-ellipsis can only be used to truncate unwrapped text, not entire paragraphs. Detecting truncated text is even harder, as is just measuring text: the APIs are inadequate. Everyone just counts letters instead. position: sticky lets elements stay in place while scrolling with zero jank. While tailor-made for this purpose, it's subtly broken. Having elements remain unconditionally sticky requires an absurd nesting hack, when it should be trivial. The z-index property determines layering by absolute index. This inevitably leads to a z-index-war.css where everyone is putting in a new number +1 or -1 to make things layer correctly. There is no concept of relative Z positioning. For each of these features, we got stuck with v1 of whatever they could get working, instead of providing the right primitives. Getting this right isn't easy, it's the hard part of API design. You can only iterate on it, by building real stuff with it before finalizing it, and looking for the holes. Oil on Canvas So, DOM is bad, CSS is single-digit X% good, and SVG is ugly but necessary... and nobody is in a position to fix it? Well no. The diagnosis is that the middle layers don't suit anyone particularly well anymore. Just an HTML6 that finally removes things could be a good start. But most of what needs to happen is to liberate the functionality that is there already. This can be done in good or bad ways. Ideally you design your system so the "escape hatch" for custom use is the same API you built the user-space stuff with. That's what dogfooding is, and also how you get good kernels. A recent proposal here is HTML in Canvas, to draw HTML content into a <canvas>, with full control over the visual output. It's not very good. While it might seem useful, the only reason the API has the shape that it does is because it's shoehorned into the DOM: elements must be descendants of <canvas> to fully participate in layout and styling, and to make accessibility work. There are also "technical concerns" with using it off-screen. One example is this spinny cube: To make it interactive, you attach hit-testing rectangles and respond to paint events. This is a new kind of hit-testing API. But it only works in 2D... so it seems 3D-use is only cosmetic? I have many questions. Again, if you designed it from scratch, you wouldn't do it this way! In particular, it's absurd that you'd have to take over all interaction responsibilities for an element and its descendants just to be able to customize how it looks i.e. renders. Especially in a browser that has projective CSS 3D transforms. The use cases not covered by that, e.g. curved re-projection, will also need more complicated hit-testing than rectangles. Did they think this through? What happens when you put a dropdown in there? To me it seems like they couldn't really figure out how to unify CSS and SVG filters, or how to add shaders to CSS. Passing it thru canvas is the only viable option left. "At least it's programmable." Is it really? Screenshotting DOM content is 1 good use-case, but not what this is sold as at all. The whole reason to do "complex UIs on canvas" is to do all the things the DOM doesn't do, like virtualizing content, just-in-time layout and styling, visual effects, custom gestures and hit-testing, and so on. It's all nuts and bolts stuff. Having to pre-stage all the DOM content you want to draw sounds... very counterproductive. From a reactivity point-of-view it's also a bad idea to route this stuff back through the same document tree, because it sets up potential cycles with observers. A canvas that's rendering DOM content isn't really a document element anymore, it's doing something else entirely. Canvas-based spreadsheet that skips the DOM entirely The actual achilles heel of canvas is that you don't have any real access to system fonts, text layout APIs, or UI utilities. It's quite absurd how basic it is. You have to implement everything from scratch, including Unicode word splitting, just to get wrapped text. The proposal is "just use the DOM as a black box for content." But we already know that you can't do anything except more CSS/SVG kitbashing this way. text-ellipsis and friends will still be broken, and you will still need to implement UIs circa 1990 from scratch to fix it. It's all-or-nothing when you actually want something right in the middle. That's why the lower level needs to be opened up. Where To Go From Here The goals of "HTML in Canvas" do strike a chord, with chunks of HTML used as free-floating fragments, a notion that has always existed under the hood. It's a composite value type you can handle. But it should not drag 20 years of useless baggage along, while not enabling anything truly novel. The kitbashing of the web has also resulted in enormous stagnation, and a loss of general UI finesse. When UI behaviors have to be mined out of divs, it limits the kinds of solutions you can even consider. Fixing this within DOM/HTML seems unwise, because there's just too much mess inside. Instead, new surfaces should be opened up outside of it. WebGPU-based box model My schtick here has become to point awkwardly at Use.GPU's HTML-like renderer, which does a full X/Y flex model in a fraction of the complexity or code. I don't mean my stuff is super great, no, it's pretty bare-bones and kinda niche... and yet definitely nicer. Vertical centering is easy. Positioning makes sense. There is no semantic HTML or CSS cascade, just first-class layout. You don't need 61 different accessors for border* either. You can just attach shaders to divs. Like, that's what people wanted right? Here's a blueprint, it's mostly just SDFs. Font and markup concerns only appear at the leaves of the tree, where the text sits. It's striking how you can do like 90% of what the DOM does here, with a fraction of the complexity of HTML/CSS/SVG, if you just reinvent that wheel. Done by 1 guy. And yes, I know about the second 90% too. The classic data model here is of a view tree and a render tree. What should the view tree actually look like? And what can it be lowered into? What is it being lowered into right now, by a giant pile of legacy crud? Alt-browser projects like Servo or Ladybird are in a position to make good proposals here. They have the freshest implementations, and are targeting the most essential features first. The big browser vendors could also do it, but well, taste matters. Good big systems grow from good small ones, not bad big ones. Maybe if Mozilla hadn't imploded... but alas. Platform-native UI toolkits are still playing catch up with declarative and reactive UI, so that's that. Native Electron-alternatives like Tauri could be helpful, but they don't treat origin isolation as a design constraint, which makes security teams antsy. There's a feasible carrot to dangle for them though, namely in the form of better process isolation. Because of CPU exploits like Spectre, multi-threading via SharedArrayBuffer and Web Workers is kinda dead on arrival anyway, and that affects all WASM. The details are boring but right now it's an impossible sell when websites have to have things like OAuth and Zendesk integrated into them. Reinventing the DOM to ditch all legacy baggage could coincide with redesigning it for a more multi-threaded, multi-origin, and async web. The browser engines are already multi-process... what did they learn? A lot has happened since Netscape, with advances in structured concurrency, ownership semantics, FP effects... all could come in handy here. * * * Step 1 should just be a data model that doesn't have 350+ properties per node tho. Don't be under the mistaken impression that this isn't entirely fixable.

8 hours ago • 2 votes

Extending My Japanese Visa as a Freelancer

With TokyoDev as my sponsor, I extended my Engineer/Specialist in Humanities/International Services visa for another three years. I’m thrilled by this result, because my family and I recently moved to a small town in Kansai and have been enjoying our lives in Japan more than ever. Since I have some experience with bureaucracy in Japan, I was prepared for things to get . . . complicated. Instead, I was pleasantly surprised. Despite the fact that I’d changed jobs and had three dependents, the process was much simpler than I expected. Below I’ll share my particular experience, which should be especially helpful to those in the Kansai area, and cover the following: What a visa extension is What happens when you change jobs mid-visa The documents your new sponsoring company needs to provide The documents you need to assemble yourself Some paperwork issues you might encounter What you can expect when visiting an immigration office (particularly in Osaka) Follow-up actions you’ll be required to take Information I wish I’d had What do I mean by “visa extension”? In 2022, I was a permanent employee at a company in Tokyo, which agreed to sponsor my Engineer/Specialist in Humanities/International Services visa and bring me to Japan. Initially I received a three-year work visa, and at the same time my husband and two children each received a three-year Dependent visa. Our original visas were set to expire in August 2025, but we’ve decided to remain in Japan long-term, so we wanted to prolong our stay. Since Japan’s immigration offices accept visa extension applications beginning three months before the visa end date, I began preparing my application in May 2025 and submitted it in June. It’s a good idea to begin the visa extension process as soon as possible. There are no downsides to doing so, and beginning early can help prevent serious complications. If you have a bank account in Japan, it can be frozen when your original visa expires; you will either need to show the bank your new residence card before that date, or demonstrate that you are currently in the process of extending your visa. Your My Number Card also expires on the original visa expiration date. This process is also often called a “visa renewal,” but it’s the same procedure. There is no difference between an extension and a renewal. New employment status and employer In the three years since my visa was originally issued, I became a freelancer, or sole proprietor (個人事業主, kojin jigyou nushi), and left my original sponsoring company. Paul McMahon was not only one of my first clients in Japan, but also the first to offer me an ongoing contract, which was enormously helpful. When I made my formal exit from my initial company, I was able to list TokyoDev as my new employer when notifying Immigration. The documents required TokyoDev also agreed to sponsor my visa, which meant Paul would provide documentation about the company to Immigration. I’d assumed this paperwork might be difficult or time-intensive, but Paul reassured me that the entire process was quite simple and only took a few hours. This work does not increase linearly per international employee; once a company knows which documents are required, it is relatively simple to repeat the process for each employee. I’m not the first worker TokyoDev has sponsored. In fact, TokyoDev successfully sponsored a contractor within a month of incorporation, with the only fees being those required for gathering the paperwork. Company documents Exactly what documents are required varies according to the status of the company. In this specific case, the documents Paul provided for TokyoDev, a category 4 company, were: The company portion of my visa extension application TokyoDev’s legal report summary (法定調書合計表, hotei chosho goukei-hyou) for the previous fiscal year TokyoDev’s Certificate of Registration (登記事項証明書, touki jikou shoumei-sho) A copy of TokyoDev’s financial statements (決算書, kessan-sho) for the latest fiscal year A business description of TokyoDev, which in this case was a sales presentation in Japanese that explained the premise of the company Personal documents The documents I supplied myself were: My passport and residence card My portions of my visa extension application A visa-sized photo (taken at a photo booth) The signed contract between myself and TokyoDev A contract with a secondary client My tax payment certificate for the previous year (納税証明書, nouzei shoumei-sho), which I got from our town hall My resident tax certificate (住民税の課税, juuminzei no kazei), which I got from our town hall I had to prepare some additional documents for my dependents. These were: The residence cards and passports of my children Copies of my own residence card and passport, for my husband’s application Visa extension applications for my dependent children and husband A visa-sized photo of my husband (children under 16 don’t need photos) Copies and Japanese translations of the children’s birth certificates A copy and Japanese translation of our American wedding certificate Paperwork tips A few questions and complications did arise while I was assembling the paperwork. Japanese translations I had Japanese translations of my children’s birth certificates and my marriage certificate already, left over from registering my initial address with City Hall. These translations were done by a coworker, and weren’t certified. I’ve used them repeatedly for procedures in Japan and never had them rejected. Dependent applications First, I had a hard time locating the correct application for my dependents. I could only find the one I’ve linked above, which initially didn’t seem to apply, since it’s for dependents of those who have a Designated Activities visa (such as researchers). I ended up filling out another, totally erroneous version of the application and had to re-do it all at the immigration office. To my chagrin, I found the paper version they had on hand was identical to this linked form! Resident tax certificate in a new town Next, my resident tax certificate was complicated by the fact that I’d lived in my new town in Nara for about seven months, and hadn’t yet paid any resident tax locally. Fortunately my first resident tax installment came due about that time, so I paid it promptly, then got the form from City Hall demonstrating that it had indeed been paid. I wasn’t sure a single payment would be enough to satisfy immigration, but it seemed to work. If I’d needed to prove payment for previous years, I would have had to request that certificate from the previous town I’d lived in, Hachoiji. Since this would have been a tedious process involving mailing things back and forth and a money order, I was glad to avoid it. Giving a “reason for extension” When filling out my application, Paul advised that I ask for a five-year extension: he said Immigration might not grant it, but it probably wouldn’t hurt my chances. I did that, and in the brief space where you write “Reason for extension,” I crammed in several sentences about how my career is based in Japan, my husband is studying shakuhachi, and my children attend public Japanese school and speak Japanese. All our applications included at least some of these details. This probably wasn’t necessary, and it’s hard to say if it influenced the final result or not, but that was how I approached it. That pesky middle name I worried that since I’d signed my TokyoDev contract without my middle name, which is present on my passport and residence card, that the application would be rejected. This sort of name-based nitpicking is common enough at Japanese banks—would Immigration react in the same way? Paul assured me that other employees had submitted their contracts without middle names and had no trouble. He was right and it wasn’t an issue, but I’ve decided in future to sign everything with all three of my names, just to be sure. Never make this mistake Finally, my husband wrote his own application, then had to rewrite it at the immigration office because they realized he’d used a Frixion (erasable) pen. This is strictly not allowed, so save yourself some trouble and use a regular ballpoint with blue or black ink! The application process Before making the trip to an immigration office, I polled my friends and checked Google Maps reviews. The nearest office to me had some one-star reviews, and a friend of mine described a negative experience there, so I was leery of simply going with the closest option. Instead, I decided to apply at an office farther from home, the Osaka Regional Immigration Bureau by Cosmosquare Station, which my friend had used for years. I wasn’t entirely sure that this was permitted, but nobody at the Osaka office raised an eyebrow at my Nara address. Getting there I took the train to Cosmosquare Station and arrived around lunchtime on Friday, June 20th. The station itself has an odd quirk: every time I try to use Google Maps inside or near it, I receive bizarrely inaccurate directions. Whatever the building is made of, it really messes with Maps! Luckily the signage around Cosmosquare is quite clear, and I had no difficulty locating the immigration office once I stopped trying to use my phone. Unfortunately I must have picked one of the worst times to visit. The office is on the second floor, but the line extended out the door and down the staircase. At least it was moving quickly, and I soon discovered that there is a convenience store on the second floor, which proved important later on. Asking for information The line I was standing in led to two counters, Application and Information. Since I wasn’t sure I had filled out the correct forms for my dependents, I stopped by the Information desk first. The man there spoke English well, and informed me that I had, in fact, filled out the wrong paperwork. This mistake was easily fixed because there were printed copies of the correct form—and of every other form used by Immigration—right by the doorway. The clerk also confirmed what I’d already suspected, that I couldn’t submit an application on behalf of my husband. Since I’d come alone while he watched the kids, he’d have to come by himself later. I took fresh copies of the applications for my children. Since the office itself was quite full, I went to the convenience store and enjoyed a soda while filling out the paperwork again. That convenience store also has an ID photo booth, a copier, and revenue stamps, so it’s well-equipped to assist applicants. Submitting the application Armed with the correct paperwork, I got back into line and waited around 10 minutes for my turn to submit. The woman behind the desk glanced quickly through my documents. Mostly she wanted to know if I needed to make any copies, because I wouldn’t be receiving these documents back. Once I’d confirmed I didn’t need any papers returned, she gave me a number and asked me to wait to be called. In addition to my number, she handed me a postcard on which to write my own address. This would be sent to me if and when Immigration approved the visa extension, to indicate by what date I needed to pick up my new residence card. Based on the messages I periodically sent my husband, my number wasn’t called for three and a half hours. The office was crowded and hot, but there were also screens showing the numbers called in the hallway and downstairs in the lobby, so it’s possible to visit the convenience store or stretch your legs without missing your appointment. Being able to purchase snacks and drinks at will certainly helped. Mostly, I wished I had brought a good book with me. When my number was finally called, I was surprised they had no questions for me. The clerks had spotted one place in the documents where I’d forgotten to sign; once that minor error was corrected, I was free to go. A paper was stapled into my passport, and my residence card was stamped on the back to show that I was going through the visa extension process. My husband’s experience My husband visited the Osaka Regional Immigration Bureau at 9:30 a.m. on Monday, June 26th. Although he described it as “quite busy” already, there was no line down the staircase, and he was finished by noon. If you want to avoid long wait times, arriving early in the morning might help. Approval and picking up Given the crowd that had packed the Osaka immigration office, and also knowing how backed up the immigration offices in Tokyo can be, I fully expected not to see our postcards for several months. Immigration regularly publishes statistics on the various visas and related processing times based on national averages. In fact, my husband and I received our postcards the same day, July 11th, just three weeks after I’d submitted my and my children’s applications. As usual, there was no indication on the postcard as to how long our visa extension would be: we would only find out if we’d qualified for a one-, three-, or five-year extension once we picked up our new residence cards. I had until July 18th to collect the cards for myself and the kids, and my husband had until the 25th to get his. We opted to go together on the same day, July 14th. The postcards also indicated that we’d need four 6,000 yen revenue stamps, one for each applicant. Revenue stamps (収入印紙, shuunyuu inshi) are a cash replacement, like a money order, to affix to specific documents. Though we knew that the convenience store at the Osaka Regional Immigration Bureau sold revenue stamps, we decided to secure them in advance, just in case. The morning we left, we stopped by our local post office and showed the staff our postcards. They had no trouble identifying and providing the stamps we needed. We arrived at the immigration office around 10:45 a.m. Foolishly, we’d assumed that picking up the cards would be a faster process. Instead, we waited for nearly four hours. Fortunately we’d discussed this possibility with several family friends, who were prepared to help pick up our children from school when we were running late. We finally got our cards and the news was good: we’d all received three-year extensions! Aftermath Extending our visa, and receiving new residence cards, entails some further paperwork. Specifically: My husband will need to reapply for permission to work. We’ll need new My Number cards for all family members, as those expire with the original visa expiration date. Our Japanese bank account will also be frozen upon the original visa expiration date, so it’s important that we inform our bank of the visa extension and provide copies of our new cards as soon as possible. If you are still going through the extension process when your original visa expires, you can show the bank your residence card, which should be stamped to indicate you are currently extending your visa, to prevent them from freezing your account in the interim. Top Takeaways Here’s a brief list of the most important questions I had during the process, and the answers I found. Can I apply for a visa extension on behalf of my spouse and children? Yes to underage children, no to the spouse, unless there are serious extenuating circumstances (such as the spouse being hospitalized). If you and your spouse don’t apply at the same time, make sure your dependent spouse has a copy of your passport and residence card to take with them. Can you only apply at the nearest immigration office? Not necessarily. I applied to one slightly further from my house, and actually in another prefecture, for personal reasons. However, this only worked because the Osaka office was a regional branch, with broader jurisdiction that included Nara. It probably wouldn’t have worked in reverse—for example, if I lived in Osaka and applied to the satellite office in Nara, which only has jurisdiction over Nara and Wakayama. Be sure to check the jurisdiction of the immigration office you choose. Is there any downside to applying early? There is no downside to getting your application in as soon as possible. Immigration will begin accepting applications within three months of the visa expiration date. I originally questioned whether an early extension would mean you “lost” a few months of your visa. For example, if I received my new card in June, but my visa was originally due to expire in August, would the new expiration date be in June? This isn’t the case: the new expiration date is based on the previous expiration date, not on when you submit your application. My visa’s prior expiration date was August 2025, and it’s now August 2028. If you’re extending a visa that was for longer than one year, how many years of tax certificates and records do you need to provide? A: I only provided my previous fiscal year’s tax certificate and proof of one resident tax payment in my local area, and that seemed to be enough. I wasn’t asked for documentation of previous years or paperwork from my prior town hall. Conclusion I’ve lived in several countries over the last fifteen years, so I’m experienced in general at acquiring and retaining visas. Japan’s visa system is paperwork-intensive, but it’s also fair, stable, and reasonably transparent. The fact that my Japanese visa isn’t attached to a singular company, but rather to the type of work I wish to perform, gives me peace of mind as I continue to establish our lives here. I also feel more comfortable as a freelancer in Japan, now that I know how easy it is for a company to sponsor my visa. Paul was able to assemble the documents needed in a single afternoon, and it didn’t cost TokyoDev anything beyond the price of the papers and postage. As freelancing and gig work are on the rise, I’d encourage more Japanese companies to consider sponsoring visas for their international contractors. Likewise, I hope that the experience I’ve shared here will help other immigrants to explore their freelancing options in Japan, and approach their visa extension process with both good information and a solid plan. If you’d like to continue the conversation on visa extensions and company sponsorship, you can join the TokyoDev Discord. Or see more articles on visas for developers, starting your own business in Japan, and remaining here long-term.

2 days ago • 6 votes

p-fast trie: lexically ordered hash map

Here’s a sketch of an idea that might or might not be a good idea. Dunno if it’s similar to something already described in the literature – if you know of something, please let me know via the links in the footer! The gist is to throw away the tree and interior pointers from a qp-trie. Instead, the p-fast trie is stored using a hash map organized into stratified levels, where each level corresponds to a prefix of the key. Exact-match lookups are normal O(1) hash map lookups. Predecessor / successor searches use binary chop on the length of the key. Where a qp-trie search is O(k), where k is the length of the key, a p-fast trie search is O(log k). This smaller O(log k) bound is why I call it a “p-fast trie” by analogy with the x-fast trie, which has O(log log N) query time. (The “p” is for popcount.) I’m not sure if this asymptotic improvement is likely to be effective in practice; see my thoughts towards the end of this note. layout A p-fast trie consists of: Leaf objects, each of which has a name. Each leaf object refers to its successor forming a circular linked list. (The last leaf refers to the first.) Multiple interior nodes refer to each leaf object. A hash map containing every (strict) prefix of every name in the trie. Each prefix maps to a unique interior node. Names are treated as bit strings split into chunks of (say) 6 bits, and prefixes are whole numbers of chunks. An interior node contains a (1<<6) == 64 wide bitmap with a bit set for each chunk where prefix+chunk matches a key. Following the bitmap is a popcount-compressed array of references to the leaf objects that are the closest predecessor of the corresponding prefix+chunk key. Prefixes are strictly shorter than names so that we can avoid having to represent non-values after the end of a name, and so that it’s OK if one name is a prefix of another. The size of chunks and bitmaps might change; 6 is a guess that I expect will work OK. For restricted alphabets you can use something like my DNS trie name preparation trick to squash 8-bit chunks into sub-64-wide bitmaps. In Rust where cross-references are problematic, there might have to be a hash map that owns the leaf objects, so that the p-fast trie can refer to them by name. Or use a pool allocator and refer to leaf objects by numerical index. search To search, start by splitting the query string at its end into prefix + final chunk of bits. Look up the prefix in the hash map and check the chunk’s bit in the bitmap. If it’s set, you can return the corresponding leaf object because it’s either an exact match or the nearest predecessor. If it isn’t found, and you want the predecessor or successor, continue with a binary chop on the length of the query string. Look up the chopped prefix in the hash map. The next chunk is the chunk of bits in the query string immediately after the prefix. If the prefix is present and the next chunk’s bit is set, remember the chunk’s leaf pointer, make the prefix longer, and try again. If the prefix is present and the next chunk’s bit is not set and there’s a lesser bit that is set, return the leaf pointer for the lesser bit. Otherwise make the prefix shorter and try again. If the prefix isn’t present, make the prefix shorter and try again. When the binary chop bottoms out, return the longest-matching leaf you remembered. The leaf’s key and successor bracket the query string. modify When inserting a name, all its prefixes must be added to the hash map from longest to shortest. At the point where it finds that the prefix already exists, the insertion routine needs to walk down the (implicit) tree of successor keys, updating pointers that refer to the new leaf’s predecessor so they refer to the new leaf instead. Similarly, when deleting a name, remove every prefix from longest to shortest from the hash map where they only refer to this leaf. At the point where the prefix has sibling nodes, walk down the (implicit) tree of successor keys, updating pointers that refer to the deleted leaf so they refer to its predecessor instead. I can’t “just” use a concurrent hash map and expect these algorithms to be thread-safe, because they require multiple changes to the hashmaps. I wonder if the search routine can detect when the hash map is modified underneath it and retry. thoughts It isn’t obvious how a p-fast trie might compare to a qp-trie in practice. A p-fast trie will use a lot more memory than a qp-trie because it requires far more interior nodes. They need to exist so that the random-access binary chop knows whether to shorten or lengthen the prefix. To avoid wasting space the hash map keys should refer to names in leaf objects, instead of making lots of copies. This is probably tricky to get right. In a qp-trie the costly part of the lookup is less than O(k) because non-branching interior nodes are omitted. How does that compare to a p-fast trie’s O(log k)? Exact matches in a p-fast trie are just a hash map lookup. If they are worth optimizing then a qp-trie could also be augmented with a hash map. Many steps of a qp-trie search are checking short prefixes of the key near the root of the tree, which should be well cached. By contrast, a p-fast trie search will typically skip short prefixes and instead bounce around longer prefixes, which suggests its cache behaviour won’t be so friendly. A qp-trie predecessor/successor search requires two traversals, one to find the common prefix of the key and another to find the prefix’s predecessor/successor. A p-fast trie requires only one.

2 days ago • 5 votes

New here?