Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]

New here?

Welcome! BoredReading is a fresh way to read high quality articles (updated every hour). Our goal is to curate (with your help) Michelin star quality articles (stuff that's really worth reading). We currently have articles in 0 categories from architecture, history, design, technology, and more. Grab a cup of freshly brewed coffee and start reading. This is the best way to increase your attention span, grow as a person, and get a better understanding of the world (or atleast that's why we built it).

54
How the Live effect run-time is implemented In this post I describe how the Live run-time internals are implemented, which drive Use.GPU. Some pre-existing React and FP effect knowledge is useful. I have written about Live before, but in general terms. You may therefor have a wrong impression of this endeavor. When a junior engineer sees an application doing complex things, they're often intimidated by the prospect of working on it. They assume that complex functionality must be the result of complexity in code. The fancier the app, the less understandable it must be. This is what their experience has been so far: seniority correlates to more and hairier lines of code. After 30 years of coding though, I know it's actually the inverse. You cannot get complex functionality working if you've wasted all your complexity points on the code itself. This is the main thing I want to show here, because this post mainly describes 1 data structure and a handful of methods. Live has a...
a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Acko.net

The Bouquet Residence

Keeping up appearances in tech I saw a remarkable pair of tweets the other day. In the wake of the outage, the CEO of CrowdStrike sent out a public announcement. It's purely factual. The scope of the problem is identified, the known facts are stated, and the logistics of disaster relief are set in motion. Millions of computers were affected. This is the equivalent of a frazzled official giving a brief statement in the aftermath of an earthquake, directing people to the Red Cross. Everything is basically on fire for everyone involved. Systems are failing everywhere, some critical, and quite likely people are panicking. The important thing is to give the technicians the information and tools to fix it, and for everyone else to do what they can, and stay out of the way. In response, a communication professional posted an 'improved' version: Credit where credit is due, she nailed the style. 10/10. It seems unobjectionable, at first. Let's go through, shall we? Opposite Day First is that the CEO is "devastated." A feeling. And they are personally going to ensure it's fixed for every single user. This focuses on the individual who is inconvenienced. Not the disaster. They take a moment out of their time to say they are so, so sorry a mistake was made. They have let you and everyone else down, and that shouldn't happen. That's their responsibility. By this point, the original statement had already told everyone the relevant facts. Here the technical details are left to the imagination. The writer's self-assigned job is to wrap the message in a more palatable envelope. Everyone will be working "all day, all night, all weekends," indeed, "however long it takes," to avoid it happening again. I imagine this is meant to be inspiring and reassuring. But if I was a CrowdStrike technician or engineer, I would find it demoralizing: the boss, who will actually be personally fixing diddly-squat, is saying that the long hours of others are a sacrifice they're willing to make. Plus, CrowdStrike's customers are in the same boat: their technicians get volunteered too. They can't magically unbrick PCs from a distance, so "until it's fully fixed for every single user" would be a promise outsiders will have to keep. Lovely. There's even a punch line: an invitation to go contact them, the quickest way linked directly. It thanks people for reaching out. If everything is on fire, that includes the phone lines, the inboxes, and so on. The most stupid thing you could do in such a situation is to tell more people to contact you, right away. Don't encourage it! That's why the original statement refers to pre-existing lines of communication, internal representatives, and so on. The Support department would hate the CEO too. Root Cause If you're wondering about the pictures, it's Hyacinth Bucket, from 90s UK sitcom Keeping Up Appearances, who would always insist "it's pronounced Bouquet." Hyacinth's ambitions always landed her out of her depth, surrounded by upper-class people she's trying to impress, in the midst of an embarrassing disaster. Her increasingly desparate attempts to save face, which invariably made things worse, are the main source of comedy. Try reading that second statement in her voice. I’m devastated to see the scale of today’s outage and will be personally working on it together with our team until it’s fully fixed for every single user. But I wanted to take a moment to come here and tell you that I am sorry. People around the world rely on us, and incidents like this can’t happen. This came from an error that ultimately is my responsibility. I can hear it perfectly, telegraphing Britishness to restore dignity for all. If she were in tech she would give that statement. It's about reputation management first, projecting the image of competence and accountability. But she's giving the speech in front of a burning building, not realizing the entire exercise is futile. Worse, she thinks she's nailing it. If CrowdStrike had sent this out, some would've applauded and called it an admirable example of wise and empathetic communication. Real leadership qualities. But it's the exact opposite. It focuses on the wrong things, it alienates the staff, and it definitely amplifies the chaos. It's Monty Python-esque. Apologizing is pointless here, the damage is already done. What matters is how severe it is and whether it could've been avoided. This requires a detailed root-cause analysis and remedy. Otherwise you only have their word. Why would that re-assure you? The original restated the company's mission: security and stability. Those are the stakes to regain a modicum of confidence. You may think that I'm reading too much into this. But I know the exact vibe on an engineering floor when the shit hits the fan. I also know how executives and staff without that experience end up missing the point entirely. I once worked for a Hyacinth Bucket. It's not an anecdote, it's allegory. They simply don't get the engineering mindset, and confuse authority with ownership. They step on everyone's toes without realizing, because they're constantly wearing clown shoes. Nobody tells them. Softness as a Service The change in style between #1 and #2 is really a microcosm of the conflict that has been broiling in tech for ~15 years now. I don't mean the politics, but the shifting of norms, of language and behavior. It's framed as a matter of interpersonal style, which needs to be welcoming and inclusive. In practice this means they assert or demand that style #2 be the norm, even when #1 is advisable or required. Factuality is seen as deficient, improper and primitive. It's a form of doublethink: everyone's preference is equally valid, except yours, specifically. But the difference is not a preference. It's about what actually works and what doesn't. Style #1 is aimed at the people who have to fix it. Style #2 is aimed at the people who can't do anything until it's fixed. Who should they be reaching out to? In #2, communication becomes an end in itself, not a means of conveying information. It's about being seen saying the words, not living them. Poking at the statement makes it fall apart. When this becomes the norm in a technical field, it has deep consequences: Critique must be gift-wrapped in flattery, and is not allowed to actually land. Mistakes are not corrected, and sentiment takes precedence over effectiveness. Leaders speak lofty words far from the trenches to save face. The people they thank the loudest are the ones they pay the least. Inevitably, quiet competence is replaced with gaudy chaos. Everyone says they're sorry and responsible, but nobody actually is. Nobody wants to resign either. Sound familiar? Cope and Soothe The elephant in the room is that #1 is very masculine, while #2 is more feminine. When you hear "women are more empathetic communicators", this is what they mean. They tend to focus on the individual and their relation to them, not the team as a whole and its mission. Complaints that tech is too "male dominated" and "notoriously hostile to women" are often just this. Tech was always full of types who won't preface their proposals and criticisms with fluff, and instead lean into autism. When you're used to being pandered to, neutrality feels like vulgarity. The notable exceptions are rare and usually have an exasperating lead up. Tech is actually one of the most accepting and egalitarian fields around. The maintainers do a mostly thankless job. "Oh so you're saying there's no misogyny in tech?" No I'm just saying misogyny doesn't mean "something 1 woman hates". The tone is really a distraction. If someone drops an analysis, saying shit or get off the pot, even very kindly and patiently, some will still run away screaming. Like an octopus spraying ink, they'll deploy a nasty form of #2 as a distraction. That's the real issue. Many techies, in their naiveté, believed the cultural reformers when they showed up to gentrify them. They obediently branded heretics like James Damore, and burned witches like Richard Stallman. Thanks to racism, words like 'master' and 'slave' are now off-limits as technical terms. Ironic, because millions of computers just crashed because they worked exactly like that. The cope is to pretend that nothing has truly changed yet, and more reform is needed. In fact, everything has already changed. Tech forums used to be crucibles for distilling insight, but now they are guarded jealously by people more likely to flag and ban than strongly disagree. I once got flagged on HN because I pointed out Twitter's mass lay-offs were a response to overhiring, and that people were rooting for the site to fail after Musk bought it. It suggested what we all know now: that the company would not implode after trimming the dead weight, and that they'd never forgive him for it. Diversity is now associated with incompetence, because incompetent people have spent over a decade reaching for it as an excuse. In their attempts to fight stereotypes, they ensured the stereotypes came true. Bait and Snitch The outcry tends to be: "We do all the same things you do, but still we get treated differently!" But they start from the conclusion and work their way backwards. This is what the rewritten statement does: it tries to fix the relationship before fixing the problem. The average woman and man actually do things very differently in the first place. Individual men and women choose. And others respond accordingly. The people who build and maintain the world's infrastructure prefer the masculine style for a reason: it keeps civilization running, and helps restore it when it breaks. A disaster announcement does not need to be relatable, it needs to be effective. Furthermore, if the job of shoveling shit falls on you, no amount of flattery or oversight will make that more pleasant. It really won't. Such commentary is purely for the benefit of the ones watching and trying to look busy. It makes it worse, stop pretending otherwise. There's little loyalty in tech companies nowadays, and it's no surprise. Project and product managers are acting more like demanding clients to their own team, than leaders. "As a user, I want..." Yes, but what are you going to do about it? Do you even know where to start? What's perceived as a lack of sensitivity is actually the presence of sensibility. It's what connects the words to the reality on the ground. It does not need to be improved or corrected, it just needs to be respected. And yes it's a matter of gender, because bashing men and masculine norms has become a jolly recreational sport in the overculture. Mature women know it. It seems impossible to admit. The entire edifice of gender equality depends on there not being a single thing men are actually better at, even just on average. Where men and women's instincts differ, women must be right. It's childish, and not harmless either. It dares you to call it out, so they can then play the wounded victim, and paint you as the unreasonable asshole who is mean. This is supposed to invalidate the argument. * * * This post is of course a giant cannon pointing in the opposite direction, sitting on top of a wall. Its message will likely fly over the reformers' heads. If they read it at all, they'll selectively quote or paraphrase, call me a tech-bro, and spool off some sentences they overheard, like an LLM. It's why they adore AI, and want it to be exactly as sycophantic as them. They don't care that it makes stuff up wholesale, because it makes them look and feel competent. It will never tell them to just fuck off already. Think less about what is said, more about what is being done. Otherwise the next CrowdStrike will probably be worse.

7 months ago 71 votes
I is for Intent

Why your app turned into spaghetti There's a certain kind of programmer. Let's call him Stanley. Stanley has been around for a while, and has his fair share of war stories. The common thread is that poorly conceived and specced solutions lead to disaster, or at least, ongoing misery. As a result, he has adopted a firm belief: it should be impossible for his program to reach an invalid state. Stanley loves strong and static typing. He's a big fan of pattern matching, and enums, and discriminated unions, which allow correctness to be verified at compile time. He also has strong opinions on errors, which must be caught, logged and prevented. He uses only ACID-compliant databases, wants foreign keys and triggers to be enforced, and wraps everything in atomic transactions. He hates any source of uncertainty or ambiguity, like untyped JSON or plain-text markup. His APIs will accept data only in normalized and validated form. When you use a Stanley lib, and it doesn't work, the answer will probably be: "you're using it wrong." Stanley is most likely a back-end or systems-level developer. Because nirvana in front-end development is reached when you understand that this view of software is not just wrong, but fundamentally incompatible with the real world. I will prove it. State Your Intent Take a text editor. What happens if you press the up and down arrows? The keyboard cursor (aka caret) moves up and down. Duh. Except it also moves left and right. The editor state at the start has the caret on line 1 column 6. Pressing down will move it to line 2 column 6. But line 2 is too short, so the caret is forcibly moved left to column 1. Then, pressing down again will move it back to column 6. It should be obvious that any editor that didn't remember which column you were actually on would be a nightmare to use. You know it in your bones. Yet this only works because the editor allows the caret to be placed on a position that "does not exist." What is the caret state in the middle? It is both column 1 and column 6. To accommodate this, you need more than just a View that is a pure function of a State, as is now commonly taught. Rather, you need an Intent, which is the source of truth that you mutate... and which is then parsed and validated into a State. Only then can it be used by the View to render the caret in the right place. To edit the intent, aka what a classic Controller does, is a bit tricky. When you press left/right, it should determine the new Intent.column based on the validated State.column +/- 1. But when you press up/down, it should keep the Intent.column you had before and instead change only Intent.line. New intent is a mixed function of both previous intent and previous state. The general pattern is that you reuse Intent if it doesn't change, but that new computed Intent should be derived from State. Note that you should still enforce normal validation of Intent.column when editing too: you don't allow a user to go past the end of a line. Any new intent should be as valid as possible, but old intent should be preserved as is, even if non-sensical or inapplicable. Functionally, for most of the code, it really does look and feel as if the state is just State, which is valid. It's just that when you make 1 state change, the app may decide to jump into a different State than one would think. When this happens, it means some old intent first became invalid, but then became valid again due to a subsequent intent/state change. This is how applications actually work IRL. FYI. Knives and Forks I chose a text editor as an example because Stanley can't dismiss this as just frivolous UI polish for limp wristed homosexuals. It's essential that editors work like this. The pattern is far more common than most devs realize: A tree view remembers the expand/collapse state for rows that are hidden. Inspector tabs remember the tab you were on, even if currently disabled or inapplicable. Toggling a widget between type A/B/C should remember all the A, B and C options, even if are mutually exclusive. All of these involve storing and preserving something unknown, invalid or unused, and bringing it back into play later. More so, if software matches your expected intent, it's a complete non-event. What looks like a "surprise hidden state transition" to a programmer is actually the exact opposite. It would be an unpleasant surprise if that extra state transition didn't occur. It would only annoy users: they already told the software what they wanted, but it keeps forgetting. The ur-example is how nested popup menus should work: good implementations track the motion of the cursor so you can move diagonally from parent to child, without falsely losing focus: This is an instance of the goalkeeper's curse: people rarely compliment or notice the goalkeeper if they do their job, only if they mess up. Successful applications of this principle are doomed to remain unnoticed and unstudied. Validation is not something you do once, discarding the messy input and only preserving the normalized output. It's something you do continuously and non-destructively, preserving the mess as much as possible. It's UI etiquette: the unspoken rules that everyone expects but which are mostly undocumented folklore. This poses a problem for most SaaS in the wild, both architectural and existential. Most APIs will only accept mutations that are valid. The goal is for the database to be a sequence of fully valid states: The smallest possible operation in the system is a fully consistent transaction. This flattens any prior intent. In practice, many software deviates from this ad-hoc. For example, spreadsheets let you create cyclic references, which is by definition invalid. The reason it must let you do this is because fixing one side of a cyclic reference also fixes the other side. A user wants and needs to do these operations in any order. So you must allow a state transition through an invalid state: This requires an effective Intent/State split, whether formal or informal. Because cyclic references can go several levels deep, identifying one cyclic reference may require you to spider out the entire dependency graph. This is functionally equivalent to identifying all cyclic references—dixit Dijkstra. Plus, you need to produce sensible, specific error messages. Many "clever" algorithmic tricks fail this test. Now imagine a spreadsheet API that doesn't allow for any cyclic references ever. This still requires you to validate the entire resulting model, just to determine if 1 change is allowed. It still requires a general validate(Intent). In short, it means your POST and PUT request handlers need to potentially call all your business logic. That seems overkill, so the usual solution is bespoke validators for every single op. If the business logic changes, there is a risk your API will now accept invalid intent. And the app was not built for that. If you flip it around and assume intent will go out-of-bounds as a normal matter, then you never have this risk. You can write the validation in one place, and you reuse it for every change as a normal matter of data flow. Note that this is not cowboy coding. Records and state should not get irreversibly corrupted, because you only ever use valid inputs in computations. If the system is multiplayer, distributed changes should still be well-ordered and/or convergent. But the data structures you're encoding should be, essentially, entirely liberal to your user's needs. Consider git. Here, a "unit of intent" is just a diff applied to a known revision ID. When something's wrong with a merge, it doesn't crash, or panic, or refuse to work. It just enters a conflict state. This state is computed by merging two incompatible intents. It's a dirty state that can't be turned into a clean commit without human intervention. This means git must continue to work, because you need to use git to clean it up. So git is fully aware when a conflict is being resolved. As a general rule, the cases where you actually need to forbid a mutation which satisfies all the type and access constraints are small. A good example is trying to move a folder inside itself: the file system has to remain a sensibly connected tree. Enforcing the uniqueness of names is similar, but also comes with a caution: falsehoods programmers believe about names. Adding (Copy) to a duplicate name is usually better than refusing to accept it, and most names in real life aren't unique at all. Having user-facing names actually requires creating tools and affordances for search, renaming references, resolving duplicates, and so on. Even among front-end developers, few people actually grok this mental model of a user. It's why most React(-like) apps in the wild are spaghetti, and why most blog posts about React gripes continue to miss the bigger picture. Doing React (and UI) well requires you to unlearn old habits and actually design your types and data flow so it uses potentially invalid input as its single source of truth. That way, a one-way data flow can enforce the necessary constraints on the fly. The way Stanley likes to encode and mutate his data is how programmers think about their own program: it should be bug-free and not crash. The mistake is to think that this should also apply to any sort of creative process that program is meant to enable. It would be like making an IDE that only allows you to save a file if the code compiles and passes all the tests. Trigger vs Memo Coding around intent is a very hard thing to teach, because it can seem overwhelming. But what's overwhelming is not doing this. It leads to codebases where every new feature makes ongoing development harder, because no part of the codebase is ever finished. You will sprinkle copies of your business logic all over the place, in the form of request validation, optimistic local updaters, and guess-based cache invalidation. If this is your baseline experience, your estimate of what is needed to pull this off will simply be wrong. In the traditional MVC model, intent is only handled at the individual input widget or form level. e.g. While typing a number, the intermediate representation is a string. This may be empty, incomplete or not a number, but you temporarily allow that. I've never seen people formally separate Intent from State in an entire front-end. Often their state is just an adhoc mix of both, where validation constraints are relaxed in the places where it was most needed. They might just duplicate certain fields to keep a validated and unvalidated variant side by side. There is one common exception. In a React-like, when you do a useMemo with a derived computation of some state, this is actually a perfect fit. The eponymous useState actually describes Intent, not State, because the derived state is ephemeral. This is why so many devs get lost here. const state = useMemo( () => validate(intent), [intent] ); Their usual instinct is that every action that has knock-on effects should be immediately and fully realized, as part of one transaction. Only, they discover some of those knock-on effects need to be re-evaluated if certain circumstances change. Often to do so, they need to undo and remember what it was before. This is then triggered anew via a bespoke effect, which requires a custom trigger and mutation. If they'd instead deferred the computation, it could have auto-updated itself, and they would've still had the original data to work with. e.g. In a WYSIWYG scenario, you often want to preview an operation as part of mouse hovering or dragging. It should look like the final result. You don't need to implement custom previewing and rewinding code for this. You just need the ability to layer on some additional ephemeral intent on top of the intent that is currently committed. Rewinding just means resetting that extra intent back to empty. You can make this easy to use by treating previews as a special kind of transaction: now you can make preview states with the same code you use to apply the final change. You can also auto-tag the created objects as being preview-only, which is very useful. That is: you can auto-translate editing intent into preview intent automatically, by messing with the contents of a transaction. Sounds bad, is actually good. The same applies to any other temporary state, for example, highlighting of elements. Instead of manually changing colors, and creating/deleting labels to pull this off, derive the resolved style just-in-time. This is vastly simpler than doing it all on 1 classic retained model. There, you run the risk of highlights incorrectly becoming sticky, or shapes reverting to the wrong style when un-highlighted. You can architect it so this is simply impossible. The trigger vs memo problem also happens on the back-end, when you have derived collections. Each object of type A must have an associated type B, created on-demand for each A. What happens if you delete an A? Do you delete the B? Do you turn the B into a tombstone? What if the relationship is 1-to-N, do you need to garbage collect? If you create invisible objects behind the scenes as a user edits, and you never tell them, expect to see a giant mess as a result. It's crazy how often I've heard engineers suggest a user should only be allowed to create something, but then never delete it, as a "solution" to this problem. Everyday undo/redo precludes it. Don't be ridiculous. The problem is having an additional layer of bookkeeping you didn't need. The source of truth was collection A, but you created a permanent derived collection B. If you instead make B ephemeral, derived via a stateless computation, then the problem goes away. You can still associate data with B records, but you don't treat B as the authoritative source for itself. This is basically what a WeakMap is. In database land this can be realized with a materialized view, which can be incremental and subscribed to. Taken to its extreme, this turns into event-based sourcing, which might seem like a panacea for this mindset. But in most cases, the latter is still a system by and for Stanley. The event-based nature of those systems exists to support housekeeping tasks like migration, backup and recovery. Users are not supposed to be aware that this is happening. They do not have any view into the event log, and cannot branch and merge it. The exceptions are extremely rare. It's not a system for working with user intent, only for flattening it, because it's append-only. It has a lot of the necessary basic building blocks, but substitutes programmer intent for user intent. What's most nefarious is that the resulting tech stacks are often quite big and intricate, involving job queues, multi-layered caches, distribution networks, and more. It's a bunch of stuff that Stanley can take joy and pride in, far away from users, with "hard" engineering challenges. Unlike all this *ugh* JavaScript, which is always broken and unreliable and uninteresting. Except it's only needed because Stanley only solved half the problem, badly. Patch or Bust When factored in from the start, it's actually quite practical to split Intent from State, and it has lots of benefits. Especially if State is just a more constrained version of the same data structures as Intent. This doesn't need to be fully global either, but it needs to encompass a meaningful document or workspace to be useful. It does create an additional problem: you now have two kinds of data in circulation. If reading or writing requires you to be aware of both Intent and State, you've made your code more complicated and harder to reason about. More so, making a new Intent requires a copy of the old Intent, which you mutate or clone. But you want to avoid passing Intent around in general, because it's fishy data. It may have the right types, but the constraints and referential integrity aren't guaranteed. It's a magnet for the kind of bugs a type-checker won't catch. I've published my common solution before: turn changes into first-class values, and make a generic update of type Update<T> be the basic unit of change. As a first approximation, consider a shallow merge {...value, ...update}. This allows you to make an updateIntent(update) function where update only specifies the fields that are changing. In other words, Update<Intent> looks just like Update<State> and can be derived 100% from State, without Intent. Only one place needs to have access to the old Intent, all other code can just call that. You can make an app intent-aware without complicating all the code. If your state is cleaved along orthogonal lines, then this is all you need. i.e. If column and line are two separate fields, then you can selectively change only one of them. If they are stored as an XY tuple or vector, now you need to be able to describe a change that only affects either the X or Y component. const value = { hello: 'text', foo: { bar: 2, baz: 4 }, }; const update = { hello: 'world', foo: { baz: 50 }, }; expect( patch(value, update) ).toEqual({ hello: 'world', foo: { bar: 2, baz: 50 }, }); So in practice I have a function patch(value, update) which implements a comprehensive superset of a deep recursive merge, with full immutability. It doesn't try to do anything fancy with arrays or strings, they're just treated as atomic values. But it allows for precise overriding of merging behavior at every level, as well as custom lambda-based updates. You can patch tuples by index, but this is risky for dynamic lists. So instead you can express e.g. "append item to list" without the entire list, as a lambda. I've been using patch for years now, and the uses are myriad. To overlay a set of overrides onto a base template, patch(base, overrides) is all you need. It's the most effective way I know to erase a metric ton of {...splats} and ?? defaultValues and != null from entire swathes of code. This is a real problem. You could also view this as a "poor man's OT", with the main distinction being that a patch update only describes the new state, not the old state. Such updates are not reversible on their own. But they are far simpler to make and apply. It can still power a global undo/redo system, in combination with its complement diff(A, B): you can reverse an update by diffing in the opposite direction. This is an operation which is formalized and streamlined into revise(…), so that it retains the exact shape of the original update, and doesn't require B at all. The structure of the update is sufficient information: it too encodes some intent behind the change. With patch you also have a natural way to work with changes and conflicts as values. The earlier WYSIWIG scenario is just patch(commited, ephemeral) with bells on. The net result is that mutating my intent or state is as easy as doing a {...value, ...update} splat, but I'm not falsely incentivized to flatten my data structures. Instead it frees you up to think about what the most practical schema actually is from the data's point of view. This is driven by how the user wishes to edit it, because that's what you will connect it to. It makes you think about what a user's workspace actually is, and lets you align boundaries in UX and process with boundaries in data structure. Remember: most classic "data structures" are not about the structure of data at all. They serve as acceleration tools to speed up specific operations you need on that data. Having the reads and writes drive the data design was always part of the job. What's weird is that people don't apply that idea end-to-end, from database to UI and back. SQL tables are shaped the way they are because it enables complex filters and joins. However, I find this pretty counterproductive: it produces derived query results that are difficult to keep up to date on a client. They also don't look like any of the data structures I actually want to use in my code. A Bike Shed of Schemas This points to a very under-appreciated problem: it is completely pointless to argue about schemas and data types without citing specific domain logic and code that will be used to produce, edit and consume it. Because that code determines which structures you are incentivized to use, and which structures will require bespoke extra work. From afar, column and line are just XY coordinates. Just use a 2-vector. But once you factor in the domain logic and etiquette, you realize that the horizontal and vertical directions have vastly different rules applied to them, and splitting might be better. Which one do you pick? This applies to all data. Whether you should put items in a List<T> or a Map<K, V> largely depends on whether the consuming code will loop over it, or need random access. If an API only provides one, consumers will just build the missing Map or List as a first step. This is O(n log n) either way, because of sorting. The method you use to read or write your data shouldn't limit use of everyday structure. Not unless you have a very good reason. But this is exactly what happens. A lot of bad choices in data design come down to picking the "wrong" data type simply because the most appropriate one is inconvenient in some cases. This then leads to Conway's law, where one team picks the types that are most convenient only for them. The other teams are stuck with it, and end up writing bidirectional conversion code around their part, which will never be removed. The software will now always have this shape, reflecting which concerns were considered essential. What color are your types? { order: [4, 11, 9, 5, 15, 43], values: { 4: {...}, 5: {...}, 9: {...}, 11: {...}, 15: {...}, 43: {...}, }, ); For List vs Map, you can have this particular cake and eat it too. Just provide a List<Id> for the order and a Map<Id, T> for the values. If you structure a list or tree this way, then you can do both iteration and ID-based traversal in the most natural and efficient way. Don't underestimate how convenient this can be. This also has the benefit that "re-ordering items" and "editing items" are fully orthogonal operations. It decomposes the problem of "patching a list of objects" into "patching a list of IDs" and "patching N separate objects". It makes code for manipulating lists and trees universal. It lets you to decide on a case by case basis whether you need to garbage collect the map, or whether preserving unused records is actually desirable. Limiting it to ordinary JSON or JS types, rather than going full-blown OT or CRDT, is a useful baseline. With sensible schema design, at ordinary editing rates, CRDTs are overkill compared to the ability to just replay edits, or notify conflicts. This only requires version numbers and retries. Users need those things anyway: just because a CRDT converges when two people edit, doesn't mean the result is what either person wants. The only case where OTs/CRDTs are absolutely necessary is rich-text editing, and you need bespoke UI solutions for that anyway. For simple text fields, last-write-wins is perfectly fine, and also far superior to what 99% of RESTy APIs do. A CRDT is just a mechanism that translates partially ordered intents into a single state. Like, it's cool that you can make CRDT counters and CRDT lists and whatnot... but each CRDT implements only one particular resolution strategy. If it doesn't produce the desired result, you've created invalid intent no user expected. With last-write-wins, you at least have something 1 user did intend. Whether this is actually destructive or corrective is mostly a matter of schema design and minimal surface area, not math. The main thing that OTs and CRDTs do well is resolve edits on ordered sequences, like strings. If two users are typing text in the same doc, edits higher-up will shift edits down below, which means the indices change when rebased. But if you are editing structured data, you can avoid referring to indices entirely, and just use IDs instead. This sidesteps the issue, like splitting order from values. For the order, there is a simple solution: a map with a fractional index, effectively a dead-simple list CRDT. It just comes with some overhead. Using a CRDT for string editing might not even be enough. Consider Google Docs-style comments anchored to that text: their indices also need to shift on every edit. Now you need a bespoke domain-aware CRDT. Or you work around it by injecting magic markers into the text. Either way, it seems non-trivial to decouple a CRDT from the specific target domain of the data inside. The constraints get mixed in. If you ask me, this is why the field of real-time web apps is still in somewhat of a rut. It's mainly viewed as a high-end technical problem: how do we synchronize data structures over a P2P network without any data conflicts? What they should be asking is: what is the minimal amount of structure we need to reliably synchronize, so that users can have a shared workspace where intent is preserved, and conflicts are clearly signposted. And how should we design our schemas, so that our code can manipulate the data in a straightforward and reliable way? Fixing non-trivial user conflicts is simply not your job. Most SaaS out there doesn't need any of this technical complexity. Consider that a good multiplayer app requires user presence and broadcast anyway. The simplest solution is just a persistent process on a single server coordinating this, one per live workspace. It's what most MMOs do. In fast-paced video games, this even involves lag compensation. Reliable ordering is not the big problem. The situations where this doesn't scale, or where you absolutely must be P2P, are a minority. If you run into them, you must be doing very well. The solution that I've sketched out here is explicitly designed so it can comfortably be done by small teams, or even just 1 person. The (private) CAD app I showed glimpses of above is entirely built this way. It's patch all the way down and it's had undo/redo from day 1. It also has a developer mode where you can just edit the user-space part of the data model, and save/load it. When the in-house designers come to me with new UX requests, they often ask: "Is it possible to do ____?" The answer is never a laborious sigh from a front-end dev with too much on their plate. It's "sure, and we can do more." If you're not actively aware the design of schemas and code is tightly coupled, your codebase will explode, and the bulk of it will be glue. Much of it just serves to translate generalized intent into concrete state or commands. Arguments about schemas are usually just hidden debates about whose job it is to translate, split or join something. This isn't just an irrelevant matter of "wire formats" because changing the structure and format of data also changes how you address specific parts of it. In an interactive UI, you also need a reverse path, to apply edits. What I hope you are starting to realize is that this is really just the forward path in reverse, on so many levels. The result of a basic query is just the ordered IDs of the records that it matched. A join returns a tuple of record IDs per row. If you pre-assemble the associated record data for me, you actually make my job as a front-end dev harder, because there are multiple forward paths for the exact same data, in subtly different forms. What I want is to query and mutate the same damn store you do, and be told when what changes. It's table-stakes now. With well-architected data, this can be wired up mostly automatically, without any scaffolding. The implementations you encounter in the wild just obfuscate this, because they don't distinguish between the data store and the model it holds. The fact that the data store should not be corruptible, and should enforce permissions and quotas, is incorrectly extended to the entire model stored inside. But that model doesn't belong to Stanley, it belongs to the user. This is why desktop applications didn't have a "Data Export". It was just called Load and Save, and what you saved was the intent, in a file. Having a universal query or update mechanism doesn't absolve you from thinking about this either, which is why I think the patch approach is so rare: it looks like cowboy coding if you don't have the right boundaries in place. Patch is mainly for user-space mutations, not kernel-space, a concept that applies to more than just OS kernels. User-space must be very forgiving. If you avoid it, you end up with something like GraphQL, a good example of solving only half the problem badly. Its getter assembles data for consumption by laboriously repeating it in dozens of partial variations. And it turns the setter part into an unsavory mix of lasagna and spaghetti. No wonder, it was designed for a platform that owns and hoards all your data. * * * Viewed narrowly, Intent is just a useful concept to rethink how you enforce validation and constraints in a front-end app. Viewed broadly, it completely changes how you build back-ends and data flows to support that. It will also teach you how adding new aspects to your software can reduce complexity, not increase it, if done right. A good metric is to judge implementation choices by how many other places of the code need to care about them. If a proposed change requires adjustments literally everywhere else, it's probably a bad idea, unless the net effect is to remove code rather than add. I believe reconcilers like React or tree-sitter are a major guide stone here. What they do is apply structure-preserving transforms on data structures, and incrementally. They actually do the annoying part for you. I based Use.GPU on the same principles, and use it to drive CPU canvases too. The tree-based structure reflects that one function's state just might be the next function's intent, all the way down. This is a compelling argument that the data and the code should have roughly the same shape. You will also conclude there is nothing more nefarious than a hard split between back-end and front-end. You know, coded by different people, where each side is only half-aware of the other's needs, but one sits squarely in front of the other. Well-intentioned guesses about what the other end needs will often be wrong. You will end up with data types and query models that cannot answer questions concisely and efficiently, and which must be babysat to not go stale. In the last 20 years, little has changed here in the wild. On the back-end, it still looks mostly the same. Even when modern storage solutions are deployed, people end up putting SQL- and ORM-like layers on top, because that's what's familiar. The split between back-end and database has the exact same malaise. None of this work actually helps make the app more reliable, it's the opposite: every new feature makes on-going development harder. Many "solutions" in this space are not solutions, they are copes. Maybe we're overdue for a NoSQL-revival, this time with a focus on practical schema design and mutation? SQL was designed to model administrative business processes, not live interaction. I happen to believe a front-end should sit next to the back-end, not in front of it, with only a thin proxy as a broker. What I can tell you for sure is: it's so much better when intent is a first-class concept. You don't need nor want to treat user data as something to pussy-foot around, or handle like its radioactive. You can manipulate and transport it without a care. You can build rich, comfy functionality on top. Once implemented, you may find yourself not touching your network code for a very long time. It's the opposite of overwhelming, it's lovely. You can focus on building the tools your users need. This can pave the way for more advanced concepts like OT and CRDT, but will show you that neither of them is a substitute for getting your application fundamentals right. In doing so, you reach a synthesis of Dijkstra and anti-Dijkstra: your program should be provably correct in its data flow, which means it can safely break in completely arbitrary ways. Because the I in UI meant "intent" all along. More: On Variance and Extensibility Climbing Mount Effect APIs are About Policy

a year ago 23 votes
Stable Fiddusion

Acko.queue(function () { renderMathInElement(document.querySelector('article'), {delimiters: [ {left: "$$", right: "$$", display: true}, {left: "$", right: "$", display: false}, ]}); }); Frequency-domain blue noise generator In computer graphics, stochastic methods are so hot right now. All rendering turns into calculus, except you solve the integrals by numerically sampling them. As I showed with Teardown, this is all based on random noise, hidden with a ton of spatial and temporal smoothing. For this, you need a good source of high quality noise. There have been a few interesting developments in this area, such as Alan Wolfe et al.'s Spatio-Temporal Blue Noise. This post is about how I designed noise in frequency space. I will cover: What is blue noise? Designing indigo noise How swap works in the frequency domain Heuristics and analysis to speed up search Implementing it in WebGPU Along the way I will also show you some "street" DSP math. This illustrates how getting comfy in this requires you to develop deep intuition about complex numbers. But complex doesn't mean complicated. It can all be done on a paper napkin. The WebGPU interface I built What I'm going to make is this: If properly displayed, this image should look eerily even. But if your browser is rescaling it incorrectly, it may not be exactly right. Colorless Blue Ideas I will start by just recapping the essentials. If you're familiar, skip to the next section. Ordinary random generators produce uniform white noise: every value is equally likely, and the average frequency spectrum is flat. Time domain Frequency domain To a person, this doesn't actually seem fully 'random', because it has clusters and voids. Similarly, a uniformly random list of coin flips will still have long runs of heads or tails in it occasionally. What a person would consider evenly random is usually blue noise: it prefers to alternate between heads and tails, and avoids long runs entirely. It is 'more random than random', biased towards the upper frequencies, i.e. the blue part of the spectrum. Time domain Frequency domain Blue noise is great for e.g. dithering, because when viewed from afar, or blurred, it tends to disappear. With white noise, clumps remain after blurring: Blurred white noise Blurred blue noise Blueness is a delicate property. If you have e.g. 3D blue noise in a volume XYZ, then a single 2D XY slice is not blue at all: XYZ spectrum XY slice XY spectrum The samples are only evenly distributed in 3D, i.e. when you consider each slice in front and behind it too. Blue noise being delicate means that nobody really knows of a way to generate it statelessly, i.e. as a pure function f(x,y,z). Algorithms to generate it must factor in the whole, as noise is only blue if every single sample is evenly spaced. You can make blue noise images that tile, and sample those, but the resulting repetition may be noticeable. Because blue noise is constructed, you can make special variants. Uniform Blue Noise has a uniform distribution of values, with each value equally likely. An 8-bit 256x256 UBN image will have each unique byte appear exactly 256 times. Projective Blue Noise can be projected down, so that a 3D volume XYZ flattened into either XY, YZ or ZX is still blue in 2D, and same for X, Y and Z in 1D. Spatio-Temporal Blue Noise (STBN) is 3D blue noise created specifically for use in real-time rendering: Every 2D slice XY is 2D blue noise Every Z row is 1D blue noise This means XZ or YZ slices of STBN are not blue. Instead, it's designed so that when you average out all the XY slices over Z, the result is uniform gray, again without clusters or voids. This requires the noise in all the slices to perfectly complement each other, a bit like overlapping slices of translucent swiss cheese. This is the sort of noise I want to generate. Indigo STBN 64x64x16 XYZ spectrum Sleep Furiously A blur filter's spectrum is the opposite of blue noise: it's concentrated in the lowest frequencies, with a bump in the middle. If you blur the noise, you multiply the two spectra. Very little is left: only the ring-shaped overlap, creating a band-pass area. This is why blue noise looks good when smoothed, and is used in rendering, with both spatial (2D) and temporal smoothing (1D) applied. Blur filters can be designed. If a blur filter is perfectly low-pass, i.e. ~zero amplitude for all frequencies > $ f_{\rm{lowpass}} $ , then nothing is left of the upper frequencies past a point. If the noise is shaped to minimize any overlap, then the result is actually noise free. The dark part of the noise spectrum should be large and pitch black. The spectrum shouldn't just be blue, it should be indigo. When people say you can't design noise in frequency space, what they mean is that you can't merely apply an inverse FFT to a given target spectrum. The resulting noise is gaussian, not uniform. The missing ingredient is the phase: all the frequencies need to be precisely aligned to have the right value distribution. This is why you need a specialized algorithm. The STBN paper describes two: void-and-cluster, and swap. Both of these are driven by an energy function. It works in the spatial/time domain, based on the distances between pairs of samples. It uses a "fall-off parameter" sigma to control the effective radius of each sample, with a gaussian kernel. STBN (Wolfe et al.) The swap algorithm is trivially simple. It starts from white noise and shapes it: Start with e.g. 256x256 pixels initialized with the bytes 0-255 repeated 256 times in order Permute all the pixels into ~white noise using a random order Now iterate: randomly try swapping two pixels, check if the result is "more blue" This is guaranteed to preserve the uniform input distribution perfectly. The resulting noise patterns are blue, but they still have some noise in all the lower frequencies. The only blur filter that could get rid of it all, is one that blurs away all the signal too. My 'simple' fix is just to score swaps in the frequency domain instead. If this seems too good to be true, you should know that a permutation search space is catastrophically huge. If any pixel can be swapped with any other pixel, the number of possible swaps at any given step is O(N²). In a 256x256 image, it's ~2 billion. The goal is to find a sequence of thousands, millions of random swaps, to turn the white noise into blue noise. This is basically stochastic bootstrapping. It's the bulk of good old fashioned AI, using simple heuristics, queues and other tools to dig around large search spaces. If there are local minima in the way, you usually need more noise and simulated annealing to tunnel over those. Usually. This set up is somewhat simplified by the fact that swaps are symmetric (i.e. (A,B) = (B,A)), but also that applying swaps S1 and S2 is the same as applying swaps S2 and S1 as long as they don't overlap. Good Energy Let's take it one hurdle at a time. It's not obvious that you can change a signal's spectrum just by re-ordering its values over space/time, but this is easy to illustrate. Take any finite 1D signal, and order its values from lowest to highest. You will get some kind of ramp, approximating a sawtooth wave. This concentrates most of the energy in the first non-DC frequency: Now split the odd values from the even values, and concatenate them. You will now have two ramps, with twice the frequency: You can repeat this to double the frequency all the way up to Nyquist. So you have a lot of design freedom to transfer energy from one frequency to another. In fact the Fourier transform has the property that energy in the time and frequency domain is conserved: This means the sum of $ |\mathrm{spectrum}_k|^2 $ remains constant over pixel swaps. We then design a target curve, e.g. a high-pass cosine: This can be fit and compared to the current noise spectrum to get the error to minimize. However, I don't measure the error in energy $ |\mathrm{spectrum}_k|^2 $ but in amplitude $ |\mathrm{spectrum}_k| $. I normalize the spectrum and the target into distributions, and take the L2 norm of the difference, i.e. a sqrt of the sum of squared errors: This keeps the math simple, but also helps target the noise in the ~zero part of the spectrum. Otherwise, deviations near zero would count for less than deviations around one. Go Banana So I tried it. With a lot of patience, you can make 2D blue noise images up to 256x256 on a single thread. A naive random search with an FFT for every iteration is not fast, but computers are. Making a 64x64x16 with this is possible, but it's certainly like watching paint dry. It's the same number of pixels as 256x256, but with an extra dimension worth of FFTs that need to be churned. Still, it works and you can also make 3D STBN with the spatial and temporal curves controlled independently: Converged spectra I built command-line scripts for this, with a bunch of quality of life things. If you're going to sit around waiting for numbers to go by, you have a lot of time for this... Save and load byte/float-exact state to a .png, save parameters to .json Save a bunch of debug viz as extra .pngs with every snapshot Auto-save state periodically during runs Measure and show rate of convergence every N seconds, with smoothing Validate the histogram before saving to detect bugs and avoid corrupting expensive runs I could fire up a couple of workers to start churning, while continuing to develop the code liberally with new variations. I could also stop and restart workers with new heuristics, continuing where it left off. Protip: you can write C in JavaScript Drunk with power, I tried various sizes and curves, which created... okay noise. Each has the exact same uniform distribution so it's difficult to judge other than comparing to other output, or earlier versions of itself. To address this, I visualized the blurred result, using a [1 4 6 4 1] kernel as my base line. After adjusting levels, structure was visible: Semi-converged Blurred The resulting spectra show what's actually going on: Semi-converged Blurred The main component is the expected ring of bandpass noise, the 2D equivalent of ringing. But in between there is also a ton of redder noise, in the lower frequencies, which all remains after a blur. This noise is as strong as the ring. So while it's easy to make a blue-ish noise pattern that looks okay at first glance, there is a vast gap between having a noise floor and not having one. So I kept iterating: It takes a very long time, but if you wait, all those little specks will slowly twinkle out, until quantization starts to get in the way, with a loss of about 1/255 per pixel (0.0039). Semi converged Fully converged The effect on the blurred output is remarkable. All the large scale structure disappears, as you'd expect from spectra, leaving only the bandpass ringing. That goes away with a strong enough blur, or a large enough dark zone. The visual difference between the two is slight, but nevertheless, the difference is significant and pervasive when amplified: Semi converged Fully converged Difference Final spectrum I tried a few indigo noise curves, with different % of the curve zero. The resulting noise is all extremely equal, even after a blur and amplify. The only visible noise left is bandpass, and the noise floor is so low it may as well not be there. As you make the black exclusion zone bigger, the noise gets concentrated in the edges and corners. It becomes a bit more linear and squarish, a contender for violet noise. This is basically a smooth evolution towards a pure pixel checkboard in the limit. Using more than 50% zero seems inadvisable for this reason: Time domain Frequency domain At this point the idea was validated, but it was dog slow. Can it be done faster? Spatially Sparse An FFT scales like O(N log N). When you are dealing with images and volumes, that N is actually an N² or N³ in practice. The early phase of the search is the easiest to speed up, because you can find a good swap for any pixel with barely any tries. There is no point in being clever. Each sub-region is very non-uniform, and its spectrum nearly white. Placing pixels roughly by the right location is plenty good enough. You might try splitting a large volume into separate blocks, and optimize each block separately. That wouldn't work, because all the boundaries remain fully white. Overlapping doesn't fix this, because they will actively create new seams. I tried it. What does work is a windowed scoring strategy. It avoids a full FFT for the entire volume, and only scores each NxN or NxNxN region around each swapped point, with N-sized FFTs in each dimension. This is enormously faster and can rapidly dissolve larger volumes of white noise into approximate blue even with e.g. N = 8 or N = 16. Eventually it stops improving and you need to bump the range or switch to a global optimizer. Here's the progression from white noise, to when sparse 16x16 gives up, followed by some additional 64x64: Time domain Frequency domain Time domain Frequency domain Time domain Frequency domain A naive solution does not work well however. This is because the spectrum of a subregion does not match the spectrum of the whole. The Fourier transform assumes each signal is periodic. If you take a random subregion and forcibly repeat it, its new spectrum will have aliasing artifacts. This would cause you to consistently misjudge swaps. To fix this, you need to window the signal in the space/time-domain. This forces it to start and end at 0, and eliminates the effect of non-matching boundaries on the scoring. I used a smoothStep window because it's cheap, and haven't needed to try anything else: 16x16 windowed data This still alters the spectrum, but in a predictable way. A time-domain window is a convolution in the frequency domain. You don't actually have a choice here: not using a window is mathematically equivalent to using a very bad window. It's effectively a box filter covering the cut-out area inside the larger volume, which causes spectral ringing. The effect of the chosen window on the target spectrum can be modeled via convolution of their spectral magnitudes: $$ \mathbf{target}' = |\mathbf{target}| \circledast |\mathcal{F}(\mathbf{window})| $$ This can be done via the time domain as: $$ \mathbf{target}' = \mathcal{F}(\mathcal{F}^{-1}(|\mathbf{target}|) \cdot \mathcal{F}^{-1}(|\mathcal{F}(\mathbf{window})|)) $$ Note that the forward/inverse Fourier pairs are not redundant, as there is an absolute value operator in between. This discards the phase component of the window, which is irrelevant. Curiously, while it is important to window the noise data, it isn't very important to window the target. The effect of the spectral convolution is small, amounting to a small blur, and the extra error is random and absorbed by the smooth scoring function. The resulting local loss tracks the global loss function pretty closely. It massively speeds up the search in larger volumes, because the large FFT is the bottleneck. But it stops working well before anything resembling convergence in the frequency-domain. It does not make true blue noise, only a lookalike. The overall problem is still that we can't tell good swaps from bad swaps without trying them and verifying. Sleight of Frequency So, let's characterize the effect of a pixel swap. Given a signal [A B C D E F G H], let's swap C and F. Swapping the two values is the same as adding F - C = Δ to C, and subtracting that same delta from F. That is, you add the vector: V = [0 0 Δ 0 0 -Δ 0 0] This remains true if you apply a Fourier transform and do it in the frequency domain. To best understand this, you need to develop some intuition around FFTs of Dirac deltas. Consider the short filter kernel [1 4 6 4 1]. It's a little known fact, but you can actually sight-read its frequency spectrum directly off the coefficients, because the filter is symmetrical. I will teach you. The extremes are easy: The DC amplitude is the sum 1 + 4 + 6 + 4 + 1 = 16 The Nyquist amplitude is the modulated sum 1 - 4 + 6 - 4 + 1 = 0 So we already know it's an 'ideal' lowpass filter, which reduces the Nyquist signal +1, -1, +1, -1, ... to exactly zero. It also has 16x DC gain. Now all the other frequencies. First, remember the Fourier transform works in symmetric ways. Every statement "____ in the time domain = ____ in the frequency domain" is still true if you swap the words time and frequency. This has lead to the grotesquely named sub-field of cepstral processing where you have quefrencies and vawes, and it kinda feels like you're having a stroke. The cepstral convolution filter from earlier is called a lifter. Usually cepstral processing is applied to the real magnitude of the spectrum, i.e. $ |\mathrm{spectrum}| $, instead of its true complex value. This is a coward move. So, decompose the kernel into symmetric pairs: [· · 6 · ·] [· 4 · 4 ·] [1 · · · 1] All but the first row is a pair of real Dirac deltas in the time domain. Such a row is normally what you get when you Fourier transform a cosine, i.e.: $$ \cos \omega = \frac{\mathrm{e}^{i\omega} + \mathrm{e}^{-i\omega}}{2} $$ A cosine in time is a pair of Dirac deltas in the frequency domain. The phase of a (real) cosine is zero, so both its deltas are real. Now flip it around. The Fourier transform of a pair [x 0 0 ... 0 0 x] is a real cosine in frequency space. Must be true. Each new pair adds a new higher cosine on top of the existing spectrum. For the central [... 0 0 x 0 0 ...] we add a DC term. It's just a Fourier transform in the other direction: |FFT([1 4 6 4 1])| = [· · 6 · ·] => 6 [· 4 · 4 ·] => 8 cos(ɷ) [1 · · · 1] => 2 cos(2ɷ) = 6 + 8 cos(ɷ) + 2 cos(2ɷ) Normally you have to use the z-transform to analyze a digital filter. But the above is a shortcut. FFTs and inverse FFTs do have opposite phase, but that doesn't matter here because cos(ɷ) = cos(-ɷ). This works for the symmetric-even case too: you offset the frequencies by half a band, ɷ/2, and there is no DC term in the middle: |FFT([1 3 3 1])| = [· 3 3 ·] => 6 cos(ɷ/2) [1 · · 1] => 2 cos(3ɷ/2) = 6 cos(ɷ/2) + 2 cos(3ɷ/2) So, symmetric filters have spectra that are made up of regular cosines. Now you know. What about the delta vector [0 0 Δ 0 0 -Δ 0 0]? It's not symmetric, so we have to decompose it: V1 = [· · · · · Δ · ·] V2 = [· · Δ · · · · ·] V = V2 - V1 Each is now an unpaired dirac delta. Each vector's fourier transform is a complex wave $ Δ \cdot \mathrm{e}^{-i \omega k} $ in the frequency domain (the k'th quefrency). It lacks the usual complementary oppositely twisting wave $ Δ \cdot \mathrm{e}^{i \omega k} $, so it's not real-valued. It has constant magnitude Δ and varying phase: FFT(V1) = [ Δ Δ Δ Δ Δ Δ Δ Δ ] FFT(V2) = [ Δ Δ Δ Δ Δ Δ Δ Δ ] These are vawes. The effect of a swap is still just to add FFT(V), aka FFT(V2) - FFT(V1) to the (complex) spectrum. The effect is to transfer energy between all the bands simultaneously. Hence, FFT(V1) and FFT(V2) function as a source and destination mask for the transfer. However, 'mask' is the wrong word, because the magnitude of $ \mathrm{e}^{i \omega k} $ is always 1. It doesn't have varying amplitude, only varying phase. -FFT(V1) and FFT(V2) define the complex direction in which to add/subtract energy. When added together their phases interfere constructively or destructively, resulting in an amplitude that varies between 0 and 2Δ: an actual mask. The resulting phase will be halfway between the two, as it's the sum of two equal-length complex numbers. FFT(V) = [ · Δ Δ Δ Δ Δ Δ Δ ] For any given pixel A and its delta FFT(V1), it can pair up with other pixels B to form N-1 different interference masks FFT(V2) - FFT(V1). There are N(N-1)/2 unique interference masks, if you account for (A,B) (B,A) symmetry. Worth pointing out, the FFT of the first index: FFT([Δ 0 0 0 0 0 0 0]) = [Δ Δ Δ Δ Δ Δ Δ Δ] This is the DC quefrency, and the fourier symmetry continues to work. Moving values in time causes the vawe's quefrency to change in the frequency domain. This is the upside-down version of how moving energy to another frequency band causes the wave's frequency to change in the time domain. What's the Gradient, Kenneth? Using vectors as masks... shifting energy in directions... this means gradient descent, no? Well. It's indeed possible to calculate the derivative of your loss function as a function of input pixel brightness, with the usual bag of automatic differentiation/backprop tricks. You can also do it numerically. But, this doesn't help you directly because the only way you can act on that per-pixel gradient is by swapping a pair of pixels. You need to find two quefrencies FFT(V1) and FFT(V2) which interfere in exactly the right way to decrease the loss function across all bad frequencies simultaneously, while leaving the good ones untouched. Even if the gradient were to help you pick a good starting pixel, that still leaves the problem of finding a good partner. There are still O(N²) possible pairs to choose from, and the entire spectrum changes a little bit on every swap. Which means new FFTs to analyze it. Random greedy search is actually tricky to beat in practice. Whatever extra work you spend on getting better samples translates into less samples tried per second. e.g. Taking a best-of-3 approach is worse than just trying 3 swaps in a row. Swaps are almost always orthogonal. But random() still samples unevenly because it's white noise. If only we had.... oh wait. Indeed if you already have blue noise of the right size, you can use that to mildly speed up the search for more. Use it as a random permutation to drive sampling, with some inevitable whitening over time to keep it fresh. You can't however use the noise you're generating to accelerate its own search, because the two are highly correlated. What's really going on is all a consequence of the loss function. Given any particular frequency band, the loss function is only affected when its magnitude changes. Its phase can change arbitrarily, rolling around without friction. The complex gradient must point in the radial direction. In the tangential direction, the partial derivative is zero. The value of a given interference mask FFT(V1) - FFT(V2) for a given frequency is also complex-valued. It can be projected onto the current phase, and split into its radial and tangential component with a dot product. The interference mask has a dual action. As we saw, its magnitude varies between 0 and 2Δ, as a function of the two indices k1 and k2. This creates a window that is independent of the specific state or data. It defines a smooth 'hash' from the interference of two quefrency bands. But its phase adds an additional selection effect: whether the interference in the mask is aligned with the current band's phase: this determines the split between radial and tangential. This defines a smooth phase 'hash' on top. It cycles at the average of the two quefrencies, i.e. a different, third one. Energy is only added/subtracted if both hashes are non-zero. If the phase hash is zero, the frequency band only turns. This does not affect loss, but changes how each mask will affect it in the future. This then determines how it is coupled to other bands when you perform a particular swap. Note that this is only true differentially: for a finite swap, the curvature of the complex domain comes into play. The loss function is actually a hyper-cylindrical skate bowl you can ride around. Just the movement of all the bands is tied together. Frequency bands with significant error may 'random walk' freely clockwise or counterclockwise when subjected to swaps. A band can therefor drift until it gets a turn where its phase is in alignment with enough similar bands, where the swap makes them all descend along the local gradient, enough to counter any negative effects elsewhere. In the time domain, each frequency band is a wave that oscillates between -1...1: it 'owns' some of the value of each pixel, but there are places where its weight is ~zero (the knots). So when a band shifts phase, it changes how much of the energy of each pixel it 'owns'. This allows each band to 'scan' different parts of the noise in the time domain. In order to fix a particular peak or dent in the frequency spectrum, the search must rotate that band's phase so it strongly owns any defect in the noise, and then perform a swap to fix that defect. Thus, my mental model of this is not actually disconnected pixel swapping. It's more like one of those Myst puzzles where flipping a switch flips some of the neighbors too. You press one pair of buttons at a time. It's a giant haunted dimmer switch. We're dealing with complex amplitudes, not real values, so the light also has a color. Mechanically it's like a slot machine, with dials that can rotate to display different sides. The cherries and bells are the color: they determine how the light gets brighter or darker. If a dial is set just right, you can use it as a /dev/null to 'dump' changes. That's what theory predicts, but does it work? Well, here is a (blurred noise) spectrum being late-optimized. The search is trying to eliminate the left-over lower frequency noise in the middle: Semi converged Here's the phase difference from the late stages of search, each a good swap. Left to right shows 4 different value scales: x2 x16 x256 x4096 x2 x16 x256 x4096 At first it looks like just a few phases are changing, but amplification reveals it's the opposite. There are several plateaus. Strongest are the bands being actively modified. Then there's the circular error area around it, where other bands are still swirling into phase. Then there's a sharp drop-off to a much weaker noise floor, present everywhere. These are the bands that are already converged. Compare to a random bad swap: x2 x16 x256 x4096 Now there is strong noise all over the center, and the loss immediately gets worse, as a bunch of amplitudes start shifting in the wrong direction randomly. So it's true. Applying the swap algorithm with a spectral target naturally cycles through focusing on different parts of the target spectrum as it makes progress. This information is positionally encoded in the phases of the bands and can be 'queried' by attempting a swap. This means the constraint of a fixed target spectrum is actually a constantly moving target in the complex domain. Frequency bands that reach the target are locked in. Neither their magnitude nor phase changes in aggregate. The random walks of such bands must have no DC component... they must be complex-valued blue noise with a tiny amplitude. Knowing this doesn't help directly, but it does explain why the search is so hard. Because the interference masks function like hashes, there is no simple pattern to how positions map to errors in the spectrum. And once you get close to the target, finding new good swaps is equivalent to digging out information encoded deep in the phase domain, with O(N²) interference masks to choose from. Gradient Sampling As I was trying to optimize for evenness after blur, it occurred to me to simply try selecting bright or dark spots in the blurred after-image. This is the situation where frequency bands are in coupled alignment: the error in the spectrum has a relatively concentrated footprint in the time domain. But, this heuristic merely picks out good swaps that are already 'lined up' so to speak. It only works as a low-hanging fruit sampler, with rapidly diminishing returns. Next I used the gradient in the frequency domain. The gradient points towards increasing loss, which is the sum of squared distance $ (…)^2 $. So the slope is $ 2(…) $, proportional to distance to the goal: $$ |\mathrm{gradient}_k| = 2 \cdot \left( \frac{|\mathrm{spectrum}_k|}{||\mathbf{spectrum}||} - \frac{\mathrm{target}_k}{||\mathbf{target}||} \right) $$ It's radial, so its phase matches the spectrum itself: $$ \mathrm{gradient}_k = \mathrm{|gradient_k|} \cdot \left(1 ∠ \mathrm{arg}(\mathrm{spectrum}_k) \right) $$ Eagle-eyed readers may notice the sqrt part of the L2 norm is missing here. It's only there for normalization, and in fact, you generally want a gradient that decreases the closer you get to the target. It acts as a natural stabilizer, forming a convex optimization problem. You can transport this gradient backwards by applying an inverse FFT. Usually derivatives and FFTs don't commute, but that's only when you are deriving in the same dimension as the FFT. The partial derivative here is neither over time nor frequency, but by signal value. The resulting time-domain gradient tells you how fast the (squared) loss would change if a given pixel changed. The sign tells you whether it needs to become lighter or darker. In theory, a pixel with a large gradient can enable larger score improvements per step. It says little about what's a suitable pixel to pair with though. You can infer that a pixel needs to be paired with one that is brighter or darker, but not how much. The gradient only applies differentially. It involves two pixels, so it will cause interference between the two deltas, and also with the signal's own phase. The time-domain gradient does change slowly after every swap—mainly the swapping pixels—so this only needs to add an extra IFFT every N swap attempts, reusing it in between. I tried this in two ways. One was to bias random sampling towards points with the largest gradients. This barely did anything, when applied to one or both pixels. Then I tried going down the list in order, and this worked better. I tried a bunch of heuristics here, like adding a retry until paired, and a 'dud' tracker to reject known unpairable pixels. It did lead to some minor gains in successful sample selection. But beating random was still not a sure bet in all cases, because it comes at the cost of ordering and tracking all pixels to sample them. All in all, it was quite mystifying. Pair Analysis Hence I analyzed all possible swaps (A,B) inside one 64x64 image at different stages of convergence, for 1024 pixels A (25% of total). The result was quite illuminating. There are 2 indicators of a pixel's suitability for swapping: % of all possible swaps (A,_) that are good score improvement of best possible swap (A,B) They are highly correlated, and you can take the geometric average to get a single quality score to order by: The curve shows that the best possible candidates are rare, with a sharp drop-off at the start. Here the average candidate is ~1/3rd as good as the best, though every pixel is pairable. This represents the typical situation when you have unconverged blue-ish noise. Order all pixels by their (signed) gradient, and plot the quality: The distribution seems biased towards the ends. A larger absolute gradient at A can indeed lead to both better scores and higher % of good swaps. Notice that it's also noisier at the ends, where it dips below the middle. If you order pixels by their quality, and then plot the gradient, you see: Selecting for large gradient at A will select both the best and the worst possible pixels A. This implies that there are pixels in the noise that are very significant, but are nevertheless currently 'unfixable'. This corresponds to the 'focus' described earlier. By drawing from the 'top', I was mining the imbalance between the good/left and bad/right distribution. Selecting for a vanishing gradient would instead select the average-to-bad pixels A. I investigated one instance of each: very good, average or very bad pixel A. I tried every possible swap (A, B). You can plot a curve of the potential score improvement: The three scenarios have similar curves, with the bulk of swaps being negative. Only a tiny bit of the curve is sticking out positive, even in the best case. The potential benefit of a good swap is dwarfed by the potential harm of bad swaps. The main difference is just how many positive swaps there are, if any. So let's focus on the positive case, where you can see best. You can order by score, and plot the gradient of all the pixels B, to see another correlation. It looks kinda promising. Here the sign matters, with left and right being different. If the gradient of pixel A is the opposite sign, then this graph is mirrored. But if you order by (signed) gradient and plot the score, you see the real problem, caused by the noise: The good samples are mixed freely among the bad ones, with only a very weak trend downward. This explains why sampling improvements based purely on gradient for pixel B are impossible. You can see what's going on if you plot Δv, the difference in value between A and B: For a given pixel A, all the good swaps have a similar value for B, which is not unexpected. Its mean is the ideal value for A, but there is a lot of variance. In this case pixel A is nearly white, so it is brighter than almost every other pixel B. If you now plot Δv * -gradient, you see a clue on the left: Almost all of the successful swaps have a small but positive value. This represents what we already knew: the gradient's sign tells you if a pixel should be brighter or darker. If Δv has the opposite sign, the chances of a successful swap are slim. Ideally both pixels 'face' the right way, so the swap is beneficial on both ends. But only the combined effect on the loss matters: i.e. Δv * Δgradient < 0. It's only true differentially so it can misfire. But compared to blind sampling of pairs, it's easily 5-10x better and faster, racing towards the tougher parts of the search. What's more... while this test is just binary, I found that any effort spent on trying to further prioritize swaps by the magnitude of the gradient is entirely wasted. Maximizing Δv * Δgradient by repeated sampling is counterproductive, because it selects more bad candidates on the right. Minimizing Δv * Δgradient creates more successful swaps on the left, but lowers the average improvement per step so the convergence is net slower. Anything more sophisticated incurs too much computation to be worth it. It does have a limit. This is what it looks like when an image is practically fully converged: Eventually you reach the point where there are only a handful of swaps with any real benefit, while the rest is just shaving off a few bits of loss at a time. It devolves back to pure random selection, only skipping the coin flip for the gradient. It is likely that more targeted heuristics can still work here. The gradient also works in the early stages. As it barely changes over successive swaps, this leads to a different kind of sparse mode. Instead of scoring only a subset of pixels, simply score multiple swaps as a group over time, without re-scoring intermediate states. This lowers the success rate roughly by a power (e.g. 0.8 -> 0.64), but cuts the number of FFTs by a constant factor (e.g. 1/2). Early on this trade-off can be worth it. Even faster: don't score steps at all. In the very early stage, you can easily get up to 80-90% successful swaps just by filtering on values and gradients. If you just swap a bunch in a row, there is a very good chance you will still end up better than before. It works better than sparse scoring: using the gradient of your true objective approximately works better than using an approximate objective exactly. The latter will miss the true goal by design, while the former continually re-aims itself to the destination despite inaccuracy. Obviously you can mix and match techniques, and gradient + sparse is actually a winning combo. I've only scratched the surface here. Warp Speed Time to address the elephant in the room. If the main bottleneck is an FFT, wouldn't this work better on a GPU? The answer to that is an unsurprising yes, at least for large sizes where the overhead of async dispatch is negligible. However, it would have been endlessly more cumbersome to discover all of the above based on a GPU implementation, where I can't just log intermediate values to a console. After checking everything, I pulled out my bag of tricks and ported it to Use.GPU. As a result, the algorithm runs entirely on the GPU, and provides live visualization of the entire process. It requires a WebGPU-enabled browser, which in practice means Chrome on Windows or Mac, or a dev build elsewhere. I haven't particularly optimized this—the FFT is vanilla textbook—but it works. It provides an easy ~8x speed up on an M1 Mac on beefier target sizes. With a desktop GPU, 128x128x32 and larger become very feasible. It lacks a few goodies from the scripts, and only does gradient + optional sparse. You can however freely exchange PNGs between the CPU and GPU version via drag and drop, as long as the settings match. Layout components Compute components Worth pointing out: this visualization is built using Use.GPU's HTML-like layout system. I can put div-like blocks inside a flex box wrapper, and put text beside it... while at the same time using raw WGSL shaders as the contents of those divs. These visualization shaders sample and colorize the algorithm state on the fly, with no CPU-side involvement other than a static dispatch. The only GPU -> CPU readback is for the stats in the corner, which are classic React and real HTML, along with the rest of the controls. I can then build an <FFT> component and drop it inside an async <ComputeLoop>, and it does exactly what it should. The rest is just a handful of <Dispatch> elements and the ordinary headache of writing compute shaders. <Suspense> ensures all the shaders are compiled before dispatching. While running, the bulk of the tree is inert, with only a handful of reducers triggering on a loop, causing a mere 7 live components to update per frame. The compute dispatch fights with the normal rendering for GPU resources, so there is an auto-batching mechanism that aims for approximately 30-60 FPS. The display is fully anti-aliased, including the pixelized data. I'm using the usual per-pixel SDF trickery to do this... it's applied as a generic wrapper shader for any UV-based sampler. It's a good showcase that Use.GPU really is React-for-GPUs with less hassle, but still with all the goodies. It bypasses most of the browser once the canvas gets going, and it isn't just for UI: you can express async compute just fine with the right component design. The robust layout and vector plotting capabilities are just extra on top. I won't claim it's the world's most elegant abstraction, because it's far too pragmatic for that. But I simply don't know any other programming environment where I could even try something like this and not get bogged down in GPU binding hell, or have to round-trip everything back to the CPU. * * * So there you have it: blue and indigo noise à la carte. What I find most interesting is that the problem of generating noise in the time domain has been recast into shaping and denoising a spectrum in the frequency domain. It starts as white noise, and gets turned into a pre-designed picture. You do so by swapping pixels in the other domain. The state for this process is kept in the phase channel, which is not directly relevant to the problem, but drifts into alignment over time. Hence I called it Stable Fiddusion. If you swap the two domains, you're turning noise into a picture by swapping frequency bands without changing their values. It would result in a complex-valued picture, whose magnitude is the target, and whose phase encodes the progress of the convergence process. This is approximately what you get when you add a hidden layer to a diffusion model. What I also find interesting is that the notion of swaps naturally creates a space that is O(N²) big with only N samples of actual data. Viewed from the perspective of a single step, every pair (A,B) corresponds to a unique information mask in the frequency domain that extracts a unique delta from the same data. There is redundancy, of course, but the nature of the Fourier transform smears it out into one big superposition. When you do multiple swaps, the space grows, but not quite that fast: any permutation of the same non-overlapping swaps is equivalent. There is also a notion of entanglement: frequency bands / pixels are linked together to move as a whole by default, but parts will diffuse into being locked in place. Phase is kind of the bugbear of the DSP world. Everyone knows it's there, but they prefer not to talk about it unless its content is neat and simple. Hopefully by now you have a better appreciation of the true nature of a Fourier transform. Not just as a spectrum for a real-valued signal, but as a complex-valued transform of a complex-valued input. During a swap run, the phase channel continuously looks like noise, but is actually highly structured when queried with the right quefrency hashes. I wonder what other things look like that, when you flip them around. More: Stable Fiddusion app CPU-side source code WebGPU source code

a year ago 22 votes
Sub-pixel Distance Transform

High quality font rendering for WebGPU This page includes diagrams in WebGPU, which has limited browser support. For the full experience, use Chrome on Windows or Mac, or a developer build on other platforms. In this post I will describe Use.GPU's text rendering, which uses a bespoke approach to Signed Distance Fields (SDFs). This was borne out of necessity: while SDF text is pretty common on GPUs, some of the established practice on generating SDFs from masks is incorrect, and some libraries get it right only by accident. So this will be a deep dive from first principles, about the nuances of subpixels. SDFs The idea behind SDFs is quite simple. To draw a crisp, anti-aliased shape at any size, you start from a field or image that records the distance to the shape's edge at every point, as a gradient. Lighter grays are inside, darker grays are outside. This can be a lower resolution than the target. Then you increase the contrast until the gradient is exactly 1 pixel wide at the target size. You can sample it to get a perfectly anti-aliased opacity mask: This works fine for text at typical sizes, and handles fractional shifts and scales perfectly with zero shimmering. It's also reasonably correct from a signal processing math point-of-view: it closely approximates averaging over a pixel-sized circular window, i.e. a low-pass convolution. Crucially, it takes a rendered glyph as input, which means I can remain blissfully unaware of TrueType font specifics, and bezier rasterization, and just offload that to an existing library. To generate an SDF, I started with MapBox's TinySDF library. Except, what comes out of it is wrong: The contours are noticeably wobbly and pixelated. The only reason the glyph itself looks okay is because the errors around the zero-level are symmetrical and cancel out. If you try to dilate or contract the outline, which is supposed to be one of SDF's killer features, you get ugly gunk. Compare to: The original Valve paper glosses over this aspect and uses high resolution inputs (4k) for a highly downscaled result (64). That is not an option for me because it's too slow. But I did get it to work. As a result Use.GPU has a novel subpixel-accurate distance transform (ESDT), which even does emoji. It's a combination CPU/GPU approach, with the CPU generating SDFs and the GPU rendering them, including all the debug viz. The Classic EDT The common solution is a Euclidean Distance Transform. Given a binary mask, it will produce an unsigned distance field. This holds the squared distance d² for either the inside or outside area, which you can sqrt. Like a Fourier Transform, you can apply it to 2D images by applying it horizontally on each row X, then vertically on each column Y (or vice versa). To make a signed distance field, you do this for both the inside and outside separately, and then combine the two as inside – outside or vice versa. The algorithm is one of those clever bits of 80s-style C code which is O(N), has lots of 1-letter variable names, and is very CPU cache friendly. Often copy/pasted, but rarely understood. In TypeScript it looks like this, where array is modified in-place and f, v and z are temporary buffers up to 1 row/column long. The arguments offset and stride allow the code to be used in either the X or Y direction in a flattened 2D array. for (let q = 1, k = 0, s = 0; q < length; q++) { f[q] = array[offset + q * stride]; do { let r = v[k]; s = (f[q] - f[r] + q * q - r * r) / (q - r) / 2; } while (s <= z[k] && --k > -1); k++; v[k] = q; z[k] = s; z[k + 1] = INF; } for (let q = 0, k = 0; q < length; q++) { while (z[k + 1] < q) k++; let r = v[k]; let d = q - r; array[offset + q * stride] = f[r] + d * d; } To explain what this code does, let's start with a naive version instead. Given a 1D input array of zeroes (filled), with an area masked out with infinity (empty): O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·] Make a matching sequence … 3 2 1 0 1 2 3 … for each element, centering the 0 at each index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14] + ∞ [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13] + ∞ [2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12] + ∞ [3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11] + 0 [4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10] + 0 [5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] + 0 [6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8] + 0 [7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7] + 0 [8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6] + ∞ [9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5] + 0 [10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4] + 0 [11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3] + 0 [12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2] + ∞ [13,12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1] + ∞ [14,13,12,11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0] + ∞ You then add the value from the array to each element in the row: [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11] [4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10] [5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8] [7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5] [10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4] [11,10,9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] [∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞, ∞] And then take the minimum of each column: P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3] This sequence counts up inside the masked out area, away from the zeroes. This is the positive distance field P. You can do the same for the inverted mask: I = [0, 0, 0, ·, ·, ·, ·, ·, 0, ·, ·, ·, 0, 0, 0] to get the complementary area, i.e. the negative distance field N: N = [0, 0, 0, 1, 2, 3, 2, 1, 0, 1, 2, 1, 0, 0, 0] That's what the EDT does, except it uses square distance … 9 4 1 0 1 4 9 …: When you apply it a second time in the second dimension, these outputs are the new input, i.e. values other than 0 or ∞. It still works because of Pythagoras' rule: d² = x² + y². This wouldn't be true if it used linear distance instead. The net effect is that you end up intersecting a series of parabolas, somewhat like a 1D slice of a Voronoi diagram: I' = [0, 0, 1, 4, 9, 4, 4, 4, 1, 1, 4, 9, 4, 9, 9] Each parabola sitting above zero is the 'shadow' of a zero-level paraboloid located some distance in a perpendicular dimension: The code is just a more clever way to do that, without generating the entire N² grid per row/column. It instead scans through the array left to right, building up a list v[k] of significant minima, with thresholds s[k] where two parabolas intersect. It adds them as candidates (k++) and discards them (--k) if they are eclipsed by a newer value. This is the first for/while loop: for (let q = 1, k = 0, s = 0; q < length; q++) { f[q] = array[offset + q * stride]; do { let r = v[k]; s = (f[q] - f[r] + q * q - r * r) / (q - r) / 2; } while (s <= z[k] && --k > -1); k++; v[k] = q; z[k] = s; z[k + 1] = INF; } Then it goes left to right again (for), and fills out the values, skipping ahead to the right minimum (k++). This is the squared distance from the current index q to the nearest minimum r, plus the minimum's value f[r] itself. The paper has more details: for (let q = 0, k = 0; q < length; q++) { while (z[k + 1] < q) k++; let r = v[k]; let d = q - r; array[offset + q * stride] = f[r] + d * d; } The Broken EDT So what's the catch? The above assumes a binary mask. As it happens, if you try to subtract a binary N from P, you have an off-by-one error: O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·] I = [0, 0, 0, ·, ·, ·, ·, ·, 0, ·, ·, ·, 0, 0, 0] P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3] N = [0, 0, 0, 1, 2, 3, 2, 1, 0, 1, 2, 1, 0, 0, 0] P - N = [3, 2, 1,-1,-2,-3,-2,-1, 1,-1,-2,-1, 1, 2, 3] It goes directly from 1 to -1 and back. You could add +/- 0.5 to fix that. But if there is a gray pixel in between each white and black, which we classify as both inside (0) and outside (0), it seems to work out just fine: O = [·, ·, ·, 0, 0, 0, 0, 0, ·, 0, 0, 0, ·, ·, ·] I = [0, 0, 0, 0, ·, ·, ·, 0, 0, 0, ·, 0, 0, 0, 0] P = [3, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 3] N = [0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0] P - N = [3, 2, 1, 0,-1,-2,-1, 0, 1, 0,-1, 0, 1, 2, 3] This is a realization that somebody must've had, and they reasoned on: "The above is correct for a 50% opaque pixel, where the edge between inside and outside falls exactly in the middle of a pixel." "Lighter grays are more inside, and darker grays are more outside. So all we need to do is treat l = level - 0.5 as a signed distance, and use l² for the initial inside or outside value for gray pixels. This will cause either the positive or negative distance field to shift by a subpixel amount l. And then the EDT will propagate this in both X and Y directions." The initial idea is correct, because this is just running SDF rendering in reverse. A gray pixel in an opacity mask is what you get when you contrast adjust an SDF and do not blow it out into pure black or white. The information inside the gray pixels is "correct", up to rounding. But there are two mistakes here. The first is that even in an anti-aliased image, you can have white pixels right next to black ones. Especially with fonts, which are pixel-hinted. So the SDF is wrong there, because it goes directly from -1 to 1. This causes the contours to double up, e.g. around this bottom edge: To solve this, you can eliminate the crisp case by deliberately making those edges very dark or light gray. But the second mistake is more subtle. The EDT works in 2D because you can feed the output of X in as the input of Y. But that means that any non-zero input to X represents another dimension Z, separate from X and Y. The resulting squared distance will be x² + y² + z². This is a 3D distance, not 2D. If an edge is shifted by 0.5 pixels in X, you would expect a 1D SDF like: […, 0.5, 1.5, 2.5, 3.5, …] = […, 0.5, 1 + 0.5, 2 + 0.5, 3 + 0.5, …] Instead, because of the squaring + square root, you will get: […, 0.5, 1.12, 2.06, 3.04, …] = […, sqrt(0.25), sqrt(1 + 0.25), sqrt(4 + 0.25), sqrt(9 + 0.25), …] The effect of l² = 0.25 rapidly diminishes as you get away from the edge, and is significantly wrong even just one pixel over. The correct shift would need to be folded into (x + …)² + (y + …)² and depends on the direction. e.g. If an edge is shifted horizontally, it ought to be (x + l)² + y², which means there is a term of 2*x*l missing. If the shift is vertical, it's 2*y*l instead. This is also a signed value, not positive/unsigned. Given all this, it's a miracle this worked at all. The only reason this isn't more visible in the final glyph is because the positive and negative fields contains the same but opposite errors around their respective gray pixels. The Not-Subpixel EDT As mentioned before, the EDT algorithm is essentially making a 1D Voronoi diagram every time. It finds the distance to the nearest minimum for every array index. But there is no reason for those minima themselves to lie at integer offsets, because the second for loop effectively resamples the data. So you can take an input mask, and tag each index with a horizontal offset Δ: O = [·, ·, ·, 0, 0, 0, 0, 0, ·, ·, ·] Δ = [A, B, C, D, E, F, G, H, I, J, K] As long as the offsets are small, no two indices will swap order, and the code still works. You then build the Voronoi diagram out of the shifted parabolas, but sample the result at unshifted indices. Problem 1 - Opposing Shifts This lead me down the first rabbit hole, which was an attempt to make the EDT subpixel capable without losing its appealing simplicity. I started by investigating the nuances of subpixel EDT in 1D. This was a bit of a mirage, because most real problems only pop up in 2D. Though there was one important insight here. O = [·, ·, ·, 0, 0, 0, 0, 0, ·, ·, ·] Δ = [·, ·, ·, A, ·, ·, ·, B, ·, ·, ·] Given a mask of zeroes and infinities, you can only shift the first and last point of each segment. Infinities don't do anything, while middle zeroes should remain zero. Using an offset A works sort of as expected: this will increase or decrease the values filled in by a fractional pixel, calculating a squared distance (d + A)² where A can be positive or negative. But the value at the shifted index itself is always (0 + A)² (positive). This means it is always outside, regardless of whether it is moving left or right. If A is moving left (–), the point is inside, and the (unsigned) distance should be 0. At B the situation is reversed: the distance should be 0 if B is moving right (+). It might seem like this is annoying but fixable, because the zeroes can be filled in by the opposite signed field. But this is only looking at the binary 1D case, where there are only zeroes and infinities. In 2D, a second pass has non-zero distances, so every index can be shifted: O = [a, b, c, d, e, f, g, h, i, j, k] Δ = [A, B, C, D, E, F, G, H, I, J, K] Now, resolving every subpixel unambiguously is harder than you might think: It's important to notice that the function being sampled by an EDT is not actually smooth: it is the minimum of a discrete set of parabolas, which cross at an angle. The square root of the output only produces a smooth linear gradient because it samples each parabola at integer offsets. Each center only shifts upward by the square of an integer in every pass, so the crossings are predictable. You never sample the 'wrong' side of (d + ...)². A subpixel EDT does not have this luxury. Subpixel EDTs are not irreparably broken though. Rather, they are only valid if they cause the unsigned distance field to increase, i.e. if they dilate the empty space. This is a problem: any shift that dilates the positive field contracts the negative, and vice versa. To fix this, you need to get out of the handwaving stage and actually understand P and N as continuous 2D fields. Problem 2 - Diagonals Consider an aliased, sloped edge. To understand how the classic EDT resolves it, we can turn it into a voronoi diagram for all the white pixel centers: Near the bottom, the field is dominated by the white pixels on the corners: they form diagonal sections downwards. Near the edge itself, the field runs perfectly vertical inside a roughly triangular section. In both cases, an arrow pointing back towards the cell center is only vaguely perpendicular to the true diagonal edge. Near perfect diagonals, the edge distances are just wrong. The distance of edge pixels goes up or right (1), rather than the more logical diagonal 0.707…. The true closest point on the edge is not part of the grid. These fields don't really resolve properly until 6-7 pixels out. You could hide these flaws with e.g. an 8x downscale, but that's 64x more pixels. Either way, you shouldn't expect perfect numerical accuracy from an EDT. Just because it's mathematically separable doesn't mean it's particularly good. In fact, it's only separable because it isn't very good at all. Problem 3 - Gradients In 2D, there is also only one correct answer to the gray case. Consider a diagonal edge, anti-aliased: Thresholding it into black, grey or white, you get: If you now classify the grays as both inside and outside, then the highlighted pixels will be part of both masks. Both the positive and negative field will be exactly zero there, and so will the SDF (P - N): This creates a phantom vertical edge that pushes apart P and N, and causes the average slope to be less than 45º. The field simply has the wrong shape, because gray pixels can be surrounded by other gray pixels. This also explains why TinySDF magically seemed to work despite being so pixelized. The l² gray correction fills in exactly the gaps in the bad (P - N) field where it is zero, and it interpolates towards a symmetrically wrong P and N field on each side. If we instead classify grays as neither inside nor outside, then P and N overlap in the boundary, and it is possible to resolve them into a coherent SDF with a clean 45 degree slope, if you do it right: What seemed like an off-by-one error is actually the right approach in 2D or higher. The subpixel SDF will then be a modified version of this field, where the P and N sides are changed in lock-step to remain mutually consistent. Though we will get there in a roundabout way. Problem 4 - Commuting It's worth pointing out: a subpixel EDT simply cannot commute in 2D. First, consider the data flow of an ordinary EDT: Information from a corner pixel can flow through empty space both when doing X-then-Y and Y-then-X. But information from the horizontal edge pixels can only flow vertically then horizontally. This is okay because the separating lines between adjacent pixels are purely vertical too: the red arrows never 'win'. But if you introduce subpixel shifts, the separating lines can turn: The data flow is still limited to the original EDT pattern, so the edge pixels at the top can only propagate by starting downward. They can only influence adjacent columns if the order is Y-then-X. For vertical edges it's the opposite. That said, this is only a problem on shallow concave curves, where there aren't any corner pixels nearby. The error is that it 'snaps' to the wrong edge point, but only when it is already several pixels away from the edge. In that case, the smaller x² term is dwarfed by the much larger y² term, so the absolute error is small after sqrt. The ESDT Knowing all this, here's how I assembled a "true" Euclidean Subpixel Distance Transform. Subpixel offsets To start we need to determine the subpixel offsets. We can still treat level - 0.5 as the signed distance for any gray pixel, and ignore all white and black for now. The tricky part is determining the exact direction of that distance. As an approximation, we can examine the 3x3 neighborhood around each gray pixel and do a least-squares fit of a plane. As long as there is at least one white and one black pixel in this neighborhood, we get a vector pointing towards where the actual edge is. In practice I apply some horizontal/vertical smoothing here using a simple [1 2 1] kernel. The result is numerically very stable, because the originally rasterized image is visually consistent. This logic is disabled for thin creases and spikes, where it doesn't work. Such points are treated as fully masked out, so that neighboring distances propagate there instead. This is needed e.g. for the pointy negative space of a W to come out right. I also implemented a relaxation step that will smooth neighboring vectors if they point in similar directions. However, the effect is quite minimal, and it rounds very sharp corners, so I ended up disabling it by default. The goal is then to do an ESDT that uses these shifted positions for the minima, to get a subpixel accurate distance field. P and N junction We saw earlier that only non-masked pixels can have offsets that influence the output (#1). We only have offsets for gray pixels, yet we concluded that gray pixels should be masked out, to form a connected SDF with the right shape (#3). This can't work. SDFs are both the problem and the solution here. Dilating and contracting SDFs is easy: add or subtract a constant. So you can expand both P and N fields ahead of time geometrically, and then undo it numerically. This is done by pushing their respective gray pixel centers in opposite directions, by half a pixel, on top of the originally calculated offset: This way, they can remain masked in in both fields, but are always pushed between 0 and 1 pixel inwards. The distance between the P and N gray pixel offsets is always exactly 1, so the non-zero overlap between P and N is guaranteed to be exactly 1 pixel wide everywhere. It's a perfect join anywhere we sample it, because the line between the two ends crosses through a pixel center. When we then calculate the final SDF, we do the opposite, shifting each by half a pixel and trimming it off with a max: SDF = max(0, P - 0.5) - max(0, N - 0.5) Only one of P or N will be > 0.5 at a time, so this is exact. To deal with pure black/white edges, I treat any black neighbor of a white pixel (horizontal or vertical only) as gray with a 0.5 pixel offset (before P/N dilation). No actual blurring needs to happen, and the result is numerically exact minus epsilon, which is nice. ESDT state The state for the ESDT then consists of remembering a signed X and Y offset for every pixel, rather than the squared distance. These are factored into the distance and threshold calculations, separated into its proper parallel and orthogonal components, i.e. X/Y or Y/X. Unlike an EDT, each X or Y pass has to be aware of both axes. But the algorithm is mostly unchanged otherwise, here X-then-Y. The X pass: At the start, only gray pixels have offsets and they are all in the range -1…1 (exclusive). With each pass of ESDT, a winning minima's offsets propagate to its affecting range, tracking the total distance (Δx, Δy) (> 1). At the end, each pixel's offset points to the nearest edge, so the squared distance can be derived as Δx² + Δy². The Y pass: You can see that the vertical distances in the top-left are practically vertical, and not oriented perpendicular to the contour on average: they have not had a chance to propagate horizontally. But they do factor in the vertical subpixel offset, and this is the dominant component. So even without correction it still creates a smooth SDF with a surprisingly small error. Fix ups The commutativity errors are all biased positively, meaning we get an upper bound of the true distance field. You could take the min of X then Y and Y then X. This would re-use all the same prep and would restore rotation-independence at the cost of 2x as many ESDTs. You could try X then Y then X at 1.5x cost with some hacks. But neither would improve diagonal areas, which were still busted in the original EDT. Instead I implemented an additional relaxation pass. It visits every pixel's target, and double checks whether one of the 4 immediate neighbors (with subpixel offset) isn't a better solution: It's a good heuristic because if the target is >1px off there is either a viable commutative propagation path, or you're so far away the error is negligible. It fixes up the diagonals, creating tidy lines when the resolution allows for it: You could take this even further, given that you know the offsets are supposed to be perpendicular to the glyph contour. You could add reprojection with a few dot products here, but getting it to not misfire on edge cases would be tricky. While you can tell the unrelaxed offsets are wrong when visualized, and the fixed ones are better, the visual difference in the output glyphs is tiny. You need to blow glyphs up to enormous size to see the difference side by side. So it too is disabled by default. The diagonals in the original EDT were wrong too and you could barely tell. Emoji An emoji is generally stored as a full color transparent PNG or SVG. The ESDT can be applied directly to its opacity mask to get an SDF, so no problem there. There are an extremely rare handful of emoji with semi-transparent areas, but you can get away with making those solid. For this I just use a filter that detects '+' shaped arrangements of pixels that have (almost) the same transparency level. Then I dilate those by 3x3 to get the average transparency level in each area. Then I divide by it to only keep the anti-aliased edges transparent. The real issue is blending the colors at the edges, when the emoji is being rendered and scaled. The RGB color of transparent pixels is undefined, so whatever values are there will blend into the surrounding pixels, e.g. creating a subtle black halo: Not Premultiplied Premultiplied A common solution is premultiplied alpha. The opacity is baked into the RGB channels as (R * A, G * A, B * A, A), and transparent areas must be fully transparent black. This allows you to use a premultiplied blend mode where the RGB channels are added directly without further scaling, to cancel out the error. But the alpha channel of an SDF glyph is dynamic, and is independent of the colors, so it cannot be premultiplied. We need valid color values even for the fully transparent areas, so that up- or downscaling is still clean. Luckily the ESDT calculates X and Y offsets which point from each pixel directly to the nearest edge. We can use them to propagate the colors outward in a single pass, filling in the entire background. It doesn't need to be very accurate, so no filtering is required. RGB channel Output The result looks pretty great. At normal sizes, the crisp edge hides the fact that the interior is somewhat blurry. Emoji fonts are supported via the underlying ab_glyph library, but are too big for the web (10MB+). So you can just load .PNGs on demand instead, at whatever resolution you need. Hooking it up to the 2D canvas to render native system emoji is left as an exercise for the reader. Use.GPU does not support complex Unicode scripts or RTL text yet—both are a can of worms I wish to offload too—but it does support composite emoji like "pirate flag" (white flag + skull and crossbones) or "male astronaut" (astronaut + man) when formatted using the usual Zero-Width Joiners (U+200D) or modifiers. Shading Finally, a note on how to actually render with SDFs, which is more nuanced than you might think. I pack all the SDF glyphs into an atlas on-demand, the same I use elsewhere in Use.GPU. This has a custom layout algorithm that doesn't backtrack, optimized for filling out a layout at run-time with pieces of a similar size. Glyphs are rasterized at 1.5x their normal font size, after rounding up to the nearest power of two. The extra 50% ensures small fonts on low-DPI displays still use a higher quality SDF, while high-DPI displays just upscale that SDF without noticeable quality loss. The rounding ensures similar font sizes reuse the same SDFs. You can also override the detail independent of font size. To determine the contrast factor to draw an SDF, you generally use screen-space derivatives. There are good and bad ways of doing this. Your goal is to get a ratio of SDF pixels to screen pixels, so the best thing to do is give the GPU the coordinates of the SDF texture pixels, and ask it to calculate the difference for that between neighboring screen pixels. This works for surfaces in 3D at an angle too. Bad ways of doing this will instead work off relative texture coordinates, and introduce additional scaling factors based on the view or atlas size, when they are all just supposed to cancel out. As you then adjust the contrast of an SDF to render it, it's important to do so around the zero-level. The glyph's ideal vector shape should not expand or contract as you scale it. Like TinySDF, I use 75% gray as the zero level, so that more SDF range is allocated to the outside than the inside, as dilating glyphs is much more common than contraction. At the same time, a pixel whose center sits exactly on the zero level edge is actually half inside, half outside, i.e. 50% opaque. So, after scaling the SDF, you need to add 0.5 to the value to get the correct opacity for a blend. This gives you a mathematically accurate font rendering that approximates convolution with a pixel-sized circle or box. But I go further. Fonts were not invented for screens, they were designed for paper, with ink that bleeds. Certain renderers, e.g. MacOS, replicate this effect. The physical bleed distance is relatively constant, so the larger the font, the smaller the effect of the bleed proportionally. I got the best results with a 0.25 pixel bleed at 32px or more. For smaller sizes, it tapers off linearly. When you zoom out blocks of text, they get subtly fatter instead of thinning out, and this is actually a great effect when viewing document thumbnails, where lines of text become a solid mass at the point where the SDF resolution fails anyway. In Use.GPU I prefer to use gamma correct, linear RGB color, even for 2D. What surprised me the most is just how unquestionably superior this looks. Text looks rock solid and readable even at small sizes on low-DPI. Because the SDF scales, there is no true font hinting, but it really doesn't need it, it would just be a nice extra. Presumably you could track hinted points or edges inside SDF glyphs and then do a dynamic distortion somehow, but this is an order of magnitude more complex than what it is doing now, which is splat a contrasted texture on screen. It does have snapping you can turn on, which avoids jiggling of individual letters. But if you turn it off, you get smooth subpixel everything: I was always a big fan of the 3x1 subpixel rendering used on color LCDs (i.e. ClearType and the like), and I was sad when it was phased out due to the popularity of high-DPI displays. But it turns out the 3x res only offered marginal benefits... the real improvement was always that it had a custom gamma correct blend mode, which is a thing a lot of people still get wrong. Even without RGB subpixels, gamma correct AA looks great. Converting the entire desktop to Linear RGB is also not going to happen in our lifetime, but I really want it more now. The "blurry text" that some people associate with anti-aliasing is usually just text blended with the wrong gamma curve, and without an appropriate bleed for the font in question. * * * If you want to make SDFs from existing input data, subpixel accuracy is crucial. Without it, fully crisp strokes actually become uneven, diagonals can look bumpy, and you cannot make clean dilated outlines or shadows. If you use an EDT, you have to start from a high resolution source and then downscale away all the errors near the edges. But if you use an ESDT, you can upscale even emoji PNGs with decent results. It might seem pretty obvious in hindsight, but there is a massive difference between getting it sort of working, and actually getting all the details right. There were many false starts and dead ends, because subpixel accuracy also means one bad pixel ruins it. In some circles, SDF text is an old party trick by now... but a solid and reliable implementation is still a fair amount of work, with very little to go off for the harder parts. By the way, I did see if I could use voronoi techniques directly, but in terms of computation it is much more involved. Pretty tho: The ESDT is fast enough to use at run-time, and the implementation is available as a stand-alone import for drop-in use. This post started as a single live WebGPU diagram, which you can play around with. The source code for all the diagrams is available too.

a year ago 41 votes

More in programming

AI: Where in the Loop Should Humans Go?

This is a re-publishing of a blog post I originally wrote for work, but wanted on my own blog as well. AI is everywhere, and its impressive claims are leading to rapid adoption. At this stage, I’d qualify it as charismatic technology—something that under-delivers on what it promises, but promises so much that the industry still leverages it because we believe it will eventually deliver on these claims. This is a known pattern. In this post, I’ll use the example of automation deployments to go over known patterns and risks in order to provide you with a list of questions to ask about potential AI solutions. I’ll first cover a short list of base assumptions, and then borrow from scholars of cognitive systems engineering and resilience engineering to list said criteria. At the core of it is the idea that when we say we want humans in the loop, it really matters where in the loop they are. My base assumptions The first thing I’m going to say is that we currently do not have Artificial General Intelligence (AGI). I don’t care whether we have it in 2 years or 40 years or never; if I’m looking to deploy a tool (or an agent) that is supposed to do stuff to my production environments, it has to be able to do it now. I am not looking to be impressed, I am looking to make my life and the system better. Another mechanism I want you to keep in mind is something called the context gap. In a nutshell, any model or automation is constructed from a narrow definition of a controlled environment, which can expand as it gains autonomy, but remains limited. By comparison, people in a system start from a broad situation and narrow definitions down and add constraints to make problem-solving tractable. One side starts from a narrow context, and one starts from a wide one—so in practice, with humans and machines, you end up seeing a type of teamwork where one constantly updates the other: The optimal solution of a model is not an optimal solution of a problem unless the model is a perfect representation of the problem, which it never is.  — Ackoff (1979, p. 97) Because of that mindset, I will disregard all arguments of “it’s coming soon” and “it’s getting better real fast” and instead frame what current LLM solutions are shaped like: tools and automation. As it turns out, there are lots of studies about ergonomics, tool design, collaborative design, where semi-autonomous components fit into sociotechnical systems, and how they tend to fail. Additionally, I’ll borrow from the framing used by people who study joint cognitive systems: rather than looking only at the abilities of what a single person or tool can do, we’re going to look at the overall performance of the joint system. This is important because if you have a tool that is built to be operated like an autonomous agent, you can get weird results in your integration. You’re essentially building an interface for the wrong kind of component—like using a joystick to ride a bicycle. This lens will assist us in establishing general criteria about where the problems will likely be without having to test for every single one and evaluate them on benchmarks against each other. Questions you'll want to ask The following list of questions is meant to act as reminders—abstracting away all the theory from research papers you’d need to read—to let you think through some of the important stuff your teams should track, whether they are engineers using code generation, SREs using AIOps, or managers and execs making the call to adopt new tooling. Are you better even after the tool is taken away? An interesting warning comes from studying how LLMs function as learning aides. The researchers found that people who trained using LLMs tended to fail tests more when the LLMs were taken away compared to people who never studied with them, except if the prompts were specifically (and successfully) designed to help people learn. Likewise, it’s been known for decades that when automation handles standard challenges, the operators expected to take over when they reach their limits end up worse off and generally require more training to keep the overall system performant. While people can feel like they’re getting better and more productive with tool assistance, it doesn’t necessarily follow that they are learning or improving. Over time, there’s a serious risk that your overall system’s performance will be limited to what the automation can do—because without proper design, people keeping the automation in check will gradually lose the skills they had developed prior. Are you augmenting the person or the computer? Traditionally successful tools tend to work on the principle that they improve the physical or mental abilities of their operator: search tools let you go through more data than you could on your own and shift demands to external memory, a bicycle more effectively transmits force for locomotion, a blind spot alert on your car can extend your ability to pay attention to your surroundings, and so on. Automation that augments users therefore tends to be easier to direct, and sort of extends the person’s abilities, rather than acting based on preset goals and framing. Automation that augments a machine tends to broaden the device’s scope and control by leveraging some known effects of their environment and successfully hiding them away. For software folks, an autoscaling controller is a good example of the latter. Neither is fundamentally better nor worse than the other—but you should figure out what kind of automation you’re getting, because they fail differently. Augmenting the user implies that they can tackle a broader variety of challenges effectively. Augmenting the computers tends to mean that when the component reaches its limits, the challenges are worse for the operator. Is it turning you into a monitor rather than helping build an understanding? If your job is to look at the tool go and then say whether it was doing a good or bad job (and maybe take over if it does a bad job), you’re going to have problems. It has long been known that people adapt to their tools, and automation can create complacency. Self-driving cars that generally self-drive themselves well but still require a monitor are not effectively monitored. Instead, having AI that supports people or adds perspectives to the work an operator is already doing tends to yield better long-term results than patterns where the human learns to mostly delegate and focus elsewhere. (As a side note, this is why I tend to dislike incident summarizers. Don’t make it so people stop trying to piece together what happened! Instead, I prefer seeing tools that look at your summaries to remind you of items you may have forgotten, or that look for linguistic cues that point to biases or reductive points of view.) Does it pigeonhole what you can look at? When evaluating a tool, you should ask questions about where the automation lands: Does it let you look at the world more effectively? Does it tell you where to look in the world? Does it force you to look somewhere specific? Does it tell you to do something specific? Does it force you to do something? This is a bit of a hybrid between “Does it extend you?” and “Is it turning you into a monitor?” The five questions above let you figure that out. As the tool becomes a source of assertions or constraints (rather than a source of information and options), the operator becomes someone who interacts with the world from inside the tool rather than someone who interacts with the world with the tool’s help. The tool stops being a tool and becomes a representation of the whole system, which means whatever limitations and internal constraints it has are then transmitted to your users. Is it a built-in distraction? People tend to do multiple tasks over many contexts. Some automated systems are built with alarms or alerts that require stealing someone’s focus, and unless they truly are the most critical thing their users could give attention to, they are going to be an annoyance that can lower the effectiveness of the overall system. What perspectives does it bake in? Tools tend to embody a given perspective. For example, AIOps tools that are built to find a root cause will likely carry the conceptual framework behind root causes in their design. More subtly, these perspectives are sometimes hidden in the type of data you get: if your AIOps agent can only see alerts, your telemetry data, and maybe your code, it will rarely be a source of suggestions on how to improve your workflows because that isn’t part of its world. In roles that are inherently about pulling context from many disconnected sources, how on earth is automation going to make the right decisions? And moreover, who’s accountable for when it makes a poor decision on incomplete data? Surely not the buyer who installed it! This is also one of the many ways in which automation can reinforce biases—not just based on what is in its training data, but also based on its own structure and what inputs were considered most important at design time. The tool can itself become a keyhole through which your conclusions are guided. Is it going to become a hero? A common trope in incident response is heroes—the few people who know everything inside and out, and who end up being necessary bottlenecks to all emergencies. They can’t go away for vacation, they’re too busy to train others, they develop blind spots that nobody can fix, and they can’t be replaced. To avoid this, you have to maintain a continuous awareness of who knows what, and crosstrain each other to always have enough redundancy. If you have a team of multiple engineers and you add AI to it, having it do all of the tasks of a specific kind means it becomes a de facto hero to your team. If that’s okay, be aware that any outages or dysfunction in the AI agent would likely have no practical workaround. You will essentially have offshored part of your ops. Do you need it to be perfect? What a thing promises to be is never what it is—otherwise AWS would be enough, and Kubernetes would be enough, and JIRA would be enough, and the software would work fine with no one needing to fix things. That just doesn’t happen. Ever. Even if it’s really, really good, it’s gonna have outages and surprises, and it’ll mess up here and there, no matter what it is. We aren’t building an omnipotent computer god, we’re building imperfect software. You’ll want to seriously consider whether the tradeoffs you’d make in terms of quality and cost are worth it, and this is going to be a case-by-case basis. Just be careful not to fix the problem by adding a human in the loop that acts as a monitor! Is it doing the whole job or a fraction of it? We don’t notice major parts of our own jobs because they feel natural. A classic pattern here is one of AIs getting better at diagnosing patients, except the benchmarks are usually run on a patient chart where most of the relevant observations have already been made by someone else. Similarly, we often see AI pass a test with flying colors while it still can’t be productive at the job the test represents. People in general have adopted a model of cognition based on information processing that’s very similar to how computers work (get data in, think, output stuff, rinse and repeat), but for decades, there have been multiple disciplines that looked harder at situated work and cognition, moving past that model. Key patterns of cognition are not just in the mind, but are also embedded in the environment and in the interactions we have with each other. Be wary of acquiring a solution that solves what you think the problem is rather than what it actually is. We routinely show we don’t accurately know the latter. What if we have more than one? You probably know how straightforward it can be to write a toy project on your own, with full control of every refactor. You probably also know how this stops being true as your team grows. As it stands today, a lot of AI agents are built within a snapshot of the current world: one or few AI tools added to teams that are mostly made up of people. By analogy, this would be like everyone selling you a computer assuming it were the first and only electronic device inside your household. Problems arise when you go beyond these assumptions: maybe AI that writes code has to go through a code review process, but what if that code review is done by another unrelated AI agent? What happens when you get to operations and common mode failures impact components from various teams that all have agents empowered to go fix things to the best of their ability with the available data? Are they going to clash with people, or even with each other? Humans also have that ability and tend to solve it via processes and procedures, explicit coordination, announcing what they’ll do before they do it, and calling upon each other when they need help. Will multiple agents require something equivalent, and if so, do you have it in place? How do they cope with limited context? Some changes that cause issues might be safe to roll back, some not (maybe they include database migrations, maybe it is better to be down than corrupting data), and some may contain changes that rolling back wouldn’t fix (maybe the workload is controlled by one or more feature flags). Knowing what to do in these situations can sometimes be understood from code or release notes, but some situations can require different workflows involving broader parts of the organization. A risk of automation without context is that if you have situations where waiting or doing little is the best option, then you’ll need to either have automation that requires input to act, or a set of actions to quickly disable multiple types of automation as fast as possible. Many of these may exist at the same time, and it becomes the operators’ jobs to not only maintain their own context, but also maintain a mental model of the context each of these pieces of automation has access to. The fancier your agents, the fancier your operators’ understanding and abilities must be to properly orchestrate them. The more surprising your landscape is, the harder it can become to manage with semi-autonomous elements roaming around. After an outage or incident, who does the learning and who does the fixing? One way to track accountability in a system is to figure out who ends up having to learn lessons and change how things are done. It’s not always the same people or teams, and generally, learning will happen whether you want it or not. This is more of a rhetorical question right now, because I expect that in most cases, when things go wrong, whoever is expected to monitor the AI tool is going to have to steer it in a better direction and fix it (if they can); if it can’t be fixed, then the expectation will be that the automation, as a tool, will be used more judiciously in the future. In a nutshell, if the expectation is that your engineers are going to be doing the learning and tweaking, your AI isn’t an independent agent—it’s a tool that cosplays as an independent agent. Do what you will—just be mindful All in all, none of the above questions flat out say you should not use AI, nor where exactly in the loop you should put people. The key point is that you should ask that question and be aware that just adding whatever to your system is not going to substitute workers away. It will, instead, transform work and create new patterns and weaknesses. Some of these patterns are known and well-studied. We don’t have to go rushing to rediscover them all through failures as if we were the first to ever automate something. If AI ever gets so good and so smart that it’s better than all your engineers, it won’t make a difference whether you adopt it only once it’s good. In the meanwhile, these things do matter and have real impacts, so please design your systems responsibly. If you’re interested to know more about the theoretical elements underpinning this post, the following references—on top of whatever was already linked in the text—might be of interest: Books: Joint Cognitive Systems: Foundations of Cognitive Systems Engineering by Erik Hollnagel Joint Cognitive Systems: Patterns in Cognitive Systems Engineering by David D. Woods Cognition in the Wild by Edwin Hutchins Behind Human Error by David D. Woods, Sydney Dekker, Richard Cook, Leila Johannesen, Nadine Sarter Papers: Ironies of Automation by Lisanne Bainbridge The French-Speaking Ergonomists’ Approach to Work Activity by Daniellou How in the World Did We Ever Get into That Mode? Mode Error and Awareness in Supervisory Control by Nadine Sarter Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis by David D. Woods Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity by Gary Klein and David D. Woods MABA-MABA or Abracadabra? Progress on Human–Automation Co-ordination by Sidney Dekker Managing the Hidden Costs of Coordination by Laura Maguire Designing for Expertise by David D. Woods The Impact of Generative AI on Critical Thinking by Lee et al.

yesterday 4 votes
AMD YOLO

AMD is sending us the two MI300X boxes we asked for. They are in the mail. It took a bit, but AMD passed my cultural test. I now believe they aren’t going to shoot themselves in the foot on software, and if that’s true, there’s absolutely no reason they should be worth 1/16th of NVIDIA. CUDA isn’t really the moat people think it is, it was just an early ecosystem. tiny corp has a fully sovereign AMD stack, and soon we’ll port it to the MI300X. You won’t even have to use tinygrad proper, tinygrad has a torch frontend now. Either NVIDIA is super overvalued or AMD is undervalued. If the petaflop gets commoditized (tiny corp’s mission), the current situation doesn’t make any sense. The hardware is similar, AMD even got the double throughput Tensor Cores on RDNA4 (NVIDIA artificially halves this on their cards, soon they won’t be able to). I’m betting on AMD being undervalued, and that the demand for AI has barely started. With good software, the MI300X should outperform the H100. In for a quarter million. Long term. It can always dip short term, but check back in 5 years.

yesterday 2 votes
whippet lab notebook: untagged mallocs, bis

Earlier this weekGuileWhippet But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s .mostly-marking collector Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so. The problem comes for uses of from outside itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.scm_gc_malloclibguile The problem with , though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an API for the tagged case and assume that anything that goes through is requesting an untagged, conservatively-scanned block of memory. Similarly for : you could be allocating a tagged object that happens to not contain pointers, or you could be allocating an untagged array of whatever. A new API is needed there too for pointerless untagged allocations.scm_gc_mallocscm_gc_mallocscm_gc_malloc_pointerless Recall that the mostly-marking collector can be built in a number of different ways: it can support conservative and/or precise roots, it can trace the heap precisely or conservatively, it can be generational or not, and the collector can use multiple threads during pauses or not. Consider a basic configuration with precise roots. You can make tagged pointerless allocations just fine: the trace function for that tag is just trivial. You would like to extend the collector with the ability to make pointerless allocations, for raw data. How to do this?untagged Consider first that when the collector goes to trace an object, it can’t use bits inside the object to discriminate between the tagged and untagged cases. Fortunately though . Of those 8 bits, 3 are used for the mark (five different states, allowing for future concurrent tracing), two for the , one to indicate whether the object is pinned or not, and one to indicate the end of the object, so that we can determine object bounds just by scanning the metadata byte array. That leaves 1 bit, and we can use it to indicate untagged pointerless allocations. Hooray!the main space of the mostly-marking collector has one metadata byte for each 16 bytes of payloadprecise field-logging write barrier However there is a wrinkle: when Whippet decides the it should evacuate an object, it tracks the evacuation state in the object itself; the embedder has to provide an implementation of a , allowing the collector to detect whether an object is forwarded or not, to claim an object for forwarding, to commit a forwarding pointer, and so on. We can’t do that for raw data, because all bit states belong to the object, not the collector or the embedder. So, we have to set the “pinned” bit on the object, indicating that these objects can’t move.little state machine We could in theory manage the forwarding state in the metadata byte, but we don’t have the bits to do that currently; maybe some day. For now, untagged pointerless allocations are pinned. You might also want to support untagged allocations that contain pointers to other GC-managed objects. In this case you would want these untagged allocations to be scanned conservatively. We can do this, but if we do, it will pin all objects. Thing is, conservative stack roots is a kind of a sweet spot in language run-time design. You get to avoid constraining your compiler, you avoid a class of bugs related to rooting, but you can still support compaction of the heap. How is this, you ask? Well, consider that you can move any object for which we can precisely enumerate the incoming references. This is trivially the case for precise roots and precise tracing. For conservative roots, we don’t know whether a given edge is really an object reference or not, so we have to conservatively avoid moving those objects. But once you are done tracing conservative edges, any live object that hasn’t yet been traced is fair game for evacuation, because none of its predecessors have yet been visited. But once you add conservatively-traced objects back into the mix, you don’t know when you are done tracing conservative edges; you could always discover another conservatively-traced object later in the trace, so you have to pin everything. The good news, though, is that we have gained an easier migration path. I can now shove Whippet into Guile and get it running even before I have removed untagged allocations. Once I have done so, I will be able to allow for compaction / evacuation; things only get better from here. Also as a side benefit, the mostly-marking collector’s heap-conservative configurations are now faster, because we have metadata attached to objects which allows tracing to skip known-pointerless objects. This regains an optimization that BDW has long had via its , used in Guile since time out of mind.GC_malloc_atomic With support for untagged allocations, I think I am finally ready to start getting Whippet into Guile itself. Happy hacking, and see you on the other side! inside and outside on intent on data on slop fin

yesterday 2 votes
Creating static map images with OpenStreetMap, Web Mercator, and Pillow

I’ve been working on a project where I need to plot points on a map. I don’t need an interactive or dynamic visualisation – just a static map with coloured dots for each coordinate. I’ve created maps on the web using Leaflet.js, which load map data from OpenStreetMap (OSM) and support zooming and panning – but for this project, I want a standalone image rather than something I embed in a web page. I want to put in coordinates, and get a PNG image back. This feels like it should be straightforward. There are lots of Python libraries for data visualisation, but it’s not an area I’ve ever explored in detail. I don’t know how to use these libraries, and despite trying I couldn’t work out how to accomplish this seemingly simple task. I made several attempts with libraries like matplotlib and plotly, but I felt like I was fighting the tools. Rather than persist, I wrote my own solution with “lower level” tools. The key was a page on the OpenStreetMap wiki explaining how to convert lat/lon coordinates into the pixel system used by OSM tiles. In particular, it allowed me to break the process into two steps: Get a “base map” image that covers the entire world Convert lat/lon coordinates into xy coordinates that can be overlaid on this image Let’s go through those steps. Get a “base map” image that covers the entire world Let’s talk about how OpenStreetMap works, and in particular their image tiles. If you start at the most zoomed-out level, OSM represents the entire world with a single 256×256 pixel square. This is the Web Mercator projection, and you don’t get much detail – just a rough outline of the world. We can zoom in, and this tile splits into four new tiles of the same size. There are twice as many pixels along each edge, and each tile has more detail. Notice that country boundaries are visible now, but we can’t see any names yet. We can zoom in even further, and each of these tiles split again. There still aren’t any text labels, but the map is getting more detailed and we can see small features that weren’t visible before. You get the idea – we could keep zooming, and we’d get more and more tiles, each with more detail. This tile system means you can get detailed information for a specific area, without loading the entire world. For example, if I’m looking at street information in Britain, I only need the detailed tiles for that part of the world. I don’t need the detailed tiles for Bolivia at the same time. OpenStreetMap will only give you 256×256 pixels at a time, but we can download every tile and stitch them together, one-by-one. Here’s a Python script that enumerates all the tiles at a particular zoom level, downloads them, and uses the Pillow library to combine them into a single large image: #!/usr/bin/env python3 """ Download all the map tiles for a particular zoom level from OpenStreetMap, and stitch them into a single image. """ import io import itertools import httpx from PIL import Image zoom_level = 2 width = 256 * 2**zoom_level height = 256 * (2**zoom_level) im = Image.new("RGB", (width, height)) for x, y in itertools.product(range(2**zoom_level), range(2**zoom_level)): resp = httpx.get(f"https://tile.openstreetmap.org/{zoom_level}/{x}/{y}.png", timeout=50) resp.raise_for_status() im_buffer = Image.open(io.BytesIO(resp.content)) im.paste(im_buffer, (x * 256, y * 256)) out_path = f"map_{zoom_level}.png" im.save(out_path) print(out_path) The higher the zoom level, the more tiles you need to download, and the larger the final image will be. I ran this script up to zoom level 6, and this is the data involved: Zoom level Number of tiles Pixels File size 0 1 256×256 17.1 kB 1 4 512×512 56.3 kB 2 16 1024×1024 155.2 kB 3 64 2048×2048 506.4 kB 4 256 4096×4096 2.7 MB 5 1,024 8192×8192 13.9 MB 6 4,096 16384×16384 46.1 MB I can just about open that zoom level 6 image on my computer, but it’s struggling. I didn’t try opening zoom level 7 – that includes 16,384 tiles, and I’d probably run out of memory. For most static images, zoom level 3 or 4 should be sufficient – I ended up a base map from zoom level 4 for my project. It takes a minute or so to download all the tiles from OpenStreetMap, but you only need to request it once, and then you have a static image you can use again and again. This is a particularly good approach if you want to draw a lot of maps. OpenStreetMap is provided for free, and we want to be a respectful user of the service. Downloading all the map tiles once is more efficient than making repeated requests for the same data. Overlay lat/lon coordinates on this base map Now we have an image with a map of the whole world, we need to overlay our lat/lon coordinates as points on this map. I found instructions on the OpenStreetMap wiki which explain how to convert GPS coordinates into a position on the unit square, which we can in turn add to our map. They outline a straightforward algorithm, which I implemented in Python: import math def convert_gps_coordinates_to_unit_xy( *, latitude: float, longitude: float ) -> tuple[float, float]: """ Convert GPS coordinates to positions on the unit square, which can be plotted on a Web Mercator projection of the world. This expects the coordinates to be specified in **degrees**. The result will be (x, y) coordinates: - x will fall in the range (0, 1). x=0 is the left (180° west) edge of the map. x=1 is the right (180° east) edge of the map. x=0.5 is the middle, the prime meridian. - y will fall in the range (0, 1). y=0 is the top (north) edge of the map, at 85.0511 °N. y=1 is the bottom (south) edge of the map, at 85.0511 °S. y=0.5 is the middle, the equator. """ # This is based on instructions from the OpenStreetMap Wiki: # https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Example:_Convert_a_GPS_coordinate_to_a_pixel_position_in_a_Web_Mercator_tile # (Retrieved 16 January 2025) # Convert the coordinate to the Web Mercator projection # (https://epsg.io/3857) # # x = longitude # y = arsinh(tan(latitude)) # x_webm = longitude y_webm = math.asinh(math.tan(math.radians(latitude))) # Transform the projected point onto the unit square # # x = 0.5 + x / 360 # y = 0.5 - y / 2π # x_unit = 0.5 + x_webm / 360 y_unit = 0.5 - y_webm / (2 * math.pi) return x_unit, y_unit Their documentation includes a worked example using the coordinates of the Hachiko Statue. We can run our code, and check we get the same results: >>> convert_gps_coordinates_to_unit_xy(latitude=35.6590699, longitude=139.7006793) (0.8880574425, 0.39385379958274735) Most users of OpenStreetMap tiles will use these unit positions to select the tiles they need, and then dowload those images – but we can also position these points directly on the global map. I wrote some more Pillow code that converts GPS coordinates to these unit positions, scales those unit positions to the size of the entire map, then draws a coloured circle at each point on the map. Here’s the code: from PIL import Image, ImageDraw gps_coordinates = [ # Hachiko Memorial Statue in Tokyo {"latitude": 35.6590699, "longitude": 139.7006793}, # Greyfriars Bobby in Edinburgh {"latitude": 55.9469224, "longitude": -3.1913043}, # Fido Statue in Tuscany {"latitude": 43.955101, "longitude": 11.388186}, ] im = Image.open("base_map.png") draw = ImageDraw.Draw(im) for coord in gps_coordinates: x, y = convert_gps_coordinates_to_unit_xy(**coord) radius = 32 draw.ellipse( [ x * im.width - radius, y * im.height - radius, x * im.width + radius, y * im.height + radius, ], fill="red", ) im.save("map_with_dots.png") and here’s the map it produces: The nice thing about writing this code in Pillow is that it’s a library I already know how to use, and so I can customise it if I need to. I can change the shape and colour of the points, or crop to specific regions, or add text to the image. I’m sure more sophisticated data visualisation libraries can do all this, and more – but I wouldn’t know how. The downside is that if I need more advanced features, I’ll have to write them myself. I’m okay with that – trading sophistication for simplicity. I didn’t need to learn a complex visualization library – I was able to write code I can read and understand. In a world full of AI-generating code, writing something I know I understand feels more important than ever. [If the formatting of this post looks odd in your feed reader, visit the original article]

yesterday 4 votes
Introducing the blogroll

This website has a new section: blogroll.opml! A blogroll is a list of blogs - a lightweight way of people recommending other people’s writing on the indieweb. What it includes The blogs that I included are just sampled from my many RSS subscriptions that I keep in my Feedbin reader. I’m subscribed to about 200 RSS feeds, the majority of which are dead or only publish once a year. I like that about blogs, that there’s no expectation of getting a post out every single day, like there is in more algorithmically-driven media. If someone who I interacted with on the internet years ago decides to restart their writing, that’s great! There’s no reason to prune all the quiet feeds. The picks are oriented toward what I’m into: niches, blogs that have a loose topic but don’t try to be general-interest, people with distinctive writing. If you import all of the feeds into your RSS reader, you’ll probably end up unsubscribing from some of them because some of the experimental electric guitar design or bonsai news is not what you’re into. Seems fine, or you’ll discover a new interest! How it works Ruben Schade figured out a brilliant way to show blogrolls and I copied him. Check out his post on styling OPML and RSS with XSLT to XHTML for how it works. My only additions to that scheme were making the blogroll page blend into the rest of the website by using an include tag with Jekyll to add the basic site skeleton, and adding a link with the download attribute to provide a simple way to download the OPML file. Oddly, if you try to save the OPML page using Save as… in Firefox, Firefox will save the transformed output via the XSLT, rather than the raw source code. XSLT is such an odd and rare part of the web ecosystem, I had to use it.

yesterday 1 votes