More from orlp.net - Blog Archive
It seems that in 2025 a lot of people fall into one of two camps when it comes to AI: skeptic or fanatic. The skeptic thinks AI sucks, that it’s overhyped, it only ever parrots nonsense and it will all blow over soon. The fanatic thinks general human-level intelligence is just around the corner, and that AI will solve almost all our problems. I hope my title is sufficiently ambiguous to attract both camps. The fanatic will be outraged, being ready to jump into the fray to point out why AI isn’t or won’t stay bad. The skeptic will feel validated, and will be eager to read more reasons as to why AI sucks. I’m neither a skeptic nor a fanatic. I see AI more neutrally, as a tool, and from that viewpoint I make the following two observations: AI is bad. It is often incorrect, expensive, racist, trained on data without knowledge or consent, environmentally unfriendly, disruptive to society, etc. AI is useful. Despite the above shortcomings there are tasks for which AI is cheap and effective. I’m no seer, perhaps AI will improve, become more accurate, less biased, cheaper, trained on open access data, cost less electricity, etc. Or perhaps we have plateaued in performance, and there is no political or economic goodwill to address any of the other issues, nor will there be. However, even if AI does not improve in any of the above metrics, it will still be useful, and I hope to show you in this article why. Hence my point: bad AI is here to stay. If you agree with me on this, I hope you’ll also agree with me that we have to stop pretending AI is useless and start taking it and its problems seriously. A formula for query cost Suppose I am a human with some kind of question that can be answered. I know AI could potentially help me with this question, but I wonder if it’s worth it or if I should not use it at all. To help with this we can quantify the risk associated with any potential method of answering the question: $$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + (1 - P(\mathrm{success})) \cdot \mathrm{Cost(bad)}$$ That is, the risk of using any particular method is the cost associated with the method plus the cost of the consequences of a bad answer multiplied by the probability of failure. Here ‘Cost’ is a highly multidimensional object, which can consist of but is not limited to: time, money, environmental impact, ethical concerns, etc. In a lot of cases however we don’t have to blindly trust the answer, and we can verify it. In these cases the consequence of a bad answer is that you’re left in the exact same scenario before trying, except knowing that the AI is of no use. In some scenarios when the AI is non-deterministic it might be worth it to try again as well, but let’s assume for now that you’d have to switch method. In this case the risk is: $$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + \mathrm{Cost(verify)} + (1 - P(\mathrm{success})) \cdot {\mathrm{Risk}}_\mathrm{Other}$$ The cost of a query is usually fairly fixed and known, and although verification cost can vary drastically from task to task, I’d argue that in most cases the cost of verification is also fairly predictable and known. This makes the risk formula applicable in a lot of scenarios, if you have a good idea of the chance of success. The latter, however, can usually only be established empirically, so for one-shot queries without having done any similar queries in the past it can be hard to evaluate whether trying AI is a good idea before doing so. There is one more expansion to the formula I’d like to make before we can look at some examples, and that is to the definition of a successful answer: $$P(\mathrm{success}) = P(\mathrm{correct} \cap \mathrm{relevant})$$ I define a successful answer as one that is both correct and relevant. For example “1 + 1 = 2” might be a correct answer, but irrelevant if we asked about anything else. Relevance is always subjective, but often the correctness of an answer is as well - I’m not assuming here that all questions are about objective facts. Cheap and effective AI queries Because AIs are fallible, usually the biggest cost is in fact the time needed for a human to verify the answer as correct and relevant (or the cost of consequences if left unverified). However, I’ve noticed a real asymmetry between these two properties when it comes to AIs: AIs often give incorrect answers. Worse, they will do so confidently, forcing you to waste time checking their answer instead of them simply stating that they don’t know for sure. AIs almost never give irrelevant answers. If I ask about cheese, the probability a modern AI starts talking about cars is very low. With this in mind I identify five general categories of query for which even bad AI is useful, either by massively reducing or eliminating this verification cost or by leaning on the strong relevance of AI answers: Inspiration, where $\operatorname{Cost}(\mathrm{bad}) \approx 0$, Creative, where $P(\mathrm{correct}) \approx 1$, Planning, where $P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$, Retrieval, where $P(\mathrm{correct}) \approx P(\mathrm{relevant})$, and Objective, where $P(\mathrm{relevant}) = 1$ and correctness verification cost is low. Let’s go over them one by one and look at some examples. Inspiration ($\operatorname{Cost}(\mathrm{bad}) \approx 0$) In this category are the queries where the consequences of a wrong answer are (near) zero. Informally speaking, “it can’t hurt to try”. In my experience these kinds of queries tend to be the ones where you are looking for something but don’t know exactly what; you’ll know it when you see it. For example: “I have leeks, eggs and minced meat in the fridge, as well as a stocked pantry with non-perishable staples. Can you suggest me some dishes I can make with this for a dinner?” “What kind of fun activities can I do with a budget of $100 in New York?” “Suggest some names for a Python function that finds the smallest non-negative number in a list.” “The user wrote this partial paragraph on their phone, suggest three words that are most likely to follow for a quick typing experience.” “Give me 20 synonyms of or similar words to ‘good’.” I think the last query highlights where AI shines or falls for this kind of query. The more localized and personalized your question is, the better the AI will do compared to an alternative. For simple synonyms you can usually just look up the word on a dedicated synonym site, as millions of other people have also wondered the same thing. But the exact contents of your fridge or your exact Python function you’re writing are rather unique to you. Creative ($P(\mathrm{correct}) \approx 1$) In this category are the queries where there are no (almost) no wrong answers. The only thing that really matters is the relevance of the answer, and as I mentioned before, I think AIs are pretty good at being relevant. Examples of queries like these are: “Draw me an image of a polar bear using a computer.” “Write and perform for me a rock ballad about gnomes on tiny bicycles.” “Rephrase the following sentence to be more formal.” “Write a poem to accompany my Sinterklaas gift.” This category does have a controversial aspect to it: it is ‘soulless’, inhuman. Usually if there are no wrong answers we expect the creator to use this opportunity to express their inner thoughts, ideas, experiences and emotions to evoke them in others. If an AI generates art it is not viewed as genuine, even if it evokes the same emotions to those ignorant of the art’s source, because the human to human connection is lost. Current AI models have no inner thoughts, ideas, experiences or emotions, at least not in a way I recognize them. I think it’s fine to use AI art in places where it would otherwise be meaningless (e.g. your corporate presentation slides), fine for humans to use AI-assisted art tools to express themselves, but ultimately defeating the point of art if used as a direct substitute. In the Netherlands we celebrate Sinterklaas which is, roughly speaking, Santa Claus (except we also have Santa Claus, so our children double-dip during the gifting season). Traditionally, gifts from Sinterklaas come accompanied by poems describing the gift and the receiver in a humorous way. It is quite common nowadays, albeit viewed as lazy, to generate such a poem using AI. What’s interesting is that this practice long predates modern LLMs–the poems have such a fixed structure that poem generators have existed a long time. The earliest reference I can find is the 1984 MS-DOS program “Sniklaas”. So people being lazy in supposedly heartfelt art is nothing new. Planning ($P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$) This is a more restrictive form of creativity, where irrelevant answers are absolutely impossible. This often requires some modification of the AI output generation method, where you restrict the output to the valid subdomain (for example yes / no, or binary numbers, etc). However, this is often trivial if you actually have access to the raw model by e.g. masking out invalid outputs, or you are working with a model which outputs the answer directly rather than in natural language or a stream of tokens. One might think in such a restrictive scenario there would be no useful queries, but this isn’t true. The quality of the answer with respect to some (complex) metric might still vary, and AIs might be far better than traditional methods at navigating such domains. For example: “Here is the schema of my database, a SQL query, a small sample of the data and 100 possible query plans. Which query plan seems most likely to execute the fastest? Take into account likely assumptions based on column names and these small data samples.” “What follows is a piece of code. Reformat the code, placing whitespace to maximize readability, while maintaining the exact same syntax tree as per this EBNF grammar.” “Re-order this set of if-else conditions in my code based on your intuition to minimize the expected number of conditions that need to be checked.” “Simplify this math expression using the following set of rewrite rules.” Retrieval ($P(\mathrm{correct}) \approx P(\mathrm{relevant})$) I define retrieval queries as those where the correctness of the answer depends (almost) entirely on its relevance. I’m including classification tasks in this category as well, as one can view it as retrieval of the class from the set of classes (or for binary classification, retrieval of positive samples from a larger set). Then, as long as the cost of verification is low (e.g. a quick glance at a result by a human to see if it interests them), or the consequences of not verifying an irrelevant answer are minimal, AIs can be excellent at this. For example: “Here are 1000 reviews of a restaurant, which ones are overall positive? Which ones mention unsanitary conditions?” “Find me pictures of my dog in my photo collection.” “What are good data structures for maintaining a list of events with dates and quickly counting the number of events in a specified period of time?” “I like Minecraft, can you suggest me some similar games?” “Summarise this 200 page government proposal.” “Which classical orchestral piece starts like ‘da da da daaaaa’”? Objective ($P(\mathrm{relevant}) = 1$, low verification cost) If a problem has an objective answer which can be verified, the relevance of the answer doesn’t really matter or arguably even make sense as a concept. Thus in these cases I’ll define $P(\mathrm{relevant}) = 1$ and leave the cost of verification entirely to correctness. AIs are often incorrect, but not always, so if the primary cost is verification and verification can be done very cheaply or entirely automatically without error, AIs can still be useful despite their fallibility. “What is the mathematical property where a series of numbers can only go up called?” “Identify the car model in this photo.” “I have a list of all Unicode glyphs which are commonly confused with other letters. Can you write an efficient function returning a boolean value that returns true for values in the list but false for all other code points?” “I formalized this mathematical conjecture in Lean. Can you help me write a proof for it?” In a way this category is reminiscent of the $P = NP$ problem. If you have an efficient verification algorithm, is finding solutions still hard in general? The answer seems to be yes, yet the proof eludes us. However, this is only true in general. For specific problems it might very well be possible to use AI to generate provably correct solutions with high probability, even though the the search space is far too large or too complicated for a traditional algorithm. Conclusion Out of the five identified categories, I consider inspiration and retrieval queries to be the strongest use-case for AI where often there is no alternative at all, besides an expensive and slow human that would rather be doing something else. Relevance is highly subjective, complex and fuzzy, which AI handles much better than traditional algorithms. Planning and objective queries are more niche, but absolutely will see use-cases for AI that are hard to replace. Creative queries are both something I think AI is really good at, while simultaneously being the most dangerous and useless category. Art, creativity and human-to-human connections are in my opinion some of the most fundamental aspects of human society, and I think it is incredibly dangerous to mess with them. So dangerous in fact I consider many such queries useless. I wanted the above examples all to be useful queries, so I did not list the following four examples in the “Creative” section despite them belonging there: “Write me ten million personalized spam emails including these links based on the following template.” “Emulate being the perfect girlfriend for me–never disagree with me or challenge my world views like real women do.” “Here is a feed of Reddit threads discussing the upcoming election. Post a comment in each thread, making up a personal anecdote how you are affected by immigrants in a negative way.” “A customer sent in this complaint. Try to help them with any questions they have but if your help is insufficient explain that you are sorry but can not help them any further. Do not reveal you are an AI.” Why do I consider these queries useless, despite them being potentially very profitable or effective? Because their cost function includes such a large detriment to society that only those who ignore its cost to society would ever use them. However, since the cost is “to society” and not to any particular individual, the only way to address this problem is with legislation, as otherwise bad actors are free to harm society for (temporary) personal gain. I wrote this article because I noticed that there are a lot of otherwise intelligent people out there who still believe (or hope) that all AI is useless garbage and that it and its problems will go away by itself. They will not. If you know someone that still believes so, please share this article with them. AI is bad, yes, but bad AI is still useful. Therefore, bad AI is here to stay, and we must deal with it.
Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on. In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions. As a teaser, this article explains how you can generate strings such as these, thousands per second: cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337 murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337 murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337 farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337 I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function: a = "xx0rlpx!xxsXъВ" b = "xxsXъВxx0rlpx!" murmurhash2(a + a, seed) == murmurhash2(a + b, seed) murmurhash2(a + a, seed) == murmurhash2(b + a, seed) murmurhash2(a + a, seed) == murmurhash2(b + b, seed) a = "!&orlpՓ" b = "yǏglp$X" murmurhash3(a + a, seed) == murmurhash3(a + b, seed) murmurhash3(a + a, seed) == murmurhash3(b + a, seed) murmurhash3(a + a, seed) == murmurhash3(b + b, seed) Hash function security basics Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function. If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle. The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse. To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are: Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$. Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$. Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$. Note that pre-image resistance implies second pre-image resistance which in turn implies collision resistance. Conversely, a pre-image attack breaks all three properties. We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers. MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones. Non-cryptographic hash functions Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds. One particular use-case for hash functions is deriving a secret key from a password: a key derivation function. Unlike regular hash functions, being slow is actually a safety feature here to protect against brute forcing passwords. Modern ones such as Argon2 also intentionally use a lot of memory for protection against specialized hardware such as ASICs or FPGAs. A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings: C++: there are multiple standard library implementations, but 64-bit clang 13.0.0 on Apple M1 ships CityHash64. Currently libstdc++ ships MurmurHash64A, a variant of Murmur2 for 64-bit platforms. Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes h = 31 * h + c for each character c. PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and 33 as its multiplier. Nim: it used to use MurmurHash3_x86_32. While writing this article they appeared to have switched to use farmhash by default. Zig: it uses wyhash by default, with 0 as seed. Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed. There were some that used stronger hashes by default as well: Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities. If not available, it uses an algorithm inspired by wyhash. Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved. This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed. Interlude: inverse operations Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$. Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some: Addition can be inverted using subtraction. That is, x += y can be inverted using x -= y. Seems obvious enough. Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$. This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever. Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$. def egcd(a, b): if a == 0: return (b, 0, 1) g, y, x = egcd(b % a, a) return (g, x - (b // a) * y, y) def modinv(c, m): g, x, y = egcd(c, m) assert g == 1, "c, m must be coprime" return x % m Using this we can invert modular multiplication: >>> modinv(17, 2**32) 4042322161 >>> 42 * 17 * 4042322161 % 2**32 42 Magic! XOR can be inverted using… XOR. It is its own inverse. So x ^= y can be inverted using x ^= y. Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter: abcdefghi defghiabc The formula for a right-rotation of k places is (x >> k) | (x << (w - k)), where w is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation of k places is another right-rotation of w-k places. Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$: x ^= x << k // Left xorshift. x ^= x >> k // Right xorshift. How to invert it is entirely analogous between the two, so I will focus on the left xorshift. An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed: v0 = (x << k) ^ x // Apply first step of inverse v1 = v0 ^ (v0 << k). v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x // Simplify using self-inverse (x << k) ^ (x << k) = 0. v1 = (x << 2*k) ^ x From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$. For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence: x ^= x << 13 // Left xorshift by 13. x ^= x << 13 // Inverse step 1. x ^= x << 26 // Inverse step 2. x ^= x << 52 // Inverse step 3. // x ^= x << 104 // Next step would be a no-op. Armed with this knowledge, we can now attack. Breaking CityHash64 Let us take a look at (part of) the source code of CityHash64 from libcxx that’s used for hashing strings on 64-bit platforms: C++ standard library code goes through a process known as 'uglification', which prepends underscores to all identifiers. This is because those identifiers are reserved by the standard to only be used in the standard library, and thus won't be replaced by macros from standards-compliant code. For your sanity's sake I removed them here. static const uint64_t mul = 0x9ddfea08eb382d69ULL; static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; static const uint64_t k1 = 0xb492b66fbe98f273ULL; static const uint64_t k2 = 0x9ae16a3b2f90404fULL; static const uint64_t k3 = 0xc949d7c7509e6557ULL; template<class T> T loadword(const void* p) { T r; std::memcpy(&r, p, sizeof(r)); return r; } uint64_t rotate(uint64_t val, int shift) { if (shift == 0) return val; return (val >> shift) | (val << (64 - shift)); } uint64_t hash_len_16(uint64_t u, uint64_t v) { uint64_t x = u ^ v; x *= mul; x ^= x >> 47; uint64_t y = v ^ x; y *= mul; y ^= y >> 47; y *= mul; return y; } uint64_t hash_len_17_to_32(const char *s, uint64_t len) { const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; const uint64_t d = loadword<uint64_t>(s + len - 16) * k0; return hash_len_16( rotate(a - b, 43) + rotate(c, 30) + d, a + rotate(b ^ k3, 20) - c + len ); } To break this, let’s assume we’ll always give length 32 inputs. Then the implementation will always call hash_len_17_to_32, and we have full control over variables a, b, c and d by changing our input. Note that d is only used once, in the final expression. This makes it a prime target for attacking the hash. We will choose a, b and c arbitrarily, and then solve for d to compute a desired hash outcome. Using the above modinv function we first compute the necessary modular multiplicative inverses of mul and k0: >>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64 1 >>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64 1 We also note that in this case the xorshift is easy to invert, as x ^= x >> 47 is simply its own inverse. Having all the components ready, we can invert the function step by step. We first load a, b and c like in the hash function, and compute uint64_t v = a + rotate(b ^ k3, 20) - c + len; which is the second parameter to hash_len_16. Then, starting from our desired return value of hash_len_16(u, v) we work backwards step by step, inverting each operation to find the function argument u that would result in our target hash. Then once we have found such the unique u we compute our required input d. Putting it all together: static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL; static const uint64_t k0_inv = 0x81bc9c5aa9c72e97ULL; void cityhash64_preimage32(uint64_t hash, char *s) { const uint64_t len = 32; const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; uint64_t v = a + rotate(b ^ k3, 20) - c + len; // Invert hash_len_16(u, v). Original operation inverted // at each step is shown on the right, note that it is in // the inverse order of hash_len_16. uint64_t y = hash; // return y; y *= mul_inv; // y *= mul; y ^= y >> 47; // y ^= y >> 47; y *= mul_inv; // y *= mul; uint64_t x = y ^ v; // uint64_t y = v ^ x; x ^= x >> 47; // x ^= x >> 47; x *= mul_inv; // x *= mul; uint64_t u = x ^ v; // uint64_t x = u ^ v; // Find loadword<uint64_t>(s + len - 16). uint64_t d = u - rotate(a - b, 43) - rotate(c, 30); d *= k0_inv; std::memcpy(s + len - 16, &d, sizeof(d)); } The chance that a random uint64_t forms 8 printable ASCII bytes is $\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32 is so fast that having to repeat it on average ~3000 times to get a purely ASCII result isn’t so bad. For example, the following 10 strings all hash to 1337 using CityHash64, generated using this code: I’ve noticed there’s variants of CityHash64 with subtle differences in the wild. I chose to attack the variant shipped with libc++, so it should work for std::hash there, for example. I also assume a little-endian machine throughout this article, your mileage may vary on a big-endian machine depending on the hash implementation. orlp-cityhash64-D-:K5yx*zkgaaaaa orlp-cityhash64-TXb7;1j&btkaaaaa orlp-cityhash64-+/LM$0 ;msnaaaaa orlp-cityhash64-u'f&>I'~mtnaaaaa orlp-cityhash64-pEEv.LyGcnpaaaaa orlp-cityhash64-v~~bm@,Vahtaaaaa orlp-cityhash64-RxHr_&~{miuaaaaa orlp-cityhash64-is_$34#>uavaaaaa orlp-cityhash64-$*~l\{S!zoyaaaaa orlp-cityhash64-W@^5|3^:gtcbaaaa Breaking MurmurHash2 We can’t let libstdc++ get away after targetting libc++, can we? The default string hash calls an implementation of MurmurHash2 with seed 0xc70f6907. The hash—simplified to only handle strings whose lengths are multiples of 8—is as follows: uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; uint64_t hash = seed ^ (len * mul); for (const char* p = s; p != s + len; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; hash ^= data; hash *= mul; } hash ^= hash >> 47; hash *= mul; hash ^= hash >> 47; return hash; } We can take a similar approach here as before. We note that the modular multiplicative inverse of 0xc6a4a7935bd1e995 mod $2^{64}$ is 0x5f7a0ea7e59b19bd. As an example, we can choose the first 24 bytes arbitrarily, and solve for the last 8 bytes: void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL; // Compute the hash state for the first 24 bytes as normal. uint64_t state = seed ^ (32 * mul); for (const char* p = s; p != s + 24; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; state ^= data; state *= mul; } // Invert target hash transformation. // return hash; hash ^= hash >> 47; // hash ^= hash >> 47; hash *= mulinv; // hash *= mul; hash ^= hash >> 47; // hash ^= hash >> 47; // Invert last iteration for last 8 bytes. hash *= mulinv; // hash *= mul; uint64_t data = state ^ hash; // hash = hash ^ data; data *= mulinv; // data *= mul; data ^= data >> 47; // data ^= data >> 47; data *= mulinv; // data *= mul; std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s); } The following 10 strings all hash to 1337 using MurmurHash64A with the default seed 0xc70f6907, generated using this code: orlp-murmurhash64-bhbaaat;SXtgVa orlp-murmurhash64-bkiaaa&JInaNcZ orlp-murmurhash64-ewmaaa(%J+jw>j orlp-murmurhash64-vxpaaag"93\Yj5 orlp-murmurhash64-ehuaaafa`Wp`/| orlp-murmurhash64-yizaaa1x.zQF6r orlp-murmurhash64-lpzaaaZphp&c F orlp-murmurhash64-wsjbaa771rz{z< orlp-murmurhash64-rnkbaazy4X]p>B orlp-murmurhash64-aqnbaaZ~OzP_Tp Universal collision attack on MurmurHash64A In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used. To be fair to CityHash64… just kidding they found universal collisions against it as well, regardless of seed used. CityHash64 is actually much easier to break in this way, as simply doing the above pre-image attack targetting 0 as hash makes the output purely dependent on the seed, and thus a universal collision. To see how it works, let’s take a look at the core loop of MurmurHash64A: uint64_t data = loadword<uint64_t>(p); data *= mul; // Trivially invertible. data ^= data >> 47; // Trivially invertible. data *= mul; // Trivially invertible. state ^= data; state *= mul; We know we can trivially invert the operations done on data regardless of what the current state is, so we might as well have had the following body: state ^= data; state *= mul; Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit? >>> 1 << 63 9223372036854775808 >>> (1 << 63) * mul % 2**64 9223372036854775808 Since mul is odd, its least significant bit is set. Multiplying 1 << 63 by it is equivalent to shifting that bit 63 places to the left, which is once again 1 << 63. That is, 1 << 63 is a fixed point for the state *= mul operation. We also note that for the top bit XOR is equivalent to addition, as the overflow from addition is removed mod $2^{64}$. So if we have two input strings, one starting with the 8 bytes data, and the other starting with data ^ (1 << 63) == data + (1 << 63) (after doing the trivial inversions). We then find that the two states, regardless of seed, differ exactly in the top bit after state ^= data. After multiplication we find we have two states x * mul and (x + (1 << 63)) * mul == x * mul + (1 << 63)… which again differ exactly in the top bit! We are now back to state ^= data in our iteration, for the next 8 bytes. We can now use this moment to cancel our top bit difference, by again feeding two 8-byte strings that differ in the top bit (after inverting). In fact, we only have to find one pair of such strings that differ in the top bit, which we can then repeat twice (in either order) to cancel our difference again. When represented as a uint64_t if we choose the first string as x we can derive the second string as x *= mul; // Forward transformation... x ^= x >> 47; // ... x *= mul; // ... x ^= 1 << 63; // Difference in top bit. x *= mulinv; // Backwards transformation... x ^= x >> 47; // ... x *= mulinv; // ... I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation: xx0rlpx! xxsXъВ Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision. xx0rlpx!xxsXъВ xxsXъВxx0rlpx! But it doesn’t stop there. By concatenating these two strings we can create $2^n$ different colliding strings each $16n$ bytes long. With the current libstdc++ implementation the following prints the same number eight times: std::hash<std::u8string> h; std::u8string a = u8"xx0rlpx!xxsXъВ"; std::u8string b = u8"xxsXъВxx0rlpx!"; std::cout << h(a + a + a) << "\n"; std::cout << h(a + a + b) << "\n"; std::cout << h(a + b + a) << "\n"; std::cout << h(a + b + b) << "\n"; std::cout << h(b + a + a) << "\n"; std::cout << h(b + a + b) << "\n"; std::cout << h(b + b + a) << "\n"; std::cout << h(b + b + b) << "\n"; Even if the libstdc++ would randomize the seed used by MurmurHash64a, the strings would still collide. Breaking MurmurHash3 Nim uses used to use MurmurHash3_x86_32, so let’s try to break that. If we once again simplify to strings whose lengths are a multiple of 4 we get the following code: uint32_t rotl32(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) { const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; const uint32_t c3 = 0x85ebca6b; const uint32_t c4 = 0xc2b2ae35; uint32_t h = seed; for (const char* p = s; p != s + len; p += 4) { uint32_t k = loadword<uint32_t>(p); k *= c1; k = rotl32(k, 15); k *= c2; h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; } h ^= len; h ^= h >> 16; h *= c3; h ^= h >> 13; h *= c4; h ^= h >> 16; return h; } I think by now you should be able to get this function to spit out any value you want if you know the seed. The inverse of rotl32(x, r) is rotl32(x, 32-r) and the inverse of h ^= h >> 16 is once again just h ^= h >> 16. Only h ^= h >> 13 is a bit different, it’s the first time we’ve seen that a xorshift’s inverse has more than one step: h ^= h >> 13 h ^= h >> 26 Compute the modular inverses of c1 through c4 as well as 5 mod $2^{32}$, and go to town. If you want to cheat or check your answer, you can check out the code I’ve used to generate the following ten strings that all hash to 1337 when fed to MurmurHash3_x86_32 with seed 0: orlp-murmurhash3_x86_32-haaaPa*+ orlp-murmurhash3_x86_32-saaaUW&< orlp-murmurhash3_x86_32-ubaa/!/" orlp-murmurhash3_x86_32-weaare]] orlp-murmurhash3_x86_32-chaa5@/} orlp-murmurhash3_x86_32-claaM[,5 orlp-murmurhash3_x86_32-fraaIx`N orlp-murmurhash3_x86_32-iwaara&< orlp-murmurhash3_x86_32-zwaa]>zd orlp-murmurhash3_x86_32-zbbaW-5G Nim uses 0 as a fixed seed. You might wonder about the ethics of publishing functions for generating arbitrary amounts of collisions for hash functions actually in use today. I did consider holding back. But HashDoS has been a known attack for almost two decades now, and the universal hash collisions I’ve shown were also published more than a decade ago now as well. At some point you’ve had enough time to, uh, fix your shit. Universal collision attack on MurmurHash3 Suppose that Nim didn’t use 0 as a fixed seed, but chose a randomly generated one. Can we do a similar attack as the one done to MurmurHash2 to still generate universal multicollisions? Yes we can. Let’s take another look at that core loop body: uint32_t k = loadword<uint32_t>(p); k *= c1; // Trivially invertable. k = rotl32(k, 15); // Trivially invertable. k *= c2; // Trivially invertable. h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; Once again we can ignore the first three trivially invertable instructions as we can simply choose our input so that we get exactly the k we want. Remember from last time that we want to introduce a difference in exactly the top bit of h, as the multiplication will leave this difference in place. But here there is a bit rotation between the XOR and the multiplication. The solution? Simply place our bit difference such that rotl32(h, 13) shifts it into the top position. Does the addition of 0xe6546b64 mess things up? No. Since only the top bit between the two states will be different, there is a difference of exactly $2^{31}$ between the two states. This difference is maintained by the addition. Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$ apart, we can conclude that the two states still differ in the top bit after the addition. So we want to find two pairs of 32-bit ints, such that after applying the first three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000 and the second pair in bit 1 << 31 == 0x80000000. After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision: a = "!&orlpՓ" b = "yǏglp$X" As before, any concatenation of as and bs of length n collides with all other combinations of length n. Breaking FarmHash64 Nim switched to farmhash since I started writing this post. To break it we can notice that its structure is very similar to CityHash64, so we can use those same techniques again. In fact, the only changes between the two for lengths 17-32 bytes is that a few operators were changed from subtraction/XOR to addition, a rotation operator had its constant tweaked, and some k constants are slightly tweaked in usage. The process of breaking it is so similar that it’s entirely analogous, so we can skip straight to the result. These 10 strings all hash to 1337 with FarmHash64: orlp-farmhash64-?VrJ@L7ytzwheaaa orlp-farmhash64-p3`!SQb}fmxheaaa orlp-farmhash64-pdt'cuI\gvxheaaa orlp-farmhash64-IbY`xAG&ibkieaaa orlp-farmhash64-[_LU!d1hwmkieaaa orlp-farmhash64-QiY!clz]bttieaaa orlp-farmhash64-&?J3rZ_8gsuieaaa orlp-farmhash64-LOBWtm5Szyuieaaa orlp-farmhash64-Mptaa^g^ytvieaaa orlp-farmhash64-B?&l::hxqmfjeaaa Trivial fixed-seed wyhash multicollisions Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves: uint64_t folded_multiply(uint64_t a, uint64_t b) { __uint128_t full = __uint128_t(a) * __uint128_t(b); return uint64_t(full) ^ uint64_t(full >> 64); } It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form: out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b); where secret_a and secret_b are determined by the seed, or outputs of previous iterations which are influenced by the seed. However, when your seed is constant… With a bit of creativity we can use the start of our string to prepare a ‘secret’ value which we can perfectly cancel with another ASCII string later in the input. So, without further ado, every 32-byte string of the form orlp-wyhash-oGf_________tWJbzMJR hashes to the same value with Zig’s default hasher. Zig uses a different set of parameters than the defaults found in the wyhash repository, so for good measure, this pattern provides arbitrary multicollisions for the default parameters found in wyhash when using seed == 0: orlp-wyhash-EUv_________NLXyytkp Conclusion We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack. We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving. With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs. Well, such attacks are provably impossible against non-interactive attackers, everything goes out of the window again when an attacker is allowed to inspect the output hashes and use that to try and guess your secret key. Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes. More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust: fn naive_sum(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out += *x; } out } This however can easily result in an arbitrarily large accumulated error. Let’s try it out: naive_sum(&vec![1.0; 1_000_000]) = 1000000.0 naive_sum(&vec![1.0; 10_000_000]) = 10000000.0 naive_sum(&vec![1.0; 100_000_000]) = 16777216.0 naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0 Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck. Luckily, there are better ways to sum an array. Pairwise summation A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums. fn pairwise_sum(arr: &[f32]) -> f32 { if arr.len() == 0 { return 0.0; } if arr.len() == 1 { return arr[0]; } let (first, second) = arr.split_at(arr.len() / 2); pairwise_sum(first) + pairwise_sum(second) } This is more accurate: pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0 pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0 pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0 pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0 However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum. By choosing the splitpoint as a multiple of 256 we ensure that the base case in the recursion always has exactly 256 elements except on the very last block. This makes sure we use the most optimal reduction and always correctly predict the loop condition. This small detail ended up improving the throughput by 40% for large arrays! fn block_pairwise_sum(arr: &[f32]) -> f32 { if arr.len() > 256 { let split = (arr.len() / 2).next_multiple_of(256); let (first, second) = arr.split_at(split); block_pairwise_sum(first) + block_pairwise_sum(second) } else { naive_sum(arr) } } Kahan summation The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers. All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error bound is still quadratic in terms of $n$ even for Kahan summation. In practice all summation algorithms do significantly better than their worst-case bounds, as in most scenarios the errors do not exclusively round up or down, but cancel each other out on average. pub fn kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for x in arr { let y = *x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one. To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy: pub fn block_kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for chunk in arr.chunks(256) { let x = naive_sum(chunk); let y = x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } Exact summation I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end. The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place. Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer. This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$. A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer. Unleashing the compiler Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism. To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow. I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends. I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact. Note that the real numbers do not contain NaNs or infinities, so these operators assume those do not exist for the validity of transformations, however it is not undefined behavior when you do encounter those values. They also allow fused multiply-add instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$ Using those new instructions it is trivial to generate an autovectorized sum: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn naive_sum_autovec(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out = fadd_algebraic(out, *x); } out } If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions: .LBB0_5: vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8] vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32] vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64] vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96] add r8, 32 cmp rdx, r8 jne .LBB0_5 This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache. With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec. Accuracy and speed Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x. All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results: AlgorithmThroughputMean absolute error naive5.5 GB/s71.796 pairwise0.9 GB/s1.5528 kahan1.4 GB/s0.2229 block_pairwise5.8 GB/s3.8597 block_kahan5.9 GB/s4.2184 naive_autovec118.6 GB/s14.538 block_pairwise_autovec71.7 GB/s1.6132 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_buffer1.1 GB/s0.0015 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 The reason the accurate crate has a non-zero absolute error is because it currently does not implement rounding to nearest correctly, so it can be off by one unit in the last place for the final result. First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one. We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum. Limiting ourselves to Pareto-optimal choices we get the following four implementations: AlgorithmThroughputMean absolute error naive_autovec118.6 GB/s14.538 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here. For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower. By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn sum_block(arr: &[f32]) -> f32 { arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y)) } pub fn sum_orlp(arr: &[f32]) -> f32 { let mut chunks = arr.chunks_exact(256); let mut sum = 0.0; let mut c = 0.0; for chunk in &mut chunks { let y = sum_block(chunk) - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum + (sum_block(chunks.remainder()) - c) } AlgorithmThroughputMean absolute error sum_orlp112.2 GB/s1.2306 You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy. Conclusion I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations. Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.
This post is an anecdote from over a decade ago, of which I lost the actual code. So please forgive me if I do not accurately remember all the details. Some details are also simplified so that anyone that likes computer security can enjoy this article, not just those who have played World of Warcraft (although the Venn diagram of those two groups likely has a solid overlap). When I was around 14 years old I discovered World of Warcraft developed by Blizzard Games and was immediately hooked. Not long after I discovered add-ons which allow you to modify how your game’s user interface looks and works. However, not all add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to find out how they were made. In a weird twist of fate, I blame World of Warcraft for me seriously picking up programming. It turned out that they were made in the Lua programming language. Add-ons were nothing more than a couple .lua source files in a folder directly loaded into the game. The barrier of entry was incredibly low: just edit a file, press save and reload the interface. The fact that the game loaded your source code and you could see it running was magical! I enjoyed it immensely and in no time I was only writing add-ons and was barely playing the game itself anymore. I published quite a few add-ons in the next two years, which mostly involved copying other people’s code with some refactoring / recombining / tweaking to my wishes. Add-on security A thought you might have is that it’s a really bad idea to let users have fully programmable add-ons in your game, lest you get bots. However, the system Blizzard made to prevent arbitrary programmable actions was quite clever. Naturally, it did nothing to prevent actual botting, but at least regular rule-abiding players were fundamentally restricted to the automation Blizzard allowed. Most UI elements that you could create were strictly decorative or informational. These were completely unrestricted, as were most APIs that strictly gather information. For example you can make a health bar display using two frames, a background and a foreground, sizing the foreground frame using an API call to get the health of your character. Not all API calls were available to you however. Some were protected so they could only be called from official Blizzard code. These typically involved the API calls that would move your character, cast spells, use items, etc. Generally speaking anything that actually makes you perform an in-game action was protected. The API for getting your exact world location and camera orientation also became protected at some point. This was a reaction by Blizzard to new add-ons that were actively drawing 3D elements on top of the game world to make boss fights easier. However, some UI elements needed to actually interact with the game itself, e.g. if I want to make a button that casts a certain spell. For this you could construct a special kind of button that executes code in a secure environment when clicked. You were only allowed to create/destroy/move such buttons when not in combat, so you couldn’t simply conditionally place such buttons underneath your cursor to automate actions during combat. The catch was that this secure environment did allow you to programmatically set which spell to cast, but doesn’t let you gather the information you would need to do arbitrary automation. All access to state from outside the secure environment was blocked. There were some information gathering API calls available to match the more accessible in-game macro system, but nothing as fancy as getting skill cooldowns or unit health which would enable automatic optimal spellcasting. So there were two environments: an insecure one where you can get all information but can’t act on it, and a secure one where you can act but can’t get the information needed for automation. A backdoor channel Fast forward a couple years and I had mostly stopped playing. My interests had mainly moved on to more “serious” programming, and I was only occasionally playing, mostly messing around with add-on ideas. But this secure environment kept on nagging in my brain; I wanted to break it. Of course there was third-party software that completely disables the security restrictions from Blizzard, but what’s the fun in that? I wanted to do it “legitimately”, using the technically allowed tools, as a challenge. Obviously using clever code to bypass security restrictions is no better than using third-party software, and both would likely get you banned. I never actually wanted to use the code, just to see if I could make it work. So I scanned the secure environment allowed function list to see if I could smuggle any information from the outside into the secure environment. It all seemed pretty hopeless until I saw one tiny, innocent little function: random. An evil idea came in my head: random number generators (RNGs) used in computers are almost always pseudorandom number generators with (hidden) internal state. If I can manipulate this state, perhaps I can use that to pass information into the secure environment. Random number generator woes It turned out that random was just a small shim around C’s rand. I was excited! This meant that there was a single global random state that was shared in the process. It also helps that rand implementations tended to be on the weak side. Since World of Warcraft was compiled with MSVC, the actual implementation of rand was as follows: uint32_t state; int rand() { state = state * 214013 + 2531011; return (state >> 16) & 0x7fff; } This RNG is, for the lack of a better word, shite. It is a naked linear congruential generator, and a weak one at that. Which in my case, was a good thing. I can understand MSVC keeps rand the same for backwards compatibility, and at least all documentation I could find for rand recommends you not to use rand for cryptographic purposes. But was there ever a time where such a bad PRNG implementation was fit for any purpose? So let’s get to breaking this thing. Since the state is so laughably small and you can see 15 bits of the state directly you can keep a full list of all possible states consistent with a single output of the RNG and use further calls to the RNG to eliminate possibilities until a single one remains. But we can be significantly more clever. First we note that the top bit of state never affects anything in this RNG. (state >> 16) & 0x7fff masks out 15 bits, after shifting away the bottom 16 bits, and thus effectively works mod $2^{31}$. Since on any update the new state is a linear function of the previous state, we can propagate this modular form all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for any linear $f$. Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by the shift $h_0, h_1$ respectively, for the states after the first and second call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$ and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity: $$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$ $$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$ Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then for some integer $k$ we have $$ah_0 = h_1 + c + 2^{31} k.$$ Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$. Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following expression for $h_0$: $$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$ Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$. Thus, assuming a solution exists, we must have $$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$ So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of the RNG state after the first call. This gives us 8 guesses, after which we can reject bad guesses using follow-up calls to the RNG until a single unique answer remains. While I was able to re-derive the above with little difficulty now, 18 year old me wasn’t as experienced in discrete math. So I asked on crypto.SE, with the excuse that I wanted to ‘show my colleagues how weak this RNG is’. It worked, which sparks all kinds of interesting ethics questions. An example implementation of this process in Python: import random A = 214013 B = 2531011 class MsvcRng: def __init__(self, state): self.state = state def __call__(self): self.state = (self.state * A + B) % 2**32 return (self.state >> 16) & 0x7fff # Create a random RNG state we'll reverse engineer. hidden_rng = MsvcRng(random.randint(0, 2**32)) # Compute guesses for hidden state from 2 observations. r0 = hidden_rng() r1 = hidden_rng() c = (2**16 * (r1 - A * r0) - B) % 2**31 ceil_div = lambda a, b: (a + b - 1) // b h_guesses = [ceil_div(c + 2**31 * k, A) for k in range(-1, 7)] # Validate guesses until a single guess remains. guess_rngs = [MsvcRng(2**16 * r0 + h0) for h0 in h_guesses] guess_rngs = [g for g in guess_rngs if g() == r1] while len(guess_rngs) > 1: r = hidden_rng() guess_rngs = [g for g in guess_rngs if g() == r] # The top bit can not be recovered as it never affects the output, # but we should have recovered the effective hidden state. assert guess_rngs[0].state % 2**31 == hidden_rng.state % 2**31 While I did write the above process with a while loop, it appears to only ever need a third output at most to narrow it down to a single guess. Putting it together Once we could reverse-engineer the internal state of the random number generator we could make arbitrary automated decisions in the supposedly secure environment. How it worked was as follows: An insecure hook was registered that would execute right before the secure environment code would run. In this hook we have full access to information, and make a decision as to which action should be taken (e.g. casting a particular spell). This action is looked up in a hardcoded list to get an index. The current state of the RNG is reverse-engineered using the above process. We predict the outcome of the next RNG call. If this (modulo the length of our action list) does not give our desired outcome, we advance the RNG and try again. This repeats until the next random number would correspond to our desired action. The hook returns, and the secure environment starts. It generates a “random” number, indexes our hardcoded list of actions, and performs the “random” action. That’s all! By being able to simulate the RNG and looking one step ahead we could use it as our information channel by choosing exactly the right moment to call random in the secure environment. Now if you wanted to support a list of $n$ actions it would on average take $n$ steps of the RNG before the correct number came up to pass along, but that wasn’t a problem in practice. Conclusion I don’t know when Blizzard fixed the issue where the RNG state is so weak and shared, or whether they were aware of it being an issue at all. A few years after I had written the code I tried it again out of curiosity, and it had stopped working. Maybe they switched to a different algorithm, or had a properly separated RNG state for the secure environment. All-in-all it was a lot of effort for a niche exploit in a video game that I didn’t even want to use. But there certainly was a magic to manipulating something supposedly random into doing exactly what you want, like a magician pulling four aces from a shuffled deck.
More in programming
Revealed: How the UK tech secretary uses ChatGPT for policy advice by Chris Stokel-Walker for the New Scientist
This book’s introduction started by defining strategy as “making decisions.” Then we dug into exploration, diagnosis, and refinement: three chapters where you could argue that we didn’t decide anything at all. Clarifying the problem to be solved is the prerequisite of effective decision making, but eventually decisions do have to be made. Here in this chapter on policy, and the following chapter on operations, we finally start to actually make some decisions. In this chapter, we’ll dig into: How we define policy, and how setting policy differs from operating policy as discussed in the next chapter The structured steps for setting policy How many policies should you set? Is it preferable to have one policy, many policies, or does it not matter much either way? Recurring kinds of policies that appear frequently in strategies Why it’s valuable to be intentional about your strategy’s altitude, and how engineers and executives generally maintain different altitudes in their strategies Criteria to use for evaluating whether your policies are likely to be impactful How to develop novel policies, and why it’s rare Why having multiple bundles of alternative policies is generally a phase in strategy development that indicates a gap in your diagnosis How policies that ignore constraints sound inspirational, but accomplish little Dealing with ambiguity and uncertainty created by missing strategies from cross-functional stakeholders By the end, you’ll be ready to evaluate why an existing strategy’s policies are struggling to make an impact, and to start iterating on policies for strategy of your own. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. What is policy? Policy is interpreting your diagnosis into a concrete plan. That plan will be a collection of decisions, tradeoffs, and approaches. They’ll range from coding practices, to hiring mandates, to architectural decisions, to guidance about how choices are made within your organization. An effective policy solves the entirety of the strategy’s diagnosis, although the diagnosis itself is encouraged to specify which aspects can be ignored. For example, the strategy for working with private equity ownership acknowledges in its diagnosis that they don’t have clear guidance on what kind of reduction to expect: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. Faced with that uncertainty, the policy simply acknowledges the ambiguity and commits to reconsider when more information becomes available: We believe our new ownership will provide a specific target for Research and Development (R&D) operating expenses during the upcoming financial year planning. We will revise these policies again once we have explicit targets, and will delay planning around reductions until we have those numbers to avoid running two overlapping processes. There are two frequent points of confusion when creating policies that are worth addressing directly: Policy is a subset of strategy, rather than the entirety of strategy, because policy is only meaningful in the context of the strategy’s diagnosis. For example, the “N-1 backfill policy” makes sense in the context of new, private equity ownership. The policy wouldn’t work well in a rapidly expanding organization. Any strategy without a policy is useless, but you’ll also find policies without context aren’t worth much either. This is particularly unfortunate, because so often strategies are communicated without those critical sections. Policy describes how tradeoffs should be made, but it doesn’t verify how the tradeoffs are actually being made in practice. The next chapter on operations covers how to inspect an organization’s behavior to ensure policies are followed. When reworking a strategy to be more readable, it often makes sense to merge policy and operation sections together. However, when drafting strategy it’s valuable to keep them separate. Yes, you might use a weekly meeting to review whether the policy is being followed, but whether it’s an effective policy is independent of having such a meeting, and what operational mechanisms you use will vary depending on the number of policies you intend to implement. With this definition in mind, now we can move onto the more interesting discussion of how to set policy. How to set policy Every part of writing a strategy feels hard when you’re doing it, but I personally find that writing policy either feels uncomfortably easy or painfully challenging. It’s never a happy medium. Fortunately, the exploration and diagnosis usually come together to make writing your policy simple: although sometimes that simple conclusion may be a difficult one to swallow. The steps I follow to write a strategy’s policy are: Review diagnosis to ensure it captures the most important themes. It doesn’t need to be perfect, but it shouldn’t have omissions so obvious that you can immediately identify them. Select policies that address the diagnosis. Explicitly match each policy to one or more diagnoses that it addresses. Continue adding policies until every diagnosis is covered. This is a broad instruction, but it’s simpler than it sounds because you’ll typically select from policies identified during your exploration phase. However, there certainly is space to tweak those policies, and to reapply familiar policies to new circumstances. If you do find yourself developing a novel policy, there’s a later section in this chapter, Developing novel policies, that addresses that topic in more detail. Consolidate policies in cases where they overlap or adjoin. For example, two policies about specific teams might be generalized into a policy about all teams in the engineering organization. Backtest policy against recent decisions you’ve made. This is particularly effective if you maintain a decision log in your organization. Mine for conflict once again, much as you did in developing your diagnosis. Emphasize feedback from teams and individuals with a different perspective than your own, but don’t wholly eliminate those that you agree with. Just as it’s easy to crowd out opposing views in diagnosis if you don’t solicit their input, it’s possible to accidentally crowd out your own perspective if you anchor too much on others’ perspectives. Consider refinement if you finish writing, and you just aren’t sure your approach works – that’s fine! Return to the refinement phase by deploying one of the refinement techniques to increase your conviction. Remember that we talk about strategy like it’s done in one pass, but almost all real strategy takes many refinement passes. The steps of writing policy are relatively pedestrian, largely because you’ve done so much of the work already in the exploration, diagnosis, and refinement steps. If you skip those phases, you’d likely follow the above steps for writing policy, but the expected quality of the policy itself would be far lower. How many policies? Addressing the entirety of the diagnosis is often complex, which is why most strategies feature a set of policies rather than just one. The strategy for decomposing a monolithic application is not one policy deciding not to decompose, but a series of four policies: Business units should always operate in their own code repository and monolith. New integrations across business unit monoliths should be done using gRPC. Except for new business unit monoliths, we don’t allow new services. Merge existing services into business-unit monoliths where you can. Four isn’t universally the right number either. It’s simply the number that was required to solve that strategy’s diagnosis. With an excellent diagnosis, your policies will often feel inevitable, and perhaps even boring. That’s great: what makes a policy good is that it’s effective, not that it’s novel or inspiring. Kinds of policies While there are so many policies you can write, I’ve found they generally fall into one of four major categories: approvals, allocations, direction, and guidance. This section introduces those categories. Approvals define the process for making a recurring decision. This might require invoking an architecture advice process, or it might require involving an authority figure like an executive. In the Index post-acquisition integration strategy, there were a number of complex decisions to be made, and the approval mechanism was: Escalations come to paired leads: given our limited shared context across teams, all escalations must come to both Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. This allowed the acquired and acquiring teams to start building trust between each other by ensuring both were consulted before any decision was finalized. On the other hand, the user data access strategy’s approval strategy was more focused on managing corporate risk: Exceptions must be granted in writing by CISO. While our overarching Engineering Strategy states that we follow an advisory architecture process as described in Facilitating Software Architecture, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the #ciso channel. These two different approval processes had different goals, so they made tradeoffs differently. There are so many ways to tweak approval, allowing for many different tradeoffs between safety, productivity, and trust. Allocations describe how resources are split across multiple potential investments. Allocations are the most concrete statement of organizational priority, and also articulate the organization’s belief about how productivity happens in teams. Some companies believe you go fast by swarming more people onto critical problems. Other companies believe you go fast by forcing teams to solve problems without additional headcount. Both can work, and teach you something important about the company’s beliefs. The strategy on Uber’s service migration has two concrete examples of allocation policies. The first describes the Infrastructure engineering team’s allocation between manual provision tasks and investing into creating a self-service provisioning platform: Constrain manual provisioning allocation to maximize investment in self-service provisioning. The service provisioning team will maintain a fixed allocation of one full time engineer on manual service provisioning tasks. We will move the remaining engineers to work on automation to speed up future service provisioning. This will degrade manual provisioning in the short term, but the alternative is permanently degrading provisioning by the influx of new service requests from newly hired product engineers. The second allocation policy is implicitly noted in this strategy’s diagnosis, where it describes the allocation policy in the Engineering organization’s higher altitude strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. Allocation policies often create a surprising amount of clarity for the team, and I include them in almost every policy I write either explicitly, or implicitly in a higher altitude strategy. Direction provides explicit instruction on how a decision must be made. This is the right tool when you know where you want to go, and exactly the way that you want to get there. Direction is appropriate for problems you understand clearly, and you value consistency more than empowering individual judgment. Direction works well when you need an unambiguous policy that doesn’t leave room for interpretation. For example, Calm’s policy for working in the monolith: We write all code in the monolith. It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code must be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith. In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below. In that case, the team couldn’t agree on what should go into the monolith. Individuals would often make incompatible decisions, so creating consistency required removing personal judgment from the equation. Sometimes judgment is the issue, and sometimes consistency is difficult due to misaligned incentives. A good example of this comes in strategy on working with new Private Equity ownership: We will move to an “N-1” backfill policy, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit. It’s likely that hiring managers would simply ignore this backfill policy if it was stated more softly, although sometimes less forceful policies are useful. Guidance provides a recommendation about how a decision should be made. Guidance is useful when there’s enough nuance, ambiguity, or complexity that you can explain the desired destination, but you can’t mandate the path to reaching it. One example of guidance comes from the Index acquisition integration strategy: Minimize changes to tokenization environment: because point-of-sale devices directly work with customer payment details, the API that directly supports the point-of-sale device must live within our secured environment where payment details are stored. However, any other functionality must not be added to our tokenization environment. This might read like direction, but it’s clarifying the desired outcome of avoiding unnecessary complexity in the tokenization environment. However, it’s not able to articulate what complexity is necessary, so ultimately it’s guidance because it requires significant judgment to interpret. A second example of guidance comes in the strategy on decomposing a monolithic codebase: Merge existing services into business-unit monoliths where you can. We believe that each choice to move existing services back into a monolith should be made “in the details” rather than from a top-down strategy perspective. Consequently, we generally encourage teams to wind down their existing services outside of their business unit’s monolith, but defer to teams to make the right decision for their local context. This is another case of knowing the desired outcome, but encountering too much uncertainty to direct the team on how to get there. If you ask five engineers about whether it’s possible to merge a given service back into a monolithic codebase, they’ll probably disagree. That’s fine, and highlights the value of guidance: it makes it possible to make incremental progress in areas where more concrete direction would cause confusion. When you’re working on a strategy’s policy section, it’s important to consider all of these categories. Which feel most natural to use will vary depending on your team and role, but they’re all usable: If you’re a developer productivity team, you might have to lean heavily on guidance in your policies and increased support for that guidance within the details of your platform. If you’re an executive, you might lean heavily on direction. Indeed, you might lean too heavily on direction, where guidance often works better for areas where you understand the direction but not the path. If you’re a product engineering organization, you might have to narrow the scope of your direction to the engineers within that organization to deal with the realities of complex cross-organization dynamics. Finally, if you have a clear approach you want to take that doesn’t fit cleanly into any of these categories, then don’t let this framework dissuade you. Give it a try, and adapt if it doesn’t initially work out. Maintaining strategy altitude The chapter on when to write engineering strategy introduced the concept of strategy altitude, which is being deliberate about where certain kinds of policies are created within your organization. Without repeating that section in its entirety, it’s particularly relevant when you set policy to consider how your new policies eliminate flexibility within your organization. Consider these two somewhat opposing strategies: Stripe’s Sorbet strategy only worked in an organization that enforced the use of a single programming language across (essentially) all teams Uber’s service migration strategy worked well in an organization that was unwilling to enforce consistent programming language adoption across teams Stripe’s organization-altitude policy took away the freedom of individual teams to select their preferred technology stack. In return, they unlocked the ability to centralize investment in a powerful way. Uber went the opposite way, unlocking the ability of teams to pick their preferred technology stack, while significantly reducing their centralized teams’ leverage. Both altitudes make sense. Both have consequences. Criteria for effective policies In The Engineering Executive’s Primer’s chapter on engineering strategy, I introduced three criteria for evaluating policies. They ought to be applicable, enforced, and create leverage. Defining those a bit: Applicable: it can be used to navigate complex, real scenarios, particularly when making tradeoffs. Enforced: teams will be held accountable for following the guiding policy. Create Leverage: create compounding or multiplicative impact. The last of these three, create leverage, made sense in the context of a book about engineering executives, but probably doesn’t make as much sense here. Some policies certainly should create leverage (e.g. empower developer experience team by restricting new services), but others might not (e.g. moving to an N-1 backfill policy). Outside the executive context, what’s important isn’t necessarily creating leverage, but that a policy solves for part of the diagnosis. That leaves the other two–being applicable and enforced–both of which are necessary for a policy to actually address the diagnosis. Any policy which you can’t determine how to apply, or aren’t willing to enforce, simply won’t be useful. Let’s apply these criteria to a handful of potential policies. First let’s think about policies we might write to improve the talent density of our engineering team: “We only hire world-class engineers.” This isn’t applicable, because it’s unclear what a world-class engineer means. Because there’s no mutually agreeable definition in this policy, it’s also not consistently enforceable. “We only hire engineers that get at least one ‘strong yes’ in scorecards.” This is applicable, because there’s a clear definition. This is enforceable, depending on the willingness of the organization to reject seemingly good candidates who don’t happen to get a strong yes. Next, let’s think about a policy regarding code reuse within a codebase: “We follow a strict Don’t Repeat Yourself policy in our codebase.” There’s room for debate within a team about whether two pieces of code are truly duplicative, but this is generally applicable. Because there’s room for debate, it’s a very context specific determination to decide how to enforce a decision. “Code authors are responsible for determining if their contributions violate Don’t Repeat Yourself, and rewriting them if they do.” This is much more applicable, because now there’s only a single person’s judgment to assess the potential repetition. In some ways, this policy is also more enforceable, because there’s no longer any ambiguity around who is deciding whether a piece of code is a repetition. The challenge is that enforceability now depends on one individual, and making this policy effective will require holding individuals accountable for the quality of their judgement. An organization that’s unwilling to distinguish between good and bad judgment won’t get any value out of the policy. This is a good example of how a good policy in one organization might become a poor policy in another. If you ever find yourself wanting to include a policy that for some reason either can’t be applied or can’t be enforced, stop to ask yourself what you’re trying to accomplish and ponder if there’s a different policy that might be better suited to that goal. Developing novel policies My experience is that there are vanishingly few truly novel policies to write. There’s almost always someone else has already done something similar to your intended approach. Calm’s engineering strategy is such a case: the details are particular to the company, but the general approach is common across the industry. The most likely place to find truly novel policies is during the adoption phase of a new widespread technology, such as the rise of ubiquitous mobile phones, cloud computing, or large language models. Even then, as explored in the strategy for adopting large-language models, the new technology can be engaged with as a generic technology: Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets. Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers. You could simply replace “LLM” with “data-driven” and it would be equally readable. In this way, policy can generally sidestep areas of uncertainty by being a bit abstract. This avoids being overly specific about topics you simply don’t know much about. However, even if your policy isn’t novel to the industry, it might still be novel to you or your organization. The steps that I’ve found useful to debug novel policies are the same steps as running a condensed version of the strategy process, with a focus on exploration and refinement: Collect a number of similar policies, with a focus on how those policies differ from the policy you are creating Create a systems model to articulate how this policy will work, and also how it will differ from the similar policies you’re considering Run a strategy testing cycle for your proto-policy to discover any unknown-unknowns about how it works in practice Whether you run into this scenario is largely a function of the extent of your, and your organization’s, experience. Early in my career, I found myself doing novel (for me) strategy work very frequently, and these days I rarely find myself doing novel work, instead focusing on adaptation of well-known policies to new circumstances. Are competing policy proposals an anti-pattern? When creating policy, you’ll often have to engage with the question of whether you should develop one preferred policy or a series of potential strategies to pick from. Developing these is a useful stage of setting policy, but rather than helping you refine your policy, I’d encourage you to think of this as exposing gaps in your diagnosis. For example, when Stripe developed the Sorbet ruby-typing tooling, there was debate between two policies: Should we build a ruby-typing tool to allow a centralized team to gradually migrate the company to a typed codebase? Should we migrate the codebase to a preexisting strongly typed language like Golang or Java? These were, initially, equally valid hypotheses. It was only by clarifying our diagnosis around resourcing that it became clear that incurring the bulk of costs in a centralized team was clearly preferable to spreading the costs across many teams. Specifically, recognizing that we wanted to prioritize short-term product engineering velocity, even if it led to a longer migration overall. If you do develop multiple policy options, I encourage you to move the alternatives into an appendix rather than including them in the core of your strategy document. This will make it easier for readers of your final version to understand how to follow your policies, and they are the most important long-term user of your written strategy. Recognizing constraints A similar problem to competing solutions is developing a policy that you cannot possibly fund. It’s easy to get enamored with policies that you can’t meaningfully enforce, but that’s bad policy, even if it would work in an alternate universe where it was possible to enforce or resource it. To consider a few examples: The strategy for controlling access to user data might have proposed requiring manual approval by a second party of every access to customer data. However, that would have gone nowhere. Our approach to Uber’s service migration might have required more staffing for the infrastructure engineering team, but we knew that wasn’t going to happen, so it was a meaningless policy proposal to make. The strategy for navigating private equity ownership might have argued that new ownership should not hold engineering accountable to a new standard on spending. But they would have just invalidated that strategy in the next financial planning period. If you find a policy that contemplates an impractical approach, it doesn’t only indicate that the policy is a poor one, it also suggests your policy is missing an important pillar. Rather than debating the policy options, the fastest path to resolution is to align on the diagnosis that would invalidate potential paths forward. In cases where aligning on the diagnosis isn’t possible, for example because you simply don’t understand the possibilities of a new technology as encountered in the strategy for adopting LLMs, then you’ve typically found a valuable opportunity to use strategy refinement to build alignment. Dealing with missing strategies At a recent company offsite, we were debating which policies we might adopt to deal with annual plans that kept getting derailed after less than a month. Someone remarked that this would be much easier if we could get the executive team to commit to a clearer, written strategy about which business units we were prioritizing. They were, of course, right. It would be much easier. Unfortunately, it goes back to the problem we discussed in the diagnosis chapter about reframing blockers into diagnosis. If a strategy from the company or a peer function is missing, the empowering thing to do is to include the absence in your diagnosis and move forward. Sometimes, even when you do this, it’s easy to fall back into the belief that you cannot set a policy because a peer function might set a conflicting policy in the future. Whether you’re an executive or an engineer, you’ll never have the details you want to make the ideal policy. Meaningful leadership requires taking meaningful risks, which is never something that gets comfortable. Summary After working through this chapter, you know how to develop policy, how to assemble policies to solve your diagnosis, and how to avoid a number of the frequent challenges that policy writers encounter. At this point, there’s only one phase of strategy left to dig into, operating the policies you’ve created.
I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos. This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right. Approach #1: ORDER BY RANDOM() My first attempt was pretty naïve – I used an ORDER BY RANDOM() clause to sort the table, then limit the results: SELECT * FROM photos ORDER BY random() LIMIT 10 This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable. It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values. I found a suggestion from Stack Overflow user Ali to do a random sort on the id column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting: SELECT * FROM photos WHERE id IN ( SELECT id FROM photos ORDER BY RANDOM() LIMIT 10 ) This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted. Approach #2: WHERE rowid > (…) Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach: SELECT * FROM photos WHERE rowid > ( ABS(RANDOM()) % (SELECT max(rowid) FROM photos) ) LIMIT 10 The rowid is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid unless you explicitly tell it not to, or create your own integer primary key. This query works by picking a point between the biggest and smallest rowid values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation. This query is much faster – around 0.0008s – but I didn’t go this route. The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101, 102, 103, … – which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety! (This behaviour isn’t guaranteed – if you don’t add an ORDER BY clause to a SELECT query, then the order of results is undefined. SQLite is returning rows in rowid order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.) Approach #3: Select random rowid values outside SQLite Max’s answer was the first time I’d heard of rowid, and it gave me an idea – what if I chose random rowid values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want. Here’s the procedure I came up with: Create an empty list to store our sample. Find the highest rowid that’s currently in use: sqlite> SELECT MAX(rowid) FROM photos; 1913389 Use a random number generator to pick a rowid between 1 and the highest rowid: >>> import random >>> random.randint(1, max_rowid) 196476 If we’ve already got this rowid, discard it and generate a new one. (The rowid is a signed, 64-bit integer, so the minimum possible value is always 1.) Look for a row with that rowid: SELECT * FROM photos WHERE rowid = 196476 If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another rowid. If such a row doesn’t exist, return to step 3 and generate another rowid. This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s). This approach is best for tables where the rowid values are mostly contiguous – it would be slower if there are lots of rowids between 1 and the max that don’t exist. If there are large gaps in rowid values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid values separately. This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid values do exist. Summary Here are the four approaches I tried: Approach Performance (for 2M rows) Notes ORDER BY RANDOM() ~0.5s Slowest, easiest to read WHERE id IN (SELECT id …) ~0.15s Faster, still fairly easy to understand WHERE rowid > ... ~0.0008s Returns clustered results Random rowid in Python ~0.001s Fast and returns varied results, requires code outside SQL I’m using the random rowid in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM(), I could feel myself waiting for the page to load. But I’ve used ORDER BY RANDOM() in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on. Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool! [If the formatting of this post looks odd in your feed reader, visit the original article]