More from orlp.net - Blog Archive
Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on. In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions. As a teaser, this article explains how you can generate strings such as these, thousands per second: cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337 murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337 murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337 farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337 I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function: a = "xx0rlpx!xxsXъВ" b = "xxsXъВxx0rlpx!" murmurhash2(a + a, seed) == murmurhash2(a + b, seed) murmurhash2(a + a, seed) == murmurhash2(b + a, seed) murmurhash2(a + a, seed) == murmurhash2(b + b, seed) a = "!&orlpՓ" b = "yǏglp$X" murmurhash3(a + a, seed) == murmurhash3(a + b, seed) murmurhash3(a + a, seed) == murmurhash3(b + a, seed) murmurhash3(a + a, seed) == murmurhash3(b + b, seed) Hash function security basics Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function. If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle. The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse. To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are: Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$. Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$. Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$. Note that pre-image resistance implies second pre-image resistance which in turn implies collision resistance. Conversely, a pre-image attack breaks all three properties. We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers. MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones. Non-cryptographic hash functions Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds. One particular use-case for hash functions is deriving a secret key from a password: a key derivation function. Unlike regular hash functions, being slow is actually a safety feature here to protect against brute forcing passwords. Modern ones such as Argon2 also intentionally use a lot of memory for protection against specialized hardware such as ASICs or FPGAs. A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings: C++: there are multiple standard library implementations, but 64-bit clang 13.0.0 on Apple M1 ships CityHash64. Currently libstdc++ ships MurmurHash64A, a variant of Murmur2 for 64-bit platforms. Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes h = 31 * h + c for each character c. PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and 33 as its multiplier. Nim: it used to use MurmurHash3_x86_32. While writing this article they appeared to have switched to use farmhash by default. Zig: it uses wyhash by default, with 0 as seed. Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed. There were some that used stronger hashes by default as well: Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities. If not available, it uses an algorithm inspired by wyhash. Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved. This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed. Interlude: inverse operations Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$. Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some: Addition can be inverted using subtraction. That is, x += y can be inverted using x -= y. Seems obvious enough. Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$. This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever. Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$. def egcd(a, b): if a == 0: return (b, 0, 1) g, y, x = egcd(b % a, a) return (g, x - (b // a) * y, y) def modinv(c, m): g, x, y = egcd(c, m) assert g == 1, "c, m must be coprime" return x % m Using this we can invert modular multiplication: >>> modinv(17, 2**32) 4042322161 >>> 42 * 17 * 4042322161 % 2**32 42 Magic! XOR can be inverted using… XOR. It is its own inverse. So x ^= y can be inverted using x ^= y. Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter: abcdefghi defghiabc The formula for a right-rotation of k places is (x >> k) | (x << (w - k)), where w is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation of k places is another right-rotation of w-k places. Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$: x ^= x << k // Left xorshift. x ^= x >> k // Right xorshift. How to invert it is entirely analogous between the two, so I will focus on the left xorshift. An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed: v0 = (x << k) ^ x // Apply first step of inverse v1 = v0 ^ (v0 << k). v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x // Simplify using self-inverse (x << k) ^ (x << k) = 0. v1 = (x << 2*k) ^ x From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$. For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence: x ^= x << 13 // Left xorshift by 13. x ^= x << 13 // Inverse step 1. x ^= x << 26 // Inverse step 2. x ^= x << 52 // Inverse step 3. // x ^= x << 104 // Next step would be a no-op. Armed with this knowledge, we can now attack. Breaking CityHash64 Let us take a look at (part of) the source code of CityHash64 from libcxx that’s used for hashing strings on 64-bit platforms: C++ standard library code goes through a process known as 'uglification', which prepends underscores to all identifiers. This is because those identifiers are reserved by the standard to only be used in the standard library, and thus won't be replaced by macros from standards-compliant code. For your sanity's sake I removed them here. static const uint64_t mul = 0x9ddfea08eb382d69ULL; static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; static const uint64_t k1 = 0xb492b66fbe98f273ULL; static const uint64_t k2 = 0x9ae16a3b2f90404fULL; static const uint64_t k3 = 0xc949d7c7509e6557ULL; template<class T> T loadword(const void* p) { T r; std::memcpy(&r, p, sizeof(r)); return r; } uint64_t rotate(uint64_t val, int shift) { if (shift == 0) return val; return (val >> shift) | (val << (64 - shift)); } uint64_t hash_len_16(uint64_t u, uint64_t v) { uint64_t x = u ^ v; x *= mul; x ^= x >> 47; uint64_t y = v ^ x; y *= mul; y ^= y >> 47; y *= mul; return y; } uint64_t hash_len_17_to_32(const char *s, uint64_t len) { const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; const uint64_t d = loadword<uint64_t>(s + len - 16) * k0; return hash_len_16( rotate(a - b, 43) + rotate(c, 30) + d, a + rotate(b ^ k3, 20) - c + len ); } To break this, let’s assume we’ll always give length 32 inputs. Then the implementation will always call hash_len_17_to_32, and we have full control over variables a, b, c and d by changing our input. Note that d is only used once, in the final expression. This makes it a prime target for attacking the hash. We will choose a, b and c arbitrarily, and then solve for d to compute a desired hash outcome. Using the above modinv function we first compute the necessary modular multiplicative inverses of mul and k0: >>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64 1 >>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64 1 We also note that in this case the xorshift is easy to invert, as x ^= x >> 47 is simply its own inverse. Having all the components ready, we can invert the function step by step. We first load a, b and c like in the hash function, and compute uint64_t v = a + rotate(b ^ k3, 20) - c + len; which is the second parameter to hash_len_16. Then, starting from our desired return value of hash_len_16(u, v) we work backwards step by step, inverting each operation to find the function argument u that would result in our target hash. Then once we have found such the unique u we compute our required input d. Putting it all together: static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL; static const uint64_t k0_inv = 0x81bc9c5aa9c72e97ULL; void cityhash64_preimage32(uint64_t hash, char *s) { const uint64_t len = 32; const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; uint64_t v = a + rotate(b ^ k3, 20) - c + len; // Invert hash_len_16(u, v). Original operation inverted // at each step is shown on the right, note that it is in // the inverse order of hash_len_16. uint64_t y = hash; // return y; y *= mul_inv; // y *= mul; y ^= y >> 47; // y ^= y >> 47; y *= mul_inv; // y *= mul; uint64_t x = y ^ v; // uint64_t y = v ^ x; x ^= x >> 47; // x ^= x >> 47; x *= mul_inv; // x *= mul; uint64_t u = x ^ v; // uint64_t x = u ^ v; // Find loadword<uint64_t>(s + len - 16). uint64_t d = u - rotate(a - b, 43) - rotate(c, 30); d *= k0_inv; std::memcpy(s + len - 16, &d, sizeof(d)); } The chance that a random uint64_t forms 8 printable ASCII bytes is $\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32 is so fast that having to repeat it on average ~3000 times to get a purely ASCII result isn’t so bad. For example, the following 10 strings all hash to 1337 using CityHash64, generated using this code: I’ve noticed there’s variants of CityHash64 with subtle differences in the wild. I chose to attack the variant shipped with libc++, so it should work for std::hash there, for example. I also assume a little-endian machine throughout this article, your mileage may vary on a big-endian machine depending on the hash implementation. orlp-cityhash64-D-:K5yx*zkgaaaaa orlp-cityhash64-TXb7;1j&btkaaaaa orlp-cityhash64-+/LM$0 ;msnaaaaa orlp-cityhash64-u'f&>I'~mtnaaaaa orlp-cityhash64-pEEv.LyGcnpaaaaa orlp-cityhash64-v~~bm@,Vahtaaaaa orlp-cityhash64-RxHr_&~{miuaaaaa orlp-cityhash64-is_$34#>uavaaaaa orlp-cityhash64-$*~l\{S!zoyaaaaa orlp-cityhash64-W@^5|3^:gtcbaaaa Breaking MurmurHash2 We can’t let libstdc++ get away after targetting libc++, can we? The default string hash calls an implementation of MurmurHash2 with seed 0xc70f6907. The hash—simplified to only handle strings whose lengths are multiples of 8—is as follows: uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; uint64_t hash = seed ^ (len * mul); for (const char* p = s; p != s + len; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; hash ^= data; hash *= mul; } hash ^= hash >> 47; hash *= mul; hash ^= hash >> 47; return hash; } We can take a similar approach here as before. We note that the modular multiplicative inverse of 0xc6a4a7935bd1e995 mod $2^{64}$ is 0x5f7a0ea7e59b19bd. As an example, we can choose the first 24 bytes arbitrarily, and solve for the last 8 bytes: void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL; // Compute the hash state for the first 24 bytes as normal. uint64_t state = seed ^ (32 * mul); for (const char* p = s; p != s + 24; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; state ^= data; state *= mul; } // Invert target hash transformation. // return hash; hash ^= hash >> 47; // hash ^= hash >> 47; hash *= mulinv; // hash *= mul; hash ^= hash >> 47; // hash ^= hash >> 47; // Invert last iteration for last 8 bytes. hash *= mulinv; // hash *= mul; uint64_t data = state ^ hash; // hash = hash ^ data; data *= mulinv; // data *= mul; data ^= data >> 47; // data ^= data >> 47; data *= mulinv; // data *= mul; std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s); } The following 10 strings all hash to 1337 using MurmurHash64A with the default seed 0xc70f6907, generated using this code: orlp-murmurhash64-bhbaaat;SXtgVa orlp-murmurhash64-bkiaaa&JInaNcZ orlp-murmurhash64-ewmaaa(%J+jw>j orlp-murmurhash64-vxpaaag"93\Yj5 orlp-murmurhash64-ehuaaafa`Wp`/| orlp-murmurhash64-yizaaa1x.zQF6r orlp-murmurhash64-lpzaaaZphp&c F orlp-murmurhash64-wsjbaa771rz{z< orlp-murmurhash64-rnkbaazy4X]p>B orlp-murmurhash64-aqnbaaZ~OzP_Tp Universal collision attack on MurmurHash64A In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used. To be fair to CityHash64… just kidding they found universal collisions against it as well, regardless of seed used. CityHash64 is actually much easier to break in this way, as simply doing the above pre-image attack targetting 0 as hash makes the output purely dependent on the seed, and thus a universal collision. To see how it works, let’s take a look at the core loop of MurmurHash64A: uint64_t data = loadword<uint64_t>(p); data *= mul; // Trivially invertible. data ^= data >> 47; // Trivially invertible. data *= mul; // Trivially invertible. state ^= data; state *= mul; We know we can trivially invert the operations done on data regardless of what the current state is, so we might as well have had the following body: state ^= data; state *= mul; Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit? >>> 1 << 63 9223372036854775808 >>> (1 << 63) * mul % 2**64 9223372036854775808 Since mul is odd, its least significant bit is set. Multiplying 1 << 63 by it is equivalent to shifting that bit 63 places to the left, which is once again 1 << 63. That is, 1 << 63 is a fixed point for the state *= mul operation. We also note that for the top bit XOR is equivalent to addition, as the overflow from addition is removed mod $2^{64}$. So if we have two input strings, one starting with the 8 bytes data, and the other starting with data ^ (1 << 63) == data + (1 << 63) (after doing the trivial inversions). We then find that the two states, regardless of seed, differ exactly in the top bit after state ^= data. After multiplication we find we have two states x * mul and (x + (1 << 63)) * mul == x * mul + (1 << 63)… which again differ exactly in the top bit! We are now back to state ^= data in our iteration, for the next 8 bytes. We can now use this moment to cancel our top bit difference, by again feeding two 8-byte strings that differ in the top bit (after inverting). In fact, we only have to find one pair of such strings that differ in the top bit, which we can then repeat twice (in either order) to cancel our difference again. When represented as a uint64_t if we choose the first string as x we can derive the second string as x *= mul; // Forward transformation... x ^= x >> 47; // ... x *= mul; // ... x ^= 1 << 63; // Difference in top bit. x *= mulinv; // Backwards transformation... x ^= x >> 47; // ... x *= mulinv; // ... I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation: xx0rlpx! xxsXъВ Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision. xx0rlpx!xxsXъВ xxsXъВxx0rlpx! But it doesn’t stop there. By concatenating these two strings we can create $2^n$ different colliding strings each $16n$ bytes long. With the current libstdc++ implementation the following prints the same number eight times: std::hash<std::u8string> h; std::u8string a = u8"xx0rlpx!xxsXъВ"; std::u8string b = u8"xxsXъВxx0rlpx!"; std::cout << h(a + a + a) << "\n"; std::cout << h(a + a + b) << "\n"; std::cout << h(a + b + a) << "\n"; std::cout << h(a + b + b) << "\n"; std::cout << h(b + a + a) << "\n"; std::cout << h(b + a + b) << "\n"; std::cout << h(b + b + a) << "\n"; std::cout << h(b + b + b) << "\n"; Even if the libstdc++ would randomize the seed used by MurmurHash64a, the strings would still collide. Breaking MurmurHash3 Nim uses used to use MurmurHash3_x86_32, so let’s try to break that. If we once again simplify to strings whose lengths are a multiple of 4 we get the following code: uint32_t rotl32(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) { const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; const uint32_t c3 = 0x85ebca6b; const uint32_t c4 = 0xc2b2ae35; uint32_t h = seed; for (const char* p = s; p != s + len; p += 4) { uint32_t k = loadword<uint32_t>(p); k *= c1; k = rotl32(k, 15); k *= c2; h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; } h ^= len; h ^= h >> 16; h *= c3; h ^= h >> 13; h *= c4; h ^= h >> 16; return h; } I think by now you should be able to get this function to spit out any value you want if you know the seed. The inverse of rotl32(x, r) is rotl32(x, 32-r) and the inverse of h ^= h >> 16 is once again just h ^= h >> 16. Only h ^= h >> 13 is a bit different, it’s the first time we’ve seen that a xorshift’s inverse has more than one step: h ^= h >> 13 h ^= h >> 26 Compute the modular inverses of c1 through c4 as well as 5 mod $2^{32}$, and go to town. If you want to cheat or check your answer, you can check out the code I’ve used to generate the following ten strings that all hash to 1337 when fed to MurmurHash3_x86_32 with seed 0: orlp-murmurhash3_x86_32-haaaPa*+ orlp-murmurhash3_x86_32-saaaUW&< orlp-murmurhash3_x86_32-ubaa/!/" orlp-murmurhash3_x86_32-weaare]] orlp-murmurhash3_x86_32-chaa5@/} orlp-murmurhash3_x86_32-claaM[,5 orlp-murmurhash3_x86_32-fraaIx`N orlp-murmurhash3_x86_32-iwaara&< orlp-murmurhash3_x86_32-zwaa]>zd orlp-murmurhash3_x86_32-zbbaW-5G Nim uses 0 as a fixed seed. You might wonder about the ethics of publishing functions for generating arbitrary amounts of collisions for hash functions actually in use today. I did consider holding back. But HashDoS has been a known attack for almost two decades now, and the universal hash collisions I’ve shown were also published more than a decade ago now as well. At some point you’ve had enough time to, uh, fix your shit. Universal collision attack on MurmurHash3 Suppose that Nim didn’t use 0 as a fixed seed, but chose a randomly generated one. Can we do a similar attack as the one done to MurmurHash2 to still generate universal multicollisions? Yes we can. Let’s take another look at that core loop body: uint32_t k = loadword<uint32_t>(p); k *= c1; // Trivially invertable. k = rotl32(k, 15); // Trivially invertable. k *= c2; // Trivially invertable. h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; Once again we can ignore the first three trivially invertable instructions as we can simply choose our input so that we get exactly the k we want. Remember from last time that we want to introduce a difference in exactly the top bit of h, as the multiplication will leave this difference in place. But here there is a bit rotation between the XOR and the multiplication. The solution? Simply place our bit difference such that rotl32(h, 13) shifts it into the top position. Does the addition of 0xe6546b64 mess things up? No. Since only the top bit between the two states will be different, there is a difference of exactly $2^{31}$ between the two states. This difference is maintained by the addition. Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$ apart, we can conclude that the two states still differ in the top bit after the addition. So we want to find two pairs of 32-bit ints, such that after applying the first three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000 and the second pair in bit 1 << 31 == 0x80000000. After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision: a = "!&orlpՓ" b = "yǏglp$X" As before, any concatenation of as and bs of length n collides with all other combinations of length n. Breaking FarmHash64 Nim switched to farmhash since I started writing this post. To break it we can notice that its structure is very similar to CityHash64, so we can use those same techniques again. In fact, the only changes between the two for lengths 17-32 bytes is that a few operators were changed from subtraction/XOR to addition, a rotation operator had its constant tweaked, and some k constants are slightly tweaked in usage. The process of breaking it is so similar that it’s entirely analogous, so we can skip straight to the result. These 10 strings all hash to 1337 with FarmHash64: orlp-farmhash64-?VrJ@L7ytzwheaaa orlp-farmhash64-p3`!SQb}fmxheaaa orlp-farmhash64-pdt'cuI\gvxheaaa orlp-farmhash64-IbY`xAG&ibkieaaa orlp-farmhash64-[_LU!d1hwmkieaaa orlp-farmhash64-QiY!clz]bttieaaa orlp-farmhash64-&?J3rZ_8gsuieaaa orlp-farmhash64-LOBWtm5Szyuieaaa orlp-farmhash64-Mptaa^g^ytvieaaa orlp-farmhash64-B?&l::hxqmfjeaaa Trivial fixed-seed wyhash multicollisions Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves: uint64_t folded_multiply(uint64_t a, uint64_t b) { __uint128_t full = __uint128_t(a) * __uint128_t(b); return uint64_t(full) ^ uint64_t(full >> 64); } It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form: out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b); where secret_a and secret_b are determined by the seed, or outputs of previous iterations which are influenced by the seed. However, when your seed is constant… With a bit of creativity we can use the start of our string to prepare a ‘secret’ value which we can perfectly cancel with another ASCII string later in the input. So, without further ado, every 32-byte string of the form orlp-wyhash-oGf_________tWJbzMJR hashes to the same value with Zig’s default hasher. Zig uses a different set of parameters than the defaults found in the wyhash repository, so for good measure, this pattern provides arbitrary multicollisions for the default parameters found in wyhash when using seed == 0: orlp-wyhash-EUv_________NLXyytkp Conclusion We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack. We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving. With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs. Well, such attacks are provably impossible against non-interactive attackers, everything goes out of the window again when an attacker is allowed to inspect the output hashes and use that to try and guess your secret key. Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes. More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust: fn naive_sum(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out += *x; } out } This however can easily result in an arbitrarily large accumulated error. Let’s try it out: naive_sum(&vec![1.0; 1_000_000]) = 1000000.0 naive_sum(&vec![1.0; 10_000_000]) = 10000000.0 naive_sum(&vec![1.0; 100_000_000]) = 16777216.0 naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0 Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck. Luckily, there are better ways to sum an array. Pairwise summation A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums. fn pairwise_sum(arr: &[f32]) -> f32 { if arr.len() == 0 { return 0.0; } if arr.len() == 1 { return arr[0]; } let (first, second) = arr.split_at(arr.len() / 2); pairwise_sum(first) + pairwise_sum(second) } This is more accurate: pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0 pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0 pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0 pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0 However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum. By choosing the splitpoint as a multiple of 256 we ensure that the base case in the recursion always has exactly 256 elements except on the very last block. This makes sure we use the most optimal reduction and always correctly predict the loop condition. This small detail ended up improving the throughput by 40% for large arrays! fn block_pairwise_sum(arr: &[f32]) -> f32 { if arr.len() > 256 { let split = (arr.len() / 2).next_multiple_of(256); let (first, second) = arr.split_at(split); block_pairwise_sum(first) + block_pairwise_sum(second) } else { naive_sum(arr) } } Kahan summation The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers. All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error bound is still quadratic in terms of $n$ even for Kahan summation. In practice all summation algorithms do significantly better than their worst-case bounds, as in most scenarios the errors do not exclusively round up or down, but cancel each other out on average. pub fn kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for x in arr { let y = *x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one. To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy: pub fn block_kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for chunk in arr.chunks(256) { let x = naive_sum(chunk); let y = x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } Exact summation I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end. The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place. Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer. This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$. A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer. Unleashing the compiler Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism. To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow. I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends. I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact. Note that the real numbers do not contain NaNs or infinities, so these operators assume those do not exist for the validity of transformations, however it is not undefined behavior when you do encounter those values. They also allow fused multiply-add instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$ Using those new instructions it is trivial to generate an autovectorized sum: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn naive_sum_autovec(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out = fadd_algebraic(out, *x); } out } If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions: .LBB0_5: vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8] vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32] vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64] vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96] add r8, 32 cmp rdx, r8 jne .LBB0_5 This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache. With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec. Accuracy and speed Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x. All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results: AlgorithmThroughputMean absolute error naive5.5 GB/s71.796 pairwise0.9 GB/s1.5528 kahan1.4 GB/s0.2229 block_pairwise5.8 GB/s3.8597 block_kahan5.9 GB/s4.2184 naive_autovec118.6 GB/s14.538 block_pairwise_autovec71.7 GB/s1.6132 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_buffer1.1 GB/s0.0015 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 The reason the accurate crate has a non-zero absolute error is because it currently does not implement rounding to nearest correctly, so it can be off by one unit in the last place for the final result. First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one. We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum. Limiting ourselves to Pareto-optimal choices we get the following four implementations: AlgorithmThroughputMean absolute error naive_autovec118.6 GB/s14.538 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here. For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower. By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn sum_block(arr: &[f32]) -> f32 { arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y)) } pub fn sum_orlp(arr: &[f32]) -> f32 { let mut chunks = arr.chunks_exact(256); let mut sum = 0.0; let mut c = 0.0; for chunk in &mut chunks { let y = sum_block(chunk) - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum + (sum_block(chunks.remainder()) - c) } AlgorithmThroughputMean absolute error sum_orlp112.2 GB/s1.2306 You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy. Conclusion I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations. Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.
Suppose you have a 64-bit word and wish to extract a couple bits from it. For example you just performed a SWAR algorithm and wish to extract the least significant bit of each byte in the u64. This is simple enough, you simply perform a binary AND with a mask of the bits you wish to keep: let out = word & 0x0101010101010101; However, this still leaves the bits of interest spread throughout the 64-bit word. What if we also want to compress the 8 bits we wish to extract into a single byte? Or what if we want the inverse, spreading the 8 bits of a byte among the least significant bits of each byte in a 64-bit word? PEXT and PDEP If you are using a modern x86-64 CPU, you are in luck. In the much underrated BMI instruction set there are two very powerful instructions: PDEP and PEXT. They are inverses of each other, PEXT extracts bits, PDEP deposits bits. PEXT takes in a word and a mask, takes just those bits from the word where the mask has a 1 bit, and compresses all selected bits to a contiguous output word. Simulated in Rust this would be: fn pext64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut out_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; let ith_word_bit = (word >> i) & 1; if ith_mask_bit == 1 { out |= ith_word_bit << out_idx; out_idx += 1; } } out } For example if you had the bitstring abcdefgh and mask 10110001 you would get output bitstring 0000acdh. PDEP is exactly its inverse, it takes contiguous data bits as a word, and a mask, and deposits the data bits one-by-one (starting at the least significant bits) into those bits where the mask has a 1 bit, leaving the rest as zeros: fn pdep64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut input_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; if ith_mask_bit == 1 { let next_word_bit = (word >> input_idx) & 1; out |= next_word_bit << i; input_idx += 1; } } out } So if you had the bitstring abcdefgh and mask 10100110 you would get output e0f00gh0 (recall that we traditionally write bitstrings with the least significant bit on the right). These instructions are incredibly powerful and flexible, and the amazing thing is that these instructions only take a single cycle on modern Intel and AMD CPUs! However, they are not available in other instruction sets, so whenever you use them you will also likely need to write a cross-platform alternative. Unfortunately, both PDEP and PEXT are very slow on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate. The platform advertises through CPUID that the instructions are supported, but they’re almost unusably slow. Use with caution. Extracting bits with multiplication While the following technique can’t replace all PEXT cases, it can be quite general. It is applicable when: The bit pattern you want to extract is static and known in advance. If you want to extract $k$ bits, there must at least be a $k-1$ gap between two bits of interest. We compute the bit extraction by adding together many left-shifted copies of our input word, such that we construct our desired bit pattern in the uppermost bits. The trick is to then realize that w << i is equivalent to w * (1 << i) and thus the sum of many left-shifted copies is equivalent to a single multiplication by (1 << i) + (1 << j) + ... I think the technique is best understood by visual example. Let’s use our example from earlier, extracting the least significant bit of each byte in a 64-bit word. We start off by masking off just those bits. After that we shift the most significant bit of interest to the topmost bit of the word to get our first shifted copy. We then repeat this, shifting the second most significant bit of interest to the second topmost bit, etc. We sum all these shifted copies. This results in the following (using underscores instead of zeros for clarity): mask = _______1_______1_______1_______1_______1_______1_______1_______1 t = w & mask t = _______a_______b_______c_______d_______e_______f_______g_______h t << 7 = a_______b_______c_______d_______e_______f_______g_______h_______ t << 14 = _b_______c_______d_______e_______f_______g_______h______________ t << 21 = __c_______d_______e_______f_______g_______h_____________________ t << 28 = ___d_______e_______f_______g_______h____________________________ t << 35 = ____e_______f_______g_______h___________________________________ t << 42 = _____f_______g_______h__________________________________________ t << 49 = ______g_______h_________________________________________________ t << 56 = _______h________________________________________________________ sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______ Note how we constructed abcdefgh in the topmost 8 bits, which we can then extract using a single right-shift by $64 - 8 = 56$ bits. Since (1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080 we get the following implementation: fn extract_lsb_bit_per_byte(w: u64) -> u8 { let mask = 0x0101010101010101; let sum_of_shifts = 0x102040810204080; ((w & mask).wrapping_mul(sum_of_shifts) >> 56) as u8 } Not as good as PEXT, but three arithmetic instructions is not bad at all. Depositing bits with multiplication Unfortunately the following technique is significantly less general than the previous one. While you can take inspiration from it to implement similar algorithms, as-is it is limited to just spreading the bits of one byte to the least significant bit of each byte in a 64-bit word. The trick is similar to the one above. We add 8 shifted copies of our byte which once again translates to a multiplication. By choosing a shift that increases in multiples if 9 instead of 8 we ensure that the bit pattern shifts over by one position in each byte. We then mask out our bits of interest, and finish off with a shift and byteswap (which compiles to a single instruction bswap on Intel or rev on ARM) to put our output bits on the least significant bits and reverse the order. This technique visualized: b = ________________________________________________________abcdefgh b << 9 = _______________________________________________abcdefgh_________ b << 18 = ______________________________________abcdefgh__________________ b << 27 = _____________________________abcdefgh___________________________ b << 36 = ____________________abcdefgh____________________________________ b << 45 = ___________abcdefgh_____________________________________________ b << 54 = __abcdefgh______________________________________________________ b << 63 = h_______________________________________________________________ sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh mask = 1_______1_______1_______1_______1_______1_______1_______1_______ s & msk = h_______g_______f_______e_______d_______c_______b_______a_______ We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201, allowing the following implementation: fn deposit_lsb_bit_per_byte(b: u8) -> u64 { let sum_of_shifts = 0x8040201008040201; let mask = 0x8080808080808080; let spread = (b as u64).wrapping_mul(sum_of_shifts) & mask; u64::swap_bytes(spread >> 7) } This time it required 4 arithmetic instructions, not quite as good as PDEP, but again not bad compared to a naive implementation, and this is cross-platform.
This post is an anecdote from over a decade ago, of which I lost the actual code. So please forgive me if I do not accurately remember all the details. Some details are also simplified so that anyone that likes computer security can enjoy this article, not just those who have played World of Warcraft (although the Venn diagram of those two groups likely has a solid overlap). When I was around 14 years old I discovered World of Warcraft developed by Blizzard Games and was immediately hooked. Not long after I discovered add-ons which allow you to modify how your game’s user interface looks and works. However, not all add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to find out how they were made. In a weird twist of fate, I blame World of Warcraft for me seriously picking up programming. It turned out that they were made in the Lua programming language. Add-ons were nothing more than a couple .lua source files in a folder directly loaded into the game. The barrier of entry was incredibly low: just edit a file, press save and reload the interface. The fact that the game loaded your source code and you could see it running was magical! I enjoyed it immensely and in no time I was only writing add-ons and was barely playing the game itself anymore. I published quite a few add-ons in the next two years, which mostly involved copying other people’s code with some refactoring / recombining / tweaking to my wishes. Add-on security A thought you might have is that it’s a really bad idea to let users have fully programmable add-ons in your game, lest you get bots. However, the system Blizzard made to prevent arbitrary programmable actions was quite clever. Naturally, it did nothing to prevent actual botting, but at least regular rule-abiding players were fundamentally restricted to the automation Blizzard allowed. Most UI elements that you could create were strictly decorative or informational. These were completely unrestricted, as were most APIs that strictly gather information. For example you can make a health bar display using two frames, a background and a foreground, sizing the foreground frame using an API call to get the health of your character. Not all API calls were available to you however. Some were protected so they could only be called from official Blizzard code. These typically involved the API calls that would move your character, cast spells, use items, etc. Generally speaking anything that actually makes you perform an in-game action was protected. The API for getting your exact world location and camera orientation also became protected at some point. This was a reaction by Blizzard to new add-ons that were actively drawing 3D elements on top of the game world to make boss fights easier. However, some UI elements needed to actually interact with the game itself, e.g. if I want to make a button that casts a certain spell. For this you could construct a special kind of button that executes code in a secure environment when clicked. You were only allowed to create/destroy/move such buttons when not in combat, so you couldn’t simply conditionally place such buttons underneath your cursor to automate actions during combat. The catch was that this secure environment did allow you to programmatically set which spell to cast, but doesn’t let you gather the information you would need to do arbitrary automation. All access to state from outside the secure environment was blocked. There were some information gathering API calls available to match the more accessible in-game macro system, but nothing as fancy as getting skill cooldowns or unit health which would enable automatic optimal spellcasting. So there were two environments: an insecure one where you can get all information but can’t act on it, and a secure one where you can act but can’t get the information needed for automation. A backdoor channel Fast forward a couple years and I had mostly stopped playing. My interests had mainly moved on to more “serious” programming, and I was only occasionally playing, mostly messing around with add-on ideas. But this secure environment kept on nagging in my brain; I wanted to break it. Of course there was third-party software that completely disables the security restrictions from Blizzard, but what’s the fun in that? I wanted to do it “legitimately”, using the technically allowed tools, as a challenge. Obviously using clever code to bypass security restrictions is no better than using third-party software, and both would likely get you banned. I never actually wanted to use the code, just to see if I could make it work. So I scanned the secure environment allowed function list to see if I could smuggle any information from the outside into the secure environment. It all seemed pretty hopeless until I saw one tiny, innocent little function: random. An evil idea came in my head: random number generators (RNGs) used in computers are almost always pseudorandom number generators with (hidden) internal state. If I can manipulate this state, perhaps I can use that to pass information into the secure environment. Random number generator woes It turned out that random was just a small shim around C’s rand. I was excited! This meant that there was a single global random state that was shared in the process. It also helps that rand implementations tended to be on the weak side. Since World of Warcraft was compiled with MSVC, the actual implementation of rand was as follows: uint32_t state; int rand() { state = state * 214013 + 2531011; return (state >> 16) & 0x7fff; } This RNG is, for the lack of a better word, shite. It is a naked linear congruential generator, and a weak one at that. Which in my case, was a good thing. I can understand MSVC keeps rand the same for backwards compatibility, and at least all documentation I could find for rand recommends you not to use rand for cryptographic purposes. But was there ever a time where such a bad PRNG implementation was fit for any purpose? So let’s get to breaking this thing. Since the state is so laughably small and you can see 15 bits of the state directly you can keep a full list of all possible states consistent with a single output of the RNG and use further calls to the RNG to eliminate possibilities until a single one remains. But we can be significantly more clever. First we note that the top bit of state never affects anything in this RNG. (state >> 16) & 0x7fff masks out 15 bits, after shifting away the bottom 16 bits, and thus effectively works mod $2^{31}$. Since on any update the new state is a linear function of the previous state, we can propagate this modular form all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for any linear $f$. Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by the shift $h_0, h_1$ respectively, for the states after the first and second call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$ and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity: $$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$ $$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$ Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then for some integer $k$ we have $$ah_0 = h_1 + c + 2^{31} k.$$ Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$. Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following expression for $h_0$: $$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$ Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$. Thus, assuming a solution exists, we must have $$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$ So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of the RNG state after the first call. This gives us 8 guesses, after which we can reject bad guesses using follow-up calls to the RNG until a single unique answer remains. While I was able to re-derive the above with little difficulty now, 18 year old me wasn’t as experienced in discrete math. So I asked on crypto.SE, with the excuse that I wanted to ‘show my colleagues how weak this RNG is’. It worked, which sparks all kinds of interesting ethics questions. An example implementation of this process in Python: import random A = 214013 B = 2531011 class MsvcRng: def __init__(self, state): self.state = state def __call__(self): self.state = (self.state * A + B) % 2**32 return (self.state >> 16) & 0x7fff # Create a random RNG state we'll reverse engineer. hidden_rng = MsvcRng(random.randint(0, 2**32)) # Compute guesses for hidden state from 2 observations. r0 = hidden_rng() r1 = hidden_rng() c = (2**16 * (r1 - A * r0) - B) % 2**31 ceil_div = lambda a, b: (a + b - 1) // b h_guesses = [ceil_div(c + 2**31 * k, A) for k in range(-1, 7)] # Validate guesses until a single guess remains. guess_rngs = [MsvcRng(2**16 * r0 + h0) for h0 in h_guesses] guess_rngs = [g for g in guess_rngs if g() == r1] while len(guess_rngs) > 1: r = hidden_rng() guess_rngs = [g for g in guess_rngs if g() == r] # The top bit can not be recovered as it never affects the output, # but we should have recovered the effective hidden state. assert guess_rngs[0].state % 2**31 == hidden_rng.state % 2**31 While I did write the above process with a while loop, it appears to only ever need a third output at most to narrow it down to a single guess. Putting it together Once we could reverse-engineer the internal state of the random number generator we could make arbitrary automated decisions in the supposedly secure environment. How it worked was as follows: An insecure hook was registered that would execute right before the secure environment code would run. In this hook we have full access to information, and make a decision as to which action should be taken (e.g. casting a particular spell). This action is looked up in a hardcoded list to get an index. The current state of the RNG is reverse-engineered using the above process. We predict the outcome of the next RNG call. If this (modulo the length of our action list) does not give our desired outcome, we advance the RNG and try again. This repeats until the next random number would correspond to our desired action. The hook returns, and the secure environment starts. It generates a “random” number, indexes our hardcoded list of actions, and performs the “random” action. That’s all! By being able to simulate the RNG and looking one step ahead we could use it as our information channel by choosing exactly the right moment to call random in the secure environment. Now if you wanted to support a list of $n$ actions it would on average take $n$ steps of the RNG before the correct number came up to pass along, but that wasn’t a problem in practice. Conclusion I don’t know when Blizzard fixed the issue where the RNG state is so weak and shared, or whether they were aware of it being an issue at all. A few years after I had written the code I tried it again out of curiosity, and it had stopped working. Maybe they switched to a different algorithm, or had a properly separated RNG state for the secure environment. All-in-all it was a lot of effort for a niche exploit in a video game that I didn’t even want to use. But there certainly was a magic to manipulating something supposedly random into doing exactly what you want, like a magician pulling four aces from a shuffled deck.
More in programming
For some purpose, the DOGE people are burrowing their way into all US Federal Systems. Their complete control over the Treasury Department is entirely insane. Unless you intend to destroy everything, making arbitrary changes to complex computer systems will result in destruction, even if that was not your intention. No
Entering 2025, I decided to spend some time exploring the topic of agents. I started reading Anthropic’s Building effective agents, followed by Chip Huyen’s AI Engineering. I kicked off a major workstream at work on using agents, and I also decided to do a personal experiment of sorts. This is a general commentary on building that project. What I wanted to build was a simple chat interface where I could write prompts, select models, and have the model use tools as appropriate. My side goal was to build this using Cursor and generally avoid writing code directly as much as possible, but I found that generally slower than writing code in emacs while relying on 4o-mini to provide working examples to pull from. Similarly, while I initially envisioned building this in fullstack TypeScript via Cursor, I ultimately bailed into a stack that I’m more comfortable, and ended up using Python3, FastAPI, PostgreSQL, and SQLAlchemy with the async psycopg3 driver. It’s been a… while… since I started a brand new Python project, and used this project as an opportunity to get comfortable with Python3’s async/await mechanisms along with Python3’s typing along with mypy. Finally, I also wanted to experiment with Tailwind, and ended up using TailwindUI’s components to build the site. The working version supports everything I wanted: creating chats with models, and allowing those models to use function calling to use tools that I provide. The models are allowed to call any number of tools in pursuit of the problem they are solving. The tool usage is the most interesting part here for sure. The simplest tool I created was a get_temperature tool that provided a fake temperature for your location. This allowed me to ask questions like “What should I wear tomorrow in San Francisco, CA?” and get a useful respond. The code to add this function to my project was pretty straightforward, just three lines of Python and 25 lines of metadata to pass to the OpenAI API. def tool_get_current_weather(location: str|None=None, format: str|None=None) -> str: "Simple proof of concept tool." temp = random.randint(40, 90) if format == 'fahrenheit' else random.randint(10, 25) return f"It's going to be {temp} degrees {format} tomorrow." FUNCTION_REGISTRY['get_current_weather'] = tool_get_current_weather TOOL_USAGE_REGISTRY['get_current_weather'] = { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } } After getting this tool, the next tool I added was a simple URL retriever tool, which allowed the agent to grab a URL and use the content of that URL in its prompt. The implementation for this tool was similarly quite simple. def tool_get_url(url: str|None=None) -> str: if url is None: return '' url = str(url) response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('main') or soup.find('article') or soup.body if not content: return str(response.content) markdown = markdownify(str(content), heading_style="ATX").strip() return str(markdown) FUNCTION_REGISTRY['get_url'] = tool_get_url TOOL_USAGE_REGISTRY['get_url'] = { "type": "function", "function": { "name": "get_url", "description": "Retrieve the contents of a website via its URL.", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "The complete URL, including protocol to retrieve. For example: \"https://lethain.com\"", } }, "required": ["url"], }, } } What’s pretty amazing is how much power you can add to your agent by adding such a trivial tool as retrieving a URL. You can similarly imagine adding tools for retrieving and commenting on Github pull requests and so, which could allow a very simple agent tool like this to become quite useful. Working on this project gave me a moderately compelling view of a near-term future where most engineers have simple application like this running that they can pipe events into from various systems (email, text, Github pull requests, calendars, etc), create triggers that map events to templates that feed into prompts, and execute those prompts with tool-aware agents. Combine that with ability for other agents to register themselves with you and expose the tools that they have access to (e.g. schedule an event with tool’s owner), and a bunch of interesting things become very accessible with a very modest amount of effort: You could schedule events between two busy people’s calendars, as if both of them had an assistant managing their calendar Reply to your own pull requests with new blog posts, providing feedback on typos and grammatical issues Crawl websites you care about and identify posts you might be interested in Ask the model to generate a system model using lethain:systems, run that model, then chart the responses Add a “planning tool” which allows the model to generate a plan to guide subsequent steps in a complex task. (e.g. getting my calendar, getting a friend’s calendar, suggesting a time we could meet) None of these are exactly lifesaving, but each is somewhat useful, and I imagine there are many more fairly obvious ideas that become easy once you have the necessary scaffolding to make this sort of thing easy. Altogether, I think that I am convinced at this points that agents, using current foundational models, are going to create a number of very interesting experiences that improve our day to day lives in small ways that are, in aggregate, pretty transformational. I’m less convinced that this is the way all software should work going forward though, but more thoughts on that over time. (A bunch of fun experiments happening at work, but early days on those.)
One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]
A lieutenant colonel in the Soviet Air Defense Forces prevented the end of human civilization on September 26th, 1983. His name was Stanislav Petrov. Protocol dictated that the Soviet Union would retaliate against any nuclear strikes sent by the United States. This was a policy of mutually assured destruction, a doctrine that compels a horrifying logical conclusion. The second and third stage effects of this type of exchange would be even more catastrophic. Allies for each side would likely be pulled into the conflict. The resulting nuclear winter was projected to lead to 2 billion deaths due to starvation. This is to say nothing about those who would have been unfortunate enough to have survived. Petrov’s job was to monitor Oko, the computerized warning systems built to centralize Soviet satellite communications. Around midnight, he received a report that one of the satellites had detected the infrared signature of a single launch of a United States ICBM. While Petrov was deciding what to do about this report, the system detected four more incoming missile launches. He had minutes to make a choice about what to do. It is impossible to imagine the amount of pressure placed on him at this moment. Source: Stanislav Petrov, Soviet officer credited with averting nuclear war, dies at 77 by Schwartzreport. Petrov lived in a world of deterministic systems. The technologies that powered these warning systems have outputs that are guaranteed, provided the proper inputs are provided. However, deterministic does not mean infallible. The only reason you are alive and reading this is because Petrov understood that the systems he observed were capable of error. He was suspicious of what he was seeing reported, and chose not to escalate a retaliatory strike. There were two factors guiding his decision: A surprise attack would most likely have used hundreds of missiles, and not just five. The allegedly foolproof Oko system was new and prone to errors. An error in a deterministic system can still lead to expected outputs being generated. For the Oko system, infrared reflections of the sun shining off of the tops of clouds created a false positive that was interpreted as detection of a nuclear launch event. Source: US-K History by Kosmonavtika. The concept of erroneous truth is a deep thing to internalize, as computerized systems are presented as omniscient, indefective, and absolute. Petrov’s rewards for this action were reprimands, reassignment, and denial of promotion. This was likely for embarrassing his superiors by the politically inconvenient shedding of light on issues with the Oko system. A coerced early retirement caused a nervous breakdown, likely him having to grapple with the weight of his decision. It was only in the 1990s—after the fall of the Soviet Union—that his actions were discovered internationally and celebrated. Stanislav Petrov was given the recognition that he deserved, including being honored by the United Nations, awarded the Dresden Peace Prize, featured in a documentary, and being able to visit a Minuteman Missile silo in the United States. On January 31st, 2025, OpenAI struck a deal with the United States government to use its AI product for nuclear weapon security. It is unclear how this technology will be used, where, and to what extent. It is also unclear how OpenAI’s systems function, as they are black box technologies. What is known is that LLM-generated responses—the product OpenAI sells—are non-deterministic. Non-deterministic systems don’t have guaranteed outputs from their inputs. In addition, LLM-based technology hallucinates—it invents content with no self-knowledge that it is a falsehood. Non-deterministic systems that are computerized also have the perception as being authoritative, the same as their deterministic peers. It is not a question of how the output is generated, it is one of the output being perceived to come from a machine. These are terrifying things to know. Consider not only the systems this technology is being applied to, but also the thoughtless speed of their integration. Then consider how we’ve historically been conditioned and rewarded to interpret the output of these systems, and then how we perceive and treat skeptics. We don’t live in a purely deterministic world of technology anymore. Stanislav Petrov died on September 18th, 2017, before this change occurred. I would be incredibly curious to know his thoughts about our current reality, as well as the increasing abdication of human monitoring of automated systems in favor of notably biased, supposed “AI solutions.” In acknowledging Petrov’s skepticism in a time of mania and political instability, we acknowledge a quote from former U.S. Secretary of Defense William J. Perry’s memoir about the incident: [Oko’s false positives] illustrates the immense danger of placing our fate in the hands of automated systems that are susceptible to failure and human beings who are fallible.
In our *Ambsheets* project, we are exploring a small extension to the familiar spreadsheet: **what if a single spreadsheet cell could hold multiple values at once**?