More from orlp.net - Blog Archive
Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on. In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions. As a teaser, this article explains how you can generate strings such as these, thousands per second: cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337 murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337 murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337 farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337 I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function: a = "xx0rlpx!xxsXъВ" b = "xxsXъВxx0rlpx!" murmurhash2(a + a, seed) == murmurhash2(a + b, seed) murmurhash2(a + a, seed) == murmurhash2(b + a, seed) murmurhash2(a + a, seed) == murmurhash2(b + b, seed) a = "!&orlpՓ" b = "yǏglp$X" murmurhash3(a + a, seed) == murmurhash3(a + b, seed) murmurhash3(a + a, seed) == murmurhash3(b + a, seed) murmurhash3(a + a, seed) == murmurhash3(b + b, seed) Hash function security basics Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function. If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle. The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse. To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are: Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$. Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$. Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$. Note that pre-image resistance implies second pre-image resistance which in turn implies collision resistance. Conversely, a pre-image attack breaks all three properties. We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers. MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones. Non-cryptographic hash functions Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds. One particular use-case for hash functions is deriving a secret key from a password: a key derivation function. Unlike regular hash functions, being slow is actually a safety feature here to protect against brute forcing passwords. Modern ones such as Argon2 also intentionally use a lot of memory for protection against specialized hardware such as ASICs or FPGAs. A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings: C++: there are multiple standard library implementations, but 64-bit clang 13.0.0 on Apple M1 ships CityHash64. Currently libstdc++ ships MurmurHash64A, a variant of Murmur2 for 64-bit platforms. Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes h = 31 * h + c for each character c. PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and 33 as its multiplier. Nim: it used to use MurmurHash3_x86_32. While writing this article they appeared to have switched to use farmhash by default. Zig: it uses wyhash by default, with 0 as seed. Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed. There were some that used stronger hashes by default as well: Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities. If not available, it uses an algorithm inspired by wyhash. Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved. This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed. Interlude: inverse operations Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$. Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some: Addition can be inverted using subtraction. That is, x += y can be inverted using x -= y. Seems obvious enough. Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$. This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever. Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$. def egcd(a, b): if a == 0: return (b, 0, 1) g, y, x = egcd(b % a, a) return (g, x - (b // a) * y, y) def modinv(c, m): g, x, y = egcd(c, m) assert g == 1, "c, m must be coprime" return x % m Using this we can invert modular multiplication: >>> modinv(17, 2**32) 4042322161 >>> 42 * 17 * 4042322161 % 2**32 42 Magic! XOR can be inverted using… XOR. It is its own inverse. So x ^= y can be inverted using x ^= y. Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter: abcdefghi defghiabc The formula for a right-rotation of k places is (x >> k) | (x << (w - k)), where w is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation of k places is another right-rotation of w-k places. Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$: x ^= x << k // Left xorshift. x ^= x >> k // Right xorshift. How to invert it is entirely analogous between the two, so I will focus on the left xorshift. An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed: v0 = (x << k) ^ x // Apply first step of inverse v1 = v0 ^ (v0 << k). v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x // Simplify using self-inverse (x << k) ^ (x << k) = 0. v1 = (x << 2*k) ^ x From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$. For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence: x ^= x << 13 // Left xorshift by 13. x ^= x << 13 // Inverse step 1. x ^= x << 26 // Inverse step 2. x ^= x << 52 // Inverse step 3. // x ^= x << 104 // Next step would be a no-op. Armed with this knowledge, we can now attack. Breaking CityHash64 Let us take a look at (part of) the source code of CityHash64 from libcxx that’s used for hashing strings on 64-bit platforms: C++ standard library code goes through a process known as 'uglification', which prepends underscores to all identifiers. This is because those identifiers are reserved by the standard to only be used in the standard library, and thus won't be replaced by macros from standards-compliant code. For your sanity's sake I removed them here. static const uint64_t mul = 0x9ddfea08eb382d69ULL; static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; static const uint64_t k1 = 0xb492b66fbe98f273ULL; static const uint64_t k2 = 0x9ae16a3b2f90404fULL; static const uint64_t k3 = 0xc949d7c7509e6557ULL; template<class T> T loadword(const void* p) { T r; std::memcpy(&r, p, sizeof(r)); return r; } uint64_t rotate(uint64_t val, int shift) { if (shift == 0) return val; return (val >> shift) | (val << (64 - shift)); } uint64_t hash_len_16(uint64_t u, uint64_t v) { uint64_t x = u ^ v; x *= mul; x ^= x >> 47; uint64_t y = v ^ x; y *= mul; y ^= y >> 47; y *= mul; return y; } uint64_t hash_len_17_to_32(const char *s, uint64_t len) { const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; const uint64_t d = loadword<uint64_t>(s + len - 16) * k0; return hash_len_16( rotate(a - b, 43) + rotate(c, 30) + d, a + rotate(b ^ k3, 20) - c + len ); } To break this, let’s assume we’ll always give length 32 inputs. Then the implementation will always call hash_len_17_to_32, and we have full control over variables a, b, c and d by changing our input. Note that d is only used once, in the final expression. This makes it a prime target for attacking the hash. We will choose a, b and c arbitrarily, and then solve for d to compute a desired hash outcome. Using the above modinv function we first compute the necessary modular multiplicative inverses of mul and k0: >>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64 1 >>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64 1 We also note that in this case the xorshift is easy to invert, as x ^= x >> 47 is simply its own inverse. Having all the components ready, we can invert the function step by step. We first load a, b and c like in the hash function, and compute uint64_t v = a + rotate(b ^ k3, 20) - c + len; which is the second parameter to hash_len_16. Then, starting from our desired return value of hash_len_16(u, v) we work backwards step by step, inverting each operation to find the function argument u that would result in our target hash. Then once we have found such the unique u we compute our required input d. Putting it all together: static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL; static const uint64_t k0_inv = 0x81bc9c5aa9c72e97ULL; void cityhash64_preimage32(uint64_t hash, char *s) { const uint64_t len = 32; const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; uint64_t v = a + rotate(b ^ k3, 20) - c + len; // Invert hash_len_16(u, v). Original operation inverted // at each step is shown on the right, note that it is in // the inverse order of hash_len_16. uint64_t y = hash; // return y; y *= mul_inv; // y *= mul; y ^= y >> 47; // y ^= y >> 47; y *= mul_inv; // y *= mul; uint64_t x = y ^ v; // uint64_t y = v ^ x; x ^= x >> 47; // x ^= x >> 47; x *= mul_inv; // x *= mul; uint64_t u = x ^ v; // uint64_t x = u ^ v; // Find loadword<uint64_t>(s + len - 16). uint64_t d = u - rotate(a - b, 43) - rotate(c, 30); d *= k0_inv; std::memcpy(s + len - 16, &d, sizeof(d)); } The chance that a random uint64_t forms 8 printable ASCII bytes is $\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32 is so fast that having to repeat it on average ~3000 times to get a purely ASCII result isn’t so bad. For example, the following 10 strings all hash to 1337 using CityHash64, generated using this code: I’ve noticed there’s variants of CityHash64 with subtle differences in the wild. I chose to attack the variant shipped with libc++, so it should work for std::hash there, for example. I also assume a little-endian machine throughout this article, your mileage may vary on a big-endian machine depending on the hash implementation. orlp-cityhash64-D-:K5yx*zkgaaaaa orlp-cityhash64-TXb7;1j&btkaaaaa orlp-cityhash64-+/LM$0 ;msnaaaaa orlp-cityhash64-u'f&>I'~mtnaaaaa orlp-cityhash64-pEEv.LyGcnpaaaaa orlp-cityhash64-v~~bm@,Vahtaaaaa orlp-cityhash64-RxHr_&~{miuaaaaa orlp-cityhash64-is_$34#>uavaaaaa orlp-cityhash64-$*~l\{S!zoyaaaaa orlp-cityhash64-W@^5|3^:gtcbaaaa Breaking MurmurHash2 We can’t let libstdc++ get away after targetting libc++, can we? The default string hash calls an implementation of MurmurHash2 with seed 0xc70f6907. The hash—simplified to only handle strings whose lengths are multiples of 8—is as follows: uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; uint64_t hash = seed ^ (len * mul); for (const char* p = s; p != s + len; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; hash ^= data; hash *= mul; } hash ^= hash >> 47; hash *= mul; hash ^= hash >> 47; return hash; } We can take a similar approach here as before. We note that the modular multiplicative inverse of 0xc6a4a7935bd1e995 mod $2^{64}$ is 0x5f7a0ea7e59b19bd. As an example, we can choose the first 24 bytes arbitrarily, and solve for the last 8 bytes: void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL; // Compute the hash state for the first 24 bytes as normal. uint64_t state = seed ^ (32 * mul); for (const char* p = s; p != s + 24; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; state ^= data; state *= mul; } // Invert target hash transformation. // return hash; hash ^= hash >> 47; // hash ^= hash >> 47; hash *= mulinv; // hash *= mul; hash ^= hash >> 47; // hash ^= hash >> 47; // Invert last iteration for last 8 bytes. hash *= mulinv; // hash *= mul; uint64_t data = state ^ hash; // hash = hash ^ data; data *= mulinv; // data *= mul; data ^= data >> 47; // data ^= data >> 47; data *= mulinv; // data *= mul; std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s); } The following 10 strings all hash to 1337 using MurmurHash64A with the default seed 0xc70f6907, generated using this code: orlp-murmurhash64-bhbaaat;SXtgVa orlp-murmurhash64-bkiaaa&JInaNcZ orlp-murmurhash64-ewmaaa(%J+jw>j orlp-murmurhash64-vxpaaag"93\Yj5 orlp-murmurhash64-ehuaaafa`Wp`/| orlp-murmurhash64-yizaaa1x.zQF6r orlp-murmurhash64-lpzaaaZphp&c F orlp-murmurhash64-wsjbaa771rz{z< orlp-murmurhash64-rnkbaazy4X]p>B orlp-murmurhash64-aqnbaaZ~OzP_Tp Universal collision attack on MurmurHash64A In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used. To be fair to CityHash64… just kidding they found universal collisions against it as well, regardless of seed used. CityHash64 is actually much easier to break in this way, as simply doing the above pre-image attack targetting 0 as hash makes the output purely dependent on the seed, and thus a universal collision. To see how it works, let’s take a look at the core loop of MurmurHash64A: uint64_t data = loadword<uint64_t>(p); data *= mul; // Trivially invertible. data ^= data >> 47; // Trivially invertible. data *= mul; // Trivially invertible. state ^= data; state *= mul; We know we can trivially invert the operations done on data regardless of what the current state is, so we might as well have had the following body: state ^= data; state *= mul; Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit? >>> 1 << 63 9223372036854775808 >>> (1 << 63) * mul % 2**64 9223372036854775808 Since mul is odd, its least significant bit is set. Multiplying 1 << 63 by it is equivalent to shifting that bit 63 places to the left, which is once again 1 << 63. That is, 1 << 63 is a fixed point for the state *= mul operation. We also note that for the top bit XOR is equivalent to addition, as the overflow from addition is removed mod $2^{64}$. So if we have two input strings, one starting with the 8 bytes data, and the other starting with data ^ (1 << 63) == data + (1 << 63) (after doing the trivial inversions). We then find that the two states, regardless of seed, differ exactly in the top bit after state ^= data. After multiplication we find we have two states x * mul and (x + (1 << 63)) * mul == x * mul + (1 << 63)… which again differ exactly in the top bit! We are now back to state ^= data in our iteration, for the next 8 bytes. We can now use this moment to cancel our top bit difference, by again feeding two 8-byte strings that differ in the top bit (after inverting). In fact, we only have to find one pair of such strings that differ in the top bit, which we can then repeat twice (in either order) to cancel our difference again. When represented as a uint64_t if we choose the first string as x we can derive the second string as x *= mul; // Forward transformation... x ^= x >> 47; // ... x *= mul; // ... x ^= 1 << 63; // Difference in top bit. x *= mulinv; // Backwards transformation... x ^= x >> 47; // ... x *= mulinv; // ... I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation: xx0rlpx! xxsXъВ Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision. xx0rlpx!xxsXъВ xxsXъВxx0rlpx! But it doesn’t stop there. By concatenating these two strings we can create $2^n$ different colliding strings each $16n$ bytes long. With the current libstdc++ implementation the following prints the same number eight times: std::hash<std::u8string> h; std::u8string a = u8"xx0rlpx!xxsXъВ"; std::u8string b = u8"xxsXъВxx0rlpx!"; std::cout << h(a + a + a) << "\n"; std::cout << h(a + a + b) << "\n"; std::cout << h(a + b + a) << "\n"; std::cout << h(a + b + b) << "\n"; std::cout << h(b + a + a) << "\n"; std::cout << h(b + a + b) << "\n"; std::cout << h(b + b + a) << "\n"; std::cout << h(b + b + b) << "\n"; Even if the libstdc++ would randomize the seed used by MurmurHash64a, the strings would still collide. Breaking MurmurHash3 Nim uses used to use MurmurHash3_x86_32, so let’s try to break that. If we once again simplify to strings whose lengths are a multiple of 4 we get the following code: uint32_t rotl32(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) { const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; const uint32_t c3 = 0x85ebca6b; const uint32_t c4 = 0xc2b2ae35; uint32_t h = seed; for (const char* p = s; p != s + len; p += 4) { uint32_t k = loadword<uint32_t>(p); k *= c1; k = rotl32(k, 15); k *= c2; h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; } h ^= len; h ^= h >> 16; h *= c3; h ^= h >> 13; h *= c4; h ^= h >> 16; return h; } I think by now you should be able to get this function to spit out any value you want if you know the seed. The inverse of rotl32(x, r) is rotl32(x, 32-r) and the inverse of h ^= h >> 16 is once again just h ^= h >> 16. Only h ^= h >> 13 is a bit different, it’s the first time we’ve seen that a xorshift’s inverse has more than one step: h ^= h >> 13 h ^= h >> 26 Compute the modular inverses of c1 through c4 as well as 5 mod $2^{32}$, and go to town. If you want to cheat or check your answer, you can check out the code I’ve used to generate the following ten strings that all hash to 1337 when fed to MurmurHash3_x86_32 with seed 0: orlp-murmurhash3_x86_32-haaaPa*+ orlp-murmurhash3_x86_32-saaaUW&< orlp-murmurhash3_x86_32-ubaa/!/" orlp-murmurhash3_x86_32-weaare]] orlp-murmurhash3_x86_32-chaa5@/} orlp-murmurhash3_x86_32-claaM[,5 orlp-murmurhash3_x86_32-fraaIx`N orlp-murmurhash3_x86_32-iwaara&< orlp-murmurhash3_x86_32-zwaa]>zd orlp-murmurhash3_x86_32-zbbaW-5G Nim uses 0 as a fixed seed. You might wonder about the ethics of publishing functions for generating arbitrary amounts of collisions for hash functions actually in use today. I did consider holding back. But HashDoS has been a known attack for almost two decades now, and the universal hash collisions I’ve shown were also published more than a decade ago now as well. At some point you’ve had enough time to, uh, fix your shit. Universal collision attack on MurmurHash3 Suppose that Nim didn’t use 0 as a fixed seed, but chose a randomly generated one. Can we do a similar attack as the one done to MurmurHash2 to still generate universal multicollisions? Yes we can. Let’s take another look at that core loop body: uint32_t k = loadword<uint32_t>(p); k *= c1; // Trivially invertable. k = rotl32(k, 15); // Trivially invertable. k *= c2; // Trivially invertable. h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; Once again we can ignore the first three trivially invertable instructions as we can simply choose our input so that we get exactly the k we want. Remember from last time that we want to introduce a difference in exactly the top bit of h, as the multiplication will leave this difference in place. But here there is a bit rotation between the XOR and the multiplication. The solution? Simply place our bit difference such that rotl32(h, 13) shifts it into the top position. Does the addition of 0xe6546b64 mess things up? No. Since only the top bit between the two states will be different, there is a difference of exactly $2^{31}$ between the two states. This difference is maintained by the addition. Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$ apart, we can conclude that the two states still differ in the top bit after the addition. So we want to find two pairs of 32-bit ints, such that after applying the first three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000 and the second pair in bit 1 << 31 == 0x80000000. After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision: a = "!&orlpՓ" b = "yǏglp$X" As before, any concatenation of as and bs of length n collides with all other combinations of length n. Breaking FarmHash64 Nim switched to farmhash since I started writing this post. To break it we can notice that its structure is very similar to CityHash64, so we can use those same techniques again. In fact, the only changes between the two for lengths 17-32 bytes is that a few operators were changed from subtraction/XOR to addition, a rotation operator had its constant tweaked, and some k constants are slightly tweaked in usage. The process of breaking it is so similar that it’s entirely analogous, so we can skip straight to the result. These 10 strings all hash to 1337 with FarmHash64: orlp-farmhash64-?VrJ@L7ytzwheaaa orlp-farmhash64-p3`!SQb}fmxheaaa orlp-farmhash64-pdt'cuI\gvxheaaa orlp-farmhash64-IbY`xAG&ibkieaaa orlp-farmhash64-[_LU!d1hwmkieaaa orlp-farmhash64-QiY!clz]bttieaaa orlp-farmhash64-&?J3rZ_8gsuieaaa orlp-farmhash64-LOBWtm5Szyuieaaa orlp-farmhash64-Mptaa^g^ytvieaaa orlp-farmhash64-B?&l::hxqmfjeaaa Trivial fixed-seed wyhash multicollisions Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves: uint64_t folded_multiply(uint64_t a, uint64_t b) { __uint128_t full = __uint128_t(a) * __uint128_t(b); return uint64_t(full) ^ uint64_t(full >> 64); } It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form: out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b); where secret_a and secret_b are determined by the seed, or outputs of previous iterations which are influenced by the seed. However, when your seed is constant… With a bit of creativity we can use the start of our string to prepare a ‘secret’ value which we can perfectly cancel with another ASCII string later in the input. So, without further ado, every 32-byte string of the form orlp-wyhash-oGf_________tWJbzMJR hashes to the same value with Zig’s default hasher. Zig uses a different set of parameters than the defaults found in the wyhash repository, so for good measure, this pattern provides arbitrary multicollisions for the default parameters found in wyhash when using seed == 0: orlp-wyhash-EUv_________NLXyytkp Conclusion We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack. We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving. With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs. Well, such attacks are provably impossible against non-interactive attackers, everything goes out of the window again when an attacker is allowed to inspect the output hashes and use that to try and guess your secret key. Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes. More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust: fn naive_sum(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out += *x; } out } This however can easily result in an arbitrarily large accumulated error. Let’s try it out: naive_sum(&vec![1.0; 1_000_000]) = 1000000.0 naive_sum(&vec![1.0; 10_000_000]) = 10000000.0 naive_sum(&vec![1.0; 100_000_000]) = 16777216.0 naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0 Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck. Luckily, there are better ways to sum an array. Pairwise summation A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums. fn pairwise_sum(arr: &[f32]) -> f32 { if arr.len() == 0 { return 0.0; } if arr.len() == 1 { return arr[0]; } let (first, second) = arr.split_at(arr.len() / 2); pairwise_sum(first) + pairwise_sum(second) } This is more accurate: pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0 pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0 pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0 pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0 However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum. By choosing the splitpoint as a multiple of 256 we ensure that the base case in the recursion always has exactly 256 elements except on the very last block. This makes sure we use the most optimal reduction and always correctly predict the loop condition. This small detail ended up improving the throughput by 40% for large arrays! fn block_pairwise_sum(arr: &[f32]) -> f32 { if arr.len() > 256 { let split = (arr.len() / 2).next_multiple_of(256); let (first, second) = arr.split_at(split); block_pairwise_sum(first) + block_pairwise_sum(second) } else { naive_sum(arr) } } Kahan summation The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers. All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error bound is still quadratic in terms of $n$ even for Kahan summation. In practice all summation algorithms do significantly better than their worst-case bounds, as in most scenarios the errors do not exclusively round up or down, but cancel each other out on average. pub fn kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for x in arr { let y = *x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one. To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy: pub fn block_kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for chunk in arr.chunks(256) { let x = naive_sum(chunk); let y = x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } Exact summation I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end. The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place. Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer. This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$. A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer. Unleashing the compiler Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism. To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow. I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends. I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact. Note that the real numbers do not contain NaNs or infinities, so these operators assume those do not exist for the validity of transformations, however it is not undefined behavior when you do encounter those values. They also allow fused multiply-add instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$ Using those new instructions it is trivial to generate an autovectorized sum: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn naive_sum_autovec(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out = fadd_algebraic(out, *x); } out } If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions: .LBB0_5: vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8] vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32] vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64] vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96] add r8, 32 cmp rdx, r8 jne .LBB0_5 This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache. With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec. Accuracy and speed Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x. All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results: AlgorithmThroughputMean absolute error naive5.5 GB/s71.796 pairwise0.9 GB/s1.5528 kahan1.4 GB/s0.2229 block_pairwise5.8 GB/s3.8597 block_kahan5.9 GB/s4.2184 naive_autovec118.6 GB/s14.538 block_pairwise_autovec71.7 GB/s1.6132 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_buffer1.1 GB/s0.0015 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 The reason the accurate crate has a non-zero absolute error is because it currently does not implement rounding to nearest correctly, so it can be off by one unit in the last place for the final result. First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one. We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum. Limiting ourselves to Pareto-optimal choices we get the following four implementations: AlgorithmThroughputMean absolute error naive_autovec118.6 GB/s14.538 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here. For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower. By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn sum_block(arr: &[f32]) -> f32 { arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y)) } pub fn sum_orlp(arr: &[f32]) -> f32 { let mut chunks = arr.chunks_exact(256); let mut sum = 0.0; let mut c = 0.0; for chunk in &mut chunks { let y = sum_block(chunk) - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum + (sum_block(chunks.remainder()) - c) } AlgorithmThroughputMean absolute error sum_orlp112.2 GB/s1.2306 You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy. Conclusion I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations. Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.
Suppose you have a 64-bit word and wish to extract a couple bits from it. For example you just performed a SWAR algorithm and wish to extract the least significant bit of each byte in the u64. This is simple enough, you simply perform a binary AND with a mask of the bits you wish to keep: let out = word & 0x0101010101010101; However, this still leaves the bits of interest spread throughout the 64-bit word. What if we also want to compress the 8 bits we wish to extract into a single byte? Or what if we want the inverse, spreading the 8 bits of a byte among the least significant bits of each byte in a 64-bit word? PEXT and PDEP If you are using a modern x86-64 CPU, you are in luck. In the much underrated BMI instruction set there are two very powerful instructions: PDEP and PEXT. They are inverses of each other, PEXT extracts bits, PDEP deposits bits. PEXT takes in a word and a mask, takes just those bits from the word where the mask has a 1 bit, and compresses all selected bits to a contiguous output word. Simulated in Rust this would be: fn pext64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut out_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; let ith_word_bit = (word >> i) & 1; if ith_mask_bit == 1 { out |= ith_word_bit << out_idx; out_idx += 1; } } out } For example if you had the bitstring abcdefgh and mask 10110001 you would get output bitstring 0000acdh. PDEP is exactly its inverse, it takes contiguous data bits as a word, and a mask, and deposits the data bits one-by-one (starting at the least significant bits) into those bits where the mask has a 1 bit, leaving the rest as zeros: fn pdep64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut input_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; if ith_mask_bit == 1 { let next_word_bit = (word >> input_idx) & 1; out |= next_word_bit << i; input_idx += 1; } } out } So if you had the bitstring abcdefgh and mask 10100110 you would get output e0f00gh0 (recall that we traditionally write bitstrings with the least significant bit on the right). These instructions are incredibly powerful and flexible, and the amazing thing is that these instructions only take a single cycle on modern Intel and AMD CPUs! However, they are not available in other instruction sets, so whenever you use them you will also likely need to write a cross-platform alternative. Unfortunately, both PDEP and PEXT are very slow on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate. The platform advertises through CPUID that the instructions are supported, but they’re almost unusably slow. Use with caution. Extracting bits with multiplication While the following technique can’t replace all PEXT cases, it can be quite general. It is applicable when: The bit pattern you want to extract is static and known in advance. If you want to extract $k$ bits, there must at least be a $k-1$ gap between two bits of interest. We compute the bit extraction by adding together many left-shifted copies of our input word, such that we construct our desired bit pattern in the uppermost bits. The trick is to then realize that w << i is equivalent to w * (1 << i) and thus the sum of many left-shifted copies is equivalent to a single multiplication by (1 << i) + (1 << j) + ... I think the technique is best understood by visual example. Let’s use our example from earlier, extracting the least significant bit of each byte in a 64-bit word. We start off by masking off just those bits. After that we shift the most significant bit of interest to the topmost bit of the word to get our first shifted copy. We then repeat this, shifting the second most significant bit of interest to the second topmost bit, etc. We sum all these shifted copies. This results in the following (using underscores instead of zeros for clarity): mask = _______1_______1_______1_______1_______1_______1_______1_______1 t = w & mask t = _______a_______b_______c_______d_______e_______f_______g_______h t << 7 = a_______b_______c_______d_______e_______f_______g_______h_______ t << 14 = _b_______c_______d_______e_______f_______g_______h______________ t << 21 = __c_______d_______e_______f_______g_______h_____________________ t << 28 = ___d_______e_______f_______g_______h____________________________ t << 35 = ____e_______f_______g_______h___________________________________ t << 42 = _____f_______g_______h__________________________________________ t << 49 = ______g_______h_________________________________________________ t << 56 = _______h________________________________________________________ sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______ Note how we constructed abcdefgh in the topmost 8 bits, which we can then extract using a single right-shift by $64 - 8 = 56$ bits. Since (1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080 we get the following implementation: fn extract_lsb_bit_per_byte(w: u64) -> u8 { let mask = 0x0101010101010101; let sum_of_shifts = 0x102040810204080; ((w & mask).wrapping_mul(sum_of_shifts) >> 56) as u8 } Not as good as PEXT, but three arithmetic instructions is not bad at all. Depositing bits with multiplication Unfortunately the following technique is significantly less general than the previous one. While you can take inspiration from it to implement similar algorithms, as-is it is limited to just spreading the bits of one byte to the least significant bit of each byte in a 64-bit word. The trick is similar to the one above. We add 8 shifted copies of our byte which once again translates to a multiplication. By choosing a shift that increases in multiples if 9 instead of 8 we ensure that the bit pattern shifts over by one position in each byte. We then mask out our bits of interest, and finish off with a shift and byteswap (which compiles to a single instruction bswap on Intel or rev on ARM) to put our output bits on the least significant bits and reverse the order. This technique visualized: b = ________________________________________________________abcdefgh b << 9 = _______________________________________________abcdefgh_________ b << 18 = ______________________________________abcdefgh__________________ b << 27 = _____________________________abcdefgh___________________________ b << 36 = ____________________abcdefgh____________________________________ b << 45 = ___________abcdefgh_____________________________________________ b << 54 = __abcdefgh______________________________________________________ b << 63 = h_______________________________________________________________ sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh mask = 1_______1_______1_______1_______1_______1_______1_______1_______ s & msk = h_______g_______f_______e_______d_______c_______b_______a_______ We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201, allowing the following implementation: fn deposit_lsb_bit_per_byte(b: u8) -> u64 { let sum_of_shifts = 0x8040201008040201; let mask = 0x8080808080808080; let spread = (b as u64).wrapping_mul(sum_of_shifts) & mask; u64::swap_bytes(spread >> 7) } This time it required 4 arithmetic instructions, not quite as good as PDEP, but again not bad compared to a naive implementation, and this is cross-platform.
This post is an anecdote from over a decade ago, of which I lost the actual code. So please forgive me if I do not accurately remember all the details. Some details are also simplified so that anyone that likes computer security can enjoy this article, not just those who have played World of Warcraft (although the Venn diagram of those two groups likely has a solid overlap). When I was around 14 years old I discovered World of Warcraft developed by Blizzard Games and was immediately hooked. Not long after I discovered add-ons which allow you to modify how your game’s user interface looks and works. However, not all add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to find out how they were made. In a weird twist of fate, I blame World of Warcraft for me seriously picking up programming. It turned out that they were made in the Lua programming language. Add-ons were nothing more than a couple .lua source files in a folder directly loaded into the game. The barrier of entry was incredibly low: just edit a file, press save and reload the interface. The fact that the game loaded your source code and you could see it running was magical! I enjoyed it immensely and in no time I was only writing add-ons and was barely playing the game itself anymore. I published quite a few add-ons in the next two years, which mostly involved copying other people’s code with some refactoring / recombining / tweaking to my wishes. Add-on security A thought you might have is that it’s a really bad idea to let users have fully programmable add-ons in your game, lest you get bots. However, the system Blizzard made to prevent arbitrary programmable actions was quite clever. Naturally, it did nothing to prevent actual botting, but at least regular rule-abiding players were fundamentally restricted to the automation Blizzard allowed. Most UI elements that you could create were strictly decorative or informational. These were completely unrestricted, as were most APIs that strictly gather information. For example you can make a health bar display using two frames, a background and a foreground, sizing the foreground frame using an API call to get the health of your character. Not all API calls were available to you however. Some were protected so they could only be called from official Blizzard code. These typically involved the API calls that would move your character, cast spells, use items, etc. Generally speaking anything that actually makes you perform an in-game action was protected. The API for getting your exact world location and camera orientation also became protected at some point. This was a reaction by Blizzard to new add-ons that were actively drawing 3D elements on top of the game world to make boss fights easier. However, some UI elements needed to actually interact with the game itself, e.g. if I want to make a button that casts a certain spell. For this you could construct a special kind of button that executes code in a secure environment when clicked. You were only allowed to create/destroy/move such buttons when not in combat, so you couldn’t simply conditionally place such buttons underneath your cursor to automate actions during combat. The catch was that this secure environment did allow you to programmatically set which spell to cast, but doesn’t let you gather the information you would need to do arbitrary automation. All access to state from outside the secure environment was blocked. There were some information gathering API calls available to match the more accessible in-game macro system, but nothing as fancy as getting skill cooldowns or unit health which would enable automatic optimal spellcasting. So there were two environments: an insecure one where you can get all information but can’t act on it, and a secure one where you can act but can’t get the information needed for automation. A backdoor channel Fast forward a couple years and I had mostly stopped playing. My interests had mainly moved on to more “serious” programming, and I was only occasionally playing, mostly messing around with add-on ideas. But this secure environment kept on nagging in my brain; I wanted to break it. Of course there was third-party software that completely disables the security restrictions from Blizzard, but what’s the fun in that? I wanted to do it “legitimately”, using the technically allowed tools, as a challenge. Obviously using clever code to bypass security restrictions is no better than using third-party software, and both would likely get you banned. I never actually wanted to use the code, just to see if I could make it work. So I scanned the secure environment allowed function list to see if I could smuggle any information from the outside into the secure environment. It all seemed pretty hopeless until I saw one tiny, innocent little function: random. An evil idea came in my head: random number generators (RNGs) used in computers are almost always pseudorandom number generators with (hidden) internal state. If I can manipulate this state, perhaps I can use that to pass information into the secure environment. Random number generator woes It turned out that random was just a small shim around C’s rand. I was excited! This meant that there was a single global random state that was shared in the process. It also helps that rand implementations tended to be on the weak side. Since World of Warcraft was compiled with MSVC, the actual implementation of rand was as follows: uint32_t state; int rand() { state = state * 214013 + 2531011; return (state >> 16) & 0x7fff; } This RNG is, for the lack of a better word, shite. It is a naked linear congruential generator, and a weak one at that. Which in my case, was a good thing. I can understand MSVC keeps rand the same for backwards compatibility, and at least all documentation I could find for rand recommends you not to use rand for cryptographic purposes. But was there ever a time where such a bad PRNG implementation was fit for any purpose? So let’s get to breaking this thing. Since the state is so laughably small and you can see 15 bits of the state directly you can keep a full list of all possible states consistent with a single output of the RNG and use further calls to the RNG to eliminate possibilities until a single one remains. But we can be significantly more clever. First we note that the top bit of state never affects anything in this RNG. (state >> 16) & 0x7fff masks out 15 bits, after shifting away the bottom 16 bits, and thus effectively works mod $2^{31}$. Since on any update the new state is a linear function of the previous state, we can propagate this modular form all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for any linear $f$. Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by the shift $h_0, h_1$ respectively, for the states after the first and second call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$ and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity: $$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$ $$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$ Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then for some integer $k$ we have $$ah_0 = h_1 + c + 2^{31} k.$$ Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$. Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following expression for $h_0$: $$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$ Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$. Thus, assuming a solution exists, we must have $$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$ So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of the RNG state after the first call. This gives us 8 guesses, after which we can reject bad guesses using follow-up calls to the RNG until a single unique answer remains. While I was able to re-derive the above with little difficulty now, 18 year old me wasn’t as experienced in discrete math. So I asked on crypto.SE, with the excuse that I wanted to ‘show my colleagues how weak this RNG is’. It worked, which sparks all kinds of interesting ethics questions. An example implementation of this process in Python: import random A = 214013 B = 2531011 class MsvcRng: def __init__(self, state): self.state = state def __call__(self): self.state = (self.state * A + B) % 2**32 return (self.state >> 16) & 0x7fff # Create a random RNG state we'll reverse engineer. hidden_rng = MsvcRng(random.randint(0, 2**32)) # Compute guesses for hidden state from 2 observations. r0 = hidden_rng() r1 = hidden_rng() c = (2**16 * (r1 - A * r0) - B) % 2**31 ceil_div = lambda a, b: (a + b - 1) // b h_guesses = [ceil_div(c + 2**31 * k, A) for k in range(-1, 7)] # Validate guesses until a single guess remains. guess_rngs = [MsvcRng(2**16 * r0 + h0) for h0 in h_guesses] guess_rngs = [g for g in guess_rngs if g() == r1] while len(guess_rngs) > 1: r = hidden_rng() guess_rngs = [g for g in guess_rngs if g() == r] # The top bit can not be recovered as it never affects the output, # but we should have recovered the effective hidden state. assert guess_rngs[0].state % 2**31 == hidden_rng.state % 2**31 While I did write the above process with a while loop, it appears to only ever need a third output at most to narrow it down to a single guess. Putting it together Once we could reverse-engineer the internal state of the random number generator we could make arbitrary automated decisions in the supposedly secure environment. How it worked was as follows: An insecure hook was registered that would execute right before the secure environment code would run. In this hook we have full access to information, and make a decision as to which action should be taken (e.g. casting a particular spell). This action is looked up in a hardcoded list to get an index. The current state of the RNG is reverse-engineered using the above process. We predict the outcome of the next RNG call. If this (modulo the length of our action list) does not give our desired outcome, we advance the RNG and try again. This repeats until the next random number would correspond to our desired action. The hook returns, and the secure environment starts. It generates a “random” number, indexes our hardcoded list of actions, and performs the “random” action. That’s all! By being able to simulate the RNG and looking one step ahead we could use it as our information channel by choosing exactly the right moment to call random in the secure environment. Now if you wanted to support a list of $n$ actions it would on average take $n$ steps of the RNG before the correct number came up to pass along, but that wasn’t a problem in practice. Conclusion I don’t know when Blizzard fixed the issue where the RNG state is so weak and shared, or whether they were aware of it being an issue at all. A few years after I had written the code I tried it again out of curiosity, and it had stopped working. Maybe they switched to a different algorithm, or had a properly separated RNG state for the secure environment. All-in-all it was a lot of effort for a niche exploit in a video game that I didn’t even want to use. But there certainly was a magic to manipulating something supposedly random into doing exactly what you want, like a magician pulling four aces from a shuffled deck.
More in programming
Once you’ve written your strategy’s exploration, the next step is working on its diagnosis. Diagnosis is understanding the constraints and challenges your strategy needs to address. In particular, it’s about doing that understanding while slowing yourself down from deciding how to solve the problem at hand before you know the problem’s nuances and constraints. If you ever find yourself wanting to skip the diagnosis phase–let’s get to the solution already!–then maybe it’s worth acknowledging that every strategy that I’ve seen fail, did so due to a lazy or inaccurate diagnosis. It’s very challenging to fail with a proper diagnosis, and almost impossible to succeed without one. The topics this chapter will cover are: Why diagnosis is the foundation of effective strategy, on which effective policy depends. Conversely, how skipping the diagnosis phase consistently ruins strategies A step-by-step approach to diagnosing your strategy’s circumstances How to incorporate data into your diagnosis effectively, and where to focus on adding data Dealing with controversial elements of your diagnosis, such as pointing out that your own executive is one of the challenges to solve Why it’s more effective to view difficulties as part of the problem to be solved, rather than a blocking issue that prevents making forward progress The near impossibility of an effective diagnosis if you don’t bring humility and self-awareness to the process Into the details we go! This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Diagnosis is strategy’s foundation One of the challenges in evaluating strategy is that, after the fact, many effective strategies are so obvious that they’re pretty boring. Similarly, most ineffective strategies are so clearly flawed that their authors look lazy. That’s because, as a strategy is operated, the reality around it becomes clear. When you’re writing your strategy, you don’t know if you can convince your colleagues to adopt a new approach to specifying APIs, but a year later you know very definitively whether it’s possible. Building your strategy’s diagnosis is your attempt to correctly recognize the context that the strategy needs to solve before deciding on the policies to address that context. Done well, the subsequent steps of writing strategy often feel like an afterthought, which is why I think of diagnosis as strategy’s foundation. Where exploration was an evaluation-free activity, diagnosis is all about evaluation. How do teams feel today? Why did that project fail? Why did the last strategy go poorly? What will be the distractions to overcome to make this new strategy successful? That said, not all evaluation is equal. If you state your judgment directly, it’s easy to dispute. An effective diagnosis is hard to argue against, because it’s a web of interconnected observations, facts, and data. Even for folks who dislike your conclusions, the weight of evidence should be hard to shift. Strategy testing, explored in the Refinement section, takes advantage of the reality that it’s easier to diagnose by doing than by speculating. It proposes a recursive diagnosis process until you have real-world evidence that the strategy is working. How to develop your diagnosis Your strategy is almost certain to fail unless you start from an effective diagnosis, but how to build a diagnosis is often left unspecified. That’s because, for most folks, building the diagnosis is indeed a dark art: unspecified, undiscussion, and uncontrollable. I’ve been guilty of this as well, with The Engineering Executive’s Primer’s chapter on strategy staying silent on the details of how to diagnose for your strategy. So, yes, there is some truth to the idea that forming your diagnosis is an emergent, organic process rather than a structured, mechanical one. However, over time I’ve come to adopt a fairly structured approach: Braindump, starting from a blank sheet of paper, write down your best understanding of the circumstances that inform your current strategy. Then set that piece of paper aside for the moment. Summarize exploration on a new piece of paper, review the contents of your exploration. Pull in every piece of diagnosis from similar situations that resonates with you. This is true for both internal and external works! For each diagnosis, tag whether it fits perfectly, or needs to be adjusted for your current circumstances. Then, once again, set the piece of paper aside. Mine for distinct perspectives on yet another blank page, talking to different stakeholders and colleagues who you know are likely to disagree with your early thinking. Your goal is not to agree with this feedback. Instead, it’s to understand their view. The Crux by Richard Rumelt anchors diagnosis in this approach, emphasizing the importance of “testing, adjusting, and changing the frame, or point of view.” Synthesize views into one internally consistent perspective. Sometimes the different perspectives you’ve gathered don’t mesh well. They might well explicitly differ in what they believe the underlying problem is, as is typical in tension between platform and product engineering teams. The goal is to competently represent each of these perspectives in the diagnosis, even the ones you disagree with, so that later on you can evaluate your proposed approach against each of them. When synthesizing feedback goes poorly, it tends to fail in one of two ways. First, the author’s opinion shines through so strongly that it renders the author suspect. Your goal is never to agree with every team’s perspective, just as your diagnosis should typically avoid crowning any perspective as correct: a reader should generally be appraised of the details and unaware of the author. The second common issue is when a group tries to jointly own the synthesis, but create a fractured perspective rather than a unified one. I generally find that having one author who is accountable for representing all views works best to address both of these issues. Test drafts across perspectives. Once you’ve written your initial diagnosis, you want to sit down with the people who you expect to disagree most fervently. Iterate with them until they agree that you’ve accurately captured their perspective. It might be that they disagree with some other view points, but they should be able to agree that others hold those views. They might argue that the data you’ve included doesn’t capture their full reality, in which case you can caveat the data by saying that their team disagrees that it’s a comprehensive lens. Don’t worry about getting the details perfectly right in your initial diagnosis. You’re trying to get the right crumbs to feed into the next phase, strategy refinement. Allowing yourself to be directionally correct, rather than perfectly correct, makes it possible to cover a broad territory quickly. Getting caught up in perfecting details is an easy way to anchor yourself into one perspective prematurely. At this point, I hope you’re starting to predict how I’ll conclude any recipe for strategy creation: if these steps feel overly mechanical to you, adjust them to something that feels more natural and authentic. There’s no perfect way to understand complex problems. That said, if you feel uncertain, or are skeptical of your own track record, I do encourage you to start with the above approach as a launching point. Incorporating data into your diagnosis The strategy for Navigating Private Equity ownership’s diagnosis includes a number of details to help readers understand the status quo. For example the section on headcount growth explains headcount growth, how it compares to the prior year, and providing a mental model for readers to translate engineering headcount into engineering headcount costs: Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%). If everyone evaluating a strategy shares the same foundational data, then evaluating the strategy becomes vastly simpler. Data is also your mechanism for supporting or critiquing the various views that you’ve gathered when drafting your diagnosis; to an impartial reader, data will speak louder than passion. If you’re confident that a perspective is true, then include a data narrative that supports it. If you believe another perspective is overstated, then include data that the reader will require to come to the same conclusion. Do your best to include data analysis with a link out to the full data, rather than requiring readers to interpret the data themselves while they are reading. As your strategy document travels further, there will be inevitable requests for different cuts of data to help readers understand your thinking, and this is somewhat preventable by linking to your original sources. If much of the data you want doesn’t exist today, that’s a fairly common scenario for strategy work: if the data to make the decision easy already existed, you probably would have already made a decision rather than needing to run a structured thinking process. The next chapter on refining strategy covers a number of tools that are useful for building confidence in low-data environments. Whisper the controversial parts At one time, the company I worked at rolled out a bar raiser program styled after Amazon’s, where there was an interviewer from outside the team that had to approve every hire. I spent some time arguing against adding this additional step as I didn’t understand what we were solving for, and I was surprised at how disinterested management was about knowing if the new process actually improved outcomes. What I didn’t realize until much later was that most of the senior leadership distrusted one of their peers, and had rolled out the bar raiser program solely to create a mechanism to control that manager’s hiring bar when the CTO was disinterested holding that leader accountable. (I also learned that these leaders didn’t care much about implementing this policy, resulting in bar raiser rejections being frequently ignored, but that’s a discussion for the Operations for strategy chapter.) This is a good example of a strategy that does make sense with the full diagnosis, but makes little sense without it, and where stating part of the diagnosis out loud is nearly impossible. Even senior leaders are not generally allowed to write a document that says, “The Director of Product Engineering is a bad hiring manager.” When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: Say something awkward or uncomfortable about your company or someone working within it Omit a critical piece of your diagnosis that’s necessary to understand the wider thinking Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. I think it’s helpful to discuss a few concrete examples of this, starting with the strategy for navigating private equity, whose diagnosis includes: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. There are many things the authors of this strategy likely feel about their state of reality. First, they are probably upset about the fact that their new private equity ownership is likely to eliminate colleagues. Second, they are likely upset that there is no clear plan around what they need to do, so they are stuck preparing for a wide range of potential outcomes. However they feel, they don’t say any of that, they stick to precise, factual statements. For a second example, we can look to the Uber service migration strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. The team didn’t agree that their headcount should not be growing, but it was the reality they were operating in. They acknowledged their reality as a factual statement, without any additional commentary about that statement. In both of these examples, they found a professional, non-judgmental way to acknowledge the circumstances they were solving. The authors would have preferred that the leaders behind those decisions take explicit accountability for them, but it would have undermined the strategy work had they attempted to do it within their strategy writeup. Excluding critical parts of your diagnosis makes your strategies particularly hard to evaluate, copy or recreate. Find a way to say things politely to make the strategy effective. As always, strategies are much more about realities than ideals. Reframe blockers as part of diagnosis When I work on strategy with early-career leaders, an idea that comes up a lot is that an identified problem means that strategy is not possible. For example, they might argue that doing strategy work is impossible at their current company because the executive team changes their mind too often. That core insight is almost certainly true, but it’s much more powerful to reframe that as a diagnosis: if we don’t find a way to show concrete progress quickly, and use that to excite the executive team, our strategy is likely to fail. This transforms the thing preventing your strategy into a condition your strategy needs to address. Whenever you run into a reason why your strategy seems unlikely to work, or why strategy overall seems difficult, you’ve found an important piece of your diagnosis to include. There are never reasons why strategy simply cannot succeed, only diagnoses you’ve failed to recognize. For example, we knew in our work on Uber’s service provisioning strategy that we weren’t getting more headcount for the team, the product engineering team was going to continue growing rapidly, and that engineering leadership was unwilling to constrain how product engineering worked. Rather than preventing us from implementing a strategy, those components clarified what sort of approach could actually succeed. The role of self-awareness Every problem of today is partially rooted in the decisions of yesterday. If you’ve been with your organization for any duration at all, this means that you are directly or indirectly responsible for a portion of the problems that your diagnosis ought to recognize. This means that recognizing the impact of your prior actions in your diagnosis is a powerful demonstration of self-awareness. It also suggests that your next strategy’s success is rooted in your self-awareness about your prior choices. Don’t be afraid to recognize the failures in your past work. While changing your mind without new data is a sign of chaotic leadership, changing your mind with new data is a sign of thoughtful leadership. Summary Because diagnosis is the foundation of effective strategy, I’ve always found it the most intimidating phase of strategy work. While I think that’s a somewhat unavoidable reality, my hope is that this chapter has somewhat prepared you for that challenge. The four most important things to remember are simply: form your diagnosis before deciding how to solve it, try especially hard to capture perspectives you initially disagree with, supplement intuition with data where you can, and accept that sometimes you’re missing the data you need to fully understand. The last piece in particular, is why many good strategies never get shared, and the topic we’ll address in the next chapter on strategy refinement.
A Live, Interactive Course for Systems Engineers
I’m sitting in a small coffee shop in Brooklyn. I have a warm drink, and it’s just started to snow outside. I’m visiting New York to see Operation Mincemeat on Broadway – I was at the dress rehearsal yesterday, and I’ll be at the opening preview tonight. I’ve seen this show more times than I care to count, and I hope US theater-goers love it as much as Brits. The people who make the show will tell you that it’s about a bunch of misfits who thought they could do something ridiculous, who had the audacity to believe in something unlikely. That’s certainly one way to see it. The musical tells the true story of a group of British spies who tried to fool Hitler with a dead body, fake papers, and an outrageous plan that could easily have failed. Decades later, the show’s creators would mirror that same spirit of unlikely ambition. Four friends, armed with their creativity, determination, and a wardrobe full of hats, created a new musical in a small London theatre. And after a series of transfers, they’re about to open the show under the bright lights of Broadway. But when I watch the show, I see a story about friendship. It’s about how we need our friends to help us, to inspire us, to push us to be the best versions of ourselves. I see the swaggering leader who needs a team to help him truly achieve. The nervous scientist who stands up for himself with the support of his friends. The enthusiastic secretary who learns wisdom and resilience from her elder. And so, I suppose, it’s fitting that I’m not in New York on my own. I’m here with friends – dozens of wonderful people who I met through this ridiculous show. At first, I was just an audience member. I sat in my seat, I watched the show, and I laughed and cried with equal measure. After the show, I waited at stage door to thank the cast. Then I came to see the show a second time. And a third. And a fourth. After a few trips, I started to see familiar faces waiting with me at stage door. So before the cast came out, we started chatting. Those conversations became a Twitter community, then a Discord, then a WhatsApp. We swapped fan art, merch, and stories of our favourite moments. We went to other shows together, and we hung out outside the theatre. I spent New Year’s Eve with a few of these friends, sitting on somebody’s floor and laughing about a bowl of limes like it was the funniest thing in the world. And now we’re together in New York. Meeting this kind, funny, and creative group of people might seem as unlikely as the premise of Mincemeat itself. But I believed it was possible, and here we are. I feel so lucky to have met these people, to take this ridiculous trip, to share these precious days with them. I know what a privilege this is – the time, the money, the ability to say let’s do this and make it happen. How many people can gather a dozen friends for even a single evening, let alone a trip halfway round the world? You might think it’s silly to travel this far for a theatre show, especially one we’ve seen plenty of times in London. Some people would never see the same show twice, and most of us are comfortably into double or triple-figures. Whenever somebody asks why, I don’t have a good answer. Because it’s fun? Because it’s moving? Because I enjoy it? I feel the need to justify it, as if there’s some logical reason that will make all of this okay. But maybe I don’t have to. Maybe joy doesn’t need justification. A theatre show doesn’t happen without people who care. Neither does a friendship. So much of our culture tells us that it’s not cool to care. It’s better to be detached, dismissive, disinterested. Enthusiasm is cringe. Sincerity is weakness. I’ve certainly felt that pressure – the urge to play it cool, to pretend I’m above it all. To act as if I only enjoy something a “normal” amount. Well, fuck that. I don’t know where the drive to be detached comes from. Maybe it’s to protect ourselves, a way to guard against disappointment. Maybe it’s to seem sophisticated, as if having passions makes us childish or less mature. Or perhaps it’s about control – if we stay detached, we never have to depend on others, we never have to trust in something bigger than ourselves. Being detached means you can’t get hurt – but you’ll also miss out on so much joy. I’m a big fan of being a big fan of things. So many of the best things in my life have come from caring, from letting myself be involved, from finding people who are a big fan of the same things as me. If I pretended not to care, I wouldn’t have any of that. Caring – deeply, foolishly, vulnerably – is how I connect with people. My friends and I care about this show, we care about each other, and we care about our joy. That care and love for each other is what brought us together, and without it we wouldn’t be here in this city. I know this is a once-in-a-lifetime trip. So many stars had to align – for us to meet, for the show we love to be successful, for us to be able to travel together. But if we didn’t care, none of those stars would have aligned. I know so many other friends who would have loved to be here but can’t be, for all kinds of reasons. Their absence isn’t for lack of caring, and they want the show to do well whether or not they’re here. I know they care, and that’s the important thing. To butcher Tennyson: I think it’s better to care about something you cannot affect, than to care about nothing at all. In a world that’s full of cynicism and spite and hatred, I feel that now more than ever. I’d recommend you go to the show if you haven’t already, but that’s not really the point of this post. Maybe you’ve already seen Operation Mincemeat, and it wasn’t for you. Maybe you’re not a theatre kid. Maybe you aren’t into musicals, or history, or war stories. That’s okay. I don’t mind if you care about different things to me. (Imagine how boring the world would be if we all cared about the same things!) But I want you to care about something. I want you to find it, find people who care about it too, and hold on to them. Because right now, in this city, with these people, at this show? I’m so glad I did. And I hope you find that sort of happiness too. Some of the people who made this trip special. Photo by Chloe, and taken from her Twitter. Timing note: I wrote this on February 15th, but I delayed posting it because I didn’t want to highlight the fact I was away from home. [If the formatting of this post looks odd in your feed reader, visit the original article]
One of the biggest mistakes that new startup founders make is trying to get away from the customer-facing roles too early. Whether it's customer support or it's sales, it's an incredible advantage to have the founders doing that work directly, and for much longer than they find comfortable. The absolute worst thing you can do is hire a sales person or a customer service agent too early. You'll miss all the golden nuggets that customers throw at you for free when they're rejecting your pitch or complaining about the product. Seeing these reasons paraphrased or summarized destroy all the nutrients in their insights. You want that whole-grain feedback straight from the customers' mouth! When we launched Basecamp in 2004, Jason was doing all the customer service himself. And he kept doing it like that for three years!! By the time we hired our first customer service agent, Jason was doing 150 emails/day. The business was doing millions of dollars in ARR. And Basecamp got infinitely, better both as a market proposition and as a product, because Jason could funnel all that feedback into decisions and positioning. For a long time after that, we did "Everyone on Support". Frequently rotating programmers, designers, and founders through a day of answering emails directly to customers. The dividends of doing this were almost as high as having Jason run it all in the early years. We fixed an incredible number of minor niggles and annoying bugs because programmers found it easier to solve the problem than to apologize for why it was there. It's not easy doing this! Customers often offer their valuable insights wrapped in rude language, unreasonable demands, and bad suggestions. That's why many founders quit the business of dealing with them at the first opportunity. That's why few companies ever do "Everyone On Support". That's why there's such eagerness to reduce support to an AI-only interaction. But quitting dealing with customers early, not just in support but also in sales, is an incredible handicap for any startup. You don't have to do everything that every customer demands of you, but you should certainly listen to them. And you can't listen well if the sound is being muffled by early layers of indirection.