Full Width [alt+shift+f] FOCUS MODE Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
73
It seems that in 2025 a lot of people fall into one of two camps when it comes to AI: skeptic or fanatic. The skeptic thinks AI sucks, that it’s overhyped, it only ever parrots nonsense and it will all blow over soon. The fanatic thinks general human-level intelligence is just around the corner, and that AI will solve almost all our problems. I hope my title is sufficiently ambiguous to attract both camps. The fanatic will be outraged, being ready to jump into the fray to point out why AI isn’t or won’t stay bad. The skeptic will feel validated, and will be eager to read more reasons as to why AI sucks. I’m neither a skeptic nor a fanatic. I see AI more neutrally, as a tool, and from that viewpoint I make the following two observations: AI is bad. It is often incorrect, expensive, racist, trained on data without knowledge or consent, environmentally unfriendly, disruptive to society, etc. AI is useful. Despite the above shortcomings there are tasks for which AI is cheap and...
7 months ago

Comments

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from orlp.net - Blog Archive

Breaking CityHash64, MurmurHash2/3, wyhash, and more...

Hash functions are incredibly neat mathematical objects. They can map arbitrary data to a small fixed-size output domain such that the mapping is deterministic, yet appears to be random. This “deterministic randomness” is incredibly useful for a variety of purposes, such as hash tables, checksums, monte carlo algorithms, communication-less distributed algorithms, etc, the list goes on. In this article we will take a look at the dark side of hash functions: when things go wrong. Luckily this essentially never happens due to unlucky inputs in the wild (for good hash functions, at least). However, people exist, and some of them may be malicious. Thus we must look towards computer security for answers. I will quickly explain some of the basics of hash function security and then show how easy it is to break this security for some commonly used non-cryptographic hash functions. As a teaser, this article explains how you can generate strings such as these, thousands per second: cityhash64("orlp-cityhash64-D-:K5yx*zkgaaaaa") == 1337 murmurhash2("orlp-murmurhash64-bkiaaa&JInaNcZ") == 1337 murmurhash3("orlp-murmurhash3_x86_32-haaaPa*+") == 1337 farmhash64("orlp-farmhash64-/v^CqdPvziuheaaa") == 1337 I also show how you can create some really funky pairs of strings that can be concatenated arbitrarily such that when concatenating $k$ strings together any of the $2^k$ combinations all have the same hash output, regardless of the seed used for the hash function: a = "xx0rlpx!xxsXъВ" b = "xxsXъВxx0rlpx!" murmurhash2(a + a, seed) == murmurhash2(a + b, seed) murmurhash2(a + a, seed) == murmurhash2(b + a, seed) murmurhash2(a + a, seed) == murmurhash2(b + b, seed) a = "!&orlpՓ" b = "yǏglp$X" murmurhash3(a + a, seed) == murmurhash3(a + b, seed) murmurhash3(a + a, seed) == murmurhash3(b + a, seed) murmurhash3(a + a, seed) == murmurhash3(b + b, seed) Hash function security basics Hash functions play a critical role in computer security. Hash functions are used not only to verify messages over secure channels, they are also used to identify trusted updates as well as known viruses. Virtually every signature scheme ever used starts with a hash function. If a hash function does not behave randomly, we can break the above security constructs. Cryptographic hash functions thus take the randomness aspect very seriously. The ideal hash function would choose an output completely at random for each input, remembering that choice for future calls. This is called a random oracle. The problem is that a random oracle requires a true random number generator, and more problematically, a globally accessible infinite memory bank. So we approximate it using deterministic hash functions instead. These compute their output by essentially shuffling their input really, really well, in such a way that it is not feasible to reverse. To help quantify whether a specific function does a good job of approximating a random oracle, cryptographers came up with a variety of properties that a random oracle would have. The three most important and well-known properties a secure cryptographic hash function should satisfy are: Pre-image resistance. For some constant $c$ it should be hard to find some input $m$ such that $h(m) = c$. Second pre-image resistance. For some input $m_1$ it should be hard to find another input $m_2$ such that $h(m_1) = h(m_2)$. Collision resistance. It should be hard to find inputs $m_1, m_2$ such that $h(m_1) = h(m_2)$. Note that pre-image resistance implies second pre-image resistance which in turn implies collision resistance. Conversely, a pre-image attack breaks all three properties. We generally consider one of these properties broken if there exists a method that produces a collision or pre-image faster than simply trying random inputs (also known as a brute force attack). However, there are definitely gradations in breakage, as some methods are only several orders of magnitude faster than brute force. That may sound like a lot, but a method taking $2^{110}$ steps instead of $2^{128}$ are still both equally out of reach for today’s computers. MD5 used to be a common hash function, and SHA-1 is still in common use today. While both were considered cryptographically secure at one point, generating MD5 collisions now takes less than a second on a modern PC. In 2017 a collaboration of researchers from CWI and Google and announced the first SHA-1 collision. However, as far as I’m aware, neither MD5 nor SHA-1 have practical (second) pre-image attacks, only theoretical ones. Non-cryptographic hash functions Cryptographically secure hash functions tend to have a small problem: they’re slow. Modern hash functions such as BLAKE3 resolve this somewhat by heavily vectorizing the hash using SIMD instructions, as well as parallelizing over multiple threads, but even then they require large input sizes before reaching those speeds. One particular use-case for hash functions is deriving a secret key from a password: a key derivation function. Unlike regular hash functions, being slow is actually a safety feature here to protect against brute forcing passwords. Modern ones such as Argon2 also intentionally use a lot of memory for protection against specialized hardware such as ASICs or FPGAs. A lot of problems don’t necessarily require secure hash functions, and people would much prefer a faster hash speed. Especially when we are computing many small hashes, such as in a hash table. Let’s take a look what common hash table implementations actually use as their hash for strings: C++: there are multiple standard library implementations, but 64-bit clang 13.0.0 on Apple M1 ships CityHash64. Currently libstdc++ ships MurmurHash64A, a variant of Murmur2 for 64-bit platforms. Java: OpenJDK uses an incredibly simple hash algorithm, which essentially just computes h = 31 * h + c for each character c. PHP: the Zend engine uses essentially the same algorithm as Java, just using unsigned integers and 33 as its multiplier. Nim: it used to use MurmurHash3_x86_32. While writing this article they appeared to have switched to use farmhash by default. Zig: it uses wyhash by default, with 0 as seed. Javascript: in V8 they use a custom weak string hash, with a randomly initialized seed. There were some that used stronger hashes by default as well: Go uses an AES-based hash if hardware acceleration is available on x86-64. Even though its construction is custom and likely not full-strength cryptographically secure, breaking it is too much effort and quite possibly beyond my capabilities. If not available, it uses an algorithm inspired by wyhash. Python and Rust use SipHash by default, which is a cryptographically secure pseudorandom function. This is effectively a hash function where you’re allowed to use a secret key during hashing, unlike a hash like SHA-2 where everyone knows all information involved. This latter concept is actually really important, at least for protecting against HashDoS in hash tables. Even if a hash function is perfectly secure over its complete output, hash tables further reduce the output to only a couple bits to find the data it is looking for. For a static hash function without any randomness it’s possible to produce large lists of hashes that collide post-reduction, just by brute force. But for non-cryptographic hashes as we’ll see here we often don’t need brute force and can generate collisions at high speed for the full output, if not randomized by a random seed. Interlude: inverse operations Before we get to breaking some of the above hash functions, I must explain a basic technique I will use a lot: the inverting of operations. We are first exposed to this in primary school, where we might get faced by a question such as “$2 + x = 10$”. There we learn subtraction is the inverse of addition, such that we may find $x$ by computing $10 - 2 = 8$. Most operations on the integer registers in computers are also invertible, despite the integers being reduced modulo $2^{w}$ in the case of overflow. Let us study some: Addition can be inverted using subtraction. That is, x += y can be inverted using x -= y. Seems obvious enough. Multiplication by a constant $c$ is not inverted by division. This would not work in the case of overflow. Instead, we calculate the modular multiplicative inverse of $c$. This is an integer $c^{-1}$ such that $c \cdot c^{-1} \equiv 1 \pmod {m}$. Then we invert multiplication by $c$ simply by multiplying by $c^{-1}$. This constant exists if and only if $c$ is coprime with our modulus $m$, which for us means that $c$ must be odd as $m = 2^n$. For example, multiplication by $2$ is not invertible, which is easy to see as such, as it is equivalent to a bit shift to the left by one position, losing the most significant bit forever. Without delving into the details, here is a snippet of Python code that computes the modular multiplicative inverse of an integer using the extended Euclidean algorithm by calculating $x, y$ such that $$cx + my = \gcd(c, m).$$ Then, because $c$ is coprime we find $\gcd(c, m) = 1$, which means that $$cx + 0 \equiv 1 \pmod m,$$ and thus $x = c^{-1}$. def egcd(a, b): if a == 0: return (b, 0, 1) g, y, x = egcd(b % a, a) return (g, x - (b // a) * y, y) def modinv(c, m): g, x, y = egcd(c, m) assert g == 1, "c, m must be coprime" return x % m Using this we can invert modular multiplication: >>> modinv(17, 2**32) 4042322161 >>> 42 * 17 * 4042322161 % 2**32 42 Magic! XOR can be inverted using… XOR. It is its own inverse. So x ^= y can be inverted using x ^= y. Bit shifts can not be inverted, but two common operations in hash functions that use bit shifts can be. The first is bit rotation by a constant. This is best explained visually, for example a bit rotation to the left by 3 places on a 8-bit word, where each bit is shown as a letter: abcdefghi defghiabc The formula for a right-rotation of k places is (x >> k) | (x << (w - k)), where w is the width of the integer type. Its inverse is a left-rotation, which simply swaps the direction of both shifts. Alternatively, the inverse of a right-rotation of k places is another right-rotation of w-k places. Another common operation in hash functions is the “xorshift”. It is an operation of one of the following forms, with $k > 0$: x ^= x << k // Left xorshift. x ^= x >> k // Right xorshift. How to invert it is entirely analogous between the two, so I will focus on the left xorshift. An important observation is that the least significant $k$ bits are left entirely untouched by the xorshift. Thus by repeating the operation, we recover the least significant $2k$ bits, as the XOR will invert itself for the next $k$ bits. Let’s take a look at the resulting value to see how we should proceed: v0 = (x << k) ^ x // Apply first step of inverse v1 = v0 ^ (v0 << k). v1 = (x << 2*k) ^ (x << k) ^ (x << k) ^ x // Simplify using self-inverse (x << k) ^ (x << k) = 0. v1 = (x << 2*k) ^ x From this we can conclude the following identity: $$\operatorname{xorshift}(\operatorname{xorshift}(x, k), k) = \operatorname{xorshift}(x, 2k)$$ Now we only need one more observation to complete our algorithm: a xorshift of $k \geq w$ where $w$ is the width of our integer is a no-op. Thus we repeatedly apply our doubling identity until we reach large enough $q$ such that $\operatorname{xorshift}(x, 2^q \cdot k) = x$. For example, to invert a left xorshift by 13 for 64-bit integers we apply the following sequence: x ^= x << 13 // Left xorshift by 13. x ^= x << 13 // Inverse step 1. x ^= x << 26 // Inverse step 2. x ^= x << 52 // Inverse step 3. // x ^= x << 104 // Next step would be a no-op. Armed with this knowledge, we can now attack. Breaking CityHash64 Let us take a look at (part of) the source code of CityHash64 from libcxx that’s used for hashing strings on 64-bit platforms: C++ standard library code goes through a process known as 'uglification', which prepends underscores to all identifiers. This is because those identifiers are reserved by the standard to only be used in the standard library, and thus won't be replaced by macros from standards-compliant code. For your sanity's sake I removed them here. static const uint64_t mul = 0x9ddfea08eb382d69ULL; static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; static const uint64_t k1 = 0xb492b66fbe98f273ULL; static const uint64_t k2 = 0x9ae16a3b2f90404fULL; static const uint64_t k3 = 0xc949d7c7509e6557ULL; template<class T> T loadword(const void* p) { T r; std::memcpy(&r, p, sizeof(r)); return r; } uint64_t rotate(uint64_t val, int shift) { if (shift == 0) return val; return (val >> shift) | (val << (64 - shift)); } uint64_t hash_len_16(uint64_t u, uint64_t v) { uint64_t x = u ^ v; x *= mul; x ^= x >> 47; uint64_t y = v ^ x; y *= mul; y ^= y >> 47; y *= mul; return y; } uint64_t hash_len_17_to_32(const char *s, uint64_t len) { const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; const uint64_t d = loadword<uint64_t>(s + len - 16) * k0; return hash_len_16( rotate(a - b, 43) + rotate(c, 30) + d, a + rotate(b ^ k3, 20) - c + len ); } To break this, let’s assume we’ll always give length 32 inputs. Then the implementation will always call hash_len_17_to_32, and we have full control over variables a, b, c and d by changing our input. Note that d is only used once, in the final expression. This makes it a prime target for attacking the hash. We will choose a, b and c arbitrarily, and then solve for d to compute a desired hash outcome. Using the above modinv function we first compute the necessary modular multiplicative inverses of mul and k0: >>> 0x9ddfea08eb382d69 * 0xdc56e6f5090b32d9 % 2**64 1 >>> 0xc3a5c85c97cb3127 * 0x81bc9c5aa9c72e97 % 2**64 1 We also note that in this case the xorshift is easy to invert, as x ^= x >> 47 is simply its own inverse. Having all the components ready, we can invert the function step by step. We first load a, b and c like in the hash function, and compute uint64_t v = a + rotate(b ^ k3, 20) - c + len; which is the second parameter to hash_len_16. Then, starting from our desired return value of hash_len_16(u, v) we work backwards step by step, inverting each operation to find the function argument u that would result in our target hash. Then once we have found such the unique u we compute our required input d. Putting it all together: static const uint64_t mul_inv = 0xdc56e6f5090b32d9ULL; static const uint64_t k0_inv = 0x81bc9c5aa9c72e97ULL; void cityhash64_preimage32(uint64_t hash, char *s) { const uint64_t len = 32; const uint64_t a = loadword<uint64_t>(s) * k1; const uint64_t b = loadword<uint64_t>(s + 8); const uint64_t c = loadword<uint64_t>(s + len - 8) * k2; uint64_t v = a + rotate(b ^ k3, 20) - c + len; // Invert hash_len_16(u, v). Original operation inverted // at each step is shown on the right, note that it is in // the inverse order of hash_len_16. uint64_t y = hash; // return y; y *= mul_inv; // y *= mul; y ^= y >> 47; // y ^= y >> 47; y *= mul_inv; // y *= mul; uint64_t x = y ^ v; // uint64_t y = v ^ x; x ^= x >> 47; // x ^= x >> 47; x *= mul_inv; // x *= mul; uint64_t u = x ^ v; // uint64_t x = u ^ v; // Find loadword<uint64_t>(s + len - 16). uint64_t d = u - rotate(a - b, 43) - rotate(c, 30); d *= k0_inv; std::memcpy(s + len - 16, &d, sizeof(d)); } The chance that a random uint64_t forms 8 printable ASCII bytes is $\left(94/256\right)^8 \approx 0.033%$. Not great, but cityhash64_preimage32 is so fast that having to repeat it on average ~3000 times to get a purely ASCII result isn’t so bad. For example, the following 10 strings all hash to 1337 using CityHash64, generated using this code: I’ve noticed there’s variants of CityHash64 with subtle differences in the wild. I chose to attack the variant shipped with libc++, so it should work for std::hash there, for example. I also assume a little-endian machine throughout this article, your mileage may vary on a big-endian machine depending on the hash implementation. orlp-cityhash64-D-:K5yx*zkgaaaaa orlp-cityhash64-TXb7;1j&btkaaaaa orlp-cityhash64-+/LM$0 ;msnaaaaa orlp-cityhash64-u'f&>I'~mtnaaaaa orlp-cityhash64-pEEv.LyGcnpaaaaa orlp-cityhash64-v~~bm@,Vahtaaaaa orlp-cityhash64-RxHr_&~{miuaaaaa orlp-cityhash64-is_$34#>uavaaaaa orlp-cityhash64-$*~l\{S!zoyaaaaa orlp-cityhash64-W@^5|3^:gtcbaaaa Breaking MurmurHash2 We can’t let libstdc++ get away after targetting libc++, can we? The default string hash calls an implementation of MurmurHash2 with seed 0xc70f6907. The hash—simplified to only handle strings whose lengths are multiples of 8—is as follows: uint64_t murmurhash64a(const char* s, size_t len, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; uint64_t hash = seed ^ (len * mul); for (const char* p = s; p != s + len; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; hash ^= data; hash *= mul; } hash ^= hash >> 47; hash *= mul; hash ^= hash >> 47; return hash; } We can take a similar approach here as before. We note that the modular multiplicative inverse of 0xc6a4a7935bd1e995 mod $2^{64}$ is 0x5f7a0ea7e59b19bd. As an example, we can choose the first 24 bytes arbitrarily, and solve for the last 8 bytes: void murmurhash64a_preimage32(uint64_t hash, char* s, uint64_t seed) { const uint64_t mul = 0xc6a4a7935bd1e995ULL; const uint64_t mulinv = 0x5f7a0ea7e59b19bdULL; // Compute the hash state for the first 24 bytes as normal. uint64_t state = seed ^ (32 * mul); for (const char* p = s; p != s + 24; p += 8) { uint64_t data = loadword<uint64_t>(p); data *= mul; data ^= data >> 47; data *= mul; state ^= data; state *= mul; } // Invert target hash transformation. // return hash; hash ^= hash >> 47; // hash ^= hash >> 47; hash *= mulinv; // hash *= mul; hash ^= hash >> 47; // hash ^= hash >> 47; // Invert last iteration for last 8 bytes. hash *= mulinv; // hash *= mul; uint64_t data = state ^ hash; // hash = hash ^ data; data *= mulinv; // data *= mul; data ^= data >> 47; // data ^= data >> 47; data *= mulinv; // data *= mul; std::memcpy(s + 24, &data, 8); // data = loadword<uint64_t>(s); } The following 10 strings all hash to 1337 using MurmurHash64A with the default seed 0xc70f6907, generated using this code: orlp-murmurhash64-bhbaaat;SXtgVa orlp-murmurhash64-bkiaaa&JInaNcZ orlp-murmurhash64-ewmaaa(%J+jw>j orlp-murmurhash64-vxpaaag"93\Yj5 orlp-murmurhash64-ehuaaafa`Wp`/| orlp-murmurhash64-yizaaa1x.zQF6r orlp-murmurhash64-lpzaaaZphp&c F orlp-murmurhash64-wsjbaa771rz{z< orlp-murmurhash64-rnkbaazy4X]p>B orlp-murmurhash64-aqnbaaZ~OzP_Tp Universal collision attack on MurmurHash64A In fact, MurmurHash64A is so weak that Jean-Philippe Aumasson, Daniel J. Bernstein and Martin Boßlet published an attack that creates sets of strings which collide regardless of the random seed used. To be fair to CityHash64… just kidding they found universal collisions against it as well, regardless of seed used. CityHash64 is actually much easier to break in this way, as simply doing the above pre-image attack targetting 0 as hash makes the output purely dependent on the seed, and thus a universal collision. To see how it works, let’s take a look at the core loop of MurmurHash64A: uint64_t data = loadword<uint64_t>(p); data *= mul; // Trivially invertible. data ^= data >> 47; // Trivially invertible. data *= mul; // Trivially invertible. state ^= data; state *= mul; We know we can trivially invert the operations done on data regardless of what the current state is, so we might as well have had the following body: state ^= data; state *= mul; Now the hash starts looking rather weak indeed. The clever trick they employ is by creating two strings simultaneously, such that they differ precisely in the top bit in each 8-byte word. Why the top bit? >>> 1 << 63 9223372036854775808 >>> (1 << 63) * mul % 2**64 9223372036854775808 Since mul is odd, its least significant bit is set. Multiplying 1 << 63 by it is equivalent to shifting that bit 63 places to the left, which is once again 1 << 63. That is, 1 << 63 is a fixed point for the state *= mul operation. We also note that for the top bit XOR is equivalent to addition, as the overflow from addition is removed mod $2^{64}$. So if we have two input strings, one starting with the 8 bytes data, and the other starting with data ^ (1 << 63) == data + (1 << 63) (after doing the trivial inversions). We then find that the two states, regardless of seed, differ exactly in the top bit after state ^= data. After multiplication we find we have two states x * mul and (x + (1 << 63)) * mul == x * mul + (1 << 63)… which again differ exactly in the top bit! We are now back to state ^= data in our iteration, for the next 8 bytes. We can now use this moment to cancel our top bit difference, by again feeding two 8-byte strings that differ in the top bit (after inverting). In fact, we only have to find one pair of such strings that differ in the top bit, which we can then repeat twice (in either order) to cancel our difference again. When represented as a uint64_t if we choose the first string as x we can derive the second string as x *= mul; // Forward transformation... x ^= x >> 47; // ... x *= mul; // ... x ^= 1 << 63; // Difference in top bit. x *= mulinv; // Backwards transformation... x ^= x >> 47; // ... x *= mulinv; // ... I was unable to find a printable ASCII string that has another printable ASCII string as its partner. But I was able to find the following pair of 8-byte UTF-8 strings that differ in exactly the top bit after the Murmurhash64A input transformation: xx0rlpx! xxsXъВ Combining them as such gives two 16-byte strings that when fed through the hash algorithm manipulate the state in the same way: a collision. xx0rlpx!xxsXъВ xxsXъВxx0rlpx! But it doesn’t stop there. By concatenating these two strings we can create $2^n$ different colliding strings each $16n$ bytes long. With the current libstdc++ implementation the following prints the same number eight times: std::hash<std::u8string> h; std::u8string a = u8"xx0rlpx!xxsXъВ"; std::u8string b = u8"xxsXъВxx0rlpx!"; std::cout << h(a + a + a) << "\n"; std::cout << h(a + a + b) << "\n"; std::cout << h(a + b + a) << "\n"; std::cout << h(a + b + b) << "\n"; std::cout << h(b + a + a) << "\n"; std::cout << h(b + a + b) << "\n"; std::cout << h(b + b + a) << "\n"; std::cout << h(b + b + b) << "\n"; Even if the libstdc++ would randomize the seed used by MurmurHash64a, the strings would still collide. Breaking MurmurHash3 Nim uses used to use MurmurHash3_x86_32, so let’s try to break that. If we once again simplify to strings whose lengths are a multiple of 4 we get the following code: uint32_t rotl32(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } uint32_t murmurhash3_x86_32(const char* s, int len, uint32_t seed) { const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; const uint32_t c3 = 0x85ebca6b; const uint32_t c4 = 0xc2b2ae35; uint32_t h = seed; for (const char* p = s; p != s + len; p += 4) { uint32_t k = loadword<uint32_t>(p); k *= c1; k = rotl32(k, 15); k *= c2; h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; } h ^= len; h ^= h >> 16; h *= c3; h ^= h >> 13; h *= c4; h ^= h >> 16; return h; } I think by now you should be able to get this function to spit out any value you want if you know the seed. The inverse of rotl32(x, r) is rotl32(x, 32-r) and the inverse of h ^= h >> 16 is once again just h ^= h >> 16. Only h ^= h >> 13 is a bit different, it’s the first time we’ve seen that a xorshift’s inverse has more than one step: h ^= h >> 13 h ^= h >> 26 Compute the modular inverses of c1 through c4 as well as 5 mod $2^{32}$, and go to town. If you want to cheat or check your answer, you can check out the code I’ve used to generate the following ten strings that all hash to 1337 when fed to MurmurHash3_x86_32 with seed 0: orlp-murmurhash3_x86_32-haaaPa*+ orlp-murmurhash3_x86_32-saaaUW&< orlp-murmurhash3_x86_32-ubaa/!/" orlp-murmurhash3_x86_32-weaare]] orlp-murmurhash3_x86_32-chaa5@/} orlp-murmurhash3_x86_32-claaM[,5 orlp-murmurhash3_x86_32-fraaIx`N orlp-murmurhash3_x86_32-iwaara&< orlp-murmurhash3_x86_32-zwaa]>zd orlp-murmurhash3_x86_32-zbbaW-5G Nim uses 0 as a fixed seed. You might wonder about the ethics of publishing functions for generating arbitrary amounts of collisions for hash functions actually in use today. I did consider holding back. But HashDoS has been a known attack for almost two decades now, and the universal hash collisions I’ve shown were also published more than a decade ago now as well. At some point you’ve had enough time to, uh, fix your shit. Universal collision attack on MurmurHash3 Suppose that Nim didn’t use 0 as a fixed seed, but chose a randomly generated one. Can we do a similar attack as the one done to MurmurHash2 to still generate universal multicollisions? Yes we can. Let’s take another look at that core loop body: uint32_t k = loadword<uint32_t>(p); k *= c1; // Trivially invertable. k = rotl32(k, 15); // Trivially invertable. k *= c2; // Trivially invertable. h ^= k; h = rotl32(h, 13); h = h * 5 + 0xe6546b64; Once again we can ignore the first three trivially invertable instructions as we can simply choose our input so that we get exactly the k we want. Remember from last time that we want to introduce a difference in exactly the top bit of h, as the multiplication will leave this difference in place. But here there is a bit rotation between the XOR and the multiplication. The solution? Simply place our bit difference such that rotl32(h, 13) shifts it into the top position. Does the addition of 0xe6546b64 mess things up? No. Since only the top bit between the two states will be different, there is a difference of exactly $2^{31}$ between the two states. This difference is maintained by the addition. Since two 32-bit numbers with the same top bit can be at most $2^{31} - 1$ apart, we can conclude that the two states still differ in the top bit after the addition. So we want to find two pairs of 32-bit ints, such that after applying the first three instructions the first pair differs in bit 1 << (31 - 13) == 0x00040000 and the second pair in bit 1 << 31 == 0x80000000. After some brute-force searching I found some cool pairs (again forced to use UTF-8), which when combined give the following collision: a = "!&orlpՓ" b = "yǏglp$X" As before, any concatenation of as and bs of length n collides with all other combinations of length n. Breaking FarmHash64 Nim switched to farmhash since I started writing this post. To break it we can notice that its structure is very similar to CityHash64, so we can use those same techniques again. In fact, the only changes between the two for lengths 17-32 bytes is that a few operators were changed from subtraction/XOR to addition, a rotation operator had its constant tweaked, and some k constants are slightly tweaked in usage. The process of breaking it is so similar that it’s entirely analogous, so we can skip straight to the result. These 10 strings all hash to 1337 with FarmHash64: orlp-farmhash64-?VrJ@L7ytzwheaaa orlp-farmhash64-p3`!SQb}fmxheaaa orlp-farmhash64-pdt'cuI\gvxheaaa orlp-farmhash64-IbY`xAG&ibkieaaa orlp-farmhash64-[_LU!d1hwmkieaaa orlp-farmhash64-QiY!clz]bttieaaa orlp-farmhash64-&?J3rZ_8gsuieaaa orlp-farmhash64-LOBWtm5Szyuieaaa orlp-farmhash64-Mptaa^g^ytvieaaa orlp-farmhash64-B?&l::hxqmfjeaaa Trivial fixed-seed wyhash multicollisions Zig uses wyhash with a fixed seed of zero. While I was unable to do seed-independent attacks against wyhash, using it with a fixed seed makes generating collisions trivial. Wyhash is built upon the folded multiply, which takes two 64-bit inputs, multiplies them to a 128-bit product before XORing together the two halves: uint64_t folded_multiply(uint64_t a, uint64_t b) { __uint128_t full = __uint128_t(a) * __uint128_t(b); return uint64_t(full) ^ uint64_t(full >> 64); } It’s easy to immediately see a critical flaw with this: if one of the two sides is zero, the output will also always be zero. To protect against this, wyhash always uses a folded multiply in the following form: out = folded_multiply(input_a ^ secret_a, input_b ^ secret_b); where secret_a and secret_b are determined by the seed, or outputs of previous iterations which are influenced by the seed. However, when your seed is constant… With a bit of creativity we can use the start of our string to prepare a ‘secret’ value which we can perfectly cancel with another ASCII string later in the input. So, without further ado, every 32-byte string of the form orlp-wyhash-oGf_________tWJbzMJR hashes to the same value with Zig’s default hasher. Zig uses a different set of parameters than the defaults found in the wyhash repository, so for good measure, this pattern provides arbitrary multicollisions for the default parameters found in wyhash when using seed == 0: orlp-wyhash-EUv_________NLXyytkp Conclusion We’ve seen that a lot of the hash functions in common use in hash tables today are very weak, allowing fairly trivial attacks to produce arbitrary amounts of collisions if not randomly initialized. Using a randomly seeded hash table is paramount if you don’t wish to become a victim of a hash flooding attack. We’ve also seen that some hash functions are vulnerable to attack even if randomly seeded. These are completely broken and should not be used if attacks are a concern at all. Luckily I was unable to find such attacks against most hashes, but the possibility of such an attack existing is quite unnerving. With universal hashing it’s possible to construct hash functions for which such an attack is provably impossible, last year I published a hash function called polymur-hash that has this property. Your HTTPS connection to this website also likely uses a universal hash function for authenticity of the transferred data, both Poly1305 and GCM are based on universal hashing for their security proofs. Well, such attacks are provably impossible against non-interactive attackers, everything goes out of the window again when an attacker is allowed to inspect the output hashes and use that to try and guess your secret key. Of course, if your data is not user-controlled, or there is no reasonable security model where your application would face attacks, you can get away with faster and insecure hashes. More to come on the subject of hashing and hash tables and how it can go right or wrong, but for now this article is long enough as-is…

10 months ago 77 votes
Taming Floating-Point Sums

Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust: fn naive_sum(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out += *x; } out } This however can easily result in an arbitrarily large accumulated error. Let’s try it out: naive_sum(&vec![1.0; 1_000_000]) = 1000000.0 naive_sum(&vec![1.0; 10_000_000]) = 10000000.0 naive_sum(&vec![1.0; 100_000_000]) = 16777216.0 naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0 Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck. Luckily, there are better ways to sum an array. Pairwise summation A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums. fn pairwise_sum(arr: &[f32]) -> f32 { if arr.len() == 0 { return 0.0; } if arr.len() == 1 { return arr[0]; } let (first, second) = arr.split_at(arr.len() / 2); pairwise_sum(first) + pairwise_sum(second) } This is more accurate: pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0 pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0 pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0 pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0 However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum. By choosing the splitpoint as a multiple of 256 we ensure that the base case in the recursion always has exactly 256 elements except on the very last block. This makes sure we use the most optimal reduction and always correctly predict the loop condition. This small detail ended up improving the throughput by 40% for large arrays! fn block_pairwise_sum(arr: &[f32]) -> f32 { if arr.len() > 256 { let split = (arr.len() / 2).next_multiple_of(256); let (first, second) = arr.split_at(split); block_pairwise_sum(first) + block_pairwise_sum(second) } else { naive_sum(arr) } } Kahan summation The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers. All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error bound is still quadratic in terms of $n$ even for Kahan summation. In practice all summation algorithms do significantly better than their worst-case bounds, as in most scenarios the errors do not exclusively round up or down, but cancel each other out on average. pub fn kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for x in arr { let y = *x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one. To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy: pub fn block_kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for chunk in arr.chunks(256) { let x = naive_sum(chunk); let y = x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } Exact summation I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end. The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place. Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer. This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$. A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer. Unleashing the compiler Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism. To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow. I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends. I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact. Note that the real numbers do not contain NaNs or infinities, so these operators assume those do not exist for the validity of transformations, however it is not undefined behavior when you do encounter those values. They also allow fused multiply-add instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$ Using those new instructions it is trivial to generate an autovectorized sum: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn naive_sum_autovec(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out = fadd_algebraic(out, *x); } out } If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions: .LBB0_5: vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8] vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32] vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64] vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96] add r8, 32 cmp rdx, r8 jne .LBB0_5 This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache. With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec. Accuracy and speed Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x. All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results: AlgorithmThroughputMean absolute error naive5.5 GB/s71.796 pairwise0.9 GB/s1.5528 kahan1.4 GB/s0.2229 block_pairwise5.8 GB/s3.8597 block_kahan5.9 GB/s4.2184 naive_autovec118.6 GB/s14.538 block_pairwise_autovec71.7 GB/s1.6132 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_buffer1.1 GB/s0.0015 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 The reason the accurate crate has a non-zero absolute error is because it currently does not implement rounding to nearest correctly, so it can be off by one unit in the last place for the final result. First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one. We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum. Limiting ourselves to Pareto-optimal choices we get the following four implementations: AlgorithmThroughputMean absolute error naive_autovec118.6 GB/s14.538 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here. For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower. By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn sum_block(arr: &[f32]) -> f32 { arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y)) } pub fn sum_orlp(arr: &[f32]) -> f32 { let mut chunks = arr.chunks_exact(256); let mut sum = 0.0; let mut c = 0.0; for chunk in &mut chunks { let y = sum_block(chunk) - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum + (sum_block(chunks.remainder()) - c) } AlgorithmThroughputMean absolute error sum_orlp112.2 GB/s1.2306 You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy. Conclusion I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations. Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.

a year ago 61 votes
Extracting and Depositing Bits

Suppose you have a 64-bit word and wish to extract a couple bits from it. For example you just performed a SWAR algorithm and wish to extract the least significant bit of each byte in the u64. This is simple enough, you simply perform a binary AND with a mask of the bits you wish to keep: let out = word & 0x0101010101010101; However, this still leaves the bits of interest spread throughout the 64-bit word. What if we also want to compress the 8 bits we wish to extract into a single byte? Or what if we want the inverse, spreading the 8 bits of a byte among the least significant bits of each byte in a 64-bit word? PEXT and PDEP If you are using a modern x86-64 CPU, you are in luck. In the much underrated BMI instruction set there are two very powerful instructions: PDEP and PEXT. They are inverses of each other, PEXT extracts bits, PDEP deposits bits. PEXT takes in a word and a mask, takes just those bits from the word where the mask has a 1 bit, and compresses all selected bits to a contiguous output word. Simulated in Rust this would be: fn pext64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut out_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; let ith_word_bit = (word >> i) & 1; if ith_mask_bit == 1 { out |= ith_word_bit << out_idx; out_idx += 1; } } out } For example if you had the bitstring abcdefgh and mask 10110001 you would get output bitstring 0000acdh. PDEP is exactly its inverse, it takes contiguous data bits as a word, and a mask, and deposits the data bits one-by-one (starting at the least significant bits) into those bits where the mask has a 1 bit, leaving the rest as zeros: fn pdep64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut input_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; if ith_mask_bit == 1 { let next_word_bit = (word >> input_idx) & 1; out |= next_word_bit << i; input_idx += 1; } } out } So if you had the bitstring abcdefgh and mask 10100110 you would get output e0f00gh0 (recall that we traditionally write bitstrings with the least significant bit on the right). These instructions are incredibly powerful and flexible, and the amazing thing is that these instructions only take a single cycle on modern Intel and AMD CPUs! However, they are not available in other instruction sets, so whenever you use them you will also likely need to write a cross-platform alternative. Unfortunately, both PDEP and PEXT are very slow on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate. The platform advertises through CPUID that the instructions are supported, but they’re almost unusably slow. Use with caution. Extracting bits with multiplication While the following technique can’t replace all PEXT cases, it can be quite general. It is applicable when: The bit pattern you want to extract is static and known in advance. If you want to extract $k$ bits, there must at least be a $k-1$ gap between two bits of interest. We compute the bit extraction by adding together many left-shifted copies of our input word, such that we construct our desired bit pattern in the uppermost bits. The trick is to then realize that w << i is equivalent to w * (1 << i) and thus the sum of many left-shifted copies is equivalent to a single multiplication by (1 << i) + (1 << j) + ... I think the technique is best understood by visual example. Let’s use our example from earlier, extracting the least significant bit of each byte in a 64-bit word. We start off by masking off just those bits. After that we shift the most significant bit of interest to the topmost bit of the word to get our first shifted copy. We then repeat this, shifting the second most significant bit of interest to the second topmost bit, etc. We sum all these shifted copies. This results in the following (using underscores instead of zeros for clarity): mask = _______1_______1_______1_______1_______1_______1_______1_______1 t = w & mask t = _______a_______b_______c_______d_______e_______f_______g_______h t << 7 = a_______b_______c_______d_______e_______f_______g_______h_______ t << 14 = _b_______c_______d_______e_______f_______g_______h______________ t << 21 = __c_______d_______e_______f_______g_______h_____________________ t << 28 = ___d_______e_______f_______g_______h____________________________ t << 35 = ____e_______f_______g_______h___________________________________ t << 42 = _____f_______g_______h__________________________________________ t << 49 = ______g_______h_________________________________________________ t << 56 = _______h________________________________________________________ sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______ Note how we constructed abcdefgh in the topmost 8 bits, which we can then extract using a single right-shift by $64 - 8 = 56$ bits. Since (1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080 we get the following implementation: fn extract_lsb_bit_per_byte(w: u64) -> u8 { let mask = 0x0101010101010101; let sum_of_shifts = 0x102040810204080; ((w & mask).wrapping_mul(sum_of_shifts) >> 56) as u8 } Not as good as PEXT, but three arithmetic instructions is not bad at all. Depositing bits with multiplication Unfortunately the following technique is significantly less general than the previous one. While you can take inspiration from it to implement similar algorithms, as-is it is limited to just spreading the bits of one byte to the least significant bit of each byte in a 64-bit word. The trick is similar to the one above. We add 8 shifted copies of our byte which once again translates to a multiplication. By choosing a shift that increases in multiples if 9 instead of 8 we ensure that the bit pattern shifts over by one position in each byte. We then mask out our bits of interest, and finish off with a shift and byteswap (which compiles to a single instruction bswap on Intel or rev on ARM) to put our output bits on the least significant bits and reverse the order. This technique visualized: b = ________________________________________________________abcdefgh b << 9 = _______________________________________________abcdefgh_________ b << 18 = ______________________________________abcdefgh__________________ b << 27 = _____________________________abcdefgh___________________________ b << 36 = ____________________abcdefgh____________________________________ b << 45 = ___________abcdefgh_____________________________________________ b << 54 = __abcdefgh______________________________________________________ b << 63 = h_______________________________________________________________ sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh mask = 1_______1_______1_______1_______1_______1_______1_______1_______ s & msk = h_______g_______f_______e_______d_______c_______b_______a_______ We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201, allowing the following implementation: fn deposit_lsb_bit_per_byte(b: u8) -> u64 { let sum_of_shifts = 0x8040201008040201; let mask = 0x8080808080808080; let spread = (b as u64).wrapping_mul(sum_of_shifts) & mask; u64::swap_bytes(spread >> 7) } This time it required 4 arithmetic instructions, not quite as good as PDEP, but again not bad compared to a naive implementation, and this is cross-platform.

a year ago 46 votes
When Random Isn't

This post is an anecdote from over a decade ago, of which I lost the actual code. So please forgive me if I do not accurately remember all the details. Some details are also simplified so that anyone that likes computer security can enjoy this article, not just those who have played World of Warcraft (although the Venn diagram of those two groups likely has a solid overlap). When I was around 14 years old I discovered World of Warcraft developed by Blizzard Games and was immediately hooked. Not long after I discovered add-ons which allow you to modify how your game’s user interface looks and works. However, not all add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to find out how they were made. In a weird twist of fate, I blame World of Warcraft for me seriously picking up programming. It turned out that they were made in the Lua programming language. Add-ons were nothing more than a couple .lua source files in a folder directly loaded into the game. The barrier of entry was incredibly low: just edit a file, press save and reload the interface. The fact that the game loaded your source code and you could see it running was magical! I enjoyed it immensely and in no time I was only writing add-ons and was barely playing the game itself anymore. I published quite a few add-ons in the next two years, which mostly involved copying other people’s code with some refactoring / recombining / tweaking to my wishes. Add-on security A thought you might have is that it’s a really bad idea to let users have fully programmable add-ons in your game, lest you get bots. However, the system Blizzard made to prevent arbitrary programmable actions was quite clever. Naturally, it did nothing to prevent actual botting, but at least regular rule-abiding players were fundamentally restricted to the automation Blizzard allowed. Most UI elements that you could create were strictly decorative or informational. These were completely unrestricted, as were most APIs that strictly gather information. For example you can make a health bar display using two frames, a background and a foreground, sizing the foreground frame using an API call to get the health of your character. Not all API calls were available to you however. Some were protected so they could only be called from official Blizzard code. These typically involved the API calls that would move your character, cast spells, use items, etc. Generally speaking anything that actually makes you perform an in-game action was protected. The API for getting your exact world location and camera orientation also became protected at some point. This was a reaction by Blizzard to new add-ons that were actively drawing 3D elements on top of the game world to make boss fights easier. However, some UI elements needed to actually interact with the game itself, e.g. if I want to make a button that casts a certain spell. For this you could construct a special kind of button that executes code in a secure environment when clicked. You were only allowed to create/destroy/move such buttons when not in combat, so you couldn’t simply conditionally place such buttons underneath your cursor to automate actions during combat. The catch was that this secure environment did allow you to programmatically set which spell to cast, but doesn’t let you gather the information you would need to do arbitrary automation. All access to state from outside the secure environment was blocked. There were some information gathering API calls available to match the more accessible in-game macro system, but nothing as fancy as getting skill cooldowns or unit health which would enable automatic optimal spellcasting. So there were two environments: an insecure one where you can get all information but can’t act on it, and a secure one where you can act but can’t get the information needed for automation. A backdoor channel Fast forward a couple years and I had mostly stopped playing. My interests had mainly moved on to more “serious” programming, and I was only occasionally playing, mostly messing around with add-on ideas. But this secure environment kept on nagging in my brain; I wanted to break it. Of course there was third-party software that completely disables the security restrictions from Blizzard, but what’s the fun in that? I wanted to do it “legitimately”, using the technically allowed tools, as a challenge. Obviously using clever code to bypass security restrictions is no better than using third-party software, and both would likely get you banned. I never actually wanted to use the code, just to see if I could make it work. So I scanned the secure environment allowed function list to see if I could smuggle any information from the outside into the secure environment. It all seemed pretty hopeless until I saw one tiny, innocent little function: random. An evil idea came in my head: random number generators (RNGs) used in computers are almost always pseudorandom number generators with (hidden) internal state. If I can manipulate this state, perhaps I can use that to pass information into the secure environment. Random number generator woes It turned out that random was just a small shim around C’s rand. I was excited! This meant that there was a single global random state that was shared in the process. It also helps that rand implementations tended to be on the weak side. Since World of Warcraft was compiled with MSVC, the actual implementation of rand was as follows: uint32_t state; int rand() { state = state * 214013 + 2531011; return (state >> 16) & 0x7fff; } This RNG is, for the lack of a better word, shite. It is a naked linear congruential generator, and a weak one at that. Which in my case, was a good thing. I can understand MSVC keeps rand the same for backwards compatibility, and at least all documentation I could find for rand recommends you not to use rand for cryptographic purposes. But was there ever a time where such a bad PRNG implementation was fit for any purpose? So let’s get to breaking this thing. Since the state is so laughably small and you can see 15 bits of the state directly you can keep a full list of all possible states consistent with a single output of the RNG and use further calls to the RNG to eliminate possibilities until a single one remains. But we can be significantly more clever. First we note that the top bit of state never affects anything in this RNG. (state >> 16) & 0x7fff masks out 15 bits, after shifting away the bottom 16 bits, and thus effectively works mod $2^{31}$. Since on any update the new state is a linear function of the previous state, we can propagate this modular form all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for any linear $f$. Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by the shift $h_0, h_1$ respectively, for the states after the first and second call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$ and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity: $$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$ $$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$ Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then for some integer $k$ we have $$ah_0 = h_1 + c + 2^{31} k.$$ Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$. Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following expression for $h_0$: $$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$ Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$. Thus, assuming a solution exists, we must have $$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$ So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of the RNG state after the first call. This gives us 8 guesses, after which we can reject bad guesses using follow-up calls to the RNG until a single unique answer remains. While I was able to re-derive the above with little difficulty now, 18 year old me wasn’t as experienced in discrete math. So I asked on crypto.SE, with the excuse that I wanted to ‘show my colleagues how weak this RNG is’. It worked, which sparks all kinds of interesting ethics questions. An example implementation of this process in Python: import random A = 214013 B = 2531011 class MsvcRng: def __init__(self, state): self.state = state def __call__(self): self.state = (self.state * A + B) % 2**32 return (self.state >> 16) & 0x7fff # Create a random RNG state we'll reverse engineer. hidden_rng = MsvcRng(random.randint(0, 2**32)) # Compute guesses for hidden state from 2 observations. r0 = hidden_rng() r1 = hidden_rng() c = (2**16 * (r1 - A * r0) - B) % 2**31 ceil_div = lambda a, b: (a + b - 1) // b h_guesses = [ceil_div(c + 2**31 * k, A) for k in range(-1, 7)] # Validate guesses until a single guess remains. guess_rngs = [MsvcRng(2**16 * r0 + h0) for h0 in h_guesses] guess_rngs = [g for g in guess_rngs if g() == r1] while len(guess_rngs) > 1: r = hidden_rng() guess_rngs = [g for g in guess_rngs if g() == r] # The top bit can not be recovered as it never affects the output, # but we should have recovered the effective hidden state. assert guess_rngs[0].state % 2**31 == hidden_rng.state % 2**31 While I did write the above process with a while loop, it appears to only ever need a third output at most to narrow it down to a single guess. Putting it together Once we could reverse-engineer the internal state of the random number generator we could make arbitrary automated decisions in the supposedly secure environment. How it worked was as follows: An insecure hook was registered that would execute right before the secure environment code would run. In this hook we have full access to information, and make a decision as to which action should be taken (e.g. casting a particular spell). This action is looked up in a hardcoded list to get an index. The current state of the RNG is reverse-engineered using the above process. We predict the outcome of the next RNG call. If this (modulo the length of our action list) does not give our desired outcome, we advance the RNG and try again. This repeats until the next random number would correspond to our desired action. The hook returns, and the secure environment starts. It generates a “random” number, indexes our hardcoded list of actions, and performs the “random” action. That’s all! By being able to simulate the RNG and looking one step ahead we could use it as our information channel by choosing exactly the right moment to call random in the secure environment. Now if you wanted to support a list of $n$ actions it would on average take $n$ steps of the RNG before the correct number came up to pass along, but that wasn’t a problem in practice. Conclusion I don’t know when Blizzard fixed the issue where the RNG state is so weak and shared, or whether they were aware of it being an issue at all. A few years after I had written the code I tried it again out of curiosity, and it had stopped working. Maybe they switched to a different algorithm, or had a properly separated RNG state for the secure environment. All-in-all it was a lot of effort for a niche exploit in a video game that I didn’t even want to use. But there certainly was a magic to manipulating something supposedly random into doing exactly what you want, like a magician pulling four aces from a shuffled deck.

a year ago 44 votes

More in programming

first-class merges and cover letters

Although it looks really good, I have not yet tried the Jujutsu (jj) version control system, mainly because it’s not yet clearly superior to Magit. But I have been following jj discussions with great interest. One of the things that jj has not yet tackled is how to do better than git refs / branches / tags. As I underestand it, jj currently has something like Mercurial bookmarks, which are more like raw git ref plumbing than a high-level porcelain feature. In particular, jj lacks signed or annotated tags, and it doesn’t have branch names that always automatically refer to the tip. This is clearly a temporary state of affairs because jj is still incomplete and under development and these gaps are going to be filled. But the discussions have led me to think about how git’s branches are unsatisfactory, and what could be done to improve them. branch merge rebase squash fork cover letters previous branch workflow questions branch One of the huge improvements in git compared to Subversion was git’s support for merges. Subversion proudly advertised its support for lightweight branches, but a branch is not very useful if you can’t merge it: an un-mergeable branch is not a tool you can use to help with work-in-progress development. The point of this anecdote is to illustrate that rather than trying to make branches better, we should try to make merges better and branches will get better as a consequence. Let’s consider a few common workflows and how git makes them all unsatisfactory in various ways. Skip to cover letters and previous branch below where I eventually get to the point. merge A basic merge workflow is, create a feature branch hack, hack, review, hack, approve merge back to the trunk The main problem is when it comes to the merge, there may be conflicts due to concurrent work on the trunk. Git encourages you to resolve conflicts while creating the merge commit, which tends to bypass the normal review process. Git also gives you an ugly useless canned commit message for merges, that hides what you did to resolve the conflicts. If the feature branch is a linear record of the work then it can be cluttered with commits to address comments from reviewers and to fix mistakes. Some people like an accurate record of the history, but others prefer the repository to contain clean logical changes that will make sense in years to come, keeping the clutter in the code review system. rebase A rebase-oriented workflow deals with the problems of the merge workflow but introduces new problems. Primarily, rebasing is intended to produce a tidy logical commit history. And when a feature branch is rebased onto the trunk before it is merged, a simple fast-forward check makes it trivial to verify that the merge will be clean (whether it uses separate merge commit or directly fast-forwards the trunk). However, it’s hard to compare the state of the feature branch before and after the rebase. The current and previous tips of the branch (amongst other clutter) are recorded in the reflog of the person who did the rebase, but they can’t share their reflog. A force-push erases the previous branch from the server. Git forges sometimes make it possible to compare a branch before and after a rebase, but it’s usually very inconvenient, which makes it hard to see if review comments have been addressed. And a reviewer can’t fetch past versions of the branch from the server to review them locally. You can mitigate these problems by adding commits in --autosquash format, and delay rebasing until just before merge. However that reintroduces the problem of merge conflicts: if the autosquash doesn’t apply cleanly the branch should have another round of review to make sure the conflicts were resolved OK. squash When the trunk consists of a sequence of merge commits, the --first-parent log is very uninformative. A common way to make the history of the trunk more informative, and deal with the problems of cluttered feature branches and poor rebase support, is to squash the feature branch into a single commit on the trunk instead of mergeing. This encourages merge requests to be roughly the size of one commit, which is arguably a good thing. However, it can be uncomfortably confining for larger features, or cause extra busy-work co-ordinating changes across multiple merge requests. And squashed feature branches have the same merge conflict problem as rebase --autosquash. fork Feature branches can’t always be short-lived. In the past I have maintained local hacks that were used in production but were not (not yet?) suitable to submit upstream. I have tried keeping a stack of these local patches on a git branch that gets rebased onto each upstream release. With this setup the problem of reviewing successive versions of a merge request becomes the bigger problem of keeping track of how the stack of patches evolved over longer periods of time. cover letters Cover letters are common in the email patch workflow that predates git, and they are supported by git format-patch. Github and other forges have a webby version of the cover letter: the message that starts off a pull request or merge request. In git, cover letters are second-class citizens: they aren’t stored in the repository. But many of the problems I outlined above have neat solutions if cover letters become first-class citizens, with a Jujutsu twist. A first-class cover letter starts off as a prototype for a merge request, and becomes the eventual merge commit. Instead of unhelpful auto-generated merge commits, you get helpful and informative messages. No extra work is needed since we’re already writing cover letters. Good merge commit messages make good --first-parent logs. The cover letter subject line works as a branch name. No more need to invent filename-compatible branch names! Jujutsu doesn’t make you name branches, giving them random names instead. It shows the subject line of the topmost commit as a reminder of what the branch is for. If there’s an explicit cover letter the subject line will be a better summary of the branch as a whole. I often find the last commit on a branch is some post-feature cleanup, and that kind of commit has a subject line that is never a good summary of its feature branch. As a prototype for the merge commit, the cover letter can contain the resolution of all the merge conflicts in a way that can be shared and reviewed. In Jujutsu, where conflicts are first class, the cover letter commit can contain unresolved conflicts: you don’t have to clean them up when creating the merge, you can leave that job until later. If you can share a prototype of your merge commit, then it becomes possible for your collaborators to review any merge conflicts and how you resolved them. To distinguish a cover letter from a merge commit object, a cover letter object has a “target” header which is a special kind of parent header. A cover letter also has a normal parent commit header that refers to earlier commits in the feature branch. The target is what will become the first parent of the eventual merge commit. previous branch The other ingredient is to add a “previous branch” header, another special kind of parent commit header. The previous branch header refers to an older version of the cover letter and, transitively, an older version of the whole feature branch. Typically the previous branch header will match the last shared version of the branch, i.e. the commit hash of the server’s copy of the feature branch. The previous branch header isn’t changed during normal work on the feature branch. As the branch is revised and rebased, the commit hash of the cover letter will change fairly frequently. These changes are recorded in git’s reflog or jj’s oplog, but not in the “previous branch” chain. You can use the previous branch chain to examine diffs between versions of the feature branch as a whole. If commits have Gerrit-style or jj-style change-IDs then it’s fairly easy to find and compare previous versions of an individual commit. The previous branch header supports interdiff code review, or allows you to retain past iterations of a patch series. workflow Here are some sketchy notes on how these features might work in practice. One way to use cover letters is jj-style, where it’s convenient to edit commits that aren’t at the tip of a branch, and easy to reshuffle commits so that a branch has a deliberate narrative. When you create a new feature branch, it starts off as an empty cover letter with both target and parent pointing at the same commit. Alternatively, you might start a branch ad hoc, and later cap it with a cover letter. If this is a small change and rebase + fast-forward is allowed, you can edit the “cover letter” to contain the whole change. Otherwise, you can hack on the branch any which way. Shuffle the commits that should be part of the merge request so that they occur before the cover letter, and edit the cover letter to summarize the preceding commits. When you first push the branch, there’s (still) no need to give it a name: the server can see that this is (probably) going to be a new merge request because the top commit has a target branch and its change-ID doesn’t match an existing merge request. Also when you push, your client automatically creates a new instance of your cover letter, adding a “previous branch” header to indicate that the old version was shared. The commits on the branch that were pushed are now immutable; rebases and edits affect the new version of the branch. During review there will typically be multiple iterations of the branch to address feedback. The chain of previous branch headers allows reviewers to see how commits were changed to address feedback, interdiff style. The branch can be merged when the target header matches the current trunk and there are no conflicts left to resolve. When the time comes to merge the branch, there are several options: For a merge workflow, the cover letter is used to make a new commit on the trunk, changing the target header into the first parent commit, and dropping the previous branch header. Or, if you like to preserve more history, the previous branch chain can be retained. Or you can drop the cover letter and fast foward the branch on to the trunk. Or you can squash the branch on to the trunk, using the cover letter as the commit message. questions This is a fairly rough idea: I’m sure that some of the details won’t work in practice without a lot of careful work on compatibility and deployability. Do the new commit headers (“target” and “previous branch”) need to be headers? What are the compatibility issues with adding new headers that refer to other commits? How would a server handle a push of an unnamed branch? How could someone else pull a copy of it? How feasible is it to use cover letter subject lines instead of branch names? The previous branch header is doing a similar job to a remote tracking branch. Is there an opportunity to simplify how we keep a local cache of the server state? Despite all that, I think something along these lines could make branches / reviews / reworks / merges less awkward. How you merge should me a matter of your project’s preferred style, without interference from technical limitations that force you to trade off one annoyance against another. There remains a non-technical limitation: I have assumed that contributors are comfortable enough with version control to use a history-editing workflow effectively. I’ve lost all perspective on how hard this is for a newbie to learn; I expect (or hope?) jj makes it much easier than git rebase.

10 hours ago 4 votes
ARM is great, ARM is terrible (and so is RISC-V)

I’ve long been interested in new and different platforms. I ran Debian on an Alpha back in the late 1990s and was part of the Alpha port team; then I helped bootstrap Debian on amd64. I’ve got somewhere around 8 Raspberry Pi devices in active use right now, and the free NNCPNET Internet email service … Continue reading ARM is great, ARM is terrible (and so is RISC-V) →

22 hours ago 3 votes
Many Hard Leetcode Problems are Easy Constraint Problems

In my first interview out of college I was asked the change counter problem: Given a set of coin denominations, find the minimum number of coins required to make change for a given number. IE for USA coinage and 37 cents, the minimum number is four (quarter, dime, 2 pennies). I implemented the simple greedy algorithm and immediately fell into the trap of the question: the greedy algorithm only works for "well-behaved" denominations. If the coin values were [10, 9, 1], then making 37 cents would take 10 coins in the greedy algorithm but only 4 coins optimally (10+9+9+9). The "smart" answer is to use a dynamic programming algorithm, which I didn't know how to do. So I failed the interview. But you only need dynamic programming if you're writing your own algorithm. It's really easy if you throw it into a constraint solver like MiniZinc and call it a day. int: total; array[int] of int: values = [10, 9, 1]; array[index_set(values)] of var 0..: coins; constraint sum (c in index_set(coins)) (coins[c] * values[c]) == total; solve minimize sum(coins); You can try this online here. It'll give you a prompt to put in total and then give you successively-better solutions: coins = [0, 0, 37]; ---------- coins = [0, 1, 28]; ---------- coins = [0, 2, 19]; ---------- coins = [0, 3, 10]; ---------- coins = [0, 4, 1]; ---------- coins = [1, 3, 0]; ---------- Lots of similar interview questions are this kind of mathematical optimization problem, where we have to find the maximum or minimum of a function corresponding to constraints. They're hard in programming languages because programming languages are too low-level. They are also exactly the problems that constraint solvers were designed to solve. Hard leetcode problems are easy constraint problems.1 Here I'm using MiniZinc, but you could just as easily use Z3 or OR-Tools or whatever your favorite generalized solver is. More examples This was a question in a different interview (which I thankfully passed): Given a list of stock prices through the day, find maximum profit you can get by buying one stock and selling one stock later. It's easy to do in O(n^2) time, or if you are clever, you can do it in O(n). Or you could be not clever at all and just write it as a constraint problem: array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; var int: buy; var int: sell; var int: profit = prices[sell] - prices[buy]; constraint sell > buy; constraint profit > 0; solve maximize profit; Reminder, link to trying it online here. While working at that job, one interview question we tested out was: Given a list, determine if three numbers in that list can be added or subtracted to give 0? This is a satisfaction problem, not a constraint problem: we don't need the "best answer", any answer will do. We eventually decided against it for being too tricky for the engineers we were targeting. But it's not tricky in a solver; include "globals.mzn"; array[int] of int: numbers = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array[index_set(numbers)] of var {0, -1, 1}: choices; constraint sum(n in index_set(numbers)) (numbers[n] * choices[n]) = 0; constraint count(choices, -1) + count(choices, 1) = 3; solve satisfy; Okay, one last one, a problem I saw last year at Chipy AlgoSIG. Basically they pick some leetcode problems and we all do them. I failed to solve this one: Given an array of integers heights representing the histogram's bar height where the width of each bar is 1, return the area of the largest rectangle in the histogram. The "proper" solution is a tricky thing involving tracking lots of bookkeeping states, which you can completely bypass by expressing it as constraints: array[int] of int: numbers = [2,1,5,6,2,3]; var 1..length(numbers): x; var 1..length(numbers): dx; var 1..: y; constraint x + dx <= length(numbers); constraint forall (i in x..(x+dx)) (y <= numbers[i]); var int: area = (dx+1)*y; solve maximize area; output ["(\(x)->\(x+dx))*\(y) = \(area)"] There's even a way to automatically visualize the solution (using vis_geost_2d), but I didn't feel like figuring it out in time for the newsletter. Is this better? Now if I actually brought these questions to an interview the interviewee could ruin my day by asking "what's the runtime complexity?" Constraint solvers runtimes are unpredictable and almost always than an ideal bespoke algorithm because they are more expressive, in what I refer to as the capability/tractability tradeoff. But even so, they'll do way better than a bad bespoke algorithm, and I'm not experienced enough in handwriting algorithms to consistently beat a solver. The real advantage of solvers, though, is how well they handle new constraints. Take the stock picking problem above. I can write an O(n²) algorithm in a few minutes and the O(n) algorithm if you give me some time to think. Now change the problem to Maximize the profit by buying and selling up to max_sales stocks, but you can only buy or sell one stock at a given time and you can only hold up to max_hold stocks at a time? That's a way harder problem to write even an inefficient algorithm for! While the constraint problem is only a tiny bit more complicated: include "globals.mzn"; int: max_sales = 3; int: max_hold = 2; array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array [1..max_sales] of var int: buy; array [1..max_sales] of var int: sell; array [index_set(prices)] of var 0..max_hold: stocks_held; var int: profit = sum(s in 1..max_sales) (prices[sell[s]] - prices[buy[s]]); constraint forall (s in 1..max_sales) (sell[s] > buy[s]); constraint profit > 0; constraint forall(i in index_set(prices)) (stocks_held[i] = (count(s in 1..max_sales) (buy[s] <= i) - count(s in 1..max_sales) (sell[s] <= i))); constraint alldifferent(buy ++ sell); solve maximize profit; output ["buy at \(buy)\n", "sell at \(sell)\n", "for \(profit)"]; Most constraint solving examples online are puzzles, like Sudoku or "SEND + MORE = MONEY". Solving leetcode problems would be a more interesting demonstration. And you get more interesting opportunities to teach optimizations, like symmetry breaking. Because my dad will email me if I don't explain this: "leetcode" is slang for "tricky algorithmic interview questions that have little-to-no relevance in the actual job you're interviewing for." It's from leetcode.com. ↩

22 hours ago 3 votes
If Apple cared about privacy

Defaults matter

yesterday 6 votes
btrfs on a Raspberry Pi

I’m something of a filesystem geek, I guess. I first wrote about ZFS on Linux 14 years ago, and even before I used ZFS, I had used ext2/3/4, jfs, reiserfs, xfs, and no doubt some others. I’ve also used btrfs. I last posted about it in 2014, when I noted it has some advantages over … Continue reading btrfs on a Raspberry Pi →

2 days ago 4 votes