More from samwho.dev
header h1 { padding: 0; margin-top: 0.2rem; margin-bottom: 1rem; } button { margin: 0.5rem; padding: 0.5rem 1rem; background-color: #444444; color: white; border: none; border-radius: 8px; cursor: pointer; touch-action: manipulation; user-select: none; } button:hover:not(:disabled) { background-color: #555555; touch-action: manipulation; user-select: none; } button:disabled { filter: opacity(0.5); cursor: not-allowed; touch-action: manipulation; user-select: none; } input[type="range"] { width: 100%; margin: 1rem 0; } .control { display: flex; justify-content: center; align-items: center; margin: 1rem 0; } .control label { margin-right: 1rem; white-space: nowrap; } .control input[type="range"] { flex-grow: 1; } .odds { font-size: 1.2rem; font-family: var(--code-font); font-weight: bold; text-align: center; display: flex; justify-content: center; gap: 2rem; margin: 1rem 0; } .odds .hold { display: flex; flex-direction: column; align-items: center; } .odds .replace { display: flex; flex-direction: column; align-items: center; } .odds .hold .value { color: var(--palette-dark-blue); white-space: nowrap; } .odds .replace .value { color: var(--palette-red); white-space: nowrap; } article input { --c: var(--palette-orange); --l: 2px; --h: 30px; --w: 30px; width: 100%; height: var(--h); -webkit-appearance :none; -moz-appearance :none; appearance :none; background: none; cursor: pointer; overflow: hidden; } article input:focus-visible, article input:hover{ --p: 25%; } article input[type="range" i]::-webkit-slider-thumb{ height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } article input[type="range"]::-moz-range-thumb { --h: 25px; --w: 25px; height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } img.hero { --width: 200px; width: var(--width); max-width: var(--width); margin-top: calc(var(--width) * -0.5); margin-bottom: 1rem; } Reservoir Sampling Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know: When you would need reservoir sampling. The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise. A simple way to implement reservoir sampling if you want to use it. ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out! # Sampling when you know the size In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it? The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle." Every time you click "Shuffle," the chart below tracks what the first 3 cards were. At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair." Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over. This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards. We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method. I'm stretching the analogy a little here. It would take a long time to count through the deck to get to, say, index 436,234. But when it's an array in memory, computers have no trouble finding an element by its index. Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random? That's the curveball: you don't know. No, you can only hold on to 1 card at a time. You're free to swap your card with the newest one each time I show you a card, but you can only hold one and you can't go back to a card you've already seen. Believe it or not, this is a real problem and it has a real and elegant solution. For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives log messages from other services and stores them so that it's easy to search them in one place. One of the things you need to think about when building a service like this is what do you do when another service starts sending you way too many logs. Maybe it's a bad release, maybe one of your videos goes viral. Whatever the reason, it threatens to overwhelm your log collection service. Let's simulate this. Below you can see a stream of logs that experiences periodic spikes. A horizontal line indicates the threshold of logs per second that the log collection service can handle, which in this example is 5 logs per second. You can see that every so often, logs per second spikes above the threshold . One way to deal with this is "sampling." Deciding to send only a fraction of the logs to the log collection service. Let's send 10% of the logs. Below we will see the same simulation again, but this time logs that don't get sent to our log collection service will be greyed out. The graph has 2 lines: a black line tracks sent logs , the logs that are sent to our log collection service, and a grey line tracks total logs . The rate of sent logs never exceeds the threshold , so we never overwhelm our log collection service. However, in the quieter periods we're throwing away 90% of the logs when we don't need to! What we really want is to send at most 5 logs per second. This would mean that during quiet periods you get all the logs, but during spikes you discard logs to protect the log collection service. The simple way to achieve this would be to send the first 5 logs you see each second, but this isn't fair. You aren't giving all logs an equal chance of being selected. # Sampling when you don't know the size We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see. Reservoir sampling is an algorithm that solves this exact problem. You could, but why live with that uncertainty? You'd be holding on to an unknown number of logs in memory. A sufficiently big spike could cause you problems. Reservoir sampling solves this problem, and does so without ever using more memory than you ask it to. Let's go back to our curveball of me showing you 1 card at a time. Here's a recap of the rules: I'll draw cards one at a time from a deck. Each time I show you a card, you have to choose to hold it or discard it. If you were already holding a card, you discard your held card before replacing it with the new card. At any point I can stop drawing cards and whatever card you're holding is the one you've chosen. How would you play this game in a way that ensures all cards have been given an equal chance to be selected when I decide to stop? You're on the right track. Let's have a look at how the coin flip idea plays out in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and 50% of the time it will go to the discard pile on the right, and 50% of the time it will become your held card in the center, with any previously held card moving to the discard pile. The problem is that while the hold vs discard counts are roughly equal, which feels fair, later cards are much more likely to be held when I stop than earlier cards. The first card drawn has to win 10 coin flips to still be in your hand after the 10th card is drawn. The last card only has to win 1. Scrub the slider below to see how the chances change as we draw more cards. Each bar represents a card in the deck, and the height of the bar is the chance we're holding that card when I stop. Below the slider are the chances we're holding the first card drawn vs. the last card drawn. Anything older than 15 cards ago is has a less than 0.01% chance of being held when I stop. Because believe it or not, we only have to make one small change to this idea to make it fair. Instead of flipping a coin to decide if we'll hold the card or not, instead we give each new card a 1/n chance of being held, where n is the number of cards we've seen so far. Yep! In order to be fair, every card must have an equal chance of being selected. So for the 2nd card, we want both cards to have a 1/2 chance. For the 3rd card, we want all 3 cards to have a 1/3 chance. For the 4th card, we want all 4 cards to have a 1/4 chance, and so on. So if we use 1/n for the new card, we can at least say that the new card has had a fair shot. Let's have a look at the chances as you draw more cards with this new method. new card has the right chance of being selected, but how does that make the older cards fair? So far we've focused on the chance of the new card being selected, but we also need to consider the chance of the card you're holding staying in your hand. Let's walk through the numbers. # Card 1 The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 1/1, or 100%. # Card 2 This time we have a real choice. We can keep hold of the card we have, or replace it with the new one. We've said that we're going to do this with a 1/n chance, where n is the number of cards we've seen so far. So our chance of replacing the first card is 1/2, or 50%, and our chance of keeping hold of the first card is its chance of being chosen last time multiplied by its chance of being replaced, so 100% * 1/2, which is again 50%. # Card 3 The card we're holding has a 50% chance of being there. This is true regardless what happened up to this point. No matter whether we're holding card 1 or card 2, it's 50%. The new card has a 1/3 chance of being selected, so the card we're holding has a 1/3 chance of being replaced. This means that our held card has a 2/3 chance of remaining held. So its chances of "surviving" this round are 50% * 2/3. # Card N This pattern continues for as many cards as you want to draw. We can express both options as formulas. Drag the slider to substitute n with real numbers and see that the two formulas are always equal. If 1/n is the chance of choosing the new card, 1/(n-1) is the chance of choosing the new card from the previous draw. The chance of not choosing the new card is the complement of 1/n, which is 1-(1/n). Below are the cards again except this time set up to use 1/n instead of a coin flip. Click to the end of the deck. Does it feel fair to you? There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but as we saw above the numbers say it is completely fair. # Choosing multiple cards Now that we know how to select a single card, we can extend this to selecting multiple cards. There are 2 changes we need to make: Rather than new cards having a 1/n chance of being selected, they now have a k/n chance, where k is the number of cards we want to choose. When we decide to replace a held card, we choose one of the k cards we're holding at random. So our new previous card selection formula becomes k/(n-1) because we're now holding k cards. Then the chance that any of the cards we're holding get replaced is 1-(1/n). Let's see how this plays out with real numbers. The fairness still holds, and will hold for any k and n pair. This is because all held cards have an equal chance of being replaced, which keeps them at an equal likelihood of still being in your hand every draw. A nice way to implement this is to use an array of size k. For each new card, generate a random number between 0 and n. If the random number is less than k, replace the card at that index with the new card. Otherwise, discard the new card. And that's how reservoir sampling works! # Applying this to log collection Let's take what we now know about reservoir sampling and apply it to our log collection service. We'll set k=5, so we're "holding" at most 5 log messages at a time, and every second we will send the selected logs to the log collection service. After we've done that, we empty our array of size 5 and start again. This creates a "lumpy" pattern in the graph below, and highlights a trade-off when using reservoir sampling. It's no longer a real-time stream of logs, but chunks of logs sent at an interval. However, sent logs never exceeds the threshold , and during quiet periods the two lines track each other almost perfectly. No logs lost during quiet periods, and never more than threshold logs per second sent during spikes. The best of both worlds. It also doesn't store more than k=5 logs, so it will have predictable memory usage. # Further reading Something you may have thought while reading this post is that some logs are more valuable than others. You almost certainly want to keep all error logs, for example. For that use-case there is a weighted variant of reservoir sampling. I wasn't able to find a simpler explanation of it, so that link is to Wikipedia which I personally find a bit hard to follow. But the key point is that it exists and if you need it you can use it. # Conclusion Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient. Thank you again to ittybit for sponsoring this post. I really couldn't have hoped for a more supportive first sponsor. Thank you for believing in and understanding what I'm doing here. Thank you to everyone who read this post and gave their feedback. You made this post much better than I could have done on my own, and steered me away from several paths that just weren't working. If you want to tell me what you thought of this post by sending me an anonymous message that goes directly to my phone, go to https://samwho.dev/ping.
body { text-wrap: pretty; } @media (prefers-reduced-motion: reduce) { * { transition: none; animation: none; } } turing-machine { width: 100%; display: block; position: relative; padding-bottom: 1em; } turing-machine .program-container { position: relative; display: flex; justify-content: center; } turing-machine table { border: none; font-family: Fira Code; border-collapse: collapse; border-spacing: 0; margin: 1px; margin-top: 0.5em; width: auto; } turing-machine thead td { text-align: center; } turing-machine td { text-align: left; padding-left: 3vw; padding-right: 3vw; padding-top: 0.2em; padding-bottom: 0.2em; border: 1px dashed #bbbbbb; } turing-machine thead td { border: 0; } turing-machine .container { z-index: 1; background-color: white; } turing-machine .svg-container { padding-bottom: 1px; padding-top: 0.5em; line-height: 0; overflow-x: scroll; overflow-y: hidden; position: relative; scrollbar-width: none; cursor: grab; cursor: -webkit-grab; } turing-machine .svg-container::-webkit-scrollbar { display: none; } turing-machine .controls-container { line-height: 1.5; display: flex; align-items: center; justify-content: center; gap: 0.3em; margin-top: 0.2em; padding-bottom: 0.2em; } turing-machine .controls-container button { color: black; background-color: #aaaaaa; border-radius: 30%/50%; padding: 0.3em; height: 2em; min-width: 3em; text-align: center; border: none; cursor: pointer; } turing-machine .controls-container select { color: black; border: 2px solid #aaaaaa; background-color: #ffffff; border-radius: 30%/50%; padding: 0.15em; height: 2em; min-width: 3em; text-align: center; cursor: pointer; font-family: "Fira Code"; } turing-machine .controls-container button:disabled { background-color: #eeeeee; color: #aaaaaa; cursor: auto; } turing-machine .controls-container button:disabled:hover { background-color: #eeeeee; } turing-machine .controls-container button:hover { background-color: #999999; } turing-machine .controls-container button:active { background-color: #888888; } turing-machine .error { width: 80%; margin-top: 0.5em; margin-bottom: 0.5em; border-radius: 10px; background-color: #ffdddd; color: red; font-family: "Fira Code"; font-size: 1.5em; font-weight: bold; text-align: center; margin: auto; } .dragging { cursor: grabbing !important; cursor: -webkit-grabbing !important; } .sticky { position: sticky !important; top: 0; } turing-machine .gradient { content: ""; z-index: 2; position: sticky; top: 0; left: 0; pointer-events: none; background-image: linear-gradient(90deg, rgba(255,255,255,1) 0%, rgba(255,255,255,0) 20%, rgba(255,255,255,0) 80%, rgba(255,255,255,1) 100%); width: 100%; margin-top: -62px; height: 64px; } turing-machine svg { user-select: none; /*touch-action: none;*/ } turing-machine svg text { font-family: "Fira Code"; } turing-machine tr { padding-top: 0.5em; padding-bottom: 0.5em; } turing-machine .instruction { color: black; display: inline-block; border-radius: 0.5em; padding: 0.1em; min-width: 4.5ch; height: 100%; text-align: center; } turing-machine .pointer { display: inline-block; border-radius: 0.5em; position: absolute; z-index: -1; } turing-machine program { display: none; } turing-machine controls { display: none; } turing-machine.heading svg text { font-family: "Lora"; font-weight: bold; } figure.hero { font-family: "Lora"; text-align: center; padding-top: 1em; padding-bottom: 2em; margin-top: 1em; width: 100%; max-width: 100%; } figure.hero turing-machine { padding: 0; margin: 0; } figure.hero turing-machine .svg-container { margin: 0; padding-top: 0; } figure.hero turing-machine .gradient { top: 0; } figure.hero .signature { margin-top: 0.5rem; max-width: 200px; } figure.hero .portrait { margin-top: 1em; width: 250px; max-width: 250px; } figure.hero p { padding: 0; font-size: 0.9em; } figure.hero figcaption { font-family: "Lora"; font-style: small-caps; font-size: 0.8em; } @keyframes pulse { 0% { transform: scale(1); } 50% { transform: scale(1.1); } 100% { transform: scale(1); } } .symbol { display: inline-block; border: 1px solid black; width: 1.5em; height: 1.5em; font-family: "Fira Code"; text-align: center; } .inline-instruction { display: inline-block; font-family: "Fira Code"; background-color: rgba(0, 158, 115, 0.7); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-instruction) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-state { display: inline-block; font-family: "Fira Code"; background-color: rgb(86, 180, 233); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-state) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-value { display: inline-block; font-family: "Fira Code"; background-color: rgb(230, 159, 0); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-value) { margin-top: 0.35em; margin-bottom: 0.35em; } figure { margin: auto; margin-top: 2em; margin-bottom: 2em; } figure img { max-width: 400px; } figure figcaption { max-width: 500px; color: #666; font-size: 0.8em; font-style: italic; text-align: center; font-family: "Lora"; text-wrap: pretty; margin: auto; } ALAN M. TURING 23 June 1912 – 7 June 1954 B B | | L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) -> F In 1928, David Hilbert, one of the most influential mathematicians of his time, asked whether it is possible to create an algorithm that could determine the correctness of a mathematical statement. This was called the "decision problem," or "Entscheidungsproblem" in Hilbert's native German. In 1936 both Alan Turing and Alonzo Church independently reached the conclusion, using different methods, that the answer is "no." The way Turing did it was to imagine a "universal machine", a machine that could compute anything that could be computed. This idea, the "Turing machine" as Alonzo Church christened it in 1937, laid the foundations for the device you are using to read this post. If we look hard enough we can see Turing's legacy in today's CPUs. By the end of this post, you will know: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. How to write and run your own programs for a Turing machine. # What is a Turing machine? You might expect a universal machine, capable of computing anything that can be computed, to be a complex device. Nothing could be further from the truth. The machine has just 4 parts, and the language used to program it has just 5 instructions. theoretical machine. It was created as a thought experiment to explore the limits of what can be computed. Some have of course been built, but in 1936 they existed only in the heads of Turing and those who read his paper. The parts are: a !tape, a !head, a !program, and a !state. When you're ready, go ahead and press !play. start What you're seeing here is a program that executes P(0) to print 0 to the tape, moves the head right with the R instruction, then !jumps back to the start. It will go on printing 0s forever. At any point, feel free to !pause, step the machine !forwards or !backwards one instruction at a time, or !restart the program from the beginning. There is also a speed selector on the far right of the controls if you want to speed up the machine. Notice that !state and !value never change. Every time the machine performs a !jump, the current state and value are used to pick the correct next row of instructions to execute. This program only has a single state, start, and every time it jumps, the symbol under the !head is !blank. Let's take a look at a program with multiple states. one one | | P(1) R -> zero This program prints alternating 0s and 1s to the tape. It has 2 states, zero and one, to illustrate what happens when you !jump to a different state. Things moving too fast? The slider below can be used to adjust the speed of all the Turing machines on this page. 100% You can also achieve this same result by using a single !state and alternating the !value. Here's an example of that: start start | 1 | R P(0) -> start start | 0 | R P(1) -> start The !value column always stays up to date with what the current symbol is under the !head, then when we !jump that value is used to know which row of instructions to execute. Combining state and value gives us a surprising amount of control over what our !program does. We've so far seen 3 instructions: P prints a given symbol to the tape. R moves the tape head right. ↪︎ jumps to a given state. There are 2 more: L moves the tape head left. H halts the machine. 1 1 | | P(a) L -> 2 2 | | P(l) L -> 3 3 | | P(A) H This program prints the word "Alan" from right to left then halts. If you can't see the full word, you can drag the !tape left and right. If the machine has halted, you can use !restart to start it again from the beginning. All of it! You're probably not going to get Crysis running at 60fps on a simulated Turing machine, but all of the calculations required to render each frame can be done with just these 5 instructions. Everything you have ever seen a computer do can be done with a Turing machine. We'll see a glimpse of how that can work in practice a little later. The last example I want to show you before we move on is the very first program Alan Turing showed the world. It's the first program featured in his 1936 paper: "On Computable Numbers, with an application to the Entsheidungsproblem." c c | | R -> e e | | P(1) R -> k k | | R -> b Turing liked to leave spaces between symbols, going as far as to even define them as "F-squares" and "E-squares". F for figure, and E for erasable. His algorithms would often make use of E-squares to help the machine remember the location of specific !tape squares. # What does it mean to compute? Something is said to be "computable" if there exists an algorithm that can get from the given input to the expected output. For example, adding together 2 integers is computable. Here I'm giving the machine a !tape that starts out with the values 2 and 6 in binary separated by a !blank. b1 b1 | 1 | R -> b1 b1 | | R -> b2 b2 | 0 | R -> b2 b2 | 1 | R -> b2 b2 | | L -> dec dec | 0 | P(1) L -> dec dec | 1 | P(0) L -> b3 dec | | H b3 | 0 | L -> b3 b3 | 1 | L -> b3 b3 | | L -> inc inc | 0 | P(1) R -> b1 inc | 1 | P(0) L -> inc inc | | P(1) R -> b1 This program adds the two numbers together, arriving at the answer 8. It does this by decrementing from the right number and incrementing the left number until the right number is 0. Here's the purpose of each state: b1 Move right until the first !blank square is found. This is to navigate past the first number. b2 Move right until the second !blank square is found. This is to navigate past the second number. dec Decrement the current number by 1. In practice this will always be the right number. It decrements by flipping bits until it either reaches a 1, where it will navigate back to the left number, or a !blank, which it will interpret as the number having hit 0. b3 Move left until the first !blank square again. This is only ever used after we've decremented the rightmost number. We can't reuse the b1 state here because when we reach the !blank, we want to jump to inc. inc Increment the current number by 1. Similar to decrementing, this will only ever happen on the leftmost number in practice. This is done by flipping bits until we reach a 0, at which point we navigate back to the right number. That we can write this program at all is proof that addition is computable, but it also implies that all integers are computable. If we can add any 2 integers, we can compute any other integer. 1 is 0+1, 2 is 1+1, 3 is 2+1, and so on. # Binary vs Decimal You may have wondered why I'm choosing to work with binary numbers rather than decimal. It's not just because that's how modern computers work. I'm going to show you 2 examples, and from those examples you'll be able to see why modern computers choose to work in binary. The first example is a program that increments a binary number in an endless loop. move move | 0 | R -> move move | | L -> flip flip | 0 | P(1) R -> move flip | 1 | P(0) L -> flip flip | | P(1) R -> move The second example is a program that increments a decimal number in an endless loop. back inc | 1 | P(2) -> back inc | 2 | P(3) -> back inc | 3 | P(4) -> back inc | 4 | P(5) -> back inc | 5 | P(6) -> back inc | 6 | P(7) -> back inc | 7 | P(8) -> back inc | 8 | P(9) -> back inc | 9 | P(0) L -> inc inc | | P(1) -> back back | | L -> inc back | * | R -> back These two programs are doing the same thing, but the program for manipulating decimal numbers is much longer. We've even introduced some new syntax, the * symbol, to handle a !value under the !head that does not match any of the other values for that !state. It's for this reason when programming Turing machines we prefer binary numbers: the programs end up being shorter and easier to reason about. This benefit also translates to the physical world. Components that switch between 2 states are cheaper, smaller, and more reliable than components that switch between 10. It was more practical to build computers that worked in binary than ones that work in decimal, though attempts to build decimal computers were made. # What can't be computed? To approach this question we need to explain the "Halting problem." It goes like this: The answer is no, and this is what Turing essentially proved. The proof is complicated and I'm not ashamed to admit I don't understand it, but there is an example I can give you that can be intuitively understood to be "undecidable." Imagine you write a program that takes as its input the program being used to decide whether it will halt or not. What it then does is run the decider program on itself, and then do the opposite of what the decider program says. function undecidable(willHalt) { if (willHalt(undecidable)) { while (true); } else { return true; } } This program intentionally enters an infinite loop if it is told it will halt, and halts if it is told it will run forever. It seems like a silly example, the kind of answer a cheeky high school student might try to get away with, but it is a legitimate counterexample to the idea that the halting problem can be solved. If you were to imagine encoding the program and input into something that could be represented on the !tape, there would be no !program that could determine whether the program would halt or not. Imagining this encoding becomes quite natural when you realise that modern programs are encoded as binary data to be saved to disk. # What does it mean to be Turing complete? If you've been involved in the world of programming for more than a few years, there's a good chance you've come across the term "Turing complete." Most likely in the context of things that really ought not to be Turing complete, like C++ templates, TypeScript's type system or Microsoft Excel. But what does it mean? Like the Halting problem, the proof is complicated but there's a straightforward test you can apply to something to judge it Turing complete: I've written this post, with the Turing machine simulations, in JavaScript. Therefore JavaScript is Turing complete. The C++ template example given above simulates a Turing machine in C++'s template system. The TypeScript example takes the route of writing an interpreter for a different Turing complete language. You're right, and everyone tends to cheat a bit with the definition. When someone says something is Turing complete, what they mean is it would be Turing complete if it had an infinite amount of memory. The infinite tape limitation means no Turing machine could ever exist in our physical reality, so that requirement tends to get waived. # How does this all relate to modern computers? If you read around the topic of Turing machines outside of this post, you might see it said that modern computers are effectively Turing machines. You would be forgiven for finding it difficult to imagine how you go from adding 2 integers in binary on a !tape to running a web browser, but the line is there. A key difference between our Turing machine and the device you're reading this on is that your device's CPU has "registers." These are small pieces of memory that live directly on the CPU and are used to store values temporarily while they're being operated on. Values are being constantly loaded from memory into registers and saved back again. You can think of registers as variables for your CPU, but they can only store fixed-size numbers. We can create registers in our Turing machine. We can do this by creating a "format" for our tape. Here we define 3 registers: A, B, and C. Each register contains a 3 bits and can store numbers between 0 and 7. Then at the far left we have an H, which stands for "home", which will help us navigate. To increment register C, we can write a program like this: goto_c goto_c | C | R R R -> inc inc | 0 | P(1) -> goto_h goto_h | * | L -> goto_h goto_h | H | H We're making a lot more liberal use of the * symbol here to help us navigate to specific parts of the !tape without having to enumerate all possible values that could be under the !head on the way there. This program is effectively equivalent to the following x86 assembly code, if x86 had a register named c: mov c, 0 ; Load 0 into c inc c ; Increment c by 1 If we wanted to add values in A and B, storing the result in C, we need to do more work. Here's the assembly code we're trying to replicate: mov a, 2 ; Load 2 into a mov b, 3 ; Load 3 into b add c, a ; Add a to c add c, b ; Add b to c Before you scroll down I will warn you that the program is long and complex. It is the last program we will see in this post, and I don't expect you to understand it in full to continue to the end. Its main purpose is to show you that we can implement operations seen in modern assembly code on a Turing machine. initb initb | * | R P(0) R P(1) R P(1) R -> start start | * | L -> start start | H | R -> go_a go_a | * | R -> go_a go_a | A | R R R -> dec_a dec_a | 0 | P(1) L -> cry_a dec_a | 1 | P(0) -> go_c1 cry_a | 0 | P(1) L -> cry_a cry_a | 1 | P(0) -> go_c1 cry_a | * | R P(0) R P(0) R P(0) -> goto_b go_c1 | * | R -> go_c1 go_c1 | C | R R R -> inc_c1 inc_c1 | 0 | P(1) -> h2a inc_c1 | 1 | P(0) L -> cry_ca cry_ca | 0 | P(1) -> h2a cry_ca | 1 | P(0) L -> cry_ca cry_ca | * | R -> h2a h2a | * | L -> h2a h2a | H | R -> go_a goto_b | * | R -> goto_b goto_b | B | R R R -> dec_b dec_b | 0 | P(1) L -> dec_bc dec_b | 1 | P(0) -> go_c2 dec_bc | 0 | P(1) L -> dec_bc dec_bc | 1 | P(0) -> go_c2 dec_bc | * | R P(0) R P(0) R P(0) -> end go_c2 | * | R -> go_c2 go_c2 | C | R R R -> inc_c2 inc_c2 | 0 | P(1) -> go_hb inc_c2 | 1 | P(0) L -> cry_cb cry_cb | 0 | P(1) -> go_hb cry_cb | 1 | P(0) L -> cry_cb cry_cb | * | R -> go_hb go_hb | * | L -> go_hb go_hb | H | R -> goto_b end | * | L -> end end | H | H This is painfully laborious, and it doesn't even precisely match the assembly code. It destroys the values in A and B as it adds them together, and it doesn't handle overflow. But it's a start, and I hope it gives you a glimpse of how this theoretical machine can be built up to operate like a modern CPU. If you watch the program run to completion, something that might strike you is just how much work is required to do something as simple as adding 2 numbers. Turing machines were not designed to be practical, Turing never intended anyone to go out and build one of these machines in the hope it will be useful. Modern machines have circuits within them that can add 2 numbers together by passing 2 electrical signals in and getting the sum as a single signal out. This happens in less than a nanosecond. Modern machines have memory where any byte can be accessed at any time, no tape manipulation required. This memory access takes a few dozen nanoseconds. # Writing and running your own programs I've built a web-based development environment for writing programs that will run on the Turing machine visualisations you've seen throughout the post. You can access the editor here. I encourage you to play around with it. Set a simple goal, like adding together 2 numbers without going back to look at the way I did it in the post. It's a great way to get a feel for how the machine works. # Conclusion To recap, we've covered: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. And you now have access to an environment for writing and running your own Turing machine programs. If you use it to make something neat, please do reach out to me and show me! My email address is hello@samwho.dev. # Further reading Turing's original paper on computable numbers. The Annotated Turing I referenced this throughout the making of this post. It is a fabulous read, strongly recommend. Alan Turing: The Enigma by Andrew Hodges. An excellent biography of Turing, I read this during the writing of this post. Calculating a Mandelbrot Set using a Turing Machine. This was exceptionally useful for me to understand how to get from Turing machines to modern computers. # Acknowledgements These posts are never a solo effort, and this one is no exception. Sincere thanks go to the following people: To my wife, Sophie, who drew the biographical sketches you're seeing at the end here, and for putting up with my incessant talking about this post the last 2 weeks. Everyone who let me watch them read this post in real-time over a video call and gave me feedback: Jaga Santagostino, Robert Aboukhalil, Tarun Verghis, Tyler Sparks. Everyone who came to hang out and help out in the Twitch streams when I was building out the early versions of the Turing machine visualisations. Everyone who supports my work on Patreon. Everyone who works on the tools used to build this post: TypeScript, Bun, Two.js, Tween.js, Monaco, Peggy, Zed, and many other indirect dependencies. We really do stand upon the shoulders of giants. Hut 8, Bletchley Park, where Turing worked during World War II. It was in this hut that Alan worked with his team to break the German Naval Enigma code.
.bf { width: 100%; height: 150px; } @media only screen and (min-width: 320px) and (max-width: 479px) { .bf { height: 200px; } } @media only screen and (min-width: 480px) and (max-width: 676px) { .bf { height: 200px; } } @media only screen and (min-width: 677px) and (max-width: 991px) { .bf { height: 150px; } } form { display: flex; flex-direction: column; align-items: center; justify-content: stretch; } input { border: 1px solid rgb(119, 119, 119); padding: 0.25rem; border-radius: 0.25rem; height: 2em; line-height: 2em; } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } Everyone has a set of tools they use to solve problems. Growing this set helps you to solve ever more difficult problems. In this post, I'm going to teach you about a tool you may not have heard of before. It's a niche tool that won't apply to many problems, but when it does you'll find it invaluable. It's called a "bloom filter." Before you continue! This post assumes you know what a hash function is, and if you don't it's going to be tricky to understand. Sam has written a post about hash functions, and recommendeds that you read this first. # What bloom filters can do Bloom filters are similar to the Set data structure. You can add items to them, and check if an item is present. Here's what it might look like to use a bloom filter in JavaScript, using a made-up BloomFilter class: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Ant"); // true bf.contains("Rhino"); // true While this looks almost identical to a Set, there are some key differences. Bloom filters are what's called a probabalistic data structure. Where a Set can give you a concrete "yes" or "no" answer when you call contains, a bloom filter can't. Bloom filters can give definite "no"s, but they can't be certain about "yes." In the example above, when we ask bf if it contains "Ant" and "Rhino", the true that it returns isn't a guarantee that they're present. We know that they're present because we added them just a couple of lines before, but it would be possible for this to happen: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Fox"); // true We'll demonstrate why over the course of this post. For now, we'll say that when bloom filters return true it doesn't mean "yes", it means "maybe". When this happens and the item has never been added before, it's called a false-positive. The opposite, claiming "no" when the answer is "yes," is called a false-negative. A bloom filter will never give a false-negative, and this is what makes them useful. It's not strictly lying, it's just not giving you a definite answer. Let's look at an example where we can use this property to our advantage. # When bloom filters are useful Imagine you're building a web browser, and you want to protect users from malicious links. You could build and maintain a list of all known malicious links and check the list every time a user navigates the browser. If the link they're trying to visit is in the list, you warn the user that they might be about to visit a malicious website. If we assume there are, say, 1,000,000 malicious links on the Internet, and each link is 20 characters long, then the list of malicious links would be 20MB in size. This isn't a huge amount of data, but it's not small either. If you have lots of users and want to keep this list up to date, the bandwidth could add up. However, if you're happy to accept being wrong 0.0001% of the time (1 in a million), you could use a bloom filter which can store the same data in 3.59MB. That's an 82% reduction in size, and all it costs you is showing the user an incorrect warning 1 in every million links visited. If you wanted to take it even further, and you were happy to accept being wrong 0.1% of the time (1 in 1000), the bloom filter would only be 1.8MB. This use-case isn't hypothetical, either. Google Chrome used a bloom filter for this exact purpose until 2012. If you were worried about showing a warning when it wasn't needed, you could always make an API that has the full list of malicious links in a database. When the bloom filter says "maybe," you would then make an API call to check the full list to be sure. No more spurious warnings, and the bloom filter would save you from having to call the API for every link visited. # How bloom filters work At its core, a bloom filter is an array of bits. When it is created, all of the bits are set to 0. We're going to represent this as a grid of circles, with each circle representing 1 bit. Our bloom filters in this post are all going to have 32 bits in total. this one and let me know what you think. Click here to go back to normal. To add an item to the bloom filter, we're going to hash it with 3 different hash functions, then use the 3 resulting values to set 3 bits. If you're not familiar with hashing, I recommend reading my post about it before continuing. For this post I'm choosing to use 3 of the SHA family of hash functions: sha1, sha256, and sha512. Here's what our bloom filter looks like if we add the value "foo" to it: The bits in positions 15, 16 and 27 have been set. Other bits, e.g. 1 have not been set. You can hover or tap the bits in this paragraph to highlight them in the visualisation. We get to this state by taking the hash value of "foo" for each of our 3 hash functions and modulo it by the number of bits in our bloom filter. Modulo gets us the remainder when dividing by 32, so we get 27 with sha1, 15 with sha256 and 16 with sha512. The table below shows what's happening, and you can try inputting your own values to see what bits they would set if added. Go ahead and add a few of your own values to our bloom filter below and see what happens. There's also a check button that will tell you if a value is present within the bloom filter. A value is only considered present if all of the bits checked are set. You can start again by hitting the clear button. You might occasionally notice that only 2, or even 1, bits get set. This happens when 2 or more of our hash functions produce the same value, or we attempt to set a bit that has already been set. Taking that a bit further, have a think about the implications of a bloom filter that has every bit set. bit is set, then won't the bloom filter claim it contains every item you check? That's a false-positive every time! Exactly right. A bloom filter with every bit set is equivalent to a Set that always returns true for contains. It will claim to contain everything you ask it about, even if that thing was never added. # False-positive rates The rate of false-positives in our bloom filter will grow as the percentage of set bits increases. Drag the slider below the graph to see how the false-positive rate changes as the number of set bits increases. It grows slowly at first, but as we get closer to having all bits set the rate increases. This is because we calculate the false-positive rate as x^3, where x is the percentage of set bits and 3 is the number of hash functions used. To give an example of why we calculate it with this formula, imagine we have a bloom filter with half of its bits set, x = 0.5. If we assume that our hash function has an equal chance of setting any of the bits, then the chance that all 3 hash functions set a bit that is already set is 0.5 * 0.5 * 0.5, or x^3. Let's have a look at the false-positive rate of bloom filters that use different numbers of hash functions. The problem that using lots of hash functions introduces is that it makes the bloom filter fill up faster. The more hash functions you use, the more bits get set for each item you add. There's also the cost of hashing itself. Hash functions aren't free, and while the hash functions you'd use in a bloom filter try to be as fast as possible, it's still more expensive to run 100 of them than it is to run 3. It's possible to calculate how full a bloom filter will be after inserting a number of items, based on the number of hash functions used. The graph below assumes a bloom filter with 1000 bits. The more hash functions we use, the faster we set all of the bits. You'll notice that the curve tails off as more items are added. This is because the more bits that are set, the more likely it is that we'all attempt to set a bit that has already been set. In practice, 1000 bits is a very small bloom filter, occupying only 125 bytes of memory. Modern computers have a lot of memory, so let's crank this up to 100,000 bits (12.5kB) and see what happens. The lines barely leave the bottom of the graph, meaning the bloom filter will be very empty and the false-positive rate will be low. All this cost us was 12.5kB of memory, which is still a very small amount by 2024 standards. # Tuning a bloom filter Picking the correct number of hash functions and bits for a bloom filter is a fine balance. Fortunately for us, if we know up-front how many unique items we want to store, and what our desired false-positive rate is, we can calculate the optimal number of hash functions, and the required number of bits. The bloom filter page on Wikipedia covers the mathematics involved, which I'm going to translate into JavaScript functions for us to use. I want to stress that you don't need to understand the maths to use a bloom filter or read this post. I'm including the link to it only for completeness. # Optimal number of bits The following JavaScript function, which might look a bit scary but bear with me, takes the number of items you want to store (items) and the desired false-positive rate (fpr, where 1% == 0.01), and returns how many bits you will need to achieve that false-positive rate. function bits(items, fpr) { const n = -items * Math.log(fpr); const d = Math.log(2) ** 2; return Math.ceil(n / d); } We can see how this grows for a variety of fpr values in the graph below. # Optimal number of hash functions After we've used the JavaScript above to calculate how many bits we need, we can use the following function to calculate the optimal number of hash functions to use: function hashFunctions(bits, items) { return Math.ceil((bits / items) * Math.log(2)); } Pause for a second here and have a think about how the number of hash functions might grow based on the size of the bloom filter and the number of items you expect to add. Do you think you'll use more hash functions, or fewer, as the bloom filter gets larger? What about as the number of items increases? The more items you plan to add, the fewer hash functions you should use. Yet, a larger bloom filter means you can use more hash functions. More hash functions keep the false-positive rate lower for longer, but more items fills up the bloom filter faster. It's a complex balancing act, and I am thankful that mathematicians have done the hard work of figuring it out for us. # Caution While we can stand on the shoulders of giants and pick the optimal number of bits and hash functions for our bloom filter, it's important to remember that these rely on you giving good estimates of the number of items you expect to add, and choosing a false-positive rate that's acceptable for your use-case. These numbers might be difficult to come up with, and I recommend erring on the side of caution. If you're not sure, it's likely better to use a larger bloom filter than you think you need. # Removing items from a bloom filter We've spent the whole post talking about adding things to a bloom filter, and the optimal parameters to use. We haven't spoken at all about removing items. And that's because you can't! In a bloom filter, we're using bits, individual 1s and 0s, to track the presence of items. If we were to remove an item by setting its bits to 0, we might also be removing other items by accident. There's no way of knowing. Click the buttons of the bloom filter below to see this in action. First we will add "foo", then "baz", and then we will remove "baz". Hit "clear" if you want to start again. The end result of this sequence is a bloom filter that doesn't contain "baz", but doesn't contain "foo" either. Because both "foo" and "baz" set bit 27, we accidentally clobber the presence of "foo" while removing "baz". Something else you might have noticed playing with the above example is that if you add "foo" and then attempt to remove "baz" before having added it, nothing happens. Even though 27 is set, bits 18 and 23 are not, so the bloom filter cannot contain "baz". Because of this, it won't unset 27. # Counting bloom filters While you can't remove items from a standard bloom filter, there are variants that allow you to do so. One of these variants is called a "counting bloom filter," which uses an array of counters instead of bits to keep track of items. Now when you go through the sequence, the end result is that the bloom filter still contains "foo." It solves the problem. The trade-off, though, is that counters are bigger than bits. With 4 bits per counter you can increment up to 15. With 8 bits per counter you can increment up to 255. You'll need to pick a counter size sufficient to never reach the maximum value, otherwise you risk corrupting the bloom filter. Using 8x more memory than a standard bloom filter could be a big deal, especially if you're using a bloom filter to save memory in the first place. Think hard about whether you really need to be able to remove items from your bloom filter. Counting bloom filters also introduce the possibility of false-negatives, which are impossible in standard bloom filters. Consider the following example. Because "loved" and "response" both hash to the bits 5, 22, and 26, when we remove "response" we also remove "loved". If we write this as JavaScript the problem becomes more clear: let bf = new CountingBloomFilter(); bf.add("loved"); bf.add("your"); bf.remove("response"); bf.contains("loved"); // false Even though we know for sure we've added "loved" in this snippet, the call to contains will return false. This sort of false-negative can't happen in a standard bloom filter, and it removes one of the key benefits of using a bloom filter in the first place: the guarantee of no false-negatives. # Bloom filters in the real-world Real-world users of bloom filters include Akamai, who use them to avoid caching web pages that are accessed once and never again. They do this by storing all page accesses in a bloom filter, and only writing them into cache if the bloom filter says they've been seen before. This does result in some pages being cached on the first access, but that's fine because it's still an improvement. It would be impractical for them to store all page accesses in a Set, so they accept the small false-positive rate in favour of the significantly smaller bloom filter. Akamai released a paper about this that goes into the full details if you're interested. Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will claim a key is in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database. # Conclusion Bloom filters, while niche, can be a huge optimisation in the right situation. They're a wonderful application of hash functions, and a great example of making a deliberate trade-off to achieve a specific goal. Trade-offs, and combining simpler building blocks to create more complex, purpose-built data structures, are present everywhere in software engineering. Being able to spot where a data structure could net a big win can separate you from the pack, and take your career to the next level. I hope you've enjoyed this post, and that you find a way to apply bloom filters to a problem you're working on. # Acknowledgements Enormous thank you to my reviewers, without whom this post would be a shadow of what you read today. In no particular order: rylon, Indy, Aaron, Sophie, Davis, ed, Michael Drury, Anton Zhiyanov, Christoph Berger.
form { padding-top: 0.5em; padding-left: 0.5em; padding-right: 0.5em; display: flex; justify-content: center; gap: 0.3em; } form input[type=text] { flex: 4 1 auto; min-width: 0; border-radius: 0.3em; border: 1px solid #aaaaaa; padding: 0.3em; } form button { flex: 1 1 auto; max-width: 140px; } form button:disabled { opacity: 0.5 !important; } form button.add { background-color: #009E73; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.check { background-color: #56B4E9; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.clear { background-color: #D55E00; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } .grid-2x2 { display: "grid"; } .grid { user-select: none; cursor: pointer; margin-top: 1rem; margin-bottom: 1rem; border: 1px solid #009E73; width: 100%; display: grid; grid-template-columns: repeat(8, 1fr); grid-template-rows: repeat(2, 1fr); } .grid-item { display: flex; align-items: center; justify-content: center; aspect-ratio: 1/1; } .grid-active { background-color: #009E73; color: white; } .above-grid { display: flex; justify-content: center; } .hash-examples { padding-top: 0.5rem; padding-bottom: 0.5rem; margin: auto; display: flex; flex-direction: column; align-items: center; } .hash-examples div { margin: auto; } .hash-examples code { display: block; white-space: pre; font-weight: bold; } .hash-examples p { font-size: 0.75rem; font-style: italic; text-align: center; font-family: Lora, serif; width: 75%; } .blob { cursor: pointer; background: #CC79A7; display: flex; justify-content: center; align-items: center; font-size: 1.5rem; color: white; border-radius: 50%; margin: 10px; height: 3rem; width: 3rem; min-width: 3rem; max-width: 3rem; box-shadow: 0 0 0 0 #CC79A7FF; transform: scale(1); animation: pulse 2s infinite; } @keyframes pulse { 0% { transform: scale(0.85); box-shadow: 0 0 0 0 #CC79A77F; } 70% { transform: scale(1); box-shadow: 0 0 0 1rem rgba(0, 0, 0, 0); } 100% { transform: scale(0.85); box-shadow: 0 0 0 0 rgba(0, 0, 0, 0); } } .blob-click { cursor: default; animation: tick 1s linear; background: #009E73FF; } @keyframes tick { 0% { transform: scale(1); box-shadow: 0 0 0 0 #009E73FF; } 50% { box-shadow: 0 0 0 1rem #009E737F; } 100% { box-shadow: 0 0 0 2rem #009E7300; } } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } .pct25 { width: 100%; height: 200px; } .datasets th { text-align: left; } .datasets { table-layout: fixed; } As a programmer, you use hash functions every day. They're used in databases to optimise queries, they're used in data structures to make things faster, they're used in security to keep data safe. Almost every interaction you have with technology will involve hash functions in one way or another. Hash functions are foundational, and they are everywhere. But what is a hash function, and how do they work? In this post, we're going to demystify hash functions. We're going to start by looking at a simple hash function, then we're going to learn how to test if a hash function is good or not, and then we're going to look at a real-world use of hash functions: the hash map. clicked. # What is a hash function? Hash functions are functions that take an input, usually a string, and produce a number. If you were to call a hash function multiple times with the same input, it will always return the same number, and that number returned will always be within a promised range. What that range is will depend on the hash function, some use 32-bit integers (so 0 to 4 billion), others go much larger. If we were to write a dummy hash function in JavaScript, it might look like this: function hash(input) { return 0; } Even without knowing how hash functions are used, it's probably no surprise that this hash function is useless. Let's see how we can measure how good a hash function is, and after that we'll do a deep dive on how they're used within hash maps. # What makes a hash function good? Because input can be any string, but the number returned is within some promised range, it's possible that two different inputs can return the same number. This is called a "collision," and good hash functions try to minimise how many collisions they produce. It's not possible to completely eliminate collisions, though. If we wrote a hash function that returned a number in the range 0 to 7, and we gave it 9 unique inputs, we're guaranteed at least 1 collision. hash("to") == 3 hash("the") == 2 hash("café") == 0 hash("de") == 6 hash("versailles") == 4 hash("for") == 5 hash("coffee") == 0 hash("we") == 7 hash("had") == 1 To visualise collisions, I'm going to use a grid. Each square of the grid is going to represent a number output by a hash function. Here's an example 8x2 grid. Click on the grid to increment the example hash output value and see how we map it to a grid square. See what happens when you get a number larger than the number of grid squares. { let grid = document.getElementById("first-grid"); let hash = document.getElementById("grid-hash"); let modulo = document.getElementById("grid-modulo"); grid.addEventListener("click", (e) => { e.preventDefault(); let number = parseInt(hash.innerText) + 1; hash.innerText = number.toString(); modulo.innerText = (number % 16).toString(); grid.querySelector(".grid-active").classList.remove("grid-active"); grid.children[number % 16].classList.add("grid-active"); return false; }); }); 13 % 16 == 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Every time we hash a value, we're going to make its corresponding square on the grid a bit darker. The idea is to create an easy way to see how well a hash function avoids collisions. What we're looking for is a nice, even distribution. We'll know that the hash function isn't good if we have clumps or patterns of dark squares. This is a great observation. You're absolutely right, we're going to be creating "pseudo-collisions" on our grid. It's okay, though, because if the hash function is good we will still see an even distribution. Incrementing every square by 100 is just as good a distribution as incrementing every square by 1. If we have a bad hash function that collides a lot, that will still stand out. We'll see this shortly. Let's take a larger grid and hash 1,000 randomly-generated strings. You can click on the grid to hash a new set of random inputs, and the grid will animate to show you each input being hashed and placed on the grid. The values are nice and evenly distributed because we're using a good, well-known hash function called murmur3. This hash is widely used in the real-world because it has great distribution while also being really, really fast. What would our grid look like if we used a bad hash function? function hash(input) { let hash = 0; for (let c of input) { hash += c.charCodeAt(0); } return hash % 1000000; } This hash function loops through the string that we're given and sums the numeric values of each character. It then makes sure that the value is between 0 and 1000000 by using the modulus operator (%). Let's call this hash function stringSum. Here it is on the grid. Reminder, this is 1,000 randomly generated strings that we're hashing. This doesn't look all that different from murmur3. What gives? The problem is that the strings we're giving to be hashed are random. Let's see how each function performs when given input that is not random: the numbers from 1 to 1000 converted to strings. Now the problem is more clear. When the input isn't random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with random values. How about if we hash the top 1,000 most common English words: It's more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3 looks the same as it always does. This is the power of a good hash function: no matter the input, the output is evenly distributed. Let's talk about one more way to visualise this and then talk about why it matters. # The avalanche effect Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping. It's this property that helps hash functions avoid forming patterns in the grid. If small changes in the input result in small changes in the output, you get patterns. Patterns indicate poor distribution, and a higher rate of collisions. Below, we are visualising the avalanche effect by showing two 8-bit binary numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit in the input. Bits that change in the output will be green, bits that stay the same will be red. murmur3 does well, though you will notice that sometimes fewer than 50% of the bits flip and sometimes more. This is okay, provided that it is 50% on average. Let's see how stringSum performs. Well this is embarassing. The output is equal to the input, and so only a single bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the string. This example only hashes the equivalent of a single character, which means the output will always be the same as the input. # Why all of this matters We've taken the time to understand some of the ways to determine if a hash function is good, but we've not spent any time talking about why it matters. Let's fix that by talking about hash maps. To understand hash maps, we first must understand what a map is. A map is a data structure that allows you to store key-value pairs. Here's an example in JavaScript: let map = new Map(); map.set("hello", "world"); console.log(map.get("hello")); Here we take a key-value pair ("hello" → "world") and store it in the map. Then we print out the value associated with the key "hello", which will be "world". A more fun real-world use-case would be to find anagrams. An anagram is when two different words contain the same letters, for example "antlers" and "rentals" or "article" and "recital." If you have a list of words and you want to find all of the anagrams, you can sort the letters in each word alphabetically and use that as a key in a map. let words = [ "antlers", "rentals", "sternal", "article", "recital", "flamboyant", ] let map = new Map(); for (let word of words) { let key = word .split('') .sort() .join(''); if (!map.has(key)) { map.set(key, []); } map.get(key).push(word); } This code results in a map with the following structure: { "aelnrst": [ "antlers", "rentals", "sternal" ], "aceilrt": [ "article", "recital" ], "aabflmnoty": [ "flamboyant" ] } # Implementing our own simple hash map Hash maps are one of many map implementations, and there are many ways to implement hash maps. The simplest way, and the way we're going to demonstrate, is to use a list of lists. The inner lists are often referred to as "buckets" in the real-world, so that's what we'll call them here. A hash function is used on the key to determine which bucket to store the key-value pair in, then the key-value pair is added to that bucket. Let's walk through a simple hash map implementation in JavaScript. We're going to go through it bottom-up, so we'll see some utility methods before getting to the set and get implementations. class HashMap { constructor() { this.bs = [[], [], []]; } } We start off by creating a HashMap class with a constructor that sets up 3 buckets. We use 3 buckets and the short variable name bs so that this code displays nicely on devices with smaller screens. In reality, you could have however many buckets you want (and better variable names). class HashMap { // ... bucket(key) { let h = murmur3(key); return this.bs[ h % this.bs.length ]; } } The bucket method uses murmur3 on the key passed in to find a bucket to use. This is the only place in our hash map code that a hash function is used. class HashMap { // ... entry(bucket, key) { for (let e of bucket) { if (e.key === key) { return e; } } return null; } } The entry method takes a bucket and a key and scans the bucket until it finds an entry with the given key. If no entry is found, null is returned. class HashMap { // ... set(key, value) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { e.value = value; return; } b.push({ key, value }); } } The set method is the first one we should recognise from our earlier JavaScript Map examples. It takes a key-value pair and stores it in our hash map. It does this by using the bucket and entry methods we created earlier. If an entry is found, its value is overwritten. If no entry is found, the key-value pair is added to the map. In JavaScript, { key, value } is shorthand for { key: key, value: value }. class HashMap { // ... get(key) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { return e.value; } return null; } } The get method is very similar to set. It uses bucket and entry to find the entry related to the key passed in, just like set does. If an entry is found, its value is returned. If one isn't found, null is returned. That was quite a lot of code. What you should take away from it is that our hash map is a list of lists, and a hash function is used to know which of the lists to store and retrieve a given key from. Here's a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our set method. To keep the visualisation simple, if a bucket were to "overflow", the buckets are all reset. Because we're using murmur3 as our hash function, you should see good distribution between the buckets. It's expected you'll see some imbalance, but it should generally be quite even. To get a value out of the hash map, we first hash the key to figure out which bucket the value will be in. Then we have to compare the key we're searching for against all of the keys in the bucket. It's this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash function, the faster we find the right bucket to search, the faster our hash map is overall. This is also why reducing collisions is so crucial. If we did decide to use that dummy hash function from all the way at the start of this article, the one that returns 0 all the time, we'll put all of our key-value pairs into the first bucket. Finding anything could mean we have to check all of the values in the hash map. With a good hash function, with good distribution, we reduce the amount of searching we have to do to 1/N, where N is the number of buckets. Let's see how stringSum does. Interestingly, stringSum seems to distribute values quite well. You notice a pattern, but the overall distribution looks good. stringSum. I knew it would be good for something. Not so fast, Haskie. We need to talk about a serious problem. The distribution looks okay on these sequential numbers, but we've seen that stringSum doesn't have a good avalanche effect. This doesn't end well. # Real-world collisions Let's look at 2 real-world data sets: IP addresses and English words. What I'm going to do is take 100,000,000 random IP addresses and 466,550 English words, hash all of them with both murmur3 and stringSum, and see how many collisions we get. IP Addresses murmur3 stringSum Collisions 1,156,959 99,999,566 1.157% 99.999% English words murmur3 stringSum Collisions 25 464,220 0.005% 99.5% When we use hash maps for real, we aren't usually storing random values in them. We can imagine counting the number of times we've seen an IP address in rate limiting code for a server. Or code that counts the occurrences of words in books throughout history to track their origin and popularity. stringSum sucks for these applications because of it's extremely high collision rate. # Manufactured collisions Now it's murmur3's turn for some bad news. It's not just collisions caused by similarity in the input we have to worry about. Check this out. What's happening here? Why do all of these jibberish strings hash to the same number? I hashed 141 trillion random strings to find values that hash to the number 1228476406 when using murmur3. Hash functions have to always return the same output for a specific input, so it's possible to find collisions by brute force. trillion? Like... 141 and then 12 zeroes? Yes, and it only took me 25 minutes. Computers are fast. Bad actors having easy access to collisions can be devastating if your software builds hash maps out of user input. Take HTTP headers, for example. An HTTP request looks like this: GET / HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: google.com You don't have to understand all of the words, just that the first line is the path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us from passing any headers we want, so we can be really mean and pass headers we know will cause collisions. This can significantly slow down the server. This isn't theoretical, either. If you search "HashDoS" you'll find a lot more examples of this. It was a really big deal in the mid-2000s. There are a few ways to mitigate this specific to HTTP servers: ignoring jibberish header keys and limiting the number of headers you store, for example. But modern hash functions like murmur3 offer a more generalised solution: randomisation. Earlier in this post we showed some examples of hash function implementations. Those implementations took a single argument: input. Lots of modern hash functions take a 2nd parameter: seed (sometimes called salt). In the case of murmur3, this seed is a number. So far, we've been using 0 as the seed. Let's see what happens with the collisions I've collected when we use a seed of 1. Just like that, 0 to 1, the collisions are gone. This is the purpose of the seed: it randomises the output of the hash function in an unpredictable way. How it achieves this is beyond the scope of this article, all hash functions do this in their own way. The hash function still returns the same output for the same input, it's just that the input is a combination of input and seed. Things that collide with one seed shouldn't collide when using another. Programming languages often generate a random number to use as the seed when the process starts, so that every time you run your program the seed is different. As a bad guy, not knowing the seed, it is now impossible for me to reliably cause harm. If you look closely in the above visualisation and the one before it, they're the same values being hashed but they produce different hash values. The implication of this is that if you hash a value with one seed, and want to be able to compare against it in the future, you need to make sure you use the same seed. Having different values for different seeds doesn't affect the hash map use-case, because hash maps only live for the duration the program is running. Provided you use the same seed for the lifetime of the program, your hash maps will continue to work just fine. If you ever store hash values outside of your program, in a file for example, you need to be careful you know what seed has been used. # Playground As is tradition, I've made a playground for you to write your own hash functions and see them visualised with the grids seen in this article. Click here to try it! # Conclusion We've covered what a hash function is, some ways to measure how good it is, what happens when it's not good, and some of the ways they can be broken by bad actors. The universe of hash functions is a large one, and we've really only scratched the surface in this post. We haven't spoken about cryptographic vs non-cryptographic hashing, we've touched on only 1 of the thousands of use-cases for hash functions, and we haven't talked about how exactly modern hash functions actually work. Some further reading I recommend if you're really enthusiastic about this topic and want to learn more: https://github.com/rurban/smhasher this repository is the gold standard for testing how good hash functions are. They run a tonne of tests against a wide number of hash functions and present the results in a big table. It will be difficult to understand what all of the tests are for, but this is where the state of the art of hash testing lives. https://djhworld.github.io/hyperloglog/ this is an interactive piece on a data structure called HyperLogLog. It's used to efficiently count the number of unique elements in very, very large sets. It uses hashing to do it in a really clever way. https://www.gnu.org/software/gperf/ is a piece of software that, when given the expected set of things you want to hash, can generate a "perfect" hash function automatically. Feel free to join the discussion on Hacker News! # Acknowledgements Thanks to everyone who read early drafts and provided invaluable feedback. delroth, Manon, Aaron, Charlie And everyone who helped me find murmur3 hash collisions: Indy, Aaron, Max # Patreon After the success of Load Balancing and Memory Allocation, I have decided to set up a Patreon page: https://patreon.com/samwho. For all of these articles going forward, I am going to post a Patreon-exclusive behind-the-scenes post talking about decisions, difficulties, and lessons learned from each post. It will give you a deep look in to how these articles evolve, and I'm really stoked about the one I've written for this one. If you enjoy my writing, and want to support it going forward, I'd really appreciate you becoming a Patreon. ❤️
More in programming
Having spent four decades as a programmer in various industries and situations, I know that modern software development processes are far more stressful than when I started. It's not simply that developing software today is more complex than it was back in 1981. In that early decade, none
In previous articles, we saw how to use “real” UART, and looked into the trick used by Arduino to automatically reset boards when uploading firmware. Today, we’ll look into how Espressif does something similar, using even more tricks. “Real” UART on the Saola As usual, let’s first simply connect the UART adapter. Again, we connect … Continue reading Espressif’s Automatic Reset → The post Espressif’s Automatic Reset appeared first on Quentin Santos.
Hey peoples! Tonight, some meta-words. As you know I am fascinated by compilers and language implementations, and I just want to know all the things and implement all the fun stuff: intermediate representations, flow-sensitive source-to-source optimization passes, register allocation, instruction selection, garbage collection, all of that. It started long ago with a combination of curiosity and a hubris to satisfy that curiosity. The usual way to slake such a thirst is structured higher education followed by industry apprenticeship, but for whatever reason my path sent me through a nuclear engineering bachelor’s program instead of computer science, and continuing that path was so distasteful that I noped out all the way to rural Namibia for a couple years. Fast-forward, after 20 years in the programming industry, and having picked up some language implementation experience, a few years ago I returned to garbage collection. I have a good level of language implementation chops but never wrote a memory manager, and Guile’s performance was limited by its use of the Boehm collector. I had been on the lookout for something that could help, and when I learned of it seemed to me that the only thing missing was an appropriate implementation for Guile, and hey I could do that!Immix I started with the idea of an -style interface to a memory manager that was abstract enough to be implemented by a variety of different collection algorithms. This kind of abstraction is important, because in this domain it’s easy to convince oneself that a given algorithm is amazing, just based on vibes; to stay grounded, I find I always need to compare what I am doing to some fixed point of reference. This GC implementation effort grew into , but as it did so a funny thing happened: the as a direct replacement for the Boehm collector maintained mark bits in a side table, which I realized was a suitable substrate for Immix-inspired bump-pointer allocation into holes. I ended up building on that to develop an Immix collector, but without lines: instead each granule of allocation (16 bytes for a 64-bit system) is its own line.MMTkWhippetmark-sweep collector that I prototyped The is funny, because it defines itself as a new class of collector, fundamentally different from the three other fundamental algorithms (mark-sweep, mark-compact, and evacuation). Immix’s are blocks (64kB coarse-grained heap divisions) and lines (128B “fine-grained” divisions); the innovation (for me) is the discipline by which one can potentially defragment a block without a second pass over the heap, while also allowing for bump-pointer allocation. See the papers for the deets!Immix papermark-regionregionsoptimistic evacuation However what, really, are the regions referred to by ? If they are blocks, then the concept is trivial: everyone has a block-structured heap these days. If they are spans of lines, well, how does one choose a line size? As I understand it, Immix’s choice of 128 bytes was to be fine-grained enough to not lose too much space to fragmentation, while also being coarse enough to be eagerly swept during the GC pause.mark-region This constraint was odd, to me; all of the mark-sweep systems I have ever dealt with have had lazy or concurrent sweeping, so the lower bound on the line size to me had little meaning. Indeed, as one reads papers in this domain, it is hard to know the real from the rhetorical; the review process prizes novelty over nuance. Anyway. What if we cranked the precision dial to 16 instead, and had a line per granule? That was the process that led me to Nofl. It is a space in a collector that came from mark-sweep with a side table, but instead uses the side table for bump-pointer allocation. Or you could see it as an Immix whose line size is 16 bytes; it’s certainly easier to explain it that way, and that’s the tack I took in a .recent paper submission to ISMM’25 Wait what! I have a fine job in industry and a blog, why write a paper? Gosh I have meditated on this for a long time and the answers are very silly. Firstly, one of my language communities is Scheme, which was a research hotbed some 20-25 years ago, which means many practitioners—people I would be pleased to call peers—came up through the PhD factories and published many interesting results in academic venues. These are the folks I like to hang out with! This is also what academic conferences are, chances to shoot the shit with far-flung fellows. In Scheme this is fine, my work on Guile is enough to pay the intellectual cover charge, but I need more, and in the field of GC I am not a proven player. So I did an atypical thing, which is to cosplay at being an independent researcher without having first been a dependent researcher, and just solo-submit a paper. Kids: if you see yourself here, just go get a doctorate. It is not easy but I can only think it is a much more direct path to goal. And the result? Well, friends, it is this blog post :) I got the usual assortment of review feedback, from the very sympathetic to the less so, but ultimately people were confused by leading with a comparison to Immix but ending without an evaluation against Immix. This is fair and the paper does not mention that, you know, I don’t have an Immix lying around. To my eyes it was a good paper, an , but, you know, just a try. I’ll try again sometime.80% paper In the meantime, I am driving towards getting Whippet into Guile. I am hoping that sometime next week I will have excised all the uses of the BDW (Boehm GC) API in Guile, which will finally allow for testing Nofl in more than a laboratory environment. Onwards and upwards! whippet regions? paper??!?
I started writing this early last week but Real Life Stuff happened and now you're getting the first-draft late this week. Warning, unedited thoughts ahead! New Logic for Programmers release! v0.9 is out! This is a big release, with a new cover design, several rewritten chapters, online code samples and much more. See the full release notes at the changelog page, and get the book here! Write the cleverest code you possibly can There are millions of articles online about how programmers should not write "clever" code, and instead write simple, maintainable code that everybody understands. Sometimes the example of "clever" code looks like this (src): # Python p=n=1 exec("p*=n*n;n+=1;"*~-int(input())) print(p%n) This is code-golfing, the sport of writing the most concise code possible. Obviously you shouldn't run this in production for the same reason you shouldn't eat dinner off a Rembrandt. Other times the example looks like this: def is_prime(x): if x == 1: return True return all([x%n != 0 for n in range(2, x)] This is "clever" because it uses a single list comprehension, as opposed to a "simple" for loop. Yes, "list comprehensions are too clever" is something I've read in one of these articles. I've also talked to people who think that datatypes besides lists and hashmaps are too clever to use, that most optimizations are too clever to bother with, and even that functions and classes are too clever and code should be a linear script.1. Clever code is anything using features or domain concepts we don't understand. Something that seems unbearably clever to me might be utterly mundane for you, and vice versa. How do we make something utterly mundane? By using it and working at the boundaries of our skills. Almost everything I'm "good at" comes from banging my head against it more than is healthy. That suggests a really good reason to write clever code: it's an excellent form of purposeful practice. Writing clever code forces us to code outside of our comfort zone, developing our skills as software engineers. Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you [will get excellent debugging practice at exactly the right level required to push your skills as a software engineer] — Brian Kernighan, probably There are other benefits, too, but first let's kill the elephant in the room:2 Don't commit clever code I am proposing writing clever code as a means of practice. Being at work is a job with coworkers who will not appreciate if your code is too clever. Similarly, don't use too many innovative technologies. Don't put anything in production you are uncomfortable with. We can still responsibly write clever code at work, though: Solve a problem in both a simple and a clever way, and then only commit the simple way. This works well for small scale problems where trying the "clever way" only takes a few minutes. Write our personal tools cleverly. I'm a big believer of the idea that most programmers would benefit from writing more scripts and support code customized to their particular work environment. This is a great place to practice new techniques, languages, etc. If clever code is absolutely the best way to solve a problem, then commit it with extensive documentation explaining how it works and why it's preferable to simpler solutions. Bonus: this potentially helps the whole team upskill. Writing clever code... ...teaches simple solutions Usually, code that's called too clever composes several powerful features together — the "not a single list comprehension or function" people are the exception. Josh Comeau's "don't write clever code" article gives this example of "too clever": const extractDataFromResponse = (response) => { const [Component, props] = response; const resultsEntries = Object.entries({ Component, props }); const assignIfValueTruthy = (o, [k, v]) => (v ? { ...o, [k]: v } : o ); return resultsEntries.reduce(assignIfValueTruthy, {}); } What makes this "clever"? I count eight language features composed together: entries, argument unpacking, implicit objects, splats, ternaries, higher-order functions, and reductions. Would code that used only one or two of these features still be "clever"? I don't think so. These features exist for a reason, and oftentimes they make code simpler than not using them. We can, of course, learn these features one at a time. Writing the clever version (but not committing it) gives us practice with all eight at once and also with how they compose together. That knowledge comes in handy when we want to apply a single one of the ideas. I've recently had to do a bit of pandas for a project. Whenever I have to do a new analysis, I try to write it as a single chain of transformations, and then as a more balanced set of updates. ...helps us master concepts Even if the composite parts of a "clever" solution aren't by themselves useful, it still makes us better at the overall language, and that's inherently valuable. A few years ago I wrote Crimes with Python's Pattern Matching. It involves writing horrible code like this: from abc import ABC class NotIterable(ABC): @classmethod def __subclasshook__(cls, C): return not hasattr(C, "__iter__") def f(x): match x: case NotIterable(): print(f"{x} is not iterable") case _: print(f"{x} is iterable") if __name__ == "__main__": f(10) f("string") f([1, 2, 3]) This composes Python match statements, which are broadly useful, and abstract base classes, which are incredibly niche. But even if I never use ABCs in real production code, it helped me understand Python's match semantics and Method Resolution Order better. ...prepares us for necessity Sometimes the clever way is the only way. Maybe we need something faster than the simplest solution. Maybe we are working with constrained tools or frameworks that demand cleverness. Peter Norvig argued that design patterns compensate for missing language features. I'd argue that cleverness is another means of compensating: if our tools don't have an easy way to do something, we need to find a clever way. You see this a lot in formal methods like TLA+. Need to check a hyperproperty? Cast your state space to a directed graph. Need to compose ten specifications together? Combine refinements with state machines. Most difficult problems have a "clever" solution. The real problem is that clever solutions have a skill floor. If normal use of the tool is at difficult 3 out of 10, then basic clever solutions are at 5 out of 10, and it's hard to jump those two steps in the moment you need the cleverness. But if you've practiced with writing overly clever code, you're used to working at a 7 out of 10 level in short bursts, and then you can "drop down" to 5/10. I don't know if that makes too much sense, but I see it happen a lot in practice. ...builds comradery On a few occasions, after getting a pull request merged, I pulled the reviewer over and said "check out this horrible way of doing the same thing". I find that as long as people know they're not going to be subjected to a clever solution in production, they enjoy seeing it! Next week's newsletter will probably also be late, after that we should be back to a regular schedule for the rest of the summer. Mostly grad students outside of CS who have to write scripts to do research. And in more than one data scientist. I think it's correlated with using Jupyter. ↩ If I don't put this at the beginning, I'll get a bajillion responses like "your team will hate you" ↩