More from samwho.dev
header h1 { padding: 0; margin-top: 0.2rem; margin-bottom: 1rem; } button { margin: 0.5rem; padding: 0.5rem 1rem; background-color: #444444; color: white; border: none; border-radius: 8px; cursor: pointer; touch-action: manipulation; user-select: none; } button:hover:not(:disabled) { background-color: #555555; touch-action: manipulation; user-select: none; } button:disabled { filter: opacity(0.5); cursor: not-allowed; touch-action: manipulation; user-select: none; } input[type="range"] { width: 100%; margin: 1rem 0; } .control { display: flex; justify-content: center; align-items: center; margin: 1rem 0; } .control label { margin-right: 1rem; white-space: nowrap; } .control input[type="range"] { flex-grow: 1; } .odds { font-size: 1.2rem; font-family: var(--code-font); font-weight: bold; text-align: center; display: flex; justify-content: center; gap: 2rem; margin: 1rem 0; } .odds .hold { display: flex; flex-direction: column; align-items: center; } .odds .replace { display: flex; flex-direction: column; align-items: center; } .odds .hold .value { color: var(--palette-dark-blue); white-space: nowrap; } .odds .replace .value { color: var(--palette-red); white-space: nowrap; } article input { --c: var(--palette-orange); --l: 2px; --h: 30px; --w: 30px; width: 100%; height: var(--h); -webkit-appearance :none; -moz-appearance :none; appearance :none; background: none; cursor: pointer; overflow: hidden; } article input:focus-visible, article input:hover{ --p: 25%; } article input[type="range" i]::-webkit-slider-thumb{ height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } article input[type="range"]::-moz-range-thumb { --h: 25px; --w: 25px; height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } img.hero { --width: 200px; width: var(--width); max-width: var(--width); margin-top: calc(var(--width) * -0.5); margin-bottom: 1rem; } Reservoir Sampling Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know: When you would need reservoir sampling. The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise. A simple way to implement reservoir sampling if you want to use it. ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out! # Sampling when you know the size In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it? The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle." Every time you click "Shuffle," the chart below tracks what the first 3 cards were. At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair." Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over. This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards. We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method. I'm stretching the analogy a little here. It would take a long time to count through the deck to get to, say, index 436,234. But when it's an array in memory, computers have no trouble finding an element by its index. Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random? That's the curveball: you don't know. No, you can only hold on to 1 card at a time. You're free to swap your card with the newest one each time I show you a card, but you can only hold one and you can't go back to a card you've already seen. Believe it or not, this is a real problem and it has a real and elegant solution. For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives log messages from other services and stores them so that it's easy to search them in one place. One of the things you need to think about when building a service like this is what do you do when another service starts sending you way too many logs. Maybe it's a bad release, maybe one of your videos goes viral. Whatever the reason, it threatens to overwhelm your log collection service. Let's simulate this. Below you can see a stream of logs that experiences periodic spikes. A horizontal line indicates the threshold of logs per second that the log collection service can handle, which in this example is 5 logs per second. You can see that every so often, logs per second spikes above the threshold . One way to deal with this is "sampling." Deciding to send only a fraction of the logs to the log collection service. Let's send 10% of the logs. Below we will see the same simulation again, but this time logs that don't get sent to our log collection service will be greyed out. The graph has 2 lines: a black line tracks sent logs , the logs that are sent to our log collection service, and a grey line tracks total logs . The rate of sent logs never exceeds the threshold , so we never overwhelm our log collection service. However, in the quieter periods we're throwing away 90% of the logs when we don't need to! What we really want is to send at most 5 logs per second. This would mean that during quiet periods you get all the logs, but during spikes you discard logs to protect the log collection service. The simple way to achieve this would be to send the first 5 logs you see each second, but this isn't fair. You aren't giving all logs an equal chance of being selected. # Sampling when you don't know the size We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see. Reservoir sampling is an algorithm that solves this exact problem. You could, but why live with that uncertainty? You'd be holding on to an unknown number of logs in memory. A sufficiently big spike could cause you problems. Reservoir sampling solves this problem, and does so without ever using more memory than you ask it to. Let's go back to our curveball of me showing you 1 card at a time. Here's a recap of the rules: I'll draw cards one at a time from a deck. Each time I show you a card, you have to choose to hold it or discard it. If you were already holding a card, you discard your held card before replacing it with the new card. At any point I can stop drawing cards and whatever card you're holding is the one you've chosen. How would you play this game in a way that ensures all cards have been given an equal chance to be selected when I decide to stop? You're on the right track. Let's have a look at how the coin flip idea plays out in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and 50% of the time it will go to the discard pile on the right, and 50% of the time it will become your held card in the center, with any previously held card moving to the discard pile. The problem is that while the hold vs discard counts are roughly equal, which feels fair, later cards are much more likely to be held when I stop than earlier cards. The first card drawn has to win 10 coin flips to still be in your hand after the 10th card is drawn. The last card only has to win 1. Scrub the slider below to see how the chances change as we draw more cards. Each bar represents a card in the deck, and the height of the bar is the chance we're holding that card when I stop. Below the slider are the chances we're holding the first card drawn vs. the last card drawn. Anything older than 15 cards ago is has a less than 0.01% chance of being held when I stop. Because believe it or not, we only have to make one small change to this idea to make it fair. Instead of flipping a coin to decide if we'll hold the card or not, instead we give each new card a 1/n chance of being held, where n is the number of cards we've seen so far. Yep! In order to be fair, every card must have an equal chance of being selected. So for the 2nd card, we want both cards to have a 1/2 chance. For the 3rd card, we want all 3 cards to have a 1/3 chance. For the 4th card, we want all 4 cards to have a 1/4 chance, and so on. So if we use 1/n for the new card, we can at least say that the new card has had a fair shot. Let's have a look at the chances as you draw more cards with this new method. new card has the right chance of being selected, but how does that make the older cards fair? So far we've focused on the chance of the new card being selected, but we also need to consider the chance of the card you're holding staying in your hand. Let's walk through the numbers. # Card 1 The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 1/1, or 100%. # Card 2 This time we have a real choice. We can keep hold of the card we have, or replace it with the new one. We've said that we're going to do this with a 1/n chance, where n is the number of cards we've seen so far. So our chance of replacing the first card is 1/2, or 50%, and our chance of keeping hold of the first card is its chance of being chosen last time multiplied by its chance of being replaced, so 100% * 1/2, which is again 50%. # Card 3 The card we're holding has a 50% chance of being there. This is true regardless what happened up to this point. No matter whether we're holding card 1 or card 2, it's 50%. The new card has a 1/3 chance of being selected, so the card we're holding has a 1/3 chance of being replaced. This means that our held card has a 2/3 chance of remaining held. So its chances of "surviving" this round are 50% * 2/3. # Card N This pattern continues for as many cards as you want to draw. We can express both options as formulas. Drag the slider to substitute n with real numbers and see that the two formulas are always equal. If 1/n is the chance of choosing the new card, 1/(n-1) is the chance of choosing the new card from the previous draw. The chance of not choosing the new card is the complement of 1/n, which is 1-(1/n). Below are the cards again except this time set up to use 1/n instead of a coin flip. Click to the end of the deck. Does it feel fair to you? There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but as we saw above the numbers say it is completely fair. # Choosing multiple cards Now that we know how to select a single card, we can extend this to selecting multiple cards. There are 2 changes we need to make: Rather than new cards having a 1/n chance of being selected, they now have a k/n chance, where k is the number of cards we want to choose. When we decide to replace a held card, we choose one of the k cards we're holding at random. So our new previous card selection formula becomes k/(n-1) because we're now holding k cards. Then the chance that any of the cards we're holding get replaced is 1-(1/n). Let's see how this plays out with real numbers. The fairness still holds, and will hold for any k and n pair. This is because all held cards have an equal chance of being replaced, which keeps them at an equal likelihood of still being in your hand every draw. A nice way to implement this is to use an array of size k. For each new card, generate a random number between 0 and n. If the random number is less than k, replace the card at that index with the new card. Otherwise, discard the new card. And that's how reservoir sampling works! # Applying this to log collection Let's take what we now know about reservoir sampling and apply it to our log collection service. We'll set k=5, so we're "holding" at most 5 log messages at a time, and every second we will send the selected logs to the log collection service. After we've done that, we empty our array of size 5 and start again. This creates a "lumpy" pattern in the graph below, and highlights a trade-off when using reservoir sampling. It's no longer a real-time stream of logs, but chunks of logs sent at an interval. However, sent logs never exceeds the threshold , and during quiet periods the two lines track each other almost perfectly. No logs lost during quiet periods, and never more than threshold logs per second sent during spikes. The best of both worlds. It also doesn't store more than k=5 logs, so it will have predictable memory usage. # Further reading Something you may have thought while reading this post is that some logs are more valuable than others. You almost certainly want to keep all error logs, for example. For that use-case there is a weighted variant of reservoir sampling. I wasn't able to find a simpler explanation of it, so that link is to Wikipedia which I personally find a bit hard to follow. But the key point is that it exists and if you need it you can use it. # Conclusion Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient. Thank you again to ittybit for sponsoring this post. I really couldn't have hoped for a more supportive first sponsor. Thank you for believing in and understanding what I'm doing here. Thank you to everyone who read this post and gave their feedback. You made this post much better than I could have done on my own, and steered me away from several paths that just weren't working. If you want to tell me what you thought of this post by sending me an anonymous message that goes directly to my phone, go to https://samwho.dev/ping.
body { text-wrap: pretty; } @media (prefers-reduced-motion: reduce) { * { transition: none; animation: none; } } turing-machine { width: 100%; display: block; position: relative; padding-bottom: 1em; } turing-machine .program-container { position: relative; display: flex; justify-content: center; } turing-machine table { border: none; font-family: Fira Code; border-collapse: collapse; border-spacing: 0; margin: 1px; margin-top: 0.5em; width: auto; } turing-machine thead td { text-align: center; } turing-machine td { text-align: left; padding-left: 3vw; padding-right: 3vw; padding-top: 0.2em; padding-bottom: 0.2em; border: 1px dashed #bbbbbb; } turing-machine thead td { border: 0; } turing-machine .container { z-index: 1; background-color: white; } turing-machine .svg-container { padding-bottom: 1px; padding-top: 0.5em; line-height: 0; overflow-x: scroll; overflow-y: hidden; position: relative; scrollbar-width: none; cursor: grab; cursor: -webkit-grab; } turing-machine .svg-container::-webkit-scrollbar { display: none; } turing-machine .controls-container { line-height: 1.5; display: flex; align-items: center; justify-content: center; gap: 0.3em; margin-top: 0.2em; padding-bottom: 0.2em; } turing-machine .controls-container button { color: black; background-color: #aaaaaa; border-radius: 30%/50%; padding: 0.3em; height: 2em; min-width: 3em; text-align: center; border: none; cursor: pointer; } turing-machine .controls-container select { color: black; border: 2px solid #aaaaaa; background-color: #ffffff; border-radius: 30%/50%; padding: 0.15em; height: 2em; min-width: 3em; text-align: center; cursor: pointer; font-family: "Fira Code"; } turing-machine .controls-container button:disabled { background-color: #eeeeee; color: #aaaaaa; cursor: auto; } turing-machine .controls-container button:disabled:hover { background-color: #eeeeee; } turing-machine .controls-container button:hover { background-color: #999999; } turing-machine .controls-container button:active { background-color: #888888; } turing-machine .error { width: 80%; margin-top: 0.5em; margin-bottom: 0.5em; border-radius: 10px; background-color: #ffdddd; color: red; font-family: "Fira Code"; font-size: 1.5em; font-weight: bold; text-align: center; margin: auto; } .dragging { cursor: grabbing !important; cursor: -webkit-grabbing !important; } .sticky { position: sticky !important; top: 0; } turing-machine .gradient { content: ""; z-index: 2; position: sticky; top: 0; left: 0; pointer-events: none; background-image: linear-gradient(90deg, rgba(255,255,255,1) 0%, rgba(255,255,255,0) 20%, rgba(255,255,255,0) 80%, rgba(255,255,255,1) 100%); width: 100%; margin-top: -62px; height: 64px; } turing-machine svg { user-select: none; /*touch-action: none;*/ } turing-machine svg text { font-family: "Fira Code"; } turing-machine tr { padding-top: 0.5em; padding-bottom: 0.5em; } turing-machine .instruction { color: black; display: inline-block; border-radius: 0.5em; padding: 0.1em; min-width: 4.5ch; height: 100%; text-align: center; } turing-machine .pointer { display: inline-block; border-radius: 0.5em; position: absolute; z-index: -1; } turing-machine program { display: none; } turing-machine controls { display: none; } turing-machine.heading svg text { font-family: "Lora"; font-weight: bold; } figure.hero { font-family: "Lora"; text-align: center; padding-top: 1em; padding-bottom: 2em; margin-top: 1em; width: 100%; max-width: 100%; } figure.hero turing-machine { padding: 0; margin: 0; } figure.hero turing-machine .svg-container { margin: 0; padding-top: 0; } figure.hero turing-machine .gradient { top: 0; } figure.hero .signature { margin-top: 0.5rem; max-width: 200px; } figure.hero .portrait { margin-top: 1em; width: 250px; max-width: 250px; } figure.hero p { padding: 0; font-size: 0.9em; } figure.hero figcaption { font-family: "Lora"; font-style: small-caps; font-size: 0.8em; } @keyframes pulse { 0% { transform: scale(1); } 50% { transform: scale(1.1); } 100% { transform: scale(1); } } .symbol { display: inline-block; border: 1px solid black; width: 1.5em; height: 1.5em; font-family: "Fira Code"; text-align: center; } .inline-instruction { display: inline-block; font-family: "Fira Code"; background-color: rgba(0, 158, 115, 0.7); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-instruction) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-state { display: inline-block; font-family: "Fira Code"; background-color: rgb(86, 180, 233); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-state) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-value { display: inline-block; font-family: "Fira Code"; background-color: rgb(230, 159, 0); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-value) { margin-top: 0.35em; margin-bottom: 0.35em; } figure { margin: auto; margin-top: 2em; margin-bottom: 2em; } figure img { max-width: 400px; } figure figcaption { max-width: 500px; color: #666; font-size: 0.8em; font-style: italic; text-align: center; font-family: "Lora"; text-wrap: pretty; margin: auto; } ALAN M. TURING 23 June 1912 – 7 June 1954 B B | | L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) -> F In 1928, David Hilbert, one of the most influential mathematicians of his time, asked whether it is possible to create an algorithm that could determine the correctness of a mathematical statement. This was called the "decision problem," or "Entscheidungsproblem" in Hilbert's native German. In 1936 both Alan Turing and Alonzo Church independently reached the conclusion, using different methods, that the answer is "no." The way Turing did it was to imagine a "universal machine", a machine that could compute anything that could be computed. This idea, the "Turing machine" as Alonzo Church christened it in 1937, laid the foundations for the device you are using to read this post. If we look hard enough we can see Turing's legacy in today's CPUs. By the end of this post, you will know: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. How to write and run your own programs for a Turing machine. # What is a Turing machine? You might expect a universal machine, capable of computing anything that can be computed, to be a complex device. Nothing could be further from the truth. The machine has just 4 parts, and the language used to program it has just 5 instructions. theoretical machine. It was created as a thought experiment to explore the limits of what can be computed. Some have of course been built, but in 1936 they existed only in the heads of Turing and those who read his paper. The parts are: a !tape, a !head, a !program, and a !state. When you're ready, go ahead and press !play. start What you're seeing here is a program that executes P(0) to print 0 to the tape, moves the head right with the R instruction, then !jumps back to the start. It will go on printing 0s forever. At any point, feel free to !pause, step the machine !forwards or !backwards one instruction at a time, or !restart the program from the beginning. There is also a speed selector on the far right of the controls if you want to speed up the machine. Notice that !state and !value never change. Every time the machine performs a !jump, the current state and value are used to pick the correct next row of instructions to execute. This program only has a single state, start, and every time it jumps, the symbol under the !head is !blank. Let's take a look at a program with multiple states. one one | | P(1) R -> zero This program prints alternating 0s and 1s to the tape. It has 2 states, zero and one, to illustrate what happens when you !jump to a different state. Things moving too fast? The slider below can be used to adjust the speed of all the Turing machines on this page. 100% You can also achieve this same result by using a single !state and alternating the !value. Here's an example of that: start start | 1 | R P(0) -> start start | 0 | R P(1) -> start The !value column always stays up to date with what the current symbol is under the !head, then when we !jump that value is used to know which row of instructions to execute. Combining state and value gives us a surprising amount of control over what our !program does. We've so far seen 3 instructions: P prints a given symbol to the tape. R moves the tape head right. ↪︎ jumps to a given state. There are 2 more: L moves the tape head left. H halts the machine. 1 1 | | P(a) L -> 2 2 | | P(l) L -> 3 3 | | P(A) H This program prints the word "Alan" from right to left then halts. If you can't see the full word, you can drag the !tape left and right. If the machine has halted, you can use !restart to start it again from the beginning. All of it! You're probably not going to get Crysis running at 60fps on a simulated Turing machine, but all of the calculations required to render each frame can be done with just these 5 instructions. Everything you have ever seen a computer do can be done with a Turing machine. We'll see a glimpse of how that can work in practice a little later. The last example I want to show you before we move on is the very first program Alan Turing showed the world. It's the first program featured in his 1936 paper: "On Computable Numbers, with an application to the Entsheidungsproblem." c c | | R -> e e | | P(1) R -> k k | | R -> b Turing liked to leave spaces between symbols, going as far as to even define them as "F-squares" and "E-squares". F for figure, and E for erasable. His algorithms would often make use of E-squares to help the machine remember the location of specific !tape squares. # What does it mean to compute? Something is said to be "computable" if there exists an algorithm that can get from the given input to the expected output. For example, adding together 2 integers is computable. Here I'm giving the machine a !tape that starts out with the values 2 and 6 in binary separated by a !blank. b1 b1 | 1 | R -> b1 b1 | | R -> b2 b2 | 0 | R -> b2 b2 | 1 | R -> b2 b2 | | L -> dec dec | 0 | P(1) L -> dec dec | 1 | P(0) L -> b3 dec | | H b3 | 0 | L -> b3 b3 | 1 | L -> b3 b3 | | L -> inc inc | 0 | P(1) R -> b1 inc | 1 | P(0) L -> inc inc | | P(1) R -> b1 This program adds the two numbers together, arriving at the answer 8. It does this by decrementing from the right number and incrementing the left number until the right number is 0. Here's the purpose of each state: b1 Move right until the first !blank square is found. This is to navigate past the first number. b2 Move right until the second !blank square is found. This is to navigate past the second number. dec Decrement the current number by 1. In practice this will always be the right number. It decrements by flipping bits until it either reaches a 1, where it will navigate back to the left number, or a !blank, which it will interpret as the number having hit 0. b3 Move left until the first !blank square again. This is only ever used after we've decremented the rightmost number. We can't reuse the b1 state here because when we reach the !blank, we want to jump to inc. inc Increment the current number by 1. Similar to decrementing, this will only ever happen on the leftmost number in practice. This is done by flipping bits until we reach a 0, at which point we navigate back to the right number. That we can write this program at all is proof that addition is computable, but it also implies that all integers are computable. If we can add any 2 integers, we can compute any other integer. 1 is 0+1, 2 is 1+1, 3 is 2+1, and so on. # Binary vs Decimal You may have wondered why I'm choosing to work with binary numbers rather than decimal. It's not just because that's how modern computers work. I'm going to show you 2 examples, and from those examples you'll be able to see why modern computers choose to work in binary. The first example is a program that increments a binary number in an endless loop. move move | 0 | R -> move move | | L -> flip flip | 0 | P(1) R -> move flip | 1 | P(0) L -> flip flip | | P(1) R -> move The second example is a program that increments a decimal number in an endless loop. back inc | 1 | P(2) -> back inc | 2 | P(3) -> back inc | 3 | P(4) -> back inc | 4 | P(5) -> back inc | 5 | P(6) -> back inc | 6 | P(7) -> back inc | 7 | P(8) -> back inc | 8 | P(9) -> back inc | 9 | P(0) L -> inc inc | | P(1) -> back back | | L -> inc back | * | R -> back These two programs are doing the same thing, but the program for manipulating decimal numbers is much longer. We've even introduced some new syntax, the * symbol, to handle a !value under the !head that does not match any of the other values for that !state. It's for this reason when programming Turing machines we prefer binary numbers: the programs end up being shorter and easier to reason about. This benefit also translates to the physical world. Components that switch between 2 states are cheaper, smaller, and more reliable than components that switch between 10. It was more practical to build computers that worked in binary than ones that work in decimal, though attempts to build decimal computers were made. # What can't be computed? To approach this question we need to explain the "Halting problem." It goes like this: The answer is no, and this is what Turing essentially proved. The proof is complicated and I'm not ashamed to admit I don't understand it, but there is an example I can give you that can be intuitively understood to be "undecidable." Imagine you write a program that takes as its input the program being used to decide whether it will halt or not. What it then does is run the decider program on itself, and then do the opposite of what the decider program says. function undecidable(willHalt) { if (willHalt(undecidable)) { while (true); } else { return true; } } This program intentionally enters an infinite loop if it is told it will halt, and halts if it is told it will run forever. It seems like a silly example, the kind of answer a cheeky high school student might try to get away with, but it is a legitimate counterexample to the idea that the halting problem can be solved. If you were to imagine encoding the program and input into something that could be represented on the !tape, there would be no !program that could determine whether the program would halt or not. Imagining this encoding becomes quite natural when you realise that modern programs are encoded as binary data to be saved to disk. # What does it mean to be Turing complete? If you've been involved in the world of programming for more than a few years, there's a good chance you've come across the term "Turing complete." Most likely in the context of things that really ought not to be Turing complete, like C++ templates, TypeScript's type system or Microsoft Excel. But what does it mean? Like the Halting problem, the proof is complicated but there's a straightforward test you can apply to something to judge it Turing complete: I've written this post, with the Turing machine simulations, in JavaScript. Therefore JavaScript is Turing complete. The C++ template example given above simulates a Turing machine in C++'s template system. The TypeScript example takes the route of writing an interpreter for a different Turing complete language. You're right, and everyone tends to cheat a bit with the definition. When someone says something is Turing complete, what they mean is it would be Turing complete if it had an infinite amount of memory. The infinite tape limitation means no Turing machine could ever exist in our physical reality, so that requirement tends to get waived. # How does this all relate to modern computers? If you read around the topic of Turing machines outside of this post, you might see it said that modern computers are effectively Turing machines. You would be forgiven for finding it difficult to imagine how you go from adding 2 integers in binary on a !tape to running a web browser, but the line is there. A key difference between our Turing machine and the device you're reading this on is that your device's CPU has "registers." These are small pieces of memory that live directly on the CPU and are used to store values temporarily while they're being operated on. Values are being constantly loaded from memory into registers and saved back again. You can think of registers as variables for your CPU, but they can only store fixed-size numbers. We can create registers in our Turing machine. We can do this by creating a "format" for our tape. Here we define 3 registers: A, B, and C. Each register contains a 3 bits and can store numbers between 0 and 7. Then at the far left we have an H, which stands for "home", which will help us navigate. To increment register C, we can write a program like this: goto_c goto_c | C | R R R -> inc inc | 0 | P(1) -> goto_h goto_h | * | L -> goto_h goto_h | H | H We're making a lot more liberal use of the * symbol here to help us navigate to specific parts of the !tape without having to enumerate all possible values that could be under the !head on the way there. This program is effectively equivalent to the following x86 assembly code, if x86 had a register named c: mov c, 0 ; Load 0 into c inc c ; Increment c by 1 If we wanted to add values in A and B, storing the result in C, we need to do more work. Here's the assembly code we're trying to replicate: mov a, 2 ; Load 2 into a mov b, 3 ; Load 3 into b add c, a ; Add a to c add c, b ; Add b to c Before you scroll down I will warn you that the program is long and complex. It is the last program we will see in this post, and I don't expect you to understand it in full to continue to the end. Its main purpose is to show you that we can implement operations seen in modern assembly code on a Turing machine. initb initb | * | R P(0) R P(1) R P(1) R -> start start | * | L -> start start | H | R -> go_a go_a | * | R -> go_a go_a | A | R R R -> dec_a dec_a | 0 | P(1) L -> cry_a dec_a | 1 | P(0) -> go_c1 cry_a | 0 | P(1) L -> cry_a cry_a | 1 | P(0) -> go_c1 cry_a | * | R P(0) R P(0) R P(0) -> goto_b go_c1 | * | R -> go_c1 go_c1 | C | R R R -> inc_c1 inc_c1 | 0 | P(1) -> h2a inc_c1 | 1 | P(0) L -> cry_ca cry_ca | 0 | P(1) -> h2a cry_ca | 1 | P(0) L -> cry_ca cry_ca | * | R -> h2a h2a | * | L -> h2a h2a | H | R -> go_a goto_b | * | R -> goto_b goto_b | B | R R R -> dec_b dec_b | 0 | P(1) L -> dec_bc dec_b | 1 | P(0) -> go_c2 dec_bc | 0 | P(1) L -> dec_bc dec_bc | 1 | P(0) -> go_c2 dec_bc | * | R P(0) R P(0) R P(0) -> end go_c2 | * | R -> go_c2 go_c2 | C | R R R -> inc_c2 inc_c2 | 0 | P(1) -> go_hb inc_c2 | 1 | P(0) L -> cry_cb cry_cb | 0 | P(1) -> go_hb cry_cb | 1 | P(0) L -> cry_cb cry_cb | * | R -> go_hb go_hb | * | L -> go_hb go_hb | H | R -> goto_b end | * | L -> end end | H | H This is painfully laborious, and it doesn't even precisely match the assembly code. It destroys the values in A and B as it adds them together, and it doesn't handle overflow. But it's a start, and I hope it gives you a glimpse of how this theoretical machine can be built up to operate like a modern CPU. If you watch the program run to completion, something that might strike you is just how much work is required to do something as simple as adding 2 numbers. Turing machines were not designed to be practical, Turing never intended anyone to go out and build one of these machines in the hope it will be useful. Modern machines have circuits within them that can add 2 numbers together by passing 2 electrical signals in and getting the sum as a single signal out. This happens in less than a nanosecond. Modern machines have memory where any byte can be accessed at any time, no tape manipulation required. This memory access takes a few dozen nanoseconds. # Writing and running your own programs I've built a web-based development environment for writing programs that will run on the Turing machine visualisations you've seen throughout the post. You can access the editor here. I encourage you to play around with it. Set a simple goal, like adding together 2 numbers without going back to look at the way I did it in the post. It's a great way to get a feel for how the machine works. # Conclusion To recap, we've covered: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. And you now have access to an environment for writing and running your own Turing machine programs. If you use it to make something neat, please do reach out to me and show me! My email address is hello@samwho.dev. # Further reading Turing's original paper on computable numbers. The Annotated Turing I referenced this throughout the making of this post. It is a fabulous read, strongly recommend. Alan Turing: The Enigma by Andrew Hodges. An excellent biography of Turing, I read this during the writing of this post. Calculating a Mandelbrot Set using a Turing Machine. This was exceptionally useful for me to understand how to get from Turing machines to modern computers. # Acknowledgements These posts are never a solo effort, and this one is no exception. Sincere thanks go to the following people: To my wife, Sophie, who drew the biographical sketches you're seeing at the end here, and for putting up with my incessant talking about this post the last 2 weeks. Everyone who let me watch them read this post in real-time over a video call and gave me feedback: Jaga Santagostino, Robert Aboukhalil, Tarun Verghis, Tyler Sparks. Everyone who came to hang out and help out in the Twitch streams when I was building out the early versions of the Turing machine visualisations. Everyone who supports my work on Patreon. Everyone who works on the tools used to build this post: TypeScript, Bun, Two.js, Tween.js, Monaco, Peggy, Zed, and many other indirect dependencies. We really do stand upon the shoulders of giants. Hut 8, Bletchley Park, where Turing worked during World War II. It was in this hut that Alan worked with his team to break the German Naval Enigma code.
.bf { width: 100%; height: 150px; } @media only screen and (min-width: 320px) and (max-width: 479px) { .bf { height: 200px; } } @media only screen and (min-width: 480px) and (max-width: 676px) { .bf { height: 200px; } } @media only screen and (min-width: 677px) and (max-width: 991px) { .bf { height: 150px; } } form { display: flex; flex-direction: column; align-items: center; justify-content: stretch; } input { border: 1px solid rgb(119, 119, 119); padding: 0.25rem; border-radius: 0.25rem; height: 2em; line-height: 2em; } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } Everyone has a set of tools they use to solve problems. Growing this set helps you to solve ever more difficult problems. In this post, I'm going to teach you about a tool you may not have heard of before. It's a niche tool that won't apply to many problems, but when it does you'll find it invaluable. It's called a "bloom filter." Before you continue! This post assumes you know what a hash function is, and if you don't it's going to be tricky to understand. Sam has written a post about hash functions, and recommendeds that you read this first. # What bloom filters can do Bloom filters are similar to the Set data structure. You can add items to them, and check if an item is present. Here's what it might look like to use a bloom filter in JavaScript, using a made-up BloomFilter class: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Ant"); // true bf.contains("Rhino"); // true While this looks almost identical to a Set, there are some key differences. Bloom filters are what's called a probabalistic data structure. Where a Set can give you a concrete "yes" or "no" answer when you call contains, a bloom filter can't. Bloom filters can give definite "no"s, but they can't be certain about "yes." In the example above, when we ask bf if it contains "Ant" and "Rhino", the true that it returns isn't a guarantee that they're present. We know that they're present because we added them just a couple of lines before, but it would be possible for this to happen: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Fox"); // true We'll demonstrate why over the course of this post. For now, we'll say that when bloom filters return true it doesn't mean "yes", it means "maybe". When this happens and the item has never been added before, it's called a false-positive. The opposite, claiming "no" when the answer is "yes," is called a false-negative. A bloom filter will never give a false-negative, and this is what makes them useful. It's not strictly lying, it's just not giving you a definite answer. Let's look at an example where we can use this property to our advantage. # When bloom filters are useful Imagine you're building a web browser, and you want to protect users from malicious links. You could build and maintain a list of all known malicious links and check the list every time a user navigates the browser. If the link they're trying to visit is in the list, you warn the user that they might be about to visit a malicious website. If we assume there are, say, 1,000,000 malicious links on the Internet, and each link is 20 characters long, then the list of malicious links would be 20MB in size. This isn't a huge amount of data, but it's not small either. If you have lots of users and want to keep this list up to date, the bandwidth could add up. However, if you're happy to accept being wrong 0.0001% of the time (1 in a million), you could use a bloom filter which can store the same data in 3.59MB. That's an 82% reduction in size, and all it costs you is showing the user an incorrect warning 1 in every million links visited. If you wanted to take it even further, and you were happy to accept being wrong 0.1% of the time (1 in 1000), the bloom filter would only be 1.8MB. This use-case isn't hypothetical, either. Google Chrome used a bloom filter for this exact purpose until 2012. If you were worried about showing a warning when it wasn't needed, you could always make an API that has the full list of malicious links in a database. When the bloom filter says "maybe," you would then make an API call to check the full list to be sure. No more spurious warnings, and the bloom filter would save you from having to call the API for every link visited. # How bloom filters work At its core, a bloom filter is an array of bits. When it is created, all of the bits are set to 0. We're going to represent this as a grid of circles, with each circle representing 1 bit. Our bloom filters in this post are all going to have 32 bits in total. this one and let me know what you think. Click here to go back to normal. To add an item to the bloom filter, we're going to hash it with 3 different hash functions, then use the 3 resulting values to set 3 bits. If you're not familiar with hashing, I recommend reading my post about it before continuing. For this post I'm choosing to use 3 of the SHA family of hash functions: sha1, sha256, and sha512. Here's what our bloom filter looks like if we add the value "foo" to it: The bits in positions 15, 16 and 27 have been set. Other bits, e.g. 1 have not been set. You can hover or tap the bits in this paragraph to highlight them in the visualisation. We get to this state by taking the hash value of "foo" for each of our 3 hash functions and modulo it by the number of bits in our bloom filter. Modulo gets us the remainder when dividing by 32, so we get 27 with sha1, 15 with sha256 and 16 with sha512. The table below shows what's happening, and you can try inputting your own values to see what bits they would set if added. Go ahead and add a few of your own values to our bloom filter below and see what happens. There's also a check button that will tell you if a value is present within the bloom filter. A value is only considered present if all of the bits checked are set. You can start again by hitting the clear button. You might occasionally notice that only 2, or even 1, bits get set. This happens when 2 or more of our hash functions produce the same value, or we attempt to set a bit that has already been set. Taking that a bit further, have a think about the implications of a bloom filter that has every bit set. bit is set, then won't the bloom filter claim it contains every item you check? That's a false-positive every time! Exactly right. A bloom filter with every bit set is equivalent to a Set that always returns true for contains. It will claim to contain everything you ask it about, even if that thing was never added. # False-positive rates The rate of false-positives in our bloom filter will grow as the percentage of set bits increases. Drag the slider below the graph to see how the false-positive rate changes as the number of set bits increases. It grows slowly at first, but as we get closer to having all bits set the rate increases. This is because we calculate the false-positive rate as x^3, where x is the percentage of set bits and 3 is the number of hash functions used. To give an example of why we calculate it with this formula, imagine we have a bloom filter with half of its bits set, x = 0.5. If we assume that our hash function has an equal chance of setting any of the bits, then the chance that all 3 hash functions set a bit that is already set is 0.5 * 0.5 * 0.5, or x^3. Let's have a look at the false-positive rate of bloom filters that use different numbers of hash functions. The problem that using lots of hash functions introduces is that it makes the bloom filter fill up faster. The more hash functions you use, the more bits get set for each item you add. There's also the cost of hashing itself. Hash functions aren't free, and while the hash functions you'd use in a bloom filter try to be as fast as possible, it's still more expensive to run 100 of them than it is to run 3. It's possible to calculate how full a bloom filter will be after inserting a number of items, based on the number of hash functions used. The graph below assumes a bloom filter with 1000 bits. The more hash functions we use, the faster we set all of the bits. You'll notice that the curve tails off as more items are added. This is because the more bits that are set, the more likely it is that we'all attempt to set a bit that has already been set. In practice, 1000 bits is a very small bloom filter, occupying only 125 bytes of memory. Modern computers have a lot of memory, so let's crank this up to 100,000 bits (12.5kB) and see what happens. The lines barely leave the bottom of the graph, meaning the bloom filter will be very empty and the false-positive rate will be low. All this cost us was 12.5kB of memory, which is still a very small amount by 2024 standards. # Tuning a bloom filter Picking the correct number of hash functions and bits for a bloom filter is a fine balance. Fortunately for us, if we know up-front how many unique items we want to store, and what our desired false-positive rate is, we can calculate the optimal number of hash functions, and the required number of bits. The bloom filter page on Wikipedia covers the mathematics involved, which I'm going to translate into JavaScript functions for us to use. I want to stress that you don't need to understand the maths to use a bloom filter or read this post. I'm including the link to it only for completeness. # Optimal number of bits The following JavaScript function, which might look a bit scary but bear with me, takes the number of items you want to store (items) and the desired false-positive rate (fpr, where 1% == 0.01), and returns how many bits you will need to achieve that false-positive rate. function bits(items, fpr) { const n = -items * Math.log(fpr); const d = Math.log(2) ** 2; return Math.ceil(n / d); } We can see how this grows for a variety of fpr values in the graph below. # Optimal number of hash functions After we've used the JavaScript above to calculate how many bits we need, we can use the following function to calculate the optimal number of hash functions to use: function hashFunctions(bits, items) { return Math.ceil((bits / items) * Math.log(2)); } Pause for a second here and have a think about how the number of hash functions might grow based on the size of the bloom filter and the number of items you expect to add. Do you think you'll use more hash functions, or fewer, as the bloom filter gets larger? What about as the number of items increases? The more items you plan to add, the fewer hash functions you should use. Yet, a larger bloom filter means you can use more hash functions. More hash functions keep the false-positive rate lower for longer, but more items fills up the bloom filter faster. It's a complex balancing act, and I am thankful that mathematicians have done the hard work of figuring it out for us. # Caution While we can stand on the shoulders of giants and pick the optimal number of bits and hash functions for our bloom filter, it's important to remember that these rely on you giving good estimates of the number of items you expect to add, and choosing a false-positive rate that's acceptable for your use-case. These numbers might be difficult to come up with, and I recommend erring on the side of caution. If you're not sure, it's likely better to use a larger bloom filter than you think you need. # Removing items from a bloom filter We've spent the whole post talking about adding things to a bloom filter, and the optimal parameters to use. We haven't spoken at all about removing items. And that's because you can't! In a bloom filter, we're using bits, individual 1s and 0s, to track the presence of items. If we were to remove an item by setting its bits to 0, we might also be removing other items by accident. There's no way of knowing. Click the buttons of the bloom filter below to see this in action. First we will add "foo", then "baz", and then we will remove "baz". Hit "clear" if you want to start again. The end result of this sequence is a bloom filter that doesn't contain "baz", but doesn't contain "foo" either. Because both "foo" and "baz" set bit 27, we accidentally clobber the presence of "foo" while removing "baz". Something else you might have noticed playing with the above example is that if you add "foo" and then attempt to remove "baz" before having added it, nothing happens. Even though 27 is set, bits 18 and 23 are not, so the bloom filter cannot contain "baz". Because of this, it won't unset 27. # Counting bloom filters While you can't remove items from a standard bloom filter, there are variants that allow you to do so. One of these variants is called a "counting bloom filter," which uses an array of counters instead of bits to keep track of items. Now when you go through the sequence, the end result is that the bloom filter still contains "foo." It solves the problem. The trade-off, though, is that counters are bigger than bits. With 4 bits per counter you can increment up to 15. With 8 bits per counter you can increment up to 255. You'll need to pick a counter size sufficient to never reach the maximum value, otherwise you risk corrupting the bloom filter. Using 8x more memory than a standard bloom filter could be a big deal, especially if you're using a bloom filter to save memory in the first place. Think hard about whether you really need to be able to remove items from your bloom filter. Counting bloom filters also introduce the possibility of false-negatives, which are impossible in standard bloom filters. Consider the following example. Because "loved" and "response" both hash to the bits 5, 22, and 26, when we remove "response" we also remove "loved". If we write this as JavaScript the problem becomes more clear: let bf = new CountingBloomFilter(); bf.add("loved"); bf.add("your"); bf.remove("response"); bf.contains("loved"); // false Even though we know for sure we've added "loved" in this snippet, the call to contains will return false. This sort of false-negative can't happen in a standard bloom filter, and it removes one of the key benefits of using a bloom filter in the first place: the guarantee of no false-negatives. # Bloom filters in the real-world Real-world users of bloom filters include Akamai, who use them to avoid caching web pages that are accessed once and never again. They do this by storing all page accesses in a bloom filter, and only writing them into cache if the bloom filter says they've been seen before. This does result in some pages being cached on the first access, but that's fine because it's still an improvement. It would be impractical for them to store all page accesses in a Set, so they accept the small false-positive rate in favour of the significantly smaller bloom filter. Akamai released a paper about this that goes into the full details if you're interested. Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will claim a key is in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database. # Conclusion Bloom filters, while niche, can be a huge optimisation in the right situation. They're a wonderful application of hash functions, and a great example of making a deliberate trade-off to achieve a specific goal. Trade-offs, and combining simpler building blocks to create more complex, purpose-built data structures, are present everywhere in software engineering. Being able to spot where a data structure could net a big win can separate you from the pack, and take your career to the next level. I hope you've enjoyed this post, and that you find a way to apply bloom filters to a problem you're working on. # Acknowledgements Enormous thank you to my reviewers, without whom this post would be a shadow of what you read today. In no particular order: rylon, Indy, Aaron, Sophie, Davis, ed, Michael Drury, Anton Zhiyanov, Christoph Berger.
form { padding-top: 0.5em; padding-left: 0.5em; padding-right: 0.5em; display: flex; justify-content: center; gap: 0.3em; } form input[type=text] { flex: 4 1 auto; min-width: 0; border-radius: 0.3em; border: 1px solid #aaaaaa; padding: 0.3em; } form button { flex: 1 1 auto; max-width: 140px; } form button:disabled { opacity: 0.5 !important; } form button.add { background-color: #009E73; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.check { background-color: #56B4E9; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.clear { background-color: #D55E00; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } .grid-2x2 { display: "grid"; } .grid { user-select: none; cursor: pointer; margin-top: 1rem; margin-bottom: 1rem; border: 1px solid #009E73; width: 100%; display: grid; grid-template-columns: repeat(8, 1fr); grid-template-rows: repeat(2, 1fr); } .grid-item { display: flex; align-items: center; justify-content: center; aspect-ratio: 1/1; } .grid-active { background-color: #009E73; color: white; } .above-grid { display: flex; justify-content: center; } .hash-examples { padding-top: 0.5rem; padding-bottom: 0.5rem; margin: auto; display: flex; flex-direction: column; align-items: center; } .hash-examples div { margin: auto; } .hash-examples code { display: block; white-space: pre; font-weight: bold; } .hash-examples p { font-size: 0.75rem; font-style: italic; text-align: center; font-family: Lora, serif; width: 75%; } .blob { cursor: pointer; background: #CC79A7; display: flex; justify-content: center; align-items: center; font-size: 1.5rem; color: white; border-radius: 50%; margin: 10px; height: 3rem; width: 3rem; min-width: 3rem; max-width: 3rem; box-shadow: 0 0 0 0 #CC79A7FF; transform: scale(1); animation: pulse 2s infinite; } @keyframes pulse { 0% { transform: scale(0.85); box-shadow: 0 0 0 0 #CC79A77F; } 70% { transform: scale(1); box-shadow: 0 0 0 1rem rgba(0, 0, 0, 0); } 100% { transform: scale(0.85); box-shadow: 0 0 0 0 rgba(0, 0, 0, 0); } } .blob-click { cursor: default; animation: tick 1s linear; background: #009E73FF; } @keyframes tick { 0% { transform: scale(1); box-shadow: 0 0 0 0 #009E73FF; } 50% { box-shadow: 0 0 0 1rem #009E737F; } 100% { box-shadow: 0 0 0 2rem #009E7300; } } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } .pct25 { width: 100%; height: 200px; } .datasets th { text-align: left; } .datasets { table-layout: fixed; } As a programmer, you use hash functions every day. They're used in databases to optimise queries, they're used in data structures to make things faster, they're used in security to keep data safe. Almost every interaction you have with technology will involve hash functions in one way or another. Hash functions are foundational, and they are everywhere. But what is a hash function, and how do they work? In this post, we're going to demystify hash functions. We're going to start by looking at a simple hash function, then we're going to learn how to test if a hash function is good or not, and then we're going to look at a real-world use of hash functions: the hash map. clicked. # What is a hash function? Hash functions are functions that take an input, usually a string, and produce a number. If you were to call a hash function multiple times with the same input, it will always return the same number, and that number returned will always be within a promised range. What that range is will depend on the hash function, some use 32-bit integers (so 0 to 4 billion), others go much larger. If we were to write a dummy hash function in JavaScript, it might look like this: function hash(input) { return 0; } Even without knowing how hash functions are used, it's probably no surprise that this hash function is useless. Let's see how we can measure how good a hash function is, and after that we'll do a deep dive on how they're used within hash maps. # What makes a hash function good? Because input can be any string, but the number returned is within some promised range, it's possible that two different inputs can return the same number. This is called a "collision," and good hash functions try to minimise how many collisions they produce. It's not possible to completely eliminate collisions, though. If we wrote a hash function that returned a number in the range 0 to 7, and we gave it 9 unique inputs, we're guaranteed at least 1 collision. hash("to") == 3 hash("the") == 2 hash("café") == 0 hash("de") == 6 hash("versailles") == 4 hash("for") == 5 hash("coffee") == 0 hash("we") == 7 hash("had") == 1 To visualise collisions, I'm going to use a grid. Each square of the grid is going to represent a number output by a hash function. Here's an example 8x2 grid. Click on the grid to increment the example hash output value and see how we map it to a grid square. See what happens when you get a number larger than the number of grid squares. { let grid = document.getElementById("first-grid"); let hash = document.getElementById("grid-hash"); let modulo = document.getElementById("grid-modulo"); grid.addEventListener("click", (e) => { e.preventDefault(); let number = parseInt(hash.innerText) + 1; hash.innerText = number.toString(); modulo.innerText = (number % 16).toString(); grid.querySelector(".grid-active").classList.remove("grid-active"); grid.children[number % 16].classList.add("grid-active"); return false; }); }); 13 % 16 == 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Every time we hash a value, we're going to make its corresponding square on the grid a bit darker. The idea is to create an easy way to see how well a hash function avoids collisions. What we're looking for is a nice, even distribution. We'll know that the hash function isn't good if we have clumps or patterns of dark squares. This is a great observation. You're absolutely right, we're going to be creating "pseudo-collisions" on our grid. It's okay, though, because if the hash function is good we will still see an even distribution. Incrementing every square by 100 is just as good a distribution as incrementing every square by 1. If we have a bad hash function that collides a lot, that will still stand out. We'll see this shortly. Let's take a larger grid and hash 1,000 randomly-generated strings. You can click on the grid to hash a new set of random inputs, and the grid will animate to show you each input being hashed and placed on the grid. The values are nice and evenly distributed because we're using a good, well-known hash function called murmur3. This hash is widely used in the real-world because it has great distribution while also being really, really fast. What would our grid look like if we used a bad hash function? function hash(input) { let hash = 0; for (let c of input) { hash += c.charCodeAt(0); } return hash % 1000000; } This hash function loops through the string that we're given and sums the numeric values of each character. It then makes sure that the value is between 0 and 1000000 by using the modulus operator (%). Let's call this hash function stringSum. Here it is on the grid. Reminder, this is 1,000 randomly generated strings that we're hashing. This doesn't look all that different from murmur3. What gives? The problem is that the strings we're giving to be hashed are random. Let's see how each function performs when given input that is not random: the numbers from 1 to 1000 converted to strings. Now the problem is more clear. When the input isn't random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with random values. How about if we hash the top 1,000 most common English words: It's more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3 looks the same as it always does. This is the power of a good hash function: no matter the input, the output is evenly distributed. Let's talk about one more way to visualise this and then talk about why it matters. # The avalanche effect Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping. It's this property that helps hash functions avoid forming patterns in the grid. If small changes in the input result in small changes in the output, you get patterns. Patterns indicate poor distribution, and a higher rate of collisions. Below, we are visualising the avalanche effect by showing two 8-bit binary numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit in the input. Bits that change in the output will be green, bits that stay the same will be red. murmur3 does well, though you will notice that sometimes fewer than 50% of the bits flip and sometimes more. This is okay, provided that it is 50% on average. Let's see how stringSum performs. Well this is embarassing. The output is equal to the input, and so only a single bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the string. This example only hashes the equivalent of a single character, which means the output will always be the same as the input. # Why all of this matters We've taken the time to understand some of the ways to determine if a hash function is good, but we've not spent any time talking about why it matters. Let's fix that by talking about hash maps. To understand hash maps, we first must understand what a map is. A map is a data structure that allows you to store key-value pairs. Here's an example in JavaScript: let map = new Map(); map.set("hello", "world"); console.log(map.get("hello")); Here we take a key-value pair ("hello" → "world") and store it in the map. Then we print out the value associated with the key "hello", which will be "world". A more fun real-world use-case would be to find anagrams. An anagram is when two different words contain the same letters, for example "antlers" and "rentals" or "article" and "recital." If you have a list of words and you want to find all of the anagrams, you can sort the letters in each word alphabetically and use that as a key in a map. let words = [ "antlers", "rentals", "sternal", "article", "recital", "flamboyant", ] let map = new Map(); for (let word of words) { let key = word .split('') .sort() .join(''); if (!map.has(key)) { map.set(key, []); } map.get(key).push(word); } This code results in a map with the following structure: { "aelnrst": [ "antlers", "rentals", "sternal" ], "aceilrt": [ "article", "recital" ], "aabflmnoty": [ "flamboyant" ] } # Implementing our own simple hash map Hash maps are one of many map implementations, and there are many ways to implement hash maps. The simplest way, and the way we're going to demonstrate, is to use a list of lists. The inner lists are often referred to as "buckets" in the real-world, so that's what we'll call them here. A hash function is used on the key to determine which bucket to store the key-value pair in, then the key-value pair is added to that bucket. Let's walk through a simple hash map implementation in JavaScript. We're going to go through it bottom-up, so we'll see some utility methods before getting to the set and get implementations. class HashMap { constructor() { this.bs = [[], [], []]; } } We start off by creating a HashMap class with a constructor that sets up 3 buckets. We use 3 buckets and the short variable name bs so that this code displays nicely on devices with smaller screens. In reality, you could have however many buckets you want (and better variable names). class HashMap { // ... bucket(key) { let h = murmur3(key); return this.bs[ h % this.bs.length ]; } } The bucket method uses murmur3 on the key passed in to find a bucket to use. This is the only place in our hash map code that a hash function is used. class HashMap { // ... entry(bucket, key) { for (let e of bucket) { if (e.key === key) { return e; } } return null; } } The entry method takes a bucket and a key and scans the bucket until it finds an entry with the given key. If no entry is found, null is returned. class HashMap { // ... set(key, value) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { e.value = value; return; } b.push({ key, value }); } } The set method is the first one we should recognise from our earlier JavaScript Map examples. It takes a key-value pair and stores it in our hash map. It does this by using the bucket and entry methods we created earlier. If an entry is found, its value is overwritten. If no entry is found, the key-value pair is added to the map. In JavaScript, { key, value } is shorthand for { key: key, value: value }. class HashMap { // ... get(key) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { return e.value; } return null; } } The get method is very similar to set. It uses bucket and entry to find the entry related to the key passed in, just like set does. If an entry is found, its value is returned. If one isn't found, null is returned. That was quite a lot of code. What you should take away from it is that our hash map is a list of lists, and a hash function is used to know which of the lists to store and retrieve a given key from. Here's a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our set method. To keep the visualisation simple, if a bucket were to "overflow", the buckets are all reset. Because we're using murmur3 as our hash function, you should see good distribution between the buckets. It's expected you'll see some imbalance, but it should generally be quite even. To get a value out of the hash map, we first hash the key to figure out which bucket the value will be in. Then we have to compare the key we're searching for against all of the keys in the bucket. It's this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash function, the faster we find the right bucket to search, the faster our hash map is overall. This is also why reducing collisions is so crucial. If we did decide to use that dummy hash function from all the way at the start of this article, the one that returns 0 all the time, we'll put all of our key-value pairs into the first bucket. Finding anything could mean we have to check all of the values in the hash map. With a good hash function, with good distribution, we reduce the amount of searching we have to do to 1/N, where N is the number of buckets. Let's see how stringSum does. Interestingly, stringSum seems to distribute values quite well. You notice a pattern, but the overall distribution looks good. stringSum. I knew it would be good for something. Not so fast, Haskie. We need to talk about a serious problem. The distribution looks okay on these sequential numbers, but we've seen that stringSum doesn't have a good avalanche effect. This doesn't end well. # Real-world collisions Let's look at 2 real-world data sets: IP addresses and English words. What I'm going to do is take 100,000,000 random IP addresses and 466,550 English words, hash all of them with both murmur3 and stringSum, and see how many collisions we get. IP Addresses murmur3 stringSum Collisions 1,156,959 99,999,566 1.157% 99.999% English words murmur3 stringSum Collisions 25 464,220 0.005% 99.5% When we use hash maps for real, we aren't usually storing random values in them. We can imagine counting the number of times we've seen an IP address in rate limiting code for a server. Or code that counts the occurrences of words in books throughout history to track their origin and popularity. stringSum sucks for these applications because of it's extremely high collision rate. # Manufactured collisions Now it's murmur3's turn for some bad news. It's not just collisions caused by similarity in the input we have to worry about. Check this out. What's happening here? Why do all of these jibberish strings hash to the same number? I hashed 141 trillion random strings to find values that hash to the number 1228476406 when using murmur3. Hash functions have to always return the same output for a specific input, so it's possible to find collisions by brute force. trillion? Like... 141 and then 12 zeroes? Yes, and it only took me 25 minutes. Computers are fast. Bad actors having easy access to collisions can be devastating if your software builds hash maps out of user input. Take HTTP headers, for example. An HTTP request looks like this: GET / HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: google.com You don't have to understand all of the words, just that the first line is the path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us from passing any headers we want, so we can be really mean and pass headers we know will cause collisions. This can significantly slow down the server. This isn't theoretical, either. If you search "HashDoS" you'll find a lot more examples of this. It was a really big deal in the mid-2000s. There are a few ways to mitigate this specific to HTTP servers: ignoring jibberish header keys and limiting the number of headers you store, for example. But modern hash functions like murmur3 offer a more generalised solution: randomisation. Earlier in this post we showed some examples of hash function implementations. Those implementations took a single argument: input. Lots of modern hash functions take a 2nd parameter: seed (sometimes called salt). In the case of murmur3, this seed is a number. So far, we've been using 0 as the seed. Let's see what happens with the collisions I've collected when we use a seed of 1. Just like that, 0 to 1, the collisions are gone. This is the purpose of the seed: it randomises the output of the hash function in an unpredictable way. How it achieves this is beyond the scope of this article, all hash functions do this in their own way. The hash function still returns the same output for the same input, it's just that the input is a combination of input and seed. Things that collide with one seed shouldn't collide when using another. Programming languages often generate a random number to use as the seed when the process starts, so that every time you run your program the seed is different. As a bad guy, not knowing the seed, it is now impossible for me to reliably cause harm. If you look closely in the above visualisation and the one before it, they're the same values being hashed but they produce different hash values. The implication of this is that if you hash a value with one seed, and want to be able to compare against it in the future, you need to make sure you use the same seed. Having different values for different seeds doesn't affect the hash map use-case, because hash maps only live for the duration the program is running. Provided you use the same seed for the lifetime of the program, your hash maps will continue to work just fine. If you ever store hash values outside of your program, in a file for example, you need to be careful you know what seed has been used. # Playground As is tradition, I've made a playground for you to write your own hash functions and see them visualised with the grids seen in this article. Click here to try it! # Conclusion We've covered what a hash function is, some ways to measure how good it is, what happens when it's not good, and some of the ways they can be broken by bad actors. The universe of hash functions is a large one, and we've really only scratched the surface in this post. We haven't spoken about cryptographic vs non-cryptographic hashing, we've touched on only 1 of the thousands of use-cases for hash functions, and we haven't talked about how exactly modern hash functions actually work. Some further reading I recommend if you're really enthusiastic about this topic and want to learn more: https://github.com/rurban/smhasher this repository is the gold standard for testing how good hash functions are. They run a tonne of tests against a wide number of hash functions and present the results in a big table. It will be difficult to understand what all of the tests are for, but this is where the state of the art of hash testing lives. https://djhworld.github.io/hyperloglog/ this is an interactive piece on a data structure called HyperLogLog. It's used to efficiently count the number of unique elements in very, very large sets. It uses hashing to do it in a really clever way. https://www.gnu.org/software/gperf/ is a piece of software that, when given the expected set of things you want to hash, can generate a "perfect" hash function automatically. Feel free to join the discussion on Hacker News! # Acknowledgements Thanks to everyone who read early drafts and provided invaluable feedback. delroth, Manon, Aaron, Charlie And everyone who helped me find murmur3 hash collisions: Indy, Aaron, Max # Patreon After the success of Load Balancing and Memory Allocation, I have decided to set up a Patreon page: https://patreon.com/samwho. For all of these articles going forward, I am going to post a Patreon-exclusive behind-the-scenes post talking about decisions, difficulties, and lessons learned from each post. It will give you a deep look in to how these articles evolve, and I'm really stoked about the one I've written for this one. If you enjoy my writing, and want to support it going forward, I'd really appreciate you becoming a Patreon. ❤️
More in programming
I started writing this early last week but Real Life Stuff happened and now you're getting the first-draft late this week. Warning, unedited thoughts ahead! New Logic for Programmers release! v0.9 is out! This is a big release, with a new cover design, several rewritten chapters, online code samples and much more. See the full release notes at the changelog page, and get the book here! Write the cleverest code you possibly can There are millions of articles online about how programmers should not write "clever" code, and instead write simple, maintainable code that everybody understands. Sometimes the example of "clever" code looks like this (src): # Python p=n=1 exec("p*=n*n;n+=1;"*~-int(input())) print(p%n) This is code-golfing, the sport of writing the most concise code possible. Obviously you shouldn't run this in production for the same reason you shouldn't eat dinner off a Rembrandt. Other times the example looks like this: def is_prime(x): if x == 1: return True return all([x%n != 0 for n in range(2, x)] This is "clever" because it uses a single list comprehension, as opposed to a "simple" for loop. Yes, "list comprehensions are too clever" is something I've read in one of these articles. I've also talked to people who think that datatypes besides lists and hashmaps are too clever to use, that most optimizations are too clever to bother with, and even that functions and classes are too clever and code should be a linear script.1. Clever code is anything using features or domain concepts we don't understand. Something that seems unbearably clever to me might be utterly mundane for you, and vice versa. How do we make something utterly mundane? By using it and working at the boundaries of our skills. Almost everything I'm "good at" comes from banging my head against it more than is healthy. That suggests a really good reason to write clever code: it's an excellent form of purposeful practice. Writing clever code forces us to code outside of our comfort zone, developing our skills as software engineers. Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you [will get excellent debugging practice at exactly the right level required to push your skills as a software engineer] — Brian Kernighan, probably There are other benefits, too, but first let's kill the elephant in the room:2 Don't commit clever code I am proposing writing clever code as a means of practice. Being at work is a job with coworkers who will not appreciate if your code is too clever. Similarly, don't use too many innovative technologies. Don't put anything in production you are uncomfortable with. We can still responsibly write clever code at work, though: Solve a problem in both a simple and a clever way, and then only commit the simple way. This works well for small scale problems where trying the "clever way" only takes a few minutes. Write our personal tools cleverly. I'm a big believer of the idea that most programmers would benefit from writing more scripts and support code customized to their particular work environment. This is a great place to practice new techniques, languages, etc. If clever code is absolutely the best way to solve a problem, then commit it with extensive documentation explaining how it works and why it's preferable to simpler solutions. Bonus: this potentially helps the whole team upskill. Writing clever code... ...teaches simple solutions Usually, code that's called too clever composes several powerful features together — the "not a single list comprehension or function" people are the exception. Josh Comeau's "don't write clever code" article gives this example of "too clever": const extractDataFromResponse = (response) => { const [Component, props] = response; const resultsEntries = Object.entries({ Component, props }); const assignIfValueTruthy = (o, [k, v]) => (v ? { ...o, [k]: v } : o ); return resultsEntries.reduce(assignIfValueTruthy, {}); } What makes this "clever"? I count eight language features composed together: entries, argument unpacking, implicit objects, splats, ternaries, higher-order functions, and reductions. Would code that used only one or two of these features still be "clever"? I don't think so. These features exist for a reason, and oftentimes they make code simpler than not using them. We can, of course, learn these features one at a time. Writing the clever version (but not committing it) gives us practice with all eight at once and also with how they compose together. That knowledge comes in handy when we want to apply a single one of the ideas. I've recently had to do a bit of pandas for a project. Whenever I have to do a new analysis, I try to write it as a single chain of transformations, and then as a more balanced set of updates. ...helps us master concepts Even if the composite parts of a "clever" solution aren't by themselves useful, it still makes us better at the overall language, and that's inherently valuable. A few years ago I wrote Crimes with Python's Pattern Matching. It involves writing horrible code like this: from abc import ABC class NotIterable(ABC): @classmethod def __subclasshook__(cls, C): return not hasattr(C, "__iter__") def f(x): match x: case NotIterable(): print(f"{x} is not iterable") case _: print(f"{x} is iterable") if __name__ == "__main__": f(10) f("string") f([1, 2, 3]) This composes Python match statements, which are broadly useful, and abstract base classes, which are incredibly niche. But even if I never use ABCs in real production code, it helped me understand Python's match semantics and Method Resolution Order better. ...prepares us for necessity Sometimes the clever way is the only way. Maybe we need something faster than the simplest solution. Maybe we are working with constrained tools or frameworks that demand cleverness. Peter Norvig argued that design patterns compensate for missing language features. I'd argue that cleverness is another means of compensating: if our tools don't have an easy way to do something, we need to find a clever way. You see this a lot in formal methods like TLA+. Need to check a hyperproperty? Cast your state space to a directed graph. Need to compose ten specifications together? Combine refinements with state machines. Most difficult problems have a "clever" solution. The real problem is that clever solutions have a skill floor. If normal use of the tool is at difficult 3 out of 10, then basic clever solutions are at 5 out of 10, and it's hard to jump those two steps in the moment you need the cleverness. But if you've practiced with writing overly clever code, you're used to working at a 7 out of 10 level in short bursts, and then you can "drop down" to 5/10. I don't know if that makes too much sense, but I see it happen a lot in practice. ...builds comradery On a few occasions, after getting a pull request merged, I pulled the reviewer over and said "check out this horrible way of doing the same thing". I find that as long as people know they're not going to be subjected to a clever solution in production, they enjoy seeing it! Next week's newsletter will probably also be late, after that we should be back to a regular schedule for the rest of the summer. Mostly grad students outside of CS who have to write scripts to do research. And in more than one data scientist. I think it's correlated with using Jupyter. ↩ If I don't put this at the beginning, I'll get a bajillion responses like "your team will hate you" ↩
Whether we like it or not, email is widely used to identify a person. Code sent to email is used as authentication and sometimes as authorisation for certain actions. I’m not comfortable with Google having such power over me, especially given the fact that they practically don’t have any support you can appeal to. If your Google account is blocked, that’s it. Maybe you know someone from Google and they can help you, but for most of us mortals that’s not an option.
In his book “The Order of Time” Carlo Rovelli notes how we often asks ourselves questions about the fundamental nature of reality such as “What is real?” and “What exists?” But those are bad questions he says. Why? the adjective “real” is ambiguous; it has a thousand meanings. The verb “to exist” has even more. To the question “Does a puppet whose nose grows when he lies exist?” it is possible to reply: “Of course he exists! It’s Pinocchio!”; or: “No, it doesn’t, he’s only part of a fantasy dreamed up by Collodi.” Both answers are correct, because they are using different meanings of the verb “to exist.” He notes how Pinocchio “exists” and is “real” in terms of a literary character, but not so far as any official Italian registry office is concerned. To ask oneself in general “what exists” or “what is real” means only to ask how you would like to use a verb and an adjective. It’s a grammatical question, not a question about nature. The point he goes on to make is that our language has to evolve and adapt with our knowledge. Our grammar developed from our limited experience, before we know what we know now and before we became aware of how imprecise it was in describing the richness of the natural world. Rovelli gives an example of this from a text of antiquity which uses confusing grammar to get at the idea of the Earth having a spherical shape: For those standing below, things above are below, while things below are above, and this is the case around the entire earth. On its face, that is a very confusing sentence full of contradictions. But the idea in there is profound: the Earth is round and direction is relative to the observer. Here’s Rovelli: How is it possible that “things above are below, while things below are above"? It makes no sense…But if we reread it bearing in mind the shape and the physics of the Earth, the phrase becomes clear: its author is saying that for those who live at the Antipodes (in Australia), the direction “upward” is the same as “downward” for those who are in Europe. He is saying, that is, that the direction “above” changes from one place to another on the Earth. He means that what is above with respect to Sydney is below with respect to us. The author of this text, written two thousand years ago, is struggling to adapt his language and his intuition to a new discovery: the fact that the Earth is a sphere, and that “up” and “down” have a meaning that changes between here and there. The terms do not have, as previously thought, a single and universal meaning. So language needs innovation as much as any technological or scientific achievement. Otherwise we find ourselves arguing over questions of deep import in a way that ultimately amounts to merely a question of grammar. Email · Mastodon · Bluesky
In mid-March we released a big bug fix update—elementary OS 8.0.1—and since then we’ve been hard at work on even more bug fixes and some new exciting features that I’m excited to share with you today! Read ahead to find out what we’ve released recently and what you can help us test in Early Access. Quick Settings Quick Settings has a new “Prevent Sleep” toggle Leo added a new “Prevent Sleep” toggle. This is useful when you’re giving a presentation or have a long-running background task where you want to temporarily avoid letting the computer go to sleep on its normal schedule. We also fixed a bug where the “Dark Mode” toggle would cancel the dark mode schedule when used. We now have proper schedule snoozing, so when you manually toggle Dark Mode on or off while using a timed or sunset-to-sunrise schedule, your schedule will resume on the next schedule change instead of being canceled completely. Vishal also fixed an issue that caused some apps to report being improperly closed on system shutdown or restart and on the lock screen we now show the “Suspend” button rather than the “Lock” button. System Settings Locale settings has a fresh layout thanks to Alain with its options aligned more cleanly and improved links to additional settings. Locale Settings has a more responsive design We’ve also added the phrase “about this device” as a search term for the System page and improved interface copy when a restart is required to finish installing updates based on your feedback. Plus, Stanisław improved stylus detection in Wacom settings preventing a crash when no stylus is found. AppCenter We now show a small label next to the download button for apps which contain in-app purchases. This is especially useful for easily identifying free-to-play games or alt stores like Steam or Heroic Games Launcher. AppCenter now shows when apps have in-app purchases Plus, we now reload app icons on-the-fly as their data is processed, thanks to Italo. That means you’ll no longer get occasionally stuck with an AppCenter which shows missing images for app’s who have taken a bit longer than usual to load. Get These Updates As always, pop open System Settings → System on elementary OS 8 and hit “Update All” to get these updates plus your regular security, bug fix, and translation updates. Or set up automatic updates and get a notification when updates are ready to install! Early Access Our development focus recently has been on some of the bigger features that will likely land for either elementary OS 8.1 or 9. We’ve got a new app, big changes to the design of our desktop itself, a whole lot of under-the-hood cleanup, and the return of some key system services thanks to a new open source project. Monitor We’re now shipping a System Monitor app by default By popular demand—and thanks to the hard work of Stanisław—we have a new system monitor app called “Monitor” shipping in Early Access. Monitor provides usage information for your processor, GPU, memory, storage, network, and currently running processes. You can optionally see system information in the panel with Monitor You can also optionally get a ton of glanceable information shown in the panel. There’s currently a lot of work happening to port Monitor to GTK4 and improve its functionality under the Secure Session, so make sure to report any issues you find! Multitasking The Dock is getting a workspace switcher Probably the biggest change to the Pantheon shell since its early inception, the Dock is getting a new workspace switcher! The workspace switcher works in a familiar way to the one you may have seen in the Multitasking View: Your currently open workspaces are represented as tiles with the icons of apps running on them; You can select a workspace to switch to it; You can drag-and-drop workspaces to rearrange them; And you can use the “+” button to create a new blank workspace. One new trick however is that selecting the workspace you’re already on will launch Multitasking View. The new workspace switcher makes it so much more accessible to multitask with just the mouse and get an overview of your workflows without having to first enter the Multitasking View. We’re really excited to hear what people think about it! You can close apps from Multitasking View by swiping up Another very satisfying feature for folks using touch input, you can now swipe up windows in the Multitasking View to close them. This is a really familiar gesture for those of us with Android and iOS devices and feels really natural for managing a big stack of windows without having to aim for a small “x” button. GTK4 Porting We’ve recently landed the port of Tasks to GTK4. So far that comes with a few fixes to tighten up its design, with much more possible in the future. Please make sure to help us test it thoroughly for any regressions! Tasks has a slightly tightened up design We’re also making great progress on porting the panel to GTK4. So far we have branches in review for Nightlight, Bluetooth, Datetime, and Network indicators. Power, Keyboard, and Quick Settings indicators all have in-progress branches. That leaves just Applications, Sound, and Notifications. So far these ports don’t come with major feature changes, but they do involve lots of cleaning up and modernizing of these code bases and in some cases fixing bugs! When the port is finished, we should see immediate performance gains and we’ll have a much better foundation for future releases. You can follow along with our progress porting everything to GTK4 in this GitHub Project. And More When you take a screenshot using keyboard shortcuts or by secondary-clicking an app’s window handle, we now send a notification letting you know that it was succesful and where to find the resulting image. Plus there’s a handy button that opens Files with your screenshot pre-selected. We’re also testing beaconDB as a replacement for Mozilla Location Services (MLS). If you’re not aware, we relied on MLS in previous versions of elementary OS to provide location information for devices that don’t have a GPS radio. Unfortunately Mozilla discontinued the service last June and we’ve been left without a replacement until now. Without these services, not only did maps and weather apps cease to function, but system features like automatic timezone detection and features that rely on sunset and sunrise times no longer work properly. beaconDB offers a drop-in replacement for MLS that uses Wireless networks, bluetooth devices, and cell towers to provide location data when requested. All of its data is crowd-sourced and opt-in and several distributions are now defaulting to using it as their location services data provider. I’ve set up a small sponsorship from elementary on Liberapay to support the project. If you can help support beaconDB either by sponsoring or providing stumbler data, I’d highly encourage you to do so! Sponsors At the moment we’re at 23% of our monthly funding goal and 336 Sponsors on GitHub! Shoutouts to everyone helping us reach our goals here. Your monthly sponsorship funds development and makes sure we have the resources we need to give you the best version of elementary OS we can! Monthly release candidate builds and daily Early Access builds are available to GitHub Sponsors from any tier! Beware that Early Access builds are not considered stable and you will encounter fresh issues when you run them. We’d really appreciate reporting any problems you encounter with the Feedback app or directly on GitHub.
Via Jeremy Keith’s link blog I found this article: Elizabeth Goodspeed on why graphic designers can’t stop joking about hating their jobs. It’s about the disillusionment of designers since the ~2010s. Having ridden that wave myself, there’s a lot of very relatable stuff in there about how design has evolved as a profession. But before we get into the meat of the article, there’s some bangers worth acknowledging, like this: Amazon – the most used website in the world – looks like a bunch of pop-up ads stitched together. lol, burn. Haven’t heard Amazon described this way, but it’s spot on. The hard truth, as pointed out in the article, is this: bad design doesn’t hurt profit margins. Or at least there’s no immediately-obvious, concrete data or correlation that proves this. So most decision makers don’t care. You know what does help profit margins? Spending less money. Cost-savings initiatives. Those always provide a direct, immediate, seemingly-obvious correlation. So those initiatives get prioritized. Fuzzy human-centered initiatives (humanities-adjacent stuff), are difficult to quantitatively (and monetarily) measure. “Let’s stop printing paper and sending people stuff in the mail. It’s expensive. Send them emails instead.” Boom! Money saved for everyone. That’s easier to prioritize than asking, “How do people want us to communicate with them — if at all?” Nobody ever asks that last part. Designers quickly realized that in most settings they serve the business first, customers second — or third, or fourth, or... Shar Biggers [says] designers are “realising that much of their work is being used to push for profit rather than change..” Meet the new boss. Same as the old boss. As students, designers are encouraged to make expressive, nuanced work, and rewarded for experimentation and personal voice. The implication, of course, is that this is what a design career will look like: meaningful, impactful, self-directed. But then graduation hits, and many land their first jobs building out endless Google Slides templates or resizing banner ads...no one prepared them for how constrained and compromised most design jobs actually are. Reality hits hard. And here’s the part Jeremy quotes: We trained people to care deeply and then funnelled them into environments that reward detachment. And the longer you stick around, the more disorienting the gap becomes – especially as you rise in seniority. You start doing less actual design and more yapping: pitching to stakeholders, writing brand strategy decks, performing taste. Less craft, more optics; less idealism, more cynicism. Less work advocating for your customers, more work for advocating for yourself and your team within the organization itself. Then the cynicism sets in. We’re not making software for others. We’re making company numbers go up, so our numbers ($$$) will go up. Which reminds me: Stephanie Stimac wrote about reaching 1 year at Igalia and what stood out to me in her post was that she didn’t feel a pressing requirement to create visibility into her work and measure (i.e. prove) its impact. I’ve never been good at that. I’ve seen its necessity, but am just not good at doing it. Being good at building is great. But being good at the optics of building is often better — for you, your career, and your standing in many orgs. Anyway, back to Elizabeth’s article. She notes you’ll burn out trying to monetize something you love — especially when it’s in pursuit of maintaining a cost of living. Once your identity is tied up in the performance, it’s hard to admit when it stops feeling good. It’s a great article and if you’ve been in the design profession of building software, it’s worth your time. Email · Mastodon · Bluesky