Bloom Filters

from samwho.dev [alt+shift+b] in programming

.bf { width: 100%; height: 150px; } @media only screen and (min-width: 320px) and (max-width: 479px) { .bf { height: 200px; } } @media only screen and (min-width: 480px) and (max-width: 676px) { .bf { height: 200px; } } @media only screen and (min-width: 677px) and (max-width: 991px) { .bf { height: 150px; } } form { display: flex; flex-direction: column; align-items: center; justify-content: stretch; } input { border: 1px solid rgb(119, 119, 119); padding: 0.25rem; border-radius: 0.25rem; height: 2em; line-height: 2em; } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } Everyone has a set of...

a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from samwho.dev

Reservoir Sampling

header h1 { padding: 0; margin-top: 0.2rem; margin-bottom: 1rem; } button { margin: 0.5rem; padding: 0.5rem 1rem; background-color: #444444; color: white; border: none; border-radius: 8px; cursor: pointer; touch-action: manipulation; user-select: none; } button:hover:not(:disabled) { background-color: #555555; touch-action: manipulation; user-select: none; } button:disabled { filter: opacity(0.5); cursor: not-allowed; touch-action: manipulation; user-select: none; } input[type="range"] { width: 100%; margin: 1rem 0; } .control { display: flex; justify-content: center; align-items: center; margin: 1rem 0; } .control label { margin-right: 1rem; white-space: nowrap; } .control input[type="range"] { flex-grow: 1; } .odds { font-size: 1.2rem; font-family: var(--code-font); font-weight: bold; text-align: center; display: flex; justify-content: center; gap: 2rem; margin: 1rem 0; } .odds .hold { display: flex; flex-direction: column; align-items: center; } .odds .replace { display: flex; flex-direction: column; align-items: center; } .odds .hold .value { color: var(--palette-dark-blue); white-space: nowrap; } .odds .replace .value { color: var(--palette-red); white-space: nowrap; } article input { --c: var(--palette-orange); --l: 2px; --h: 30px; --w: 30px; width: 100%; height: var(--h); -webkit-appearance :none; -moz-appearance :none; appearance :none; background: none; cursor: pointer; overflow: hidden; } article input:focus-visible, article input:hover{ --p: 25%; } article input[type="range" i]::-webkit-slider-thumb{ height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } article input[type="range"]::-moz-range-thumb { --h: 25px; --w: 25px; height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } img.hero { --width: 200px; width: var(--width); max-width: var(--width); margin-top: calc(var(--width) * -0.5); margin-bottom: 1rem; } Reservoir Sampling Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know: When you would need reservoir sampling. The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise. A simple way to implement reservoir sampling if you want to use it. ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out! # Sampling when you know the size In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it? The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle." Every time you click "Shuffle," the chart below tracks what the first 3 cards were. At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair." Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over. This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards. We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method. I'm stretching the analogy a little here. It would take a long time to count through the deck to get to, say, index 436,234. But when it's an array in memory, computers have no trouble finding an element by its index. Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random? That's the curveball: you don't know. No, you can only hold on to 1 card at a time. You're free to swap your card with the newest one each time I show you a card, but you can only hold one and you can't go back to a card you've already seen. Believe it or not, this is a real problem and it has a real and elegant solution. For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives log messages from other services and stores them so that it's easy to search them in one place. One of the things you need to think about when building a service like this is what do you do when another service starts sending you way too many logs. Maybe it's a bad release, maybe one of your videos goes viral. Whatever the reason, it threatens to overwhelm your log collection service. Let's simulate this. Below you can see a stream of logs that experiences periodic spikes. A horizontal line indicates the threshold of logs per second that the log collection service can handle, which in this example is 5 logs per second. You can see that every so often, logs per second spikes above the threshold . One way to deal with this is "sampling." Deciding to send only a fraction of the logs to the log collection service. Let's send 10% of the logs. Below we will see the same simulation again, but this time logs that don't get sent to our log collection service will be greyed out. The graph has 2 lines: a black line tracks sent logs , the logs that are sent to our log collection service, and a grey line tracks total logs . The rate of sent logs never exceeds the threshold , so we never overwhelm our log collection service. However, in the quieter periods we're throwing away 90% of the logs when we don't need to! What we really want is to send at most 5 logs per second. This would mean that during quiet periods you get all the logs, but during spikes you discard logs to protect the log collection service. The simple way to achieve this would be to send the first 5 logs you see each second, but this isn't fair. You aren't giving all logs an equal chance of being selected. # Sampling when you don't know the size We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see. Reservoir sampling is an algorithm that solves this exact problem. You could, but why live with that uncertainty? You'd be holding on to an unknown number of logs in memory. A sufficiently big spike could cause you problems. Reservoir sampling solves this problem, and does so without ever using more memory than you ask it to. Let's go back to our curveball of me showing you 1 card at a time. Here's a recap of the rules: I'll draw cards one at a time from a deck. Each time I show you a card, you have to choose to hold it or discard it. If you were already holding a card, you discard your held card before replacing it with the new card. At any point I can stop drawing cards and whatever card you're holding is the one you've chosen. How would you play this game in a way that ensures all cards have been given an equal chance to be selected when I decide to stop? You're on the right track. Let's have a look at how the coin flip idea plays out in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and 50% of the time it will go to the discard pile on the right, and 50% of the time it will become your held card in the center, with any previously held card moving to the discard pile. The problem is that while the hold vs discard counts are roughly equal, which feels fair, later cards are much more likely to be held when I stop than earlier cards. The first card drawn has to win 10 coin flips to still be in your hand after the 10th card is drawn. The last card only has to win 1. Scrub the slider below to see how the chances change as we draw more cards. Each bar represents a card in the deck, and the height of the bar is the chance we're holding that card when I stop. Below the slider are the chances we're holding the first card drawn vs. the last card drawn. Anything older than 15 cards ago is has a less than 0.01% chance of being held when I stop. Because believe it or not, we only have to make one small change to this idea to make it fair. Instead of flipping a coin to decide if we'll hold the card or not, instead we give each new card a 1/n chance of being held, where n is the number of cards we've seen so far. Yep! In order to be fair, every card must have an equal chance of being selected. So for the 2nd card, we want both cards to have a 1/2 chance. For the 3rd card, we want all 3 cards to have a 1/3 chance. For the 4th card, we want all 4 cards to have a 1/4 chance, and so on. So if we use 1/n for the new card, we can at least say that the new card has had a fair shot. Let's have a look at the chances as you draw more cards with this new method. new card has the right chance of being selected, but how does that make the older cards fair? So far we've focused on the chance of the new card being selected, but we also need to consider the chance of the card you're holding staying in your hand. Let's walk through the numbers. # Card 1 The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 1/1, or 100%. # Card 2 This time we have a real choice. We can keep hold of the card we have, or replace it with the new one. We've said that we're going to do this with a 1/n chance, where n is the number of cards we've seen so far. So our chance of replacing the first card is 1/2, or 50%, and our chance of keeping hold of the first card is its chance of being chosen last time multiplied by its chance of being replaced, so 100% * 1/2, which is again 50%. # Card 3 The card we're holding has a 50% chance of being there. This is true regardless what happened up to this point. No matter whether we're holding card 1 or card 2, it's 50%. The new card has a 1/3 chance of being selected, so the card we're holding has a 1/3 chance of being replaced. This means that our held card has a 2/3 chance of remaining held. So its chances of "surviving" this round are 50% * 2/3. # Card N This pattern continues for as many cards as you want to draw. We can express both options as formulas. Drag the slider to substitute n with real numbers and see that the two formulas are always equal. If 1/n is the chance of choosing the new card, 1/(n-1) is the chance of choosing the new card from the previous draw. The chance of not choosing the new card is the complement of 1/n, which is 1-(1/n). Below are the cards again except this time set up to use 1/n instead of a coin flip. Click to the end of the deck. Does it feel fair to you? There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but as we saw above the numbers say it is completely fair. # Choosing multiple cards Now that we know how to select a single card, we can extend this to selecting multiple cards. There are 2 changes we need to make: Rather than new cards having a 1/n chance of being selected, they now have a k/n chance, where k is the number of cards we want to choose. When we decide to replace a held card, we choose one of the k cards we're holding at random. So our new previous card selection formula becomes k/(n-1) because we're now holding k cards. Then the chance that any of the cards we're holding get replaced is 1-(1/n). Let's see how this plays out with real numbers. The fairness still holds, and will hold for any k and n pair. This is because all held cards have an equal chance of being replaced, which keeps them at an equal likelihood of still being in your hand every draw. A nice way to implement this is to use an array of size k. For each new card, generate a random number between 0 and n. If the random number is less than k, replace the card at that index with the new card. Otherwise, discard the new card. And that's how reservoir sampling works! # Applying this to log collection Let's take what we now know about reservoir sampling and apply it to our log collection service. We'll set k=5, so we're "holding" at most 5 log messages at a time, and every second we will send the selected logs to the log collection service. After we've done that, we empty our array of size 5 and start again. This creates a "lumpy" pattern in the graph below, and highlights a trade-off when using reservoir sampling. It's no longer a real-time stream of logs, but chunks of logs sent at an interval. However, sent logs never exceeds the threshold , and during quiet periods the two lines track each other almost perfectly. No logs lost during quiet periods, and never more than threshold logs per second sent during spikes. The best of both worlds. It also doesn't store more than k=5 logs, so it will have predictable memory usage. # Further reading Something you may have thought while reading this post is that some logs are more valuable than others. You almost certainly want to keep all error logs, for example. For that use-case there is a weighted variant of reservoir sampling. I wasn't able to find a simpler explanation of it, so that link is to Wikipedia which I personally find a bit hard to follow. But the key point is that it exists and if you need it you can use it. # Conclusion Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient. Thank you again to ittybit for sponsoring this post. I really couldn't have hoped for a more supportive first sponsor. Thank you for believing in and understanding what I'm doing here. Thank you to everyone who read this post and gave their feedback. You made this post much better than I could have done on my own, and steered me away from several paths that just weren't working. If you want to tell me what you thought of this post by sending me an anonymous message that goes directly to my phone, go to https://samwho.dev/ping.

3 months ago • 8 votes

Turing Machines

body { text-wrap: pretty; } @media (prefers-reduced-motion: reduce) { * { transition: none; animation: none; } } turing-machine { width: 100%; display: block; position: relative; padding-bottom: 1em; } turing-machine .program-container { position: relative; display: flex; justify-content: center; } turing-machine table { border: none; font-family: Fira Code; border-collapse: collapse; border-spacing: 0; margin: 1px; margin-top: 0.5em; width: auto; } turing-machine thead td { text-align: center; } turing-machine td { text-align: left; padding-left: 3vw; padding-right: 3vw; padding-top: 0.2em; padding-bottom: 0.2em; border: 1px dashed #bbbbbb; } turing-machine thead td { border: 0; } turing-machine .container { z-index: 1; background-color: white; } turing-machine .svg-container { padding-bottom: 1px; padding-top: 0.5em; line-height: 0; overflow-x: scroll; overflow-y: hidden; position: relative; scrollbar-width: none; cursor: grab; cursor: -webkit-grab; } turing-machine .svg-container::-webkit-scrollbar { display: none; } turing-machine .controls-container { line-height: 1.5; display: flex; align-items: center; justify-content: center; gap: 0.3em; margin-top: 0.2em; padding-bottom: 0.2em; } turing-machine .controls-container button { color: black; background-color: #aaaaaa; border-radius: 30%/50%; padding: 0.3em; height: 2em; min-width: 3em; text-align: center; border: none; cursor: pointer; } turing-machine .controls-container select { color: black; border: 2px solid #aaaaaa; background-color: #ffffff; border-radius: 30%/50%; padding: 0.15em; height: 2em; min-width: 3em; text-align: center; cursor: pointer; font-family: "Fira Code"; } turing-machine .controls-container button:disabled { background-color: #eeeeee; color: #aaaaaa; cursor: auto; } turing-machine .controls-container button:disabled:hover { background-color: #eeeeee; } turing-machine .controls-container button:hover { background-color: #999999; } turing-machine .controls-container button:active { background-color: #888888; } turing-machine .error { width: 80%; margin-top: 0.5em; margin-bottom: 0.5em; border-radius: 10px; background-color: #ffdddd; color: red; font-family: "Fira Code"; font-size: 1.5em; font-weight: bold; text-align: center; margin: auto; } .dragging { cursor: grabbing !important; cursor: -webkit-grabbing !important; } .sticky { position: sticky !important; top: 0; } turing-machine .gradient { content: ""; z-index: 2; position: sticky; top: 0; left: 0; pointer-events: none; background-image: linear-gradient(90deg, rgba(255,255,255,1) 0%, rgba(255,255,255,0) 20%, rgba(255,255,255,0) 80%, rgba(255,255,255,1) 100%); width: 100%; margin-top: -62px; height: 64px; } turing-machine svg { user-select: none; /*touch-action: none;*/ } turing-machine svg text { font-family: "Fira Code"; } turing-machine tr { padding-top: 0.5em; padding-bottom: 0.5em; } turing-machine .instruction { color: black; display: inline-block; border-radius: 0.5em; padding: 0.1em; min-width: 4.5ch; height: 100%; text-align: center; } turing-machine .pointer { display: inline-block; border-radius: 0.5em; position: absolute; z-index: -1; } turing-machine program { display: none; } turing-machine controls { display: none; } turing-machine.heading svg text { font-family: "Lora"; font-weight: bold; } figure.hero { font-family: "Lora"; text-align: center; padding-top: 1em; padding-bottom: 2em; margin-top: 1em; width: 100%; max-width: 100%; } figure.hero turing-machine { padding: 0; margin: 0; } figure.hero turing-machine .svg-container { margin: 0; padding-top: 0; } figure.hero turing-machine .gradient { top: 0; } figure.hero .signature { margin-top: 0.5rem; max-width: 200px; } figure.hero .portrait { margin-top: 1em; width: 250px; max-width: 250px; } figure.hero p { padding: 0; font-size: 0.9em; } figure.hero figcaption { font-family: "Lora"; font-style: small-caps; font-size: 0.8em; } @keyframes pulse { 0% { transform: scale(1); } 50% { transform: scale(1.1); } 100% { transform: scale(1); } } .symbol { display: inline-block; border: 1px solid black; width: 1.5em; height: 1.5em; font-family: "Fira Code"; text-align: center; } .inline-instruction { display: inline-block; font-family: "Fira Code"; background-color: rgba(0, 158, 115, 0.7); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-instruction) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-state { display: inline-block; font-family: "Fira Code"; background-color: rgb(86, 180, 233); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-state) { margin-top: 0.35em; margin-bottom: 0.35em; } .inline-value { display: inline-block; font-family: "Fira Code"; background-color: rgb(230, 159, 0); color: black; border-radius: 0.5em; padding: 0.1em; min-width: 3ch; padding-left: 0.2em; padding-right: 0.2em; text-align: center; } li:has(.inline-value) { margin-top: 0.35em; margin-bottom: 0.35em; } figure { margin: auto; margin-top: 2em; margin-bottom: 2em; } figure img { max-width: 400px; } figure figcaption { max-width: 500px; color: #666; font-size: 0.8em; font-style: italic; text-align: center; font-family: "Lora"; text-wrap: pretty; margin: auto; } ALAN M. TURING 23 June 1912 – 7 June 1954 B B | | L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) L P( ) -> F In 1928, David Hilbert, one of the most influential mathematicians of his time, asked whether it is possible to create an algorithm that could determine the correctness of a mathematical statement. This was called the "decision problem," or "Entscheidungsproblem" in Hilbert's native German. In 1936 both Alan Turing and Alonzo Church independently reached the conclusion, using different methods, that the answer is "no." The way Turing did it was to imagine a "universal machine", a machine that could compute anything that could be computed. This idea, the "Turing machine" as Alonzo Church christened it in 1937, laid the foundations for the device you are using to read this post. If we look hard enough we can see Turing's legacy in today's CPUs. By the end of this post, you will know: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. How to write and run your own programs for a Turing machine. # What is a Turing machine? You might expect a universal machine, capable of computing anything that can be computed, to be a complex device. Nothing could be further from the truth. The machine has just 4 parts, and the language used to program it has just 5 instructions. theoretical machine. It was created as a thought experiment to explore the limits of what can be computed. Some have of course been built, but in 1936 they existed only in the heads of Turing and those who read his paper. The parts are: a !tape, a !head, a !program, and a !state. When you're ready, go ahead and press !play. start What you're seeing here is a program that executes P(0) to print 0 to the tape, moves the head right with the R instruction, then !jumps back to the start. It will go on printing 0s forever. At any point, feel free to !pause, step the machine !forwards or !backwards one instruction at a time, or !restart the program from the beginning. There is also a speed selector on the far right of the controls if you want to speed up the machine. Notice that !state and !value never change. Every time the machine performs a !jump, the current state and value are used to pick the correct next row of instructions to execute. This program only has a single state, start, and every time it jumps, the symbol under the !head is !blank. Let's take a look at a program with multiple states. one one | | P(1) R -> zero This program prints alternating 0s and 1s to the tape. It has 2 states, zero and one, to illustrate what happens when you !jump to a different state. Things moving too fast? The slider below can be used to adjust the speed of all the Turing machines on this page. 100% You can also achieve this same result by using a single !state and alternating the !value. Here's an example of that: start start | 1 | R P(0) -> start start | 0 | R P(1) -> start The !value column always stays up to date with what the current symbol is under the !head, then when we !jump that value is used to know which row of instructions to execute. Combining state and value gives us a surprising amount of control over what our !program does. We've so far seen 3 instructions: P prints a given symbol to the tape. R moves the tape head right. ↪︎ jumps to a given state. There are 2 more: L moves the tape head left. H halts the machine. 1 1 | | P(a) L -> 2 2 | | P(l) L -> 3 3 | | P(A) H This program prints the word "Alan" from right to left then halts. If you can't see the full word, you can drag the !tape left and right. If the machine has halted, you can use !restart to start it again from the beginning. All of it! You're probably not going to get Crysis running at 60fps on a simulated Turing machine, but all of the calculations required to render each frame can be done with just these 5 instructions. Everything you have ever seen a computer do can be done with a Turing machine. We'll see a glimpse of how that can work in practice a little later. The last example I want to show you before we move on is the very first program Alan Turing showed the world. It's the first program featured in his 1936 paper: "On Computable Numbers, with an application to the Entsheidungsproblem." c c | | R -> e e | | P(1) R -> k k | | R -> b Turing liked to leave spaces between symbols, going as far as to even define them as "F-squares" and "E-squares". F for figure, and E for erasable. His algorithms would often make use of E-squares to help the machine remember the location of specific !tape squares. # What does it mean to compute? Something is said to be "computable" if there exists an algorithm that can get from the given input to the expected output. For example, adding together 2 integers is computable. Here I'm giving the machine a !tape that starts out with the values 2 and 6 in binary separated by a !blank. b1 b1 | 1 | R -> b1 b1 | | R -> b2 b2 | 0 | R -> b2 b2 | 1 | R -> b2 b2 | | L -> dec dec | 0 | P(1) L -> dec dec | 1 | P(0) L -> b3 dec | | H b3 | 0 | L -> b3 b3 | 1 | L -> b3 b3 | | L -> inc inc | 0 | P(1) R -> b1 inc | 1 | P(0) L -> inc inc | | P(1) R -> b1 This program adds the two numbers together, arriving at the answer 8. It does this by decrementing from the right number and incrementing the left number until the right number is 0. Here's the purpose of each state: b1 Move right until the first !blank square is found. This is to navigate past the first number. b2 Move right until the second !blank square is found. This is to navigate past the second number. dec Decrement the current number by 1. In practice this will always be the right number. It decrements by flipping bits until it either reaches a 1, where it will navigate back to the left number, or a !blank, which it will interpret as the number having hit 0. b3 Move left until the first !blank square again. This is only ever used after we've decremented the rightmost number. We can't reuse the b1 state here because when we reach the !blank, we want to jump to inc. inc Increment the current number by 1. Similar to decrementing, this will only ever happen on the leftmost number in practice. This is done by flipping bits until we reach a 0, at which point we navigate back to the right number. That we can write this program at all is proof that addition is computable, but it also implies that all integers are computable. If we can add any 2 integers, we can compute any other integer. 1 is 0+1, 2 is 1+1, 3 is 2+1, and so on. # Binary vs Decimal You may have wondered why I'm choosing to work with binary numbers rather than decimal. It's not just because that's how modern computers work. I'm going to show you 2 examples, and from those examples you'll be able to see why modern computers choose to work in binary. The first example is a program that increments a binary number in an endless loop. move move | 0 | R -> move move | | L -> flip flip | 0 | P(1) R -> move flip | 1 | P(0) L -> flip flip | | P(1) R -> move The second example is a program that increments a decimal number in an endless loop. back inc | 1 | P(2) -> back inc | 2 | P(3) -> back inc | 3 | P(4) -> back inc | 4 | P(5) -> back inc | 5 | P(6) -> back inc | 6 | P(7) -> back inc | 7 | P(8) -> back inc | 8 | P(9) -> back inc | 9 | P(0) L -> inc inc | | P(1) -> back back | | L -> inc back | * | R -> back These two programs are doing the same thing, but the program for manipulating decimal numbers is much longer. We've even introduced some new syntax, the * symbol, to handle a !value under the !head that does not match any of the other values for that !state. It's for this reason when programming Turing machines we prefer binary numbers: the programs end up being shorter and easier to reason about. This benefit also translates to the physical world. Components that switch between 2 states are cheaper, smaller, and more reliable than components that switch between 10. It was more practical to build computers that worked in binary than ones that work in decimal, though attempts to build decimal computers were made. # What can't be computed? To approach this question we need to explain the "Halting problem." It goes like this: The answer is no, and this is what Turing essentially proved. The proof is complicated and I'm not ashamed to admit I don't understand it, but there is an example I can give you that can be intuitively understood to be "undecidable." Imagine you write a program that takes as its input the program being used to decide whether it will halt or not. What it then does is run the decider program on itself, and then do the opposite of what the decider program says. function undecidable(willHalt) { if (willHalt(undecidable)) { while (true); } else { return true; } } This program intentionally enters an infinite loop if it is told it will halt, and halts if it is told it will run forever. It seems like a silly example, the kind of answer a cheeky high school student might try to get away with, but it is a legitimate counterexample to the idea that the halting problem can be solved. If you were to imagine encoding the program and input into something that could be represented on the !tape, there would be no !program that could determine whether the program would halt or not. Imagining this encoding becomes quite natural when you realise that modern programs are encoded as binary data to be saved to disk. # What does it mean to be Turing complete? If you've been involved in the world of programming for more than a few years, there's a good chance you've come across the term "Turing complete." Most likely in the context of things that really ought not to be Turing complete, like C++ templates, TypeScript's type system or Microsoft Excel. But what does it mean? Like the Halting problem, the proof is complicated but there's a straightforward test you can apply to something to judge it Turing complete: I've written this post, with the Turing machine simulations, in JavaScript. Therefore JavaScript is Turing complete. The C++ template example given above simulates a Turing machine in C++'s template system. The TypeScript example takes the route of writing an interpreter for a different Turing complete language. You're right, and everyone tends to cheat a bit with the definition. When someone says something is Turing complete, what they mean is it would be Turing complete if it had an infinite amount of memory. The infinite tape limitation means no Turing machine could ever exist in our physical reality, so that requirement tends to get waived. # How does this all relate to modern computers? If you read around the topic of Turing machines outside of this post, you might see it said that modern computers are effectively Turing machines. You would be forgiven for finding it difficult to imagine how you go from adding 2 integers in binary on a !tape to running a web browser, but the line is there. A key difference between our Turing machine and the device you're reading this on is that your device's CPU has "registers." These are small pieces of memory that live directly on the CPU and are used to store values temporarily while they're being operated on. Values are being constantly loaded from memory into registers and saved back again. You can think of registers as variables for your CPU, but they can only store fixed-size numbers. We can create registers in our Turing machine. We can do this by creating a "format" for our tape. Here we define 3 registers: A, B, and C. Each register contains a 3 bits and can store numbers between 0 and 7. Then at the far left we have an H, which stands for "home", which will help us navigate. To increment register C, we can write a program like this: goto_c goto_c | C | R R R -> inc inc | 0 | P(1) -> goto_h goto_h | * | L -> goto_h goto_h | H | H We're making a lot more liberal use of the * symbol here to help us navigate to specific parts of the !tape without having to enumerate all possible values that could be under the !head on the way there. This program is effectively equivalent to the following x86 assembly code, if x86 had a register named c: mov c, 0 ; Load 0 into c inc c ; Increment c by 1 If we wanted to add values in A and B, storing the result in C, we need to do more work. Here's the assembly code we're trying to replicate: mov a, 2 ; Load 2 into a mov b, 3 ; Load 3 into b add c, a ; Add a to c add c, b ; Add b to c Before you scroll down I will warn you that the program is long and complex. It is the last program we will see in this post, and I don't expect you to understand it in full to continue to the end. Its main purpose is to show you that we can implement operations seen in modern assembly code on a Turing machine. initb initb | * | R P(0) R P(1) R P(1) R -> start start | * | L -> start start | H | R -> go_a go_a | * | R -> go_a go_a | A | R R R -> dec_a dec_a | 0 | P(1) L -> cry_a dec_a | 1 | P(0) -> go_c1 cry_a | 0 | P(1) L -> cry_a cry_a | 1 | P(0) -> go_c1 cry_a | * | R P(0) R P(0) R P(0) -> goto_b go_c1 | * | R -> go_c1 go_c1 | C | R R R -> inc_c1 inc_c1 | 0 | P(1) -> h2a inc_c1 | 1 | P(0) L -> cry_ca cry_ca | 0 | P(1) -> h2a cry_ca | 1 | P(0) L -> cry_ca cry_ca | * | R -> h2a h2a | * | L -> h2a h2a | H | R -> go_a goto_b | * | R -> goto_b goto_b | B | R R R -> dec_b dec_b | 0 | P(1) L -> dec_bc dec_b | 1 | P(0) -> go_c2 dec_bc | 0 | P(1) L -> dec_bc dec_bc | 1 | P(0) -> go_c2 dec_bc | * | R P(0) R P(0) R P(0) -> end go_c2 | * | R -> go_c2 go_c2 | C | R R R -> inc_c2 inc_c2 | 0 | P(1) -> go_hb inc_c2 | 1 | P(0) L -> cry_cb cry_cb | 0 | P(1) -> go_hb cry_cb | 1 | P(0) L -> cry_cb cry_cb | * | R -> go_hb go_hb | * | L -> go_hb go_hb | H | R -> goto_b end | * | L -> end end | H | H This is painfully laborious, and it doesn't even precisely match the assembly code. It destroys the values in A and B as it adds them together, and it doesn't handle overflow. But it's a start, and I hope it gives you a glimpse of how this theoretical machine can be built up to operate like a modern CPU. If you watch the program run to completion, something that might strike you is just how much work is required to do something as simple as adding 2 numbers. Turing machines were not designed to be practical, Turing never intended anyone to go out and build one of these machines in the hope it will be useful. Modern machines have circuits within them that can add 2 numbers together by passing 2 electrical signals in and getting the sum as a single signal out. This happens in less than a nanosecond. Modern machines have memory where any byte can be accessed at any time, no tape manipulation required. This memory access takes a few dozen nanoseconds. # Writing and running your own programs I've built a web-based development environment for writing programs that will run on the Turing machine visualisations you've seen throughout the post. You can access the editor here. I encourage you to play around with it. Set a simple goal, like adding together 2 numbers without going back to look at the way I did it in the post. It's a great way to get a feel for how the machine works. # Conclusion To recap, we've covered: What a Turing machine is. What can and cannot be computed. What it means to be Turing complete. How modern computers relate to Turing machines. And you now have access to an environment for writing and running your own Turing machine programs. If you use it to make something neat, please do reach out to me and show me! My email address is hello@samwho.dev. # Further reading Turing's original paper on computable numbers. The Annotated Turing I referenced this throughout the making of this post. It is a fabulous read, strongly recommend. Alan Turing: The Enigma by Andrew Hodges. An excellent biography of Turing, I read this during the writing of this post. Calculating a Mandelbrot Set using a Turing Machine. This was exceptionally useful for me to understand how to get from Turing machines to modern computers. # Acknowledgements These posts are never a solo effort, and this one is no exception. Sincere thanks go to the following people: To my wife, Sophie, who drew the biographical sketches you're seeing at the end here, and for putting up with my incessant talking about this post the last 2 weeks. Everyone who let me watch them read this post in real-time over a video call and gave me feedback: Jaga Santagostino, Robert Aboukhalil, Tarun Verghis, Tyler Sparks. Everyone who came to hang out and help out in the Twitch streams when I was building out the early versions of the Turing machine visualisations. Everyone who supports my work on Patreon. Everyone who works on the tools used to build this post: TypeScript, Bun, Two.js, Tween.js, Monaco, Peggy, Zed, and many other indirect dependencies. We really do stand upon the shoulders of giants. Hut 8, Bletchley Park, where Turing worked during World War II. It was in this hut that Alan worked with his team to break the German Naval Enigma code.

6 months ago • 103 votes

A Commitment to Art and Dogs

.dog-line { display: flex; flex-wrap: nowrap; flex-direction: row; width: 100%; height: 10rem; margin-top: 2rem; margin-bottom: 2rem; } .dog-line img { flex-grow: 1; height: auto; margin: 0; padding: 0; object-fit: contain; } .dog-grid { display: grid; grid-template-columns: repeat(4, 1fr); grid-gap: 1rem; margin-top: 2rem; margin-bottom: 2rem; } Back in Memory Allocation, I introduced Haskie. The idea behind Haskie was to create a character that could ask questions the reader might have, and to "soften" the posts to make them feel less intimidating. I got some feedback from people that Haskie was a bit too childish, and didn't feel like he belonged in posts about serious topics. This feedback was in the minority, though, and most people liked him. So I kept him and used him again in Hashing. Having a proxy to the reader was useful. I could anticipate areas of confusion and clear them up without creating enormous walls of text. I don't like it when the entire screen is filled with text, I like to break it up with images and interactive elements. And now dogs. Then in Bloom Filters, I found myself needing a character to represent the "adult in the room." If Haskie was my proxy to the reader, this new character would serve as a proxy to all of the material I learned from in the writing of the post. This is Sage. I liked the idea of having a cast of characters, each with their own personality and purpose. But I had a few problems. # Problems Both Haskie and Sage, because I have no artistic ability, were generated by AI. Back when I made them I was making no money from this blog, and I had no idea if I was going to keep them around. I didn't want to invest money in an idea that could flop, so I didn't feel bad about using AI to try it out. Since then, however, I have been paid twice to write posts for companies, and I know that I'm keeping the dogs. It wasn't ethical to continue piggybacking on AI. While ethics were the primary motivation, there were some other smaller problems with the dogs: The visual style of them, while I did like it, never felt like it fit with the rest of my personal brand. It was difficult to get AI to generate consistent dogs. You'll notice differences in coat colouration and features between variants of the same dog. The AI generated images look bad at small sizes. So I worked with the wonderful Andy Carolan to create a new design for my dogs. A design that would be consistent, fit with my brand, and look good at any size. # Haskie, Sage, and Doe The redesigned dogs are consistent, use simple colours and shapes, and use the SVGs file format to look good at any size. Each variant clocks in at around 20kb, which is slightly larger than the small AI-generated images, but I'll be able to use them at any size. Together the dogs represent a family unit: Sage as the dad, Haskie as the youngest child, and Doe as his older sister. They also come in a variety of poses, so I can use them to represent different emotions or actions. We were careful to make the dogs recognisable apart. They differ in colour, ear shape, tail shape, and collar tag. Sage and Doe have further distinguishing features: Sage with his glasses, and Doe with her bandana. Doe's bandana uses the same colours as the transgender flag, to show my support for the trans community and as a nod to her identity. # Going forward I'm so happy with the new dogs, and plan to use them in my posts going forward. I suspect I will, at some point, replace the dogs in my old posts as well. I don't plan to add any more characters, and I want to be careful to avoid overusing them. I don't want them to become a crutch, or to distract from the content of the posts. I also haven't forgotten the many people that pointed out to me that you can't pet the dogs. I'm working on it.

a year ago • 108 votes

Hashing

form { padding-top: 0.5em; padding-left: 0.5em; padding-right: 0.5em; display: flex; justify-content: center; gap: 0.3em; } form input[type=text] { flex: 4 1 auto; min-width: 0; border-radius: 0.3em; border: 1px solid #aaaaaa; padding: 0.3em; } form button { flex: 1 1 auto; max-width: 140px; } form button:disabled { opacity: 0.5 !important; } form button.add { background-color: #009E73; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.check { background-color: #56B4E9; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.clear { background-color: #D55E00; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } .grid-2x2 { display: "grid"; } .grid { user-select: none; cursor: pointer; margin-top: 1rem; margin-bottom: 1rem; border: 1px solid #009E73; width: 100%; display: grid; grid-template-columns: repeat(8, 1fr); grid-template-rows: repeat(2, 1fr); } .grid-item { display: flex; align-items: center; justify-content: center; aspect-ratio: 1/1; } .grid-active { background-color: #009E73; color: white; } .above-grid { display: flex; justify-content: center; } .hash-examples { padding-top: 0.5rem; padding-bottom: 0.5rem; margin: auto; display: flex; flex-direction: column; align-items: center; } .hash-examples div { margin: auto; } .hash-examples code { display: block; white-space: pre; font-weight: bold; } .hash-examples p { font-size: 0.75rem; font-style: italic; text-align: center; font-family: Lora, serif; width: 75%; } .blob { cursor: pointer; background: #CC79A7; display: flex; justify-content: center; align-items: center; font-size: 1.5rem; color: white; border-radius: 50%; margin: 10px; height: 3rem; width: 3rem; min-width: 3rem; max-width: 3rem; box-shadow: 0 0 0 0 #CC79A7FF; transform: scale(1); animation: pulse 2s infinite; } @keyframes pulse { 0% { transform: scale(0.85); box-shadow: 0 0 0 0 #CC79A77F; } 70% { transform: scale(1); box-shadow: 0 0 0 1rem rgba(0, 0, 0, 0); } 100% { transform: scale(0.85); box-shadow: 0 0 0 0 rgba(0, 0, 0, 0); } } .blob-click { cursor: default; animation: tick 1s linear; background: #009E73FF; } @keyframes tick { 0% { transform: scale(1); box-shadow: 0 0 0 0 #009E73FF; } 50% { box-shadow: 0 0 0 1rem #009E737F; } 100% { box-shadow: 0 0 0 2rem #009E7300; } } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } .pct25 { width: 100%; height: 200px; } .datasets th { text-align: left; } .datasets { table-layout: fixed; } As a programmer, you use hash functions every day. They're used in databases to optimise queries, they're used in data structures to make things faster, they're used in security to keep data safe. Almost every interaction you have with technology will involve hash functions in one way or another. Hash functions are foundational, and they are everywhere. But what is a hash function, and how do they work? In this post, we're going to demystify hash functions. We're going to start by looking at a simple hash function, then we're going to learn how to test if a hash function is good or not, and then we're going to look at a real-world use of hash functions: the hash map. clicked. # What is a hash function? Hash functions are functions that take an input, usually a string, and produce a number. If you were to call a hash function multiple times with the same input, it will always return the same number, and that number returned will always be within a promised range. What that range is will depend on the hash function, some use 32-bit integers (so 0 to 4 billion), others go much larger. If we were to write a dummy hash function in JavaScript, it might look like this: function hash(input) { return 0; } Even without knowing how hash functions are used, it's probably no surprise that this hash function is useless. Let's see how we can measure how good a hash function is, and after that we'll do a deep dive on how they're used within hash maps. # What makes a hash function good? Because input can be any string, but the number returned is within some promised range, it's possible that two different inputs can return the same number. This is called a "collision," and good hash functions try to minimise how many collisions they produce. It's not possible to completely eliminate collisions, though. If we wrote a hash function that returned a number in the range 0 to 7, and we gave it 9 unique inputs, we're guaranteed at least 1 collision. hash("to") == 3 hash("the") == 2 hash("café") == 0 hash("de") == 6 hash("versailles") == 4 hash("for") == 5 hash("coffee") == 0 hash("we") == 7 hash("had") == 1 To visualise collisions, I'm going to use a grid. Each square of the grid is going to represent a number output by a hash function. Here's an example 8x2 grid. Click on the grid to increment the example hash output value and see how we map it to a grid square. See what happens when you get a number larger than the number of grid squares. { let grid = document.getElementById("first-grid"); let hash = document.getElementById("grid-hash"); let modulo = document.getElementById("grid-modulo"); grid.addEventListener("click", (e) => { e.preventDefault(); let number = parseInt(hash.innerText) + 1; hash.innerText = number.toString(); modulo.innerText = (number % 16).toString(); grid.querySelector(".grid-active").classList.remove("grid-active"); grid.children[number % 16].classList.add("grid-active"); return false; }); }); 13 % 16 == 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Every time we hash a value, we're going to make its corresponding square on the grid a bit darker. The idea is to create an easy way to see how well a hash function avoids collisions. What we're looking for is a nice, even distribution. We'll know that the hash function isn't good if we have clumps or patterns of dark squares. This is a great observation. You're absolutely right, we're going to be creating "pseudo-collisions" on our grid. It's okay, though, because if the hash function is good we will still see an even distribution. Incrementing every square by 100 is just as good a distribution as incrementing every square by 1. If we have a bad hash function that collides a lot, that will still stand out. We'll see this shortly. Let's take a larger grid and hash 1,000 randomly-generated strings. You can click on the grid to hash a new set of random inputs, and the grid will animate to show you each input being hashed and placed on the grid. The values are nice and evenly distributed because we're using a good, well-known hash function called murmur3. This hash is widely used in the real-world because it has great distribution while also being really, really fast. What would our grid look like if we used a bad hash function? function hash(input) { let hash = 0; for (let c of input) { hash += c.charCodeAt(0); } return hash % 1000000; } This hash function loops through the string that we're given and sums the numeric values of each character. It then makes sure that the value is between 0 and 1000000 by using the modulus operator (%). Let's call this hash function stringSum. Here it is on the grid. Reminder, this is 1,000 randomly generated strings that we're hashing. This doesn't look all that different from murmur3. What gives? The problem is that the strings we're giving to be hashed are random. Let's see how each function performs when given input that is not random: the numbers from 1 to 1000 converted to strings. Now the problem is more clear. When the input isn't random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with random values. How about if we hash the top 1,000 most common English words: It's more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3 looks the same as it always does. This is the power of a good hash function: no matter the input, the output is evenly distributed. Let's talk about one more way to visualise this and then talk about why it matters. # The avalanche effect Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping. It's this property that helps hash functions avoid forming patterns in the grid. If small changes in the input result in small changes in the output, you get patterns. Patterns indicate poor distribution, and a higher rate of collisions. Below, we are visualising the avalanche effect by showing two 8-bit binary numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit in the input. Bits that change in the output will be green, bits that stay the same will be red. murmur3 does well, though you will notice that sometimes fewer than 50% of the bits flip and sometimes more. This is okay, provided that it is 50% on average. Let's see how stringSum performs. Well this is embarassing. The output is equal to the input, and so only a single bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the string. This example only hashes the equivalent of a single character, which means the output will always be the same as the input. # Why all of this matters We've taken the time to understand some of the ways to determine if a hash function is good, but we've not spent any time talking about why it matters. Let's fix that by talking about hash maps. To understand hash maps, we first must understand what a map is. A map is a data structure that allows you to store key-value pairs. Here's an example in JavaScript: let map = new Map(); map.set("hello", "world"); console.log(map.get("hello")); Here we take a key-value pair ("hello" → "world") and store it in the map. Then we print out the value associated with the key "hello", which will be "world". A more fun real-world use-case would be to find anagrams. An anagram is when two different words contain the same letters, for example "antlers" and "rentals" or "article" and "recital." If you have a list of words and you want to find all of the anagrams, you can sort the letters in each word alphabetically and use that as a key in a map. let words = [ "antlers", "rentals", "sternal", "article", "recital", "flamboyant", ] let map = new Map(); for (let word of words) { let key = word .split('') .sort() .join(''); if (!map.has(key)) { map.set(key, []); } map.get(key).push(word); } This code results in a map with the following structure: { "aelnrst": [ "antlers", "rentals", "sternal" ], "aceilrt": [ "article", "recital" ], "aabflmnoty": [ "flamboyant" ] } # Implementing our own simple hash map Hash maps are one of many map implementations, and there are many ways to implement hash maps. The simplest way, and the way we're going to demonstrate, is to use a list of lists. The inner lists are often referred to as "buckets" in the real-world, so that's what we'll call them here. A hash function is used on the key to determine which bucket to store the key-value pair in, then the key-value pair is added to that bucket. Let's walk through a simple hash map implementation in JavaScript. We're going to go through it bottom-up, so we'll see some utility methods before getting to the set and get implementations. class HashMap { constructor() { this.bs = [[], [], []]; } } We start off by creating a HashMap class with a constructor that sets up 3 buckets. We use 3 buckets and the short variable name bs so that this code displays nicely on devices with smaller screens. In reality, you could have however many buckets you want (and better variable names). class HashMap { // ... bucket(key) { let h = murmur3(key); return this.bs[ h % this.bs.length ]; } } The bucket method uses murmur3 on the key passed in to find a bucket to use. This is the only place in our hash map code that a hash function is used. class HashMap { // ... entry(bucket, key) { for (let e of bucket) { if (e.key === key) { return e; } } return null; } } The entry method takes a bucket and a key and scans the bucket until it finds an entry with the given key. If no entry is found, null is returned. class HashMap { // ... set(key, value) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { e.value = value; return; } b.push({ key, value }); } } The set method is the first one we should recognise from our earlier JavaScript Map examples. It takes a key-value pair and stores it in our hash map. It does this by using the bucket and entry methods we created earlier. If an entry is found, its value is overwritten. If no entry is found, the key-value pair is added to the map. In JavaScript, { key, value } is shorthand for { key: key, value: value }. class HashMap { // ... get(key) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { return e.value; } return null; } } The get method is very similar to set. It uses bucket and entry to find the entry related to the key passed in, just like set does. If an entry is found, its value is returned. If one isn't found, null is returned. That was quite a lot of code. What you should take away from it is that our hash map is a list of lists, and a hash function is used to know which of the lists to store and retrieve a given key from. Here's a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our set method. To keep the visualisation simple, if a bucket were to "overflow", the buckets are all reset. Because we're using murmur3 as our hash function, you should see good distribution between the buckets. It's expected you'll see some imbalance, but it should generally be quite even. To get a value out of the hash map, we first hash the key to figure out which bucket the value will be in. Then we have to compare the key we're searching for against all of the keys in the bucket. It's this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash function, the faster we find the right bucket to search, the faster our hash map is overall. This is also why reducing collisions is so crucial. If we did decide to use that dummy hash function from all the way at the start of this article, the one that returns 0 all the time, we'll put all of our key-value pairs into the first bucket. Finding anything could mean we have to check all of the values in the hash map. With a good hash function, with good distribution, we reduce the amount of searching we have to do to 1/N, where N is the number of buckets. Let's see how stringSum does. Interestingly, stringSum seems to distribute values quite well. You notice a pattern, but the overall distribution looks good. stringSum. I knew it would be good for something. Not so fast, Haskie. We need to talk about a serious problem. The distribution looks okay on these sequential numbers, but we've seen that stringSum doesn't have a good avalanche effect. This doesn't end well. # Real-world collisions Let's look at 2 real-world data sets: IP addresses and English words. What I'm going to do is take 100,000,000 random IP addresses and 466,550 English words, hash all of them with both murmur3 and stringSum, and see how many collisions we get. IP Addresses murmur3 stringSum Collisions 1,156,959 99,999,566 1.157% 99.999% English words murmur3 stringSum Collisions 25 464,220 0.005% 99.5% When we use hash maps for real, we aren't usually storing random values in them. We can imagine counting the number of times we've seen an IP address in rate limiting code for a server. Or code that counts the occurrences of words in books throughout history to track their origin and popularity. stringSum sucks for these applications because of it's extremely high collision rate. # Manufactured collisions Now it's murmur3's turn for some bad news. It's not just collisions caused by similarity in the input we have to worry about. Check this out. What's happening here? Why do all of these jibberish strings hash to the same number? I hashed 141 trillion random strings to find values that hash to the number 1228476406 when using murmur3. Hash functions have to always return the same output for a specific input, so it's possible to find collisions by brute force. trillion? Like... 141 and then 12 zeroes? Yes, and it only took me 25 minutes. Computers are fast. Bad actors having easy access to collisions can be devastating if your software builds hash maps out of user input. Take HTTP headers, for example. An HTTP request looks like this: GET / HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: google.com You don't have to understand all of the words, just that the first line is the path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us from passing any headers we want, so we can be really mean and pass headers we know will cause collisions. This can significantly slow down the server. This isn't theoretical, either. If you search "HashDoS" you'll find a lot more examples of this. It was a really big deal in the mid-2000s. There are a few ways to mitigate this specific to HTTP servers: ignoring jibberish header keys and limiting the number of headers you store, for example. But modern hash functions like murmur3 offer a more generalised solution: randomisation. Earlier in this post we showed some examples of hash function implementations. Those implementations took a single argument: input. Lots of modern hash functions take a 2nd parameter: seed (sometimes called salt). In the case of murmur3, this seed is a number. So far, we've been using 0 as the seed. Let's see what happens with the collisions I've collected when we use a seed of 1. Just like that, 0 to 1, the collisions are gone. This is the purpose of the seed: it randomises the output of the hash function in an unpredictable way. How it achieves this is beyond the scope of this article, all hash functions do this in their own way. The hash function still returns the same output for the same input, it's just that the input is a combination of input and seed. Things that collide with one seed shouldn't collide when using another. Programming languages often generate a random number to use as the seed when the process starts, so that every time you run your program the seed is different. As a bad guy, not knowing the seed, it is now impossible for me to reliably cause harm. If you look closely in the above visualisation and the one before it, they're the same values being hashed but they produce different hash values. The implication of this is that if you hash a value with one seed, and want to be able to compare against it in the future, you need to make sure you use the same seed. Having different values for different seeds doesn't affect the hash map use-case, because hash maps only live for the duration the program is running. Provided you use the same seed for the lifetime of the program, your hash maps will continue to work just fine. If you ever store hash values outside of your program, in a file for example, you need to be careful you know what seed has been used. # Playground As is tradition, I've made a playground for you to write your own hash functions and see them visualised with the grids seen in this article. Click here to try it! # Conclusion We've covered what a hash function is, some ways to measure how good it is, what happens when it's not good, and some of the ways they can be broken by bad actors. The universe of hash functions is a large one, and we've really only scratched the surface in this post. We haven't spoken about cryptographic vs non-cryptographic hashing, we've touched on only 1 of the thousands of use-cases for hash functions, and we haven't talked about how exactly modern hash functions actually work. Some further reading I recommend if you're really enthusiastic about this topic and want to learn more: https://github.com/rurban/smhasher this repository is the gold standard for testing how good hash functions are. They run a tonne of tests against a wide number of hash functions and present the results in a big table. It will be difficult to understand what all of the tests are for, but this is where the state of the art of hash testing lives. https://djhworld.github.io/hyperloglog/ this is an interactive piece on a data structure called HyperLogLog. It's used to efficiently count the number of unique elements in very, very large sets. It uses hashing to do it in a really clever way. https://www.gnu.org/software/gperf/ is a piece of software that, when given the expected set of things you want to hash, can generate a "perfect" hash function automatically. Feel free to join the discussion on Hacker News! # Acknowledgements Thanks to everyone who read early drafts and provided invaluable feedback. delroth, Manon, Aaron, Charlie And everyone who helped me find murmur3 hash collisions: Indy, Aaron, Max # Patreon After the success of Load Balancing and Memory Allocation, I have decided to set up a Patreon page: https://patreon.com/samwho. For all of these articles going forward, I am going to post a Patreon-exclusive behind-the-scenes post talking about decisions, difficulties, and lessons learned from each post. It will give you a deep look in to how these articles evolve, and I'm really stoked about the one I've written for this one. If you enjoy my writing, and want to support it going forward, I'd really appreciate you becoming a Patreon. ❤️

over a year ago • 40 votes

More in programming

That boolean should probably be something else

One of the first types we learn about is the boolean. It's pretty natural to use, because boolean logic underpins much of modern computing. And yet, it's one of the types we should probably be using a lot less of. In almost every single instance when you use a boolean, it should be something else. The trick is figuring out what "something else" is. Doing this is worth the effort. It tells you a lot about your system, and it will improve your design (even if you end up using a boolean). There are a few possible types that come up often, hiding as booleans. Let's take a look at each of these, as well as the case where using a boolean does make sense. This isn't exhaustive—[1]there are surely other types that can make sense, too. Datetimes A lot of boolean data is representing a temporal event having happened. For example, websites often have you confirm your email. This may be stored as a boolean column, is_confirmed, in the database. It makes a lot of sense. But, you're throwing away data: when the confirmation happened. You can instead store when the user confirmed their email in a nullable column. You can still get the same information by checking whether the column is null. But you also get richer data for other purposes. Maybe you find out down the road that there was a bug in your confirmation process. You can use these timestamps to check which users would be affected by that, based on when their confirmation was stored. This is the one I've seen discussed the most of all these. We run into it with almost every database we design, after all. You can detect it by asking if an action has to occur for the boolean to change values, and if values can only change one time. If you have both of these, then it really looks like it is a datetime being transformed into a boolean. Store the datetime! Enums Much of the remaining boolean data indicates either what type something is, or its status. Is a user an admin or not? Check the is_admin column! Did that job fail? Check the failed column! Is the user allowed to take this action? Return a boolean for that, yes or no! These usually make more sense as an enum. Consider the admin case: this is really a user role, and you should have an enum for it. If it's a boolean, you're going to eventually need more columns, and you'll keep adding on other statuses. Oh, we had users and admins, but now we also need guest users and we need super-admins. With an enum, you can add those easily. enum UserRole { User, Admin, Guest, SuperAdmin, } And then you can usually use your tooling to make sure that all the new cases are covered in your code. With a boolean, you have to add more booleans, and then you have to make sure you find all the places where the old booleans were used and make sure they handle these new cases, too. Enums help you avoid these bugs. Job status is one that's pretty clearly an enum as well. If you use booleans, you'll have is_failed, is_started, is_queued, and on and on. Or you could just have one single field, status, which is an enum with the various statuses. (Note, though, that you probably do want timestamp fields for each of these events—but you're still best having the status stored explicitly as well.) This begins to resemble a state machine once you store the status, and it means that you can make much cleaner code and analyze things along state transition lines. And it's not just for storing in a database, either. If you're checking a user's permissions, you often return a boolean for that. fn check_permissions(user: User) -> bool { false // no one is allowed to do anything i guess } In this case, true means the user can do it and false means they can't. Usually. I think. But you can really start to have doubts here, and with any boolean, because the application logic meaning of the value cannot be inferred from the type. Instead, this can be represented as an enum, even when there are just two choices. enum PermissionCheck { Allowed, NotPermitted(reason: String), } As a bonus, though, if you use an enum? You can end up with richer information, like returning a reason for a permission check failing. And you are safe for future expansions of the enum, just like with roles. You can detect when something should be an enum a proliferation of booleans which are mutually exclusive or depend on one another. You'll see multiple columns which are all changed at the same time. Or you'll see a boolean which is returned and used for a long time. It's important to use enums here to keep your program maintainable and understandable. Conditionals But when should we use a boolean? I've mainly run into one case where it makes sense: when you're (temporarily) storing the result of a conditional expression for evaluation. This is in some ways an optimization, either for the computer (reuse a variable[2]) or for the programmer (make it more comprehensible by giving a name to a big conditional) by storing an intermediate value. Here's a contrived example where using a boolean as an intermediate value. fn calculate_user_data(user: User, records: RecordStore) { // this would be some nice long conditional, // but I don't have one. So variables it is! let user_can_do_this: bool = (a && b) && (c || !d); if user_can_do_this && records.ready() { // do the thing } else if user_can_do_this && records.in_progress() { // do another thing } else { // and something else! } } But even here in this contrived example, some enums would make more sense. I'd keep the boolean, probably, simply to give a name to what we're calculating. But the rest of it should be a match on an enum! * * * Sure, not every boolean should go away. There's probably no single rule in software design that is always true. But, we should be paying a lot more attention to booleans. They're sneaky. They feel like they make sense for our data, but they make sense for our logic. The data is usually something different underneath. By storing a boolean as our data, we're coupling that data tightly to our application logic. Instead, we should remain critical and ask what data the boolean depends on, and should we maybe store that instead? It comes easier with practice. Really, all good design does. A little thinking up front saves you a lot of time in the long run. I know that using an em-dash is treated as a sign of using LLMs. LLMs are never used for my writing. I just really like em-dashes and have a dedicated key for them on one of my keyboard layers. ↩ This one is probably best left to the compiler. ↩

22 hours ago • 3 votes

AmigaGuide Reference Library

As I slowly but surely work towards the next release of my setcmd project for the Amiga (see the 68k branch for the gory details and my total noob-like C flailing around), I’ve made heavy use of documentation in the AmigaGuide format. Despite it’s age, it’s a great Amiga-native format and there’s a wealth of great information out there for things like the C API, as well as language guides and tutorials for tools like the Installer utility - and the AmigaGuide markup syntax itself. The only snag is, I had to have access to an Amiga (real or emulated), or install one of the various viewer programs on my laptops. Because like many, I spend a lot of time in a web browser and occasionally want to check something on my mobile phone, this is less than convenient. Fortunately, there’s a great AmigaGuideJS online viewer which renders AmigaGuide format documents using Javascript. I’ve started building up a collection of useful developer guides and other files in my own reference library so that I can access this documentation whenever I’m not at my Amiga or am coding in my “modern” dev environment. It’s really just for my own personal use, but I’ll be adding to it whenever I come across a useful piece of documentation so I hope it’s of some use to others as well! And on a related note, I now have a “unified” code-base so that SetCmd now builds and runs on 68k-based OS 3.x systems as well as OS 4.x PPC systems like my X5000. I need to: Tidy up my code and fix all the “TODO” stuff Update the Installer to run on OS 3.x systems Update the documentation Build a new package and upload to Aminet/OS4Depot Hopefully I’ll get that done in the next month or so. With the pressures of work and family life (and my other hobbies), progress has been a lot slower these last few years but I’m still really enjoying working on Amiga code and it’s great to have a fun personal project that’s there for me whenever I want to hack away at something for the sheer hell of it. I’ve learned a lot along the way and the AmigaOS is still an absolute joy to develop for. I even brought my X5000 to the most recent Kickstart Amiga User Group BBQ/meetup and had a fun day working on the code with fellow Amigans and enjoying some classic gaming & demos - there was also a MorphOS machine there, which I think will be my next target as the codebase is slowly becoming more portable. Just got to find some room in the “retro cave” now… This stuff is addictive :)

14 hours ago • 2 votes

An Analysis of Links From The White House’s “Wire” Website

A little while back I heard about the White House launching their version of a Drudge Report style website called White House Wire. According to Axios, a White House official said the site’s purpose was to serve as “a place for supporters of the president’s agenda to get the real news all in one place”. So a link blog, if you will. As a self-professed connoisseur of websites and link blogs, this got me thinking: “I wonder what kind of links they’re considering as ‘real news’ and what they’re linking to?” So I decided to do quick analysis using Quadratic, a programmable spreadsheet where you can write code and return values to a 2d interface of rows and columns. I wrote some JavaScript to: Fetch the HTML page at whitehouse.gov/wire Parse it with cheerio Select all the external links on the page Return a list of links and their headline text In a few minutes I had a quick analysis of what kind of links were on the page: This immediately sparked my curiosity to know more about the meta information around the links, like: If you grouped all the links together, which sites get linked to the most? What kind of interesting data could you pull from the headlines they’re writing, like the most frequently used words? What if you did this analysis, but with snapshots of the website over time (rather than just the current moment)? So I got to building. Quadratic today doesn’t yet have the ability for your spreadsheet to run in the background on a schedule and append data. So I had to look elsewhere for a little extra functionality. My mind went to val.town which lets you write little scripts that can 1) run on a schedule (cron), 2) store information (blobs), and 3) retrieve stored information via their API. After a quick read of their docs, I figured out how to write a little script that’ll run once a day, scrape the site, and save the resulting HTML page in their key/value storage. From there, I was back to Quadratic writing code to talk to val.town’s API and retrieve my HTML, parse it, and turn it into good, structured data. There were some things I had to do, like: Fine-tune how I select all the editorial links on the page from the source HTML (I didn’t want, for example, to include external links to the White House’s social pages which appear on every page). This required a little finessing, but I eventually got a collection of links that corresponded to what I was seeing on the page. Parse the links and pull out the top-level domains so I could group links by domain occurrence. Create charts and graphs to visualize the structured data I had created. Selfish plug: Quadratic made this all super easy, as I could program in JavaScript and use third-party tools like tldts to do the analysis, all while visualizing my output on a 2d grid in real-time which made for a super fast feedback loop! Once I got all that done, I just had to sit back and wait for the HTML snapshots to begin accumulating! It’s been about a month and a half since I started this and I have about fifty days worth of data. The results? Here’s the top 10 domains that the White House Wire links to (by occurrence), from May 8 to June 24, 2025: youtube.com (133) foxnews.com (72) thepostmillennial.com (67) foxbusiness.com (66) breitbart.com (64) x.com (63) reuters.com (51) truthsocial.com (48) nypost.com (47) dailywire.com (36) From the links, here’s a word cloud of the most commonly recurring words in the link headlines: “trump” (343) “president” (145) “us” (134) “big” (131) “bill” (127) “beautiful” (113) “trumps” (92) “one” (72) “million” (57) “house” (56) The data and these graphs are all in my spreadsheet, so I can open it up whenever I want to see the latest data and re-run my script to pull the latest from val.town. In response to the new data that comes in, the spreadsheet automatically parses it, turn it into links, and updates the graphs. Cool! If you want to check out the spreadsheet — sorry! My API key for val.town is in it (“secrets management” is on the roadmap). But I created a duplicate where I inlined the data from the API (rather than the code which dynamically pulls it) which you can check out here at your convenience. Email · Mastodon · Bluesky

3 hours ago • 2 votes

Implementation of optimized vector of strings in C++ in SumatraPDF

SumatraPDF is a fast, small, open-source PDF reader for Windows, written in C++. This article describes how I implemented StrVec class for efficiently storing multiple strings. Much ado about the strings Strings are among the most used types in most programs. Arrays of strings are also used often. I count ~80 uses of StrVec in SumatraPDF code. This article describes how I implemented an optimized array of strings in SumatraPDF C++ code . No STL for you Why not use std::vector<std::string>? In SumatraPDF I don’t use STL. I don’t use std::string, I don’t use std::vector. For me it’s a symbol of my individuality, and my belief in personal freedom. As described here, minimum size of std::string on 64-bit machines is 32 bytes for msvc / gcc and 24 bytes for short strings (15 chars for msvc / gcc, 22 chars for clang). For longer strings we have more overhead: 32⁄24 bytes for the header memory allocator overhead allocator metadata padding due to rounding allocations to at least 16 bytes There’s also std::vector overhead: for fast appends (push()) std::vectorimplementations over-allocated space Longer strings are allocated at random addresses so they can be spread out in memory. That is bad for cache locality and that often cause more slowness than executing lots of instructions. Design and implementation of StrVec StrVec (vector of strings) solves all of the above: per-string overhead of only 8 bytes strings are laid out next to each other in memory StrVec High level design of StrVec: backing memory is allocated in singly-linked pages similar to std::vector, we start with small page and increase the size of the page. This strikes a balance between speed of accessing a string at random index and wasted space unlike std::vector we don’t reallocate memory (most of the time). That saves memory copy when re-allocating backing space Here’s all there is to StrVec: struct StrVec { StrVecPage* first = nullptr; int nextPageSize = 256; int size = 0; } size is a cached number of strings. It could be calculated by summing the size in all StrVecPages. nextPageSize is the size of the next StrVecPage. Most array implementation increase the size of next allocation by 1.4x - 2x. I went with the following progression: 256 bytes, 1k, 4k, 16k, 32k and I cap it at 64k. I don’t have data behind those numbers, they feel right. Bigger page wastes more space. Smaller page makes random access slower because to find N-th string we need to traverse linked list of StrVecPage. nextPageSize is exposed to allow the caller to optimize use. E.g. if it expects lots of strings, it could set nextPageSize to a large number. StrVecPage Most of the implementation is in StrVecPage. The big idea here is: we allocate a block of memory strings are allocated from the end of memory block at the beginning of the memory block we build and index of strings. For each string we have: u32 size u32 offset of the string within memory block, counting from the beginning of the block The layout of memory block is: StrVecPage struct { size u32; offset u32 } [] … not yet used space strings This is StrVecPage: struct StrVecPage { struct StrVecPage* next; int pageSize; int nStrings; char* currEnd; } next is for linked list of pages. Since pages can have various sizes we need to record pageSize. nStrings is number of strings in the page and currEnd points to the end of free space within page. Implementing operations Appending a string Appending a string at the end is most common operation. To append a string: we calculate how much memory inside a page it’ll need: str::Len(string) + 1 + sizeof(u32) + sizeof(u32). +1 is for 0-termination for compatibility with C APIs that take char*, and 2xu32 for size and offset. If we have enough space in last page, we add size and offset at the end of index and append a string from the end i.e. `currEnd - (str::Len(string) + 1). If there is not enough space in last page, we allocate new page We can calculate how much space we have left with: int indexEntrySize = sizeof(u32) + sizeof(u32); // size + offset char* indexEnd = (char*)pageStart + sizeof(StrVecPage) + nStrings*indexEntrySize int nBytesFree = (int)(currEnd - indexEnd) Removing a string Removing a string is easy because it doesn’t require moving memory inside StrVecPage. We do nStrings-- and move index values of strings after the removed string. I don’t bother freeing the string memory within a page. It’s possible but complicated enough I decided to skip it. You can compact StrVec to remove all overhead. If you do not care about preserving order of strings after removal, I haveRemoveAtFast() which uses a trick: instead of copying memory of all index values after removed string, I copy a single index from the end into a slot of the string being removed. Replacing a string or inserting in the middle Replacing a string or inserting a string in the middle is more complicated because there might not be enough space in the page for the string. When there is enough space, it’s as simple as append. When there is not enough space, I re-use the compacting capability: I compact all existing pages into a single page with extra space for the string and some extra space as an optimization for multiple inserts. Iteration A random access requires traversing a linked list. I think it’s still fast because typically there aren’t many pages and we only need to look at a single nStrings value. After compaction to a single page, random access is as fast as it could ever be. C++ iterator is optimized for sequential access: struct iterator { const StrVec* v; int idx; // perf: cache page, idxInPage from prev iteration int idxInPage; StrVecPage* page; } We cache the current state of iteration as page and idxInPage. To advance to next string we advance idxInPage. If it exceeds nStrings, we advance to page->next. Optimized search Finding a string is as optimized as it could be without a hash table. Typically to compare char* strings you need to call str::Eq(s, s2) for every string you compare it to. That is a function call and it has to touch s2 memory. That is bad for performance because it blows the cache. In StrVec I calculate length of the string to find once and then traverse the size / offset index. Only when size is different I have to compare the strings. Most of the time we just look at offset / size in L1 cache, which is very fast. Compacting If you know that you’ll not be adding more strings to StrVec you can compact all pages into a single page with no overhead of empty space. It also speeds up random access because we don’t have multiple pages to traverse to find the item and a given index. Representing a nullptr char* Even though I have a string class, I mostly use char* in SumatraPDF code. In that world empty string and nullptr are 2 different things. To allow storing nullptr strings in StrVec (and not turning them into empty strings on the way out) I use a trick: a special u32 value kNullOffset represents nullptr. StrVec is a string pool allocator In C++ you have to track the lifetime of each object: you allocate with malloc() or new when you no longer need to object, you call free() or delete However, the lifetime of allocations is often tied together. For example in SumatraPDF an opened document is represented by a class. Many allocations done to construct that object last exactly as long as the object. The idea of a pool allocator is that instead of tracking the lifetime of each allocation, you have a single allocator. You allocate objects with the same lifetime from that allocator and you free them with a single call. StrVec is a string pool allocator: all strings stored in StrVec have the same lifetime. Testing In general I don’t advocate writing a lot of tests. However, low-level, tricky functionality like StrVec deserves decent test coverage to ensure basic functionality works and to exercise code for corner cases. I have 360 lines of tests for ~700 lines of of implementation. Potential tweaks and optimization When designing and implementing data structures, tradeoffs are aplenty. Interleaving index and strings I’m not sure if it would be faster but instead of storing size and offset at the beginning of the page and strings at the end, we could store size / string sequentially from the beginning. It would remove the need for u32 of offset but would make random access slower. Varint encoding of size and offset Most strings are short, under 127 chars. Most offsets are under 16k. If we stored size and offset as variable length integers, we would probably bring down average per-string overhead from 8 bytes to ~4 bytes. Implicit size When strings are stored sequentially size is implicit as difference between offset of the string and offset of next string. Not storing size would make insert and set operations more complicated and costly: we would have to compact and arrange strings in order every time. Storing index separately We could store index of size / offset in a separate vector and use pages to only allocate string data. This would simplify insert and set operations. With current design if we run out of space inside a page, we have to re-arrange memory. When offset is stored outside of the page, it can refer to any page so insert and set could be as simple as append. The evolution of StrVec The design described here is a second implementation of StrVec. The one before was simply a combination of str::Str (my std::string) for allocating all strings and Vec<u32> (my std::vector) for storing offset index. It had some flaws: appending a string could re-allocate memory within str::Str. The caller couldn’t store returned char* pointer because it could be invalidated. As a result the API was akward and potentially confusing: I was returning offset of the string so the string was str::Str.Data() + offset. The new StrVec doesn’t re-allocate on Append, only (potentially) on InsertAt and SetAt. The most common case is append-only which allows the caller to store the returned char* pointers. Before implementing StrVec I used Vec<char*>. Vec is my version of std::vector and Vec<char*> would just store pointer to individually allocated strings. Cost vs. benefit I’m a pragmatist: I want to achieve the most with the least amount of code, the least amount of time and effort. While it might seem that I’m re-implementing things willy-nilly, I’m actually very mindful of the cost of writing code. Writing software is a balance between effort and resulting quality. One of the biggest reasons SumatraPDF so popular is that it’s fast and small. That’s an important aspect of software quality. When you double click on a PDF file in an explorer, SumatraPDF starts instantly. You can’t say that about many similar programs and about other software in general. Keeping SumatraPDF small and fast is an ongoing focus and it does take effort. StrVec.cpp is only 705 lines of code. It took me several days to complete. Maybe 2 days to write the code and then some time here and there to fix the bugs. That being said, I didn’t start with this StrVec. For many years I used obvious Vec<char*>. Then I implemented somewhat optimized StrVec. And a few years after that I implemented this ultra-optimized version. References SumatraPDF is a small, fast, multi-format (PDF/eBook/Comic Book and more), open-source reader for Windows. The implementation described here: StrVec.cpp, StrVec.h, StrVec_ut.cpp By the time you read this, the implementation could have been improved.

22 hours ago • 1 votes

The parental dead end of consent morality

Consent morality is the idea that there are no higher values or virtues than allowing consenting adults to do whatever they please. As long as they're not hurting anyone, it's all good, and whoever might have a problem with that is by definition a bigot. This was the overriding morality I picked up as a child of the 90s. From TV, movies, music, and popular culture. Fly your freak! Whatever feels right is right! It doesn't seem like much has changed since then. What a moral dead end. I first heard the term consent morality as part of Louise Perry's critique of the sexual revolution. That in the context of hook-up culture, situationships, and falling birthrates, we have to wrestle with the fact that the sexual revolution — and it's insistence that, say, a sky-high body count mustn't be taboo — has led society to screwy dating market in the internet age that few people are actually happy with. But the application of consent morality that I actually find even more troubling is towards parenthood. As is widely acknowledged now, we're in a bit of a birthrate crisis all over the world. And I think consent morality can help explain part of it. I was reminded of this when I posted a cute video of a young girl so over-the-moon excited for her dad getting off work to argue that you'd be crazy to trade that for some nebulous concept of "personal freedom". Predictably, consent morality immediately appeared in the comments: Some people just don't want children and that's TOTALLY OKAY and you're actually bad for suggesting they should! No. It's the role of a well-functioning culture to guide people towards The Good Life. Not force, but guide. Nobody wants to be convinced by the morality police at the pointy end of a bayonet, but giving up on the whole idea of objective higher values and virtues is a nihilistic and cowardly alternative. Humans are deeply mimetic creatures. It's imperative that we celebrate what's good, true, and beautiful, such that these ideals become collective markers for morality. Such that they guide behavior. I don't think we've done a good job at doing that with parenthood in the last thirty-plus years. In fact, I'd argue we've done just about everything to undermine the cultural appeal of the simple yet divine satisfaction of child rearing (and by extension maligned the square family unit with mom, dad, and a few kids). Partly out of a coordinated campaign against the family unit as some sort of trad (possibly fascist!) identity marker in a long-waged culture war, but perhaps just as much out of the banal denigration of how boring and limiting it must be to carry such simple burdens as being a father or a mother in modern society. It's no wonder that if you incessantly focus on how expensive it is, how little sleep you get, how terrifying the responsibility is, and how much stress is involved with parenthood that it doesn't seem all that appealing! This is where Jordan Peterson does his best work. In advocating for the deeper meaning of embracing burden and responsibility. In diagnosing that much of our modern malaise does not come from carrying too much, but from carrying too little. That a myopic focus on personal freedom — the nights out, the "me time", the money saved — is a spiritual mirage: You think you want the paradise of nothing ever being asked of you, but it turns out to be the hell of nobody ever needing you. Whatever the cause, I think part of the cure is for our culture to reembrace the virtue and the value of parenthood without reservation. To stop centering the margins and their pathologies. To start centering the overwhelming middle where most people make for good parents, and will come to see that role as the most meaningful part they've played in their time on this planet. But this requires giving up on consent morality as the only way to find our path to The Good Life. It involves taking a moral stance that some ways of living are better than other ways of living for the broad many. That parenthood is good, that we need more children both for the literal survival of civilization, but also for the collective motivation to guard against the bad, the false, and the ugly. There's more to life than what you feel like doing in the moment. The worst thing in the world is not to have others ask more of you. Giving up on the total freedom of the unmoored life is a small price to pay for finding the deeper meaning in a tethered relationship with continuing a bloodline that's been drawn for hundreds of thousands of years before it came to you. You're never going to be "ready" before you take the leap. If you keep waiting, you'll wait until the window has closed, and all you see is regret. Summon a bit of bravery, don't overthink it, and do your part for the future of the world. It's 2.1 or bust, baby!

yesterday • 2 votes

New here?

Bloom Filters

Improve your reading experience

More from samwho.dev

More in programming

bored reading