Turing Machines

114

from samwho.dev [alt+shift+b] in programming

body { text-wrap: pretty; } @media (prefers-reduced-motion: reduce) { * { transition: none; animation: none; } } turing-machine { width: 100%; display: block; position: relative; padding-bottom: 1em; } turing-machine .program-container { position: relative; display: flex; justify-content: center; } turing-machine table { border: none; font-family: Fira Code; border-collapse: collapse; border-spacing: 0; margin: 1px; margin-top: 0.5em; width: auto; } turing-machine thead td { text-align: center; } turing-machine td { text-align: left; padding-left: 3vw; padding-right: 3vw; padding-top: 0.2em; padding-bottom: 0.2em; border: 1px dashed #bbbbbb; } turing-machine thead td { border: 0; } turing-machine .container { z-index: 1; background-color: white; } turing-machine .svg-container { padding-bottom: 1px; padding-top:...

7 months ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from samwho.dev

Reservoir Sampling

header h1 { padding: 0; margin-top: 0.2rem; margin-bottom: 1rem; } button { margin: 0.5rem; padding: 0.5rem 1rem; background-color: #444444; color: white; border: none; border-radius: 8px; cursor: pointer; touch-action: manipulation; user-select: none; } button:hover:not(:disabled) { background-color: #555555; touch-action: manipulation; user-select: none; } button:disabled { filter: opacity(0.5); cursor: not-allowed; touch-action: manipulation; user-select: none; } input[type="range"] { width: 100%; margin: 1rem 0; } .control { display: flex; justify-content: center; align-items: center; margin: 1rem 0; } .control label { margin-right: 1rem; white-space: nowrap; } .control input[type="range"] { flex-grow: 1; } .odds { font-size: 1.2rem; font-family: var(--code-font); font-weight: bold; text-align: center; display: flex; justify-content: center; gap: 2rem; margin: 1rem 0; } .odds .hold { display: flex; flex-direction: column; align-items: center; } .odds .replace { display: flex; flex-direction: column; align-items: center; } .odds .hold .value { color: var(--palette-dark-blue); white-space: nowrap; } .odds .replace .value { color: var(--palette-red); white-space: nowrap; } article input { --c: var(--palette-orange); --l: 2px; --h: 30px; --w: 30px; width: 100%; height: var(--h); -webkit-appearance :none; -moz-appearance :none; appearance :none; background: none; cursor: pointer; overflow: hidden; } article input:focus-visible, article input:hover{ --p: 25%; } article input[type="range" i]::-webkit-slider-thumb{ height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } article input[type="range"]::-moz-range-thumb { --h: 25px; --w: 25px; height: var(--h); width: var(--w); aspect-ratio: 1; border-radius: 50%; background: var(--c); border-image: linear-gradient(90deg,var(--c) 50%,#ababab 0) 0 1/calc(50% - var(--l)/2) 100vw/0 100vw; -webkit-appearance: none; appearance: none; box-shadow: none; transition: .3s; } img.hero { --width: 200px; width: var(--width); max-width: var(--width); margin-top: calc(var(--width) * -0.5); margin-bottom: 1rem; } Reservoir Sampling Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know: When you would need reservoir sampling. The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise. A simple way to implement reservoir sampling if you want to use it. ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out! # Sampling when you know the size In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it? The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle." Every time you click "Shuffle," the chart below tracks what the first 3 cards were. At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair." Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over. This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards. We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method. I'm stretching the analogy a little here. It would take a long time to count through the deck to get to, say, index 436,234. But when it's an array in memory, computers have no trouble finding an element by its index. Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random? That's the curveball: you don't know. No, you can only hold on to 1 card at a time. You're free to swap your card with the newest one each time I show you a card, but you can only hold one and you can't go back to a card you've already seen. Believe it or not, this is a real problem and it has a real and elegant solution. For example, let's say you're building a log collection service. Text logs, not wooden ones. This service receives log messages from other services and stores them so that it's easy to search them in one place. One of the things you need to think about when building a service like this is what do you do when another service starts sending you way too many logs. Maybe it's a bad release, maybe one of your videos goes viral. Whatever the reason, it threatens to overwhelm your log collection service. Let's simulate this. Below you can see a stream of logs that experiences periodic spikes. A horizontal line indicates the threshold of logs per second that the log collection service can handle, which in this example is 5 logs per second. You can see that every so often, logs per second spikes above the threshold . One way to deal with this is "sampling." Deciding to send only a fraction of the logs to the log collection service. Let's send 10% of the logs. Below we will see the same simulation again, but this time logs that don't get sent to our log collection service will be greyed out. The graph has 2 lines: a black line tracks sent logs , the logs that are sent to our log collection service, and a grey line tracks total logs . The rate of sent logs never exceeds the threshold , so we never overwhelm our log collection service. However, in the quieter periods we're throwing away 90% of the logs when we don't need to! What we really want is to send at most 5 logs per second. This would mean that during quiet periods you get all the logs, but during spikes you discard logs to protect the log collection service. The simple way to achieve this would be to send the first 5 logs you see each second, but this isn't fair. You aren't giving all logs an equal chance of being selected. # Sampling when you don't know the size We instead want to pick a fair sample of all the logs we see each second. The problem is that we don't know how many we will see. Reservoir sampling is an algorithm that solves this exact problem. You could, but why live with that uncertainty? You'd be holding on to an unknown number of logs in memory. A sufficiently big spike could cause you problems. Reservoir sampling solves this problem, and does so without ever using more memory than you ask it to. Let's go back to our curveball of me showing you 1 card at a time. Here's a recap of the rules: I'll draw cards one at a time from a deck. Each time I show you a card, you have to choose to hold it or discard it. If you were already holding a card, you discard your held card before replacing it with the new card. At any point I can stop drawing cards and whatever card you're holding is the one you've chosen. How would you play this game in a way that ensures all cards have been given an equal chance to be selected when I decide to stop? You're on the right track. Let's have a look at how the coin flip idea plays out in practice. Below you see a deck of cards. Clicking "Deal" will draw a card and 50% of the time it will go to the discard pile on the right, and 50% of the time it will become your held card in the center, with any previously held card moving to the discard pile. The problem is that while the hold vs discard counts are roughly equal, which feels fair, later cards are much more likely to be held when I stop than earlier cards. The first card drawn has to win 10 coin flips to still be in your hand after the 10th card is drawn. The last card only has to win 1. Scrub the slider below to see how the chances change as we draw more cards. Each bar represents a card in the deck, and the height of the bar is the chance we're holding that card when I stop. Below the slider are the chances we're holding the first card drawn vs. the last card drawn. Anything older than 15 cards ago is has a less than 0.01% chance of being held when I stop. Because believe it or not, we only have to make one small change to this idea to make it fair. Instead of flipping a coin to decide if we'll hold the card or not, instead we give each new card a 1/n chance of being held, where n is the number of cards we've seen so far. Yep! In order to be fair, every card must have an equal chance of being selected. So for the 2nd card, we want both cards to have a 1/2 chance. For the 3rd card, we want all 3 cards to have a 1/3 chance. For the 4th card, we want all 4 cards to have a 1/4 chance, and so on. So if we use 1/n for the new card, we can at least say that the new card has had a fair shot. Let's have a look at the chances as you draw more cards with this new method. new card has the right chance of being selected, but how does that make the older cards fair? So far we've focused on the chance of the new card being selected, but we also need to consider the chance of the card you're holding staying in your hand. Let's walk through the numbers. # Card 1 The first card is easy: we're not holding anything, so we always choose to hold the first card. The chance we're holding this card is 1/1, or 100%. # Card 2 This time we have a real choice. We can keep hold of the card we have, or replace it with the new one. We've said that we're going to do this with a 1/n chance, where n is the number of cards we've seen so far. So our chance of replacing the first card is 1/2, or 50%, and our chance of keeping hold of the first card is its chance of being chosen last time multiplied by its chance of being replaced, so 100% * 1/2, which is again 50%. # Card 3 The card we're holding has a 50% chance of being there. This is true regardless what happened up to this point. No matter whether we're holding card 1 or card 2, it's 50%. The new card has a 1/3 chance of being selected, so the card we're holding has a 1/3 chance of being replaced. This means that our held card has a 2/3 chance of remaining held. So its chances of "surviving" this round are 50% * 2/3. # Card N This pattern continues for as many cards as you want to draw. We can express both options as formulas. Drag the slider to substitute n with real numbers and see that the two formulas are always equal. If 1/n is the chance of choosing the new card, 1/(n-1) is the chance of choosing the new card from the previous draw. The chance of not choosing the new card is the complement of 1/n, which is 1-(1/n). Below are the cards again except this time set up to use 1/n instead of a coin flip. Click to the end of the deck. Does it feel fair to you? There's a good chance that through the 2nd half of the deck, you never swap your chosen card. This feels wrong, at least to me, but as we saw above the numbers say it is completely fair. # Choosing multiple cards Now that we know how to select a single card, we can extend this to selecting multiple cards. There are 2 changes we need to make: Rather than new cards having a 1/n chance of being selected, they now have a k/n chance, where k is the number of cards we want to choose. When we decide to replace a held card, we choose one of the k cards we're holding at random. So our new previous card selection formula becomes k/(n-1) because we're now holding k cards. Then the chance that any of the cards we're holding get replaced is 1-(1/n). Let's see how this plays out with real numbers. The fairness still holds, and will hold for any k and n pair. This is because all held cards have an equal chance of being replaced, which keeps them at an equal likelihood of still being in your hand every draw. A nice way to implement this is to use an array of size k. For each new card, generate a random number between 0 and n. If the random number is less than k, replace the card at that index with the new card. Otherwise, discard the new card. And that's how reservoir sampling works! # Applying this to log collection Let's take what we now know about reservoir sampling and apply it to our log collection service. We'll set k=5, so we're "holding" at most 5 log messages at a time, and every second we will send the selected logs to the log collection service. After we've done that, we empty our array of size 5 and start again. This creates a "lumpy" pattern in the graph below, and highlights a trade-off when using reservoir sampling. It's no longer a real-time stream of logs, but chunks of logs sent at an interval. However, sent logs never exceeds the threshold , and during quiet periods the two lines track each other almost perfectly. No logs lost during quiet periods, and never more than threshold logs per second sent during spikes. The best of both worlds. It also doesn't store more than k=5 logs, so it will have predictable memory usage. # Further reading Something you may have thought while reading this post is that some logs are more valuable than others. You almost certainly want to keep all error logs, for example. For that use-case there is a weighted variant of reservoir sampling. I wasn't able to find a simpler explanation of it, so that link is to Wikipedia which I personally find a bit hard to follow. But the key point is that it exists and if you need it you can use it. # Conclusion Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient. Thank you again to ittybit for sponsoring this post. I really couldn't have hoped for a more supportive first sponsor. Thank you for believing in and understanding what I'm doing here. Thank you to everyone who read this post and gave their feedback. You made this post much better than I could have done on my own, and steered me away from several paths that just weren't working. If you want to tell me what you thought of this post by sending me an anonymous message that goes directly to my phone, go to https://samwho.dev/ping.

5 months ago • 14 votes

A Commitment to Art and Dogs

.dog-line { display: flex; flex-wrap: nowrap; flex-direction: row; width: 100%; height: 10rem; margin-top: 2rem; margin-bottom: 2rem; } .dog-line img { flex-grow: 1; height: auto; margin: 0; padding: 0; object-fit: contain; } .dog-grid { display: grid; grid-template-columns: repeat(4, 1fr); grid-gap: 1rem; margin-top: 2rem; margin-bottom: 2rem; } Back in Memory Allocation, I introduced Haskie. The idea behind Haskie was to create a character that could ask questions the reader might have, and to "soften" the posts to make them feel less intimidating. I got some feedback from people that Haskie was a bit too childish, and didn't feel like he belonged in posts about serious topics. This feedback was in the minority, though, and most people liked him. So I kept him and used him again in Hashing. Having a proxy to the reader was useful. I could anticipate areas of confusion and clear them up without creating enormous walls of text. I don't like it when the entire screen is filled with text, I like to break it up with images and interactive elements. And now dogs. Then in Bloom Filters, I found myself needing a character to represent the "adult in the room." If Haskie was my proxy to the reader, this new character would serve as a proxy to all of the material I learned from in the writing of the post. This is Sage. I liked the idea of having a cast of characters, each with their own personality and purpose. But I had a few problems. # Problems Both Haskie and Sage, because I have no artistic ability, were generated by AI. Back when I made them I was making no money from this blog, and I had no idea if I was going to keep them around. I didn't want to invest money in an idea that could flop, so I didn't feel bad about using AI to try it out. Since then, however, I have been paid twice to write posts for companies, and I know that I'm keeping the dogs. It wasn't ethical to continue piggybacking on AI. While ethics were the primary motivation, there were some other smaller problems with the dogs: The visual style of them, while I did like it, never felt like it fit with the rest of my personal brand. It was difficult to get AI to generate consistent dogs. You'll notice differences in coat colouration and features between variants of the same dog. The AI generated images look bad at small sizes. So I worked with the wonderful Andy Carolan to create a new design for my dogs. A design that would be consistent, fit with my brand, and look good at any size. # Haskie, Sage, and Doe The redesigned dogs are consistent, use simple colours and shapes, and use the SVGs file format to look good at any size. Each variant clocks in at around 20kb, which is slightly larger than the small AI-generated images, but I'll be able to use them at any size. Together the dogs represent a family unit: Sage as the dad, Haskie as the youngest child, and Doe as his older sister. They also come in a variety of poses, so I can use them to represent different emotions or actions. We were careful to make the dogs recognisable apart. They differ in colour, ear shape, tail shape, and collar tag. Sage and Doe have further distinguishing features: Sage with his glasses, and Doe with her bandana. Doe's bandana uses the same colours as the transgender flag, to show my support for the trans community and as a nod to her identity. # Going forward I'm so happy with the new dogs, and plan to use them in my posts going forward. I suspect I will, at some point, replace the dogs in my old posts as well. I don't plan to add any more characters, and I want to be careful to avoid overusing them. I don't want them to become a crutch, or to distract from the content of the posts. I also haven't forgotten the many people that pointed out to me that you can't pet the dogs. I'm working on it.

a year ago • 117 votes

Bloom Filters

.bf { width: 100%; height: 150px; } @media only screen and (min-width: 320px) and (max-width: 479px) { .bf { height: 200px; } } @media only screen and (min-width: 480px) and (max-width: 676px) { .bf { height: 200px; } } @media only screen and (min-width: 677px) and (max-width: 991px) { .bf { height: 150px; } } form { display: flex; flex-direction: column; align-items: center; justify-content: stretch; } input { border: 1px solid rgb(119, 119, 119); padding: 0.25rem; border-radius: 0.25rem; height: 2em; line-height: 2em; } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } Everyone has a set of tools they use to solve problems. Growing this set helps you to solve ever more difficult problems. In this post, I'm going to teach you about a tool you may not have heard of before. It's a niche tool that won't apply to many problems, but when it does you'll find it invaluable. It's called a "bloom filter." Before you continue! This post assumes you know what a hash function is, and if you don't it's going to be tricky to understand. Sam has written a post about hash functions, and recommendeds that you read this first. # What bloom filters can do Bloom filters are similar to the Set data structure. You can add items to them, and check if an item is present. Here's what it might look like to use a bloom filter in JavaScript, using a made-up BloomFilter class: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Ant"); // true bf.contains("Rhino"); // true While this looks almost identical to a Set, there are some key differences. Bloom filters are what's called a probabalistic data structure. Where a Set can give you a concrete "yes" or "no" answer when you call contains, a bloom filter can't. Bloom filters can give definite "no"s, but they can't be certain about "yes." In the example above, when we ask bf if it contains "Ant" and "Rhino", the true that it returns isn't a guarantee that they're present. We know that they're present because we added them just a couple of lines before, but it would be possible for this to happen: let bf = new BloomFilter(); bf.add("Ant"); bf.add("Rhino"); bf.contains("Fox"); // true We'll demonstrate why over the course of this post. For now, we'll say that when bloom filters return true it doesn't mean "yes", it means "maybe". When this happens and the item has never been added before, it's called a false-positive. The opposite, claiming "no" when the answer is "yes," is called a false-negative. A bloom filter will never give a false-negative, and this is what makes them useful. It's not strictly lying, it's just not giving you a definite answer. Let's look at an example where we can use this property to our advantage. # When bloom filters are useful Imagine you're building a web browser, and you want to protect users from malicious links. You could build and maintain a list of all known malicious links and check the list every time a user navigates the browser. If the link they're trying to visit is in the list, you warn the user that they might be about to visit a malicious website. If we assume there are, say, 1,000,000 malicious links on the Internet, and each link is 20 characters long, then the list of malicious links would be 20MB in size. This isn't a huge amount of data, but it's not small either. If you have lots of users and want to keep this list up to date, the bandwidth could add up. However, if you're happy to accept being wrong 0.0001% of the time (1 in a million), you could use a bloom filter which can store the same data in 3.59MB. That's an 82% reduction in size, and all it costs you is showing the user an incorrect warning 1 in every million links visited. If you wanted to take it even further, and you were happy to accept being wrong 0.1% of the time (1 in 1000), the bloom filter would only be 1.8MB. This use-case isn't hypothetical, either. Google Chrome used a bloom filter for this exact purpose until 2012. If you were worried about showing a warning when it wasn't needed, you could always make an API that has the full list of malicious links in a database. When the bloom filter says "maybe," you would then make an API call to check the full list to be sure. No more spurious warnings, and the bloom filter would save you from having to call the API for every link visited. # How bloom filters work At its core, a bloom filter is an array of bits. When it is created, all of the bits are set to 0. We're going to represent this as a grid of circles, with each circle representing 1 bit. Our bloom filters in this post are all going to have 32 bits in total. this one and let me know what you think. Click here to go back to normal. To add an item to the bloom filter, we're going to hash it with 3 different hash functions, then use the 3 resulting values to set 3 bits. If you're not familiar with hashing, I recommend reading my post about it before continuing. For this post I'm choosing to use 3 of the SHA family of hash functions: sha1, sha256, and sha512. Here's what our bloom filter looks like if we add the value "foo" to it: The bits in positions 15, 16 and 27 have been set. Other bits, e.g. 1 have not been set. You can hover or tap the bits in this paragraph to highlight them in the visualisation. We get to this state by taking the hash value of "foo" for each of our 3 hash functions and modulo it by the number of bits in our bloom filter. Modulo gets us the remainder when dividing by 32, so we get 27 with sha1, 15 with sha256 and 16 with sha512. The table below shows what's happening, and you can try inputting your own values to see what bits they would set if added. Go ahead and add a few of your own values to our bloom filter below and see what happens. There's also a check button that will tell you if a value is present within the bloom filter. A value is only considered present if all of the bits checked are set. You can start again by hitting the clear button. You might occasionally notice that only 2, or even 1, bits get set. This happens when 2 or more of our hash functions produce the same value, or we attempt to set a bit that has already been set. Taking that a bit further, have a think about the implications of a bloom filter that has every bit set. bit is set, then won't the bloom filter claim it contains every item you check? That's a false-positive every time! Exactly right. A bloom filter with every bit set is equivalent to a Set that always returns true for contains. It will claim to contain everything you ask it about, even if that thing was never added. # False-positive rates The rate of false-positives in our bloom filter will grow as the percentage of set bits increases. Drag the slider below the graph to see how the false-positive rate changes as the number of set bits increases. It grows slowly at first, but as we get closer to having all bits set the rate increases. This is because we calculate the false-positive rate as x^3, where x is the percentage of set bits and 3 is the number of hash functions used. To give an example of why we calculate it with this formula, imagine we have a bloom filter with half of its bits set, x = 0.5. If we assume that our hash function has an equal chance of setting any of the bits, then the chance that all 3 hash functions set a bit that is already set is 0.5 * 0.5 * 0.5, or x^3. Let's have a look at the false-positive rate of bloom filters that use different numbers of hash functions. The problem that using lots of hash functions introduces is that it makes the bloom filter fill up faster. The more hash functions you use, the more bits get set for each item you add. There's also the cost of hashing itself. Hash functions aren't free, and while the hash functions you'd use in a bloom filter try to be as fast as possible, it's still more expensive to run 100 of them than it is to run 3. It's possible to calculate how full a bloom filter will be after inserting a number of items, based on the number of hash functions used. The graph below assumes a bloom filter with 1000 bits. The more hash functions we use, the faster we set all of the bits. You'll notice that the curve tails off as more items are added. This is because the more bits that are set, the more likely it is that we'all attempt to set a bit that has already been set. In practice, 1000 bits is a very small bloom filter, occupying only 125 bytes of memory. Modern computers have a lot of memory, so let's crank this up to 100,000 bits (12.5kB) and see what happens. The lines barely leave the bottom of the graph, meaning the bloom filter will be very empty and the false-positive rate will be low. All this cost us was 12.5kB of memory, which is still a very small amount by 2024 standards. # Tuning a bloom filter Picking the correct number of hash functions and bits for a bloom filter is a fine balance. Fortunately for us, if we know up-front how many unique items we want to store, and what our desired false-positive rate is, we can calculate the optimal number of hash functions, and the required number of bits. The bloom filter page on Wikipedia covers the mathematics involved, which I'm going to translate into JavaScript functions for us to use. I want to stress that you don't need to understand the maths to use a bloom filter or read this post. I'm including the link to it only for completeness. # Optimal number of bits The following JavaScript function, which might look a bit scary but bear with me, takes the number of items you want to store (items) and the desired false-positive rate (fpr, where 1% == 0.01), and returns how many bits you will need to achieve that false-positive rate. function bits(items, fpr) { const n = -items * Math.log(fpr); const d = Math.log(2) ** 2; return Math.ceil(n / d); } We can see how this grows for a variety of fpr values in the graph below. # Optimal number of hash functions After we've used the JavaScript above to calculate how many bits we need, we can use the following function to calculate the optimal number of hash functions to use: function hashFunctions(bits, items) { return Math.ceil((bits / items) * Math.log(2)); } Pause for a second here and have a think about how the number of hash functions might grow based on the size of the bloom filter and the number of items you expect to add. Do you think you'll use more hash functions, or fewer, as the bloom filter gets larger? What about as the number of items increases? The more items you plan to add, the fewer hash functions you should use. Yet, a larger bloom filter means you can use more hash functions. More hash functions keep the false-positive rate lower for longer, but more items fills up the bloom filter faster. It's a complex balancing act, and I am thankful that mathematicians have done the hard work of figuring it out for us. # Caution While we can stand on the shoulders of giants and pick the optimal number of bits and hash functions for our bloom filter, it's important to remember that these rely on you giving good estimates of the number of items you expect to add, and choosing a false-positive rate that's acceptable for your use-case. These numbers might be difficult to come up with, and I recommend erring on the side of caution. If you're not sure, it's likely better to use a larger bloom filter than you think you need. # Removing items from a bloom filter We've spent the whole post talking about adding things to a bloom filter, and the optimal parameters to use. We haven't spoken at all about removing items. And that's because you can't! In a bloom filter, we're using bits, individual 1s and 0s, to track the presence of items. If we were to remove an item by setting its bits to 0, we might also be removing other items by accident. There's no way of knowing. Click the buttons of the bloom filter below to see this in action. First we will add "foo", then "baz", and then we will remove "baz". Hit "clear" if you want to start again. The end result of this sequence is a bloom filter that doesn't contain "baz", but doesn't contain "foo" either. Because both "foo" and "baz" set bit 27, we accidentally clobber the presence of "foo" while removing "baz". Something else you might have noticed playing with the above example is that if you add "foo" and then attempt to remove "baz" before having added it, nothing happens. Even though 27 is set, bits 18 and 23 are not, so the bloom filter cannot contain "baz". Because of this, it won't unset 27. # Counting bloom filters While you can't remove items from a standard bloom filter, there are variants that allow you to do so. One of these variants is called a "counting bloom filter," which uses an array of counters instead of bits to keep track of items. Now when you go through the sequence, the end result is that the bloom filter still contains "foo." It solves the problem. The trade-off, though, is that counters are bigger than bits. With 4 bits per counter you can increment up to 15. With 8 bits per counter you can increment up to 255. You'll need to pick a counter size sufficient to never reach the maximum value, otherwise you risk corrupting the bloom filter. Using 8x more memory than a standard bloom filter could be a big deal, especially if you're using a bloom filter to save memory in the first place. Think hard about whether you really need to be able to remove items from your bloom filter. Counting bloom filters also introduce the possibility of false-negatives, which are impossible in standard bloom filters. Consider the following example. Because "loved" and "response" both hash to the bits 5, 22, and 26, when we remove "response" we also remove "loved". If we write this as JavaScript the problem becomes more clear: let bf = new CountingBloomFilter(); bf.add("loved"); bf.add("your"); bf.remove("response"); bf.contains("loved"); // false Even though we know for sure we've added "loved" in this snippet, the call to contains will return false. This sort of false-negative can't happen in a standard bloom filter, and it removes one of the key benefits of using a bloom filter in the first place: the guarantee of no false-negatives. # Bloom filters in the real-world Real-world users of bloom filters include Akamai, who use them to avoid caching web pages that are accessed once and never again. They do this by storing all page accesses in a bloom filter, and only writing them into cache if the bloom filter says they've been seen before. This does result in some pages being cached on the first access, but that's fine because it's still an improvement. It would be impractical for them to store all page accesses in a Set, so they accept the small false-positive rate in favour of the significantly smaller bloom filter. Akamai released a paper about this that goes into the full details if you're interested. Google's BigTable is a distributed key-value store, and uses bloom filters internally to know what keys are stored within. When a read request for a key comes in, a bloom filter in memory is first checked to see if the key is in the database. If not, BigTable can respond with "not found" without ever needing to read from disk. Sometimes the bloom filter will claim a key is in the database when it isn't, but this is fine because when that happens a disk access will confirm the key in fact isn't in the database. # Conclusion Bloom filters, while niche, can be a huge optimisation in the right situation. They're a wonderful application of hash functions, and a great example of making a deliberate trade-off to achieve a specific goal. Trade-offs, and combining simpler building blocks to create more complex, purpose-built data structures, are present everywhere in software engineering. Being able to spot where a data structure could net a big win can separate you from the pack, and take your career to the next level. I hope you've enjoyed this post, and that you find a way to apply bloom filters to a problem you're working on. # Acknowledgements Enormous thank you to my reviewers, without whom this post would be a shadow of what you read today. In no particular order: rylon, Indy, Aaron, Sophie, Davis, ed, Michael Drury, Anton Zhiyanov, Christoph Berger.

a year ago • 65 votes

Hashing

form { padding-top: 0.5em; padding-left: 0.5em; padding-right: 0.5em; display: flex; justify-content: center; gap: 0.3em; } form input[type=text] { flex: 4 1 auto; min-width: 0; border-radius: 0.3em; border: 1px solid #aaaaaa; padding: 0.3em; } form button { flex: 1 1 auto; max-width: 140px; } form button:disabled { opacity: 0.5 !important; } form button.add { background-color: #009E73; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.check { background-color: #56B4E9; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } form button.clear { background-color: #D55E00; color: white; border: 0; border-radius: 0.3em; cursor: pointer; } .grid-2x2 { display: "grid"; } .grid { user-select: none; cursor: pointer; margin-top: 1rem; margin-bottom: 1rem; border: 1px solid #009E73; width: 100%; display: grid; grid-template-columns: repeat(8, 1fr); grid-template-rows: repeat(2, 1fr); } .grid-item { display: flex; align-items: center; justify-content: center; aspect-ratio: 1/1; } .grid-active { background-color: #009E73; color: white; } .above-grid { display: flex; justify-content: center; } .hash-examples { padding-top: 0.5rem; padding-bottom: 0.5rem; margin: auto; display: flex; flex-direction: column; align-items: center; } .hash-examples div { margin: auto; } .hash-examples code { display: block; white-space: pre; font-weight: bold; } .hash-examples p { font-size: 0.75rem; font-style: italic; text-align: center; font-family: Lora, serif; width: 75%; } .blob { cursor: pointer; background: #CC79A7; display: flex; justify-content: center; align-items: center; font-size: 1.5rem; color: white; border-radius: 50%; margin: 10px; height: 3rem; width: 3rem; min-width: 3rem; max-width: 3rem; box-shadow: 0 0 0 0 #CC79A7FF; transform: scale(1); animation: pulse 2s infinite; } @keyframes pulse { 0% { transform: scale(0.85); box-shadow: 0 0 0 0 #CC79A77F; } 70% { transform: scale(1); box-shadow: 0 0 0 1rem rgba(0, 0, 0, 0); } 100% { transform: scale(0.85); box-shadow: 0 0 0 0 rgba(0, 0, 0, 0); } } .blob-click { cursor: default; animation: tick 1s linear; background: #009E73FF; } @keyframes tick { 0% { transform: scale(1); box-shadow: 0 0 0 0 #009E73FF; } 50% { box-shadow: 0 0 0 1rem #009E737F; } 100% { box-shadow: 0 0 0 2rem #009E7300; } } .aside { padding: 2rem; width: 100vw; position: relative; margin-left: -50vw; left: 50%; background-color: #eeeeee; display: flex; align-items: center; flex-direction: column; } .aside > * { flex-grow: 1; } .aside p { padding-left: 1rem; padding-right: 1rem; max-width: 780px; font-style: italic; font-family: Lora, serif; text-align: center; } .pct25 { width: 100%; height: 200px; } .datasets th { text-align: left; } .datasets { table-layout: fixed; } As a programmer, you use hash functions every day. They're used in databases to optimise queries, they're used in data structures to make things faster, they're used in security to keep data safe. Almost every interaction you have with technology will involve hash functions in one way or another. Hash functions are foundational, and they are everywhere. But what is a hash function, and how do they work? In this post, we're going to demystify hash functions. We're going to start by looking at a simple hash function, then we're going to learn how to test if a hash function is good or not, and then we're going to look at a real-world use of hash functions: the hash map. clicked. # What is a hash function? Hash functions are functions that take an input, usually a string, and produce a number. If you were to call a hash function multiple times with the same input, it will always return the same number, and that number returned will always be within a promised range. What that range is will depend on the hash function, some use 32-bit integers (so 0 to 4 billion), others go much larger. If we were to write a dummy hash function in JavaScript, it might look like this: function hash(input) { return 0; } Even without knowing how hash functions are used, it's probably no surprise that this hash function is useless. Let's see how we can measure how good a hash function is, and after that we'll do a deep dive on how they're used within hash maps. # What makes a hash function good? Because input can be any string, but the number returned is within some promised range, it's possible that two different inputs can return the same number. This is called a "collision," and good hash functions try to minimise how many collisions they produce. It's not possible to completely eliminate collisions, though. If we wrote a hash function that returned a number in the range 0 to 7, and we gave it 9 unique inputs, we're guaranteed at least 1 collision. hash("to") == 3 hash("the") == 2 hash("café") == 0 hash("de") == 6 hash("versailles") == 4 hash("for") == 5 hash("coffee") == 0 hash("we") == 7 hash("had") == 1 To visualise collisions, I'm going to use a grid. Each square of the grid is going to represent a number output by a hash function. Here's an example 8x2 grid. Click on the grid to increment the example hash output value and see how we map it to a grid square. See what happens when you get a number larger than the number of grid squares. { let grid = document.getElementById("first-grid"); let hash = document.getElementById("grid-hash"); let modulo = document.getElementById("grid-modulo"); grid.addEventListener("click", (e) => { e.preventDefault(); let number = parseInt(hash.innerText) + 1; hash.innerText = number.toString(); modulo.innerText = (number % 16).toString(); grid.querySelector(".grid-active").classList.remove("grid-active"); grid.children[number % 16].classList.add("grid-active"); return false; }); }); 13 % 16 == 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Every time we hash a value, we're going to make its corresponding square on the grid a bit darker. The idea is to create an easy way to see how well a hash function avoids collisions. What we're looking for is a nice, even distribution. We'll know that the hash function isn't good if we have clumps or patterns of dark squares. This is a great observation. You're absolutely right, we're going to be creating "pseudo-collisions" on our grid. It's okay, though, because if the hash function is good we will still see an even distribution. Incrementing every square by 100 is just as good a distribution as incrementing every square by 1. If we have a bad hash function that collides a lot, that will still stand out. We'll see this shortly. Let's take a larger grid and hash 1,000 randomly-generated strings. You can click on the grid to hash a new set of random inputs, and the grid will animate to show you each input being hashed and placed on the grid. The values are nice and evenly distributed because we're using a good, well-known hash function called murmur3. This hash is widely used in the real-world because it has great distribution while also being really, really fast. What would our grid look like if we used a bad hash function? function hash(input) { let hash = 0; for (let c of input) { hash += c.charCodeAt(0); } return hash % 1000000; } This hash function loops through the string that we're given and sums the numeric values of each character. It then makes sure that the value is between 0 and 1000000 by using the modulus operator (%). Let's call this hash function stringSum. Here it is on the grid. Reminder, this is 1,000 randomly generated strings that we're hashing. This doesn't look all that different from murmur3. What gives? The problem is that the strings we're giving to be hashed are random. Let's see how each function performs when given input that is not random: the numbers from 1 to 1000 converted to strings. Now the problem is more clear. When the input isn't random, the output of stringSum forms a pattern. Our murmur3 grid, however, looks the same as how it looked with random values. How about if we hash the top 1,000 most common English words: It's more subtle, but we do see a pattern on the stringSum grid. As usual, murmur3 looks the same as it always does. This is the power of a good hash function: no matter the input, the output is evenly distributed. Let's talk about one more way to visualise this and then talk about why it matters. # The avalanche effect Another way hash functions get evaluated is on something called the "avalanche effect." This refers to how many bits in the output value change when just a single bit of the input changes. To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping. It's this property that helps hash functions avoid forming patterns in the grid. If small changes in the input result in small changes in the output, you get patterns. Patterns indicate poor distribution, and a higher rate of collisions. Below, we are visualising the avalanche effect by showing two 8-bit binary numbers. The top number is the input value, and the bottom number is the murmur3 output value. Click on it to flip a single bit in the input. Bits that change in the output will be green, bits that stay the same will be red. murmur3 does well, though you will notice that sometimes fewer than 50% of the bits flip and sometimes more. This is okay, provided that it is 50% on average. Let's see how stringSum performs. Well this is embarassing. The output is equal to the input, and so only a single bit flips each time. This does make sense, because stringSum just sums the numeric value of each character in the string. This example only hashes the equivalent of a single character, which means the output will always be the same as the input. # Why all of this matters We've taken the time to understand some of the ways to determine if a hash function is good, but we've not spent any time talking about why it matters. Let's fix that by talking about hash maps. To understand hash maps, we first must understand what a map is. A map is a data structure that allows you to store key-value pairs. Here's an example in JavaScript: let map = new Map(); map.set("hello", "world"); console.log(map.get("hello")); Here we take a key-value pair ("hello" → "world") and store it in the map. Then we print out the value associated with the key "hello", which will be "world". A more fun real-world use-case would be to find anagrams. An anagram is when two different words contain the same letters, for example "antlers" and "rentals" or "article" and "recital." If you have a list of words and you want to find all of the anagrams, you can sort the letters in each word alphabetically and use that as a key in a map. let words = [ "antlers", "rentals", "sternal", "article", "recital", "flamboyant", ] let map = new Map(); for (let word of words) { let key = word .split('') .sort() .join(''); if (!map.has(key)) { map.set(key, []); } map.get(key).push(word); } This code results in a map with the following structure: { "aelnrst": [ "antlers", "rentals", "sternal" ], "aceilrt": [ "article", "recital" ], "aabflmnoty": [ "flamboyant" ] } # Implementing our own simple hash map Hash maps are one of many map implementations, and there are many ways to implement hash maps. The simplest way, and the way we're going to demonstrate, is to use a list of lists. The inner lists are often referred to as "buckets" in the real-world, so that's what we'll call them here. A hash function is used on the key to determine which bucket to store the key-value pair in, then the key-value pair is added to that bucket. Let's walk through a simple hash map implementation in JavaScript. We're going to go through it bottom-up, so we'll see some utility methods before getting to the set and get implementations. class HashMap { constructor() { this.bs = [[], [], []]; } } We start off by creating a HashMap class with a constructor that sets up 3 buckets. We use 3 buckets and the short variable name bs so that this code displays nicely on devices with smaller screens. In reality, you could have however many buckets you want (and better variable names). class HashMap { // ... bucket(key) { let h = murmur3(key); return this.bs[ h % this.bs.length ]; } } The bucket method uses murmur3 on the key passed in to find a bucket to use. This is the only place in our hash map code that a hash function is used. class HashMap { // ... entry(bucket, key) { for (let e of bucket) { if (e.key === key) { return e; } } return null; } } The entry method takes a bucket and a key and scans the bucket until it finds an entry with the given key. If no entry is found, null is returned. class HashMap { // ... set(key, value) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { e.value = value; return; } b.push({ key, value }); } } The set method is the first one we should recognise from our earlier JavaScript Map examples. It takes a key-value pair and stores it in our hash map. It does this by using the bucket and entry methods we created earlier. If an entry is found, its value is overwritten. If no entry is found, the key-value pair is added to the map. In JavaScript, { key, value } is shorthand for { key: key, value: value }. class HashMap { // ... get(key) { let b = this.bucket(key); let e = this.entry(b, key); if (e) { return e.value; } return null; } } The get method is very similar to set. It uses bucket and entry to find the entry related to the key passed in, just like set does. If an entry is found, its value is returned. If one isn't found, null is returned. That was quite a lot of code. What you should take away from it is that our hash map is a list of lists, and a hash function is used to know which of the lists to store and retrieve a given key from. Here's a visual representation of this hash map in action. Click anywhere on the buckets to add a new key-value pair using our set method. To keep the visualisation simple, if a bucket were to "overflow", the buckets are all reset. Because we're using murmur3 as our hash function, you should see good distribution between the buckets. It's expected you'll see some imbalance, but it should generally be quite even. To get a value out of the hash map, we first hash the key to figure out which bucket the value will be in. Then we have to compare the key we're searching for against all of the keys in the bucket. It's this search step that we minimise through hashing, and why murmur3 is optimised for speed. The faster the hash function, the faster we find the right bucket to search, the faster our hash map is overall. This is also why reducing collisions is so crucial. If we did decide to use that dummy hash function from all the way at the start of this article, the one that returns 0 all the time, we'll put all of our key-value pairs into the first bucket. Finding anything could mean we have to check all of the values in the hash map. With a good hash function, with good distribution, we reduce the amount of searching we have to do to 1/N, where N is the number of buckets. Let's see how stringSum does. Interestingly, stringSum seems to distribute values quite well. You notice a pattern, but the overall distribution looks good. stringSum. I knew it would be good for something. Not so fast, Haskie. We need to talk about a serious problem. The distribution looks okay on these sequential numbers, but we've seen that stringSum doesn't have a good avalanche effect. This doesn't end well. # Real-world collisions Let's look at 2 real-world data sets: IP addresses and English words. What I'm going to do is take 100,000,000 random IP addresses and 466,550 English words, hash all of them with both murmur3 and stringSum, and see how many collisions we get. IP Addresses murmur3 stringSum Collisions 1,156,959 99,999,566 1.157% 99.999% English words murmur3 stringSum Collisions 25 464,220 0.005% 99.5% When we use hash maps for real, we aren't usually storing random values in them. We can imagine counting the number of times we've seen an IP address in rate limiting code for a server. Or code that counts the occurrences of words in books throughout history to track their origin and popularity. stringSum sucks for these applications because of it's extremely high collision rate. # Manufactured collisions Now it's murmur3's turn for some bad news. It's not just collisions caused by similarity in the input we have to worry about. Check this out. What's happening here? Why do all of these jibberish strings hash to the same number? I hashed 141 trillion random strings to find values that hash to the number 1228476406 when using murmur3. Hash functions have to always return the same output for a specific input, so it's possible to find collisions by brute force. trillion? Like... 141 and then 12 zeroes? Yes, and it only took me 25 minutes. Computers are fast. Bad actors having easy access to collisions can be devastating if your software builds hash maps out of user input. Take HTTP headers, for example. An HTTP request looks like this: GET / HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: google.com You don't have to understand all of the words, just that the first line is the path being requested and all of the other lines are headers. Headers are Key: Value pairs, so HTTP servers tend to use maps to store them. Nothing stops us from passing any headers we want, so we can be really mean and pass headers we know will cause collisions. This can significantly slow down the server. This isn't theoretical, either. If you search "HashDoS" you'll find a lot more examples of this. It was a really big deal in the mid-2000s. There are a few ways to mitigate this specific to HTTP servers: ignoring jibberish header keys and limiting the number of headers you store, for example. But modern hash functions like murmur3 offer a more generalised solution: randomisation. Earlier in this post we showed some examples of hash function implementations. Those implementations took a single argument: input. Lots of modern hash functions take a 2nd parameter: seed (sometimes called salt). In the case of murmur3, this seed is a number. So far, we've been using 0 as the seed. Let's see what happens with the collisions I've collected when we use a seed of 1. Just like that, 0 to 1, the collisions are gone. This is the purpose of the seed: it randomises the output of the hash function in an unpredictable way. How it achieves this is beyond the scope of this article, all hash functions do this in their own way. The hash function still returns the same output for the same input, it's just that the input is a combination of input and seed. Things that collide with one seed shouldn't collide when using another. Programming languages often generate a random number to use as the seed when the process starts, so that every time you run your program the seed is different. As a bad guy, not knowing the seed, it is now impossible for me to reliably cause harm. If you look closely in the above visualisation and the one before it, they're the same values being hashed but they produce different hash values. The implication of this is that if you hash a value with one seed, and want to be able to compare against it in the future, you need to make sure you use the same seed. Having different values for different seeds doesn't affect the hash map use-case, because hash maps only live for the duration the program is running. Provided you use the same seed for the lifetime of the program, your hash maps will continue to work just fine. If you ever store hash values outside of your program, in a file for example, you need to be careful you know what seed has been used. # Playground As is tradition, I've made a playground for you to write your own hash functions and see them visualised with the grids seen in this article. Click here to try it! # Conclusion We've covered what a hash function is, some ways to measure how good it is, what happens when it's not good, and some of the ways they can be broken by bad actors. The universe of hash functions is a large one, and we've really only scratched the surface in this post. We haven't spoken about cryptographic vs non-cryptographic hashing, we've touched on only 1 of the thousands of use-cases for hash functions, and we haven't talked about how exactly modern hash functions actually work. Some further reading I recommend if you're really enthusiastic about this topic and want to learn more: https://github.com/rurban/smhasher this repository is the gold standard for testing how good hash functions are. They run a tonne of tests against a wide number of hash functions and present the results in a big table. It will be difficult to understand what all of the tests are for, but this is where the state of the art of hash testing lives. https://djhworld.github.io/hyperloglog/ this is an interactive piece on a data structure called HyperLogLog. It's used to efficiently count the number of unique elements in very, very large sets. It uses hashing to do it in a really clever way. https://www.gnu.org/software/gperf/ is a piece of software that, when given the expected set of things you want to hash, can generate a "perfect" hash function automatically. Feel free to join the discussion on Hacker News! # Acknowledgements Thanks to everyone who read early drafts and provided invaluable feedback. delroth, Manon, Aaron, Charlie And everyone who helped me find murmur3 hash collisions: Indy, Aaron, Max # Patreon After the success of Load Balancing and Memory Allocation, I have decided to set up a Patreon page: https://patreon.com/samwho. For all of these articles going forward, I am going to post a Patreon-exclusive behind-the-scenes post talking about decisions, difficulties, and lessons learned from each post. It will give you a deep look in to how these articles evolve, and I'm really stoked about the one I've written for this one. If you enjoy my writing, and want to support it going forward, I'd really appreciate you becoming a Patreon. ❤️

over a year ago • 46 votes

More in programming

The Framework Desktop is a beast

I've been running the Framework Desktop for a few months here in Copenhagen now. It's an incredible machine. It's completely quiet, even under heavy, stress-all-cores load. It's tiny too, at just 4.5L of volume, especially compared to my old beautiful but bulky North tower running the 7950X — yet it's faster! And finally, it's simply funky, quirky, and fun! In some ways, the Framework Desktop is a curious machine. Desktop PCs are already very user-repairable! So why is Framework even bringing their talents to this domain? In the laptop realm, they're basically alone with that concept, but in the desktop space, it's rather crowded already. Yet it somehow still makes sense. Partly because Framework has gone with the AMD Ryzen AI Max 395+, which is technically a laptop CPU. You can find it in the ASUS ROG Flow Z13 and the HP ZBook Ultra. Which means it'll fit in a tiny footprint, and Framework apparently just wanted to see what they could do in that form factor. They clearly had fun with it. Look at mine: There are 21 little tiles on the front that you can get in a bunch of different colors or with logos from Framework. Or you can 3D print your own! It's a welcome change in aesthetic from the brushed aluminum or gamer-focused RGBs approach that most of the competition is taking. But let's cut to the benchmarks. That's really why you'd buy a machine like the Framework Desktop. There are significantly cheaper mini PCs available from Beelink and others, but so far, Framework has the only AMD 395+ unit on sale that's completely silent (the GMKTec very much is not, nor is the Z3 Flow). And for me, that's just a dealbreaker. I can't listen to roaring fans anymore. Here's the key benchmark for me: That's the only type of multi-core workload I really sit around waiting on these days, and the Framework Desktop absolutely crushes it. It's almost twice as fast as the Beelink SER8 and still a solid third faster than the Beelink SER9 too. Of course, it's also a lot more expensive, but you're clearly getting some multi-core bang for your buck here! It's even a more dramatic difference to the Macs. It's a solid 40% faster than the M4 Max and 50% faster than the M4 Pro! Now some will say "that's just because Docker is faster on Linux," and they're not entirely wrong. Docker runs natively on Linux, so for this test, where the MySQL/Redis/ElasticSearch data stores run in Docker while Ruby and the app code runs natively, that's part of the answer. Last I checked, it was about 25% of the difference. But so what? Docker is an integral part of the workflow for tons of developers. We use it to be able to run different versions of MySQL, Redis, and ElasticSearch for different applications on the same machine at the same time. You can't really do that without Docker. So this is what Real World benchmarks reveal. It's not just about having a Docker advantage, though. The AMD 395+ is also incredibly potent in RAW CPU performance. Those 16 Zen5 cores are running at 5.1GHz, and in Geekbench 6 multicore, this is how they stack up: Basically matching the M4 Max! And a good chunk faster than the M4 Pro (as well as other AMDs and Intel's 14900K!). No wonder that it's crazy quick with a full-core stress test like running 30,000 assertions for our HEY test suite. To be fair, the M4s are faster in single-core performance. Apple holds the crown there. It's about 20%. And you'll see that in benchmarks like Speedometer, which mostly measures JavaScript single-core performance. The Framework Desktop puts out 670 vs 744 on the M4 Pro on Speedometer 2.1. On SP 3.1, it's an even bigger difference with 35 vs 50. But I've found that all these computers feel fast enough in single-core performance these days. I can't actually feel the difference browsing on a machine that does 670 vs 744 on SP2.1. Hell, I can barely feel the difference between the SER8, which does 506, and the M4 Pro! The only time I actually feel like I'm waiting on anything is in multi-core workloads like the HEY test suite, and here the AMD 395+ is very near the fastest you can get for a consumer desktop machine today at any price. It gets even better when you bring price into the equation, though. The Framework Desktop with 64GB RAM + 2TB NVMe is $1,876. To get a Mac Studio with similar specs — M4 Max, 64GB RAM, 2TB NVMe — you'll literally spend nearly twice as much at $3,299! If you go for 128GB RAM, you'll spend $2,276 on the Framework, but $4,099 on the Mac. And it'll still be way slower for development work using Docker! The Framework Desktop is simply a great deal. Speaking of 64GB vs 128GB, I've been running the 64GB version, and I almost never get anywhere close to the limits. I think the highest I've seen in regular use is about 20GB of RAM in action. Linux is really efficient. Especially when you're using a window manager like Hyprland, as we do in Omarchy. The only reason you really want to go for the full 128GB RAM is to run local LLM models. The AMD 395+ uses unified memory, like Apple, so nearly all of it is addressable to be used by the GPU. That means you can run monster models, like the new 120b gpt-oss from OpenAI. Framework has a video showing them pushing out 40 tokens/second doing just that. That seems about in range of the numbers I've seen from the M4 Max, which also seem in the 40-50 token/second range, but I'll defer to folks who benchmark local LLMs for the exact details on that. I tried running the new gpt-oss-20b on my 64GB machine, though, and I wasn't exactly blown away by the accuracy. In fact, I'd say it was pretty bad. I mean, exceptionally cool that it's doable, but very far off the frontier models we have access to as SaaS. So personally, this isn't yet something I actually use all that much in day-to-day development. I want the best models running at full speed, and right now that means SaaS. So if you just want the best, small computer that runs Linux superbly well out of the box, you should buy the Framework Desktop. It's completely quiet, fantastically fast, and super fun to look at. But I think it's also fair to mention that you can get something like a Beelink SER9 for half the price! Yes, it's also only 2/3 the performance in multi-core, but it's just as fast in single-core. Most developers could totally get away with the SER9, and barely notice what they were missing. But there are just as many people for whom the extra $1,000 is worth the price to run the test suite 40 seconds quicker! You know who you are. Oh, before I close, I also need to mention that this thing is a gaming powerhouse. It basically punches about as hard as an RTX 4060! With an iGPU! That's kinda crazy. Totally new territory on the PC side for integrated graphics. ETA Prime has a video showing the same chip in the GMK Tech running premier games at 1440p High Settings at great frame rates. You can run most games under Linux these days too (thanks Valve and Steam Deck!), but if you need to dual boot with Windows, the dual NVMe slots in the Framework Desktop come very handy. Framework did good with this one. AMD really blew it out of the water with the 395+. We're spoiled to have such incredible hardware available for Linux at such appealing discounts over similar stuff from Cupertino. What a great time to love open source software and tinker-friendly hardware!

21 hours ago • 4 votes

Writing: Blog Posts and Songs

I was listening to a podcast interview with the Jackson Browne (American singer/songwriter, political activist, and inductee into the Rock and Roll Hall of Fame) and the interviewer asks him how he approaches writing songs with social commentaries and critiques — something along the lines of: “How do you get from the New York Times headline on a social subject to the emotional heart of a song that matters to each individual?” Browne discusses how if you’re too subtle, people won’t know what you’re talking about. And if you’re too direct, you run the risk of making people feel like they’re being scolded. Here’s what he says about his songwriting: I want this to sound like you and I were drinking in a bar and we’re just talking about what’s going on in the world. Not as if you’re at some elevated place and lecturing people about something they should know about but don’t but [you think] they should care. You have to get to people where [they are, where] they do care and where they do know. I think that’s a great insight for anyone looking to have a connecting, effective voice. I know for me, it’s really easily to slide into a lecturing voice — you “should” do this and you “shouldn’t” do that. But I like Browne’s framing of trying to have an informal, conversational tone that meets people where they are. Like you’re discussing an issue in the bar, rather than listening to a sermon. Chris Coyier is the canonical example of this that comes to mind. I still think of this post from CSS Tricks where Chris talks about how to have submit buttons that go to different URLs: When you submit that form, it’s going to go to the URL /submit. Say you need another submit button that submits to a different URL. It doesn’t matter why. There is always a reason for things. The web is a big place and all that. He doesn’t conjure up some universally-applicable, justified rationale for why he’s sharing this method. Nor is there any pontificating on why this is “good” or “bad”. Instead, like most of Chris’ stuff, I read it as a humble acknowledgement of the practicalities at hand — “Hey, the world is a big place. People have to do crafty things to make their stuff work. And if you’re in that situation, here’s something that might help what ails ya.” I want to work on developing that kind of a voice because I love reading voices like that. Email · Mastodon · Bluesky

2 days ago • 4 votes

Doing versus Delegating

A staff+ skill

2 days ago • 7 votes

p-fast trie, but smaller

Previously, I wrote some sketchy ideas for what I call a p-fast trie, which is basically a wide fan-out variant of an x-fast trie. It allows you to find the longest matching prefix or nearest predecessor or successor of a query string in a set of names in O(log k) time, where k is the key length. My initial sketch was more complicated and greedy for space than necessary, so here’s a simplified revision. (“p” now stands for prefix.) layout A p-fast trie stores a lexicographically ordered set of names. A name is a sequence of characters from some small-ish character set. For example, DNS names can be represented as a set of about 50 letters, digits, punctuation and escape characters, usually one per byte of name. Names that are arbitrary bit strings can be split into chunks of 6 bits to make a set of 64 characters. Every unique prefix of every name is added to a hash table. An entry in the hash table contains: A shared reference to the closest name lexicographically greater than or equal to the prefix. Multiple hash table entries will refer to the same name. A reference to a name might instead be a reference to a leaf object containing the name. The length of the prefix. To save space, each prefix is not stored separately, but implied by the combination of the closest name and prefix length. A bitmap with one bit per possible character, corresponding to the next character after this prefix. For every other prefix that matches this prefix and is one character longer than this prefix, a bit is set in the bitmap corresponding to the last character of the longer prefix. search The basic algorithm is a longest-prefix match. Look up the query string in the hash table. If there’s a match, great, done. Otherwise proceed by binary chop on the length of the query string. If the prefix isn’t in the hash table, reduce the prefix length and search again. (If the empty prefix isn’t in the hash table then there are no names to find.) If the prefix is in the hash table, check the next character of the query string in the bitmap. If its bit is set, increase the prefix length and search again. Otherwise, this prefix is the answer. predecessor Instead of putting leaf objects in a linked list, we can use a more complicated search algorithm to find names lexicographically closest to the query string. It’s tricky because a longest-prefix match can land in the wrong branch of the implicit trie. Here’s an outline of a predecessor search; successor requires more thought. During the binary chop, when we find a prefix in the hash table, compare the complete query string against the complete name that the hash table entry refers to (the closest name greater than or equal to the common prefix). If the name is greater than the query string we’re in the wrong branch of the trie, so reduce the length of the prefix and search again. Otherwise search the set bits in the bitmap for one corresponding to the greatest character less than the query string’s next character; if there is one remember it and the prefix length. This will be the top of the sub-trie containing the predecessor, unless we find a longer match. If the next character’s bit is set in the bitmap, continue searching with a longer prefix, else stop. When the binary chop has finished, we need to walk down the predecessor sub-trie to find its greatest leaf. This must be done one character at a time – there’s no shortcut. thoughts In my previous note I wondered how the number of search steps in a p-fast trie compares to a qp-trie. I have some old numbers measuring the average depth of binary, 4-bit, 5-bit, 6-bit and 4-bit, 5-bit, dns qp-trie variants. A DNS-trie varies between 7 and 15 deep on average, depending on the data set. The number of steps for a search matches the depth for exact-match lookups, and is up to twice the depth for predecessor searches. A p-fast trie is at most 9 hash table probes for DNS names, and unlikely to be more than 7. I didn’t record the average length of names in my benchmark data sets, but I guess they would be 8–32 characters, meaning 3–5 probes. Which is far fewer than a qp-trie, though I suspect a hash table probe takes more time than chasing a qp-trie pointer. (But this kind of guesstimate is notoriously likely to be wrong!) However, a predecessor search might need 30 probes to walk down the p-fast trie, which I think suggests a linked list of leaf objects is a better option.

2 days ago • 4 votes

Software books I wish I could read

New Logic for Programmers Release! v0.11 is now available! This is over 20% longer than v0.10, with a new chapter on code proofs, three chapter overhauls, and more! Full release notes here. Software books I wish I could read I'm writing Logic for Programmers because it's a book I wanted to have ten years ago. I had to learn everything in it the hard way, which is why I'm ensuring that everybody else can learn it the easy way. Books occupy a sort of weird niche in software. We're great at sharing information via blogs and git repos and entire websites. These have many benefits over books: they're free, they're easily accessible, they can be updated quickly, they can even be interactive. But no blog post has influenced me as profoundly as Data and Reality or Making Software. There is no blog or talk about debugging as good as the Debugging book. It might not be anything deeper than "people spend more time per word on writing books than blog posts". I dunno. So here are some other books I wish I could read. I don't think any of them exist yet but it's a big world out there. Also while they're probably best as books, a website or a series of blog posts would be ok too. Everything about Configurations The whole topic of how we configure software, whether by CLI flags, environmental vars, or JSON/YAML/XML/Dhall files. What causes the configuration complexity clock? How do we distinguish between basic, advanced, and developer-only configuration options? When should we disallow configuration? How do we test all possible configurations for correctness? Why do so many widespread outages trace back to misconfiguration, and how do we prevent them? I also want the same for plugin systems. Manifests, permissions, common APIs and architectures, etc. Configuration management is more universal, though, since everybody either uses software with configuration or has made software with configuration. The Big Book of Complicated Data Schemas I guess this would kind of be like Schema.org, except with a lot more on the "why" and not the what. Why is important for the Volcano model to have a "smokingAllowed" field?1 I'd see this less as "here's your guide to putting Volcanos in your database" and more "here's recurring motifs in modeling interesting domains", to help a person see sources of complexity in their own domain. Does something crop up if the references can form a cycle? If a relationship needs to be strictly temporary, or a reference can change type? Bonus: path dependence in data models, where an additional requirement leads to a vastly different ideal data model that a company couldn't do because they made the old model. (This has got to exist, right? Business modeling is a big enough domain that this must exist. Maybe The Essence of Software touches on this? Man I feel bad I haven't read that yet.) Computer Science for Software Engineers Yes, I checked, this book does not exist (though maybe this is the same thing). I don't have any formal software education; everything I know was either self-taught or learned on the job. But it's way easier to learn software engineering that way than computer science. And I bet there's a lot of other engineers in the same boat. This book wouldn't have to be comprehensive or instructive: just enough about each topic to understand why it's an area of study and appreciate how research in it eventually finds its way into practice. MISU Patterns MISU, or "Make Illegal States Unrepresentable", is the idea of designing system invariants in the structure of your data. For example, if a Contact needs at least one of email or phone to be non-null, make it a sum type over EmailContact, PhoneContact, EmailPhoneContact (from this post). MISU is great. Most MISU in the wild look very different than that, though, because the concept of MISU is so broad there's lots of different ways to achieve it. And that means there are "patterns": smart constructors, product types, properly using sets, newtypes to some degree, etc. Some of them are specific to typed FP, while others can be used in even untyped languages. Someone oughta make a pattern book. My one request would be to not give them cutesy names. Do something like the Aarne–Thompson–Uther Index, where items are given names like "Recognition by manner of throwing cakes of different weights into faces of old uncles". Names can come later. The Tools of '25 Not something I'd read, but something to recommend to junior engineers. Starting out it's easy to think the only bit that matters is the language or framework and not realize the enormous amount of surrounding tooling you'll have to learn. This book would cover the basics of tools that enough developers will probably use at some point: git, VSCode, very basic Unix and bash, curl. Maybe the general concepts of tools that appear in every ecosystem, like package managers, build tools, task runners. That might be easier if we specialize this to one particular domain, like webdev or data science. Ideally the book would only have to be updated every five years or so. No LLM stuff because I don't expect the tooling will be stable through 2026, to say nothing of 2030. A History of Obsolete Optimizations Probably better as a really long blog series. Each chapter would be broken up into two parts: A deep dive into a brilliant, elegant, insightful historical optimization designed to work within the constraints of that era's computing technology What we started doing instead, once we had more compute/network/storage available. c.f. A Spellchecker Used to Be a Major Feat of Software Engineering. Bonus topics would be brilliance obsoleted by standardization (like what people did before git and json were universal), optimizations we do today that may not stand the test of time, and optimizations from the past that did. Sphinx Internals I need this. I've spent so much goddamn time digging around in Sphinx and docutils source code I'm gonna throw up. Systems Distributed Talk Today! Online premier's at noon central / 5 PM UTC, here! I'll be hanging out to answer questions and be awkward. You ever watch a recording of your own talk? It's real uncomfortable! In this case because it's a field on one of Volcano's supertypes. I guess schemas gotta follow LSP too ↩

2 days ago • 9 votes

New here?

Turing Machines

Improve your reading experience

More from samwho.dev

More in programming

bored reading