Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
26
I had a complicated relationship with Twitter (now X). I wasted a lot of time there, got into some embarrassing arguments, and wrote things I now regret. But it’s also where I met friends and partners, started my career in cultural heritage, and began my queer awakening. Maybe one day I’ll write a longer article about what Twitter meant to me. I haven’t been active on Twitter for several years, and I finally decided to remove all my old tweets. I used TweetDeleter.com to get rid of them. I’ll post one final tweet linking to this article and telling my followers where else to find me on the web, but after that I’m done with Twitter. Why? I felt like I was getting very little benefit to leaving my old tweets up, but I was feeling increasingly uncomfortable about it. I don’t think many people were reading them and getting something useful out of it. I kept getting notifications from people who’d found an old tweet, and wanted to start an argument or hurl abuse in my direction. I can do...
a month ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from alexwlchan

Making inventory spreadsheets for my LEGO sets

One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]

19 hours ago 3 votes
Adding auto-generated cover images to EPUBs downloaded from AO3

I was chatting with a friend recently, and she mentioned an annoyance when reading fanfiction on her iPad. She downloads fic from AO3 as EPUB files, and reads it in the Kindle app – but the files don’t have a cover image, and so the preview thumbnails aren’t very readable: She’s downloaded several hundred stories, and these thumbnails make it difficult to find things in the app’s “collections” view. This felt like a solvable problem. There are tools to add cover images to EPUB files, if you already have the image. The EPUB file embeds some key metadata, like the title and author. What if you had a tool that could extract that metadata, auto-generate an image, and use it as the cover? So I built that. It’s a small site where you upload EPUB files you’ve downloaded from AO3, the site generates a cover image based on the metadata, and it gives you an updated EPUB to download. The new covers show the title and author in large text on a coloured background, so they’re much easier to browse in the Kindle app: If you’d find this helpful, you can use it at alexwlchan.net/my-tools/add-cover-to-ao3-epubs/ Otherwise, I’m going to explain how it works, and what I learnt from building it. There are three steps to this process: Open the existing EPUB to get the title and author Generate an image based on that metadata Modify the EPUB to insert the new cover image Let’s go through them in turn. Open the existing EPUB I’ve not worked with EPUB before, and I don’t know much about it. My first instinct was to look for Python EPUB libraries on PyPI, but there was nothing appealing. The results were either very specific tools (convert EPUB to/from format X) or very unmaintained (the top result was last updated in April 2014). I decied to try writing my own code to manipulate EPUBs, rather than using somebody else’s library. I had a vague memory that EPUB files are zips, so I changed the extension from .epub to .zip and tried unzipping one – and it turns out that yes, it is a zip file, and the internal structure is fairly simple. I found a file called content.opf which contains metadata as XML, including the title and author I’m looking for: <?xml version='1.0' encoding='utf-8'?> <package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uuid_id"> <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata"> <dc:title>Operation Cameo</dc:title> <meta name="calibre:timestamp" content="2025-01-25T18:01:43.253715+00:00"/> <dc:language>en</dc:language> <dc:creator opf:file-as="alexwlchan" opf:role="aut">alexwlchan</dc:creator> <dc:identifier id="uuid_id" opf:scheme="uuid">13385d97-35a1-4e72-830b-9757916d38a7</dc:identifier> <meta name="calibre:title_sort" content="operation cameo"/> <dc:description><p>Some unusual orders arrive at Operation Mincemeat HQ.</p></dc:description> <dc:publisher>Archive of Our Own</dc:publisher> <dc:subject>Fanworks</dc:subject> <dc:subject>General Audiences</dc:subject> <dc:subject>Operation Mincemeat: A New Musical - SpitLip</dc:subject> <dc:subject>No Archive Warnings Apply</dc:subject> <dc:date>2023-12-14T00:00:00+00:00</dc:date> </metadata> … That dc: prefix was instantly familiar from my time working at Wellcome Collection – this is Dublin Core, a standard set of metadata fields used to describe books and other objects. I’m unsurprised to see it in an EPUB; this is exactly how I’d expect it to be used. I found an article that explains the structure of an EPUB file, which told me that I can find the content.opf file by looking at the root-path element inside the mandatory META-INF/container.xml file which is every EPUB. I wrote some code to find the content.opf file, then a few XPath expressions to extract the key fields, and I had the metadata I needed. Generate a cover image I sketched a simple cover design which shows the title and author. I wrote the first version of the drawing code in Pillow, because that’s what I’m familiar with. It was fine, but the code was quite flimsy – it didn’t wrap properly for long titles, and I couldn’t get custom fonts to work. Later I rewrote the app in JavaScript, so I had access to the HTML canvas element. This is another tool that I haven’t worked with before, so a fun chance to learn something new. The API felt fairly familiar, similar to other APIs I’ve used to build HTML elements. This time I did implement some line wrapping – there’s a measureText() API for canvas, so you can see how much space text will take up before you draw it. I break the text into words, and keeping adding words to a line until measureText tells me the line is going to overflow the page. I have lots of ideas for how I could improve the line wrapping, but it’s good enough for now. I was also able to get fonts working, so I picked Georgia to match the font used for titles on AO3. Here are some examples: I had several ideas for choosing the background colour. I’m trying to help my friend browse her collection of fic, and colour would be a useful way to distinguish things – so how do I use it? I realised I could get the fandom from the EPUB file, so I decided to use that. I use the fandom name as a seed to a random number generator, then I pick a random colour. This means that all the fics in the same fandom will get the same colour – for example, all the Star Wars stories are a shade of red, while Star Trek are a bluey-green. This was a bit harder than I expected, because it turns out that JavaScript doesn’t have a built-in seeded random number generator – I ended up using some snippets from a Stack Overflow answer, where bryc has written several pseudorandom number generators in plain JavaScript. I didn’t realise until later, but I designed something similar to the placeholder book covers in the Apple Books app. I don’t use Apple Books that often so it wasn’t a deliberate choice to mimic this style, but clearly it was somewhere in my subconscious. One difference is that Apple’s app seems to be picking from a small selection of background colours, whereas my code can pick a much nicer variety of colours. Apple’s choices will have been pre-approved by a designer to look good, but I think mine is more fun. Add the cover image to the EPUB My first attempt to add a cover image used pandoc: pandoc input.epub --output output.epub --epub-cover-image cover.jpeg This approach was no good: although it added the cover image, it destroyed the formatting in the rest of the EPUB. This made it easier to find the fic, but harder to read once you’d found it. An EPUB file I downloaded from AO3, before/after it was processed by pandoc. So I tried to do it myself, and it turned out to be quite easy! I unzipped another EPUB which already had a cover image. I found the cover image in OPS/images/cover.jpg, and then I looked for references to it in content.opf. I found two elements that referred to cover images: <?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://www.idpf.org/2007/opf" version="3.0" unique-identifier="PrimaryID"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <meta name="cover" content="cover-image"/> … </metadata> <manifest> <item id="cover-image" href="images/cover.jpg" media-type="image/jpeg" properties="cover-image"/> … </manifest> </package> This gave me the steps for adding a cover image to an EPUB file: add the image file to the zipped bundle, then add these two elements to the content.opf. Where am I going to deploy this? I wrote the initial prototype of this in Python, because that’s the language I’m most familiar with. Python has all the libraries I need: The zipfile module can unpack and modify the EPUB/ZIP The xml.etree or lxml modules can manipulate XML The Pillow library can generate images I built a small Flask web app: you upload the EPUB to my server, my server does some processing, and sends the EPUB back to you. But for such a simple app, do I need a server? I tried rebuilding it as a static web page, doing all the processing in client-side JavaScript. That’s simpler for me to host, and it doesn’t involve a round-trip to my server. That has lots of other benefits – it’s faster, less of a privacy risk, and doesn’t require a persistent connection. I love static websites, so can they do this? Yes! I just had to find a different set of libraries: The JSZip library can unpack and modify the EPUB/ZIP, and is the only third-party code I’m using in the tool Browsers include DOMParser for manipulating XML I’ve already mentioned the HTML <canvas> element for rendering the image This took a bit longer because I’m not as familiar with JavaScript, but I got it working. As a bonus, this makes the tool very portable. Everything is bundled into a single HTML file, so if you download that file, you have the whole tool. If my friend finds this tool useful, she can save the file and keep a local copy of it – she doesn’t have to rely on my website to keep using it. What should it look like? My first design was very “engineer brain” – I just put the basic controls on the page. It was fine, but it wasn’t good. That might be okay, because the only person I need to be able to use this app is my friend – but wouldn’t it be nice if other people were able to use it? If they’re going to do that, they need to know what it is – most people aren’t going to read a 2,500 word blog post to understand a tool they’ve never heard of. (Although if you have read this far, I appreciate you!) I started designing a proper page, including some explanations and descriptions of what the tool is doing. I got something that felt pretty good, including FAQs and acknowledgements, and I added a grey area for the part where you actually upload and download your EPUBs, to draw the user’s eye and make it clear this is the important stuff. But even with that design, something was missing. I realised I was telling you I’d create covers, but not showing you what they’d look like. Aha! I sat down and made up a bunch of amusing titles for fanfic and fanfic authors, so now you see a sample of the covers before you upload your first EPUB: This makes it clearer what the app will do, and was a fun way to wrap up the project. What did I learn from this project? Don’t be scared of new file formats My first instinct was to look for a third-party library that could handle the “complexity” of EPUB files. In hindsight, I’m glad I didn’t find one – it forced me to learn more about how EPUBs work, and I realised I could write my own code using built-in libraries. EPUB files are essentially ZIP files, and I only had basic needs. I was able to write my own code. Because I didn’t rely on a library, now I know more about EPUBs, I have code that’s simpler and easier for me to understand, and I don’t have a dependency that may cause problems later. There are definitely some file formats where I need existing libraries (I’m not going to write my own JPEG parser, for example) – but I should be more open to writing my own code, and not jumping to add a dependency. Static websites can handle complex file manipulations I love static websites and I’ve used them for a lot of tasks, but mostly read-only display of information – not anything more complex or interactive. But modern JavaScript is very capable, and you can do a lot of things with it. Static pages aren’t just for static data. One of the first things I made that got popular was find untagged Tumblr posts, which was built as a static website because that’s all I knew how to build at the time. Somewhere in the intervening years, I forgot just how powerful static sites can be. I want to build more tools this way. Async JavaScript calls require careful handling The JSZip library I’m using has a lot of async functions, and this is my first time using async JavaScript. I got caught out several times, because I forgot to wait for async calls to finish properly. For example, I’m using canvas.toBlob to render the image, which is an async function. I wasn’t waiting for it to finish, and so the zip would be repackaged before the cover image was ready to add, and I got an EPUB with a missing image. Oops. I think I’ll always prefer the simplicity of synchronous code, but I’m sure I’ll get better at async JavaScript with practice. Final thoughts I know my friend will find this helpful, and that feels great. Writing software that’s designed for one person is my favourite software to write. It’s not hyper-scale, it won’t launch the next big startup, and it’s usually not breaking new technical ground – but it is useful. I can see how I’m making somebody’s life better, and isn’t that what computers are for? If other people like it, that’s a nice bonus, but I’m really thinking about that one person. Normally the one person I’m writing software for is me, so it’s extra nice when I can do it for somebody else. If you want to try this tool yourself, go to alexwlchan.net/my-tools/add-cover-to-ao3-epubs/ If you want to read the code, it’s all on GitHub. [If the formatting of this post looks odd in your feed reader, visit the original article]

4 days ago 10 votes
Looking at images in a spreadsheet

I’ve had a couple of projects recently where I needed to work with a list that involved images. For example, choosing a series of photos to print, or making an inventory of Lego parts. I could write a simple text list, but it’s really helpful to be able to see the images as part of the list, especially when I’m working with other people. The best tool I’ve found is Google Sheets – not something I usually associate with pictures! I’m using Google Sheets, and I use the IMAGE function, which inserts an image into a cell. For example: =IMAGE("https://www.google.com/images/srpr/logo3w.png") There’s a similar function in Microsoft Excel, but not in Apple Numbers. This function can reference values in other cells, so I’ll often prepare my spreadsheet in another tool – say, a Python script – and include an image URL in one of the columns. When I import the spreadsheet into Google Sheets, I use IMAGE() to reference that column, and then I see inline images. After that, I tend to hide the column with the image URL, and resize the rows/columns containing images to make them bigger and easier to look at. I often pair this with the HYPERLINK function, which can add a clickable link to a cell. This is useful to link to the source of the image, or to more detail I can’t fit in the spredsheet. I don’t know how far this approach can scale – I’ve never tried more than a thousand or so images in a single spreadsheet – but it’s pretty cool that it works at all! Using a spreadsheet gives me a simple, lightweight interface that most people are already familiar with. It doesn’t take much work on my part, and I get useful features like sorting and filtering for “free”. Previously I’d only thought of spreadsheets as a tool for textual data, and being able to include images has made them even more powerful. [If the formatting of this post looks odd in your feed reader, visit the original article]

2 weeks ago 25 votes
randline: get a random selection of lines in a file using reservoir sampling

I’ve posted another command-line tool on GitHub: randline, which gives you a random selection of lines in a file: $ randline < /usr/share/dict/words ultraluxurious $ randline 3 < /usr/share/dict/words unexceptionably baselessness salinity There are lots of tools that solve this problem; I wrote my own as a way to get some more Rust practice and try a new-to-me technique called reservoir sampling. Existing approaches There’s a shuf command in coreutils which is designed to do this exact thing: $ shuf -n 3 /usr/share/dict/words brimstone melody's reimbursed But I don’t have coreutils on my Mac, so I can’t use shuf. You can do this in lots of other ways using tools like awk, sort and perl. If you’re interested, check out these Stack Overflow and Unix & Linux Stack Exchange threads for examples. For my needs, I wrote a tiny Python script called randline which I saved in my PATH years ago, and I haven’t thought much about since: import random import sys if __name__ == "__main__": lines = sys.stdin.read().splitlines() try: k = int(sys.argv[1]) except IndexError: k = 1 random.shuffle(lines) print("\n".join(lines[:k])) (I’m not sure why my past self decided not to use random.sample. I suspect I’d forgotten about it.) This script has worked fine, but I stumbled across it recently and it got me thinking. This approach isn’t very efficient – it has to load the whole file into memory. Can we do better? Reservoir sampling In other Python scripts, I process files as a stream – look at one line at a time, rather than loading the whole file at once. This doesn’t make much difference for small files, but it pays off when you have really big files. I couldn’t think of a good way to take a random sample of a file using streaming, and still get a uniform distribution – but smart people have already thought about this. I did some reading and I found a technique called reservoir sampling. The introduction in the Wikipedia article makes it clear this is exactly what I want: Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. The basic idea is that rather than holding the whole file in memory at once, I can keep a fixed-sized buffer – or “reservoir” – of the items I’ve selected. As I go line-by-line through the file, I can add or remove items in this resevoir, and it will always use about the same amount of memory. I’m only holding a line in memory if it’s in the reservoir, not every line in the file. Algorithm L The Wikipedia article describes several algorithms, including a simple Algorithm R and an optimal Algorithm L. The underlying principle of Algorithm L is pretty concise: If we generate $n$ random numbers $u_1, \ldots, u_n \sim U[0,1]$ independently, then the indices of the smallest $k$ of them is a uniform sample of the $k$-subsets of $\{1, \ldots, n\}$. There’s no proof in the Wikipedia article, but I wanted to satisfy myself that this is true. If you’re happy to take it as given, you can skip the maths and go to the next section. Here’s my attempt at a justification: What we really care about is the relative ranking of the $u_1, \ldots, u_n$, not their actual values – we care whether, for example, $u_1 < u_2$, but not the exact difference between them. Because the variables are independent and they have the same distribution, every possible ranking is equally likely. Every variable is the same, so none of them can be “special” or favoured above the others. This means that each permutation of the indices $\{1, \ldots, n\}$ is equally likely. There are $n!$ such permutations, so each occurs with probability $1/n!$. For a given $k$-subset, we’re interested in permuations where this subset is the first $k$ items. This means the probability that a particular $k$-subset will be selected is a simple fraction: $$ \begin{equation*} \text{probability of selecting this }k\text{-subset} = \frac{\text{# of permutations where this subset is the first }k\text{ items}} {\text{# of permutations}} \end{equation*} $$ How many permutations are there where this $k$-subset is the first $k$ items? There are $k!$ ways to arrange this $k$-subset as the first $k$ digits, and $(n-k)!$ ways to arrange the remaining digits. This means there are $k!\left(n-k\right)!$ permutations that match, and so: $$ \begin{equation*} \text{probability of selecting this }k\text{-subset} = \frac{k!\left(n-k\right)!}{n!} \end{equation*} $$ This probability is the same for every $k$-subset, so each one is equally likely – which is the thing we care about. This was enough to give me the confidence to try implementing Algorithm L. Implementing Algorithm L in an efficient way If we don’t know $n$ upfront, we could save all the items and only then generate the random variables $u_1, \ldots, u_n \sim U[0,1]$ – but that’s precisely the sort of inefficiency I’m trying to avoid! Fortunately, we don’t need to: the nice thing about this algorithm is that we only need to track the $k$ smallest values of $u_1, \ldots, u_i$ we’ve seen so far. Once a value is larger than the $k$ smallest, we can safely discard it because we know it’ll never be used. Here’s the approach I took: Create an empty “reservoir” of $k$ items. As you get items, assign each one a “weight” and start filling the reservoir. (These weights are the random variables $u_1, \ldots, u_n$.) If you run out of items before you fill the reservoir, go to step 4. If you fill the reservoir and there are more items, calculate the largest weight of the items in the reservoir, and go to step 3. Once the reservoir is full, go through the remaining items one-by-one. For each item, assign it a weight. If the weight of this new item is larger than the largest weight already in the reservoir, discard the item. This weight isn’t in the $k$ smallest, so we don’t care about it. If the weight of this new item is smaller than the largest weight in the resevoir, then add the item to the reservoir and remove the item with the previously-largest weight. Recalculate the largest weight of the items in the reservoir. When you run out of items, go to step 4. Return the items in the reservoir. This is your random sample. This approach means we only have to hold a fixed number of items/weights in memory at a time – much more efficient, and it should scale to an arbitrarily large number of inputs. It’s a bit too much code to include here, but you can read my Rust implementation on GitHub. I wrote some tests, which include a statistical test – I run the sampling code 10,000 times, and check the results are the uniform distribution I want. What did I learn about Rust? This is only about 250 lines of Rust, but it was still good practice, and I learnt a few new things. Working with generics I’ve used generics in other languages, and I’d read about them in the Rust Book, but I’d never written my own code using generics in Rust. I used a generic to write my sampling function: fn reservoir_sample<T>( mut items: impl Iterator<Item = T>, k: usize) -> Vec<T> { … } It was straightforward, and there were no big surprises. The difference between .iter() and .into_iter() I’ve used both of these methods before, but I only understood part of the difference. When you call .iter(), you’re borrowing the vector, which means it can be used later. When you call .into_iter(), you’re consuming the vector, which means it can’t be used later. I hadn’t thought about how this affects the types. When you call .iter(), you get an iterator of references. When you call .into_iter(), you get an iterator of values. This caused me some confusion when I was writing a test. Consider the following code: fn reservoir_sample<T>( mut items: impl Iterator<Item = T>, k: usize) -> Vec<T> { … } let letters = vec!["A", "A", "A"]; let items = letters.iter(); assert_eq!(reservoir_sample(items, 1), vec!["A"]); I was trying to write a test that reservoir_sample would only return the number of items I asked for, and no more. This was my first attempt, and it doesn’t compile. When I call letters.iter(), I’m getting an iterator of string references, that is Iterator<&&str>. Then I’m comparing it to a Vec<&str>, but Rust doesn’t know how to check equality of &str and &&str, so it refuses to compile this code. There are two ways I could fix this: Use .into_iter(), so I get an iterator of string values, i.e. Iterator<&str>. let letters = vec!["A", "A", "A"]; let items = letters.into_iter(); assert_eq!(reservoir_sample(items, 1), vec!["A"]); Change the expected result so it’s a Vec<&&str>: let letters = vec!["A", "A", "A"]; let items = letters.iter(); assert_eq!(reservoir_sample(items, 1), vec![&"A"]); I used .into_iter() in my tests. This sort of distinction is probably obvious to more experienced Rust programmers, but it was new to me. I’ve read about these methods, but I only understand them by writing code. Arrays are indexed with usize I wasn’t sure what type I should use for k, the size of the random sample. It’s a positive integer, but should I use u32 or usize? I read the descriptions of both, and it wasn’t immediately obvious which was preferable. I looked to Vec::with_capacity for inspiration, because it’s one of the methods I was using and it feels similar. It takes a single argument capacity: usize. That gave me an example to follow, but I still wanted to understand why. I did some more reading, and I learned that Rust arrays are indexed with usize. It makes sense that a pointer type is used for array indexing, but it’s been a while since I used a language with pointers, and so it didn’t occur to me. There’s a lot of cool stuff in std::collections At the core of this tool, I have a reservoir of weighted items, and I want to be able to find the item with the largest weight when it gets replaced. This sounds like priority queue, and there’s an implementation of one in the Rust standard library. I was able to use BinaryHeap from the std::collections module, which saved me from writing a bunch of fiddly code myself. Here’s the broad shape of it: struct WeightedItem<T> { item: T, weight: f64, } let mut reservoir: BinaryHeap<WeightedItem<T>> = BinaryHeap::with_capacity(k); There’s a bit more code to implement Eq and Ord for WeightedItem, but that wasn’t difficult. I didn’t even need to read the documentation – the compiler error messages were so helpful, I could just follow their suggestions to get a working solution. In this sense, Rust feels very like Python – both languages have a built-in collections module with some common data structures. I need to spend more time exploring the Rust variant, and there’s a When should you use which collection? guide to help me find the useful parts. This whole project is less than 250 lines of Rust, including tests. There are plenty of other tools that do the same thing, so I doubt anybody else will want to use it. Most people should use shuf – to which Assaf Gordon added reservoir sampling nearly twelve years ago. But in case anybody is interested, I’ve put all the code on GitHub. I’ve learnt every programming language in tiny steps – a little at a time, growing slowly until I have something approximating skill. This project is the latest tiny step towards learning Rust, and now I know a little bit more than I did before. It’s over eight years since I wrote my first Rust, and I’m still a beginner, but I’m having fun learning, and I’m having fun writing it down as I go. [If the formatting of this post looks odd in your feed reader, visit the original article]

2 weeks ago 33 votes
How I test Rust command-line apps with `assert_cmd`

Rust has become my go-to language for my personal toolbox – small, standalone utilities like create_thumbnail, emptydir, and dominant_colours. There’s no place for Rust in my day job, so having some self-contained hobby projects means I can still have fun playing with it. I’ve been using the assert_cmd crate to test my command line tools, but I wanted to review my testing approach before I write my next utility. My old code was fine and it worked, but that’s about all you could say about it – it wasn’t clean or idiomatic Rust, and it wasn’t especially readable. My big mistake was trying to write Rust like Python. I’d written wrapper functions that would call assert_cmd and return values, then I wrote my own assertions a bit like I’d write a Python test. I missed out on the nice assertion helpers in the crate. I’d skimmed just enough of the assert_cmd documentation to get something working, but I hadn’t read it properly. As I was writing this blog post, I went back and read the documentation in more detail, to understand the right way to use the crate. Here are some examples of how I’m using it in my refreshed test suites: Testing a basic command This test calls dominant_colours with a single argument, then checks it succeeds and that a single line is printed to stdout: use assert_cmd::Command; /// If every pixel in an image is the same colour, then the image /// has a single dominant colour. #[test] fn it_prints_the_colour() { Command::cargo_bin("dominant_colours") .unwrap() .arg("./src/tests/red.png") .assert() .success() .stdout("#fe0000\n") .stderr(""); } If I have more than one argument or flag, I can replace .arg with .args to pass a list: use assert_cmd::Command; /// It picks the best colour from an image to go with a background -- /// the colour with sufficient contrast and the most saturation. #[test] fn it_chooses_the_right_colour_for_a_light_background() { Command::cargo_bin("dominant_colours") .unwrap() .args(&[ "src/tests/stripes.png", "--max-colours=5", "--best-against-bg=#fff", ]) .assert() .success() .stdout("#693900\n") .stderr(""); } Alternatively, I can omit .arg and .args if I don’t need to pass any arguments. Testing error cases Most of my tests are around error handling – call the tool with bad input, and check it returns a useful error message. I can check that the command failed, the exit code, and the error message printed to stderr: use assert_cmd::Command; /// Getting the dominant colour of a file that doesn't exist is an error. #[test] fn it_fails_if_you_pass_an_nonexistent_file() { Command::cargo_bin("dominant_colours") .unwrap() .arg("doesnotexist.jpg") .assert() .failure() .code(1) .stdout("") .stderr("No such file or directory (os error 2)\n"); } Comparing output to a regular expression All the examples so far are doing an exact match for the stdout/stderr, but sometimes I need something more flexible. Maybe I only know what part of the output will look like, or I only care about checking how it starts. If so, I can use the predicate::str::is_match predicate from the predicates crate and define a regular expression I want to match against. Here’s an example where I’m checking the output contains a version number, but not what the version number is: use assert_cmd::Command; use predicates::prelude::*; /// If I run `dominant_colours --version`, it prints the version number. #[test] fn it_prints_the_version() { // Match strings like `dominant_colours 1.2.3` let is_version_string = predicate::str::is_match(r"^dominant_colours [0-9]+\.[0-9]+\.[0-9]+\n$").unwrap(); Command::cargo_bin("dominant_colours") .unwrap() .arg("--version") .assert() .success() .stdout(is_version_string) .stderr(""); } Creating focused helper functions I have a couple of helper functions for specific test scenarios. I try to group these by common purpose – they should be testing similar behaviour. I’m trying to avoid creating helpers for the sake of reducing repetitive code. For example, I have a helper function that passes a single invalid file to dominant_colours and checks the error message is what I expect: use assert_cmd::Command; use predicates::prelude::*; /// Getting the dominant colour of a file that doesn't exist is an error. #[test] fn it_fails_if_you_pass_an_nonexistent_file() { assert_file_fails_with_error( "./doesnotexist.jpg", "No such file or directory (os error 2)\n", ); } /// Try to get the dominant colours for a file, and check it fails /// with the given error message. fn assert_file_fails_with_error( path: &str, expected_stderr: &str, ) -> assert_cmd::assert::Assert { Command::cargo_bin("dominant_colours") .unwrap() .arg(path) .assert() .failure() .code(1) .stdout("") .stderr(predicate::eq(expected_stderr)) } Initially I wrote this helper just calling .stderr(expected_stderr) to do an exact match, like in previous tests, but I got an error “expected_stderr escapes the function body here”. I’m not sure what that means – it’s something to do with borrowing – but wrapping it in a predicate seems to fix the error, so I’m happy. My test suite is a safety net, not a playground Writing this blog post has helped me refactor my tests into something that’s actually good. I’m sure there’s still room for improvement, but this is the first iteration that I feel happy with. It’s no coincidence that it looks very similar to other test suites using assert_cmd. My earlier approaches were far too clever. I was over-abstracting to hide a few lines of boilerplate, which made the tests harder to follow. I even wrote a macro with a variadic interface because of a minor annoyance, which is stretching the limits of my Rust knowledge. It was fun to write, but it would have been a pain to debug or edit later. It’s okay to have a bit of repetition in a test suite, if it makes them easier to read. I keep having to remind myself of this – I’m often tempted to create helper functions whose sole purpose is to remove boilerplate, or create some clever parametrisation which only made sense as I’m writing it. I need to resist the urge to compress my test code. My new tests are more simple and more readable. There’s a time and a place for clever code, but my test suite isn’t it. [If the formatting of this post looks odd in your feed reader, visit the original article]

4 weeks ago 36 votes

More in programming

Making inventory spreadsheets for my LEGO sets

One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]

19 hours ago 3 votes
Giving Junior Engineers Control Of A Six Trillion Dollar System Is Nuts

For some purpose, the DOGE people are burrowing their way into all US Federal Systems. Their complete control over the Treasury Department is entirely insane. Unless you intend to destroy everything, making arbitrary changes to complex computer systems will result in destruction, even if that was not your intention. No

4 hours ago 3 votes
Stanislav Petrov

A lieutenant colonel in the Soviet Air Defense Forces prevented the end of human civilization on September 26th, 1983. His name was Stanislav Petrov. Protocol dictated that the Soviet Union would retaliate against any nuclear strikes sent by the United States. This was a policy of mutually assured destruction, a doctrine that compels a horrifying logical conclusion. The second and third stage effects of this type of exchange would be even more catastrophic. Allies for each side would likely be pulled into the conflict. The resulting nuclear winter was projected to lead to 2 billion deaths due to starvation. This is to say nothing about those who would have been unfortunate enough to have survived. Petrov’s job was to monitor Oko, the computerized warning systems built to centralize Soviet satellite communications. Around midnight, he received a report that one of the satellites had detected the infrared signature of a single launch of a United States ICBM. While Petrov was deciding what to do about this report, the system detected four more incoming missile launches. He had minutes to make a choice about what to do. It is impossible to imagine the amount of pressure placed on him at this moment. Source: Stanislav Petrov, Soviet officer credited with averting nuclear war, dies at 77 by Schwartzreport. Petrov lived in a world of deterministic systems. The technologies that powered these warning systems have outputs that are guaranteed, provided the proper inputs are provided. However, deterministic does not mean infallible. The only reason you are alive and reading this is because Petrov understood that the systems he observed were capable of error. He was suspicious of what he was seeing reported, and chose not to escalate a retaliatory strike. There were two factors guiding his decision: A surprise attack would most likely have used hundreds of missiles, and not just five. The allegedly foolproof Oko system was new and prone to errors. An error in a deterministic system can still lead to expected outputs being generated. For the Oko system, infrared reflections of the sun shining off of the tops of clouds created a false positive that was interpreted as detection of a nuclear launch event. Source: US-K History by Kosmonavtika. The concept of erroneous truth is a deep thing to internalize, as computerized systems are presented as omniscient, indefective, and absolute. Petrov’s rewards for this action were reprimands, reassignment, and denial of promotion. This was likely for embarrassing his superiors by the politically inconvenient shedding of light on issues with the Oko system. A coerced early retirement caused a nervous breakdown, likely him having to grapple with the weight of his decision. It was only in the 1990s—after the fall of the Soviet Union—that his actions were discovered internationally and celebrated. Stanislav Petrov was given the recognition that he deserved, including being honored by the United Nations, awarded the Dresden Peace Prize, featured in a documentary, and being able to visit a Minuteman Missile silo in the United States. On January 31st, 2025, OpenAI struck a deal with the United States government to use its AI product for nuclear weapon security. It is unclear how this technology will be used, where, and to what extent. It is also unclear how OpenAI’s systems function, as they are black box technologies. What is known is that LLM-generated responses—the product OpenAI sells—are non-deterministic. Non-deterministic systems don’t have guaranteed outputs from their inputs. In addition, LLM-based technology hallucinates—it invents content with no self-knowledge that it is a falsehood. Non-deterministic systems that are computerized also have the perception as being authoritative, the same as their deterministic peers. It is not a question of how the output is generated, it is one of the output being perceived to come from a machine. These are terrifying things to know. Consider not only the systems this technology is being applied to, but also the thoughtless speed of their integration. Then consider how we’ve historically been conditioned and rewarded to interpret the output of these systems, and then how we perceive and treat skeptics. We don’t live in a purely deterministic world of technology anymore. Stanislav Petrov died on September 18th, 2017, before this change occurred. I would be incredibly curious to know his thoughts about our current reality, as well as the increasing abdication of human monitoring of automated systems in favor of notably biased, supposed “AI solutions.” In acknowledging Petrov’s skepticism in a time of mania and political instability, we acknowledge a quote from former U.S. Secretary of Defense William J. Perry’s memoir about the incident: [Oko’s false positives] illustrates the immense danger of placing our fate in the hands of automated systems that are susceptible to failure and human beings who are fallible.

yesterday 7 votes
01 · A spreadsheet for exploring scenarios

In our *Ambsheets* project, we are exploring a small extension to the familiar spreadsheet: **what if a single spreadsheet cell could hold multiple values at once**?

yesterday 2 votes
Recently

I am not going to repeat the news. But man, things are really, really bad and getting worse in America. It’s all so unendingly stupid and evil. The tech industry is being horrible, too. Wishing strength to the people who are much more exposed to the chaos than I am. Reading A Confederacy of Dunces was such a perfect novel. It was pure escapism, over-the-top comedy, and such an unusual artifact, that was sadly only appreciated posthumously. Very earnestly I believe that despite greater access to power and resources, the box labeled “socially acceptable ways to be a man” is much smaller than the box labeled “socially acceptable ways to be a woman.” This article on the distinction between patriarchy and men was an interesting read. With the whole… politics out there, it’s easy to go off the rails with any discussion about men and women and whether either have it easy or hard. The same author wrote this good article about declining male enrollment in college. I think both are worth a read. Whenever I read this kind of article, I’m reminded of how limited and mostly fortunate my own experience is. There’s a big difference, I think, in how vigorously you have to perform your gender in some red state where everyone owns a pickup truck, versus a major city where the roles are a little more fluid. Plus, I’ve been extremely fortunate to have a lot of friends and genuine open conversations about feelings with other men. I wish that was the norm! On Having a Maximum Wealth was right up my alley. I’m reading another one of the new-French-economist books right now, and am still fascinated by the prospect of wealth taxes. My friend David has started a local newsletter for Richmond, Virginia, and written a good piece about public surveillance. Construction Physics is consistently great, and their investigation of why skyscrapers are all glass boxes is no exception. Watching David Lynch was so great. We watched his film Lost Highway a few days after he passed, and it was even better than I had remembered it. Norm Macdonald’s extremely long jokes on late-night talk shows have been getting me through the days. Listening This song by the The Hard Quartet – a supergroup of Emmett Kelly, Stephen Malkmus (Pavement), Matt Sweeney and Jim White. It’s such a loving, tender bit of nonsense, very golden-age Pavement. They also have this nice chill song: I came across this SML album via Hearing Things, which has been highlighting a lot of good music. Small Medium Large by SML It’s a pretty good time for these independent high-quality art websites. Colossal has done the same for the art world and highlights good new art: I really want to make it out to see the Nick Cave (not the musician) art show while it’s in New York.

yesterday 1 votes