Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
10
I was chatting with a friend recently, and she mentioned an annoyance when reading fanfiction on her iPad. She downloads fic from AO3 as EPUB files, and reads it in the Kindle app – but the files don’t have a cover image, and so the preview thumbnails aren’t very readable: She’s downloaded several hundred stories, and these thumbnails make it difficult to find things in the app’s “collections” view. This felt like a solvable problem. There are tools to add cover images to EPUB files, if you already have the image. The EPUB file embeds some key metadata, like the title and author. What if you had a tool that could extract that metadata, auto-generate an image, and use it as the cover? So I built that. It’s a small site where you upload EPUB files you’ve downloaded from AO3, the site generates a cover image based on the metadata, and it gives you an updated EPUB to download. The new covers show the title and author in large text on a coloured background, so they’re much easier...
4 days ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from alexwlchan

Making inventory spreadsheets for my LEGO sets

One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]

22 hours ago 3 votes
Looking at images in a spreadsheet

I’ve had a couple of projects recently where I needed to work with a list that involved images. For example, choosing a series of photos to print, or making an inventory of Lego parts. I could write a simple text list, but it’s really helpful to be able to see the images as part of the list, especially when I’m working with other people. The best tool I’ve found is Google Sheets – not something I usually associate with pictures! I’m using Google Sheets, and I use the IMAGE function, which inserts an image into a cell. For example: =IMAGE("https://www.google.com/images/srpr/logo3w.png") There’s a similar function in Microsoft Excel, but not in Apple Numbers. This function can reference values in other cells, so I’ll often prepare my spreadsheet in another tool – say, a Python script – and include an image URL in one of the columns. When I import the spreadsheet into Google Sheets, I use IMAGE() to reference that column, and then I see inline images. After that, I tend to hide the column with the image URL, and resize the rows/columns containing images to make them bigger and easier to look at. I often pair this with the HYPERLINK function, which can add a clickable link to a cell. This is useful to link to the source of the image, or to more detail I can’t fit in the spredsheet. I don’t know how far this approach can scale – I’ve never tried more than a thousand or so images in a single spreadsheet – but it’s pretty cool that it works at all! Using a spreadsheet gives me a simple, lightweight interface that most people are already familiar with. It doesn’t take much work on my part, and I get useful features like sorting and filtering for “free”. Previously I’d only thought of spreadsheets as a tool for textual data, and being able to include images has made them even more powerful. [If the formatting of this post looks odd in your feed reader, visit the original article]

2 weeks ago 25 votes
randline: get a random selection of lines in a file using reservoir sampling

I’ve posted another command-line tool on GitHub: randline, which gives you a random selection of lines in a file: $ randline < /usr/share/dict/words ultraluxurious $ randline 3 < /usr/share/dict/words unexceptionably baselessness salinity There are lots of tools that solve this problem; I wrote my own as a way to get some more Rust practice and try a new-to-me technique called reservoir sampling. Existing approaches There’s a shuf command in coreutils which is designed to do this exact thing: $ shuf -n 3 /usr/share/dict/words brimstone melody's reimbursed But I don’t have coreutils on my Mac, so I can’t use shuf. You can do this in lots of other ways using tools like awk, sort and perl. If you’re interested, check out these Stack Overflow and Unix & Linux Stack Exchange threads for examples. For my needs, I wrote a tiny Python script called randline which I saved in my PATH years ago, and I haven’t thought much about since: import random import sys if __name__ == "__main__": lines = sys.stdin.read().splitlines() try: k = int(sys.argv[1]) except IndexError: k = 1 random.shuffle(lines) print("\n".join(lines[:k])) (I’m not sure why my past self decided not to use random.sample. I suspect I’d forgotten about it.) This script has worked fine, but I stumbled across it recently and it got me thinking. This approach isn’t very efficient – it has to load the whole file into memory. Can we do better? Reservoir sampling In other Python scripts, I process files as a stream – look at one line at a time, rather than loading the whole file at once. This doesn’t make much difference for small files, but it pays off when you have really big files. I couldn’t think of a good way to take a random sample of a file using streaming, and still get a uniform distribution – but smart people have already thought about this. I did some reading and I found a technique called reservoir sampling. The introduction in the Wikipedia article makes it clear this is exactly what I want: Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. The basic idea is that rather than holding the whole file in memory at once, I can keep a fixed-sized buffer – or “reservoir” – of the items I’ve selected. As I go line-by-line through the file, I can add or remove items in this resevoir, and it will always use about the same amount of memory. I’m only holding a line in memory if it’s in the reservoir, not every line in the file. Algorithm L The Wikipedia article describes several algorithms, including a simple Algorithm R and an optimal Algorithm L. The underlying principle of Algorithm L is pretty concise: If we generate $n$ random numbers $u_1, \ldots, u_n \sim U[0,1]$ independently, then the indices of the smallest $k$ of them is a uniform sample of the $k$-subsets of $\{1, \ldots, n\}$. There’s no proof in the Wikipedia article, but I wanted to satisfy myself that this is true. If you’re happy to take it as given, you can skip the maths and go to the next section. Here’s my attempt at a justification: What we really care about is the relative ranking of the $u_1, \ldots, u_n$, not their actual values – we care whether, for example, $u_1 < u_2$, but not the exact difference between them. Because the variables are independent and they have the same distribution, every possible ranking is equally likely. Every variable is the same, so none of them can be “special” or favoured above the others. This means that each permutation of the indices $\{1, \ldots, n\}$ is equally likely. There are $n!$ such permutations, so each occurs with probability $1/n!$. For a given $k$-subset, we’re interested in permuations where this subset is the first $k$ items. This means the probability that a particular $k$-subset will be selected is a simple fraction: $$ \begin{equation*} \text{probability of selecting this }k\text{-subset} = \frac{\text{# of permutations where this subset is the first }k\text{ items}} {\text{# of permutations}} \end{equation*} $$ How many permutations are there where this $k$-subset is the first $k$ items? There are $k!$ ways to arrange this $k$-subset as the first $k$ digits, and $(n-k)!$ ways to arrange the remaining digits. This means there are $k!\left(n-k\right)!$ permutations that match, and so: $$ \begin{equation*} \text{probability of selecting this }k\text{-subset} = \frac{k!\left(n-k\right)!}{n!} \end{equation*} $$ This probability is the same for every $k$-subset, so each one is equally likely – which is the thing we care about. This was enough to give me the confidence to try implementing Algorithm L. Implementing Algorithm L in an efficient way If we don’t know $n$ upfront, we could save all the items and only then generate the random variables $u_1, \ldots, u_n \sim U[0,1]$ – but that’s precisely the sort of inefficiency I’m trying to avoid! Fortunately, we don’t need to: the nice thing about this algorithm is that we only need to track the $k$ smallest values of $u_1, \ldots, u_i$ we’ve seen so far. Once a value is larger than the $k$ smallest, we can safely discard it because we know it’ll never be used. Here’s the approach I took: Create an empty “reservoir” of $k$ items. As you get items, assign each one a “weight” and start filling the reservoir. (These weights are the random variables $u_1, \ldots, u_n$.) If you run out of items before you fill the reservoir, go to step 4. If you fill the reservoir and there are more items, calculate the largest weight of the items in the reservoir, and go to step 3. Once the reservoir is full, go through the remaining items one-by-one. For each item, assign it a weight. If the weight of this new item is larger than the largest weight already in the reservoir, discard the item. This weight isn’t in the $k$ smallest, so we don’t care about it. If the weight of this new item is smaller than the largest weight in the resevoir, then add the item to the reservoir and remove the item with the previously-largest weight. Recalculate the largest weight of the items in the reservoir. When you run out of items, go to step 4. Return the items in the reservoir. This is your random sample. This approach means we only have to hold a fixed number of items/weights in memory at a time – much more efficient, and it should scale to an arbitrarily large number of inputs. It’s a bit too much code to include here, but you can read my Rust implementation on GitHub. I wrote some tests, which include a statistical test – I run the sampling code 10,000 times, and check the results are the uniform distribution I want. What did I learn about Rust? This is only about 250 lines of Rust, but it was still good practice, and I learnt a few new things. Working with generics I’ve used generics in other languages, and I’d read about them in the Rust Book, but I’d never written my own code using generics in Rust. I used a generic to write my sampling function: fn reservoir_sample<T>( mut items: impl Iterator<Item = T>, k: usize) -> Vec<T> { … } It was straightforward, and there were no big surprises. The difference between .iter() and .into_iter() I’ve used both of these methods before, but I only understood part of the difference. When you call .iter(), you’re borrowing the vector, which means it can be used later. When you call .into_iter(), you’re consuming the vector, which means it can’t be used later. I hadn’t thought about how this affects the types. When you call .iter(), you get an iterator of references. When you call .into_iter(), you get an iterator of values. This caused me some confusion when I was writing a test. Consider the following code: fn reservoir_sample<T>( mut items: impl Iterator<Item = T>, k: usize) -> Vec<T> { … } let letters = vec!["A", "A", "A"]; let items = letters.iter(); assert_eq!(reservoir_sample(items, 1), vec!["A"]); I was trying to write a test that reservoir_sample would only return the number of items I asked for, and no more. This was my first attempt, and it doesn’t compile. When I call letters.iter(), I’m getting an iterator of string references, that is Iterator<&&str>. Then I’m comparing it to a Vec<&str>, but Rust doesn’t know how to check equality of &str and &&str, so it refuses to compile this code. There are two ways I could fix this: Use .into_iter(), so I get an iterator of string values, i.e. Iterator<&str>. let letters = vec!["A", "A", "A"]; let items = letters.into_iter(); assert_eq!(reservoir_sample(items, 1), vec!["A"]); Change the expected result so it’s a Vec<&&str>: let letters = vec!["A", "A", "A"]; let items = letters.iter(); assert_eq!(reservoir_sample(items, 1), vec![&"A"]); I used .into_iter() in my tests. This sort of distinction is probably obvious to more experienced Rust programmers, but it was new to me. I’ve read about these methods, but I only understand them by writing code. Arrays are indexed with usize I wasn’t sure what type I should use for k, the size of the random sample. It’s a positive integer, but should I use u32 or usize? I read the descriptions of both, and it wasn’t immediately obvious which was preferable. I looked to Vec::with_capacity for inspiration, because it’s one of the methods I was using and it feels similar. It takes a single argument capacity: usize. That gave me an example to follow, but I still wanted to understand why. I did some more reading, and I learned that Rust arrays are indexed with usize. It makes sense that a pointer type is used for array indexing, but it’s been a while since I used a language with pointers, and so it didn’t occur to me. There’s a lot of cool stuff in std::collections At the core of this tool, I have a reservoir of weighted items, and I want to be able to find the item with the largest weight when it gets replaced. This sounds like priority queue, and there’s an implementation of one in the Rust standard library. I was able to use BinaryHeap from the std::collections module, which saved me from writing a bunch of fiddly code myself. Here’s the broad shape of it: struct WeightedItem<T> { item: T, weight: f64, } let mut reservoir: BinaryHeap<WeightedItem<T>> = BinaryHeap::with_capacity(k); There’s a bit more code to implement Eq and Ord for WeightedItem, but that wasn’t difficult. I didn’t even need to read the documentation – the compiler error messages were so helpful, I could just follow their suggestions to get a working solution. In this sense, Rust feels very like Python – both languages have a built-in collections module with some common data structures. I need to spend more time exploring the Rust variant, and there’s a When should you use which collection? guide to help me find the useful parts. This whole project is less than 250 lines of Rust, including tests. There are plenty of other tools that do the same thing, so I doubt anybody else will want to use it. Most people should use shuf – to which Assaf Gordon added reservoir sampling nearly twelve years ago. But in case anybody is interested, I’ve put all the code on GitHub. I’ve learnt every programming language in tiny steps – a little at a time, growing slowly until I have something approximating skill. This project is the latest tiny step towards learning Rust, and now I know a little bit more than I did before. It’s over eight years since I wrote my first Rust, and I’m still a beginner, but I’m having fun learning, and I’m having fun writing it down as I go. [If the formatting of this post looks odd in your feed reader, visit the original article]

2 weeks ago 33 votes
How I test Rust command-line apps with `assert_cmd`

Rust has become my go-to language for my personal toolbox – small, standalone utilities like create_thumbnail, emptydir, and dominant_colours. There’s no place for Rust in my day job, so having some self-contained hobby projects means I can still have fun playing with it. I’ve been using the assert_cmd crate to test my command line tools, but I wanted to review my testing approach before I write my next utility. My old code was fine and it worked, but that’s about all you could say about it – it wasn’t clean or idiomatic Rust, and it wasn’t especially readable. My big mistake was trying to write Rust like Python. I’d written wrapper functions that would call assert_cmd and return values, then I wrote my own assertions a bit like I’d write a Python test. I missed out on the nice assertion helpers in the crate. I’d skimmed just enough of the assert_cmd documentation to get something working, but I hadn’t read it properly. As I was writing this blog post, I went back and read the documentation in more detail, to understand the right way to use the crate. Here are some examples of how I’m using it in my refreshed test suites: Testing a basic command This test calls dominant_colours with a single argument, then checks it succeeds and that a single line is printed to stdout: use assert_cmd::Command; /// If every pixel in an image is the same colour, then the image /// has a single dominant colour. #[test] fn it_prints_the_colour() { Command::cargo_bin("dominant_colours") .unwrap() .arg("./src/tests/red.png") .assert() .success() .stdout("#fe0000\n") .stderr(""); } If I have more than one argument or flag, I can replace .arg with .args to pass a list: use assert_cmd::Command; /// It picks the best colour from an image to go with a background -- /// the colour with sufficient contrast and the most saturation. #[test] fn it_chooses_the_right_colour_for_a_light_background() { Command::cargo_bin("dominant_colours") .unwrap() .args(&[ "src/tests/stripes.png", "--max-colours=5", "--best-against-bg=#fff", ]) .assert() .success() .stdout("#693900\n") .stderr(""); } Alternatively, I can omit .arg and .args if I don’t need to pass any arguments. Testing error cases Most of my tests are around error handling – call the tool with bad input, and check it returns a useful error message. I can check that the command failed, the exit code, and the error message printed to stderr: use assert_cmd::Command; /// Getting the dominant colour of a file that doesn't exist is an error. #[test] fn it_fails_if_you_pass_an_nonexistent_file() { Command::cargo_bin("dominant_colours") .unwrap() .arg("doesnotexist.jpg") .assert() .failure() .code(1) .stdout("") .stderr("No such file or directory (os error 2)\n"); } Comparing output to a regular expression All the examples so far are doing an exact match for the stdout/stderr, but sometimes I need something more flexible. Maybe I only know what part of the output will look like, or I only care about checking how it starts. If so, I can use the predicate::str::is_match predicate from the predicates crate and define a regular expression I want to match against. Here’s an example where I’m checking the output contains a version number, but not what the version number is: use assert_cmd::Command; use predicates::prelude::*; /// If I run `dominant_colours --version`, it prints the version number. #[test] fn it_prints_the_version() { // Match strings like `dominant_colours 1.2.3` let is_version_string = predicate::str::is_match(r"^dominant_colours [0-9]+\.[0-9]+\.[0-9]+\n$").unwrap(); Command::cargo_bin("dominant_colours") .unwrap() .arg("--version") .assert() .success() .stdout(is_version_string) .stderr(""); } Creating focused helper functions I have a couple of helper functions for specific test scenarios. I try to group these by common purpose – they should be testing similar behaviour. I’m trying to avoid creating helpers for the sake of reducing repetitive code. For example, I have a helper function that passes a single invalid file to dominant_colours and checks the error message is what I expect: use assert_cmd::Command; use predicates::prelude::*; /// Getting the dominant colour of a file that doesn't exist is an error. #[test] fn it_fails_if_you_pass_an_nonexistent_file() { assert_file_fails_with_error( "./doesnotexist.jpg", "No such file or directory (os error 2)\n", ); } /// Try to get the dominant colours for a file, and check it fails /// with the given error message. fn assert_file_fails_with_error( path: &str, expected_stderr: &str, ) -> assert_cmd::assert::Assert { Command::cargo_bin("dominant_colours") .unwrap() .arg(path) .assert() .failure() .code(1) .stdout("") .stderr(predicate::eq(expected_stderr)) } Initially I wrote this helper just calling .stderr(expected_stderr) to do an exact match, like in previous tests, but I got an error “expected_stderr escapes the function body here”. I’m not sure what that means – it’s something to do with borrowing – but wrapping it in a predicate seems to fix the error, so I’m happy. My test suite is a safety net, not a playground Writing this blog post has helped me refactor my tests into something that’s actually good. I’m sure there’s still room for improvement, but this is the first iteration that I feel happy with. It’s no coincidence that it looks very similar to other test suites using assert_cmd. My earlier approaches were far too clever. I was over-abstracting to hide a few lines of boilerplate, which made the tests harder to follow. I even wrote a macro with a variadic interface because of a minor annoyance, which is stretching the limits of my Rust knowledge. It was fun to write, but it would have been a pain to debug or edit later. It’s okay to have a bit of repetition in a test suite, if it makes them easier to read. I keep having to remind myself of this – I’m often tempted to create helper functions whose sole purpose is to remove boilerplate, or create some clever parametrisation which only made sense as I’m writing it. I need to resist the urge to compress my test code. My new tests are more simple and more readable. There’s a time and a place for clever code, but my test suite isn’t it. [If the formatting of this post looks odd in your feed reader, visit the original article]

4 weeks ago 36 votes

More in programming

Making inventory spreadsheets for my LEGO sets

One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]

22 hours ago 3 votes
Giving Junior Engineers Control Of A Six Trillion Dollar System Is Nuts

For some purpose, the DOGE people are burrowing their way into all US Federal Systems. Their complete control over the Treasury Department is entirely insane. Unless you intend to destroy everything, making arbitrary changes to complex computer systems will result in destruction, even if that was not your intention. No

7 hours ago 3 votes
Our own agents with their own tools.

Entering 2025, I decided to spend some time exploring the topic of agents. I started reading Anthropic’s Building effective agents, followed by Chip Huyen’s AI Engineering. I kicked off a major workstream at work on using agents, and I also decided to do a personal experiment of sorts. This is a general commentary on building that project. What I wanted to build was a simple chat interface where I could write prompts, select models, and have the model use tools as appropriate. My side goal was to build this using Cursor and generally avoid writing code directly as much as possible, but I found that generally slower than writing code in emacs while relying on 4o-mini to provide working examples to pull from. Similarly, while I initially envisioned building this in fullstack TypeScript via Cursor, I ultimately bailed into a stack that I’m more comfortable, and ended up using Python3, FastAPI, PostgreSQL, and SQLAlchemy with the async psycopg3 driver. It’s been a… while… since I started a brand new Python project, and used this project as an opportunity to get comfortable with Python3’s async/await mechanisms along with Python3’s typing along with mypy. Finally, I also wanted to experiment with Tailwind, and ended up using TailwindUI’s components to build the site. The working version supports everything I wanted: creating chats with models, and allowing those models to use function calling to use tools that I provide. The models are allowed to call any number of tools in pursuit of the problem they are solving. The tool usage is the most interesting part here for sure. The simplest tool I created was a get_temperature tool that provided a fake temperature for your location. This allowed me to ask questions like “What should I wear tomorrow in San Francisco, CA?” and get a useful respond. The code to add this function to my project was pretty straightforward, just three lines of Python and 25 lines of metadata to pass to the OpenAI API. def tool_get_current_weather(location: str|None=None, format: str|None=None) -> str: "Simple proof of concept tool." temp = random.randint(40, 90) if format == 'fahrenheit' else random.randint(10, 25) return f"It's going to be {temp} degrees {format} tomorrow." FUNCTION_REGISTRY['get_current_weather'] = tool_get_current_weather TOOL_USAGE_REGISTRY['get_current_weather'] = { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } } After getting this tool, the next tool I added was a simple URL retriever tool, which allowed the agent to grab a URL and use the content of that URL in its prompt. The implementation for this tool was similarly quite simple. def tool_get_url(url: str|None=None) -> str: if url is None: return '' url = str(url) response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('main') or soup.find('article') or soup.body if not content: return str(response.content) markdown = markdownify(str(content), heading_style="ATX").strip() return str(markdown) FUNCTION_REGISTRY['get_url'] = tool_get_url TOOL_USAGE_REGISTRY['get_url'] = { "type": "function", "function": { "name": "get_url", "description": "Retrieve the contents of a website via its URL.", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "The complete URL, including protocol to retrieve. For example: \"https://lethain.com\"", } }, "required": ["url"], }, } } What’s pretty amazing is how much power you can add to your agent by adding such a trivial tool as retrieving a URL. You can similarly imagine adding tools for retrieving and commenting on Github pull requests and so, which could allow a very simple agent tool like this to become quite useful. Working on this project gave me a moderately compelling view of a near-term future where most engineers have simple application like this running that they can pipe events into from various systems (email, text, Github pull requests, calendars, etc), create triggers that map events to templates that feed into prompts, and execute those prompts with tool-aware agents. Combine that with ability for other agents to register themselves with you and expose the tools that they have access to (e.g. schedule an event with tool’s owner), and a bunch of interesting things become very accessible with a very modest amount of effort: You could schedule events between two busy people’s calendars, as if both of them had an assistant managing their calendar Reply to your own pull requests with new blog posts, providing feedback on typos and grammatical issues Crawl websites you care about and identify posts you might be interested in Ask the model to generate a system model using lethain:systems, run that model, then chart the responses Add a “planning tool” which allows the model to generate a plan to guide subsequent steps in a complex task. (e.g. getting my calendar, getting a friend’s calendar, suggesting a time we could meet) None of these are exactly lifesaving, but each is somewhat useful, and I imagine there are many more fairly obvious ideas that become easy once you have the necessary scaffolding to make this sort of thing easy. Altogether, I think that I am convinced at this points that agents, using current foundational models, are going to create a number of very interesting experiences that improve our day to day lives in small ways that are, in aggregate, pretty transformational. I’m less convinced that this is the way all software should work going forward though, but more thoughts on that over time. (A bunch of fun experiments happening at work, but early days on those.)

19 hours ago 1 votes
Stanislav Petrov

A lieutenant colonel in the Soviet Air Defense Forces prevented the end of human civilization on September 26th, 1983. His name was Stanislav Petrov. Protocol dictated that the Soviet Union would retaliate against any nuclear strikes sent by the United States. This was a policy of mutually assured destruction, a doctrine that compels a horrifying logical conclusion. The second and third stage effects of this type of exchange would be even more catastrophic. Allies for each side would likely be pulled into the conflict. The resulting nuclear winter was projected to lead to 2 billion deaths due to starvation. This is to say nothing about those who would have been unfortunate enough to have survived. Petrov’s job was to monitor Oko, the computerized warning systems built to centralize Soviet satellite communications. Around midnight, he received a report that one of the satellites had detected the infrared signature of a single launch of a United States ICBM. While Petrov was deciding what to do about this report, the system detected four more incoming missile launches. He had minutes to make a choice about what to do. It is impossible to imagine the amount of pressure placed on him at this moment. Source: Stanislav Petrov, Soviet officer credited with averting nuclear war, dies at 77 by Schwartzreport. Petrov lived in a world of deterministic systems. The technologies that powered these warning systems have outputs that are guaranteed, provided the proper inputs are provided. However, deterministic does not mean infallible. The only reason you are alive and reading this is because Petrov understood that the systems he observed were capable of error. He was suspicious of what he was seeing reported, and chose not to escalate a retaliatory strike. There were two factors guiding his decision: A surprise attack would most likely have used hundreds of missiles, and not just five. The allegedly foolproof Oko system was new and prone to errors. An error in a deterministic system can still lead to expected outputs being generated. For the Oko system, infrared reflections of the sun shining off of the tops of clouds created a false positive that was interpreted as detection of a nuclear launch event. Source: US-K History by Kosmonavtika. The concept of erroneous truth is a deep thing to internalize, as computerized systems are presented as omniscient, indefective, and absolute. Petrov’s rewards for this action were reprimands, reassignment, and denial of promotion. This was likely for embarrassing his superiors by the politically inconvenient shedding of light on issues with the Oko system. A coerced early retirement caused a nervous breakdown, likely him having to grapple with the weight of his decision. It was only in the 1990s—after the fall of the Soviet Union—that his actions were discovered internationally and celebrated. Stanislav Petrov was given the recognition that he deserved, including being honored by the United Nations, awarded the Dresden Peace Prize, featured in a documentary, and being able to visit a Minuteman Missile silo in the United States. On January 31st, 2025, OpenAI struck a deal with the United States government to use its AI product for nuclear weapon security. It is unclear how this technology will be used, where, and to what extent. It is also unclear how OpenAI’s systems function, as they are black box technologies. What is known is that LLM-generated responses—the product OpenAI sells—are non-deterministic. Non-deterministic systems don’t have guaranteed outputs from their inputs. In addition, LLM-based technology hallucinates—it invents content with no self-knowledge that it is a falsehood. Non-deterministic systems that are computerized also have the perception as being authoritative, the same as their deterministic peers. It is not a question of how the output is generated, it is one of the output being perceived to come from a machine. These are terrifying things to know. Consider not only the systems this technology is being applied to, but also the thoughtless speed of their integration. Then consider how we’ve historically been conditioned and rewarded to interpret the output of these systems, and then how we perceive and treat skeptics. We don’t live in a purely deterministic world of technology anymore. Stanislav Petrov died on September 18th, 2017, before this change occurred. I would be incredibly curious to know his thoughts about our current reality, as well as the increasing abdication of human monitoring of automated systems in favor of notably biased, supposed “AI solutions.” In acknowledging Petrov’s skepticism in a time of mania and political instability, we acknowledge a quote from former U.S. Secretary of Defense William J. Perry’s memoir about the incident: [Oko’s false positives] illustrates the immense danger of placing our fate in the hands of automated systems that are susceptible to failure and human beings who are fallible.

yesterday 8 votes
01 · A spreadsheet for exploring scenarios

In our *Ambsheets* project, we are exploring a small extension to the familiar spreadsheet: **what if a single spreadsheet cell could hold multiple values at once**?

yesterday 3 votes