Are Efficiency and Horizontal Scalability at odds?

from Computer Things [alt+shift+b] in programming

Sorry for missing the newsletter last week! I started writing on Monday as normal, and by Wednesday the piece (about the hierarchy of controls ) was 2000 words and not close to done. So now it'll be a blog post sometime later this month. I also just released a new version of Logic for Programmers! 0.7 adds a bunch of new content (type invariants, modeling access policies, rewrites of the first chapters) but more importantly has new fonts that are more legible than the old ones. Go check it out! For this week's newsletter I want to brainstorm an idea I've been noodling over for a while. Say we have a computational task, like running a simulation or searching a very large graph, and it's taking too long to complete on a computer. There's generally three things that we can do to make it faster: Buy a faster computer ("vertical scaling") Modify the software to use the computer's resources better ("efficiency") Modify the software to use multiple computers ("horizontal scaling") (Splitting...

4 months ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Computer Things

Logical Quantifiers in Software

I realize that for all I've talked about Logic for Programmers in this newsletter, I never once explained basic logical quantifiers. They're both simple and incredibly useful, so let's do that this week! Sets and quantifiers A set is a collection of unordered, unique elements. {1, 2, 3, …} is a set, as are "every programming language", "every programming language's Wikipedia page", and "every function ever defined in any programming language's standard library". You can put whatever you want in a set, with some very specific limitations to avoid certain paradoxes.2 Once we have a set, we can ask "is something true for all elements of the set" and "is something true for at least one element of the set?" IE, is it true that every programming language has a set collection type in the core language? We would write it like this: # all of them all l in ProgrammingLanguages: HasSetType(l) # at least one some l in ProgrammingLanguages: HasSetType(l) This is the notation I use in the book because it's easy to read, type, and search for. Mathematicians historically had a few different formats; the one I grew up with was ∀x ∈ set: P(x) to mean all x in set, and ∃ to mean some. I use these when writing for just myself, but find them confusing to programmers when communicating. "All" and "some" are respectively referred to as "universal" and "existential" quantifiers. Some cool properties We can simplify expressions with quantifiers, in the same way that we can simplify !(x && y) to !x || !y. First of all, quantifiers are commutative with themselves. some x: some y: P(x,y) is the same as some y: some x: P(x, y). For this reason we can write some x, y: P(x,y) as shorthand. We can even do this when quantifying over different sets, writing some x, x' in X, y in Y instead of some x, x' in X: some y in Y. We can not do this with "alternating quantifiers": all p in Person: some m in Person: Mother(m, p) says that every person has a mother. some m in Person: all p in Person: Mother(m, p) says that someone is every person's mother. Second, existentials distribute over || while universals distribute over &&. "There is some url which returns a 403 or 404" is the same as "there is some url which returns a 403 or some url that returns a 404", and "all PRs pass the linter and the test suites" is the same as "all PRs pass the linter and all PRs pass the test suites". Finally, some and all are duals: some x: P(x) == !(all x: !P(x)), and vice-versa. Intuitively: if some file is malicious, it's not true that all files are benign. All these rules together mean we can manipulate quantifiers almost as easily as we can manipulate regular booleans, putting them in whatever form is easiest to use in programming. Speaking of which, how do we use this in in programming? How we use this in programming First of all, people clearly have a need for directly using quantifiers in code. If we have something of the form: for x in list: if P(x): return true return false That's just some x in list: P(x). And this is a prevalent pattern, as you can see by using GitHub code search. It finds over 500k examples of this pattern in Python alone! That can be simplified via using the language's built-in quantifiers: the Python would be any(P(x) for x in list). (Note this is not quantifying over sets but iterables. But the idea translates cleanly enough.) More generally, quantifiers are a key way we express higher-level properties of software. What does it mean for a list to be sorted in ascending order? That all i, j in 0..<len(l): if i < j then l[i] <= l[j]. When should a ratchet test fail? When some f in functions - exceptions: Uses(f, bad_function). Should the image classifier work upside down? all i in images: classify(i) == classify(rotate(i, 180)). These are the properties we verify with tests and types and MISU and whatnot;1 it helps to be able to make them explicit! One cool use case that'll be in the book's next version: database invariants are universal statements over the set of all records, like all a in accounts: a.balance > 0. That's enforceable with a CHECK constraint. But what about something like all i, i' in intervals: NoOverlap(i, i')? That isn't covered by CHECK, since it spans two rows. Quantifier duality to the rescue! The invariant is equivalent to !(some i, i' in intervals: Overlap(i, i')), so is preserved if the query SELECT COUNT(*) FROM intervals CROSS JOIN intervals … returns 0 rows. This means we can test it via a database trigger.3 There are a lot more use cases for quantifiers, but this is enough to introduce the ideas! Next week's the one year anniversary of the book entering early access, so I'll be writing a bit about that experience and how the book changed. It's crazy how crude v0.1 was compared to the current version. MISU ("make illegal states unrepresentable") means using data representations that rule out invalid values. For example, if you have a location -> Optional(item) lookup and want to make sure that each item is in exactly one location, consider instead changing the map to item -> location. This is a means of implementing the property all i in item, l, l' in location: if ItemIn(i, l) && l != l' then !ItemIn(i, l'). ↩ Specifically, a set can't be an element of itself, which rules out constructing things like "the set of all sets" or "the set of sets that don't contain themselves". ↩ Though note that when you're inserting or updating an interval, you already have that row's fields in the trigger's NEW keyword. So you can just query !(some i in intervals: Overlap(new, i')), which is more efficient. ↩

8 hours ago • 2 votes

You can cheat a test suite with a big enough polynomial

Hi nerds, I'm back from Systems Distributed! I'd heartily recommend it, wildest conference I've been to in years. I have a lot of work to catch up on, so this will be a short newsletter. In an earlier version of my talk, I had a gag about unit tests. First I showed the test f([1,2,3]) == 3, then said that this was satisfied by f(l) = 3, f(l) = l[-1], f(l) = len(l), f(l) = (129*l[0]-34*l[1]-617)*l[2] - 443*l[0] + 1148*l[1] - 182. Then I progressively rule them out one by one with more unit tests, except the last polynomial which stubbornly passes every single test. If you're given some function of f(x: int, y: int, …): int and a set of unit tests asserting specific inputs give specific outputs, then you can find a polynomial that passes every single unit test. To find the gag, and as SMT practice, I wrote a Python program that finds a polynomial that passes a test suite meant for max. It's hardcoded for three parameters and only finds 2nd-order polynomials but I think it could be generalized with enough effort. The code Full code here, breakdown below. from z3 import * # type: ignore s1, s2 = Solver(), Solver() Z3 is just the particular SMT solver we use, as it has good language bindings and a lot of affordances. As part of learning SMT I wanted to do this two ways. First by putting the polynomial "outside" of the SMT solver in a python function, second by doing it "natively" in Z3. I created two solvers so I could test both versions in one run. a0, a, b, c, d, e, f = Consts('a0 a b c d e f', IntSort()) x, y, z = Ints('x y z') t = "a*x+b*y+c*z+d*x*y+e*x*z+f*y*z+a0" Both Const('x', IntSort()) and Int('x') do the exact same thing, the latter being syntactic sugar for the former. I did not know this when I wrote the program. To keep the two versions in sync I represented the equation as a string, which I later eval. This is one of the rare cases where eval is a good idea, to help us experiment more quickly while learning. The polynomial is a "2nd-order polynomial", even though it doesn't have x^2 terms, as it has xy and xz terms. lambdamax = lambda x, y, z: eval(t) z3max = Function('z3max', IntSort(), IntSort(), IntSort(), IntSort()) s1.add(ForAll([x, y, z], z3max(x, y, z) == eval(t))) lambdamax is pretty straightforward: create a lambda with three parameters and eval the string. The string "a*x" then becomes the python expression a*x, a is an SMT symbol, while the x SMT symbol is shadowed by the lambda parameter. To reiterate, a terrible idea in practice, but a good way to learn faster. z3max function is a little more complex. Function takes an identifier string and N "sorts" (roughly the same as programming types). The first N-1 sorts define the parameters of the function, while the last becomes the output. So here I assign the string identifier "z3max" to be a function with signature (int, int, int) -> int. I can load the function into the model by specifying constraints on what z3max could be. This could either be a strict input/output, as will be done later, or a ForAll over all possible inputs. Here I just use that directly to say "for all inputs, the function should match this polynomial." But I could do more complicated constraints, like commutativity (f(x, y) == f(y, x)) or monotonicity (Implies(x < y, f(x) <= f(y))). Note ForAll takes a list of z3 symbols to quantify over. That's the only reason we need to define x, y, z in the first place. The lambda version doesn't need them. inputs = [(1,2,3), (4, 2, 2), (1, 1, 1), (3, 5, 4)] for g in inputs: s1.add(z3max(*g) == max(*g)) s2.add(lambdamax(*g) == max(*g)) This sets up the joke: adding constraints to each solver that the polynomial it finds must, for a fixed list of triplets, return the max of each triplet. for s, func in [(s1, z3max), (s2, lambdamax)]: if s.check() == sat: m = s.model() for x, y, z in inputs: print(f"max([{x}, {y}, {z}]) =", m.evaluate(func(x, y, z))) print(f"max([x, y, z]) = {m[a]}x + {m[b]}y", f"+ {m[c]}z +", # linebreaks added for newsletter rendering f"{m[d]}xy + {m[e]}xz + {m[f]}yz + {m[a0]}\n") Output: max([1, 2, 3]) = 3 # etc max([x, y, z]) = -133x + 130y + -10z + -2xy + 62xz + -46yz + 0 max([1, 2, 3]) = 3 # etc max([x, y, z]) = -17x + 16y + 0z + 0xy + 8xz + -6yz + 0 I find that z3max (top) consistently finds larger coefficients than lambdamax does. I don't know why. Practical Applications Test-Driven Development recommends a strict "red-green refactor" cycle. Write a new failing test, make the new test pass, then go back and refactor. Well, the easiest way to make the new test pass would be to paste in a new polynomial, so that's what you should be doing. You can even do this all automatically: have a script read the set of test cases, pass them to the solver, and write the new polynomial to your code file. All you need to do is write the tests! Pedagogical Notes Writing the script took me a couple of hours. I'm sure an LLM could have whipped it all up in five minutes but I really want to learn SMT and LLMs may decrease learning retention.1 Z3 documentation is not... great for non-academics, though, and most other SMT solvers have even worse docs. One useful trick I use regularly is to use Github code search to find code using the same APIs and study how that works. Turns out reading API-heavy code is a lot easier than writing it! Anyway, I'm very, very slowly feeling like I'm getting the basics on how to use SMT. I don't have any practical use cases yet, but I wanted to learn this skill for a while and glad I finally did. Caveat I have not actually read the study, for all I know it could have a sample size of three people, I'll get around to it eventually ↩

a week ago • 11 votes

Solving LinkedIn Queens with SMT

No newsletter next week I’ll be speaking at Systems Distributed. My talk isn't close to done yet, which is why this newsletter is both late and short. Solving LinkedIn Queens in SMT The article Modern SAT solvers: fast, neat and underused claims that SAT solvers1 are "criminally underused by the industry". A while back on the newsletter I asked "why": how come they're so powerful and yet nobody uses them? Many experts responded saying the reason is that encoding SAT kinda sucked and they rather prefer using tools that compile to SAT. I was reminded of this when I read Ryan Berger's post on solving “LinkedIn Queens” as a SAT problem. A quick overview of Queens. You’re presented with an NxN grid divided into N regions, and have to place N queens so that there is exactly one queen in each row, column, and region. While queens can be on the same diagonal, they cannot be adjacently diagonal. (Important note: Linkedin “Queens” is a variation on the puzzle game Star Battle, which is the same except the number of stars you place in each row/column/region varies per puzzle, and is usually two. This is also why 'queens' don’t capture like chess queens.) Ryan solved this by writing Queens as a SAT problem, expressing properties like "there is exactly one queen in row 3" as a large number of boolean clauses. Go read his post, it's pretty cool. What leapt out to me was that he used CVC5, an SMT solver.2 SMT solvers are "higher-level" than SAT, capable of handling more data types than just boolean variables. It's a lot easier to solve the problem at the SMT level than at the SAT level. To show this, I whipped up a short demo of solving the same problem in Z3 (via the Python API). Full code here, which you can compare to Ryan's SAT solution here. I didn't do a whole lot of cleanup on it (again, time crunch!), but short explanation below. The code from z3 import * # type: ignore from itertools import combinations, chain, product solver = Solver() size = 9 # N Initial setup and modules. size is the number of rows/columns/regions in the board, which I'll call N below. # queens[n] = col of queen on row n # by construction, not on same row queens = IntVector('q', size) SAT represents the queen positions via N² booleans: q_00 means that a Queen is on row 0 and column 0, !q_05 means a queen isn't on row 0 col 5, etc. In SMT we can instead encode it as N integers: q_0 = 5 means that the queen on row 0 is positioned at column 5. This immediately enforces one class of constraints for us: we don't need any constraints saying "exactly one queen per row", because that's embedded in the definition of queens! (Incidentally, using 0-based indexing for the board was a mistake on my part, it makes correctly encoding the regions later really painful.) To actually make the variables [q_0, q_1, …], we use the Z3 affordance IntVector(str, n) for making n variables at once. solver.add([And(0 <= i, i < size) for i in queens]) # not on same column solver.add(Distinct(queens)) First we constrain all the integers to [0, N), then use the incredibly handy Distinct constraint to force all the integers to have different values. This guarantees at most one queen per column, which by the pigeonhole principle means there is exactly one queen per column. # not diagonally adjacent for i in range(size-1): q1, q2 = queens[i], queens[i+1] solver.add(Abs(q1 - q2) != 1) One of the rules is that queens can't be adjacent. We already know that they can't be horizontally or vertically adjacent via other constraints, which leaves the diagonals. We only need to add constraints that, for each queen, there is no queen in the lower-left or lower-right corner, aka q_3 != q_2 ± 1. We don't need to check the top corners because if q_1 is in the upper-left corner of q_2, then q_2 is in the lower-right corner of q_1! That covers everything except the "one queen per region" constraint. But the regions are the tricky part, which we should expect because we vary the difficulty of queens games by varying the regions. regions = { "purple": [(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0), (1, 1), (8, 1)], "red": [(1, 2), (2, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (6, 2), (7, 1), (7, 2), (8, 2), (8, 3),], # you get the picture } # Some checking code left out, see below The region has to be manually coded in, which is a huge pain. (In the link, some validation code follows. Since it breaks up explaining the model I put it in the next section.) for r in regions.values(): solver.add(Or( *[queens[row] == col for (row, col) in r] )) Finally we have the region constraint. The easiest way I found to say "there is exactly one queen in each region" is to say "there is a queen in region 1 and a queen in region 2 and a queen in region 3" etc." Then to say "there is a queen in region purple" I wrote "q_0 = 0 OR q_0 = 1 OR … OR q_1 = 0 etc." Why iterate over every position in the region instead of doing something like (0, q[0]) in r? I tried that but it's not an expression that Z3 supports. if solver.check() == sat: m = solver.model() print([(l, m[l]) for l in queens]) Finally, we solve and print the positions. Running this gives me: [(q__0, 0), (q__1, 5), (q__2, 8), (q__3, 2), (q__4, 7), (q__5, 4), (q__6, 1), (q__7, 3), (q__8, 6)] Which is the correct solution to the queens puzzle. I didn't benchmark the solution times, but I imagine it's considerably slower than a raw SAT solver. Glucose is really, really fast. But even so, solving the problem with SMT was a lot easier than solving it with SAT. That satisfies me as an explanation for why people prefer it to SAT. Sanity checks One bit I glossed over earlier was the sanity checking code. I knew for sure that I was going to make a mistake encoding the region, and the solver wasn't going to provide useful information abut what I did wrong. In cases like these, I like adding small tests and checks to catch mistakes early, because the solver certainly isn't going to catch them! all_squares = set(product(range(size), repeat=2)) def test_i_set_up_problem_right(): assert all_squares == set(chain.from_iterable(regions.values())) for r1, r2 in combinations(regions.values(), 2): assert not set(r1) & set(r2), set(r1) & set(r2) The first check was a quick test that I didn't leave any squares out, or accidentally put the same square in both regions. Converting the values into sets makes both checks a lot easier. Honestly I don't know why I didn't just use sets from the start, sets are great. def render_regions(): colormap = ["purple", "red", "brown", "white", "green", "yellow", "orange", "blue", "pink"] board = [[0 for _ in range(size)] for _ in range(size)] for (row, col) in all_squares: for color, region in regions.items(): if (row, col) in region: board[row][col] = colormap.index(color)+1 for row in board: print("".join(map(str, row))) render_regions() The second check is something that prints out the regions. It produces something like this: 111111111 112333999 122439999 124437799 124666779 124467799 122467899 122555889 112258899 I can compare this to the picture of the board to make sure I got it right. I guess a more advanced solution would be to print emoji squares like 🟥 instead. Neither check is quality code but it's throwaway and it gets the job done so eh. "Boolean SATisfiability Solver", aka a solver that can find assignments that make complex boolean expressions true. I write a bit more about them here. ↩ "Satisfiability Modulo Theories" ↩

2 weeks ago • 15 votes

AI is a gamechanger for TLA+ users

New Logic for Programmers Release v0.10 is now available! This is a minor release, mostly focused on logic-based refactoring, with new material on set types and testing refactors are correct. See the full release notes at the changelog page. Due to conference pressure v0.11 will also likely be a minor release. AI is a gamechanger for TLA+ users TLA+ is a specification language to model and debug distributed systems. While very powerful, it's also hard for programmers to learn, and there's always questions of connecting specifications with actual code. That's why The Coming AI Revolution in Distributed Systems caught my interest. In the post, Cheng Huang claims that Azure successfully used LLMs to examine an existing codebase, derive a TLA+ spec, and find a production bug in that spec. "After a decade of manually crafting TLA+ specifications", he wrote, "I must acknowledge that this AI-generated specification rivals human work". This inspired me to experiment with LLMs in TLA+ myself. My goals are a little less ambitious than Cheng's: I wanted to see how LLMs could help junior specifiers write TLA+, rather than handling the entire spec automatically. Details on what did and didn't work below, but my takeaway is that LLMs are an immense specification force multiplier. All tests were done with a standard VSCode Copilot subscription, writing Claude 3.7 in Agent mode. Other LLMs or IDEs may be more or less effective, etc. Things Claude was good at Fixing syntax errors TLA+ uses a very different syntax than mainstream programming languages, meaning beginners make a lot of mistakes where they do a "programming syntax" instead of TLA+ syntax: NotThree(x) = \* should be ==, not = x != 3 \* should be #, not != The problem is that the TLA+ syntax checker, SANY, is 30 years old and doesn't provide good information. Here's what it says for that snippet: Was expecting "==== or more Module body" Encountered "NotThree" at line 6, column 1 That only isolates one error and doesn't tell us what the problem is, only where it is. Experienced TLA+ users get "error eyes" and can quickly see what the problem is, but beginners really struggle with this. The TLA+ foundation has made LLM integration a priority, so the VSCode extension naturally supports several agents actions. One of these is running SANY, meaning an agent can get an error, fix it, get another error, fix it, etc. Provided the above sample and asked to make it work, Claude successfully fixed both errors. It also fixed many errors in a larger spec, as well as figure out why PlusCal specs weren't compiling to TLA+. This by itself is already enough to make LLMs a worthwhile tool, as it fixes one of the biggest barriers to entry. Understanding error traces When TLA+ finds a violated property, it outputs the sequence of steps that leads to the error. This starts in plaintext, and VSCode parses it into an interactive table: Learning to read these error traces is a skill in itself. You have to understand what's happening in each step and how it relates back to the actually broken property. It takes a long time for people to learn how to do this well. Claude was successful here, too, accurately reading 20+ step error traces and giving a high-level explanation of what went wrong. It also could condense error traces: if ten steps of the error trace could be condensed into a one-sentence summary (which can happen if you're modeling a lot of process internals) Claude would do it. I did have issues here with doing this in agent mode: while the extension does provide a "run model checker" command, the agent would regularly ignore this and prefer to run a terminal command instead. This would be fine except that the LLM consistently hallucinated invalid commands. I had to amend every prompt with "run the model checker via vscode, do not use a terminal command". You can skip this if you're willing to copy and paste the error trace into the prompt. As with syntax checking, if this was the only thing LLMs could effectively do, that would already be enough1 to earn a strong recommend. Even as a TLA+ expert I expect I'll be using this trick regularly. Boilerplate tasks TLA+ has a lot of boilerplate. One of the most notorious examples is UNCHANGED rules. Specifications are extremely precise — so precise that you have to specify what variables don't change in every step. This takes the form of an UNCHANGED clause at the end of relevant actions: RemoveObjectFromStore(srv, o, s) == /\ o \in stored[s] /\ stored' = [stored EXCEPT ![s] = @ \ {o}] /\ UNCHANGED <<capacity, log, objectsize, pc>> Writing this is really annoying. Updating these whenever you change an action, or add a new variable to the spec, is doubly so. Syntax checking and error analysis are important for beginners, but this is what I wanted for myself. I took a spec and prompted Claude Add UNCHANGED <> for each variable not changed in an action. And it worked! It successfully updated the UNCHANGED in every action. (Note, though, that it was a "well-behaved" spec in this regard: only one "action" happened at a time. In TLA+ you can have two actions happen simultaneously, that each update half of the variables, meaning neither of them should have an UNCHANGED clause. I haven't tested how Claude handles that!) That's the most obvious win, but Claude was good at handling other tedious work, too. Some examples include updating vars (the conventional collection of all state variables), lifting a hard-coded value into a model parameter, and changing data formats. Most impressive to me, though, was rewriting a spec designed for one process to instead handle multiple processes. This means taking all of the process variables, which originally have types like Int, converting them to types like [Process -> Int], and then updating the uses of all of those variables in the spec. It didn't account for race conditions in the new concurrent behavior, but it was an excellent scaffold to do more work. Writing properties from an informal description You have to be pretty precise with your intended property description but it handles converting that precise description into TLA+'s formalized syntax, which is something beginners often struggle with. Things it is less good at Generating model config files To model check TLA+, you need both a specification (.tla) and a model config file (.cfg), which have separate syntaxes. Asking the agent to generate the second often lead to it using TLA+ syntax. It automatically fixed this after getting parsing errors, though. Fixing specs Whenever the ran model checking and discovered a bug, it would naturally propose a change to either the invalid property or the spec. Sometimes the changes were good, other times the changes were not physically realizable. For example, if it found that a bug was due to a race condition between processes, it would often suggest fixing it by saying race conditions were okay. I mean yes, if you say bugs are okay, then the spec finds that bugs are okay! Or it would alternatively suggest adding a constraint to the spec saying that race conditions don't happen. But that's a huge mistake in specification, because race conditions happen if we don't have coordination. We need to specify the mechanism that is supposed to prevent them. Finding properties of the spec After seeing how capable it was at translating my properties to TLA+, I started prompting Claude to come up with properties on its own. Unfortunately, almost everything I got back was either trivial, uninteresting, or too coupled to implementation details. I haven't tested if it would work better to ask it for "properties that may be violated". Generating code from specs I have to be specific here: Claude could sometimes convert Python into a passable spec, an vice versa. It wasn't good at recognizing abstraction. For example, TLA+ specifications often represent sequential operations with a state variable, commonly called pc. If modeling code that nonatomically retrieves a counter value and increments it, we'd have one action that requires pc = "Get" and sets the new value to "Inc", then another that requires it be "Inc" and sets it to "Done". I found that Claude would try to somehow convert pc into part of the Python program's state, rather than recognize it as a TLA+ abstraction. On the other side, when converting python code to TLA+ it would often try to translate things like sleep into some part of the spec, not recognizing that it is abstractable into a distinct action. I didn't test other possible misconceptions, like converting randomness to nondeterminism. For the record, when converting TLA+ to Python Claude tended to make simulators of the spec, rather than possible production code implementing the spec. I really wasn't expecting otherwise though. Unexplored Applications Things I haven't explored thoroughly but could possibly be effective, based on what I know about TLA+ and AI: Writing Java Overrides Most TLA+ operators are resolved via TLA+ interpreters, but you can also implement them in "native" Java. This lets you escape the standard language semantics and add capabilities like executing programs during model-checking or dynamically constrain the depth of the searched state space. There's a lot of cool things I think would be possible with overrides. The problem is there's only a handful of people in the world who know how to write them. But that handful have written quite a few overrides and I think there's enough there for Claude to work with. Writing specs, given a reference mechanism In all my experiments, the LLM only had my prompts and the occasional Python script as information. That makes me suspect that some of its problems with writing and fixing specs come down to not having a system model. Maybe it wouldn't suggest fixes like "these processes never race" if it had a design doc saying that the processes can't coordinate. (Could a Sufficiently Powerful LLM derive some TLA+ specification from a design document?) Connecting specs and code This is the holy grail of TLA+: taking a codebase and showing it correctly implements a spec. Currently the best ways to do this are by either using TLA+ to generate a test suite, or by taking logged production traces and matching them to TLA+ behaviors. This blog post discusses both. While I've seen a lot of academic research into these approaches there are no industry-ready tools. So if you want trace validation you have to do a lot of manual labour tailored to your specific product. If LLMs could do some of this work for us then that'd really amplify the usefulness of TLA+ to many companies. Thoughts Right now, agents seem good at the tedious and routine parts of TLA+ and worse at the strategic and abstraction parts. But, since the routine parts are often a huge barrier to beginners, this means that LLMs have the potential to make TLA+ far, far more accessible than it previously was. I have mixed thoughts on this. As an advocate, this is incredible. I want more people using formal specifications because I believe it leads to cheaper, safer, more reliable software. Anything that gets people comfortable with specs is great for our industry. As a professional TLA+ consultant, I'm worried that this obsoletes me. Most of my income comes from training and coaching, which companies will have far less demand of now. Then again, maybe this an opportunity to pitch "agentic TLA+ training" to companies! Anyway, if you're interested in TLA+, there has never been a better time to try it. I mean it, these tools handle so much of the hard part now. I've got a free book available online, as does the inventor of TLA+. I like this guide too. Happy modeling! Dayenu. ↩

3 weeks ago • 13 votes

What does "Undecidable" mean, anyway

Systems Distributed I'll be speaking at Systems Distributed next month! The talk is brand new and will aim to showcase some of the formal methods mental models that would be useful in mainstream software development. It has added some extra stress on my schedule, though, so expect the next two monthly releases of Logic for Programmers to be mostly minor changes. What does "Undecidable" mean, anyway Last week I read Against Curry-Howard Mysticism, which is a solid article I recommend reading. But this newsletter is actually about one comment: I like to see posts like this because I often feel like I can’t tell the difference between BS and a point I’m missing. Can we get one for questions like “Isn’t XYZ (Undecidable|NP-Complete|PSPACE-Complete)?” I've already written one of these for NP-complete, so let's do one for "undecidable". Step one is to pull a technical definition from the book Automata and Computability: A property P of strings is said to be decidable if ... there is a total Turing machine that accepts input strings that have property P and rejects those that do not. (pg 220) Step two is to translate the technical computer science definition into more conventional programmer terms. Warning, because this is a newsletter and not a blog post, I might be a little sloppy with terms. Machines and Decision Problems In automata theory, all inputs to a "program" are strings of characters, and all outputs are "true" or "false". A program "accepts" a string if it outputs "true", and "rejects" if it outputs "false". You can think of this as automata studying all pure functions of type f :: string -> boolean. Problems solvable by finding such an f are called "decision problems". This covers more than you'd think, because we can bootstrap more powerful functions from these. First, as anyone who's programmed in bash knows, strings can represent any other data. Second, we can fake non-boolean outputs by instead checking if a certain computation gives a certain result. For example, I can reframe the function add(x, y) = x + y as a decision problem like this: IS_SUM(str) { x, y, z = split(str, "#") return x + y == z } Then because IS_SUM("2#3#5") returns true, we know 2 + 3 == 5, while IS_SUM("2#3#6") is false. Since we can bootstrap parameters out of strings, I'll just say it's IS_SUM(x, y, z) going forward. A big part of automata theory is studying different models of computation with different strengths. One of the weakest is called "DFA". I won't go into any details about what DFA actually can do, but the important thing is that it can't solve IS_SUM. That is, if you give me a DFA that takes inputs of form x#y#z, I can always find an input where the DFA returns true when x + y != z, or an input which returns false when x + y == z. It's really important to keep this model of "solve" in mind: a program solves a problem if it correctly returns true on all true inputs and correctly returns false on all false inputs. (total) Turing Machines A Turing Machine (TM) is a particular type of computation model. It's important for two reasons: By the Church-Turing thesis, a Turing Machine is the "upper bound" of how powerful (physically realizable) computational models can get. This means that if an actual real-world programming language can solve a particular decision problem, so can a TM. Conversely, if the TM can't solve it, neither can the programming language.1 It's possible to write a Turing machine that takes a textual representation of another Turing machine as input, and then simulates that Turing machine as part of its computations. Property (1) means that we can move between different computational models of equal strength, proving things about one to learn things about another. That's why I'm able to write IS_SUM in a pseudocode instead of writing it in terms of the TM computational model (and why I was able to use split for convenience). Property (2) does several interesting things. First of all, it makes it possible to compose Turing machines. Here's how I can roughly ask if a given number is the sum of two primes, with "just" addition and boolean functions: IS_SUM_TWO_PRIMES(z): x := 1 y := 1 loop { if x > z {return false} if IS_PRIME(x) { if IS_PRIME(y) { if IS_SUM(x, y, z) { return true; } } } y := y + 1 if y > x { x := x + 1 y := 0 } } Notice that without the if x > z {return false}, the program would loop forever on z=2. A TM that always halts for all inputs is called total. Property (2) also makes "Turing machines" a possible input to functions, meaning that we can now make decision problems about the behavior of Turing machines. For example, "does the TM M either accept or reject x within ten steps?"2 IS_DONE_IN_TEN_STEPS(M, x) { for (i = 0; i < 10; i++) { `simulate M(x) for one step` if(`M accepted or rejected`) { return true } } return false } Decidability and Undecidability Now we have all of the pieces to understand our original definition: A property P of strings is said to be decidable if ... there is a total Turing machine that accepts input strings that have property P and rejects those that do not. (220) Let IS_P be the decision problem "Does the input satisfy P"? Then IS_P is decidable if it can be solved by a Turing machine, ie, I can provide some IS_P(x) machine that always accepts if x has property P, and always rejects if x doesn't have property P. If I can't do that, then IS_P is undecidable. IS_SUM(x, y, z) and IS_DONE_IN_TEN_STEPS(M, x) are decidable properties. Is IS_SUM_TWO_PRIMES(z) decidable? Some analysis shows that our corresponding program will either find a solution, or have x>z and return false. So yes, it is decidable. Notice there's an asymmetry here. To prove some property is decidable, I need just to need to find one program that correctly solves it. To prove some property is undecidable, I need to show that any possible program, no matter what it is, doesn't solve it. So with that asymmetry in mind, do are there any undecidable problems? Yes, quite a lot. Recall that Turing machines can accept encodings of other TMs as input, meaning we can write a TM that checks properties of Turing machines. And, by Rice's Theorem, almost every nontrivial semantic3 property of Turing machines is undecidable. The conventional way to prove this is to first find a single undecidable property H, and then use that to bootstrap undecidability of other properties. The canonical and most famous example of an undecidable problem is the Halting problem: "does machine M halt on input i?" It's pretty easy to prove undecidable, and easy to use it to bootstrap other undecidability properties. But again, any nontrivial property is undecidable. Checking a TM is total is undecidable. Checking a TM accepts any inputs is undecidable. Checking a TM solves IS_SUM is undecidable. Etc etc etc. What this doesn't mean in practice I often see the halting problem misconstrued as "it's impossible to tell if a program will halt before running it." This is wrong. The halting problem says that we cannot create an algorithm that, when applied to an arbitrary program, tells us whether the program will halt or not. It is absolutely possible to tell if many programs will halt or not. It's possible to find entire subcategories of programs that are guaranteed to halt. It's possible to say "a program constructed following constraints XYZ is guaranteed to halt." The actual consequence of undecidability is more subtle. If we want to know if a program has property P, undecidability tells us We will have to spend time and mental effort to determine if it has P We may not be successful. This is subtle because we're so used to living in a world where everything's undecidable that we don't really consider what the counterfactual would be like. In such a world there might be no need for Rust, because "does this C program guarantee memory-safety" is a decidable property. The entire field of formal verification could be unnecessary, as we could just check properties of arbitrary programs directly. We could automatically check if a change in a program preserves all existing behavior. Lots of famous math problems could be solved overnight. (This to me is a strong "intuitive" argument for why the halting problem is undecidable: a halt detector can be trivially repurposed as a program optimizer / theorem-prover / bcrypt cracker / chess engine. It's too powerful, so we should expect it to be impossible.) But because we don't live in that world, all of those things are hard problems that take effort and ingenuity to solve, and even then we often fail. To be pendantic, a TM can't do things like "scrape a webpage" or "render a bitmap", but we're only talking about computational decision problems here. ↩ One notation I've adopted in Logic for Programmers is marking abstract sections of pseudocode with backticks. It's really handy! ↩ Nontrivial meaning "at least one TM has this property and at least one TM doesn't have this property". Semantic meaning "related to whether the TM accepts, rejects, or runs forever on a class of inputs". IS_DONE_IN_TEN_STEPS is not a semantic property, as it doesn't tell us anything about inputs that take longer than ten steps. ↩

a month ago • 15 votes

More in programming

Logical Quantifiers in Software

8 hours ago • 2 votes

The missing part of Espressif’s reset circuit

In the previous article, we peeked at the reset circuit of ESP-Prog with an oscilloscope, and reproduced it with basic components. We observed that it did not behave quite as expected. In this article, we’ll look into the missing pieces. An incomplete circuit For a hint, we’ll first look a bit more closely at the … Continue reading The missing part of Espressif’s reset circuit → The post The missing part of Espressif’s reset circuit appeared first on Quentin Santos.

8 hours ago • 2 votes

Setting Element Ordering With HTML Rewriter Using CSS

After shipping my work transforming HTML with Netlify’s edge functions I realized I have a little bug: the order of the icons specified in the URL doesn’t match the order in which they are displayed on screen. Why’s this happening? I have a bunch of links in my HTML document, like this: <icon-list> <a href="/1/">…</a> <a href="/2/">…</a> <a href="/3/">…</a>  </icon-list> I use html-rewriter in my edge function to strip out the HTML for icons not specified in the URL. So for a request to: /lookup?id=1&id=2 My HTML will be transformed like so: <icon-list>  <a href="/1/">…</a> <a href="/2/">…</a>  <a href="/3/">…</a> </icon-list> Resulting in less HTML over the wire to the client. But what about the order of the IDs in the URL? What if the request is to: /lookup?id=2&id=1 Instead of: /lookup?id=1&id=2 In the source HTML document containing all the icons, they’re marked up in reverse chronological order. But the request for this page may specify a different order for icons in the URL. So how do I rewrite the HTML to match the URL’s ordering? The problem is that html-rewriter doesn’t give me a fully-parsed DOM to work with. I can’t do things like “move this node to the top” or “move this node to position x”. With html-rewriter, you only “see” each element as it streams past. Once it passes by, your chance at modifying it is gone. (It seems that’s just the way these edge function tools are designed to work, keeps them lean and performant and I can’t shoot myself in the foot). So how do I change the icon’s display order to match what’s in the URL if I can’t modify the order of the elements in the HTML? CSS to the rescue! Because my markup is just a bunch of <a> tags inside a custom element and I’m using CSS grid for layout, I can use the order property in CSS! All the IDs are in the URL, and their position as parameters has meaning, so I assign their ordering to each element as it passes by html-rewriter. Here’s some pseudo code: // Get all the IDs in the URL const ids = url.searchParams.getAll("id"); // Select all the icons in the HTML rewriter.on("icon-list a", { element: (element) => { // Get the ID const id = element.getAttribute('id'); // If it's in our list, set it's order // position from the URL if (ids.includes(id)) { const order = ids.indexOf(id); element.setAttribute( "style", `order: ${order}` ); // Otherwise, remove it } else { element.remove(); } }, }); Boom! I didn’t have to change the order in the source HTML document, but I can still get the displaying ordering to match what’s in the URL. I love shifty little workarounds like this! Email · Mastodon · Bluesky

9 hours ago • 2 votes

clamp / median / range

Here are a few tangentially-related ideas vaguely near the theme of comparison operators. comparison style clamp style clamp is median clamp in range range style style clash? comparison style Some languages such as BCPL, Icon, Python have chained comparison operators, like if min <= x <= max: ... In languages without chained comparison, I like to write comparisons as if they were chained, like, if min <= x && x <= max { // ... } A rule of thumb is to prefer less than (or equal) operators and avoid greater than. In a sequence of comparisons, order values from (expected) least to greatest. clamp style The clamp() function ensures a value is between some min and max, def clamp(min, x, max): if x < min: return min if max < x: return max return x I like to order its arguments matching the expected order of the values, following my rule of thumb for comparisons. (I used that flavour of clamp() in my article about GCRA.) But I seem to be unusual in this preference, based on a few examples I have seen recently. clamp is median Last month, Fabian Giesen pointed out a way to resolve this difference of opinion: A function that returns the median of three values is equivalent to a clamp() function that doesn’t care about the order of its arguments. This version is written so that it returns NaN if any of its arguments is NaN. (When an argument is NaN, both of its comparisons will be false.) fn med3(a: f64, b: f64, c: f64) -> f64 { match (a <= b, b <= c, c <= a) { (false, false, false) => f64::NAN, (false, false, true) => b, // a > b > c (false, true, false) => a, // c > a > b (false, true, true) => c, // b <= c <= a (true, false, false) => c, // b > c > a (true, false, true) => a, // c <= a <= b (true, true, false) => b, // a <= b <= c (true, true, true) => b, // a == b == c } } When two of its arguments are constant, med3() should compile to the same code as a simple clamp(); but med3()’s misuse-resistance comes at a small cost when the arguments are not known at compile time. clamp in range If your language has proper range types, there is a nicer way to make clamp() resistant to misuse: fn clamp(x: f64, r: RangeInclusive<f64>) -> f64 { let (&min,&max) = (r.start(), r.end()); if x < min { return min } if max < x { return max } return x; } let x = clamp(x, MIN..=MAX); range style For a long time I have been fond of the idea of a simple counting for loop that matches the syntax of chained comparisons, like for min <= x <= max: ... By itself this is silly: too cute and too ad-hoc. I’m also dissatisfied with the range or slice syntax in basically every programming language I’ve seen. I thought it might be nice if the cute comparison and iteration syntaxes were aspects of a more generally useful range syntax, but I couldn’t make it work. Until recently when I realised I could make use of prefix or mixfix syntax, instead of confining myself to infix. So now my fantasy pet range syntax looks like >= min < max // half-open >= min <= max // inclusive And you might use it in a pattern match if x is >= min < max { // ... } Or as an iterator for x in >= min < max { // ... } Or to take a slice xs[>= min < max] style clash? It’s kind of ironic that these range examples don’t follow the left-to-right, lesser-to-greater rule of thumb that this post started off with. (x is not lexically between min and max!) But that rule of thumb is really intended for languages such as C that don’t have ranges. Careful stylistic conventions can help to avoid mistakes in nontrivial conditional expressions. It’s much better if language and library features reduce the need for nontrivial conditions and catch mistakes automatically.

yesterday • 2 votes

C++ engineering decision in SumatraPDF code

SumatraPDF is a medium size (120k+ loc, not counting dependencies) Windows GUI (win32) C++ code base started by me and written by mostly 2 people. The goals of SumatraPDF are to be: fast small packed with features and yet with thoughtfully minimal UI It’s not just a matter of pride in craftsmanship of writing code. I believe being fast and small are a big reason for SumatraPDF’s success. People notice when an app starts in an instant because that’s sadly not the norm in modern software. The engineering goals of SumatraPDF are: reliable (no crashes) fast compilation to enable fast iteration SumatraPDF has been successful achieving those objectives so I’m writing up my C++ implementation decisions. I know those decisions are controversial. Maybe not Terry Davis level of controversial but still. You probably won’t adopt them. Even if you wanted to, you probably couldn’t. There’s no way code like this would pass Google review. Not because it’s bad but becaues it’s different. Diverging from mainstream this much is only feasible if you have total control: it’s your company or your own open-source project. If my ideas were just like everyone else’s ideas, there would be little point in writing about them, would it? Use UTF8 strings internally My app only runs on Windows and a string native to Windows is WCHAR* where each character consumes 2 bytes. Despite that I mostly use char* assumed to be utf8-encoded. I only decided on that after lots of code was written so it was a refactoring oddysey that is still ongoing. My initial impetus was to be able to compile non-GUI parts under Linux and Mac. I abandoned that goal but I think that’s a good idea anyway. WCHAR* strings are 2x larger than char*. That’s more memory used which also makes the app slower. Binaries are bigger if string constants are WCHAR*. The implementation rule is simple: I only convert to WCHAR* when calling Windows API. When Windows API returns WCHA* I convert it to utf-8. No exceptions Do you want to hear a joke? “Zero-cost exceptions”. Throwing and catching exceptions generate bloated code. Exceptions are a non-local control flow that makes it hard to reason about program. Every memory allocation becomes a potential leak. But RAII, you protest. RAII is a “solution” to a problem created by exceptions. How about I don’t create the problem in the first place. Hard core #include discipline I wrote about it in depth. My objects are not shy I don’t bother with private and protected. struct is just class with guts exposed by default, so I use that. While intellectually I understand the reasoning behind hiding implementation details in practices it becomes busy work of typing noise and then even more typing when you change your mind about visibility. I’m the only person working on the code so I don’t need to force those of lesser intellect to write the code properly. My objects are shy At the same time I minimize what goes into a class, especially methods. The smaller the class, the faster the build. A common problem is adding too many methods to a class. You have a StrVec class for array of strings. A lesser programmer is tempted to add Join(const char* sep) method to StrVec. A wise programmer makes it a stand-alone function: Join(const StrVec& v, const char* sep). This is enabled by making everything in a class public. If you limit visibility you then have to use friendto allow Join() function access what it needs. Another example of “solution” to self-inflicted problems. Minimize #ifdef #ifdef is problematic because it creates code paths that I don’t always build. I provide arm64, intel 32-bit and 64-bit builds but typically only develop with 64-bit intel build. Every #ifdef that branches on architecture introduces potential for compilation error which I’ll only know about when my daily ci build fails. Consider 2 possible implementations of IsProcess64Bit(): Bad: bool IsProcess64Bit() { #ifdef _WIN64 return true; #else return false; #endif } Good: bool IsProcess64Bit() { return sizeof(uintptr_t) == 8; } The bad version has a bug: it was correct when I was only doing intel builds but became buggy when I added arm64 builds. This conflicts with the goal of smallest possible size but it’s worth it. Stress testing SumatraPDF supports a lot of very complex document and image formats. Complex format require complex code that is likely to have bugs. I also have lots of files in those formats. I’ve added stress testing functionality where I point SumatraPDF to a folder with files and tell it to render all of them. For greater coverage, I also simulate some of the possible UI actions users can take like searching, switching view modes etc. Crash reporting I wrote about it in depth. Heavy use of CrashIf() C/C++ programmers are familiar with assert() macro. CrashIf() is my version of that, tailored to my needs. The purpose of assert / CrashIf is to add checks to detect incorrect use of APIs or invalid states in the program. For example, if the code tries to access an element of an array at an invalid index (negative or larger than size of the array), it indicates a bug in the program. I want to be notified about such bugs both when I test SumatraPDF and when it runs on user’s computers. As the name implies, it’ll crash (by de-referencing null pointer) and therefore generate a crash report. It’s enabled in debug and pre-release builds but not in release builds. Release builds have many, many users so I worry about too many crash reports. premake to generate Visual Studio solution Visual Studio uses XML files as a list of files in the project and build format. The format is impossible to work with in a text editor so you have no choice but to use Visual Studio to edit the project / solution. To add a new file: find the right UI element, click here, click there, pick a file using file picker, click again. To change a compilation setting of a project or a file? Find the right UI element, click here, click there, type this, confirm that. You accidentally changed compilation settings of 1 file out of a hundred? Good luck figuring out which one. Go over all files in UI one by one. In other words: managing project files using Visual Studio UI is a nightmare. Premake is a solution. It’s a meta-build system. You define your build using lua scripts, which look like test configuration files. Premake then can generate Visual Studio projects, XCode project, makefiles etc. That’s the meta part. It was truly a life server on project with lots of files (SumatraPDF’s own are over 300, many times more for third party libraries). Using /analyze and cppcheck cppcheck and /analyze flag in cl.exe are tools to find bugs in C++ code via static analysis. They are like a C++ compiler but instead of generating code, they analyze control flow in a program to find potential programs. It’s a cheap way to find some bugs, so there’s no excuse to not run them from time to time on your code. Using asan builds Address Sanitizer (asan) is a compiler flag /fsanitize=address that instruments the code with checks for common memory-related bugs like using an object after freeing it, over-writing values on the stack, freeing an object twice, writing past allocated memory. The downside of this instrumentation is that the code is much slower due to overhead of instrumentation. I’ve created a project for release build with asan and run it occasionally, especially in stress test. Write for the debugger Programmers love to code golf i.e. put us much code on one line as possible. As if lines of code were expensive. Many would write: Bad: // ... return (char*)(start + offset); I write: Good: // ... char* s = (char*)(start + offset); return s; Why? Imagine you’re in a debugger stepping through a debug build of your code. The second version makes it trivial to set a breakpoint at return s line and look at the value of s. The first doesn’t. I don’t optimize for smallest number of lines of code but for how easy it is to inspect the state of the program in the debugger. In practice it means that I intentionally create intermediary variables like s in the example above. Do it yourself standard library I’m not using STL. Yes, I wrote my own string and vector class. There are several reasons for that. Historical reason When I started SumatraPDF over 15 years ago STL was crappy. Bad APIs Today STL is still crappy. STL implementations improved greatly but the APIs still suck. There’s no API to insert something in the middle of a string or a vector. I understand the intent of separation of data structures and algorithms but I’m a pragmatist and to my pragmatist eyes v.insert (v.begin(), myarray, myarray+3); is just stupid compared to v.inert(3, el). Code bloat STL is bloated. Heavy use of templates leads to lots of generated code i.e. surprisingly large binaries for supposedly low-level language. That bloat is invisible i.e. you won’t know unless you inspect generated binaries, which no one does. The bloat is out of my control. Even if I notice, I can’t fix STL classes. All I can do is to write my non-bloaty alternative, which is what I did. Slow compilation times Compilation of C code is not fast but it feels zippy compared to compilation of C++ code. Heavy use of templates is big part of it. STL implementations are over-templetized and need to provide all the C++ support code (operators, iterators etc.). As a pragmatist, I only implement the absolute minimum functionality I use in my code. I minimize use of templates. For example Str and WStr could be a single template but are 2 implementations. I don’t understand C++ I understand the subset of C++ I use but the whole of C++ is impossibly complicated. For example I’ve read a bunch about std::move() and I’m not confident I know how to use it correctly and that’s just one of many complicated things in C++. C++ is too subtle and I don’t want my code to be a puzzle. Possibility of optimized implementations I wrote a StrVec class that is optimized for storing vector of strings. It’s more efficient than std::vector<std::string> by a large margin and I use it extensively. Temporary allocator and pool allocators I use temporary allocators heavily. They make the code faster and smaller. Technically STL has support for non-standard allocators but the API is so bad that I would rather not. My temporary allocator and pool allocators are very small and simple and I can add support for them only when beneficial. Minimize unsigned int STL and standard C library like to use size_t and other unsigned integers. I think it was a mistake. Go shows that you can just use int. Having two types leads to cast-apalooza. I don’t like visual noise in my code. Unsigned are also more dangerous. When you substract you can end up with a bigger value. Indexing from end is subtle, for (int i = n; i >= 0; i--) is buggy because i >= 0 is always true for unsigned. Sadly I only realized this recently so there’s a lot of code still to refactor to change use of size_t to int. Mostly raw pointers No std::unique_ptr for me. Warnings are errors C++ makes a distinction between compilation errors and compilation warnings. I don’t like sloppy code and polluting build output with warning messages so for my own code I use a compiler flag that turns warnings into errors, which forces me to fix the warnings.

yesterday • 2 votes

New here?