More from On Test Automation
Last weekend, I wrote a more or less casual post on LinkedIn containing the ‘rules’ (it’s more of a list of terms and conditions, really) I set for myself when it comes to using AI. That post received some interesting comments that made me think and refine my thoughts on when (not) to use AI to support me in my work. Thank you to all of you who commented for doing so, and for showing me that there still is value in being active on LinkedIn in between all the AI-generated ‘content’. I really appreciate it. Now, AI and LLMs like ChatGPT or Claude can be very useful, that is, when used prudently. I think it is very important to be conscious and cautious when it comes to using AI, though, which is why I wrote that post. I wrote it mostly for myself, to structure my thoughts around AI, but also because I think it is important that others are at least conscious of what they’re doing and working with. That doesn’t mean you have to adhere to or even agree with my views and the way I use these tools, by the way. Different strokes for different folks. Because of the ephemeral nature of these LinkedIn posts, and the importance of the topic to me, I want to repeat the ‘rules’ (again, more of a T&C list) I wrote down here. This is the original, unchanged list from the post I wrote on February 15: I only use it to support me in completing tasks I understand. I need to be able to scrutinize the output the AI system produces and see if it is both sound and fit for the purpose I want to use it for. I never use it to explain to me something I don’t know yet or don’t understand enough. I have seen and read about too many hallucinations to trust them to teach me what I don’t understand. Instead, I use books, articles, and other content from authors and sources I do trust if I’m looking to learn something new. I never EVER use it for creative work. I don’t use AI-generated images anywhere, and all of my blogs, LinkedIn posts, comments, course material and other written text are 100% my own, warts and all. My views, my ideas, my voice. Interestingly, most of the comments were written in reaction to the first two bullet points at the time I wrote this blog post. I don’t know exactly why this is the case, it might be because the people who read it agree (which I doubt seeing the tsunami of AI-generated content that’s around these days), or maybe because there’s a bit of stigma around admitting to use AI for content generation. I don’t know. What I do know is that it is an important principle to me. I wrote about the reasons for that in an earlier blog post, so I won’t repeat myself here. Like so many terms and conditions, the list I wrote down in this post will probably evolve over time, but what will not change is me remaining very careful around where I use and where I don’t use AI to help me in my work. Especially now that the speed with which new developments in the AI space are presented to us and the claims around what it can and will do only get bigger, I think it is wise to remain cautious and look at these developments with a critical and very much human view.
When I build and release new features or bug fixes for RestAssured.Net, I rely heavily on the acceptance tests that I wrote over time. Next to serving as living documentation for the library, I run these tests both locally and on every push to GitHub to see if I didn’t accidentally break something, for different versions of .NET. But how reliable are these tests really? Can I trust them to pass and fail when they should? Did I cover all the things that are important? I speak, write and teach about the importance of testing your tests on a regular basis, so it makes sense to start walking the talk and get more insight into the quality of the RestAssured.Net test suite. One approach to learning more about the quality of your tests is through a technique called mutation testing. I speak about and demo testing your tests and using mutation testing to do so on a regular basis (you can watch a recent talk here), but until now, I’ve pretty much exclusively used PITest for Java. As RestAssured.Net is a C# library, I can’t use PITest, but I’d heard many good things about Stryker.NET, so this would be a perfect opportunity to finally use it. Adding Stryker.NET to the RestAssured.Net project The first step was to add Stryker.Net to the RestAssured.Net project. Stryker.NET is a dotnet tool, so installing it is straightforward: run dotnet new tool-manifest to create a new, project-specific tool manifest (this was the first local dotnet tool for this project) and then dotnet tool install dotnet-stryker to add Stryker.NET as a dotnet tool to the project. Running mutation tests for the first time Running mutation tests with Stryker.NET is just as straightforward: dotnet stryker --project RestAssured.Net.csproj from the tests project folder is all it takes. Because both my test suite (about 200 tests) and the project itself are relatively small code bases, and because my test suite runs quickly, running mutation tests for my entire project works for me. It still took around five minutes for the process to complete. If you have a larger code base, and longer-running test suites, you’ll see that mutation testing will take much, much longer. In that case, it’s probably best to start on a subset of your code base and a subset of your test suite. After five minutes and change, the results are in: Stryker.NET created 538 mutants from my application code base. Of these: 390 were killed, that is, at least one test failed because of this mutation, 117 survived, that is, the change did not make any of the tests fail, and 31 resulted in a timeout, which I’ll need to investigate further, but I suspect it has something to do with HTTP timeouts (RestAssured.Net is an HTTP API testing library, and all acceptance tests perform actual HTTP requests) This leads to an overall mutation testing score of 59.97%. Is that good? Is that bad? In all honesty, I don’t know, and I don’t care. Just like with code coverage, I am not a fan of setting fixed targets for this type of metric, as these will typically lead to writing tests for the sake of improving a score rather than for actual improvement of the code. What I am much more interested in is the information that Stryker.NET produced during the mutation testing process. Opening the HTML report I was surprised to see that out of the box, Stryker.NET produces a very good-looking and incredibly helpful HTML report. It provides both a high-level overview of the results: as well as in-depth detail for every mutant that was killed or that survived. It offers a breakdown of the results per namespace and per class, and it is the starting point for further drilling down into results for individual mutants. Let’s have a look and see if the report provides some useful, actionable information for us to improve the RestAssured.Net test suite. Missing coverage Like many other mutation testing tools, Stryker.NET provides code coverage information along with mutation coverage information. That is, if there is code in the application code base that was mutated, but that is not covered by any of the tests, Stryker.NET will inform you about it. Here’s an example: Stryker.NET changed the message of an exception thrown when RestAssured.Net is asked to deserialize a response body that is either null or empty. Apparently, there is no test in the test suite that covers this path in the code. As this particular code path deals with exception handling, it’s probably a good idea to add a test for it: [Test] public void EmptyResponseBodyThrowsTheExpectedException() { var de = Assert.Throws<DeserializationException>(() => { Location responseLocation = (Location)Given() .When() .Get($"{MOCK_SERVER_BASE_URL}/empty-response-body") .DeserializeTo(typeof(Location)); }); Assert.That(de?.Message, Is.EqualTo("Response content is null or empty.")); } I added the corresponding test in this commit. Removed code blocks Another type of mutant that Stryker.NET generates is the removal of a code block. Going by the mutation testing report, it seems like there are a few of these mutants that are not detected by any of the tests. Here’s an example: The return statement for the Put() method body, which is used to perform an HTTP PUT operation, is replaced with an empty method body, but this is not picked up by any of the tests. The same applies to the methods for HTTP PATCH, DELETE, HEAD and OPTIONS. Looking at the tests that cover the different HTTP verbs, this makes sense. While I do call each of these HTTP methods in a test, I don’t assert on the result for the aforementioned HTTP verbs. I am basically relying on the fact that no exception is thrown when I call Put() when I say ‘it works’. Let’s change that by at least asserting on a property of the response that is returned when these HTTP verbs are used: [Test] public void HttpPutCanBeUsed() { Given() .When() .Put($"{MOCK_SERVER_BASE_URL}/http-put") .Then() .StatusCode(200); } These assertions were added to the RestAssured.Net test suite in this commit. Improving testability The next signal I received from this initial mutation testing run is an interesting one. It tells me that even though I have acceptance tests that add cookies to the request and that only pass when the request contains the cookies I set, I’m not properly covering some logic that I added: To understand what is going on here, it is useful to know that a Cookie in C# offers a constructor that creates a Cookie specifying only a name and a value, but that a cookie has to have a domain value set. To enforce that, I added the logic you see in the screenshot. However, Stryker.NET tells me I’m not properly testing this logic, because changing its implementation doesn’t cause any tests to fail. Now, I might be able to test this specific logic with a few added acceptance tests, but it really is only a small piece of logic, and I should be able to test that logic in isolation, right? Well, not with the code written in the way it currently is… So, time to extract that piece of logic into a class of its own, which will improve both the modularity of the code and allow me to test it in isolation. First, let’s extract the logic into a CookieUtils class: internal class CookieUtils { internal Cookie SetDomainFor(Cookie cookie, string hostname) { if (string.IsNullOrEmpty(cookie.Domain)) { cookie.Domain = hostname; } return cookie; } } I deliberately made this class internal as I don’t want it to be directly accessible to RestAssured.Net users. However, as I do need to access it in the tests, I have to add this little snippet to the RestAssured.Net.csproj file: <ItemGroup> <InternalsVisibleTo Include="$(MSBuildProjectName).Tests" /> </ItemGroup> Now, I can add unit tests that should cover both paths in the SetDomainFor() logic: [Test] public void CookieDomainIsSetToDefaultValueWhenNotSpecified() { Cookie cookie = new Cookie("cookie_name", "cookie_value"); CookieUtils cookieUtils = new CookieUtils(); cookie = cookieUtils.SetDomainFor(cookie, "localhost"); Assert.That(cookie.Domain, Is.EqualTo("localhost")); } [Test] public void CookieDomainIsUnchangedWhenSpecifiedAlready() { Cookie cookie = new Cookie("cookie_name", "cookie_value", "/my_path", "strawberry.com"); CookieUtils cookieUtils = new CookieUtils(); cookie = cookieUtils.SetDomainFor(cookie, "localhost"); Assert.That(cookie.Domain, Is.EqualTo("strawberry.com")); } These changes were added to the RestAssured.Net source and test code in this commit. An interesting mutation So far, all the signals that appeared in the mutation testing report generated by Stryker.NET have been valuable, as in: they have pointed me at code that isn’t covered by any tests yet, to tests that could be improved, and they have led to code refactoring to improve testability. Using Stryker.NET (and mutation testing in general) does sometimes lead to some, well, interesting mutations, like this one: I’m checking that a certain string is either null or an empty string, and if either condition is true, RestAssured.Net throws an exception. Perfectly valid. However, Stryker.NET changes the logical OR to a logical AND (a common mutation), which makes it impossible for the condition to evaluate to true. Is that even a useful mutation to make? Well, to some extent, it is. Even if the code doesn’t make sense anymore after it has been mutated, it does tell you that your tests for this logical condition probably need some improvement. In this case, I don’t have to add more tests, as we discussed this exact statement earlier (remember that it had no test coverage at all). It did make me look at this statement once again, though, and I only then realized that I could simplify this code snippet to if (string.IsNullOrEmpty(responseBodyAsString)) { throw new DeserializationException("Response content is null or empty."); } Instead of a custom-built logical OR, I am now using a construct built into C#, which is arguably the safer choice. In general, if your mutation testing tool generates several (or even many) mutants for the same code statement or block, it might be a good idea to have another look at that code and see if it can be simplified. This was just a very small example, but I think this observation holds true in general. This change was added to the RestAssured.Net source and test code in this commit. Running mutation tests again and inspecting the results Now that several (supposed) improvements to the tests and the code have been made, let’s run the mutation tests another time to see if the changes improved our score. In short: 397 mutants were killed now, up from 390 (that’s good) 111 mutants survived, down from 117 (that’s also good) there were 32 timeouts, up from 31 (that needs some further investigation) Overall, the mutation testing score went up from 59,97% to 61,11%. This might not seem like much, but it is definitely a step in the right direction. The most important thing for me right now is that my tests for RestAssured.Net have improved, my code has improved and I learned a lot about mutation testing and Stryker.NET in the process. Am I going to run mutation tests every time I make a change? Probably not. There is quite a lot of information to go through, and that takes time, time that I don’t want to spend for every build. For that reason, I’m also not going to make these mutation tests part of the build and test pipeline for RestAssured.Net, at least not any time soon. This was nonetheless both a very valuable and a very enjoyable exercise, and I’ll definitely keep improving the tests and the code for RestAssured.Net using the suggestions that Stryker.NET presents.
This blog post is another one in the ‘writing things down to structure my thinking on where I want my career to go’ series. I will get back to writing technical and automation blog posts soon, but I need to finish my contract testing course first. One of the things I like to do most in life is traveling and seeing new places. Well, seeing new places, mostly, as the novelty of waiting, flying and staying in hotel rooms has definitely worn off by now. I am in the privileged position (really, that is what it is: I’m privileged, and I fully realize that) that I get to scratch this travel itch professionally on a regular basis these days. Over the last few years, I have been invited to contribute to meetups and conferences abroad, and I also get to run in-house training sessions with companies outside the Netherlands a couple of times per year. Most of this traveling takes place within Europe, but for the last three years, I have been able to travel outside of Europe once every year (South Africa in 2022, Canada in 2023 and the United States in 2024), and needless to say I have enjoyed those opportunities very much. To give you an idea of the amount of traveling I do: for 2025, I now have four work-related trips abroad scheduled, and I am pretty sure at least a few more will be added to that before the year ends (it’s only just February…). That might not be much travel by some people’s standards, but for me, it is. And it seems the number of opportunities I get for traveling increase year over year, to the point where I have to say ‘no’ to several of these opportunities. Say no? Why? I thought you just said you loved to travel? Yes, that’s true. I do love to travel. But I also love spending time at home with my family, and that comes first. Always. Now, my sons are getting older, and being away from home for a few days doesn’t put as much pressure on them and on my wife as it did a few years ago. Still, I always need to find a balance between spending time with them and spending time at work. I am away from home for work not just when I’m abroad. I run evening training sessions with clients here in the Netherlands on a regular basis, too, as well as training sessions in my evenings for clients in different time zones, mainly US-based clients. And all that adds up. I try to only be away from home one night per week, but often, it’s two. When I travel abroad, it’s even more than that. Again, I’m not complaining. Not at all. It is an absolute privilege to get to travel for work and get paid to do that, but I cannot do that indefinitely, and that’s why I have made a decision: With a few exceptions (more on those below), I am going to say ‘no’ to conferences abroad from now on. This is a tough decision for me to make, but sometimes that’s exactly what you need to do. Tough, because I have very fond memories of all the conferences and meetups abroad I have contributed to. My first one, Romanian Testing Conference in 2017. My first keynote abroad, UKStar in 2019. My first one outside of Europe, Targeting Quality in 2023. They were all amazing, because of the travel and sightseeing (when time allowed), but also because of all the people I have met at these conferences. Yet, I can meet at least some of these people at conferences here in the Netherlands, too. Test Automation Days, the TestNet events, the Dutch Testing Day and TestMass all provide a great opportunity for me to catch up with my network. Sometimes, international conferences come to the Netherlands, too, like AutomationSTAR this year. And then there are plenty of smaller meetups here in the Netherlands (and Belgium) where I can meet and catch up with people as well. Plus, the money. I am not going to be a hypocrite and say that money doesn’t play into this. For the reasons mentioned above, I have a limited number of opportunities to travel every year, and I prefer to spend those on in-house training sessions with clients abroad, simply because the pay is much better. Even when a conference compensates flights and hotel (as they should) and offer a speaker or workshop facilitator fee (a nice bonus), it will be significantly less of a payday than when I run a training session with a client. That’s not the fault of those conferences, not at all, especially when they’re compensating their speakers fairly, but this is simply a matter of numbers and budgets. At the moment, I have one, maybe two contributions to conferences abroad coming up, and I gave them my word, so I’ll be there. That’s the SAST 30-year anniversary conference in October, plus one other conference that I’m talking to but haven’t received a ‘yes’ or ‘no’ from yet. Other than that, if conferences reach out to me, it’s likely to be a ‘no’ from now on, unless: the event pays a fee comparable to my rate for in-house training I can combine the event with paid in-house training (for example with a sponsor) it is a country or region I really, really want to visit, either for personal reasons or because I want to grow my professional network there I don’t see the first one happening soon, and the list of destinations for the third one is very short (Norway, Canada, New Zealand, that’s pretty much it), so unless we can arrange paid in-house training alongside the conference, the answer will be a ‘no’ from me. Will this reduce the number of travel opportunities for me? Maybe. Maybe not. Again, I see the number of requests I get for in-house training abroad growing, too, and if that dies down, it’ll be a sign for me that I’ll have to work harder to create those opportunities. For 2025, things are looking pretty good, with trips for training to Romania, North Macedonia and Denmark already scheduled, and several leads for more in the pipeline. And if the number of opportunities does go down, that’s fine, too. I’m happy to spend that time with family, working on other things, or riding my bike. And I’m sure there will be a few opportunities to speak at online meetups, events and webinars, too.
As a (sort of) follow-up post to my yearly review for 2024, in this post, I would like to go over the changes, bug fixes and new features that have been introduced in RestAssured .NET in 2024. This year, I released 7 new versions of the library, and while none of the versions included changes that were worthy of a blog post on its own, I thought it would be a good idea to wrap them all up in a single overview. Basically, this blog post is an extended version of the library’s CHANGELOG. I’ll go through the new versions chronologically, starting with the first release of 2024. Version 4.2.2 - released April 23 RestAssured .NET 4.2.2 fixes a bug that prevented JSON responses that are an array to be properly verified. In other words, if the JSON response body looks like this: [ { "id": 1, "text": "Do the dishes" }, { "id": 2, "text": "Clean out the trash" }, { "id": 3, "text": "Read the newspaper" } ] I would expect this test to pass: [Test] public void JsonArrayResponseBodyElementCanBeVerifiedUsingNHamcrestMatcher() { Given() .When() .Get("http://localhost:9876/json-array-response-body") .Then() .StatusCode(200) .Body("$[1].text", NHamcrest.Is.EqualTo("Clean out the trash")); } but prior to this version, it threw a Newtonsoft.Json.JsonReaderException. The solution? Adding a try-catch that first tries to parse the JSON response as a JObject (equal to existing behaviour), catch the JsonReaderException and try again, now parsing the JSON response into a JArray. That made the newly added test pass without failing any other tests. Another demonstration of the added value of having a decent set of tests. RestAssured .NET is slowly growing and becoming more complex, and having a test suite I can run locally, and that always runs when I push code to GitHub is an invaluable safety net for me. These tests run in a few seconds, yet they give me invaluable feedback on the effect of new features, bug fixes and code refactoring efforts. I haven’t heard back from the person submitting the original issue, but I assume that this fixed their issue. Version 4.3.0 - released August 16 I love learning about how people use RestAssured .NET, because invariably they will use it in ways I haven’t foreseen. I was unfamiliar with the concept of server-sent events (SSE) in APIs, for example, yet there are people looking to test these kinds of APIs using RestAssured .NET. It turned out that what this user was looking for was a way to set the HttpCompletionOption value on the System.Net.Http.HttpClient that is wrapped by RestAssured .NET. To enable this, I added a method to the DSL that looks like this: Given() .UseHttpCompletionOption(HttpCompletionOption.ResponseHeadersRead) I also added the option to specify the HttpCompletionOption to be used in a RequestSpecification as well as in the global config. A straightforward fix that solved the problem for this specific user. The only thing I don’t like here is that I don’t know of a way to test this locally. Do you? I would love to hear it. Version 4.3.1 - release August 22 Another user pointed out to me that trying to verify that the value of a JSON response body element is an empty array also threw an exception. So, if the JSON response body looks like this: { "success": true, "errors": [] } this test should pass, but instead it threw a Newtonsoft.Json.JsonSerializationException: [Test] public void JsonResponseBodyElementEmptyArrayValueCanBeVerifiedUsingNHamcrestMatcher() { Given() .When() .Get("http://localhost:9876/json-empty-array-response-body") .Then() .StatusCode(200) .Body("$.errors", NHamcrest.Is.OfLength(0)); } The fix? Adding some code that checks if the element returned when evaluating the JsonPath expression is a JArray or a JObject and using the right matching logic accordingly. I used my preferred procedure here: first, write a failing test that reproduces the issue then, make the test pass without breaking any other tests refactor the code, document and release Does this procedure sound familiar to you? Version 4.4.0 - released October 21 As you can probably tell from the semantic versioning, this version introduced a new feature to RestAssured .NET: the ability to use NTLM authentication when making an HTTP call. To enable this, I added a new method to the DSL: Given() .NtlmAuth() // This one uses default NTLM credentials for the current user .NtlmAuth("username", "password", "domain") // This one uses custom NTLM credentials As I had no idea how to write a proper test for this, even though I had tested it before releasing using Fiddler, I released a beta version first that the person submitting the issue could use to verify the solution. I’m happy to say that it worked for them and that the solution could be released properly. Again, if someone can think of a way to add a proper test for NTLM authentication to the test suite, I would love to hear it. All that the current tests do is run the code and see if no exception is thrown. Not a good test, but until I find a better way, it will have to do. Version 4.5.0 - released November 19 This version introduced not one, but two changes. First, since .NET 9 was officially released earlier that week (or maybe the week before, I forgot), I needed to release a RestAssured .NET version that targets .NET 9, so I did. Just like with .NET 8, I didn’t really have to change anything to the code other than adding net9.0 to the TargetFrameworks and add .NET 9 to the build pipeline for the library to make sure that every change is tested on .NET 9, too. Happy to say it all ‘just worked’. The other change took more effort: a user reported that they could not override the ResponseLogLevel set in a RequestSpecification at the individual test level. The reason? In the existing code, the response was logged directly after the HTTP call completed, so before any calls to Log() for the response. When Log() is called on the response, it was then logged again. I have no idea how I completely overlooked this until now, but I did. Rewriting the code to make this work took longer than I expected, but I managed in the end, through quite a bit of trial and error and lots of humand-centered testing (again, no idea how to write automated tests for this). The logging functionality of RestAssured .NET is something I intend to rewrite in the future, for a couple of reasons: It’s impossible to write automated tests for it (or at least I don’t know how to do this) Ideally, I want the logging to be more configurable and extensible to give users more flexibility than they have at the moment Version 4.5.1 - released November 20 As one does, I found an issue with the updated logging logic almost immediately after releasing 4.5.0 to the public: masking of sensitive headers and cookies didn’t work anymore when specified as part of a RequestSpecification. Lucky for me, this was a quick fix, but a bit embarrassing nonetheless. Had I had proper automated tests for the logging in place, I probably would have caught this before releasing 4.5.0…. Anyway, it’s fixed now, as far as I can tell. Version 4.6.0 - released December 9 The final RestAssured .NET release of 2024 added the capability to strip the ; charset=<some_charset> from the Content-Type header in a request. It turns out, some APIs explicitly expect this header to not contain the charset suffix, but the way I create a request, or rather, the way .NET creates a StringContent object, will add it by default. This issue was a great example of one of the main reasons why I started this project: there is so much I don’t know yet about HTTP, APIs, C#/.NET and other technologies, and working on these issues and improving RestAssured .NET gives me an opportunity to learn them. I make a habit of writing what I learned down in the issue on GitHub, so I can review it later, and so I can point others to these links and thoughts, too. So, if you’re looking for a way to strip the charset identifier from the Content-Type header in the request, you can now do that by passing an optional second boolean argument to Body() (defaults to false): Given() .Body(your_body_goes_here, stripCharset: true) That’s it! As you can see, lots of small changes, bug fixes and new features have been added to RestAssured .NET this year. Oh, and before I forget: with every release, I also made sure to update the dependencies I use to create and test RestAssured .NET to their latest versions. I consider that good housekeeping, and it’s all part of keeping a library up to date. I am looking forward to seeing the library evolve and improve further in 2025.
More in programming
One of the biggest mistakes that new startup founders make is trying to get away from the customer-facing roles too early. Whether it's customer support or it's sales, it's an incredible advantage to have the founders doing that work directly, and for much longer than they find comfortable. The absolute worst thing you can do is hire a sales person or a customer service agent too early. You'll miss all the golden nuggets that customers throw at you for free when they're rejecting your pitch or complaining about the product. Seeing these reasons paraphrased or summarized destroy all the nutrients in their insights. You want that whole-grain feedback straight from the customers' mouth! When we launched Basecamp in 2004, Jason was doing all the customer service himself. And he kept doing it like that for three years!! By the time we hired our first customer service agent, Jason was doing 150 emails/day. The business was doing millions of dollars in ARR. And Basecamp got infinitely, better both as a market proposition and as a product, because Jason could funnel all that feedback into decisions and positioning. For a long time after that, we did "Everyone on Support". Frequently rotating programmers, designers, and founders through a day of answering emails directly to customers. The dividends of doing this were almost as high as having Jason run it all in the early years. We fixed an incredible number of minor niggles and annoying bugs because programmers found it easier to solve the problem than to apologize for why it was there. It's not easy doing this! Customers often offer their valuable insights wrapped in rude language, unreasonable demands, and bad suggestions. That's why many founders quit the business of dealing with them at the first opportunity. That's why few companies ever do "Everyone On Support". That's why there's such eagerness to reduce support to an AI-only interaction. But quitting dealing with customers early, not just in support but also in sales, is an incredible handicap for any startup. You don't have to do everything that every customer demands of you, but you should certainly listen to them. And you can't listen well if the sound is being muffled by early layers of indirection.
I’m sitting in a small coffee shop in Brooklyn. I have a warm drink, and it’s just started to snow outside. I’m visiting New York to see Operation Mincemeat on Broadway – I was at the dress rehearsal yesterday, and I’ll be at the opening preview tonight. I’ve seen this show more times than I care to count, and I hope US theater-goers love it as much as Brits. The people who make the show will tell you that it’s about a bunch of misfits who thought they could do something ridiculous, who had the audacity to believe in something unlikely. That’s certainly one way to see it. The musical tells the true story of a group of British spies who tried to fool Hitler with a dead body, fake papers, and an outrageous plan that could easily have failed. Decades later, the show’s creators would mirror that same spirit of unlikely ambition. Four friends, armed with their creativity, determination, and a wardrobe full of hats, created a new musical in a small London theatre. And after a series of transfers, they’re about to open the show under the bright lights of Broadway. But when I watch the show, I see a story about friendship. It’s about how we need our friends to help us, to inspire us, to push us to be the best versions of ourselves. I see the swaggering leader who needs a team to help him truly achieve. The nervous scientist who stands up for himself with the support of his friends. The enthusiastic secretary who learns wisdom and resilience from her elder. And so, I suppose, it’s fitting that I’m not in New York on my own. I’m here with friends – dozens of wonderful people who I met through this ridiculous show. At first, I was just an audience member. I sat in my seat, I watched the show, and I laughed and cried with equal measure. After the show, I waited at stage door to thank the cast. Then I came to see the show a second time. And a third. And a fourth. After a few trips, I started to see familiar faces waiting with me at stage door. So before the cast came out, we started chatting. Those conversations became a Twitter community, then a Discord, then a WhatsApp. We swapped fan art, merch, and stories of our favourite moments. We went to other shows together, and we hung out outside the theatre. I spent New Year’s Eve with a few of these friends, sitting on somebody’s floor and laughing about a bowl of limes like it was the funniest thing in the world. And now we’re together in New York. Meeting this kind, funny, and creative group of people might seem as unlikely as the premise of Mincemeat itself. But I believed it was possible, and here we are. I feel so lucky to have met these people, to take this ridiculous trip, to share these precious days with them. I know what a privilege this is – the time, the money, the ability to say let’s do this and make it happen. How many people can gather a dozen friends for even a single evening, let alone a trip halfway round the world? You might think it’s silly to travel this far for a theatre show, especially one we’ve seen plenty of times in London. Some people would never see the same show twice, and most of us are comfortably into double or triple-figures. Whenever somebody asks why, I don’t have a good answer. Because it’s fun? Because it’s moving? Because I enjoy it? I feel the need to justify it, as if there’s some logical reason that will make all of this okay. But maybe I don’t have to. Maybe joy doesn’t need justification. A theatre show doesn’t happen without people who care. Neither does a friendship. So much of our culture tells us that it’s not cool to care. It’s better to be detached, dismissive, disinterested. Enthusiasm is cringe. Sincerity is weakness. I’ve certainly felt that pressure – the urge to play it cool, to pretend I’m above it all. To act as if I only enjoy something a “normal” amount. Well, fuck that. I don’t know where the drive to be detached comes from. Maybe it’s to protect ourselves, a way to guard against disappointment. Maybe it’s to seem sophisticated, as if having passions makes us childish or less mature. Or perhaps it’s about control – if we stay detached, we never have to depend on others, we never have to trust in something bigger than ourselves. Being detached means you can’t get hurt – but you’ll also miss out on so much joy. I’m a big fan of being a big fan of things. So many of the best things in my life have come from caring, from letting myself be involved, from finding people who are a big fan of the same things as me. If I pretended not to care, I wouldn’t have any of that. Caring – deeply, foolishly, vulnerably – is how I connect with people. My friends and I care about this show, we care about each other, and we care about our joy. That care and love for each other is what brought us together, and without it we wouldn’t be here in this city. I know this is a once-in-a-lifetime trip. So many stars had to align – for us to meet, for the show we love to be successful, for us to be able to travel together. But if we didn’t care, none of those stars would have aligned. I know so many other friends who would have loved to be here but can’t be, for all kinds of reasons. Their absence isn’t for lack of caring, and they want the show to do well whether or not they’re here. I know they care, and that’s the important thing. To butcher Tennyson: I think it’s better to care about something you cannot affect, than to care about nothing at all. In a world that’s full of cynicism and spite and hatred, I feel that now more than ever. I’d recommend you go to the show if you haven’t already, but that’s not really the point of this post. Maybe you’ve already seen Operation Mincemeat, and it wasn’t for you. Maybe you’re not a theatre kid. Maybe you aren’t into musicals, or history, or war stories. That’s okay. I don’t mind if you care about different things to me. (Imagine how boring the world would be if we all cared about the same things!) But I want you to care about something. I want you to find it, find people who care about it too, and hold on to them. Because right now, in this city, with these people, at this show? I’m so glad I did. And I hope you find that sort of happiness too. Some of the people who made this trip special. Photo by Chloe, and taken from her Twitter. Timing note: I wrote this on February 15th, but I delayed posting it because I didn’t want to highlight the fact I was away from home. [If the formatting of this post looks odd in your feed reader, visit the original article]
Humanity's Last Exam by Center for AI Safety (CAIS) and Scale AI
Most of our cultural virtues, celebrated heroes, and catchy slogans align with the idea of "never give up". That's a good default! Most people are inclined to give up too easily, as soon as the going gets hard. But it's also worth remembering that sometimes you really should fold, admit defeat, and accept that your plan didn't work out. But how to distinguish between a bad plan and insufficient effort? It's not easy. Plenty of plans look foolish at first glance, especially to people without skin in the game. That's the essence of a disruptive startup: The idea ought to look a bit daft at first glance or it probably doesn't carry the counter-intuitive kernel needed to really pop. Yet it's also obviously true that not every daft idea holds the potential to be a disruptive startup. That's why even the best venture capital investors in the world are wrong far more than they're right. Not because they aren't smart, but because nobody is smart enough to predict (the disruption of) the future consistently. The best they can do is make long bets, and then hope enough of them pay off to fund the ones that don't. So far, so logical, so conventional. A million words have been written by a million VCs about how their shrewd eyes let them see those hidden disruptive kernels before anyone else could. Good for them. What I'm more interested in knowing more about is how and when you pivot from a promising bet to folding your hand. When do you accept that no amount of additional effort is going to get that turkey to soar? I'm asking because I don't have any great heuristics here, and I'd really like to know! Because the ability to fold your hand, and live to play your remaining chips another day, isn't just about startups. It's also about individual projects. It's about work methods. Hell, it's even about politics and societies at large. I'll give you just one small example. In 2017, Rails 5.1 shipped with new tooling for doing end-to-end system tests, using a headless browser to validate the functionality, as a user would in their own browser. Since then, we've spent an enormous amount of time and effort trying to make this approach work. Far too much time, if you ask me now. This year, we finished our decision to fold, and to give up on using these types of system tests on the scale we had previously thought made sense. In fact, just last week, we deleted 5,000 lines of code from the Basecamp code base by dropping literally all the system tests that we had carried so diligently for all these years. I really like this example, because it draws parallels to investing and entrepreneurship so well. The problem with our approach to system tests wasn't that it didn't work at all. If that had been the case, bailing on the approach would have been a no brainer long ago. The trouble was that it sorta-kinda did work! Some of the time. With great effort. But ultimately wasn't worth the squeeze. I've seen this trap snap on startups time and again. The idea finds some traction. Enough for the founders to muddle through for years and years. Stuck with an idea that sorta-kinda does work, but not well enough to be worth a decade of their life. That's a tragic trap. The only antidote I've found to this on the development side is time boxing. Programmers are just as liable as anyone to believe a flawed design can work if given just a bit more time. And then a bit more. And then just double of what we've already spent. The time box provides a hard stop. In Shape Up, it's six weeks. Do or die. Ship or don't. That works. But what's the right amount of time to give a startup or a methodology or a societal policy? There's obviously no universal answer, but I'd argue that whatever the answer, it's "less than you think, less than you want". Having the grit to stick with the effort when the going gets hard is a key trait of successful people. But having the humility to give up on good bets turned bad might be just as important.
No newsletter next week, I'm teaching a TLA+ workshop. Speaking of which: I spend a lot of time thinking about formal methods (and TLA+ specifically) because it's where the source of almost all my revenue. But I don't share most of the details because 90% of my readers don't use FM and never will. I think it's more interesting to talk about ideas from FM that would be useful to people outside that field. For example, the idea of "property strength" translates to the idea that some tests are stronger than others. Another possible export is how FM approaches nondeterminism. A nondeterministic algorithm is one that, from the same starting conditions, has multiple possible outputs. This is nondeterministic: # Pseudocode def f() { return rand()+1; } When specifying systems, I may not encounter nondeterminism more often than in real systems, but I am definitely more aware of its presence. Modeling nondeterminism is a core part of formal specification. I mentally categorize nondeterminism into five buckets. Caveat, this is specifically about nondeterminism from the perspective of system modeling, not computer science as a whole. If I tried to include stuff on NFAs and amb operations this would be twice as long.1 1. True Randomness Programs that literally make calls to a random function and then use the results. This the simplest type of nondeterminism and one of the most ubiquitous. Most of the time, random isn't truly nondeterministic. Most of the time computer randomness is actually pseudorandom, meaning we seed a deterministic algorithm that behaves "randomly-enough" for some use. You could "lift" a nondeterministic random function into a deterministic one by adding a fixed seed to the starting state. # Python from random import random, seed def f(x): seed(x) return random() >>> f(3) 0.23796462709189137 >>> f(3) 0.23796462709189137 Often we don't do this because the point of randomness is to provide nondeterminism! We deliberately abstract out the starting state of the seed from our program, because it's easier to think about it as locally nondeterministic. (There's also "true" randomness, like using thermal noise as an entropy source, which I think are mainly used for cryptography and seeding PRNGs.) Most formal specification languages don't deal with randomness (though some deal with probability more broadly). Instead, we treat it as a nondeterministic choice: # software if rand > 0.001 then return a else crash # specification either return a or crash This is because we're looking at worst-case scenarios, so it doesn't matter if crash happens 50% of the time or 0.0001% of the time, it's still possible. 2. Concurrency # Pseudocode global x = 1, y = 0; def thread1() { x++; x++; x++; } def thread2() { y := x; } If thread1() and thread2() run sequentially, then (assuming the sequence is fixed) the final value of y is deterministic. If the two functions are started and run simultaneously, then depending on when thread2 executes y can be 1, 2, 3, or 4. Both functions are locally sequential, but running them concurrently leads to global nondeterminism. Concurrency is arguably the most dramatic source of nondeterminism. Small amounts of concurrency lead to huge explosions in the state space. We have words for the specific kinds of nondeterminism caused by concurrency, like "race condition" and "dirty write". Often we think about it as a separate topic from nondeterminism. To some extent it "overshadows" the other kinds: I have a much easier time teaching students about concurrency in models than nondeterminism in models. Many formal specification languages have special syntax/machinery for the concurrent aspects of a system, and generic syntax for other kinds of nondeterminism. In P that's choose. Others don't special-case concurrency, instead representing as it as nondeterministic choices by a global coordinator. This more flexible but also more inconvenient, as you have to implement process-local sequencing code yourself. 3. User Input One of the most famous and influential programming books is The C Programming Language by Kernighan and Ritchie. The first example of a nondeterministic program appears on page 14: For the newsletter readers who get text only emails,2 here's the program: #include /* copy input to output; 1st version */ main() { int c; c = getchar(); while (c != EOF) { putchar(c); c = getchar(); } } Yup, that's nondeterministic. Because the user can enter any string, any call of main() could have any output, meaning the number of possible outcomes is infinity. Okay that seems a little cheap, and I think it's because we tend to think of determinism in terms of how the user experiences the program. Yes, main() has an infinite number of user inputs, but for each input the user will experience only one possible output. It starts to feel more nondeterministic when modeling a long-standing system that's reacting to user input, for example a server that runs a script whenever the user uploads a file. This can be modeled with nondeterminism and concurrency: We have one execution that's the system, and one nondeterministic execution that represents the effects of our user. (One intrusive thought I sometimes have: any "yes/no" dialogue actually has three outcomes: yes, no, or the user getting up and walking away without picking a choice, permanently stalling the execution.) 4. External forces The more general version of "user input": anything where either 1) some part of the execution outcome depends on retrieving external information, or 2) the external world can change some state outside of your system. I call the distinction between internal and external components of the system the world and the machine. Simple examples: code that at some point reads an external temperature sensor. Unrelated code running on a system which quits programs if it gets too hot. API requests to a third party vendor. Code processing files but users can delete files before the script gets to them. Like with PRNGs, some of these cases don't have to be nondeterministic; we can argue that "the temperature" should be a virtual input into the function. Like with PRNGs, we treat it as nondeterministic because it's useful to think in that way. Also, what if the temperature changes between starting a function and reading it? External forces are also a source of nondeterminism as uncertainty. Measurements in the real world often comes with errors, so repeating a measurement twice can give two different answers. Sometimes operations fail for no discernable reason, or for a non-programmatic reason (like something physically blocks the sensor). All of these situations can be modeled in the same way as user input: a concurrent execution making nondeterministic choices. 5. Abstraction This is where nondeterminism in system models and in "real software" differ the most. I said earlier that pseudorandomness is arguably deterministic, but we abstract it into nondeterminism. More generally, nondeterminism hides implementation details of deterministic processes. In one consulting project, we had a machine that received a message, parsed a lot of data from the message, went into a complicated workflow, and then entered one of three states. The final state was totally deterministic on the content of the message, but the actual process of determining that final state took tons and tons of code. None of that mattered at the scope we were modeling, so we abstracted it all away: "on receiving message, nondeterministically enter state A, B, or C." Doing this makes the system easier to model. It also makes the model more sensitive to possible errors. What if the workflow is bugged and sends us to the wrong state? That's already covered by the nondeterministic choice! Nondeterministic abstraction gives us the potential to pick the worst-case scenario for our system, so we can prove it's robust even under those conditions. I know I beat the "nondeterminism as abstraction" drum a whole lot but that's because it's the insight from formal methods I personally value the most, that nondeterminism is a powerful tool to simplify reasoning about things. You can see the same approach in how I approach modeling users and external forces: complex realities black-boxed and simplified into nondeterministic forces on the system. Anyway, I hope this collection of ideas I got from formal methods are useful to my broader readership. Lemme know if it somehow helps you out! I realized after writing this that I already talked wrote an essay about nondeterminism in formal specification just under a year ago. I hope this one covers enough new ground to be interesting! ↩ There is a surprising number of you. ↩