More from Irrational Exuberance
While Stripe is a widely admired company for things like its creation of the Sorbet typer project, I personally think that Stripe’s most interesting strategy work is also among its most subtle: its willingness to significantly prioritize API stability. This strategy is almost invisible externally. Internally, discussions around it were frequent and detailed, but mostly confined to dedicated API design conversations. API stability isn’t just a technical design quirk, it’s a foundational decision in an API-driven business, and I believe it is one of the unsung heroes of Stripe’s business success. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our policies for managing API changes are: Design for long API lifetime. APIs are not inherently durable. Instead we have to design thoughtfully to ensure they can support change. When designing a new API, build a test application that doesn’t use this API, then migrate to the new API. Consider how integrations might evolve as applications change. Perform these migrations yourself to understand potential friction with your API. Then think about the future changes that we might want to implement on our end. How would those changes impact the API, and how would they impact the application you’ve developed. At this point, take your API to API Review for initial approval as described below. Following that approval, identify a handful of early adopter companies who can place additional pressure on your API design, and test with them before releasing the final, stable API. All new and modified APIs must be approved by API Review. API changes may not be enabled for customers prior to API Review approval. Change requests should be sent to api-review email group. For examples of prior art, review the api-review archive for prior requests and the feedback they received. All requests must include a written proposal. Most requests will be approved asynchronously by a member of API Review. Complex or controversial proposals will require live discussions to ensure API Review members have sufficient context before making a decision. We never deprecate APIs without an unavoidable requirement to do so. Even if it’s technically expensive to maintain support, we incur that support cost. To be explicit, we define API deprecation as any change that would require customers to modify an existing integration. If such a change were to be approved as an exception to this policy, it must first be approved by the API Review, followed by our CEO. One example where we granted an exception was the deprecation of TLS 1.2 support due to PCI compliance obligations. When significant new functionality is required, we add a new API. For example, we created /v1/subscriptions to support those workflows rather than extending /v1/charges to add subscriptions support. With the benefit of hindsight, a good example of this policy in action was the introduction of the Payment Intents APIs to maintain compliance with Europe’s Strong Customer Authentication requirements. Even in that case the charge API continued to work as it did previously, albeit only for non-European Union payments. We manage this policy’s implied technical debt via an API translation layer. We release changed APIs into versions, tracked in our API version changelog. However, we only maintain one implementation internally, which is the implementation of the latest version of the API. On top of that implementation, a series of version transformations are maintained, which allow us to support prior versions without maintaining them directly. While this approach doesn’t eliminate the overhead of supporting multiple API versions, it significantly reduces complexity by enabling us to maintain just a single, modern implementation internally. All API modifications must also update the version transformation layers to allow the new version to coexist peacefully with prior versions. In the future, SDKs may allow us to soften this policy. While a significant number of our customers have direct integrations with our APIs, that number has dropped significantly over time. Instead, most new integrations are performed via one of our official API SDKs. We believe that in the future, it may be possible for us to make more backwards incompatible changes because we can absorb the complexity of migrations into the SDKs we provide. That is certainly not the case yet today. Diagnosis Our diagnosis of the impact on API changes and deprecation on our business is: If you are a small startup composed of mostly engineers, integrating a new payments API seems easy. However, for a small business without dedicated engineers—or a larger enterprise involving numerous stakeholders—handling external API changes can be particularly challenging. Even if this is only marginally true, we’ve modeled the impact of minimizing API changes on long-term revenue growth, and it has a significant impact, unlocking our ability to benefit from other churn reduction work. While we believe API instability directly creates churn, we also believe that API stability directly retains customers by increasing the migration overhead even if they wanted to change providers. Without an API change forcing them to change their integration, we believe that hypergrowth customers are particularly unlikely to change payments API providers absent a concrete motivation like an API change or a payment plan change. We are aware of relatively few companies that provide long-term API stability in general, and particularly few for complex, dynamic areas like payments APIs. We can’t assume that companies that make API changes are ill-informed. Rather it appears that they experience a meaningful technical debt tradeoff between the API provider and API consumers, and aren’t willing to consistently absorb that technical debt internally. Future compliance or security requirements—along the lines of our upgrade from TLS 1.2 to TLS 1.3 for PCI—may necessitate API changes. There may also be new tradeoffs exposed as we enter new markets with their own compliance regimes. However, we have limited ability to predict these changes at this point.
At work, we’ve been building agentic workflows to support our internal Delivery team on various accounting, cash reconciliation, and operational tasks. To better guide that project, I wrote my own simple workflow tool as a learning project in January. Since then, the Model Context Protocol (MCP) has become a prominent solution for writing tools for agents, and I decided to spend some time writing an MCP server over the weekend to build a better intuition. The output of that project is library-mcp, a simple MCP that you can use locally with tools like Claude Desktop to explore Markdown knowledge bases. I’m increasingly enamored with the idea of “datapacks” that I load into context windows with relevant work, and I am currently working to release my upcoming book in a “datapack” format that’s optimized for usage with LLMs. library-mcp allows any author to dynamically build datapacks relevant to their current question, as long as they have access to their content in Markdown files. A few screenshots tell the story. First, here’s a list of the tools provided by this server. These tools give a variety of ways to search through content and pull that content into your context window. Each time you access a tool for the first time in a chat, Claude Desktop prompts you to verify you want that tool to operate. This is a nice feature, and I think it’s particularly important that approval is done at the application layer, not at the agent layer. If agents approve their own usage, well, security is going to be quite a mess. Here’s an example of retrieving all the tags to figure out what I’ve written about. You could do a follow-up like, “Get me posts I’ve written about ‘python’” after seeing the tags. The interesting thing here is you can combine retrieval and intelligence. For example, you could ask “Get me all the tags I’ve written, and find those that seem related to software architecture” and it does a good job of filtering. Finally, here’s an example of actually using a datapack to answer a question. In this case, it’s evaluating how my writing has changed between 2018 and 2025. More practically, I’ve already experimented with friends writing their CTO onboarding plans with Your first 90 days as CTO as a datapack in the context window, and you can imagine the right datapacks allowing you to go much further. Writing a company policy with all the existing policies in a datapack, along with a document about how to write policies effectively, for example, would improve consistency and be likely to identify conflicting policies. Altogether, I am currently enamored with the vision of useful datapacks facilitating creation, and hope that library-mcp is a useful tool for folks as we experiment our way towards this idea.
Ahead of announcing the title and publisher of my thus-far-untitled book on engineering strategy in the next week or two, I put together a website for its content. That site is pretty much the same format as this blog, but with some improvements like better mobile rendering on / than this blog has historically had. After finishing that work, I ported the improvements back to lethain.com, but also decided to bring them to staffeng.com. That was slightly trickier because, unlike this blog, StaffEng was historically a Gatsby app. (Why a Gatsby app? Because Calm was using Gatsby for our web frontend and I wanted to get some experience with it.) Over the weekend, I took some time to migrate to Hugo and apply the same enhancements. which you can now see in the lethain:staff-eng repository or on staffeng.com. Here’s a screenshot of the old version. Then here’s a screenshot of the updated version. Overall, I think it’s slightly easier to read, and I took it as a chance to update the various links. For example, I removed the newsletter link and pointed that to this blog’s newsletter instead, given that one’s mailing list went quiet a long time ago. Speaking of going quiet, I also brought these updates to infraeng.dev, which is the very-stuck-in-time site for the book I may-or-may-not one day write about infrastructure engineering. That means that I now have four essentially equivalent Hugo sites running different content websites: this blog, staffeng.com, infraeng.dev, and the site for the upcoming book. All of these build and deploy automatically onto GitHub Pages, which has been an extremely easy, reliable workflow for me. While I was working on this, someone asked me why I don’t just write my own blog server to host my blogs. The answer here is pretty straightforward. I’ve written three blog servers for my blog over the years. The first two were in Python, and the last one was in Go. They all worked well enough, but maintaining them was eventually a pain point because they required a real build pipeline and deal with libraries that could have security issues. Even in the best case, the containers they run in would get end-of-lifed periodically as Ubuntu versions got deprecated. What I’ve slowly learned from that is that, as a frequent writer, you really want your content to live somewhere that can work properly for decades. Even small maintenance costs can be prohibitive over time, and I’ve seen some good blogs disappear rather than e.g. figure out a WordPress upgrade. Individually, these are all minor, but over decades they can really add up. This is also my argument against using hosted providers: I’m sure Substack will be around in five years, but I have no idea if Substack will be around in twenty years, but I know that I’ll still be writing then, and will also want my previous writing to still be accessible.
Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact. Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe’s Product Infrastructure had similar goals to other companies’ Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes. This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring. Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe’s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can’t take credit for this strategy’s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe’s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation The Product Infrastructure team is investing in Stripe’s developer experience by: Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas. We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review. Our three highest priority areas for this half are: Add static typing to the highest value portions of our Ruby codebase, such that we can run the type checker locally and on the test machines to identify errors more quickly. Support selective test execution such that engineers can quickly determine and run the most appropriate tests on their machine rather than delaying until tests run on the build server. Instrument test failures such that we have better data to prioritize future efforts. Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team’s time to get our type checker to an effective place. Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase. It’s possible for Product Infrastructure to take on both, despite its fixed size. We’ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code’s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable. Based on our knowledge of Facebook’s Hack project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It’s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios. Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value. In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the Command Query Responsibility Segregation (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase. Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code. Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified. Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues. For escalations and issues, reach out in the #product-infra channel. Diagnose In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity. Our current diagnosis of our developer productivity is: We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.) We have two primary mechanisms for understanding our engineer’s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey. Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners. Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request’s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete. By the time their build-test cycle completes, they’ve lost their focus and maybe take several hours to return to addressing the results. There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes. Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase. Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity. Our data infrastructure is implemented in Scala. While these developers have concerns–primarily slow build times–they manage their build and deployment infrastructure independently, and the group remains relatively small. Ruby is not a highly performant programming language, but we’ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language. Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers. Developers working in Golang and Scala routinely ask for more centralized support, but it’s challenging to prioritize those requests as we’re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala. If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.
More in programming
I like the job title “Design Engineer”. When required to label myself, I feel partial to that term (I should, I’ve written about it enough). Lately I’ve felt like the term is becoming more mainstream which, don’t get me wrong, is a good thing. I appreciate the diversification of job titles, especially ones that look to stand in the middle between two binaries. But — and I admit this is a me issue — once a title starts becoming mainstream, I want to use it less and less. I was never totally sure why I felt this way. Shouldn’t I be happy a title I prefer is gaining acceptance and understanding? Do I just want to rebel against being labeled? Why do I feel this way? These were the thoughts simmering in the back of my head when I came across an interview with the comedian Brian Regan where he talks about his own penchant for not wanting to be easily defined: I’ve tried over the years to write away from how people are starting to define me. As soon as I start feeling like people are saying “this is what you do” then I would be like “Alright, I don't want to be just that. I want to be more interesting. I want to have more perspectives.” [For example] I used to crouch around on stage all the time and people would go “Oh, he’s the guy who crouches around back and forth.” And I’m like, “I’ll show them, I will stand erect! Now what are you going to say?” And then they would go “You’re the guy who always feels stupid.” So I started [doing other things]. He continues, wondering aloud whether this aversion to not being easily defined has actually hurt his career in terms of commercial growth: I never wanted to be something you could easily define. I think, in some ways, that it’s held me back. I have a nice following, but I’m not huge. There are people who are huge, who are great, and deserve to be huge. I’ve never had that and sometimes I wonder, ”Well maybe it’s because I purposely don’t want to be a particular thing you can advertise or push.” That struck a chord with me. It puts into words my current feelings towards the job title “Design Engineer” — or any job title for that matter. Seven or so years ago, I would’ve enthusiastically said, “I’m a Design Engineer!” To which many folks would’ve said, “What’s that?” But today I hesitate. If I say “I’m a Design Engineer” there are less follow up questions. Now-a-days that title elicits less questions and more (presumed) certainty. I think I enjoy a title that elicits a “What’s that?” response, which allows me to explain myself in more than two or three words, without being put in a box. But once a title becomes mainstream, once people begin to assume they know what it means, I don’t like it anymore (speaking for myself, personally). As Brian says, I like to be difficult to define. I want to have more perspectives. I like a title that befuddles, that doesn’t provide a presumed sense of certainty about who I am and what I do. And I get it, that runs counter to the very purpose of a job title which is why I don’t think it’s good for your career to have the attitude I do, lol. I think my own career evolution has gone something like what Brian describes: Them: “Oh you’re a Designer? So you make mock-ups in Photoshop and somebody else implements them.” Me: “I’ll show them, I’ll implement them myself! Now what are you gonna do?” Them: “Oh, so you’re a Design Engineer? You design and build user interfaces on the front-end.” Me: “I’ll show them, I’ll write a Node server and setup a database that powers my designs and interactions on the front-end. Now what are they gonna do?” Them: “Oh, well, we I’m not sure we have a term for that yet, maybe Full-stack Design Engineer?” Me: “Oh yeah? I’ll frame up a user problem, interface with stakeholders, explore the solution space with static designs and prototypes, implement a high-fidelity solution, and then be involved in testing, measuring, and refining said solution. What are you gonna call that?” [As you can see, I have some personal issues I need to work through…] As Brian says, I want to be more interesting. I want to have more perspectives. I want to be something that’s not so easily definable, something you can’t sum up in two or three words. I’ve felt this tension my whole career making stuff for the web. I think it has led me to work on smaller teams where boundaries are much more permeable and crossing them is encouraged rather than discouraged. All that said, I get it. I get why titles are useful in certain contexts (corporate hierarchies, recruiting, etc.) where you’re trying to take something as complicated and nuanced as an individual human beings and reduce them to labels that can be categorized in a database. I find myself avoiding those contexts where so much emphasis is placed in the usefulness of those labels. “I’ve never wanted to be something you could easily define” stands at odds with the corporate attitude of, “Here’s the job req. for the role (i.e. cog) we’re looking for.” Email · Mastodon · Bluesky
We received over 2,200 applications for our just-closed junior programmer opening, and now we're going through all of them by hand and by human. No AI screening here. It's a lot of work, but we have a great team who take the work seriously, so in a few weeks, we'll be able to invite a group of finalists to the next phase. This highlights the folly of thinking that what it'll take to land a job like this is some specific list of criteria, though. Yes, you have to present a baseline of relevant markers to even get into consideration, like a great cover letter that doesn't smell like AI slop, promising projects or work experience or educational background, etc. But to actually get the job, you have to be the best of the ones who've applied! It sounds self-evident, maybe, but I see questions time and again about it, so it must not be. Almost every job opening is grading applicants on the curve of everyone who has applied. And the best candidate of the lot gets the job. You can't quantify what that looks like in advance. I'm excited to see who makes it to the final stage. I already hear early whispers that we got some exceptional applicants in this round. It would be great to help counter the narrative that this industry no longer needs juniors. That's simply retarded. However good AI gets, we're always going to need people who know the ins and outs of what the machine comes up with. Maybe not as many, maybe not in the same roles, but it's truly utopian thinking that mankind won't need people capable of vetting the work done by AI in five minutes.
Recently I got a question on formal methods1: how does it help to mathematically model systems when the system requirements are constantly changing? It doesn't make sense to spend a lot of time proving a design works, and then deliver the product and find out it's not at all what the client needs. As the saying goes, the hard part is "building the right thing", not "building the thing right". One possible response: "why write tests"? You shouldn't write tests, especially lots of unit tests ahead of time, if you might just throw them all away when the requirements change. This is a bad response because we all know the difference between writing tests and formal methods: testing is easy and FM is hard. Testing requires low cost for moderate correctness, FM requires high(ish) cost for high correctness. And when requirements are constantly changing, "high(ish) cost" isn't affordable and "high correctness" isn't worthwhile, because a kinda-okay solution that solves a customer's problem is infinitely better than a solid solution that doesn't. But eventually you get something that solves the problem, and what then? Most of us don't work for Google, we can't axe features and products on a whim. If the client is happy with your solution, you are expected to support it. It should work when your customers run into new edge cases, or migrate all their computers to the next OS version, or expand into a market with shoddy internet. It should work when 10x as many customers are using 10x as many features. It should work when you add new features that come into conflict. And just as importantly, it should never stop solving their problem. Canonical example: your feature involves processing requested tasks synchronously. At scale, this doesn't work, so to improve latency you make it asynchronous. Now it's eventually consistent, but your customers were depending on it being always consistent. Now it no longer does what they need, and has stopped solving their problems. Every successful requirement met spawns a new requirement: "keep this working". That requirement is permanent, or close enough to decide our long-term strategy. It takes active investment to keep a feature behaving the same as the world around it changes. (Is this all a pretentious of way of saying "software maintenance is hard?" Maybe!) Phase changes In physics there's a concept of a phase transition. To raise the temperature of a gram of liquid water by 1° C, you have to add 4.184 joules of energy.2 This continues until you raise it to 100°C, then it stops. After you've added two thousand joules to that gram, it suddenly turns into steam. The energy of the system changes continuously but the form, or phase, changes discretely. Software isn't physics but the idea works as a metaphor. A certain architecture handles a certain level of load, and past that you need a new architecture. Or a bunch of similar features are independently hardcoded until the system becomes too messy to understand, you remodel the internals into something unified and extendable. etc etc etc. It's doesn't have to be totally discrete phase transition, but there's definitely a "before" and "after" in the system form. Phase changes tend to lead to more intricacy/complexity in the system, meaning it's likely that a phase change will introduce new bugs into existing behaviors. Take the synchronous vs asynchronous case. A very simple toy model of synchronous updates would be Set(key, val), which updates data[key] to val.3 A model of asynchronous updates would be AsyncSet(key, val, priority) adds a (key, val, priority, server_time()) tuple to a tasks set, and then another process asynchronously pulls a tuple (ordered by highest priority, then earliest time) and calls Set(key, val). Here are some properties the client may need preserved as a requirement: If AsyncSet(key, val, _, _) is called, then eventually db[key] = val (possibly violated if higher-priority tasks keep coming in) If someone calls AsyncSet(key1, val1, low) and then AsyncSet(key2, val2, low), they should see the first update and then the second (linearizability, possibly violated if the requests go to different servers with different clock times) If someone calls AsyncSet(key, val, _) and immediately reads db[key] they should get val (obviously violated, though the client may accept a slightly weaker property) If the new system doesn't satisfy an existing customer requirement, it's prudent to fix the bug before releasing the new system. The customer doesn't notice or care that your system underwent a phase change. They'll just see that one day your product solves their problems, and the next day it suddenly doesn't. This is one of the most common applications of formal methods. Both of those systems, and every one of those properties, is formally specifiable in a specification language. We can then automatically check that the new system satisfies the existing properties, and from there do things like automatically generate test suites. This does take a lot of work, so if your requirements are constantly changing, FM may not be worth the investment. But eventually requirements stop changing, and then you're stuck with them forever. That's where models shine. As always, I'm using formal methods to mean the subdiscipline of formal specification of designs, leaving out the formal verification of code. Mostly because "formal specification" is really awkward to say. ↩ Also called a "calorie". The US "dietary Calorie" is actually a kilocalorie. ↩ This is all directly translatable to a TLA+ specification, I'm just describing it in English to avoid paying the syntax tax ↩
While Stripe is a widely admired company for things like its creation of the Sorbet typer project, I personally think that Stripe’s most interesting strategy work is also among its most subtle: its willingness to significantly prioritize API stability. This strategy is almost invisible externally. Internally, discussions around it were frequent and detailed, but mostly confined to dedicated API design conversations. API stability isn’t just a technical design quirk, it’s a foundational decision in an API-driven business, and I believe it is one of the unsung heroes of Stripe’s business success. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our policies for managing API changes are: Design for long API lifetime. APIs are not inherently durable. Instead we have to design thoughtfully to ensure they can support change. When designing a new API, build a test application that doesn’t use this API, then migrate to the new API. Consider how integrations might evolve as applications change. Perform these migrations yourself to understand potential friction with your API. Then think about the future changes that we might want to implement on our end. How would those changes impact the API, and how would they impact the application you’ve developed. At this point, take your API to API Review for initial approval as described below. Following that approval, identify a handful of early adopter companies who can place additional pressure on your API design, and test with them before releasing the final, stable API. All new and modified APIs must be approved by API Review. API changes may not be enabled for customers prior to API Review approval. Change requests should be sent to api-review email group. For examples of prior art, review the api-review archive for prior requests and the feedback they received. All requests must include a written proposal. Most requests will be approved asynchronously by a member of API Review. Complex or controversial proposals will require live discussions to ensure API Review members have sufficient context before making a decision. We never deprecate APIs without an unavoidable requirement to do so. Even if it’s technically expensive to maintain support, we incur that support cost. To be explicit, we define API deprecation as any change that would require customers to modify an existing integration. If such a change were to be approved as an exception to this policy, it must first be approved by the API Review, followed by our CEO. One example where we granted an exception was the deprecation of TLS 1.2 support due to PCI compliance obligations. When significant new functionality is required, we add a new API. For example, we created /v1/subscriptions to support those workflows rather than extending /v1/charges to add subscriptions support. With the benefit of hindsight, a good example of this policy in action was the introduction of the Payment Intents APIs to maintain compliance with Europe’s Strong Customer Authentication requirements. Even in that case the charge API continued to work as it did previously, albeit only for non-European Union payments. We manage this policy’s implied technical debt via an API translation layer. We release changed APIs into versions, tracked in our API version changelog. However, we only maintain one implementation internally, which is the implementation of the latest version of the API. On top of that implementation, a series of version transformations are maintained, which allow us to support prior versions without maintaining them directly. While this approach doesn’t eliminate the overhead of supporting multiple API versions, it significantly reduces complexity by enabling us to maintain just a single, modern implementation internally. All API modifications must also update the version transformation layers to allow the new version to coexist peacefully with prior versions. In the future, SDKs may allow us to soften this policy. While a significant number of our customers have direct integrations with our APIs, that number has dropped significantly over time. Instead, most new integrations are performed via one of our official API SDKs. We believe that in the future, it may be possible for us to make more backwards incompatible changes because we can absorb the complexity of migrations into the SDKs we provide. That is certainly not the case yet today. Diagnosis Our diagnosis of the impact on API changes and deprecation on our business is: If you are a small startup composed of mostly engineers, integrating a new payments API seems easy. However, for a small business without dedicated engineers—or a larger enterprise involving numerous stakeholders—handling external API changes can be particularly challenging. Even if this is only marginally true, we’ve modeled the impact of minimizing API changes on long-term revenue growth, and it has a significant impact, unlocking our ability to benefit from other churn reduction work. While we believe API instability directly creates churn, we also believe that API stability directly retains customers by increasing the migration overhead even if they wanted to change providers. Without an API change forcing them to change their integration, we believe that hypergrowth customers are particularly unlikely to change payments API providers absent a concrete motivation like an API change or a payment plan change. We are aware of relatively few companies that provide long-term API stability in general, and particularly few for complex, dynamic areas like payments APIs. We can’t assume that companies that make API changes are ill-informed. Rather it appears that they experience a meaningful technical debt tradeoff between the API provider and API consumers, and aren’t willing to consistently absorb that technical debt internally. Future compliance or security requirements—along the lines of our upgrade from TLS 1.2 to TLS 1.3 for PCI—may necessitate API changes. There may also be new tradeoffs exposed as we enter new markets with their own compliance regimes. However, we have limited ability to predict these changes at this point.
I've been biking in Brooklyn for a few years now! It's hard for me to believe it, but I'm now one of the people other bicyclists ask questions to now. I decided to make a zine that answers the most common of those questions: Bike Brooklyn! is a zine that touches on everything I wish I knew when I started biking in Brooklyn. A lot of this information can be found in other resources, but I wanted to collect it in one place. I hope to update this zine when we get significantly more safe bike infrastructure in Brooklyn and laws change to make streets safer for bicyclists (and everyone) over time, but it's still important to note that each release will reflect a specific snapshot in time of bicycling in Brooklyn. All text and illustrations in the zine are my own. Thank you to Matt Denys, Geoffrey Thomas, Alex Morano, Saskia Haegens, Vishnu Reddy, Ben Turndorf, Thomas Nayem-Huzij, and Ryan Christman for suggestions for content and help with proofreading. This zine is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, so you can copy and distribute this zine for noncommercial purposes in unadapted form as long as you give credit to me. Check out the Bike Brooklyn! zine on the web or download pdfs to read digitally or print here!