More from Irrational Exuberance
Entering 2025, I decided to spend some time exploring the topic of agents. I started reading Anthropic’s Building effective agents, followed by Chip Huyen’s AI Engineering. I kicked off a major workstream at work on using agents, and I also decided to do a personal experiment of sorts. This is a general commentary on building that project. What I wanted to build was a simple chat interface where I could write prompts, select models, and have the model use tools as appropriate. My side goal was to build this using Cursor and generally avoid writing code directly as much as possible, but I found that generally slower than writing code in emacs while relying on 4o-mini to provide working examples to pull from. Similarly, while I initially envisioned building this in fullstack TypeScript via Cursor, I ultimately bailed into a stack that I’m more comfortable, and ended up using Python3, FastAPI, PostgreSQL, and SQLAlchemy with the async psycopg3 driver. It’s been a… while… since I started a brand new Python project, and used this project as an opportunity to get comfortable with Python3’s async/await mechanisms along with Python3’s typing along with mypy. Finally, I also wanted to experiment with Tailwind, and ended up using TailwindUI’s components to build the site. The working version supports everything I wanted: creating chats with models, and allowing those models to use function calling to use tools that I provide. The models are allowed to call any number of tools in pursuit of the problem they are solving. The tool usage is the most interesting part here for sure. The simplest tool I created was a get_temperature tool that provided a fake temperature for your location. This allowed me to ask questions like “What should I wear tomorrow in San Francisco, CA?” and get a useful respond. The code to add this function to my project was pretty straightforward, just three lines of Python and 25 lines of metadata to pass to the OpenAI API. def tool_get_current_weather(location: str|None=None, format: str|None=None) -> str: "Simple proof of concept tool." temp = random.randint(40, 90) if format == 'fahrenheit' else random.randint(10, 25) return f"It's going to be {temp} degrees {format} tomorrow." FUNCTION_REGISTRY['get_current_weather'] = tool_get_current_weather TOOL_USAGE_REGISTRY['get_current_weather'] = { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } } After getting this tool, the next tool I added was a simple URL retriever tool, which allowed the agent to grab a URL and use the content of that URL in its prompt. The implementation for this tool was similarly quite simple. def tool_get_url(url: str|None=None) -> str: if url is None: return '' url = str(url) response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('main') or soup.find('article') or soup.body if not content: return str(response.content) markdown = markdownify(str(content), heading_style="ATX").strip() return str(markdown) FUNCTION_REGISTRY['get_url'] = tool_get_url TOOL_USAGE_REGISTRY['get_url'] = { "type": "function", "function": { "name": "get_url", "description": "Retrieve the contents of a website via its URL.", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "The complete URL, including protocol to retrieve. For example: \"https://lethain.com\"", } }, "required": ["url"], }, } } What’s pretty amazing is how much power you can add to your agent by adding such a trivial tool as retrieving a URL. You can similarly imagine adding tools for retrieving and commenting on Github pull requests and so, which could allow a very simple agent tool like this to become quite useful. Working on this project gave me a moderately compelling view of a near-term future where most engineers have simple application like this running that they can pipe events into from various systems (email, text, Github pull requests, calendars, etc), create triggers that map events to templates that feed into prompts, and execute those prompts with tool-aware agents. Combine that with ability for other agents to register themselves with you and expose the tools that they have access to (e.g. schedule an event with tool’s owner), and a bunch of interesting things become very accessible with a very modest amount of effort: You could schedule events between two busy people’s calendars, as if both of them had an assistant managing their calendar Reply to your own pull requests with new blog posts, providing feedback on typos and grammatical issues Crawl websites you care about and identify posts you might be interested in Ask the model to generate a system model using lethain:systems, run that model, then chart the responses Add a “planning tool” which allows the model to generate a plan to guide subsequent steps in a complex task. (e.g. getting my calendar, getting a friend’s calendar, suggesting a time we could meet) None of these are exactly lifesaving, but each is somewhat useful, and I imagine there are many more fairly obvious ideas that become easy once you have the necessary scaffolding to make this sort of thing easy. Altogether, I think that I am convinced at this points that agents, using current foundational models, are going to create a number of very interesting experiences that improve our day to day lives in small ways that are, in aggregate, pretty transformational. I’m less convinced that this is the way all software should work going forward though, but more thoughts on that over time. (A bunch of fun experiments happening at work, but early days on those.)
In my career, the majority of the strategy work I’ve done has been in non-executive roles, things like Uber’s service migration. Joining Calm was my first executive role, where I was able to not just propose, but also mandate, strategy. Like almost all startups, the engineering team was scattered when I joined. Was our most important work creating more scalable infrastructure? Was our greatest risk the failure to adopt leading programming languages? How did we rescue the stuck service decomposition initiative? This strategy is where the engineering team and I aligned after numerous rounds of iteration, debate, and inevitably some disagreement. As a strategy, it’s both basic and also unambiguous about what we valued, and I believe it’s a reasonably good starting point for any low scalability-complexity consumer product. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document has one tweak, folding the Operation section in with Policy. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our new policies, and the mechanisms to operate them are: We are a product engineering company. Users write in every day to tell us that our product has changed their lives for the better. Our technical infrastructure doesn’t get many user letters–and this is unlikely to change going forward as our infrastructure is relatively low-scale and low-complexity. Rather than attempting to change that, we want to devote the absolute maximum possible attention to product engineering. We exclusively adopt new technologies to create valuable product capabilities. We believe our technology stack as it exists today can solve the majority of our current and future product roadmaps. In the rare case where we adopt a new technology, we do so because a product capability is inherently impossible without adopting a new technology. We do not adopt new technologies for other reasons. For example, we would not adopt a new technology because someone is interested in learning about it. Nor would we adopt a technology because it is 30% better suited to a task. We write all code in the monolith. It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code must be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith. In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below. Exceptions are granted by the CTO, and must be in writing. The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product. All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there’s no written record that the CTO approved it, then it’s not approved. Proving the point about exceptions, there are two confirmed exceptions to the above strategy: We are incrementally migrating to TypeScript. We have found that static typing can prevent a number of our user-facing bugs. TypeScript provides a clean, incremental migration path for our JavaScript codebase, and we aim to migrate the entirety over the next six months. Our Web engineering team is leading this migration. We are evaluating Postgres Aurora as our primary database. Many of our recent production incidents are caused by index scans for tables with high write velocity such as tracking customer logins. We believe Aurora will perform better under these workloads. Our Infrastructure engineering team is leading this initiative. Diagnose The current state of our engineering organization: Our product is not limited by missing infrastructure capabilities. Reviewing our roadmap, there’s nothing that we are trying to build today or over the next year that is constrained by our technical infrastructure. Our uptime, stability and latency are OK but not great. We have semi-frequent stability and latency issues in our application, all of which are caused by one of two issues. First, deploying new code with a missing index because it performed well enough in a test environment. Second, writes to a small number of extremely large, skinny tables have become expensive in combination with scans over those tables’ indexes. Our infrastructure team is split between supporting monolith and service workflows. One way to measure technical debt is to understand how much time the team is spending propping up the current infrastructure. Today, that is meaningful but not overwhelming work for our team of three infrastructure engineers supporting 30 product engineers. However, we are finding infrastructure engineers increasingly pulled into debugging incidents for components moved out of the central monolith into our service architecture. This is partially due to increased inherent complexity, but it’s more due to exposing lack of monitoring and ambiguous accountability in services’ production incidents. Our product and executive stakeholders experience us as competing factions. Engineering exists to build and operate software in the company. Part of that is being easy to work with. We should not necessarily support every ask from Product if we believe they are misaligned with Engineering’s goals (e.g. maintaining security), but it should generally provide a consistent perspective across our team. Today, our stakeholders believe they will get radically different answers to basic questions of capabilities and approach depending on who they ask. If they try to get a group of engineers to agree on an approach, they often find we derail into debate about approach rather than articulating a clear point of view that allows the conversation to move forward. We’re arguing a particularly large amount about adopting new technologies and rewrites. Most of our disagreements stem around adopting new technologies or rewriting existing components into new technology stacks. For example, can we extend this feature or do we have to migrate it to a service before extending it? Can we add this to our database or should we move it into a new Redis cache instead? Is JavaScript a sufficient programming language, or do we need to rewrite this functionality in Go? This is particularly relevant to next steps around the ongoing services migration, which has been in-flight for over a year, but is yet to move any core production code. We are spending more time on infrastructure and platform work than product work. This is the combination of all the above issues, from the stability issues we are encountering in our database design, to the lack of engineering alignment on execution. This places us at odds with stakeholder expectation that we are predominantly focused on new product development. Explore Calm is a mobile application that guides users to build and maintain either a meditation or sleep habit. Recommendations and guidance across content is individual to the user, but the content is shared across all customers and is amenable to caching on a content delivery network (CDN). As long as the CDN is available, the mobile application can operate despite inability to access servers (e.g. the application remains usable from a user’s perspective, even if the non-CDN production infrastructure is unreachable). In 2010, enabling a product of this complexity would have required significant bespoke infrastructure, along with likely maintaining a physical presence in a series of datacenters to run your software. In 2020, comparable applications are generally moving towards maintaining as little internal infrastructure as possible. This perspective is summarized effectively in Intercom’s Run Less Software and Dan McKinley’s Choose Boring Technology. New companies founded in this space view essentially all infrastructure as a commodity bought off your cloud provider. This even extends to areas of innovation, such as machine learning, where the training infrastructure is typically run on an offering like AWS Bedrock, and the model infrastructure is provided by Anthropic or OpenAI.
Some people I’ve worked with have lost hope that engineering strategy actually exists within any engineering organizations. I imagine that they, reading through the steps to build engineering strategy, or the strategy for navigating private equity ownership, are not impressed. Instead, these ideas probably come across as theoretical at best. In less polite company, they might describe these ideas as fake constructs. Let’s talk about it! Because they’re right. In fact, they’re right in two different ways. First, this book is focused on explain how to create clean, refine and definitive strategy documents, where initially most real strategy artifacts look rather messy. Second, applying these techniques in practice can require a fair amount of creativity. It might sound easy, but it’s quick difficult in practice. This chapter will cover: Why strategy documents need to be clear and definitive, especially when strategy development has been messy How to iterate on strategy when there are demands for unrealistic timelines Using strategy as non-executives, where others might override your strategy Handling dynamic, quickly changing environments where diagnosis can change frequently Working with indecisive stakeholders who don’t provide clarity on approach Surviving other people’s bad strategy work Alright, let’s dive into the many ways that praxis doesn’t quite line up with theory. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Clear and definitive documents As explored in Making engineering strategies more readable, documents that feel intuitive to write are often fairly difficult to read, That’s because thinking tends to be a linear-ish journey from a problem to a solution. Most readers, on the other hand, usually just want to know the solution and then to move on. That’s because good strategies for direction (e.g. when a team wants to understand how they’re supposed to solve a specific issue at hand) far more frequently than they’re read to build agreement (e.g. building stakeholder alignment during the initial development of the strategy). However, many organizations only produce writer-oriented strategy documents, and may not have any reader-oriented documents at all. If you’ve predominantly worked in those sorts of organizations, then the first reader-oriented documents you encounter will see artificial. There are also organizations that have many reader-oriented documents, but omit the rationale behind those documents. Those documents feel proscriptive and heavy-handed, because the infrequent reader who does want to understand the thinking can’t find it. Further, when they want to propose an alternative, they have to do so without the rationale behind the current policies: the absence of that context often transforms what was a collaborative problem-solving opportunity into a political match. With that in mind, I’d encourage you to see the frequent absence of these documents as a major opportunity to drive strategy within your organization, rather than evidence that these documents don’t work. My experience is that they do. Doing strategy despite unrealistic timelines The most frequent failure mode I see for strategy is when it’s rushed, and its authors accept that thinking must stop when the artificial deadline is reached. Taking annual planning at Stripe as an example, Claire Hughes Johnson argued that planning expands to fit any timeline, and consequently set a short planning timeline of several weeks. Some teams accepted that as a fixed timeline and stopped planning when the timeline ended, whereas effective teams never stopped planning before or after the planning window. When strategy work is given an artificially or unrealistic timeline, then you should deliver the best draft you can. Afterwards, rather than being finished, you should view yourself as startingt he refinement process. An open strategy secret is that many strategies never leave the refinement phase, and continue to be tweaked throughout their lifespan. Why should a strategy with an early deadline be any different? Well, there is one important problem to acknowledge: I’ve often found that the executive who initially provided the unrealistic timeline intended it as a forcing function to inspire action and quick thinking. If you have a discussion with them directly, they’re usually quite open to adjusting the approach. However, the intermediate layers of leadership between that executive and you often calcify on a particular approach which they claim that the executive insists on precisely following. Sometimes having the conversation with the responsible executive is quite difficult. In that case, you do have to work with individuals taking the strategy as literate and unalterable until either you can have the conversation or something goes wrong enough that the executive starts paying attention again. Usually, though, you can find someone who has a communication path, as long as you can articulate the issue clearly. Using strategy as non-executives Some engineers will argue that the only valid strategy altitude is the highest one defined by executives, because any other strategy can be invalidated by a new, higher altitude strategy. They would claim that teams simply cannot do strategy, because executives might invalidate it. Some engineering executives would argue the same thing, instead claiming that they can’t work on an engineering strategy because the missing product strategy or business strategy might introduce new constraints. I don’t agree with this line of thinking at all. To do strategy at any altitude, you have to come to terms with the certainty that new information will show up, and you’ll need to revise your strategy to deal with that. Uber’s service provisioning strategy is a good counterexample against the idea that you have to wait for someone else to set the strategy table. We were able to find a durable diagnosis despite being a relatively small team within a much larger organization that was relatively indifferent to helping us succeed. When it comes to using strategy, effective diagnosis trumps authority. In my experience, at least as many executives’ strategies are ravaged by reality’s pervasive details as are overridden by higher altitude strategies. The only way to be certain your strategy will fail is waiting until you’re certain that no new information might show up and require it changing. Doing strategy in chaotic environments How should you adopt LLMs? discusses how a company should plot a path through the rapidly evolving LLM ecosystem. Periods of rapid technology evolution are one reason why your strategy might encounter a pocket of chaos, but there are many others. Pockets of rapid hiring, as well as layoffs, create chaos. The departure of load-bearing senior leaders can change a company quickly. Slowing revenue in a company’s core business can also initiate chaotic actions in pursuit of a new business. Strategies don’t require stable environments. Instead, strategies require awareness of the environment that they’re operating in. In a stable period, a strategy might expect to run for several years and expect relatively little deviation from the initial approach. In a dynamic period, the strategy might know you can only protect capacity in two-week chunks before a new critical initiative pops up. It’s possible to good strategy in either scenario, but it’s impossible to good strategy if you don’t diagnose the context effectively. Unreliable information Often times, the strategy forward is very obvious if a few key decisions were made, you know who is supposed to make those decisions, but you simply cannot get them to decide. My most visceral experience of this was conducting a layoff where the CEO wouldn’t define a target cost reduction or a thesis of how much various functions (e.g. engineering, marketing, sales) should contribute to those reductions. With those two decisions, engineering’s approach would be obvious, and without that clarity things felt impossible. Although I was frustrated at the time, I’ve since come to appreciate that missing decisions are the norm rather than the exception. The strategy on Navigating Private Equity ownership deals with this problem by acknowledging a missing decision, and expressly blocking one part of its execution on that decision being made. Other parts of its plan, like changing how roles are backfilled, went ahead to address the broader cost problem. Rather than blocking on missing information, your strategy should acknowledge what’s missing, and move forward where you can. Sometimes that’s moving forward by taking risk, sometimes that’s delaying for clarity, but it’s never accepting yourself as stuck without options other than pointing a finger. Surviving other people’s bad strategy work Sometimes you will be told to follow something which is described as a strategy, but is really just a policy without any strategic thinking behind it. This is an unavoidable element of working in organizations and happens for all sorts of reasons. Sometimes, your organization’s leader doesn’t believe it’s valuable to explain their thinking to others, because they see themselves as the one important decision maker. Other times, your leader doesn’t agree with a policy they’ve been instructed to rollout. Adoption of “high hype” technologies like blockchain technologies during the crypto book was often top-down direction from company leadership that engineering disagreed with, but was obligated to align with. In this case, your leader is finding that it’s hard to explain a strategy that they themselves don’t understand either. This is a frustrating situation. What I’ve found most effective is writing a strategy of my own, one that acknowledges the broader strategy I disagree with in its diagnosis as a static, unavoidable truth. From there, I’ve been able to make practical decisions that recognize the context, even if it’s not a context I’d have selected for myself. Summary I started this chapter by acknowledging that the steps to building engineering strategy are a theory of strategy, and one that can get quite messy in practice. Now you know why strategy documents often come across as overly pristine–because they’re trying to communicate clear about a complex topic. You also know how to navigate the many ways reality pulls you away from perfect strategy, such as unrealsitic timelines, higher altitude strategies invalidating your own strategy work, working in a chaotic environment, and dealing with stakeholders who refuse to align with your strategy. Finally, we acknowledged that sometimes strategy work done by others is not what we’d consider strategy, it’s often unsupported policy with neither a diagnosis or an approach to operating the policy. That’s all stuff you’re going to run into, and it’s all stuff you’re going to overcome on the path to doing good strategy work.
In early 2014, I joined as an engineering manager for Uber’s Infrastructure team. We were responsible for a wide number of things, including provisioning new services. While the overall team I led grew significantly over time, the subset working on service provisioning never grew beyond four engineers. Those four engineers successfully migrated 1,000+ services onto a new, future-proofed service platform. More importantly, they did it while absorbing the majority, although certainly not the entirety, of the migration workload onto that small team rather than spreading it across the 2,000+ engineers working at Uber at the time. Their strategy serves as an interesting case study of how a team can drive strategy, even without any executive sponsor, by focusing on solving a pressing user problem, and providing effective ergonomics while doing so. Note that after this introductory section, the remainder of this strategy will be written from the perspective of 2014, when it was originally developed. More than a decade later after this strategy was implemented, we have an interesting perspective to evaluate its impact. It’s fair to say that it had some meaningful, negative consequences by allowing the widespread proliferation of new services within Uber. Those services contributed to a messy architecture that had to go through cycles of internal cleanup over the following years. As the principle author of this strategy, I’ve learned a lot from meditating on the fact that this strategy was wildly successful, that I think Uber is better off for having followed it, and that it also meaningfully degraded Uber’s developer experience over time. There’s both good and bad here; with a wide enough lens, all evaluations get complicated. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reserve order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document one tweak, folding the Operation section in with Policy. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation We’ve adopted these guiding principles for extending Uber’s service platform: Constrain manual provisioning allocation to maximize investment in self-service provisioning. The service provisioning team will maintain a fixed allocation of one full time engineer on manual service provisioning tasks. We will move the remaining engineers to work on automation to speed up future service provisioning. This will degrade manual provisioning in the short term, but the alternative is permanently degrading provisioning by the influx of new service requests from newly hired product engineers. Self-service must be safely usable by a new hire without Uber context. It is possible today to make a Puppet or Clusto change while provisioning a new service that negatively impacts the production environment. This must not be true in any self-service solution. Move to structured requests, and out of tickets. Missing or incorrect information in provisioning requests create significant delays in provisioning. Further, collecting this information is the first step of moving to a self-service process. As such, we can get paid twice by reducing errors in manual provisioning while also creating the interface for self-service workflows. Prefer initializing new services with good defaults rather than requiring user input. Most new services are provisioned for new projects with strong timeline pressure but little certainty on their long-term requirements. These users cannot accurately predict their future needs, and expecting them to do so creates significant friction. Instead, the provisioning framework should suggest good defaults, and make it easy to change the settings later when users have more clarity. The gate from development environment to production environment is a particularly effective one for ensuring settings are refreshed. We are materializing those principles into this sequenced set of tasks: Create an internal tool that coordinates service provisioning, replacing the process where teams request new services via Phabricator tickets. This new tool will maintain a schema of required fields that must be supplied, with the aim of eliminating the majority of back and forth between teams during service provisioning. In addition to capturing necessary data, this will also serve as our interface for automating various steps in provisioning without requiring future changes in the workflow to request service provisioning. Extend the internal tool will generate Puppet scaffolding for new services, reducing the potential for errors in two ways. First, the data supplied in the service provisioning request can be directly included into the rendered template. Second, this will eliminate most human tweaking of templates where typo’s can create issues. Port allocation is a particularly high-risk element of provisioning, as reusing a port can breaking routing to an existing production service. As such, this will be the first area we fully automate, with the provisioning service supply the allocated port rather than requiring requesting teams to provide an already allocated port. Doing this will require moving the port registry out of a Phabricator wiki page and into a database, which will allow us to guard access with a variety of checks. Manual assignment of new services to servers often leads to new serices being allocated to already heavily utilized servers. We will replace the manual assignment with an automated system, and do so with the intention of migrating to the Mesos/Aurora cluster once it is available for production workloads. Each week, we’ll review the size of the service provisioning queue, along with the service provisioning time to assess whether the strategy is working or needs to be revised. Prolonged strategy testing Although I didn’t have a name for this practice in 2014 when we created and implemented this strategy, the preceeding paragraph captures an important truth of team-led bottom-up strategy: the entire strategy was implemented in a prolonged strategy testing phase. This is an important truth of all low-attitude, bottom-up strategy: because you don’t have the authority to mandate compliance. An executive’s high-altitude strategy can be enforced despite not working due to their organizational authority, but a team’s strategy will only endure while it remains effective. Refine In order to refine our diagnosis, we’ve created a systems model for service onboarding. This will allow us to simulate a variety of different approaches to our problem, and determine which approach, or combination of approaches, will be most effective. As we exercised the model, it became clear that: we are increasingly falling behind, hiring onto the service provisioning team is not a viable solution, and moving to a self-service approach is our only option. While the model writeup justifies each of those statements in more detail, we’ll include two charts here. The first chart shows the status quo, where new service provisioning requests, labeled as Initial RequestedServices, quickly accumulate into a backlog. Second, we have a chart comparing the outcomes between the current status quo and a self-service approach. In that chart, you can see that the service provisioning backlog in the self-service model remains steady, as represented by the SelfService RequestedServices line. Of the various attempts to find a solution, none of the others showed promise, including eliminating all errors in provisioning and increasing the team’s capacity by 500%. Diagnose We’ve diagnosed the current state of service provisioning at Uber’s as: Many product engineering teams are aiming to leave the centralized monolith, which is generating two to three service provisioning requests each week. We expect this rate to increase roughly linearly with the size of the product engineering organization. Even if we disagree with this shift to additional services, there’s no team responsible for maintaining the extensibility of the monolith, and working in the monolith is the number one source of developer frustration, so we don’t have a practical counter proposal to offer engineers other than provisioning a new service. The engineering organization is doubling every six months. Consequently, a year from now, we expect eight to twelve service provisioning requests every week. Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. Some additional headcount is being allocated to Service Reliability Engineers (SREs) who can take on the most nuanced, complicated service provisioning work. However, their bandwidth is already heavily constrained across many tasks, so relying on SRES is an insufficient solution. The queue for service provisioning is already increasing in size as things are today. Barring some change, many services will not be provisioned in a timely fashion. Today, provisioning a new service takes about a week, with numerous round trips between the requesting team and the provisioning team. Missing and incorrect information between teams is the largest source of delay in provisioning services. If the provisioning team has all the necessary information, and it’s accurate, then a new service can be provisioned in about three to four hours of work across configuration in Puppet, metadata in Clusto, allocating ports, assigning the service to servers, and so on. There are few safe guards on port allocation, server assignment and so on. It is easy to inadvertently cause a production outage during service provisioning unless done with attention to detail. Given our rate of hiring, training the engineering organization to use this unsafe toolchain is an impractical solution: even if we train the entire organization perfectly today, there will be just as many untrained individuals in six months. Further, there’s product engineering leadership has no interest in their team being diverted to service provisioning training. It’s widely agreed across the infrastructure engineering team that essentially every component of service provisioning should be replaced as soon as possible, but there is no concrete plan to replace any of the core components. Further, there is no team accountable for replacing these components, which means the service provisioning team will either need to work around the current tooling or replace that tooling ourselves. It’s urgent to unblock development of new services, but moving those new services to production is rarely urgent, and occurs after a long internal development period. Evidence of this is that requests to provision a new service generally come with significant urgency and internal escalations to management. After the service is provisioned for development, there are relatively few urgent escalations other than one-off requests for increased production capacity during incidents. Another team within infrastructure is actively exploring adoption of Mesos and Aurora, but there’s no concrete timeline for when this might be available for our usage. Until they commit to supporting our workloads, we’ll need to find an alternative solution. Explore Uber’s server and service infrastructure today is composed of a handful of pieces. First, we run servers on-prem within a handful of colocations. Second, we describe each server in Puppet manifests to support repeatable provisioning of servers. Finally, we manage fleet and server metadata in a tool named Clusto, originally created by Digg, which allows us to populate Puppet manifests with server and cluster appropriate metadata during provisioning. In general, we agree that our current infrastructure is nearing its end of lifespan, but it’s less obvious what the appropriate replacements are for each piece. There’s significant internal opposition to running in the cloud, up to and including our CEO, so we don’t believe that will change in the forseeable future. We do however believe there’s opportunity to change our service definitions from Puppet to something along the lines of Docker, and to change our metadata mechanism towards a more purpose-built solution like Mesos/Aurora or Kubernetes. As a starting point, we find it valuable to read Large-scale cluster management at Google with Borg which informed some elements of the approach to Kubernetes, and Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center which describes the Mesos/Aurora approach. If you’re wondering why there’s no mention of Borg, Omega, and Kubernetes, it’s because it wasn’t published until 2016, a year after this strategy was developed. Within Uber, we have a number of ex-Twitter engineers who can speak with confidence to their experience operating with Mesos/Aurora at Twitter. We have been unable to find anyone to speak with that has production Kubernetes experience operating a comparably large fleet of 10,000+ servers, although presumably someone is operating–or close to operating–Kuberenetes at that scale. Our general belief of the evolution of the ecosystem at the time is described in this Wardley mapping exercise on service orchestration (2014). One of the unknowns today is how the evolution of Mesos/Aurora and Kubernetes will look in the future. Kubernetes seems promising with Google’s backing, but there are few if any meaningful production deployments today. Mesos/Aurora has more community support and more production deployments, but the absolute number of deployments remains quite small, and there is no large-scale industry backer outside of Twitter. Even further out, there’s considerable excitement around “serverless” frameworks, which seem like a likely future evolution, but canvassing the industry and our networks we’ve simply been unable to find enough real-world usage to make an active push towards this destination today. Wardley mapping is introduced as one of the techniques for strategy refinement, but it can also be a useful technique for exploring an dyanmic ecosystem like service orchestration in 2014. Assembling each strategy requires exercising judgment on how to compile the pieces together most usefully, and in this case I found that the map fits most naturally in with the rest of exploration rather than in the more operationally-focused refinement section.
More in programming
AGI is coming—whether we’re ready or not. I’ve been convinced of this trajectory since GPT-3’s release, but recent developments have significantly accelerated my timelines. The first major shift was OpenAI’s breakthrough in test-time compute and its newly demonstrated scaling law.
AI Updates There is a lot of chatter about 2025 being the year of agentic frameworks. To me, this means a system in which a subset can allow AI models to take independent actions based on their environment, typically interacting with external APIs or interfaces. The terminology around this concept is still evolving, and definitions […]
This blog post is another one in the ‘writing things down to structure my thinking on where I want my career to go’ series. I will get back to writing technical and automation blog posts soon, but I need to finish my contract testing course first. One of the things I like to do most in life is traveling and seeing new places. Well, seeing new places, mostly, as the novelty of waiting, flying and staying in hotel rooms has definitely worn off by now. I am in the privileged position (really, that is what it is: I’m privileged, and I fully realize that) that I get to scratch this travel itch professionally on a regular basis these days. Over the last few years, I have been invited to contribute to meetups and conferences abroad, and I also get to run in-house training sessions with companies outside the Netherlands a couple of times per year. Most of this traveling takes place within Europe, but for the last three years, I have been able to travel outside of Europe once every year (South Africa in 2022, Canada in 2023 and the United States in 2024), and needless to say I have enjoyed those opportunities very much. To give you an idea of the amount of traveling I do: for 2025, I now have four work-related trips abroad scheduled, and I am pretty sure at least a few more will be added to that before the year ends (it’s only just February…). That might not be much travel by some people’s standards, but for me, it is. And it seems the number of opportunities I get for traveling increase year over year, to the point where I have to say ‘no’ to several of these opportunities. Say no? Why? I thought you just said you loved to travel? Yes, that’s true. I do love to travel. But I also love spending time at home with my family, and that comes first. Always. Now, my sons are getting older, and being away from home for a few days doesn’t put as much pressure on them and on my wife as it did a few years ago. Still, I always need to find a balance between spending time with them and spending time at work. I am away from home for work not just when I’m abroad. I run evening training sessions with clients here in the Netherlands on a regular basis, too, as well as training sessions in my evenings for clients in different time zones, mainly US-based clients. And all that adds up. I try to only be away from home one night per week, but often, it’s two. When I travel abroad, it’s even more than that. Again, I’m not complaining. Not at all. It is an absolute privilege to get to travel for work and get paid to do that, but I cannot do that indefinitely, and that’s why I have made a decision: With a few exceptions (more on those below), I am going to say ‘no’ to conferences abroad from now on. This is a tough decision for me to make, but sometimes that’s exactly what you need to do. Tough, because I have very fond memories of all the conferences and meetups abroad I have contributed to. My first one, Romanian Testing Conference in 2017. My first keynote abroad, UKStar in 2019. My first one outside of Europe, Targeting Quality in 2023. They were all amazing, because of the travel and sightseeing (when time allowed), but also because of all the people I have met at these conferences. Yet, I can meet at least some of these people at conferences here in the Netherlands, too. Test Automation Days, the TestNet events, the Dutch Testing Day and TestMass all provide a great opportunity for me to catch up with my network. Sometimes, international conferences come to the Netherlands, too, like AutomationSTAR this year. And then there are plenty of smaller meetups here in the Netherlands (and Belgium) where I can meet and catch up with people as well. Plus, the money. I am not going to be a hypocrite and say that money doesn’t play into this. For the reasons mentioned above, I have a limited number of opportunities to travel every year, and I prefer to spend those on in-house training sessions with clients abroad, simply because the pay is much better. Even when a conference compensates flights and hotel (as they should) and offer a speaker or workshop facilitator fee (a nice bonus), it will be significantly less of a payday than when I run a training session with a client. That’s not the fault of those conferences, not at all, especially when they’re compensating their speakers fairly, but this is simply a matter of numbers and budgets. At the moment, I have one, maybe two contributions to conferences abroad coming up, and I gave them my word, so I’ll be there. That’s the SAST 30-year anniversary conference in October, plus one other conference that I’m talking to but haven’t received a ‘yes’ or ‘no’ from yet. Other than that, if conferences reach out to me, it’s likely to be a ‘no’ from now on, unless: the event pays a fee comparable to my rate for in-house training I can combine the event with paid in-house training (for example with a sponsor) it is a country or region I really, really want to visit, either for personal reasons or because I want to grow my professional network there I don’t see the first one happening soon, and the list of destinations for the third one is very short (Norway, Canada, New Zealand, that’s pretty much it), so unless we can arrange paid in-house training alongside the conference, the answer will be a ‘no’ from me. Will this reduce the number of travel opportunities for me? Maybe. Maybe not. Again, I see the number of requests I get for in-house training abroad growing, too, and if that dies down, it’ll be a sign for me that I’ll have to work harder to create those opportunities. For 2025, things are looking pretty good, with trips for training to Romania, North Macedonia and Denmark already scheduled, and several leads for more in the pipeline. And if the number of opportunities does go down, that’s fine, too. I’m happy to spend that time with family, working on other things, or riding my bike. And I’m sure there will be a few opportunities to speak at online meetups, events and webinars, too.
This post covers why companies are considering reincorporating from Delaware to Nevada & Texas
A few weeks ago I ran a terminal survey (you can read the results here) and at the end I asked: What’s the most frustrating thing about using the terminal for you? 1600 people answered, and I decided to spend a few days categorizing all the responses. Along the way I learned that classifying qualitative data is not easy but I gave it my best shot. I ended up building a custom tool to make it faster to categorize everything. As with all of my surveys the methodology isn’t particularly scientific. I just posted the survey to Mastodon and Twitter, ran it for a couple of days, and got answers from whoever happened to see it and felt like responding. Here are the top categories of frustrations! I think it’s worth keeping in mind while reading these comments that 40% of people answering this survey have been using the terminal for 21+ years 95% of people answering the survey have been using the terminal for at least 4 years These comments aren’t coming from total beginners. Here are the categories of frustrations! The number in brackets is the number of people with that frustration. Honestly I don’t how how interesting this is to other people – I’m just writing this up for myself because I’m trying to write a zine about the terminal and I wanted to get a sense for what people are having trouble with. remembering syntax (115) People talked about struggles remembering: the syntax for CLI tools like awk, jq, sed, etc the syntax for redirects keyboard shortcuts for tmux, text editing, etc One example comment: There are just so many little “trivia” details to remember for full functionality. Even after all these years I’ll sometimes forget where it’s 2 or 1 for stderr, or forget which is which for > and >>. switching terminals is hard (91) People talked about struggling with switching systems (for example home/work computer or when SSHing) and running into: OS differences in keyboard shortcuts (like Linux vs Mac) systems which don’t have their preferred text editor (“no vim” or “only vim”) different versions of the same command (like Mac OS grep vs GNU grep) no tab completion a shell they aren’t used to (“the subtle differences between zsh and bash”) as well as differences inside the same system like pagers being not consistent with each other (git diff pagers, other pagers). One example comment: I got used to fish and vi mode which are not available when I ssh into servers, containers. color (85) Lots of problems with color, like: programs setting colors that are unreadable with a light background color finding a colorscheme they like (and getting it to work consistently across different apps) color not working inside several layers of SSH/tmux/etc not liking the defaults not wanting color at all and struggling to turn it off This comment felt relatable to me: Getting my terminal theme configured in a reasonable way between the terminal emulator and fish (I did this years ago and remember it being tedious and fiddly and now feel like I’m locked into my current theme because it works and I dread touching any of that configuration ever again). keyboard shortcuts (84) Half of the comments on keyboard shortcuts were about how on Linux/Windows, the keyboard shortcut to copy/paste in the terminal is different from in the rest of the OS. Some other issues with keyboard shortcuts other than copy/paste: using Ctrl-W in a browser-based terminal and closing the window the terminal only supports a limited set of keyboard shortcuts (no Ctrl-Shift-, no Super, no Hyper, lots of ctrl- shortcuts aren’t possible like Ctrl-,) the OS stopping you from using a terminal keyboard shortcut (like by default Mac OS uses Ctrl+left arrow for something else) issues using emacs in the terminal backspace not working (2) other copy and paste issues (75) Aside from “the keyboard shortcut for copy and paste is different”, there were a lot of OTHER issues with copy and paste, like: copying over SSH how tmux and the terminal emulator both do copy/paste in different ways dealing with many different clipboards (system clipboard, vim clipboard, the “middle click” keyboard on Linux, tmux’s clipboard, etc) and potentially synchronizing them random spaces added when copying from the terminal pasting multiline commands which automatically get run in a terrifying way wanting a way to copy text without using the mouse discoverability (55) There were lots of comments about this, which all came down to the same basic complaint – it’s hard to discover useful tools or features! This comment kind of summed it all up: How difficult it is to learn independently. Most of what I know is an assorted collection of stuff I’ve been told by random people over the years. steep learning curve (44) A lot of comments about it generally having a steep learning curve. A couple of example comments: After 15 years of using it, I’m not much faster than using it than I was 5 or maybe even 10 years ago. and That I know I could make my life easier by learning more about the shortcuts and commands and configuring the terminal but I don’t spend the time because it feels overwhelming. history (42) Some issues with shell history: history not being shared between terminal tabs (16) limits that are too short (4) history not being restored when terminal tabs are restored losing history because the terminal crashed not knowing how to search history One example comment: It wasted a lot of time until I figured it out and still annoys me that “history” on zsh has such a small buffer; I have to type “history 0” to get any useful length of history. bad documentation (37) People talked about: documentation being generally opaque lack of examples in man pages programs which don’t have man pages Here’s a representative comment: Finding good examples and docs. Man pages often not enough, have to wade through stack overflow scrollback (36) A few issues with scrollback: programs printing out too much data making you lose scrollback history resizing the terminal messes up the scrollback lack of timestamps GUI programs that you start in the background printing stuff out that gets in the way of other programs’ outputs One example comment: When resizing the terminal (in particular: making it narrower) leads to broken rewrapping of the scrollback content because the commands formatted their output based on the terminal window width. “it feels outdated” (33) Lots of comments about how the terminal feels hampered by legacy decisions and how users often end up needing to learn implementation details that feel very esoteric. One example comment: Most of the legacy cruft, it would be great to have a green field implementation of the CLI interface. shell scripting (32) Lots of complaints about POSIX shell scripting. There’s a general feeling that shell scripting is difficult but also that switching to a different less standard scripting language (fish, nushell, etc) brings its own problems. Shell scripting. My tolerance to ditch a shell script and go to a scripting language is pretty low. It’s just too messy and powerful. Screwing up can be costly so I don’t even bother. more issues Some more issues that were mentioned at least 10 times: (31) inconsistent command line arguments: is it -h or help or –help? (24) keeping dotfiles in sync across different systems (23) performance (e.g. “my shell takes too long to start”) (20) window management (potentially with some combination of tmux tabs, terminal tabs, and multiple terminal windows. Where did that shell session go?) (17) generally feeling scared/uneasy (“The debilitating fear that I’m going to do some mysterious Bad Thing with a command and I will have absolutely no idea how to fix or undo it or even really figure out what happened”) (16) terminfo issues (“Having to learn about terminfo if/when I try a new terminal emulator and ssh elsewhere.”) (16) lack of image support (sixel etc) (15) SSH issues (like having to start over when you lose the SSH connection) (15) various tmux/screen issues (for example lack of integration between tmux and the terminal emulator) (15) typos & slow typing (13) the terminal getting messed up for various reasons (pressing Ctrl-S, cating a binary, etc) that’s all! I’m not going to make a lot of commentary on these results, but here are a couple of categories that feel related to me: remembering syntax & history (often the thing you need to remember is something you’ve run before!) discoverability & the learning curve (the lack of discoverability is definitely a big part of what makes it hard to learn)