Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
14
Even the best policies fail if they aren’t adopted by the teams they’re intended to serve. Can we persistently change our company’s behaviors with a one-time announcement? No, probably not. I refer to the art of making policies work as “operations” or “strategy operations.” The good news is that effectively operating a policy is two-thirds avoiding common practices that simply don’t work. The other one-third takes some practice, but can be practiced in any engineering role: there’s no need to wait until you’re an executive to start building mastery. This chapter will dig into those mechanisms, with particular focus on: How policies are supported by operations, and how operations are composed of mechanisms that ensure they work well Evaluating operational mechanisms to select between different options, and determine which mechanisms are unlikely to be an effective choice Composing an operational plan for the specific set of policies that you are looking to support Common varieties of...
a month ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Irrational Exuberance

Refreshed StaffEng.com and a few other sites

Ahead of announcing the title and publisher of my thus-far-untitled book on engineering strategy in the next week or two, I put together a website for its content. That site is pretty much the same format as this blog, but with some improvements like better mobile rendering on / than this blog has historically had. After finishing that work, I ported the improvements back to lethain.com, but also decided to bring them to staffeng.com. That was slightly trickier because, unlike this blog, StaffEng was historically a Gatsby app. (Why a Gatsby app? Because Calm was using Gatsby for our web frontend and I wanted to get some experience with it.) Over the weekend, I took some time to migrate to Hugo and apply the same enhancements. which you can now see in the lethain:staff-eng repository or on staffeng.com. Here’s a screenshot of the old version. Then here’s a screenshot of the updated version. Overall, I think it’s slightly easier to read, and I took it as a chance to update the various links. For example, I removed the newsletter link and pointed that to this blog’s newsletter instead, given that one’s mailing list went quiet a long time ago. Speaking of going quiet, I also brought these updates to infraeng.dev, which is the very-stuck-in-time site for the book I may-or-may-not one day write about infrastructure engineering. That means that I now have four essentially equivalent Hugo sites running different content websites: this blog, staffeng.com, infraeng.dev, and the site for the upcoming book. All of these build and deploy automatically onto GitHub Pages, which has been an extremely easy, reliable workflow for me. While I was working on this, someone asked me why I don’t just write my own blog server to host my blogs. The answer here is pretty straightforward. I’ve written three blog servers for my blog over the years. The first two were in Python, and the last one was in Go. They all worked well enough, but maintaining them was eventually a pain point because they required a real build pipeline and deal with libraries that could have security issues. Even in the best case, the containers they run in would get end-of-lifed periodically as Ubuntu versions got deprecated. What I’ve slowly learned from that is that, as a frequent writer, you really want your content to live somewhere that can work properly for decades. Even small maintenance costs can be prohibitive over time, and I’ve seen some good blogs disappear rather than e.g. figure out a WordPress upgrade. Individually, these are all minor, but over decades they can really add up. This is also my argument against using hosted providers: I’m sure Substack will be around in five years, but I have no idea if Substack will be around in twenty years, but I know that I’ll still be writing then, and will also want my previous writing to still be accessible.

21 hours ago 1 votes
Why did Stripe build Sorbet? (~2017).

Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact. Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe’s Product Infrastructure had similar goals to other companies’ Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes. This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring. Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe’s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can’t take credit for this strategy’s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe’s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation The Product Infrastructure team is investing in Stripe’s developer experience by: Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas. We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review. Our three highest priority areas for this half are: Add static typing to the highest value portions of our Ruby codebase, such that we can run the type checker locally and on the test machines to identify errors more quickly. Support selective test execution such that engineers can quickly determine and run the most appropriate tests on their machine rather than delaying until tests run on the build server. Instrument test failures such that we have better data to prioritize future efforts. Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team’s time to get our type checker to an effective place. Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase. It’s possible for Product Infrastructure to take on both, despite its fixed size. We’ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code’s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable. Based on our knowledge of Facebook’s Hack project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It’s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios. Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value. In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the Command Query Responsibility Segregation (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase. Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code. Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified. Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues. For escalations and issues, reach out in the #product-infra channel. Diagnose In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity. Our current diagnosis of our developer productivity is: We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.) We have two primary mechanisms for understanding our engineer’s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey. Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners. Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request’s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete. By the time their build-test cycle completes, they’ve lost their focus and maybe take several hours to return to addressing the results. There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes. Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase. Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity. Our data infrastructure is implemented in Scala. While these developers have concerns–primarily slow build times–they manage their build and deployment infrastructure independently, and the group remains relatively small. Ruby is not a highly performant programming language, but we’ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language. Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers. Developers working in Golang and Scala routinely ask for more centralized support, but it’s challenging to prioritize those requests as we’re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala. If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.

4 days ago 9 votes
How to get better at strategy?

One of the most memorable quotes in Arthur Miller’s The Death of a Salesman comes from Uncle Ben, who describes his path to becoming wealthy as, “When I was seventeen, I walked into the jungle, and when I was twenty-one I walked out. And by God I was rich.” I wish I could describe the path to learning engineering strategy in similar terms, but by all accounts it’s a much slower path. Two decades in, I am still learning more from each project I work on. This book has aimed to accelerate your learning path, but my experience is that there’s still a great deal left to learn, despite what this book has hoped to accomplish. This final chapter is focused on the remaining advice I have to give on how you can continue to improve at strategy long after reading this book’s final page. Inescapably, this chapter has become advice on writing your own strategy for improving at strategy. You are already familiar with my general suggestions on creating strategy, so this chapter provides focused advice on creating your own plan to get better at strategy. It covers: Exploring strategy creation to find strategies you can learn from via public and private resources, and through creating learning communities How to diagnose the strategies you’ve found, to ensure you learn the right lessons from each one Policies that will help you find ways to perform and practice strategy within your organization, whether or not you have organizational authority Operational mechanisms to hold yourself accountable to developing a strategy practice My final benediction to you as a strategy practitioner who has finished reading this book With that preamble, let’s write this book’s final strategy: your personal strategy for developing your strategy practice. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Exploring strategy creation Ideally, we’d start our exploration of how to improve at engineering strategy by reading broadly from the many publicly available examples. Unfortunately, there simply aren’t many easily available works to learn from others’ experience. Nonetheless, resources do exist, and we’ll discuss the three categories that I’ve found most useful: Public resources on engineering strategy, such as companies’ engineering blogs Private and undocumented strategies available through your professional network Learning communities that you build together, including ongoing learning circles Each of these is explored in its own section below. Public resources While there aren’t as many public engineering strategy resources as I’d like, I’ve found that there are still a reasonable number available. This book collects a number of such resources in the appendix of engineering strategy resources. That appendix also includes some individuals’ blog posts that are adjacent to this topic. You can go a long way by searching and prompting your way into these resources. As you read them, it’s important to recognize that public strategies are often misleading, as discussed previously in evaluating strategies. Everyone writing in public has an agenda, and that agenda often means that they’ll omit important details to make themselves, or their company, come off well. Make sure you read through the lines rather than taking things too literally. Private resources Ironically, where public resources are hard to find, I’ve found it much easier to find privately held strategy resources. While private recollections are still prone to inaccuracies, the incentives to massage the truth are less pronounced. The most useful sources I’ve found are: peers’ stories – strategies are often oral histories, and they are shared freely among peers within and across companies. As you build out your professional network, you can usually get access to any company’s engineering strategy on any topic by just asking. There are brief exceptions. Even a close peer won’t share a sensitive strategy before its existence becomes obvious externally, but they’ll be glad to after it does. People tend to over-estimate how much information companies can keep private anyway: even reading recent job postings can usually expose a surprising amount about a company. internal strategy archaeologists – while surprisingly few companies formally collect their strategies into a repository, the stories are informally collected by the tenured members of the organization. These folks are the company’s strategy archaeologists, and you can learn a great deal by explicitly consulting them becoming a strategy archaeologist yourself – whether or not you’re a tenured member of your company, you can learn a tremendous amount by starting to build your own strategy repository. As you start collecting them, you’ll interest others in contributing their strategies as well. As discussed in Staff Engineer’s section on the Write five then synthesize approach to strategy, over time you can foster a culture of documentation where one didn’t exist before. Even better, building that culture doesn’t require any explicit authority, just an ongoing show of excitement. There are other sources as well, ranging from attending the hallway track in conferences to organizing dinners where stories are shared with a commitment to privacy. Working in community My final suggestion for seeing how others work on strategy is to form a learning circle. I formed a learning circle when I first moved into an executive role, and at this point have been running it for more than five years. What’s surprised me the most is how much I’ve learned from it. There are a few reasons why ongoing learning circles are exceptional for sharing strategy: Bi-directional discussion allows so much more learning and understanding than mono-directional communication like conference talks or documents. Groups allow you to learn from others’ experiences and others’ questions, rather than having to guide the entire learning yourself. Continuity allows you to see the strategy at inception, during the rollout, and after it’s been in practice for some time. Trust is built slowly, and you only get the full details about a problem when you’ve already successfully held trust about smaller things. An ongoing group makes this sort of sharing feasible where a transient group does not. Although putting one of these communities together requires a commitment, they are the best mechanism I’ve found. As a final secret, many people get stuck on how they can get invited to an existing learning circle, but that’s almost always the wrong question to be asking. If you want to join a learning circle, make one. That’s how I got invited to mine. Diagnosing your prior and current strategy work Collecting strategies to learn from is a valuable part of learning. You also have to determine what lessons to learn from each strategy. For example, you have to determine whether Calm’s approach to resourcing Engineering-driven projects is something to copy or something to avoid. What I’ve found effective is to apply the strategy rubric we developed in the “Is this strategy any good?” chapter to each of the strategies you’ve collected. Even by splitting a strategy into its various phases, you’ll learn a lot. Applying the rubric to each phase will teach you more. Each time you do this to another strategy, you’ll get a bit faster at applying the rubric, and you’ll start to see interesting, recurring patterns. As you dig into a strategy that you’ve split into phases and applied the evaluation rubric to, here are a handful of questions that I’ve found interesting to ask myself: How long did it take to determine a strategy’s initial phase could be improved? How high was the cost to fund that initial phase’s discovery? Why did the strategy reach its final stage and get repealed or replaced? How long did that take to get there? If you had to pick only one, did this strategy fail in its approach to exploration, diagnosis, policy or operations? To what extent did the strategy outlive the tenure of its primary author? Did it get repealed quickly after their departure, did it endure, or was it perhaps replaced during their tenure? Would you generally repeat this strategy, or would you strive to avoid repeating it? If you did repeat it, what conditions seem necessary to make it a success? How might you apply this strategy to your current opportunities and challenges? It’s not necessary to work through all of these questions for every strategy you’re learning from. I often try to pick the two that I think might be most interesting for a given strategy. Policy for improving at strategy At a high level, there are just a few key policies to consider for improving your strategic abilities. The first is implementing strategy, and the second is practicing implementing strategy. While those are indeed the starting points, there are a few more detailed options worth consideration: If your company has existing strategies that are not working, debug one and work to fix it. If you lack the authority to work at the company scope, then decrease altitude until you find an altitude you can work at. Perhaps setting Engineering organizational strategies is beyond your circumstances, but strategy for your team is entirely accessible. If your company has no documented strategies, document one to make it debuggable. Again, if operating at a high altitude isn’t attainable for some reason, operate at a lower altitude that is within reach. If your company’s or team’s strategies are effective but have low adoption, see if you can iterate on operational mechanisms to increase adoption. Many such mechanisms require no authority at all, such as low-noise nudges or the model-document-share approach. If existing strategies are effective and have high adoption, see if you can build excitement for a new strategy. Start by mining for which problems Staff-plus engineers and senior managers believe are important. Once you find one, you have a valuable strategy vein to start mining. If you don’t feel comfortable sharing your work internally, then try writing proposals while only sharing them to a few trusted peers. You can even go further to only share proposals with trusted external peers, perhaps within a learning circle that you create or join. Trying all of these at once would be overwhelming, so I recommend picking one in any given phase. If you aren’t able to make traction, then try another until something works. It’s particularly important to recognize in your diagnosis where things are not working–perhaps you simply don’t have the sponsorship you need to enforce strategy so you need to switch towards suggesting strategies instead–and you’ll find something that works. What if you’re not allowed to do strategy? If you’re looking to find one, you’ll always unearth a reason why it’s not possible to do strategy in your current environment. If you’ve convinced yourself that there’s simply no policy that would allow you to do strategy in your current role, then the two most useful levers I’ve found are: Lower your altitude – there’s always a scale where you can perform strategy, even if it’s just your team or even just yourself. Only you can forbid yourself from developing personal strategies. Practice rather than perform – organizations can only absorb so much strategy development at a given time, so sometimes they won’t be open to you doing more strategy. In that case, you should focus on practicing strategy work rather than directly performing it. Only you can stop yourself from practice. Don’t believe the hype: you can always do strategy work. Operating your strategy improvement policies As the refrain goes, even the best policies don’t accomplish much if they aren’t paired with operational mechanisms to ensure the policies actually happen, and debug why they aren’t happening. Although it’s tempting to ignore operations when it comes to our personal habits, I think that would be a mistake: our personal habits have the most significant long-term impact on ourselves, and are the easiest habits to ignore since others generally won’t ask about them. The mechanisms I’d recommend: Explicitly track the strategies that you’ve implemented, refined, documented, or read. This should be in a document, spreadsheet or folder where you can explicitly see if you have or haven’t done the work. Review your tracked strategies every quarter: are you working on the expected number and in the expected way? If not, why not? Ideally, your review should be done in community with a peer or a learning circle. It’s too easy to deceive yourself, it’s much harder to trick someone else. If your periodic review ever discovers that you’re simply not doing the work you expected, sit down for an hour with someone that you trust–ideally someone equally or more experienced than you–and debug what’s going wrong. Commit to doing this before your next periodic review. Tracking your personal habits can feel a bit odd, but it’s something I highly recommend. I’ve been setting and tracking personal goals for some time now—for example, in my 2024 year in review—and have benefited greatly from it. Too busy for strategy Many companies convince themselves that they’re too much in a rush to make good decisions. I’ve certainly gotten stuck in this view at times myself, although at this point in my career I find it increasingly difficult to not recognize that I have a number of tools to create time for strategy, and an obligation to do strategy rather than inflict poor decisions on the organizations I work in. Here’s my advice for creating time: If you’re not tracking how often you’re creating strategies, then start there. If you’ve not worked on a single strategy in the past six months, then start with one. If implementing a strategy has been prohibitively time consuming, then focus on practicing a strategy instead. If you do try all those things and still aren’t making progress, then accept your reality: you don’t view doing strategy as particularly important. Spend some time thinking about why that is, and if you’re comfortable with your answer, then maybe this is a practice you should come back to later. Final words At this point, you’ve read everything I have to offer on drafting engineering strategy. I hope this has refined your view on what strategy can be in your organization, and has given you the tools to draft a more thoughtful future for your corner of the software engineering industry. What I’d never ask is for you to wholly agree with my ideas here. They are my best thinking on this topic, but strategy is a topic where I’m certain Hegel’s world view is the correct one: even the best ideas here are wrong in interesting ways, and will be surpassed by better ones.

a week ago 7 votes
Wardley mapping the service orchestration ecosystem (2014).

In Uber’s 2014 service migration strategy, we explore how to navigate the move from a Python monolith to a services-oriented architecture while also scaling with user traffic that doubled every six months. This Wardley map explores how orchestration frameworks were evolving during that period, to be used as an input into determining the most effective path forward for Uber’s Infrastructure Engineering team. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this map To quickly understand this Wardley Map, read from top to bottom. If you want to review how this map was written, then you should read section by section from the bottom up, starting with Users, then Value Chains, and so on. More detail on this structure in Refining strategy with Wardley Mapping. How things work today There are three primary internal teams involved in service provisioning. The Service Provisioning Team abstracts applications developed by Product Engineering from servers managed by the Server Operations Team. As more servers are added to support application scaling, this is invisible to the applications themselves, freeing Product Engineers to focus on what the company values the most: developing more application functionality. The challenges within the current value chain are cost-efficient scaling, reliable deployment, and fast deployment. All three of those problems anchor on the same underlying problem of resource scheduling. We want to make a significant investment into improving our resource scheduling, and believe that understanding the industry’s trend for resource scheduling underpins making an effective choice. Transition to future state Most interesting cluster orchestration problems are anchored in cluster metadata and resource scheduling. Routing requests, whether through DNS entries or allocated ports, depends on cluster metadata. Mapping services to a fleet of servers depends on resource scheduling managing cluster metadata. Deployment and autoscaling both depend on cluster metadata. This is also an area where we see significant changes occurring in 2014. Uber initially solved this problem using Clusto, an open-source tool released by Digg with goals similar to Hashicorp’s Consul but with limited adoption. We also used Puppet for configuring servers, alongside custom scripting. This has worked, but has required custom, ongoing support for scheduling. The key question we’re confronted with is whether to build our own scheduling algorithms (e.g. bin packing) or adopt a different approach. It seems clear that the industry intends to directly solve this problem via two paths: relying on Cloud providers for orchestration (Amazon Web Services, Google Cloud Platform, etc) and through open-source scheduling frameworks such as Mesos and Kubernetes. Our industry peers with over five years of technical infrastructure are almost unanimously adopting open-source scheduling frameworks to better support their physical infrastructure. This will give them a tool to perform a bridged migration from physical infrastructure to cloud infrastructure. Newer companies with less existing infrastructure are moving directly to the cloud, and avoiding solving the orchestration problem entirely. The only companies not adopting one of these two approaches are extraordinarily large and complex (think Google or Microsoft) or allergic to making any technical change at all. From this analysis, it’s clear that continuing our reliance on Clusto and Puppet is going to be an expensive investment that’s not particularly aligned with the industry’s evolution. User & Value Chains This map focuses on the orchestration ecosystem within a single company, with a focus on what did, and did not, stay the same from roughly 2008 to 2014. It focuses in particular on three users: Product Engineers are focused on provisioning new services, and then deploying new versions of that service as they make changes. They are wholly focused on their own service, and entirely unaware of anything beneath the orchestration layer (including any servers). Service Provisioning Team focuses on provisioning new services, orchestrating resources for those services, and routing traffic to those services. This team acts as the bridge between the Product Engineers and the Server Operations Team. Server Operations Team is focused on adding server capacity to be used for orchestration. They work closely with the Service Provisioning Team, and have no contact with the Product Engineers. It’s worth acknowledging that, in practice, these are artificial aggregates of multiple underlying teams. For example, routing traffic between services and servers is typically handled by a Traffic or Service Networking team. However, these omissions are intended to clarify the distinctions relevant to the evolution of orchestration tooling.

a week ago 10 votes
Script for consistent linking within book.

As part of my work on #eng-strategy-book, I’ve been editing a bunch of stuff. This morning I wanted to work on two editing problems. First, I wanted to ensure I was referencing strategies evenly across chapters (and not relying too heavily on any given strategy). Second, I wanted to make sure I was making references to other chapters in a consistent, standardized way, Both of these are collecting Markdown links from files, grouping those links by either file or url, and then outputting the grouped content in a useful way. I decided to experiment with writing a one-shot prompt to write the script for me rather than writing it myself. The prompt and output (from ChatGPT 4.5) are available in this gist. That worked correctly! The output was a bit ugly, so I tweaked the output slightly by hand, and also adjusted the regular expression to capture less preceding content, which resulted in this script. Although I did it by hand, I’m sure it would have been faster to just ask ChatGPT to fix the script itself, but either way these are very minor tweaks. Now I can call the script in either standard of --grouped mode. Example of ./scripts/links.py "content/posts/strategy-book/*.md" output: Example of ./scripts/links.py "content/posts/strategy-book/*.md" --grouped output: Altogether, this is a super simple script that I could have written in thirty minutes or so, but this allowed me to write it in less than ten minutes, and get back to actually editing with the remaining twenty.

2 weeks ago 10 votes

More in programming

“Can They Change My Contract?”: Protecting Your Workplace Rights in Japan

Right now, the General Union is handling cases at Japanese tech companies where well-established workplace practices have come under threat. These include businesses pushing for return-to-office mandates after years of remote work, eliminating flexible scheduling, and cutting bonuses and other forms of compensation. Sometimes these companies are altering work conditions that were never officially documented but had become standard practice. Other times, they’re trying to eliminate benefits explicitly written into contracts or work rules. In both cases, many workers believe they have no choice but to accept these changes. They’re wrong! Japanese labour law protects workers in two significant ways here. First, there’s the principle of established workplace practices (労使慣行, roudou kankou), protected through decades of court precedents, which can give unwritten customs the same legal weight as written rules. Then there’s Article 8 of the Labour Contract Law, which prevents employers from unilaterally changing documented working conditions. But having these legal protections is only half the battle. While the courts have established strong precedents, as an individual, pursuing these rights can be prohibitively difficult. Taking an employer to court is expensive, time-consuming, and potentially career-damaging. This is where collective action through unions becomes essential. Unions provide both the legal expertise and collective leverage needed to uphold these rights. For tech workers in Japan, these issues have never been more relevant. The tech industry’s desperate need for skilled workers, worth nearly 22 trillion yen in domestic investments, could create unprecedented leverage. By understanding their legal rights and organizing collectively, developers can effectively protect and even improve their workplace conditions in this critical moment. How the courts protect you A series of landmark cases in Japan created robust protections for workplace customs, also known as established employment practices. But what qualifies as an established employment practice? Yukiko Sadaoka, a regular collaborator with the General Union, explained how courts determine whether a workplace practice qualifies to be protected. “There are three main factors. One, a habit or fact must have been repeated and continued for a long period of time. Two, neither labour nor management has explicitly denied following the practice. And three, the practice must be supported by a normative awareness on both sides. The employers, especially those who control working conditions, must be aware of the practice.” Case 1: Post-Retirement Employment Practice (RECOGNISED) 東豊観光事件 (Toho Kanko), Osaka District Court 28, June 1990 In this order by the court, the company’s work regulations stated that the forced retirement age (定年, teinen) was 55. In practice, however, the company repeatedly kept employees on after they had reached this age. This custom continued for six years (1984–1990), during which 7 out of 8 employees who reached the age of 55 were retained. When the company terminated the 55 year-old plaintiff, citing the official retirement rule, the plaintiff sued for confirmation of employment status. The court stated that the practice of continued employment after 55 had become an established workplace custom. That custom overrode the written regulations, despite its relatively short period of practice, because it had been consistently applied to almost all employees who had reached the retirement age during that time. Case 2: Extra Pay for Substitute Holidays (DENIED) 商大八戸の里ドライビングスクール事件 (Syodai Yae-no-sato Driving School), Osaka High Court 25 June, 1993, and upheld by Supreme Court 9 March, 1995 Employees claimed entitlement to extra pay when working on a substitute holiday. The company’s policy was that every other Monday was a day off. If that particular Monday fell on a national holiday, Tuesday would become the substitute day off (振替休日, furikae kyuujitsu). The question was whether working on these Tuesday substitute holidays entitled workers to extra holiday pay. Although this practice had existed for 10 years, both the Osaka High Court, and the Supreme Court denied that it was an established workplace practice because the specific situation—working on a Tuesday substitute holiday—had only occurred 8–10 times during that period. This case demonstrates that frequency matters. Even when established over a lengthy period of time, if the actual instances of the practice are rare, it may not be considered established. Case 3: Bonus Amount Practice (PARTIALLY RECOGNISED) 立命館事件 (Ritsumeikan), Kyoto District Court 29 March 2012 (appealed, outcome uncertain) This case involved a bonus dispute. For 14 years, according to labour agreements (労働協約, roudou kyouyaku), the company had paid a bonus of 6.1 months’ salary plus 100,000 yen. However, there was no provision about bonuses in the company work rules (就業規則, shuugyou kisoku). When the employer announced the forthcoming bonus would be only 5.1 months’ salary, employees claimed the higher amount was an established practice. The court made an interesting distinction: It ruled that the specific amount (6.1 months + 100,000 yen) was NOT an established practice because there had been negotiations each year before agreeing on the amount, even though the final amount had always been the same. However, it recognised that “at least 6 months’ salary” had become an established practice, because the employer’s initial offer had never been lower than 6 months throughout the 14-year period. This case demonstrates the nuanced way courts examine whether a practice has created a legitimate expectation that can’t be unilaterally changed. These examples show that determining what constitutes an established workplace practice isn’t a simple matter of time. Courts look at consistency, frequency, specificity, and whether both sides understood the practice to be binding—even if they didn’t write it down. Other cases of interest The courts’ protection of established practices has occasionally been implemented in surprising ways. Take a case from 1968 that went all the way to the Tokyo High Court. The issue? Whether railway workers could continue their long-standing practice of using the company bath at 4 p.m. and clocking out at 4:30. At first glance, it might seem trivial—why fight over a bathing schedule? But the court’s decision was significant: after 13 years of this practice, with management’s knowledge and acceptance, it had become a legally-protected workplace custom that couldn’t be unilaterally changed. In the 1973 Shishido Shokai case (宍戸商会事件), the Tokyo District Court ruled that a company must pay severance to a voluntarily-resigned employee because consistent payment to other departing employees had established a binding workplace custom. The court classified these payments as deferred wages, confirming that consistent practices can create legal rights without written policies. Even when ruling in employers’ favour, courts have reinforced the importance of established practices. The 1982 Daiwa Bank case (大和銀行事件) upheld the bank’s long-standing practice of paying bonuses only to those still employed on payment dates, ruling that workers who left before payment weren’t entitled to bonuses despite working during the calculation period. These court precedents have established principles that apply universally across all industries. Whether a railway worker’s bathing schedule from the 1960s or solidifying employer rights, established employment practices are a real concept that can be parlayed into tech workers’ fight to maintain their working conditions. What does the law say? Article 8 of the Labour Contract Law also protects you against unilateral changes to working conditions. The law states, “A Worker and an Employer may, by agreement, change any working conditions that constitute the contents of a labour contract.” While the law is written in the positive, the inverse is what is important: “a worker and an employer cannot change working conditions without mutual agreement.” So, what constitutes the “contents of a labour contract”? Does this only deal with working conditions that are specifically written into your employment contract? According to General Union Chair Toshiaki Asari, “The idea that long-standing practices can become part of the employment contract and receive legal protection—even if they aren’t explicitly written—is well-established as a legal doctrine. Therefore, long-standing practices should also fall under the protections of Article 8 of the Labour Contract Law, just like written working conditions.” This perspective is supported by a significant 1991 Supreme Court ruling in the Hitachi Musashi Factory case (日立製作所武蔵工場事件). When an employee refused to work overtime during a production issue, claiming the request was unreasonable, the court found that the company’s consistent practice of requiring overtime in specific situations had become an implied term of employment—even without documentation. Because this practice was long-standing and implicitly accepted by employees over time, the court upheld the company’s disciplinary action. Though the ruling was in the employer’s favor, it also established that workplace customs, through consistent application and mutual understanding, can become legally binding components of employment contracts. Does this mean an employer can never change any working condition? The short answer is: they certainly can. This right of the employer to change working conditions is established in both Articles 9 and 10 of the Labour Contract Law. However, Article 9 also sets up a fundamental principle: employers cannot unilaterally change working conditions to workers’ disadvantage by modifying work rules without employee agreement. While Article 10 provides limited exceptions to this principle, it sets strict criteria that employers must meet. Any changes must be: Properly communicated to workers Reasonable given the degree of disadvantage to workers Necessary for the business Appropriate in their revised content Preceded by proper negotiations with unions or worker representatives This framework ensures that while employers can adapt to changing business needs, they cannot do so by simply imposing disadvantageous changes on workers without justification or prior consultation. The reality of defending your rights But what do these legal protections really mean in practice? Terms like disadvantages, reasonable, necessary, and appropriate sound reassuring on paper. Yet their practical meaning becomes far less clear when your employer suddenly demands you return to the office after five years of remote work, or changes how raises are calculated, or alters stock option arrangements. This is where the gap between legal rights and practical enforcement becomes stark. While the law provides a strong framework for protecting workplace practices, enforcing these rights as an individual is another matter entirely. Taking your employer to court is a long, expensive process that could take years to resolve, and that’s not only in the case of established employment practices. Even the seemingly clear prescriptions of the Labour Contract Law can only be enforced through court action. In the meantime, you’re stuck working under the contested conditions, while potentially damaging both your career and wellbeing. And remember, if you want to negotiate with your employer about changes to workplace practices, they have no legal obligation to even meet with you. They can simply ignore your requests or refuse outright. The power of collective action Unlike individual workers, who can be ignored or denied the chance to meet with management, employers cannot legally refuse to negotiate with a labour union. This right to collective bargaining is protected by the Trade Union Law. Even if you’re the only union member in your workplace, the company must negotiate with your union in good faith. The law also protects union members from retaliation or disadvantageous treatment. Unions offer multiple pathways for protecting workers’ rights and established employment practices. One key strategy is establishing prior consultation agreements (事前協議協定, jizen kyougi kyoutei), which require employers to inform and consult with the union before implementing changes to working conditions. By securing a seat at the table early, unions can influence decisions and propose alternatives before changes are implemented, rather than fighting them after the fact. Through collective bargaining, unions can also convert established workplace practices into written agreements, giving them stronger protection than relying on court precedents alone. By codifying these customs in collective agreements, they become immune to unilateral changes by employers. This matters because collective agreements sit at the top of the workplace rules hierarchy, superseding both individual contracts and company work rules—they’re the gold standard in protecting workers’ rights. True, an employer might still break a collective agreement, but unlike individual workers who can only turn to courts or government agencies, unions have multiple ways to respond. They can file complaints with the Labour Commission, as violating a collective agreement constitutes an unfair labour practice under the Trade Union Law. Most importantly, unions retain the right to take direct action through strikes and other collective measures—options simply not available to individual workers. As an individual, though, declaring union membership often shifts the power dynamic. When you’re backed by a union, you’re no longer facing the company alone. Many disputes are resolved at this early stage, as employers recognise that their actions will face greater scrutiny. While some employers might still attempt to violate the law, most understand that directly confronting unions by targeting their members creates more problems than it’s worth. Remember, your right to labour union representation is enshrined both in the Trade Union Law and Article 28 of the Constitution, which states that “The right of workers to organise and to bargain and act collectively is guaranteed.” Unions can also push beyond merely protecting existing rights, into negotiating improvements that particularly matter to tech workers—like expanded remote work options, clearer overtime compensation for project crunch times, training and skill development budgets, improved parental leave policies, and protections against excessive on-call rotations. Unions can also establish guidelines on emerging issues like AI implementation policies and the right to disconnect after working hours. Conclusion For developers, this situation presents a unique opportunity for collective action. Japan’s tech industry, worth nearly 22 trillion yen in domestic IT investments alone, mirrors the automotive industry of mid-20th century America—a wealthy, rapidly growing sector desperately in need of workers at all skill levels. Like the auto plants of Detroit, modern tech companies rely on a diverse workforce ranging from highly-paid specialists to entry-level developers. This combination—a wealthy industry, a significant labour shortage, and the critical role tech workers play in company operations—creates unprecedented leverage for collective action. Just as auto workers used their position to secure better wages and working conditions in the 1950s and 1960s, tech workers today could wield similar influence. The cost of lost productivity in tech companies can run into millions of yen per day, giving organised workers significant bargaining power. Moreover, unlike in traditional manufacturing where production delays might only affect future sales, many tech companies lose revenue immediately when work stops. By maintaining critical systems, supporting client operations, and keeping online services running, tech workers’ daily tasks directly impact their companies’ bottom lines. Even a small group of organised workers can effectively advocate for better conditions. That’s why understanding your legal rights under Japanese labour law is crucial—it provides the foundation for protecting established employment practices and knowing what you can rightfully demand. But knowledge alone isn’t enough. Whether these rights come from written contracts, established customs, or collective agreements, making them real requires more than just knowing they exist: it requires standing together to defend them. Through union membership, you combine legal protection with collective power—that is, you have both the law on your side and an active organisation working to protect your rights, with real mechanisms to enforce them.

6 hours ago 2 votes
Refreshed StaffEng.com and a few other sites

Ahead of announcing the title and publisher of my thus-far-untitled book on engineering strategy in the next week or two, I put together a website for its content. That site is pretty much the same format as this blog, but with some improvements like better mobile rendering on / than this blog has historically had. After finishing that work, I ported the improvements back to lethain.com, but also decided to bring them to staffeng.com. That was slightly trickier because, unlike this blog, StaffEng was historically a Gatsby app. (Why a Gatsby app? Because Calm was using Gatsby for our web frontend and I wanted to get some experience with it.) Over the weekend, I took some time to migrate to Hugo and apply the same enhancements. which you can now see in the lethain:staff-eng repository or on staffeng.com. Here’s a screenshot of the old version. Then here’s a screenshot of the updated version. Overall, I think it’s slightly easier to read, and I took it as a chance to update the various links. For example, I removed the newsletter link and pointed that to this blog’s newsletter instead, given that one’s mailing list went quiet a long time ago. Speaking of going quiet, I also brought these updates to infraeng.dev, which is the very-stuck-in-time site for the book I may-or-may-not one day write about infrastructure engineering. That means that I now have four essentially equivalent Hugo sites running different content websites: this blog, staffeng.com, infraeng.dev, and the site for the upcoming book. All of these build and deploy automatically onto GitHub Pages, which has been an extremely easy, reliable workflow for me. While I was working on this, someone asked me why I don’t just write my own blog server to host my blogs. The answer here is pretty straightforward. I’ve written three blog servers for my blog over the years. The first two were in Python, and the last one was in Go. They all worked well enough, but maintaining them was eventually a pain point because they required a real build pipeline and deal with libraries that could have security issues. Even in the best case, the containers they run in would get end-of-lifed periodically as Ubuntu versions got deprecated. What I’ve slowly learned from that is that, as a frequent writer, you really want your content to live somewhere that can work properly for decades. Even small maintenance costs can be prohibitive over time, and I’ve seen some good blogs disappear rather than e.g. figure out a WordPress upgrade. Individually, these are all minor, but over decades they can really add up. This is also my argument against using hosted providers: I’m sure Substack will be around in five years, but I have no idea if Substack will be around in twenty years, but I know that I’ll still be writing then, and will also want my previous writing to still be accessible.

21 hours ago 1 votes
SSEBITDA -- A steady-state profit metric for SaaS companies

How can a business that is "spending to grow" determine whether it's truly profitable underneath all that "revenue acceleration?" Here's a way.

yesterday 3 votes
You’re Only As Strong As Your Weakest Point

In April 1945, as US soldiers overtook Merkers, Germany, stories began to surface to Army officials of stolen Nazi riches stored in the local salt mine. Eventually, the Americans found the mine and began exploring it, ending up at a vaulted door. Here’s the story, as told by Greg Bradsher: the Americans found the main vault. It was blocked by a brick wall three feet thick…In the center of the wall was a large bank-type steel safe door, complete with combination lock and timing mechanism with a heavy steel door set in the middle of it. Attempts to open the steel vault door were unsuccessful. Word went up the chain of command about the find and suspected gold hoard behind the vaulted steel door. The order came back down to open it up. But what to do about this vault door that, up until now, nobody could open? One engineer looked at the problem and said: forget the door, blow the wall! One of the engineers who inspected the brick wall surrounding the vault door thought it could be blasted through with little effort. Therefore the engineers, using a half-stick of dynamite, blasted an entrance though the masonry wall. To me, this is a fascinating commentary on security specifically [insert meme of gate with no fence] But also a commentary on problem-solving generally. When you have a seemingly intractable problem — there’s an impenetrable door we can’t open — rather than focus on the door itself, you take a step back and realize the door may be impenetrable but the wall enclosing it is not. A little dynamite and problem solved. Lessons: You’re only as strong as your weakest point. Don’t miss the forest for the trees. A little dynamite goes a long way. Footnote to this story, in case you’re wondering what they found inside: [a partial] inventory indicated that there were 8,198 bars of gold bullion; 55 boxes of crated gold bullion; hundreds of bags of gold items; over 1,300 bags of gold Reichsmarks, British gold pounds, and French gold francs; 711 bags of American twenty-dollar gold pieces; hundreds of bags of gold and silver coins; hundreds of bags of foreign currency; 9 bags of valuable coins; 2,380 bags and 1,300 boxes of Reichsmarks (2.76 billion Reichsmarks); 20 silver bars; 40 bags containing silver bars; 63 boxes and 55 bags of silver plate; 1 bag containing six platinum bars; and 110 bags from various countries Email · Mastodon · Bluesky

2 days ago 2 votes
An unplanned upgrade to Linux Mint 22.1 Cinnamon

<![CDATA[I spoke too soon when I said I was enjoying the stability of Linux. I have been using Linux Mint Cinnamon on a System76 Merkaat PC with no major issues since July of 2024. But a few days ago a routine system update of Mint 22 dumped me to the text console. A fresh install of Mint 22.1, the latest release, brought the system back online. I had backups and the mishap luckily turned out as just an annoyance that consumed several hours of unplanned maintenance. It all started when the Mint Update Manager listed several packages for update, including the System76 driver and tools. Oddly, the Update Manager also marked for removal several packages including core ones such as Xorg, Celluloid, and more. The smooth running of Mint made my paranoid side fall asleep and I applied the recommend changes. At the next reboot the graphics session didn't start and landed me at the text console with no clue what happened. I don't use Timeshift for system snapshots as I prefer a fresh install and restore of data backups if the system breaks. Therefore, to fix such an issue apparently related to Mint 22 the obvious route was to install Mint 22.1. Besides, this was the right occasion to try the new release. On my Raspberry Pi 400 I ran dd to flash a bootable USB stick with Mint 22.1. I had no alternatives as GNOME Disks didn't work. The Merkaat failed to boot off the stick, possibly because I messed with the arguments of dd. I still had around a USB stick with Mint 22 and I used it to freshly install it on the Merkaat. Then I immediately ran the upgrade to Mint 22.1 which completed successfully unlike a prior upgrade attempt. Next, I tried to install the System76 driver with sudo apt install system76-driver but got a package not found error. At that point I had already added the System76 package repository to the APT sources and refreshing the Mint Update Manager yielded this error: Could not refresh the list of updates Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages Aside from the errors the system was up and running on the Merkaat, so with Nemo I reflashed the Mint 22.1 stick. This time the PC did boot off the stick and let me successfully install Mint 22.1. Restoring the data completed the system recovery. I left out the System76 driver as it's the primary suspect, possibly due to package conflicts. Mint detects and supports all hardware of the Merkaat anyway and it's only prudent to skip the package for the time being. Besides improvements under the hood, Mint 22.1 features a redesigned default Cinnamon theme. No major changes, I feel at home. The main takeaway of this adventure is that it's better to have a bootable USB stick ready with the latest Mint release, even if I don't plan to upgrade immediately. Another takeaway is the Pi 400 makes for a viable backup computer that can support my major tasks, should it take longer to recover the Merkaat. However, using the device for making bootable media is problematic as little flashing software is available and some is unreliable. Finally, over decades of Linux experience I honed my emergency installation skills so much I can now confidently address most broken system situations. #linux #pi400 a href="https://remark.as/p/journal.paoloamoroso.com/an-unplanned-upgrade-to-linux-mint-22-1-cinnamon"Discuss.../a Email | Reply @amoroso@fosstodon.org !--emailsub--]]>

2 days ago 3 votes