More from Irrational Exuberance
If you talk to enough aspiring leaders, you’ll become familiar with the prevalent idea that they need to be promoted before they can work on strategy. It’s a truism, but I’ve also found this idea perfectly wrong: you can work on strategy from anywhere in an organization, it just requires different tactics to do so. Both Staff Engineer and The Engineering Executive’s Primer have chapters on strategy. While the chapters’ contents are quite different, both present a practical path to advancing your organization’s thinking about complex topics. This chapter explains my belief that anyone within an organization can make meaningful progress on strategy, particularly if you are honest about the tools accessible to you, and thoughtful about how to use them. The themes we’ll dig into are: How to do strategy as an engineer, particularly an engineer who hasn’t been given explicit authority to do strategy Doing strategy as an engineering executive who is responsible for your organization’s decision-making How you can do engineering strategy even when you depend on an absent strategy, cannot acknowledge parts of the diagnosis because addressing certain problems is politically sensitive, or struggle with pockets of misaligned incentives If this book’s argument is that everyone should do strategy, is there anyone who, nonetheless, really should not do strategy? By the end, you’ll hopefully agree that engineering strategy is accessible to everyone, even though you’re always operating within constraints. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Doing strategy as an engineer It’s easy to get so distracted by executive’s top-down approach to strategy that you convince yourself that there aren’t other approachable mechanisms to doing strategy. There are! Staff Engineer introduces an approach I call Take five, then synthesize, which does strategy by: Documenting how five current and historical related decisions have been made in your organization. This is an extended exploration phase Synthesizing those five documents into a diagnosis and policy. You are naming the implicit strategy, so it’s impossible for someone to reasonably argue you’re not empowered to do strategy: you’re just describing what’s already happening At that point, either the organization feels comfortable with what you’ve written–which is their current strategy–or it doesn’t in which case you’ve forced a conversation about how to revise the approach. Creating awareness is often enough to drive strategic change, and doesn’t require any explicit authorization from an executive to do. When awareness is insufficient, the other pattern I’ve found highly effective in low-authority scenarios is an approach I wrote about in An Elegant Puzzle, and call model, document, and share: Model the approach you want others to adopt. Make it easy for them to observe how you’ve changed the way you’re doing things. Document the approach, the thinking behind it, and how to adopt it. Share the document around. If people see you succeeding with the approach, then they’re likely to copy it from you. You might be skeptical because this is an influence-based approach. However, as we’ll discuss in the next section, even executive-driven strategy is highly dependent on influence. Strategy archaeology Vernor Vinge’s A Deepness in the Sky, published in 1999, introduced the term software archaeologists, folks who created functionality by cobbling together millennia of scraps of existing software. Although it’s a somewhat different usage, I sometimes think of the “take five, then synthesize” approach as performing strategy archaeology. Simply by recording what has happened in the past, we make it easier to understand the present, and influence the future. Doing strategy as an executive The biggest misconception about executive roles, frequently held by non-executives and new executives who are about to make a series of regrettable mistakes, is that executives operate without constraints. That is false: executives have an extremely high number of constraints that they operate under. Executives have budgets, CEO visions, peers to satisfy, and a team to motivate. They can disappoint any of these temporarily, but long-term have to satisfy all of them. Nonetheless, it is true that executives have more latitude to mandate and cajole participation in the strategies that they sponsor. The Engineering Executive’s Primer’s chapter on strategy is a brief summary of this entire book, but it doesn’t say much about how executive strategy differs from non-executive strategy. How the executive’s approach to strategy differs from the engineer’s can be boiled down to: Executives can mandate following of their strategy, which empowers their policy options. An engineer can’t prevent the promotion of someone who refused to follow their policy, but an executive can. Mandates only matter if there are consequences. If an executive is unwilling to enforce consequences for non-compliance with a mandate, the ability to issue a mandate isn’t meaningful. This is also true if they can’t enforce a mandate because of lack of support from their peer executives. Even if an executive is unwilling to use mandates, they have significant visibility and access to their organization to advocate for their preferred strategy. Neither access nor mandates improve an executive’s ability to diagnose problems. However, both often create the appearance of progress. This is why executive strategies can fail so spectacularly and endure so long despite failure. As a result, my experience is that executives have an easier time doing strategy, but a much harder time learning how to do strategy well, and fewer protections to avoid serious mistakes. Further, the consequences of an executive’s poor strategy tend to be much further reaching than an engineer’s. Waiting to do strategy until you are an executive is a recipe for disaster, even if it looks easier from a distance. Doing strategy in other roles Even if you’re neither an engineer nor an engineering executive, you can still do engineering strategy. It’ll just require an even more influence-driven approach. The engineering organization is generally right to believe that they know the most about engineering, but that’s not always true. Sometimes a product manager used to be an engineer and has significant relevant experience. Other times, such as the early adoption of large language models, engineers don’t know much either, and benefit from outside perspectives. Doing strategy in challenging environments Good strategies succeed by accurately diagnosing circumstances and picking policies that address those circumstances. You are likely to spend time in organizations where both of those are challenging due to internal limitations, so it’s worth acknowledging that and discussing how to navigate those challenges. Low-trust environment Sometimes the struggle to diagnose problems is a skill issue. Being bad at strategy is in some ways the easy problem to solve: just do more strategy work to build expertise. In other cases, you may see what the problems are fairly clearly, but not know how to acknowledge the problems because your organization’s culture would frown on it. The latter is a diagnosis problem rooted in low-trust, and does make things more difficult. The chapter on Diagnosis recognizes this problem, and admits that sometimes you have to whisper the controversial parts of a strategy: When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: say something awkward or uncomfortable about your company or someone working within it, or omit a critical piece of your diagnosis that’s necessary to understand the wider thinking. Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. In short, the solution to low-trust is to translate difficult messages into softer, less direct versions that are acceptable to state. If your goal is to hold people accountable, this can feel dishonest or like a ethical compromise, but the goal of strategy is to make better decisions, which is an entirely different concern than holding folks accountable for the past. Karpman Drama Triangle Sometimes when the diagnosis seems particularly obvious, and people don’t agree with you, it’s because you are wrong. When I’ve been obviously wrong about things I understand well, it’s usually because I’ve fallen into viewing a situation through the Karpman Drama Triangle, where all parties are mapped as the persecutor, the rescuer, or the victim. Poor-judgment environment Even when you do an excellent job diagnosing challenges, it can be difficult to drive agreement within the organization about how to address them. Sometimes this is due to genuinely complex tradeoffs, for example in Stripe’s acquisition of Index, there was debate about how to deal with Index’s Java-based technology stack, which culminated in a compromise that didn’t make anyone particularly happy: Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. That compromise is a good example of a difficult tradeoff: although parties disagreed with the approach, everyone understood the conflicting priorities that had to be addressed. In other cases, though, there are policy choices that simply don’t make much sense, generally driven by poor judgment in your organization. Sometimes it’s not poor technical judgment, but poor judgment in choosing to prioritize one’s personal interests at the expense of the company’s needs. Calm’s strategy to focus on being a product-engineering organization dealt with some aspects of that, acknowledged in its diagnosis: We’re arguing a particularly large amount about adopting new technologies and rewrites. Most of our disagreements stem around adopting new technologies or rewriting existing components into new technology stacks. For example, can we extend this feature or do we have to migrate it to a service before extending it? Can we add this to our database or should we move it into a new Redis cache instead? Is JavaScript a sufficient programming language, or do we need to rewrite this functionality in Go? In that situation, your strategy is an attempt to educate your colleagues about the tradeoffs they are making, but ultimately sometimes folks will disagree with your strategy. In that case, remember that most interesting problems require iterative solutions. Writing your strategy and sharing it will start to change the organization’s mind. Don’t get discouraged even if that change is initially slow. Dealing with missing strategies The strategy for dealing with new private equity ownership introduces a common problem: lack of clarity about what other parts of your own company want. In that case, it seems likely there will be a layoff, but it’s unclear how large that layoff will be: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. Many leaders encounter that sort of ambiguity and decide that they cannot move forward with a strategy of their own until that decision is made. While it’s true that it’s inconvenient not to know the details, getting blocked by ambiguity is always the wrong decision. Instead you should do what the private equity strategy does: accept that ambiguity as a fact to be worked around. Rather than giving up, it adopts a series of new policies to start reducing cost growth by changing their organization’s seniority mix, and recognizes that once there is clarity on reduction targets that there will be additional actions to be taken. Whenever you’re doing something challenging, there are an infinite number of reasonable rationales for why you shouldn’t or can’t make progress. Leadership is finding a way to move forward despite those issues. A missing strategy is always part of your diagnosis, but never a reason that you can’t do strategy. Who shouldn’t do strategy In my experience, there’s almost never a reason why you cannot do strategy, but there are two particular scenarios where doing strategy probably doesn’t make sense. The first is not a who, but a when problem: sometimes there is so much strategy already happening, that doing more is a distraction. If another part of your organization is already working on the same problem, do your best to work with them directly rather than generating competing work. The other time to avoid strategy is when you’re trying to satisfy an emotional need to make a direct, immediate impact. Sharing a thoughtful strategy always makes progress, but it’s often the slow, incremental progress of changing your organization’s beliefs. Even definitive, top-down strategies from executives are often ignored in pockets of an organization, and bottoms-up strategy spread slowly as they are modeled, documented and shared. Embarking on strategy work requires a tolerance for winning in the long-run, even when there’s little progress this week or this quarter. Summary As you finish reading this chapter, my hope is that you also believe that you can work on strategy in your organization, whether you’re an engineer or an executive. I also hope that you appreciate that the tools you use vary greatly depending on who you are within your organization and the culture in which you work. Whether you need to model or can mandate, there’s a mechanism that will work for you.
While discussions around acquisitions often focus on technical diligence and deciding whether to make the acquisition, the integration that follows afterwards can be even more complex. There are few irreversible trapdoor decisions in engineering, but decisions made early in an integration tend to be surprisingly durable. This engineering strategy explores Stripe’s approach to integrating their 2018 acquisition of Index. While a business book would focus on the rationale for the acquisition itself, here that rationale is merely part of the diagnosis that defines the integration tradeoffs. The integration itself is the area of focus. Like most acquisitions, the team responsible for the integration has only learned about the project after the deal closed, which means early efforts are a scramble to apply strategy testing to distinguish between optimistic dates and technical realities. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy & Operation. To understand the thinking behind this strategy, read sections in reserve order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation We’re starting with little shared context between the acquired and acquiring engineering teams, and have a six month timeline to launch a joint product. So our starting policy is a mix of a commitment to joint refinment and several provisional architectural policies: Meet at least weekly until the initial release is complete: the involved leadership from Stripe and Index will hold a weekly sync meeting to refine our approach until we fulfill our intial release timeline. This meeting is jointly owned by Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. Minimize changes to tokenization environment: because point-of-sale devices directly work with with customer payment details, the API that directly supports the point-of-sale device must live within our secured environment where payment details are stored. However, any other functionality must not be added to our tokenization environment. All other functionality must exist in standard environments: except for the minimum necessary functionality moving into the tokenization environment, everything else must be operated in our standard, non-tokenization environments. In particular, any software that requires frequent changes, or introduces complex external dependencies, should exist in the standard environments. Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. Escalations come to paired leads: given our limited shaed context across teams, all escalations must come to both Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. Security review of changes impacting tokenization environment: we need to move quickly to launch the combined point-of-sale and payments product, but we must not cut corners on security to launch faster. Security must be included and explicitly sign off on any integration decisions that involve our tokenization environment Diagnose There are generally four categories of acquisitions: talent acquisitions to bring on a talented team, business acquisitions to buy a company’s revenue and product, technology acquisitions to add a differentiated capability that would be challenging to develop internally, and time-to-market acquisitions where you could develop the capability internally but can develop it meaningfully faster by acquiring a company. While most acquisitions have a flavor of several of these dimensions, this acquisition is primarily a time-to-market acquisition aimed to address these constraints: Several of our largest customers are pushing for us to provide a point-of-sale device integrated with our API-driven payments ecosystem. At least one has implied that we either provide this functionality on a committed timeline or they may churn to a competitor. We currently have no homegrown expertise in developing or integrating with hardware such as point-of-sale devices. Based on other zero-to-one efforts internally, we believe it would take about a year to hire the team, develop and launch a minimum-viable product for a point-of-sale device integrated into our platform. Where we’ve taken a horizontal approach to supporting web payments via an API, at least one of our competitors, Square, has taken a vertically integrated approach. While their API ecosystem is less developed than ours, they are a plausible destination for customers threatening to churn. We believe that at least one of our enterprise customers will churn if our best commitment is launching a point-of-sale solution 12 months from now. We’ve decided to acquire a small point-of-sale startup, which we will use to commit to a six month timeframe for supporting an integrated point-of-sale device with our API ecosystem. We will need to rapidly integrate the acquired startup to meet this timeline. We only know a small number of details about what this will entail. We do know that point-of-sale devices directly operate on payment details (e.g. the point-of-sale device knows the credit card details of the card it reads). Our compliance obligations restrict such activity to our “tokenization environment”, a highly secured and isolated environment with direct access to payment details. This environment converts payment details into a unique token that other environments can utilize to operate against payment details without the compliance overhead of having direct access to the underlying payment details. Going into this technical integration, we have few details about the acquired company’s technology stack. We do know that they are primarily a Java shop running on AWS, where we are primarily a Ruby (with some Go) shop running on AWS. Explore Prior to this acquisition, we have done several small acquisitions. None of those acquisitions had a meaningful product to integrate with ours, so we don’t have much of an internal playbook to anchor our approach in. We do have limited experience in integrating technical acquisitions from prior companies we’ve worked in, along with talking to peers at other companies to mine their experience. Synthesizing those experiences, the recurring patterns are: Usually deal teams have made certain commitments, or the acquired team has understood certain commitments, that will be challenging to facilitate. This is doubly true when you are unaware of what those commitments might be. If folks seem to be behaving oddly, it might be one such misunderstanding, and it’s worth engaging directly to debug the confusion. There should be an executive sponsor for the acquisition, and the sponsor is typically the best person ask about the company’s intentions. If you can’t find the executive sponsor, or they are not engaged, try to recruit a new executive sponsor rather than trying to make things work without one. Close the culture gap quickly where there’s little friction, and cautiously where there’s little trust. We do need to bring the acquired company into our culture, but we have years to do that. The most successful stories of doing this leaned on a mix of moving folks into and out of the acquired team rather than applying force. The long-term cost of supporting a new technology stack is high, and in conflict with our technology strategy of consolidating on as few programming languages as possible. This is not the place to be flexible, as each additional feature in the new stack will take you further from your desired outcome. Finally, find a way to derisk key departures. Things can go wrong quickly. One of the easiest starting points is consolidating infrastructure immediately, even if the product or software takes longer. Altogether, this was not the most reassuring exploration: it was a bit abstract, and much of our research returned strongly-held, conflicting perspectives. Perhaps acquisitions, like starting a new company, is one of those places where there’s simply no right way to do it well.
Once you’ve written your strategy’s exploration, the next step is working on its diagnosis. Diagnosis is understanding the constraints and challenges your strategy needs to address. In particular, it’s about doing that understanding while slowing yourself down from deciding how to solve the problem at hand before you know the problem’s nuances and constraints. If you ever find yourself wanting to skip the diagnosis phase–let’s get to the solution already!–then maybe it’s worth acknowledging that every strategy that I’ve seen fail, did so due to a lazy or inaccurate diagnosis. It’s very challenging to fail with a proper diagnosis, and almost impossible to succeed without one. The topics this chapter will cover are: Why diagnosis is the foundation of effective strategy, on which effective policy depends. Conversely, how skipping the diagnosis phase consistently ruins strategies A step-by-step approach to diagnosing your strategy’s circumstances How to incorporate data into your diagnosis effectively, and where to focus on adding data Dealing with controversial elements of your diagnosis, such as pointing out that your own executive is one of the challenges to solve Why it’s more effective to view difficulties as part of the problem to be solved, rather than a blocking issue that prevents making forward progress The near impossibility of an effective diagnosis if you don’t bring humility and self-awareness to the process Into the details we go! This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Diagnosis is strategy’s foundation One of the challenges in evaluating strategy is that, after the fact, many effective strategies are so obvious that they’re pretty boring. Similarly, most ineffective strategies are so clearly flawed that their authors look lazy. That’s because, as a strategy is operated, the reality around it becomes clear. When you’re writing your strategy, you don’t know if you can convince your colleagues to adopt a new approach to specifying APIs, but a year later you know very definitively whether it’s possible. Building your strategy’s diagnosis is your attempt to correctly recognize the context that the strategy needs to solve before deciding on the policies to address that context. Done well, the subsequent steps of writing strategy often feel like an afterthought, which is why I think of diagnosis as strategy’s foundation. Where exploration was an evaluation-free activity, diagnosis is all about evaluation. How do teams feel today? Why did that project fail? Why did the last strategy go poorly? What will be the distractions to overcome to make this new strategy successful? That said, not all evaluation is equal. If you state your judgment directly, it’s easy to dispute. An effective diagnosis is hard to argue against, because it’s a web of interconnected observations, facts, and data. Even for folks who dislike your conclusions, the weight of evidence should be hard to shift. Strategy testing, explored in the Refinement section, takes advantage of the reality that it’s easier to diagnose by doing than by speculating. It proposes a recursive diagnosis process until you have real-world evidence that the strategy is working. How to develop your diagnosis Your strategy is almost certain to fail unless you start from an effective diagnosis, but how to build a diagnosis is often left unspecified. That’s because, for most folks, building the diagnosis is indeed a dark art: unspecified, undiscussion, and uncontrollable. I’ve been guilty of this as well, with The Engineering Executive’s Primer’s chapter on strategy staying silent on the details of how to diagnose for your strategy. So, yes, there is some truth to the idea that forming your diagnosis is an emergent, organic process rather than a structured, mechanical one. However, over time I’ve come to adopt a fairly structured approach: Braindump, starting from a blank sheet of paper, write down your best understanding of the circumstances that inform your current strategy. Then set that piece of paper aside for the moment. Summarize exploration on a new piece of paper, review the contents of your exploration. Pull in every piece of diagnosis from similar situations that resonates with you. This is true for both internal and external works! For each diagnosis, tag whether it fits perfectly, or needs to be adjusted for your current circumstances. Then, once again, set the piece of paper aside. Mine for distinct perspectives on yet another blank page, talking to different stakeholders and colleagues who you know are likely to disagree with your early thinking. Your goal is not to agree with this feedback. Instead, it’s to understand their view. The Crux by Richard Rumelt anchors diagnosis in this approach, emphasizing the importance of “testing, adjusting, and changing the frame, or point of view.” Synthesize views into one internally consistent perspective. Sometimes the different perspectives you’ve gathered don’t mesh well. They might well explicitly differ in what they believe the underlying problem is, as is typical in tension between platform and product engineering teams. The goal is to competently represent each of these perspectives in the diagnosis, even the ones you disagree with, so that later on you can evaluate your proposed approach against each of them. When synthesizing feedback goes poorly, it tends to fail in one of two ways. First, the author’s opinion shines through so strongly that it renders the author suspect. Your goal is never to agree with every team’s perspective, just as your diagnosis should typically avoid crowning any perspective as correct: a reader should generally be appraised of the details and unaware of the author. The second common issue is when a group tries to jointly own the synthesis, but create a fractured perspective rather than a unified one. I generally find that having one author who is accountable for representing all views works best to address both of these issues. Test drafts across perspectives. Once you’ve written your initial diagnosis, you want to sit down with the people who you expect to disagree most fervently. Iterate with them until they agree that you’ve accurately captured their perspective. It might be that they disagree with some other view points, but they should be able to agree that others hold those views. They might argue that the data you’ve included doesn’t capture their full reality, in which case you can caveat the data by saying that their team disagrees that it’s a comprehensive lens. Don’t worry about getting the details perfectly right in your initial diagnosis. You’re trying to get the right crumbs to feed into the next phase, strategy refinement. Allowing yourself to be directionally correct, rather than perfectly correct, makes it possible to cover a broad territory quickly. Getting caught up in perfecting details is an easy way to anchor yourself into one perspective prematurely. At this point, I hope you’re starting to predict how I’ll conclude any recipe for strategy creation: if these steps feel overly mechanical to you, adjust them to something that feels more natural and authentic. There’s no perfect way to understand complex problems. That said, if you feel uncertain, or are skeptical of your own track record, I do encourage you to start with the above approach as a launching point. Incorporating data into your diagnosis The strategy for Navigating Private Equity ownership’s diagnosis includes a number of details to help readers understand the status quo. For example the section on headcount growth explains headcount growth, how it compares to the prior year, and providing a mental model for readers to translate engineering headcount into engineering headcount costs: Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%). If everyone evaluating a strategy shares the same foundational data, then evaluating the strategy becomes vastly simpler. Data is also your mechanism for supporting or critiquing the various views that you’ve gathered when drafting your diagnosis; to an impartial reader, data will speak louder than passion. If you’re confident that a perspective is true, then include a data narrative that supports it. If you believe another perspective is overstated, then include data that the reader will require to come to the same conclusion. Do your best to include data analysis with a link out to the full data, rather than requiring readers to interpret the data themselves while they are reading. As your strategy document travels further, there will be inevitable requests for different cuts of data to help readers understand your thinking, and this is somewhat preventable by linking to your original sources. If much of the data you want doesn’t exist today, that’s a fairly common scenario for strategy work: if the data to make the decision easy already existed, you probably would have already made a decision rather than needing to run a structured thinking process. The next chapter on refining strategy covers a number of tools that are useful for building confidence in low-data environments. Whisper the controversial parts At one time, the company I worked at rolled out a bar raiser program styled after Amazon’s, where there was an interviewer from outside the team that had to approve every hire. I spent some time arguing against adding this additional step as I didn’t understand what we were solving for, and I was surprised at how disinterested management was about knowing if the new process actually improved outcomes. What I didn’t realize until much later was that most of the senior leadership distrusted one of their peers, and had rolled out the bar raiser program solely to create a mechanism to control that manager’s hiring bar when the CTO was disinterested holding that leader accountable. (I also learned that these leaders didn’t care much about implementing this policy, resulting in bar raiser rejections being frequently ignored, but that’s a discussion for the Operations for strategy chapter.) This is a good example of a strategy that does make sense with the full diagnosis, but makes little sense without it, and where stating part of the diagnosis out loud is nearly impossible. Even senior leaders are not generally allowed to write a document that says, “The Director of Product Engineering is a bad hiring manager.” When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: Say something awkward or uncomfortable about your company or someone working within it Omit a critical piece of your diagnosis that’s necessary to understand the wider thinking Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. I think it’s helpful to discuss a few concrete examples of this, starting with the strategy for navigating private equity, whose diagnosis includes: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. There are many things the authors of this strategy likely feel about their state of reality. First, they are probably upset about the fact that their new private equity ownership is likely to eliminate colleagues. Second, they are likely upset that there is no clear plan around what they need to do, so they are stuck preparing for a wide range of potential outcomes. However they feel, they don’t say any of that, they stick to precise, factual statements. For a second example, we can look to the Uber service migration strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. The team didn’t agree that their headcount should not be growing, but it was the reality they were operating in. They acknowledged their reality as a factual statement, without any additional commentary about that statement. In both of these examples, they found a professional, non-judgmental way to acknowledge the circumstances they were solving. The authors would have preferred that the leaders behind those decisions take explicit accountability for them, but it would have undermined the strategy work had they attempted to do it within their strategy writeup. Excluding critical parts of your diagnosis makes your strategies particularly hard to evaluate, copy or recreate. Find a way to say things politely to make the strategy effective. As always, strategies are much more about realities than ideals. Reframe blockers as part of diagnosis When I work on strategy with early-career leaders, an idea that comes up a lot is that an identified problem means that strategy is not possible. For example, they might argue that doing strategy work is impossible at their current company because the executive team changes their mind too often. That core insight is almost certainly true, but it’s much more powerful to reframe that as a diagnosis: if we don’t find a way to show concrete progress quickly, and use that to excite the executive team, our strategy is likely to fail. This transforms the thing preventing your strategy into a condition your strategy needs to address. Whenever you run into a reason why your strategy seems unlikely to work, or why strategy overall seems difficult, you’ve found an important piece of your diagnosis to include. There are never reasons why strategy simply cannot succeed, only diagnoses you’ve failed to recognize. For example, we knew in our work on Uber’s service provisioning strategy that we weren’t getting more headcount for the team, the product engineering team was going to continue growing rapidly, and that engineering leadership was unwilling to constrain how product engineering worked. Rather than preventing us from implementing a strategy, those components clarified what sort of approach could actually succeed. The role of self-awareness Every problem of today is partially rooted in the decisions of yesterday. If you’ve been with your organization for any duration at all, this means that you are directly or indirectly responsible for a portion of the problems that your diagnosis ought to recognize. This means that recognizing the impact of your prior actions in your diagnosis is a powerful demonstration of self-awareness. It also suggests that your next strategy’s success is rooted in your self-awareness about your prior choices. Don’t be afraid to recognize the failures in your past work. While changing your mind without new data is a sign of chaotic leadership, changing your mind with new data is a sign of thoughtful leadership. Summary Because diagnosis is the foundation of effective strategy, I’ve always found it the most intimidating phase of strategy work. While I think that’s a somewhat unavoidable reality, my hope is that this chapter has somewhat prepared you for that challenge. The four most important things to remember are simply: form your diagnosis before deciding how to solve it, try especially hard to capture perspectives you initially disagree with, supplement intuition with data where you can, and accept that sometimes you’re missing the data you need to fully understand. The last piece in particular, is why many good strategies never get shared, and the topic we’ll address in the next chapter on strategy refinement.
A surprising number of strategies are doomed from inception because their authors get attached to one particular approach without considering alternatives that would work better for their current circumstances. This happens when engineers want to pick tools solely because they are trending, and when executives insist on adopting the tech stack from their prior organization where they felt comfortable. Exploration is the antidote to early anchoring, forcing you to consider the problem widely before evaluating any of the paths forward. Exploration is about updating your priors before assuming the industry hasn’t evolved since you last worked on a given problem. Exploration is continuing to believe that things can get better when you’re not watching. This chapter covers: The goals of the exploration phase of strategy creation When to explore (always first!) and when it makes sense to stop exploring How to explore a topic, including discussion of the most common mechanisms: mining for internal precedent, reading industry papers and books, and leveraging your external network Why avoiding judgment is an essential part of exploration By the end of this chapter, you’ll be able to conduct an exploration for the current or next strategy that you work on. What is exploration? One of the frequent senior leadership anti-patterns I’ve encountered in my career is The Grand Migration, where a new leader declares that a massive migration to a new technology stack–typically the stack used by their former employer–will solve every pressing problem. What’s distinguishing about the Grand Migration is not the initially bad selection, but the single-minded ferocity with which the senior leader pushes for their approach, even when it becomes abundantly clear to others that it doesn’t solve the problem at hand. These senior leaders are very intelligent, but have allowed themselves to be framed in by their initial thinking from prior experiences. Accepting those early thoughts as the foundation of their strategy, they build the entire strategy on top of those ideas, and eventually there is so much weight standing on those early assumptions that it becomes impossible to acknowledge the errors. Exploration is the deliberate practice of searching through a strategy’s problem and solution spaces before allowing yourself to commit to a given approach. It’s understanding how others have approached the same problem recently and in the past. It’s doing this both in trendy companies you admire, and in practical companies that actually resemble yours. Most exploration will be external to your team, but depending on your company, much of your exploration might be internal to the company. If you’re in a massive engineering organization of 100,000, there are likely existing internal solutions to your problem that you’ve never heard of. Conversely, if you’re in an organization of 50 engineers, it’s likely that much of your exploration will be external. When to explore Exploration is the first step of good strategy work. Even when you want to skip it, you will always regret skipping it, because you’ll inadvertently frame yourself into whatever approach you focus on first. Especially when it comes to problems that you’ve solved previously, exploration is the only thing preventing you from over-indexing on your prior experiences. Try to continue exploration until you know how three similar teams within your company and three similar companies have recently solved the same problem. Further, make sure you are able to explain the thinking behind those decisions. At that point,you should be ready to stop exploring and move on to the diagnosis step of strategy creation. Exploration should always come with a minimum and maximum timeframe: less than a few hours is very suspicious, and more than a week is generally questionably as well. How to explore While the details of each exploration will differ a bit, the overarching approach tends to be pretty similar across strategies. After I open up the draft strategy document I’m working on, my general approach to exploration is: Start throwing in every resource I can think of related to that problem. For example, in the Uber service provisioning strategy, I started by collecting recent papers on Mesos, Kubernetes, and Aurora to understand the state of the industry on orchestration. Do some web searching, foundational model prompting, and checking with a few current and prior colleagues about what topics and resources I might be missing. For example, for the Calm engineering strategy, I focused on talking with industry peers on tools they’d used to focus a team with diffuse goals. Summarize the list of resources I’ve gathered, organizing them by which I want to explore, and which I won’t spend time on but are worth mentioning. For example, the Large Language Model adoption strategy’s exploration section documents the variety of resources the team explored before completing it. Work through the list one by one, continuing to collect notes in the strategy document. When you’re done, synthesize those into a concise, readable summary of what you’ve learned. For example, the monolith decomposition strategy synthesizes the exploration of a broad topic into four paragraphs, with links out to references. Stop once I generally understand how a handful of similar internal and external teams have recently approached this problem. Of all the steps in strategy creation, exploration is inherently open-ended, and you may find a different approach works better for you. If you’re not sure what to do, try following the above steps closely. If you have a different approach that you’re confident in–as long as it’s not skipping exploration!–then go ahead and try that instead. While not discussed in this chapter, you can also use some techniques like Wardley mapping, covered in the Refinement chapter, to support your exploration phase. Wardley mapping is a strategy tool designed within a different strategy tradition, and consequently categorizing it as either solely an exploration tool or a refinement tool ignores some of its potential uses. There’s no perfect way to do strategy: take what works for you and use it. Mine internal precedent One of the most powerful forms of strategy is simply documenting how similar decisions have been made internally: often this is enough to steer how similar future decisions are made within your organization. This approach, documented in Staff Engineer’s Write five, then synthesize, is also the most valuable step of exploration for those working in established companies. If you are a tenured engineer within your organization, then it’s somewhat safe to assume that you are aware of the typical internal approaches. Even then, it’s worth poking around to see if there are any related skunkworks projects happening internally. This is doubly true if you’ve joined the organization recently, or are distant from the codebase itself. In that case, it’s almost always worth poking around to see what already exists. Sometimes the internal approach isn’t ideal, but it’s still superior because it’s already been implemented and there’s someone else maintaining it. In the long-run, your strategy can ride along as someone else addresses the issues that aren’t perfect fits. Using your network How should we control access to user data’s exploration section begins with: Our experience is that best practices around managing internal access to user data are widely available through our networks, and otherwise hard to find. The exact rationale for this is hard to determine, While there are many topics with significant public writing out there, my experience is that there are many topics where there’s very little you can learn without talking directly to practitioners. This is especially true for security, compliance, operating at truly large scale, and competitive processes like optimizing advertising spend. Further, it’s surprisingly common to find that how people publicly describe solving a problem and how they actually approach the problem are largely divorced. This is why having a broad personal network is exceptionally powerful, and makes it possible to quickly understand the breadth of possible solutions. It also provides access to the practical downsides to various approaches, which are often omitted when talking to public proponents. In a recent strategy session, a proposal came up that seemed off to me, and I was able to text–and get answers to those texts–industry peers before the meeting ended, which invalidated the room’s assumptions about what was and was not possible. A disagreement that might have taken weeks to resolve was instead resolved in a few minutes, and we were able to figure out next steps in that meeting rather than waiting a week for the next meeting when we’d realized our mistake. Of course, it’s also important to hold information from your network with skepticism. I’ve certainly had my network be wrong, and your network never knows how your current circumstances differ from theirs, so blindly accepting guidance from your network is never the right decision either. If you’re looking for a more detailed coverage on building your network, this topic has also come up in Staff Engineer’s chapter on Build a network of peers, and The Engineering Executive’s Primer’s chapter on Building your executive network. It feels silly to cover the same topic a third time, but it’s a foundational technique for effective decision making. Read widely; read narrowly Reading has always been an important part of my strategy work. There are two distinct motions to this approach: read widely on an ongoing basis to broaden your thinking, and read narrowly on the specific topic you’re working on. Starting with reading widely, I make an effort each year to read ten to twenty industry-relevant works. These are not necessarily new releases, but are new releases for me. Importantly, I try to read things that I don’t know much about or that I initially disagree with. Some of my recent reads were Chip War, Building Green Software, Tidy First?, and How Big Things Get Done. From each of these books, I learned something, and over time they’ve built a series of bookmarks in my head about ideas that might apply to new problems. On the other end of things is reading narrowly. When I recently started working on an AI agents strategy, the first thing I did was read through Chip Huyen’s AI Engineering, which was an exceptionally helpful survey. Similarly, when we started thinking about Uber’s service migration, we read a number of industry papers, including Large-scale cluster management at Google with Borg and Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. None of these readings had all the answers to the problems I was working on, but they did an excellent job at helping me understand the range of options, as well as identifying other references to consult in my exploration. I’ll mention two nuances that will help a lot here. First, I highly encourage getting comfortable with skimming books. Even tightly edited books will have a lot of content that isn’t particularly relevant to your current goals, and you should skip that content liberally. Second, what you read doesn’t have to be books. It can be blog posts, essays, interview transcripts, or certainly it can be books. In this context, “reading” doesn’t event have to actually be reading. There are conference talks that contain just as much as a blog post, and conferences that cover as much breadth as a book. There are also conference talks without a written equivalent, such as Dan Na’s excellent Pushing Through Friction. Each job is an education Experience is frequently disregarded in the technology industry, and there are ways to misuse experience by copying too liberally the solutions that worked in different circumstances, but the most effective, and the slowest, mechanism for exploring is continuing to work in the details of meaningful problems. You probably won’t choose every job to optimize for learning, but allowing you to instantly explore more complex problems over time–recognizing that a bit of your data will have become stale each time–is uniquely valuable. Save judgment for later As I’ve mentioned several times, the point of exploration is to go broad with the goal of understanding approaches you might not have considered, and invalidating things you initially think are true. Both of those things are only possible if you save judgment for later: if you’re passing judgment about whether approaches are “good” or “bad”, then your exploration is probably going astray. As a soft rule, I’d argue that if no one involved in a strategy has changed their mind about something they believed when you started the exploration step, then you’re not done exploring. This is especially true when it comes to strategy work by senior leaders. Their beliefs are often well-justified by years of experience, but it’s unclear which parts of their experience have become stale over time. Summary At this point, I hope you feel comfortable exploring as the first step of your strategy work, and understand the likely consequences of skipping this step. It’s not an overstatement to say that every one of the worst strategic failures I’ve encountered would have been prevented by its primary author taking a few days to explore the space before anchoring on a particular approach. A few days of feeling slow are always worth avoiding years of misguided efforts.
At some point in a startup’s lifecycle, they decide that they need to be ready to go public in 18 months, and a flurry of IPO-readiness activity kicks off. This strategy focuses on a company working on IPO readiness, which has identified a gap in their internal controls for managing access to their users’ data. It’s a company that wants to meaningfully improve their security posture around user data access, but which has had a number of failed security initiatives over the years. Most of those initiatives have failed because they significantly degraded internal workflows for teams like customer support, such that the initial progress was reverted and subverted over time, to little long-term effect. This strategy represents the Chief Information Security Officer’s (CISO) attempt to acknowledge and overcome those historical challenges while meeting their IPO readiness obligations, and–most importantly–doing right by their users. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, Operation has been folded into Policy; second, Refine has been embedded in Diagnose. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operations Our new policies, and the mechanisms to operate them are: Controls for accessing user data must be significantly stronger prior to our IPO. Senior leadership, legal, compliance and security have decided that we are not comfortable accepting the status quo of our user data access controls as a public company, and must meaningfully improve the quality of resource-level access controls as part of our pre-IPO readiness efforts. Our Security team is accountable for the exact mechanisms and approach to addressing this risk. We will continue to prioritize a hybrid solution to resource-access controls. This has been our approach thus far, and the fastest available option. Directly expose the log of our resource-level accesses to our users. We will build towards a user-accessible log of all company accesses of user data, and ensure we are comfortable explaining each and every access. In addition, it means that each rationale for access must be comprehensible and reasonable from a user perspective. This is important because it aligns our approach with our users’ perspectives. They will be able to evaluate how we access their data, and make decisions about continuing to use our product based on whether they agree with our use. Good security discussions don’t frame decisions as a compromise between security and usability. We will pursue multi-dimensional tradeoffs to simultaneously improve security and efficiency. Whenever we frame a discussion on trading off between security and utility, it’s a sign that we are having the wrong discussion, and that we should rethink our approach. We will prioritize mechanisms that can both automatically authorize and automatically document the rationale for accesses to customer data. The most obvious example of this is automatically granting access to a customer support agent for users who have an open support ticket assigned to that agent. (And removing that access when that ticket is reassigned or resolved.) Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale. This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues’ internal tools. If we only expand requirements for accessing customer data, we won’t view this as progress because it’s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won’t represent this as progress, because we won’t have increased the number of supported requests. As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data. Expire unused roles to move towards principle of least privilege. Today we have a number of roles granted in our role-based access control (RBAC) system to users who do not use the granted permissions. To address that issue, we will automatically remove roles from colleagues after 90 days of not using the role’s permissions. Engineers in an active on-call rotation are the exception to this automated permission pruning. Weekly reviews until we see progress; monthly access reviews in perpetuity. Starting now, there will be a weekly sync between the security engineering team, teams working on customer data access initiatives, and the CISO. This meeting will focus on rapid iteration and problem solving. This is explicitly a forum for ongoing strategy testing, with CISO serving as the meeting’s sponsor, and their Principal Security Engineer serving as the meeting’s guide. It will continue until we have clarity on the path to 100% coverage of user-comprehensible, automated rationales for access to customer data. Separately, we are also starting a monthly review of sampled accesses to customer data to ensure the proper usage and function of the rationale-creation mechanisms we build. This meeting’s goal is to review access rationales for quality and appropriateness, both by reviewing sampled rationales in the short-term, and identifying more automated mechanisms for identifying high-risk accesses to review in the future. Exceptions must be granted in writing by CISO. While our overarching Engineering Strategy states that we follow an advisory architecture process as described in Facilitating Software Architecture, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the #ciso channel. Diagnose We have a strong baseline of role-based access controls (RBAC) and audit logging. However, we have limited mechanisms for ensuring assigned roles follow the principle of least privilege. This is particularly true in cases where individuals change teams or roles over the course of their tenure at the company: some individuals have collected numerous unused roles over five-plus years at the company. Similarly, our audit logs are durable and pervasive, but we have limited proactive mechanisms for identifying anomalous usage. Instead they are typically used to understand what occurred after an incident is identified by other mechanisms. For resource-level access controls, we rely on a hybrid approach between a 3rd-party platform for incoming user requests, and approval mechanisms within our own product. Providing a rationale for access across these two systems requires manual work, and those rationales are later manually reviewed for appropriateness in a batch fashion. There are two major ongoing problems with our current approach to resource-level access controls. First, the teams making requests view them as a burdensome obligation without much benefit to them or on behalf of the user. Second, because the rationale review steps are manual, there is no verifiable evidence of the quality of the review. We’ve found no evidence of misuse of user data. When colleagues do access user data, we have uniformly and consistently found that there is a clear, and reasonable rationale for that access. For example, a ticket in the user support system where the user has raised an issue. However, the quality of our documented rationales is consistently low because it depends on busy people manually copying over significant information many times a day. Because the rationales are of low quality, the verification of these rationales is somewhat arbitrary. From a literal compliance perspective, we do provide rationales and auditing of these rationales, but it’s unclear if the majority of these audits increase the security of our users’ data. Historically, we’ve made significant security investments that caused temporary spikes in our security posture. However, looking at those initiatives a year later, in many cases we see a pattern of increased scrutiny, followed by a gradual repeal or avoidance of the new mechanisms. We have found that most of them involved increased friction for essential work performed by other internal teams. In the natural order of performing work, those teams would subtly subvert the improvements because it interfered with their immediate goals (e.g. supporting customer requests). As such, we have high conviction from our track record that our historical approach can create optical wins internally. We have limited conviction that it can create long-term improvements outside of significant, unlikely internal changes (e.g. colleagues are markedly less busy a year from now than they are today). It seems likely we need a new approach to meaningfully shift our stance on these kinds of problems. Explore Our experience is that best practices around managing internal access to user data are widely available through our networks, and otherwise hard to find. The exact rationale for this is hard to determine, but it seems possible that it’s a topic that folks are generally uncomfortable discussing in public on account of potential future liability and compliance issues. In our exploration, we found two standardized dimensions (role-based access controls, audit logs), and one highly divergent dimension (resource-specific access controls): Role-based access controls (RBAC) are a highly standardized approach at this point. The core premise is that users are mapped to one or more roles, and each role is granted a certain set of permissions. For example, a role representing the customer support agent might be granted permission to deactivate an account, whereas a role representing the sales engineer might be able to configure a new account. Audit logs are similarly standardized. All access and mutation of resources should be tied in a durable log to the human who performed the action. These logs should be accumulated in a centralized, queryable solution. One of the core challenges is determining how to utilize these logs proactively to detect issues rather than reactively when an issue has already been flagged. Resource-level access controls are significantly less standardized than RBAC or audit logs. We found three distinct patterns adopted by companies, with little consistency across companies on which is adopted. Those three patterns for resource-level access control were: 3rd-party enrichment where access to resources is managed in a 3rd-party system such as Zendesk. This requires enriching objects within those systems with data and metadata from the product(s) where those objects live. It also requires implementing actions on the platform, such as archiving or configuration, allowing them to live entirely in that platform’s permission structure. The downside of this approach is tight coupling with the platform vendor, any limitations inherent to that platform, and the overhead of maintaining engineering teams familiar with both your internal technology stack and the platform vendor’s technology stack. 1st-party tool implementation where all activity, including creation and management of user issues, is managed within the core product itself. This pattern is most common in earlier stage companies or companies whose customer support leadership “grew up” within the organization without much exposure to the approach taken by peer companies. The advantage of this approach is that there is a single, tightly integrated and infinitely extensible platform for managing interactions. The downside is that you have to build and maintain all of that work internally rather than pushing it to a vendor that ought to be able to invest more heavily into their tooling. Hybrid solutions where a 3rd-party platform is used for most actions, and is further used to permit resource-level access within the 1st-party system. For example, you might be able to access a user’s data only while there is an open ticket created by that user, and assigned to you, in the 3rd-party platform. The advantage of this approach is that it allows supporting complex workflows that don’t fit within the platform’s limitations, and allows you to avoid complex coupling between your product and the vendor platform. Generally, our experience is that all companies implement RBAC, audit logs, and one of the resource-level access control mechanisms. Most companies pursue either 3rd-party enrichment with a sizable, long-standing team owning the platform implementation, or rely on a hybrid solution where they are able to avoid a long-standing dedicated team by lumping that work into existing teams.
More in programming
This is a re-publishing of a blog post I originally wrote for work, but wanted on my own blog as well. AI is everywhere, and its impressive claims are leading to rapid adoption. At this stage, I’d qualify it as charismatic technology—something that under-delivers on what it promises, but promises so much that the industry still leverages it because we believe it will eventually deliver on these claims. This is a known pattern. In this post, I’ll use the example of automation deployments to go over known patterns and risks in order to provide you with a list of questions to ask about potential AI solutions. I’ll first cover a short list of base assumptions, and then borrow from scholars of cognitive systems engineering and resilience engineering to list said criteria. At the core of it is the idea that when we say we want humans in the loop, it really matters where in the loop they are. My base assumptions The first thing I’m going to say is that we currently do not have Artificial General Intelligence (AGI). I don’t care whether we have it in 2 years or 40 years or never; if I’m looking to deploy a tool (or an agent) that is supposed to do stuff to my production environments, it has to be able to do it now. I am not looking to be impressed, I am looking to make my life and the system better. Another mechanism I want you to keep in mind is something called the context gap. In a nutshell, any model or automation is constructed from a narrow definition of a controlled environment, which can expand as it gains autonomy, but remains limited. By comparison, people in a system start from a broad situation and narrow definitions down and add constraints to make problem-solving tractable. One side starts from a narrow context, and one starts from a wide one—so in practice, with humans and machines, you end up seeing a type of teamwork where one constantly updates the other: The optimal solution of a model is not an optimal solution of a problem unless the model is a perfect representation of the problem, which it never is. — Ackoff (1979, p. 97) Because of that mindset, I will disregard all arguments of “it’s coming soon” and “it’s getting better real fast” and instead frame what current LLM solutions are shaped like: tools and automation. As it turns out, there are lots of studies about ergonomics, tool design, collaborative design, where semi-autonomous components fit into sociotechnical systems, and how they tend to fail. Additionally, I’ll borrow from the framing used by people who study joint cognitive systems: rather than looking only at the abilities of what a single person or tool can do, we’re going to look at the overall performance of the joint system. This is important because if you have a tool that is built to be operated like an autonomous agent, you can get weird results in your integration. You’re essentially building an interface for the wrong kind of component—like using a joystick to ride a bicycle. This lens will assist us in establishing general criteria about where the problems will likely be without having to test for every single one and evaluate them on benchmarks against each other. Questions you'll want to ask The following list of questions is meant to act as reminders—abstracting away all the theory from research papers you’d need to read—to let you think through some of the important stuff your teams should track, whether they are engineers using code generation, SREs using AIOps, or managers and execs making the call to adopt new tooling. Are you better even after the tool is taken away? An interesting warning comes from studying how LLMs function as learning aides. The researchers found that people who trained using LLMs tended to fail tests more when the LLMs were taken away compared to people who never studied with them, except if the prompts were specifically (and successfully) designed to help people learn. Likewise, it’s been known for decades that when automation handles standard challenges, the operators expected to take over when they reach their limits end up worse off and generally require more training to keep the overall system performant. While people can feel like they’re getting better and more productive with tool assistance, it doesn’t necessarily follow that they are learning or improving. Over time, there’s a serious risk that your overall system’s performance will be limited to what the automation can do—because without proper design, people keeping the automation in check will gradually lose the skills they had developed prior. Are you augmenting the person or the computer? Traditionally successful tools tend to work on the principle that they improve the physical or mental abilities of their operator: search tools let you go through more data than you could on your own and shift demands to external memory, a bicycle more effectively transmits force for locomotion, a blind spot alert on your car can extend your ability to pay attention to your surroundings, and so on. Automation that augments users therefore tends to be easier to direct, and sort of extends the person’s abilities, rather than acting based on preset goals and framing. Automation that augments a machine tends to broaden the device’s scope and control by leveraging some known effects of their environment and successfully hiding them away. For software folks, an autoscaling controller is a good example of the latter. Neither is fundamentally better nor worse than the other—but you should figure out what kind of automation you’re getting, because they fail differently. Augmenting the user implies that they can tackle a broader variety of challenges effectively. Augmenting the computers tends to mean that when the component reaches its limits, the challenges are worse for the operator. Is it turning you into a monitor rather than helping build an understanding? If your job is to look at the tool go and then say whether it was doing a good or bad job (and maybe take over if it does a bad job), you’re going to have problems. It has long been known that people adapt to their tools, and automation can create complacency. Self-driving cars that generally self-drive themselves well but still require a monitor are not effectively monitored. Instead, having AI that supports people or adds perspectives to the work an operator is already doing tends to yield better long-term results than patterns where the human learns to mostly delegate and focus elsewhere. (As a side note, this is why I tend to dislike incident summarizers. Don’t make it so people stop trying to piece together what happened! Instead, I prefer seeing tools that look at your summaries to remind you of items you may have forgotten, or that look for linguistic cues that point to biases or reductive points of view.) Does it pigeonhole what you can look at? When evaluating a tool, you should ask questions about where the automation lands: Does it let you look at the world more effectively? Does it tell you where to look in the world? Does it force you to look somewhere specific? Does it tell you to do something specific? Does it force you to do something? This is a bit of a hybrid between “Does it extend you?” and “Is it turning you into a monitor?” The five questions above let you figure that out. As the tool becomes a source of assertions or constraints (rather than a source of information and options), the operator becomes someone who interacts with the world from inside the tool rather than someone who interacts with the world with the tool’s help. The tool stops being a tool and becomes a representation of the whole system, which means whatever limitations and internal constraints it has are then transmitted to your users. Is it a built-in distraction? People tend to do multiple tasks over many contexts. Some automated systems are built with alarms or alerts that require stealing someone’s focus, and unless they truly are the most critical thing their users could give attention to, they are going to be an annoyance that can lower the effectiveness of the overall system. What perspectives does it bake in? Tools tend to embody a given perspective. For example, AIOps tools that are built to find a root cause will likely carry the conceptual framework behind root causes in their design. More subtly, these perspectives are sometimes hidden in the type of data you get: if your AIOps agent can only see alerts, your telemetry data, and maybe your code, it will rarely be a source of suggestions on how to improve your workflows because that isn’t part of its world. In roles that are inherently about pulling context from many disconnected sources, how on earth is automation going to make the right decisions? And moreover, who’s accountable for when it makes a poor decision on incomplete data? Surely not the buyer who installed it! This is also one of the many ways in which automation can reinforce biases—not just based on what is in its training data, but also based on its own structure and what inputs were considered most important at design time. The tool can itself become a keyhole through which your conclusions are guided. Is it going to become a hero? A common trope in incident response is heroes—the few people who know everything inside and out, and who end up being necessary bottlenecks to all emergencies. They can’t go away for vacation, they’re too busy to train others, they develop blind spots that nobody can fix, and they can’t be replaced. To avoid this, you have to maintain a continuous awareness of who knows what, and crosstrain each other to always have enough redundancy. If you have a team of multiple engineers and you add AI to it, having it do all of the tasks of a specific kind means it becomes a de facto hero to your team. If that’s okay, be aware that any outages or dysfunction in the AI agent would likely have no practical workaround. You will essentially have offshored part of your ops. Do you need it to be perfect? What a thing promises to be is never what it is—otherwise AWS would be enough, and Kubernetes would be enough, and JIRA would be enough, and the software would work fine with no one needing to fix things. That just doesn’t happen. Ever. Even if it’s really, really good, it’s gonna have outages and surprises, and it’ll mess up here and there, no matter what it is. We aren’t building an omnipotent computer god, we’re building imperfect software. You’ll want to seriously consider whether the tradeoffs you’d make in terms of quality and cost are worth it, and this is going to be a case-by-case basis. Just be careful not to fix the problem by adding a human in the loop that acts as a monitor! Is it doing the whole job or a fraction of it? We don’t notice major parts of our own jobs because they feel natural. A classic pattern here is one of AIs getting better at diagnosing patients, except the benchmarks are usually run on a patient chart where most of the relevant observations have already been made by someone else. Similarly, we often see AI pass a test with flying colors while it still can’t be productive at the job the test represents. People in general have adopted a model of cognition based on information processing that’s very similar to how computers work (get data in, think, output stuff, rinse and repeat), but for decades, there have been multiple disciplines that looked harder at situated work and cognition, moving past that model. Key patterns of cognition are not just in the mind, but are also embedded in the environment and in the interactions we have with each other. Be wary of acquiring a solution that solves what you think the problem is rather than what it actually is. We routinely show we don’t accurately know the latter. What if we have more than one? You probably know how straightforward it can be to write a toy project on your own, with full control of every refactor. You probably also know how this stops being true as your team grows. As it stands today, a lot of AI agents are built within a snapshot of the current world: one or few AI tools added to teams that are mostly made up of people. By analogy, this would be like everyone selling you a computer assuming it were the first and only electronic device inside your household. Problems arise when you go beyond these assumptions: maybe AI that writes code has to go through a code review process, but what if that code review is done by another unrelated AI agent? What happens when you get to operations and common mode failures impact components from various teams that all have agents empowered to go fix things to the best of their ability with the available data? Are they going to clash with people, or even with each other? Humans also have that ability and tend to solve it via processes and procedures, explicit coordination, announcing what they’ll do before they do it, and calling upon each other when they need help. Will multiple agents require something equivalent, and if so, do you have it in place? How do they cope with limited context? Some changes that cause issues might be safe to roll back, some not (maybe they include database migrations, maybe it is better to be down than corrupting data), and some may contain changes that rolling back wouldn’t fix (maybe the workload is controlled by one or more feature flags). Knowing what to do in these situations can sometimes be understood from code or release notes, but some situations can require different workflows involving broader parts of the organization. A risk of automation without context is that if you have situations where waiting or doing little is the best option, then you’ll need to either have automation that requires input to act, or a set of actions to quickly disable multiple types of automation as fast as possible. Many of these may exist at the same time, and it becomes the operators’ jobs to not only maintain their own context, but also maintain a mental model of the context each of these pieces of automation has access to. The fancier your agents, the fancier your operators’ understanding and abilities must be to properly orchestrate them. The more surprising your landscape is, the harder it can become to manage with semi-autonomous elements roaming around. After an outage or incident, who does the learning and who does the fixing? One way to track accountability in a system is to figure out who ends up having to learn lessons and change how things are done. It’s not always the same people or teams, and generally, learning will happen whether you want it or not. This is more of a rhetorical question right now, because I expect that in most cases, when things go wrong, whoever is expected to monitor the AI tool is going to have to steer it in a better direction and fix it (if they can); if it can’t be fixed, then the expectation will be that the automation, as a tool, will be used more judiciously in the future. In a nutshell, if the expectation is that your engineers are going to be doing the learning and tweaking, your AI isn’t an independent agent—it’s a tool that cosplays as an independent agent. Do what you will—just be mindful All in all, none of the above questions flat out say you should not use AI, nor where exactly in the loop you should put people. The key point is that you should ask that question and be aware that just adding whatever to your system is not going to substitute workers away. It will, instead, transform work and create new patterns and weaknesses. Some of these patterns are known and well-studied. We don’t have to go rushing to rediscover them all through failures as if we were the first to ever automate something. If AI ever gets so good and so smart that it’s better than all your engineers, it won’t make a difference whether you adopt it only once it’s good. In the meanwhile, these things do matter and have real impacts, so please design your systems responsibly. If you’re interested to know more about the theoretical elements underpinning this post, the following references—on top of whatever was already linked in the text—might be of interest: Books: Joint Cognitive Systems: Foundations of Cognitive Systems Engineering by Erik Hollnagel Joint Cognitive Systems: Patterns in Cognitive Systems Engineering by David D. Woods Cognition in the Wild by Edwin Hutchins Behind Human Error by David D. Woods, Sydney Dekker, Richard Cook, Leila Johannesen, Nadine Sarter Papers: Ironies of Automation by Lisanne Bainbridge The French-Speaking Ergonomists’ Approach to Work Activity by Daniellou How in the World Did We Ever Get into That Mode? Mode Error and Awareness in Supervisory Control by Nadine Sarter Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis by David D. Woods Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity by Gary Klein and David D. Woods MABA-MABA or Abracadabra? Progress on Human–Automation Co-ordination by Sidney Dekker Managing the Hidden Costs of Coordination by Laura Maguire Designing for Expertise by David D. Woods The Impact of Generative AI on Critical Thinking by Lee et al.
AMD is sending us the two MI300X boxes we asked for. They are in the mail. It took a bit, but AMD passed my cultural test. I now believe they aren’t going to shoot themselves in the foot on software, and if that’s true, there’s absolutely no reason they should be worth 1/16th of NVIDIA. CUDA isn’t really the moat people think it is, it was just an early ecosystem. tiny corp has a fully sovereign AMD stack, and soon we’ll port it to the MI300X. You won’t even have to use tinygrad proper, tinygrad has a torch frontend now. Either NVIDIA is super overvalued or AMD is undervalued. If the petaflop gets commoditized (tiny corp’s mission), the current situation doesn’t make any sense. The hardware is similar, AMD even got the double throughput Tensor Cores on RDNA4 (NVIDIA artificially halves this on their cards, soon they won’t be able to). I’m betting on AMD being undervalued, and that the demand for AI has barely started. With good software, the MI300X should outperform the H100. In for a quarter million. Long term. It can always dip short term, but check back in 5 years.
Earlier this weekGuileWhippet But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s .mostly-marking collector Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so. The problem comes for uses of from outside itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.scm_gc_malloclibguile The problem with , though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an API for the tagged case and assume that anything that goes through is requesting an untagged, conservatively-scanned block of memory. Similarly for : you could be allocating a tagged object that happens to not contain pointers, or you could be allocating an untagged array of whatever. A new API is needed there too for pointerless untagged allocations.scm_gc_mallocscm_gc_mallocscm_gc_malloc_pointerless Recall that the mostly-marking collector can be built in a number of different ways: it can support conservative and/or precise roots, it can trace the heap precisely or conservatively, it can be generational or not, and the collector can use multiple threads during pauses or not. Consider a basic configuration with precise roots. You can make tagged pointerless allocations just fine: the trace function for that tag is just trivial. You would like to extend the collector with the ability to make pointerless allocations, for raw data. How to do this?untagged Consider first that when the collector goes to trace an object, it can’t use bits inside the object to discriminate between the tagged and untagged cases. Fortunately though . Of those 8 bits, 3 are used for the mark (five different states, allowing for future concurrent tracing), two for the , one to indicate whether the object is pinned or not, and one to indicate the end of the object, so that we can determine object bounds just by scanning the metadata byte array. That leaves 1 bit, and we can use it to indicate untagged pointerless allocations. Hooray!the main space of the mostly-marking collector has one metadata byte for each 16 bytes of payloadprecise field-logging write barrier However there is a wrinkle: when Whippet decides the it should evacuate an object, it tracks the evacuation state in the object itself; the embedder has to provide an implementation of a , allowing the collector to detect whether an object is forwarded or not, to claim an object for forwarding, to commit a forwarding pointer, and so on. We can’t do that for raw data, because all bit states belong to the object, not the collector or the embedder. So, we have to set the “pinned” bit on the object, indicating that these objects can’t move.little state machine We could in theory manage the forwarding state in the metadata byte, but we don’t have the bits to do that currently; maybe some day. For now, untagged pointerless allocations are pinned. You might also want to support untagged allocations that contain pointers to other GC-managed objects. In this case you would want these untagged allocations to be scanned conservatively. We can do this, but if we do, it will pin all objects. Thing is, conservative stack roots is a kind of a sweet spot in language run-time design. You get to avoid constraining your compiler, you avoid a class of bugs related to rooting, but you can still support compaction of the heap. How is this, you ask? Well, consider that you can move any object for which we can precisely enumerate the incoming references. This is trivially the case for precise roots and precise tracing. For conservative roots, we don’t know whether a given edge is really an object reference or not, so we have to conservatively avoid moving those objects. But once you are done tracing conservative edges, any live object that hasn’t yet been traced is fair game for evacuation, because none of its predecessors have yet been visited. But once you add conservatively-traced objects back into the mix, you don’t know when you are done tracing conservative edges; you could always discover another conservatively-traced object later in the trace, so you have to pin everything. The good news, though, is that we have gained an easier migration path. I can now shove Whippet into Guile and get it running even before I have removed untagged allocations. Once I have done so, I will be able to allow for compaction / evacuation; things only get better from here. Also as a side benefit, the mostly-marking collector’s heap-conservative configurations are now faster, because we have metadata attached to objects which allows tracing to skip known-pointerless objects. This regains an optimization that BDW has long had via its , used in Guile since time out of mind.GC_malloc_atomic With support for untagged allocations, I think I am finally ready to start getting Whippet into Guile itself. Happy hacking, and see you on the other side! inside and outside on intent on data on slop fin
I’ve been working on a project where I need to plot points on a map. I don’t need an interactive or dynamic visualisation – just a static map with coloured dots for each coordinate. I’ve created maps on the web using Leaflet.js, which load map data from OpenStreetMap (OSM) and support zooming and panning – but for this project, I want a standalone image rather than something I embed in a web page. I want to put in coordinates, and get a PNG image back. This feels like it should be straightforward. There are lots of Python libraries for data visualisation, but it’s not an area I’ve ever explored in detail. I don’t know how to use these libraries, and despite trying I couldn’t work out how to accomplish this seemingly simple task. I made several attempts with libraries like matplotlib and plotly, but I felt like I was fighting the tools. Rather than persist, I wrote my own solution with “lower level” tools. The key was a page on the OpenStreetMap wiki explaining how to convert lat/lon coordinates into the pixel system used by OSM tiles. In particular, it allowed me to break the process into two steps: Get a “base map” image that covers the entire world Convert lat/lon coordinates into xy coordinates that can be overlaid on this image Let’s go through those steps. Get a “base map” image that covers the entire world Let’s talk about how OpenStreetMap works, and in particular their image tiles. If you start at the most zoomed-out level, OSM represents the entire world with a single 256×256 pixel square. This is the Web Mercator projection, and you don’t get much detail – just a rough outline of the world. We can zoom in, and this tile splits into four new tiles of the same size. There are twice as many pixels along each edge, and each tile has more detail. Notice that country boundaries are visible now, but we can’t see any names yet. We can zoom in even further, and each of these tiles split again. There still aren’t any text labels, but the map is getting more detailed and we can see small features that weren’t visible before. You get the idea – we could keep zooming, and we’d get more and more tiles, each with more detail. This tile system means you can get detailed information for a specific area, without loading the entire world. For example, if I’m looking at street information in Britain, I only need the detailed tiles for that part of the world. I don’t need the detailed tiles for Bolivia at the same time. OpenStreetMap will only give you 256×256 pixels at a time, but we can download every tile and stitch them together, one-by-one. Here’s a Python script that enumerates all the tiles at a particular zoom level, downloads them, and uses the Pillow library to combine them into a single large image: #!/usr/bin/env python3 """ Download all the map tiles for a particular zoom level from OpenStreetMap, and stitch them into a single image. """ import io import itertools import httpx from PIL import Image zoom_level = 2 width = 256 * 2**zoom_level height = 256 * (2**zoom_level) im = Image.new("RGB", (width, height)) for x, y in itertools.product(range(2**zoom_level), range(2**zoom_level)): resp = httpx.get(f"https://tile.openstreetmap.org/{zoom_level}/{x}/{y}.png", timeout=50) resp.raise_for_status() im_buffer = Image.open(io.BytesIO(resp.content)) im.paste(im_buffer, (x * 256, y * 256)) out_path = f"map_{zoom_level}.png" im.save(out_path) print(out_path) The higher the zoom level, the more tiles you need to download, and the larger the final image will be. I ran this script up to zoom level 6, and this is the data involved: Zoom level Number of tiles Pixels File size 0 1 256×256 17.1 kB 1 4 512×512 56.3 kB 2 16 1024×1024 155.2 kB 3 64 2048×2048 506.4 kB 4 256 4096×4096 2.7 MB 5 1,024 8192×8192 13.9 MB 6 4,096 16384×16384 46.1 MB I can just about open that zoom level 6 image on my computer, but it’s struggling. I didn’t try opening zoom level 7 – that includes 16,384 tiles, and I’d probably run out of memory. For most static images, zoom level 3 or 4 should be sufficient – I ended up a base map from zoom level 4 for my project. It takes a minute or so to download all the tiles from OpenStreetMap, but you only need to request it once, and then you have a static image you can use again and again. This is a particularly good approach if you want to draw a lot of maps. OpenStreetMap is provided for free, and we want to be a respectful user of the service. Downloading all the map tiles once is more efficient than making repeated requests for the same data. Overlay lat/lon coordinates on this base map Now we have an image with a map of the whole world, we need to overlay our lat/lon coordinates as points on this map. I found instructions on the OpenStreetMap wiki which explain how to convert GPS coordinates into a position on the unit square, which we can in turn add to our map. They outline a straightforward algorithm, which I implemented in Python: import math def convert_gps_coordinates_to_unit_xy( *, latitude: float, longitude: float ) -> tuple[float, float]: """ Convert GPS coordinates to positions on the unit square, which can be plotted on a Web Mercator projection of the world. This expects the coordinates to be specified in **degrees**. The result will be (x, y) coordinates: - x will fall in the range (0, 1). x=0 is the left (180° west) edge of the map. x=1 is the right (180° east) edge of the map. x=0.5 is the middle, the prime meridian. - y will fall in the range (0, 1). y=0 is the top (north) edge of the map, at 85.0511 °N. y=1 is the bottom (south) edge of the map, at 85.0511 °S. y=0.5 is the middle, the equator. """ # This is based on instructions from the OpenStreetMap Wiki: # https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Example:_Convert_a_GPS_coordinate_to_a_pixel_position_in_a_Web_Mercator_tile # (Retrieved 16 January 2025) # Convert the coordinate to the Web Mercator projection # (https://epsg.io/3857) # # x = longitude # y = arsinh(tan(latitude)) # x_webm = longitude y_webm = math.asinh(math.tan(math.radians(latitude))) # Transform the projected point onto the unit square # # x = 0.5 + x / 360 # y = 0.5 - y / 2π # x_unit = 0.5 + x_webm / 360 y_unit = 0.5 - y_webm / (2 * math.pi) return x_unit, y_unit Their documentation includes a worked example using the coordinates of the Hachiko Statue. We can run our code, and check we get the same results: >>> convert_gps_coordinates_to_unit_xy(latitude=35.6590699, longitude=139.7006793) (0.8880574425, 0.39385379958274735) Most users of OpenStreetMap tiles will use these unit positions to select the tiles they need, and then dowload those images – but we can also position these points directly on the global map. I wrote some more Pillow code that converts GPS coordinates to these unit positions, scales those unit positions to the size of the entire map, then draws a coloured circle at each point on the map. Here’s the code: from PIL import Image, ImageDraw gps_coordinates = [ # Hachiko Memorial Statue in Tokyo {"latitude": 35.6590699, "longitude": 139.7006793}, # Greyfriars Bobby in Edinburgh {"latitude": 55.9469224, "longitude": -3.1913043}, # Fido Statue in Tuscany {"latitude": 43.955101, "longitude": 11.388186}, ] im = Image.open("base_map.png") draw = ImageDraw.Draw(im) for coord in gps_coordinates: x, y = convert_gps_coordinates_to_unit_xy(**coord) radius = 32 draw.ellipse( [ x * im.width - radius, y * im.height - radius, x * im.width + radius, y * im.height + radius, ], fill="red", ) im.save("map_with_dots.png") and here’s the map it produces: The nice thing about writing this code in Pillow is that it’s a library I already know how to use, and so I can customise it if I need to. I can change the shape and colour of the points, or crop to specific regions, or add text to the image. I’m sure more sophisticated data visualisation libraries can do all this, and more – but I wouldn’t know how. The downside is that if I need more advanced features, I’ll have to write them myself. I’m okay with that – trading sophistication for simplicity. I didn’t need to learn a complex visualization library – I was able to write code I can read and understand. In a world full of AI-generating code, writing something I know I understand feels more important than ever. [If the formatting of this post looks odd in your feed reader, visit the original article]
This website has a new section: blogroll.opml! A blogroll is a list of blogs - a lightweight way of people recommending other people’s writing on the indieweb. What it includes The blogs that I included are just sampled from my many RSS subscriptions that I keep in my Feedbin reader. I’m subscribed to about 200 RSS feeds, the majority of which are dead or only publish once a year. I like that about blogs, that there’s no expectation of getting a post out every single day, like there is in more algorithmically-driven media. If someone who I interacted with on the internet years ago decides to restart their writing, that’s great! There’s no reason to prune all the quiet feeds. The picks are oriented toward what I’m into: niches, blogs that have a loose topic but don’t try to be general-interest, people with distinctive writing. If you import all of the feeds into your RSS reader, you’ll probably end up unsubscribing from some of them because some of the experimental electric guitar design or bonsai news is not what you’re into. Seems fine, or you’ll discover a new interest! How it works Ruben Schade figured out a brilliant way to show blogrolls and I copied him. Check out his post on styling OPML and RSS with XSLT to XHTML for how it works. My only additions to that scheme were making the blogroll page blend into the rest of the website by using an include tag with Jekyll to add the basic site skeleton, and adding a link with the download attribute to provide a simple way to download the OPML file. Oddly, if you try to save the OPML page using Save as… in Firefox, Firefox will save the transformed output via the XSLT, rather than the raw source code. XSLT is such an odd and rare part of the web ecosystem, I had to use it.