More from Ferd.ca
I like to think that I write code deliberately. I’m an admittedly slow developer, and I want to believe I do so on purpose. I want to know as much as I can about the context of what it is that I'm automating. I also use a limited set of tools. I used old computers for a long time, both out of an environmental mindset, but also because a slower computer quickly makes it obvious when something scales poorly.1 The idea is to seek friction, and harness it as an early signal that whatever I’m doing may need to be tweaked, readjusted. I find this friction, and even frustration in general to also be useful around learning approaches.2 In opposition to the way I'd like to do things, everything about the tech industry is oriented towards elevated productivity, accelerated growth, and "easy" solutions to whole families of problems. I feel that maybe we should teach people to program the way they teach martial arts, like only in the most desperate situations when all else failed should you resort to automating something. I don’t quite know if I’m just old and grumpy, seeing industry trends fly by me at a pace I don’t follow, or whether there’s really something to it, but I thought I’d take a walk through a set of ideas and concepts that motivate my stance. This blog post has a lot of ground to cover. I'll first start with some fundamental properties of systems and how overload propagates through various bottlenecks. Then I'll go over some high-level pressures that are shared by most organizations and force trade-offs down their structure. These two aspects—load propagation and pervasive trade-offs—create the need for compensatory actions, of which we'll discuss some limits. This, finally, will be tied back to friction and ways to listen to it, because it's one of the things that underpins adaptation and keeps systems running. Optimizing While Leaving Pressures in Place Optimizing a frictional path without revising the system’s conditions and pressures tends to not actually improve the system. Instead, what you’re likely to do is surface brittleness in all the areas that are now exposed to the new system demands. Whether a bottleneck was invisible or well monitored, and regardless of scale, it offered an implicit form of protection that was likely taken for granted. For a small scale example, imagine you run a small bit of software on a server, talking to a database. If you suddenly get a lot of visits, simply autoscaling the web front-end will likely leave the database unprotected and sensitive to tipping over (well, usually after having grown the connection pool, raised the connection limit, vertically scaled the servers, and so on). None of this will let you serve heavy traffic at a reasonable price until you rework your caching and data distribution strategy. Building for orders of magnitude more traffic than usual requires changing some fundamental aspects of your solution. Similar patterns can be seen at a larger scale. An interesting case was the Clarkesworld magazine; as LLMs made it possible to produce slop at a faster rate than previously normal, an inherent bottleneck in authorship ("writing a book takes significant time and effort") was removed, leading to so much garbage that the magazine had to stop taking in submissions. They eventually ended up bearing the cost of creating a sort of imperfect queuing "spam filter" for submissions in order to accept them again. They don't necessarily publish more stories than before, they still aim to publish the good human-written stuff, there's just more costly garbage flowing through the system.3 A similar case to look for is how doctors in the US started using generative AI to fight insurance claim denials. Of course, insurers are now expected to adopt the same technology to counteract this effect. A general issue at play here is that the private insurance system's objectives and priorities are in conflict with those of the doctors and patients. Without realigning them, most of what we can expect is an increase in costs and technological means to get the same results out of it. People who don’t or can’t use the new tools are going to be left behind. The optimization's benefit is temporary, limited, and ultimately lost in the overall system, which has grown more complex and possibly less accessible.4 I think LLMs are top of mind for people because they feel like a shift in how you automate. The common perspective is that machines are good at repetitive, predictable, mechanical tasks, and that solutions always suffered when it came to the fuzzy, unpredictable, and changing human-adjacent elements. LLMs look exactly the opposite of that: the computers can't do math very well anymore, but they seem to hold conversations and read intent much better. They therefore look like a huge opportunity to automate more of the human element and optimize it away, following well-established pressures and patterns. Alternatively, they seemingly increase the potential for new tools that could be created and support people in areas where none existed before. The issues I'm discussing here clearly apply to AI, Machine Learning, and particularly LLMs. But they also are not specific to them. People who love the solution more than they appreciate the problem risk delivering clumsy integrations that aren’t really fit for purpose. This is why it feels like companies are wedging more AI in our face; that's what the investors wanted in order to signal innovativeness, or because the engineers really wanted to build cool shit, rather than solving the problems the users wanted or needed solved. The challenges around automation were always there from their earliest days and keep being in play now. They remain similar without regards to the type of automation or optimization being put in place, particularly if the system around them does not reorganize itself. The canonical example here is what happens when an organization looms so large that people can't understand what is going on. The standard playbook around this is to start driving purely by metrics, which end up compressing away rich phenomena. Doing so faster, whether it is by gathering more data (even if we already had too much) or by summarizing harder via a LLM likely won't help run things better. Summaries, like metrics, are lossy compression. They're also not that different from management by PowerPoint slides, which we've seen cause problems in the space program, as highlighted by the Columbia report: As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation. At many points during its investigation, the Board was surprised to receive similar presentation slides from NASA officials in place of technical reports. The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA. There is no reason to think that overly aggressive summarization via PowerPoint, LLM, or metrics would not all end similarly. If your decision-making layer cannot deal with the amount of information required to centrally make informed decisions, there may be a point where the solution is to change the system's structure (and decentralize, which has its own pitfalls) rather than to optimize the existing paths without question.5 Every actor, component, or communication channel in a system has inherent limits. Any part that suddenly becomes faster or more productive without feedback shifts greater burdens onto other parts. These other parts must adapt, adjust, pass on the cost, or stop meeting expectations. Eliminating friction from one part of the system sometimes just shifts it around. System problems tend to remain system problems regardless of how much you optimize isolated portions of them. Pressures and Propagation How can we know what is worth optimizing, and what is changing at a more structural level?6 It helps to have an idea of where the pressures that create goal conflicts might come from, since they eventually lead to adaptations. Systems tend to continually be stretched to the limit of their capacity, and any improvement is instantly leveraged to accelerate the pace of existing activities. This is usually where online people say things like "the root cause is capitalism"7—you shouldn't expect local solutions to fix systemic problems in the long term. The moment other players dynamically reduce their margins of maneuver to gain efficiency, you become relatively less competitive. You can think of how we could all formally prove software to be safe before shipping it, but instead we’ll compromise by using less formal methods like type analysis, tests, or feature flags to deliver acceptable products at much lower costs—both financial and cognitive. Be late to the market and you suffer, so there's a constant drive to ship faster and course-correct often. People more hopeful or trusting of a system try to create and apply counteracting forces to maintain safe operating margins. This tends to be done through changing incentives, creating regulatory bodies, and implementing better control and reporting mechanisms. This is often the approach you'll see taken around the nuclear industry, the FAA and the aviation industry, and so on. However, there are also known patterns (such as regulatory capture) that tend to erode these mechanisms, and even within each of these industries, surprises and adaptations are still a regular occurrence. Ultimately, the effects of any technological change are rather unpredictable. Designing for systems where experts operate demands constantly revisiting and iterating. The concepts we define to govern systems create their own indifference to other important perspectives, and data-driven approaches carry the risk of "bias laundering" mechanisms that repeat and amplify existing flaws in the system. Other less predictable effects can happen. Adopting objectively more accurate algorithms can create monocultures in decision-making, which can interact such that the overall system efficiency can go down compared to more diverse environments—even in the absence of disruption. Basically, the need for increased automation isn't likely to "normalize" a system and make it more predictable. It tends to just create new types of surprises in a way that does not remove the need for adaptation nor shift pressures; it only transforms them and makes them dynamic. Robust Yet Fragile Embedded deeply in our view of systems is an assumption that things are stable until they are disrupted. It’s possibly where ideas like “root cause” gain their charisma: identify the one triggering disruptor (or its underlying mechanism) and then the system will be stable again. It’s conceptually a bit Newtonian in that if no force is applied, nothing will change. A more ecological stance would instead assume that any perceived stability (while maintaining function) requires ongoing dynamic adjustments. The system is always decaying, transforming, interacting, changing. Stop interfering with it and it will eventually reach stability (without maintaining function) by breaking down or failing. If the pressures are constant and shifting as well as the counteracting mechanisms, we can assume that evolution and adaptation are required to deal with this dynamism. Over time, we should expect that the system instead evolves into a shape that fits its burdens while driven by scarcity and efficiency. A risk in play here is that an ecosystem's pressures make it rational and necessary for all actors to optimize when they’re each other’s focal point—rather than some environmental condition. The more aggressively it is done, the more aggressively it is needed by others to stay in the game. Robust yet fragile is the nature of systems that are well optimized for their main use cases and competitive within their environment, but which become easily upended by pressures applied from unexpected angles (that are therefore unprotected, since resources were used elsewhere instead). Good examples of this are Just-In-Time supply chains being far more efficient than traditional ones, but being far easier to disrupt in times of disasters or pandemics. Most buffers in the supply chain (such as stock held in warehouses) had been replaced by more agile and effective production and delivery mechanisms. Particularly, the economic benefits (in stable times) and the need for competitiveness have made it tricky for many businesses not to rely on them. The issue with optimizations driven from systemic pressures is that as you look at trimming the costs of keeping a subsystem going in times of stability, you may notice decent amounts of slack capacity that you could get rid of or drive harder in order to be more competitive in your ecosystem. That’s often resources that resilience efforts draw on to keep adapting and evolving. Another form of rationalization in systems is one where rather than cutting "excess", the adoption and expansion of (software) platforms are used to drive economies of scale. Standardization and uniformization of patterns, methods, and processes is a good way to get more bang for your buck on an investment, to do more with less. Any such platform is going to have some things it gives its users for cheap, and some things that become otherwise challenging to do.8 Friction felt here can both be caused by going against the platform's optimal use cases or by the platform not properly supporting some use cases—it's a signal worth listening to. In fact, we can more or less assume that friction is coming from everywhere because it's connected to these pressures. They just happen to be pervasive, at every layer of abstraction. If we had infinite time, infinite resources, or infinite capacity, we'd never need to optimize a thing. Compensatory Adaptive Mechanisms Successfully navigating these pressures is essentially drawing from concepts such as graceful extensibility and sustained adaptability. In a nutshell, we're looking to know how systems stretch themselves to deal with disruptions and surprises in a context of finite resources, and also how a system manages and regulates its own abilities to do that on an ongoing basis. Remember that every actor or component of a system has inherent limits. This is also true of our ability to know what is going on, something known as local rationality. This means that even if we're really hoping we could intervene from the system level first and avoid the (sometimes deceptively ineffective) local optimizations, it will regardless be attempted through local efforts. Knowing and detecting the friction behind it is useful for whoever wants the broader systematic view to act earlier, but large portions of the system are going to remain dynamic and co-evolving from locally felt pains and friction. Local rationality impacts everyone, even the most confident of system thinkers. Friction shifts are unavoidable, so it's useful to also know of the ways in which they show up. Unfortunately, these shifts generally remain unseen from afar, because compensatory mechanisms and adaptation patterns hide them.9. So instead, it's more practical to find how to spot the compensatory patterns themselves. One of the well-known mechanisms is the Efficiency–thoroughness trade-off (ETTO) principle, which states that since time and resources are limited, one has to trade-off efficiency and thoroughness to accomplish a task. Basically, if there's more work to do than there's capacity to do it, either you maintain thoroughness and the work accumulates or gets dropped, or you do work less thoroughly, possibly cut corners, accuracy, or you have to be less careful and keep going as fast as required. This is also one of the patterns feeding concepts such as "deviance" (often used in normalization of deviance, although the term alone points to any variation relative to norms), where procedures and rules defining safe work start being modified or bent unofficially, until covert work patterns grow a gap between the work as it is specified and how it is practiced.10 Of course, another path is one of innovation, which can mean some reorganization or restructuring. We happen to be in tech, so we tend to prefer to increase capacity by using new technology. New technology is rarely neutral and never isolated. It disturbs established patterns—often on purpose, but sometimes in unexpected ways—can require a complex support system, and for everyone to adjust around it to maintain the proper operational context. Adding to this, if automation is clumsy enough, it won’t be used to its full potential to avoid distracting or burdening practitioners using it to do their work. The ongoing adaptations and trade-offs create potential risks and needs for reciprocity to anticipate and respond to new contingencies. You basically need people who know the system, how it works, understand what is normal or abnormal, and how to work around its flaws. They are usually those who have the capacity to detect any sort of "creaking" in local parts of the system, who harness the friction and can then do some adjusting, mustering and creating slack to provide the margin to absorb surprises. They are compensating for weaknesses as they appear by providing adaptive capacity. Some organizations may enjoy these benefits without fixing anything else by burning out employees and churning through workers, using them as a kind of human buffer for systemic stressors. This can sustain them for a while, but may eventually reach its limits. Even without any sort of willful abuse, pressures lead a system to try to fully use or optimize away the spare capacity within. This can eventually exhaust the compensatory mechanisms it needs to function, leading to something called "decompensation". Decompensation Compensatory mechanisms are often called on so gradually that your average observer wouldn't even know it's taking place. Systems (or organisms) that appear absolutely healthy one day collapse, and we discover they were overextended for a long while. Let's look at congestive heart failure as an example.11 Effects of heart damage accumulate gradually over the years—partly just by aging—and can be offset by compensatory mechanisms in the human body. As the heart becomes weaker and pumps less blood with each beat, adjustments manage to keep the overall flow constant over time. This can be done by increasing the heart rate using complex neural and hormonal signaling. Other processes can be added to this: kidneys faced with lower blood pressure and flow can reduce how much urine they create to keep more fluid in the circulatory system, which increases cardiac filling pressure, which stretches the heart further before each beat, which adds to the stroke volume. Multiple pathways of this kind exist through the body, and they can maintain or optimize cardiac performance. However, each of these compensatory mechanisms has less desirable consequences. The heart remains damaged and they offset it, but the organism remains unable to generate greater cardiac output such as would be required during exercise. You would therefore see "normal" cardiac performance at rest, with little ability to deal with increased demand. If the damage is gradual enough, the organism will adjust its behavior to maintain compensation: you will walk slower, take breaks while climbing stairs, and will just generally avoid situations that strain your body. This may be done without even awareness of the decreased capacity of the system, and we may even resist acknowledging that we ever slowed down. Decompensation happens when all the compensatory mechanisms no longer prevent a downward spiral. If the heart can't maintain its output anymore, other organs (most often the kidneys) start failing. A failing organ can't overextend itself to help the heart; what was a stable negative feedback loop becomes a positive feedback loop, which quickly leads to collapse and death. Someone with a compensated congestive heart failure appears well and stable. They have gradually adjusted their habits to cope with their limited capacity as their heart weakened through life. However, looking well and healthy can hide how precarious of a position the organism is in. Someone in their late sixties skipping their heart medication for a few days or adopting a saltier diet could be enough to tip the scales into decompensation. Decompensation usually doesn’t happen because compensation mechanisms fail, but because their range is exhausted. A system that is compensating looks fine until it doesn’t. That's when failures may cascade and major breakdowns occur. This applies to all sorts of systems, biological as well as sociotechnical. A common example seen in the tech industry is one where overburdened teams continuously pull small miracles and fight fires, keeping things working through major efforts. The teams are stretched thin, nobody's been on vacation for a while, and hiring is difficult because nobody wants to jump into that sort of place. All you need is one extra incident, one person falling ill or quitting, needing to add one extra feature (which nobody has bandwidth to work on), and the whole thing falls apart. But even within purely technical subsystems, automation reaching its limits often shows up a bit like decompensation when it hands control back to a human operator who doesn't have the capacity to deal with what is going on (one of the many things pointed out by the classic text on the Ironies of Automation). Think of an autopilot that disengages once it reached the limit of what it can do to stabilize a plane in hazardous conditions. Or of a cluster autoscaler that can no longer schedule more containers or hosts and starts crowding them until performance collapses, queues fill up, and the whole application becomes unresponsive. Eventually, things spin out into a much bigger emergency than you'd have expected as everything appeared fine. There might have been subtle clues—too subtle to be picked up without knowing where to look—which shouldn't distract from their importance. Friction usually involves some of these indicators. Seeking the Friction Going back to friction being useful feedback, the question I want to ask is: how can we keep listening? The most effective actions are systemic, but the friction patterns are often local. If we detect the friction, papering over it via optimization or brute-force necessarily keeps it local, and potentially ineffective. We need to do the more complex work of turning friction into a system-level feedback signal for it to have better chances of success and sustainability. We can't cover all the clues, but surfacing key ones can be critical for the system to anticipate surprises and foster broader adaptive responses. When we see inappropriate outcomes of a system, we should be led to wonder what about its structure makes it a normal output. What are the externalities others suffer as a consequence of the system's strengths and weaknesses? This is a big question that feels out of reach for most, and not necessarily practical for everyday life. But it’s an important one as we repeatedly make daily decisions around trading off “working a bit faster” against the impacts of the tools we adopt, whether they are environmental, philosophical, or sociopolitical. Closer to our daily work as developers, when we see code that’s a bit messy and hard to understand, we either slow down to create and repair that understanding, or patch it up with local information and move on. When we do this with a tool that manages the information for us, are we in a situation where we accelerate ourselves by providing better framing and structure, or one where we just get where we want without acknowledging the friction?12 If it's the latter, what are the effects of ignoring the friction? Are we creating technical debt that can’t be managed without the tools? Are we risking increasingly not reorganizing the system when it creaks, and only waiting to see obvious breaks to know it needs attention? In fact, how would you even become good at knowing what creaking sounds like if you just always slam through the hurdles? Recognizing these patterns is a skill, and it tends to require knowing what “normal” feels like such that you can detect what is not there when you start deviating.13 If you use a bot for code reviews, ask yourself whether it is replacing people reviewing and eroding the process. Is it providing a backstop? Are there things it can't know about that you think are important? Is it palliating already missing support? Are the additional code changes dictated by review comments worth more than the acts of reviewing and discussing the code? Do you get a different result if the bot only reviews code that someone else already reviewed to add more coverage, rather than implicitly making it easier to ignore reviews and go fast? Work that takes time is a form of friction, and it's therefore tempting to seek ways to make it go faster. Before optimizing it away, ask yourself whether it might have outputs other than its main outputs. Maybe you’re fixing a broken process for an overextended team. Maybe you’re eroding annoying but surprisingly important opportunities for teams to learn, synchronize, share, or reflect on their practices without making room for a replacement. When you're reworking a portion of a system to make it more automatable, ask whether any of the facilitating and structuring steps you're putting in place could also benefit people directly. I recall hearing a customer who said “We are now documenting things in human-readable text so AI can make use of it”—an investment that clearly could have been worth it for people too. Use the change of perspective as an opportunity to surface elements hidden in the broader context and ecosystem, and on which people rely implicitly. I've been disappointed by proposals of turning LLMs into incident reviewers; I'd rather see them becoming analysis second-guessers: maybe they can point out agentive language leading to bias, elements that sound counterfactual, highlights elements that appear blameful to create blame awareness? If you make the decision to automate, still ask the questions and seek the friction. Systems adjust themselves and activate their adaptive capacity based on the type of challenges they face. Highlight friction. It’s useful, and it would be a waste to ignore it. Thanks to Jordan Goodnough, Alan Kraft, and Laura Nolan for reviewing this text. 1: I’m forced to refresh my work equipment more often now because new software appears to hunger for newer hardware at an accelerating pace. 2: As a side note, I'd like to call out the difference between friction, where you feel resistance and that your progression is not as expected based on experience, and one of pain, where you're just making no progress at all and having a plain old bad time. I'd put "pain" in a category where you might feel more helpless, or do useless work just because that's how people first gained the experience without any good reason for it to still be learned the same today. Under this casual definition, friction is the unfamiliar feeling when getting used to your tools and seeking better ways of wielding them, and pain is injuring yourself because the tools have poor ergonomic properties. 3: the same problem can be felt in online book retail, where spammers started hijacking the names of established authors with fake books. The cost of managing this is left to authors—and even myself, having published mostly about Erlang stuff, have had at least two fake books published under my name in the last couple years. 4: In Energy and Equity, Ivan Illich proposes that societies built on high-speed motorized transportation create a "radical monopoly," basically stating that as the society grows around cars and scales its distances proportionally to time spent traveling, living without affording a car and its upkeep becomes harder and harder. This raises the bar of participation in such environments, and it's easy to imagine a parallel within other sociotechnical systems. 5: AI is charismatic technology. It is tempting to think of it as the one optimization that can make decisions such that the overall system remains unchanged while its outputs improve. Its role as fantasized by science fiction is one of an industrial supply chain built to produce constantly good decisions. This does not reduce its potential for surprise or risk. Machine-as-human-replacement is most often misguided. I don't believe we're anywhere that point, and I don't think it's quite necessary to make an argument about it. 6: Because structural changes often require a lot more time and effort than local optimizations, you sometimes need to carry both types of interventions at the same time: a piecemeal local optimization to "extend the runway", and broader interventions to change the conditions of the system. A common problem for sustainability is to assume that extending the runway forever is both possible and sufficient, and never follow up with broader acts. 7: While capitalism has a keen ability to drive constraints of this kind, scarcity constraints are fairly universal. For example, Sonja D. Schmid, in Producing Power illustrates that some of the contributing factors that encouraged the widespread use of the RBMK reactor design in the USSR—the same design used in Chernobyl—were that its manufacturing was more easily distributed over broad geographic areas and sourced from local materials which could avoid the planned system's inefficiencies, and therefore meet electrification objectives in ways that couldn't be done with competing (and safer) reactor designs. Additionally, competing designs often needed centralized manufacturing of parts that could then not be shipped through communist USSR without having to increase the dimensions of some existing train tunnels, forcing upgrades to its rail network to open power plants. An entirely unrelated example is that a beehive's honeycomb structure optimizes for using the least material to create a lattice of cells within a given volume. 8: AWS or Kubernetes or your favorite framework all come with some real cool capabilities and also some real trade-offs. What they're built to do makes some things much easier, and some things much harder. Do note that when you’re building something for the first time on a schedule, prioritizing to deliver a minimal first set of features also acts as an inherent optimization phase: what you choose to build and leave for later fits that same trade-off pattern. 9: This is similar to something called the Law of Fluency, which states that well-adapted cognitive work occurs with a facility that belies the difficulty of resolving demands and balancing dilemmas. While the law of fluency works at the individual cognitive level, I tend to assume it also shows up at larger organizational or system levels as well. 10: Rule- and Role-retreat may also be seen when people get overloaded, but won't deviate or adjust their plans to new circumstances. This "failure to adapt" can also contribute to incidents, and is one of the reasons why some forms of deviations have to be considered positive for the system. 11: Most of the information in this section came from Dr. Richard I. Cook, explaining the concept in a group discussion, a few years before his passing. 12: this isn’t purely a tooling decision; you also make this type of call every time you choose to refactor code to create an abstraction instead of copy/pasting bits of it around. 13: I believe but can't prove that there's also a tenuous but real path between the small-scale frictions, annoyances, and injustices we can let slip, and how they can be allowed to propagate and grow in greater systemic scales. There's always tremendously important work done at the local level, where people bridge the gap between what the system orders and what the world needs. If there are paths leading the feedback up from the local, they are critical to keeping things aligned. I'm unsure what the links between them are, but I like to think that small adjustments made by people with agency are part of a negative feedback loop partially keeping things in check.
This blog post originally appeared on the LFI blog but I decided to post it on my own as well. Every organization has to contend with limits: scarcity of resources, people, attention, or funding, friction from scaling, inertia from previous code bases, or a quickly shifting ecosystem. And of course there are more, like time, quality, effort, or how much can fit in anyone's mind. There are so many ways for things to go wrong; your ongoing success comes in no small part from the people within your system constantly navigating that space, making sacrifice decisions and trading off some things to buy runway elsewhere. From time to time, these come to a head in what we call a goal conflict, where two important attributes clash with each other. These are not avoidable, and in fact are just assumed to be so in many cases, such as "cheap, fast, and good; pick two". But somehow, when it comes to more specific details of our work, that clarity hides itself or gets obscured by the veil of normative judgments. It is easy after an incident to think of what people could have done differently, of signals they should have listened to, or of consequences they would have foreseen had they just been a little bit more careful. From this point of view, the idea of reinforcing desired behaviors through incentives, both positive (bonuses, public praise, promotions) and negative (demerits, re-certification, disciplinary reviews) can feel attractive. (Do note here that I am specifically talking of incentives around specific decision-making or performance, rather than broader ones such as wages, perks, overtime or hazard pay, or employment benefits, even though effects may sometimes overlap.) But this perspective itself is a trap. Hindsight bias—where we overestimate how predictable outcomes were after the fact—and its close relative outcome bias—where knowing the results after the fact tints how we judge the decision made—both serve as good reminders that we should ideally look at decisions as they were being made, with the information known and pressures present then.. This is generally made easier by assuming people were trying to do a good job and get good results; a judgment that seems to make no sense asks of us that we figure out how it seemed reasonable at the time. Events were likely challenging, resources were limited (including cognitive bandwidth), and context was probably uncertain. If you were looking for goal conflicts and difficult trade-offs, this is certainly a promising area in which they can be found. Taking people's desire for good outcomes for granted forces you to shift your perspective. It demands you move away from thinking that somehow more pressure toward succeeding would help. It makes you ask what aid could be given to navigate the situation better, how the context could be changed for the trade-offs to be negotiated differently next time around. It lets us move away from wondering how we can prevent mistakes and move toward how we could better support our participants. Hell, the idea of rewarding desired behavior feels enticing even in cases where your review process does not fall into the traps mentioned here, where you take a more just approach. But the core idea here is that you can't really expect different outcomes if the pressures and goals that gave them rise don't change either. During incidents, priorities in play already are things like "I've got to fix this to keep this business alive", stabilizing the system to prevent large cascades, or trying to prevent harm to users or customers. They come with stress, adrenalin, and sometimes a sense of panic or shock. These are likely to rank higher in the minds of people than “what’s my bonus gonna be?” or “am I losing a gift card or some plaque if I fail?” Adding incentives, whether positive or negative, does not clarify the situation. It does not address goal conflicts. It adds more variables to the equation, complexifies the situation, and likely makes it more challenging. Chances are that people will make the same decisions they would have made (and have been making continuously) in the past, obtaining the desired outcomes. Instead, they’ll change what they report later in subtle ways, by either tweaking or hiding information to protect themselves, or by gradually losing trust in the process you've put in place. These effects can be amplified when teams are given hard-to-meet abstract targets such as lowering incident counts, which can actively interfere with incident response by creating new decision points in people's mental flows. If responders have to discuss and classify the nature of an incident to fit an accounting system unrelated to solving it right now, their response is likely made slower, more challenging. This is not to say all attempts at structure and classification would hinder proper response, though. Clarifying the critical elements to salvage first, creating cues and language for patterns that will be encountered, and agreeing on strategies that support effective coordination across participants can all be really productive. It needs to be done with a deeper understanding of how your incident response actually works, and that sometimes means unpleasant feedback about how people perceive your priorities. I've been in reviews where people stated things like "we know that we get yelled at more for delivering features late than broken code so we just shipped broken code since we were out of time", or who admitted ignoring execs who made a habit of coming down from above to scold employees into fixing things they were pressured into doing anyway. These can be hurtful for an organization to consider, but they are nevertheless a real part of how people deal with exceptional situations. By trying to properly understand the challenges, by clarifying the goal conflicts that arise in systems and result in sometimes frustrating trade-offs, and by making learning from these experiences an objective of its own, we can hopefully make things a bit better. Grounding our interventions within a richer, more naturalistic understanding of incident response and all its challenges is a small—albeit a critical one—part of it all.
From time to time, people ask me what I use to power my blog, maybe because they like the minimalist form it has. I tell them it’s a bad idea and that I use the Erlang compiler infrastructure for it, and they agree to look elsewhere. After launching my notes section, I had to fully clean up my engine. I thought I could write about how it works because it’s fairly unique and interesting, even if you should probably not use it. The Requirements I first started my blog 14 years ago. It had roughly the same structure as it does at the time of writing this: a list of links and text with nothing else. It did poorly with mobile (which was still sort of new but I should really work to improve these days), but okay with screen readers. It’s gotta be minimal enough to load fast on old devices. There’s absolutely nothing dynamic on here. No JavaScript, no comments, no tracking, and I’m pretty sure I’ve disabled most logging and retention. I write into a void, either transcribing talks or putting down rants I’ve repeated 2-3 times to other people so it becomes faster to just link things in the future. I mostly don’t know what gets read or not, but over time I found this kept the experience better for me than chasing readers or views. Basically, a static site is the best technology for me, but from time to time it’s nice to be able to update the layout, add some features (like syntax highlighting or an RSS feed) so it needs to be better than flat HTML files. Internally it runs with erlydtl, an Erlang implementation of Django Templates, which I really liked a decade and a half ago. It supports template inheritance, which is really neat to minimize files I have to edit. All I have is a bunch of files containing my posts, a few of these templates, and a little bit of Rebar3 config tying them together. There are some features that erlydtl doesn’t support but that I wanted anyway, notably syntax highlighting (without JavaScript), markdown support, and including subsections of HTML files (a weird corner case to support RSS feeds without powering them with a database). The feature I want to discuss here is “only rebuild what you strictly need to,” which I covered by using the Rebar3 compiler. Rebar3’s Compiler Rebar3 is the Erlang community’s build tool, which Tristan and I launched over 10 years ago, a follower to the classic rebar 2.x script. A funny requirement for Rebar3 is that Erlang has multiple compilers: one for Erlang, but also one for MIB files (for SNMP), the Leex syntax analyzer generator, and the Yecc parser generator. It also is plugin-friendly in order to compile Elixir modules, and other BEAM languages, like LFE, or very early versions of Gleam. We needed to support at least four compilers out of the box, and to properly track everything such that we only rebuild what we must. This is done using a Directed Acyclic Graph (DAG) to track all files, including build artifacts. The Rebar3 compiler infrastructure works by breaking up the flow of compilation in a generic and specific subset. The specific subset will: Define which file types and paths must be considered by the compiler. Define which files are dependencies of other files. Be given a graph of all files and their artifacts with their last modified times (and metadata), and specify which of them need rebuilding. Compile individual files and provide metadata to track the artifacts. The generic subset will: Scan files and update their timestamps in a graph for the last modifications. Use the dependency information to complete the dependency graph. Propagate the timestamps of source files modifications transitively through the graph (assume you update header A, included by header B, applied by macro C, on file D; then B, C, and D are all marked as modified as recently as A in the DAG). Pass this updated graph to the specific part to get a list of files to build (usually by comparing which source files are newer than their artifacts, but also if build options changed). Schedule sequential or parallel compilation based on what the specific part specified. Update the DAG with the artifacts and build metadata, and persist the data to disk. In short, you build a compiler plugin that can name directories, file extensions, dependencies, and can compare timestamps and metadata. Then make sure this plugin can compile individual files, and the rest is handled for you. The blog engine Since I’m currently the most active Rebar3 maintainer, I’ve definitely got to maintain the compiler infrastructure described earlier. Since my blog needed to rebuild the fewest static files possible and I already used a template compiler, plugging it into Rebar3 became the solution demanding the least effort. It requires a few hundred lines of code to write the plugin and a bit of config looking like this: {blog3r,[{vars,[{url,[{base,"https://ferd.ca/"},{notes,"https://ferd.ca/notes/"},{img,"https://ferd.ca/static/img/"},...]},%% Main site{index,#{template=>"index.tpl",out=>"index.html",section=>main}},{index,#{template=>"rss.tpl",out=>"feed.rss",section=>main}},%% Notes section{index,#{template=>"index-notes.tpl",out=>"notes/index.html",section=>notes}},{index,#{template=>"rss-notes.tpl",out=>"notes/feed.rss",section=>notes}},%% All sections' pages.{sections,#{main=>{"posts/","./",[{"Mon, 02 Sep 2024 11:00:00 EDT","My Blog Engine is the Erlang Build Tool","blog-engine-erlang-build-tool.md.tpl"},{"Thu, 30 May 2024 15:00:00 EDT","The Review Is the Action Item","the-review-is-the-action-item.md.tpl"},{"Tue, 19 Mar 2024 11:00:00 EDT","A Commentary on Defining Observability","a-commentary-on-defining-observability.md.tpl"},{"Wed, 07 Feb 2024 19:00:00 EST","A Distributed Systems Reading List","distsys-reading-list.md.tpl"},...]},notes=>{"notes/","notes/",[{"Fri, 16 Aug 2024 10:30:00 EDT","Paper: Psychological Safety: The History, Renaissance, and Future of an Interpersonal Construct","papers/psychological-safety-interpersonal-construct.md.tpl"},{"Fri, 02 Aug 2024 09:30:00 EDT","Atomic Accidents and Uneven Blame","atomic-accidents-and-uneven-blame.md.tpl"},{"Sat, 27 Jul 2024 12:00:00 EDT","Paper: Moral Crumple Zones","papers/moral-crumple-zones.md.tpl"},{"Tue, 16 Jul 2024 19:00:00 EDT","Hutchins' Distributed Cognition in The Wild","hutchins-distributed-cognition-in-the-wild.md.tpl"},...]}}}]}. And blog post entry files like this: {% extends "base.tpl" %} {% block content %} <p>I like cats. I like food. <br /> I don't especially like catfood though.</p> {% markdown %} ### Have a subtitle And then _all sorts_ of content! - lists - other lists - [links]({{ url.base }}section/page)) - and whatever fits a demo > Have a quote to close this out {% endmarkdown %} {% endblock %} These call to a parent template (see base.tpl for the structure) to inject their content. The whole site gets generated that way. Even compiler error messages are lifted from the Rebar3 libraries (although I haven't wrapped everything perfectly yet), with the following coming up when I forgot to close an if tag before closing a for loop: $ rebar3 compile ===> Verifying dependencies... ===> Analyzing applications... ===> Compiling ferd_ca ===> template error: ┌─ /home/ferd/code/ferd-ca/templates/rss.tpl: │ 24 │ {% endfor %} │ ╰── syntax error before: "endfor" ===> Compiling templates/rss.tpl failed As you can see, I build my blog by calling rebar3 compile, the same command as I do for any Erlang project. I find it interesting that on one hand, this is pretty much the best design possible for me given that it represents almost no new code, no new tools, and no new costs. It’s quite optimal. On the other hand, it’s possibly the worst possible tool chain imaginable for a blog engine for almost anybody else.
2024/05/30 The Review Is the Action Item I like to consider running an incident review to be its own action item. Other follow-ups emerging from it are a plus, but the point is to learn from incidents, and the review gives room for that to happen. This is not surprising advice if you’ve read material from the LFI community and related disciplines. However, there are specific perspectives required to make this work, and some assumptions necessary for it, without which things can break down. How can it work? In a more traditional view, the system is believed to be stable, then disrupted into an incident. The system gets stabilized, and we must look for weaknesses that can be removed or barriers that could be added in order to prevent such disruption in the future. Other perspectives for systems include views where they are never truly stable. Things change constantly; uncertainty is normal. Under that lens, systems can’t be forced into stability by control or authority. They can be influenced and adapt on an ongoing basis, and possibly kept in balance through constant effort. Once you adopt a socio-technical perspective, the hard-to-model nature of humans becomes a desirable trait to cope with chaos. Rather than a messy variable to stamp out, you’ll want to give them more tools and ways to keep all the moving parts of the subsystems going. There, an incident review becomes an arena where misalignment in objectives can be repaired, where strategies and tactics can be discussed, where mental models can be corrected and enriched, where voices can be heard when they wouldn’t be, and where we are free to reflect on the messy reality that drove us here. This is valuable work, and establishing an environment where it takes place is a key action item on its own. People who want to keep things working will jump on this opportunity if they see any value in it. Rather than giving them tickets to work on, we’re giving them a safe context to surface and discuss useful information. They’ll carry that information with them in the future, and it may influence the decisions they make, here and elsewhere. If the stories that come out of reviews are good enough, they will be retold to others, and the organization will have learned something. That belief people will do better over time as they learn, to me, tends to be worth more than focusing on making room for a few tickets in the backlog. How can it break down? One of the unnamed assumptions with this whole approach is that teams should have the ability to influence their own roadmap and choose some of their own work. A staunchly top-down organization may leverage incident reviews as a way to let people change the established course with a high priority. That use of incident reviews can’t be denied in these contexts. We want to give people the information and the perspectives they need to come up with fixes that are effective. Good reviews with action items ought to make sense, particularly in these orgs where most of the work is normally driven by folks outside of the engineering teams. But if the maintainers do not have the opportunity to schedule work they think needs doing outside of the aftermath of an incident—work that is by definition reactive—then they have no real power to schedule preventive work on their own. And so that’s a place where learning being its own purpose breaks down: when the learnings can’t be applied. Maybe it feels like “good” reviews focused on learning apply to a surprisingly narrow set of teams then, because most teams don’t have that much control. The question here really boils down to “who is it that can apply things they learned, and when?” If the answer is “almost no one, and only when things explode,” that’s maybe a good lesson already. That’s maybe where you’d want to start remediating. Note that even this perspective is a bit reductionist, which is also another way in which learning reviews may break down. By narrowing knowledge’s utility only to when it gets applied in measurable scheduled work, we stop finding value outside of this context, and eventually stop giving space for it. It’s easy to forget that we don’t control what people learn. We don’t choose what the takeaways are. Everyone does it for themselves based on their own lived experience. More importantly, we can’t stop people from using the information they learned, whether at work or in their personal life. Lessons learned can be applied anywhere and any time, and they can become critically useful at unexpected times. Narrowing the scope of your reviews such that they only aim to prevent bad accidents indirectly hinders creating fertile grounds for good surprises as well. Going for better While the need for action items is almost always there, a key element of improving incident reviews is to not make corrections the focal point. Consider the incident review as a preliminary step, the data-gathering stage before writing down the ideas. You’re using recent events as a study of what’s surprising within the system, but also of how it is that things usually work well. Only once that perspective is established does it make sense to start thinking of ways of modifying things. Try it with only one or two reviews at first. Minor incidents are usually good, because following the methods outlined in docs like the Etsy Debriefing Facilitation Guide and the Howie guide tends to reveal many useful insights in incidents people would have otherwise overlooked as not very interesting. As you and your teams see value, expand to more and more incidents. It also helps to set the tone before and during the meetings. I’ve written a set of “ground rules” we use at Honeycomb and that my colleague Lex Neva has transcribed, commented, and published. See if something like that could adequately frame the session.. If abandoning the idea of action items seems irresponsible or impractical to you, keep them. But keep them with some distance; the common tip given by the LFI community is to schedule another meeting after the review to discuss them in isolation. iiii At some point, that follow-up meeting may become disjoint from the reviews. There’s not necessarily a reason why every incident needs a dedicated set of fixes (longer-term changes impacting them could already be in progress, for example), nor is there a reason to wait for an incident to fix things and improve them. That’s when you decouple understanding from fixing, and the incident review becomes its own sufficient action item.
2024/03/19 A Commentary on Defining Observability Recently, Hazel Weakly has published a great article titled Redefining Observability. In it, she covers competing classical definitions observability, weaknesses they have, and offers a practical reframing of the concept in the context of software organizations (well, not only software organizations, but the examples tilt that way). I agree with her post in most ways, and so this blog post of mine is more of an improv-like “yes, and…” response to it, in which I also try to frame the many existing models as complementary or contrasting perspectives of a general concept. The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems. Insights and Questions The control theory definition of observability, from Rudolf E. Kálmán, goes as follows: Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The one from Woods and Hollnagel in cognitive engineering goes like this: Observability is feedback that provides insight into a process and refers to the work needed to extract meaning from available data. Hazel version’s, by comparison, is: Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn. While all three definitions relate to extracting information from a system, Hazel’s definition sits at a higher level by specifically mentioning questions and actions. It’s a more complete feedback loop including some people, their mental models, and seeking to enrich them or use them. I think that higher level ends up erasing a property of observability in the other definitions: it doesn’t have to be inquisitive nor analytical. Let’s take a washing machine, for example. You can know whether it is being filled or spinning by sound. The lack of sound itself can be a signal about whether it is running or not. If it is overloaded, it might shake a lot and sound out of balance during the spin cycle. You don’t necessarily have to be in the same room as the washing machine nor paying attention to it to know things about its state, passively create an understanding of normalcy, and learn about some anomalies in there. Another example here would be something as simple as a book. If you’re reading a good old paper book, you know you’re nearing the end of it just by how the pages you have read make a thicker portion of the book than those you haven’t read yet. You do not have to think about it, the information is inherent to the medium. An ebook read on an electronic device, however, will hide that information unless a design decision is made to show how many lines or words have been read, display a percentage, or a time estimate of the content left. Observability for the ebook isn’t innate to its structure and must be built in deliberately. Similarly, you could know old PCs were doing heavy work if the disk was making noise and when the fan was spinning up; it is not possible to know as much on a phone or laptop that has an SSD and no fan unless someone builds a way to expose this data. Associations and patterns can be formed by the observer in a way that provides information and explanations, leading to effective action and management of the system in play. It isn’t something always designed or done on purpose, but it may need to be. The key element is that an insight can be obtained without asking questions. In fact, a lot of anomaly detection is done passively, by the observer having a sort of mental construct of what normal is that lets them figure out what should happen next—what the future trajectory of the system is—and to then start asking questions when these expectations are not met. The insights, therefore, can come before the question is asked. Observability can be described as a mechanism behind this. I don’t think that this means Hazel’s definition is wrong; I think it might just be a consequence of the scale at which her definition operates. However, this distinction is useful for a segue into the difference between data availability and observability. The difference between observability and data availability For a visual example, I’ll use variations on a painting from 1866 named In a Roman Osteria, by Danish painter Carl Bloch: The first one is an outline, and the second is a jigsaw puzzle version (with all the pieces are right side up with the correct orientation, at least). The jigsaw puzzle has 100% data availability. All of the information is there and you can fully reconstruct the initial painting. The outlined version has a lot less data available, but if you’ve never seen the painting before, you will get a better understanding from it in absolutely no time compared to the jigsaw. This “make the jigsaw show you what you need faster” approach is more or less where a lot of observability vendors operate: the data is out there, you need help to store it and put it together and extract the relevancy out of it: What this example highlights though is that while you may get better answers with richer and more accurate data (given enough time, tools, and skill), the outline is simpler and may provide adequate information with less effort required from the observer. Effective selection of data, presented better, may be more appropriate during high-tempo, tense situations where quick decisions can make a difference. This, at least, implies that observability is not purely a data problem nor a tool problem (which lines up, again, with what Hazel said in her post). However, it hints at it being a potential design problem. The way data is presented, the way affordances are added, and whether the system is structured in a way that makes storytelling possible all can have a deep impact in how observable it turns out to be. Sometimes, coworkers mention that some of our services are really hard to interpret even when using Honeycomb (which we build and operate!) My theory about that is that too often, the data we output for observability is structured with the assumption that the person looking at it will have the code on hand—as the author did when writing it—and will be able to map telemetry data to specific areas of code. So when you’re in there writing queries and you don’t know much about the service, the traces mean little. As a coping mechanism, social patterns emerge where data that is generally useful is kept on some specific spans that are considered important, but that you can only find if someone more familiar with this area explained where it was to you already. It draws into pre-existing knowledge of the architecture, of communication patterns, of goals of the application that do not live within the instrumentation. Traces that are easier to understand and explore make use of patterns that are meaningful to the investigator, regardless of their understanding of the code. For the more “understandable” telemetry data, the naming, structure, and level of detail are more related to how the information is to be presented than the structure of the underlying implementation. Observability requires interpretation, and interpretation sits in the observer. What is useful or not will be really hard to predict, and people may find patterns and affordances in places that weren’t expected or designed, but still significant. Catering to this property requires taking a perspective of the system that is socio-technical. The System is Socio-Technical Once again for this section, I agree with Hazel on the importance of people in the process. She has lots of examples of good questions that exemplify this. I just want to push even harder here. Most of the examples I’ve given so far were technical: machines and objects whose interpretation is done by humans. Real complex systems don’t limit themselves to technical components being looked at; people are involved, talking to each other, making decisions, and steering the overall system around. This is nicely represented by the STELLA Report’s Above/Below the line diagram: The continued success of the overall system does not purely depend on the robustness of technical components and their ability to withstand challenges for which they were designed. When challenges beyond what was planned for do happen, and when the context in which the system operates changes (whether it is due to competition, legal frameworks, pandemics, or evolving user needs and tastes), adjustments need to be made to adapt the system and keep it working. The adaptation is not done purely on a technical level, by fixing and changing the software and hardware, but also by reconfiguring the organization, by people learning new things, by getting new or different people in the room, by reframing the situation, and by steering things in a new direction. There is a constant gap to bridge between a solution and its context, and the ability to anticipate these challenges, prepare for them, and react to them can be informed by observability. Observability at the technical level (“instrument the code and look at it with tools”) is covered by all definitions of observability in play here, but I want to really point out that observability can go further. If you reframe your system as properly socio-technical, then yes you will need technical observability interpreted at the social level. But you may also need social observability handled at the social level: are employees burning out? Do we have the psychological safety required to learn from events? Do I have silos of knowledge that render my organization brittle? What are people working on? Where is the market at right now? Are our users leaving us for competition? Are our employees leaving us for competitions? How do we deal with a fast-moving space with limited resources? There are so many ways for an organization to fail that aren’t technical, and ideally we’d also keep an eye on them. A definition of observability that is technical in nature can set artificial boundaries to your efforts to gain insights from ongoing processes. I believe Hazel’s definition maps to this ideal more clearly than the cognitive engineering one, but I want to re-state the need to avoid strictly framing its application to technical components observed by people. A specific dynamic I haven’t mentioned here—and this is something disciplines like cybernetics, cognitive engineering, and resilience engineering all have interests for—is one where the technical elements of the system know about the social elements of the system. We essentially do not currently have automation (nor AI) sophisticated enough to be good team members. For example, while I can detect a coworker is busy managing a major outage or meeting with important customers in the midst of a contract renewal, pretty much no alerting system will be able to find that information and decide to ask for assistance from someone else who isn’t as busy working on high-priority stuff within the system. The ability of one agent to shape their own work based on broader objectives than their private ones is something that requires being able to observe other agents in the system, map that to higher goals, and shape their own behaviour accordingly. Ultimately, a lot of decisions are made through attempts at steering the system or part of it in a given direction. This needs some sort of [mental] map of the relationships in the system, and an even harder thing to map out is the impact of having this information will have on the system itself. Complex Systems Mapping Themselves Recently I was at work trying to map concepts about reliability, and came up with this mess of a diagram showing just a tiny portion of what I think goes into influencing system reliability (the details are unimportant): In this concept map, very few things are purely technical; lots of work is social, process-driven, and is about providing feedback loops. As the system grows more complex, analysis and control lose some power, and sense-making and influence become more applicable. The overall system becomes unknowable, and nearly impossible to map out—by the time the map is complete, it’s already outdated. On top of that, the moment the above map becomes used to make decisions, its own influence might need to become part of itself, since it has entered the feedback loop of how decisions are made. These things can’t be planned out, and sometimes can only be discovered in a timely manner by acting. Basically, the point here is that not everything is observable via data availability and search. Some questions you have can only be answered by changing the system, either through adding new data, or by extracting the data through probing of the system. Try a bunch of things and look at the consequences. A few years ago, I was talking with David Woods (to be exact, he was telling me useful things and I was listening) and he compared complex socio-technical systems to a messy web; everything is somehow connected to everything, and it’s nearly impossible to just keep track of all the relevant connections in your head. Things change and some elements will be more relevant today than they were yesterday. As we walk the web, we rediscover connections that are important, some that stopped being so significant, and so on. Experimental practices like chaos engineering or fault injection aren’t just about testing behaviour for success and failure, they are also about deliberately exploring the connections and parts of the web we don’t venture into as often as we’d need to in order to maintain a solid understanding of it. One thing to keep in mind is that the choice of which experiment to run is also based on the existing map and understanding of situations and failures that might happen. There is a risk in the planners and decision-makers not considering themselves to be part of the system they are studying, and of ignoring their own impact and influence. This leads to elements such as pressures, goal conflicts, and adaptations to them, which may tend to only become visible during incidents. The framing of what to investigate, how to investigate it, how errors are constructed, which questions are worth asking or not worth asking all participate to the weird complex feedback loop within the big messy systems we’re in. The tools required for that level of analysis are however very, very different from what most observability vendors provide, and are generally never marketed as such, which does tie back on Hazel’s conclusion that “observability is organizational learning.” A Hot Take on the Use of Models Our complex socio-technical systems aren’t closed systems. Competitors exist; employees bring in their personal life into their work life; the environment and climate in which we live plays a role. It’s all intractable, but simplified models help make bits of it all manageable, at least partially. A key criterion is knowing when a model is worth using and when it is insufficient. Hazel Weakly again hints at this when she states: The control theory version gives you a way to know whether a system is observable or not, but it ignores the people and gives you no way to get there The cognitive engineering version is better, but doesn’t give you a “why” you should care, nor any idea of where you are and where to go Her version provides a motivation and a sense of direction as a process I don’t think these versions are in conflict. They are models, and models have limits and contextual uses. In general I’d like to reframe these models as: What data may be critical to provide from a technical component’s point of view (control theory model) How people may process the data and find significance in it (cognitive engineering model) How organizations should harness the mechanism as a feedback loop to learn and improve (Hazel Weakly’s model) They work on different concerns by picking a different area of focus, and therefore highlight different parts of the overall system. It’s a bit like how looking at phenomena at a human scale with your own eyes, at a micro scale with a microscope, and at an astronomical scale with a telescope, all provide you with useful information despite operating at different levels. While the astronomical scale tools may not bear tons of relevance at the microscopic scale operations, they can all be part of the same overall search of understanding. Much like observability can be improved despite having less data if it is structured properly, a few simpler models can let you make better decisions in the proper context. My hope here was not to invalidate anything Hazel posted, but to keep the validity and specificity of the other models through additional contextualization.
More in programming
Once you’ve written your strategy’s exploration, the next step is working on its diagnosis. Diagnosis is understanding the constraints and challenges your strategy needs to address. In particular, it’s about doing that understanding while slowing yourself down from deciding how to solve the problem at hand before you know the problem’s nuances and constraints. If you ever find yourself wanting to skip the diagnosis phase–let’s get to the solution already!–then maybe it’s worth acknowledging that every strategy that I’ve seen fail, did so due to a lazy or inaccurate diagnosis. It’s very challenging to fail with a proper diagnosis, and almost impossible to succeed without one. The topics this chapter will cover are: Why diagnosis is the foundation of effective strategy, on which effective policy depends. Conversely, how skipping the diagnosis phase consistently ruins strategies A step-by-step approach to diagnosing your strategy’s circumstances How to incorporate data into your diagnosis effectively, and where to focus on adding data Dealing with controversial elements of your diagnosis, such as pointing out that your own executive is one of the challenges to solve Why it’s more effective to view difficulties as part of the problem to be solved, rather than a blocking issue that prevents making forward progress The near impossibility of an effective diagnosis if you don’t bring humility and self-awareness to the process Into the details we go! This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Diagnosis is strategy’s foundation One of the challenges in evaluating strategy is that, after the fact, many effective strategies are so obvious that they’re pretty boring. Similarly, most ineffective strategies are so clearly flawed that their authors look lazy. That’s because, as a strategy is operated, the reality around it becomes clear. When you’re writing your strategy, you don’t know if you can convince your colleagues to adopt a new approach to specifying APIs, but a year later you know very definitively whether it’s possible. Building your strategy’s diagnosis is your attempt to correctly recognize the context that the strategy needs to solve before deciding on the policies to address that context. Done well, the subsequent steps of writing strategy often feel like an afterthought, which is why I think of diagnosis as strategy’s foundation. Where exploration was an evaluation-free activity, diagnosis is all about evaluation. How do teams feel today? Why did that project fail? Why did the last strategy go poorly? What will be the distractions to overcome to make this new strategy successful? That said, not all evaluation is equal. If you state your judgment directly, it’s easy to dispute. An effective diagnosis is hard to argue against, because it’s a web of interconnected observations, facts, and data. Even for folks who dislike your conclusions, the weight of evidence should be hard to shift. Strategy testing, explored in the Refinement section, takes advantage of the reality that it’s easier to diagnose by doing than by speculating. It proposes a recursive diagnosis process until you have real-world evidence that the strategy is working. How to develop your diagnosis Your strategy is almost certain to fail unless you start from an effective diagnosis, but how to build a diagnosis is often left unspecified. That’s because, for most folks, building the diagnosis is indeed a dark art: unspecified, undiscussion, and uncontrollable. I’ve been guilty of this as well, with The Engineering Executive’s Primer’s chapter on strategy staying silent on the details of how to diagnose for your strategy. So, yes, there is some truth to the idea that forming your diagnosis is an emergent, organic process rather than a structured, mechanical one. However, over time I’ve come to adopt a fairly structured approach: Braindump, starting from a blank sheet of paper, write down your best understanding of the circumstances that inform your current strategy. Then set that piece of paper aside for the moment. Summarize exploration on a new piece of paper, review the contents of your exploration. Pull in every piece of diagnosis from similar situations that resonates with you. This is true for both internal and external works! For each diagnosis, tag whether it fits perfectly, or needs to be adjusted for your current circumstances. Then, once again, set the piece of paper aside. Mine for distinct perspectives on yet another blank page, talking to different stakeholders and colleagues who you know are likely to disagree with your early thinking. Your goal is not to agree with this feedback. Instead, it’s to understand their view. The Crux by Richard Rumelt anchors diagnosis in this approach, emphasizing the importance of “testing, adjusting, and changing the frame, or point of view.” Synthesize views into one internally consistent perspective. Sometimes the different perspectives you’ve gathered don’t mesh well. They might well explicitly differ in what they believe the underlying problem is, as is typical in tension between platform and product engineering teams. The goal is to competently represent each of these perspectives in the diagnosis, even the ones you disagree with, so that later on you can evaluate your proposed approach against each of them. When synthesizing feedback goes poorly, it tends to fail in one of two ways. First, the author’s opinion shines through so strongly that it renders the author suspect. Your goal is never to agree with every team’s perspective, just as your diagnosis should typically avoid crowning any perspective as correct: a reader should generally be appraised of the details and unaware of the author. The second common issue is when a group tries to jointly own the synthesis, but create a fractured perspective rather than a unified one. I generally find that having one author who is accountable for representing all views works best to address both of these issues. Test drafts across perspectives. Once you’ve written your initial diagnosis, you want to sit down with the people who you expect to disagree most fervently. Iterate with them until they agree that you’ve accurately captured their perspective. It might be that they disagree with some other view points, but they should be able to agree that others hold those views. They might argue that the data you’ve included doesn’t capture their full reality, in which case you can caveat the data by saying that their team disagrees that it’s a comprehensive lens. Don’t worry about getting the details perfectly right in your initial diagnosis. You’re trying to get the right crumbs to feed into the next phase, strategy refinement. Allowing yourself to be directionally correct, rather than perfectly correct, makes it possible to cover a broad territory quickly. Getting caught up in perfecting details is an easy way to anchor yourself into one perspective prematurely. At this point, I hope you’re starting to predict how I’ll conclude any recipe for strategy creation: if these steps feel overly mechanical to you, adjust them to something that feels more natural and authentic. There’s no perfect way to understand complex problems. That said, if you feel uncertain, or are skeptical of your own track record, I do encourage you to start with the above approach as a launching point. Incorporating data into your diagnosis The strategy for Navigating Private Equity ownership’s diagnosis includes a number of details to help readers understand the status quo. For example the section on headcount growth explains headcount growth, how it compares to the prior year, and providing a mental model for readers to translate engineering headcount into engineering headcount costs: Our Engineering headcount costs have grown by 15% YoY this year, and 18% YoY the prior year. Headcount grew 7% and 9% respectively, with the difference between headcount and headcount costs explained by salary band adjustments (4%), a focus on hiring senior roles (3%), and increased hiring in higher cost geographic regions (1%). If everyone evaluating a strategy shares the same foundational data, then evaluating the strategy becomes vastly simpler. Data is also your mechanism for supporting or critiquing the various views that you’ve gathered when drafting your diagnosis; to an impartial reader, data will speak louder than passion. If you’re confident that a perspective is true, then include a data narrative that supports it. If you believe another perspective is overstated, then include data that the reader will require to come to the same conclusion. Do your best to include data analysis with a link out to the full data, rather than requiring readers to interpret the data themselves while they are reading. As your strategy document travels further, there will be inevitable requests for different cuts of data to help readers understand your thinking, and this is somewhat preventable by linking to your original sources. If much of the data you want doesn’t exist today, that’s a fairly common scenario for strategy work: if the data to make the decision easy already existed, you probably would have already made a decision rather than needing to run a structured thinking process. The next chapter on refining strategy covers a number of tools that are useful for building confidence in low-data environments. Whisper the controversial parts At one time, the company I worked at rolled out a bar raiser program styled after Amazon’s, where there was an interviewer from outside the team that had to approve every hire. I spent some time arguing against adding this additional step as I didn’t understand what we were solving for, and I was surprised at how disinterested management was about knowing if the new process actually improved outcomes. What I didn’t realize until much later was that most of the senior leadership distrusted one of their peers, and had rolled out the bar raiser program solely to create a mechanism to control that manager’s hiring bar when the CTO was disinterested holding that leader accountable. (I also learned that these leaders didn’t care much about implementing this policy, resulting in bar raiser rejections being frequently ignored, but that’s a discussion for the Operations for strategy chapter.) This is a good example of a strategy that does make sense with the full diagnosis, but makes little sense without it, and where stating part of the diagnosis out loud is nearly impossible. Even senior leaders are not generally allowed to write a document that says, “The Director of Product Engineering is a bad hiring manager.” When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: Say something awkward or uncomfortable about your company or someone working within it Omit a critical piece of your diagnosis that’s necessary to understand the wider thinking Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. I think it’s helpful to discuss a few concrete examples of this, starting with the strategy for navigating private equity, whose diagnosis includes: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. There are many things the authors of this strategy likely feel about their state of reality. First, they are probably upset about the fact that their new private equity ownership is likely to eliminate colleagues. Second, they are likely upset that there is no clear plan around what they need to do, so they are stuck preparing for a wide range of potential outcomes. However they feel, they don’t say any of that, they stick to precise, factual statements. For a second example, we can look to the Uber service migration strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. The team didn’t agree that their headcount should not be growing, but it was the reality they were operating in. They acknowledged their reality as a factual statement, without any additional commentary about that statement. In both of these examples, they found a professional, non-judgmental way to acknowledge the circumstances they were solving. The authors would have preferred that the leaders behind those decisions take explicit accountability for them, but it would have undermined the strategy work had they attempted to do it within their strategy writeup. Excluding critical parts of your diagnosis makes your strategies particularly hard to evaluate, copy or recreate. Find a way to say things politely to make the strategy effective. As always, strategies are much more about realities than ideals. Reframe blockers as part of diagnosis When I work on strategy with early-career leaders, an idea that comes up a lot is that an identified problem means that strategy is not possible. For example, they might argue that doing strategy work is impossible at their current company because the executive team changes their mind too often. That core insight is almost certainly true, but it’s much more powerful to reframe that as a diagnosis: if we don’t find a way to show concrete progress quickly, and use that to excite the executive team, our strategy is likely to fail. This transforms the thing preventing your strategy into a condition your strategy needs to address. Whenever you run into a reason why your strategy seems unlikely to work, or why strategy overall seems difficult, you’ve found an important piece of your diagnosis to include. There are never reasons why strategy simply cannot succeed, only diagnoses you’ve failed to recognize. For example, we knew in our work on Uber’s service provisioning strategy that we weren’t getting more headcount for the team, the product engineering team was going to continue growing rapidly, and that engineering leadership was unwilling to constrain how product engineering worked. Rather than preventing us from implementing a strategy, those components clarified what sort of approach could actually succeed. The role of self-awareness Every problem of today is partially rooted in the decisions of yesterday. If you’ve been with your organization for any duration at all, this means that you are directly or indirectly responsible for a portion of the problems that your diagnosis ought to recognize. This means that recognizing the impact of your prior actions in your diagnosis is a powerful demonstration of self-awareness. It also suggests that your next strategy’s success is rooted in your self-awareness about your prior choices. Don’t be afraid to recognize the failures in your past work. While changing your mind without new data is a sign of chaotic leadership, changing your mind with new data is a sign of thoughtful leadership. Summary Because diagnosis is the foundation of effective strategy, I’ve always found it the most intimidating phase of strategy work. While I think that’s a somewhat unavoidable reality, my hope is that this chapter has somewhat prepared you for that challenge. The four most important things to remember are simply: form your diagnosis before deciding how to solve it, try especially hard to capture perspectives you initially disagree with, supplement intuition with data where you can, and accept that sometimes you’re missing the data you need to fully understand. The last piece in particular, is why many good strategies never get shared, and the topic we’ll address in the next chapter on strategy refinement.
A Live, Interactive Course for Systems Engineers
I’m sitting in a small coffee shop in Brooklyn. I have a warm drink, and it’s just started to snow outside. I’m visiting New York to see Operation Mincemeat on Broadway – I was at the dress rehearsal yesterday, and I’ll be at the opening preview tonight. I’ve seen this show more times than I care to count, and I hope US theater-goers love it as much as Brits. The people who make the show will tell you that it’s about a bunch of misfits who thought they could do something ridiculous, who had the audacity to believe in something unlikely. That’s certainly one way to see it. The musical tells the true story of a group of British spies who tried to fool Hitler with a dead body, fake papers, and an outrageous plan that could easily have failed. Decades later, the show’s creators would mirror that same spirit of unlikely ambition. Four friends, armed with their creativity, determination, and a wardrobe full of hats, created a new musical in a small London theatre. And after a series of transfers, they’re about to open the show under the bright lights of Broadway. But when I watch the show, I see a story about friendship. It’s about how we need our friends to help us, to inspire us, to push us to be the best versions of ourselves. I see the swaggering leader who needs a team to help him truly achieve. The nervous scientist who stands up for himself with the support of his friends. The enthusiastic secretary who learns wisdom and resilience from her elder. And so, I suppose, it’s fitting that I’m not in New York on my own. I’m here with friends – dozens of wonderful people who I met through this ridiculous show. At first, I was just an audience member. I sat in my seat, I watched the show, and I laughed and cried with equal measure. After the show, I waited at stage door to thank the cast. Then I came to see the show a second time. And a third. And a fourth. After a few trips, I started to see familiar faces waiting with me at stage door. So before the cast came out, we started chatting. Those conversations became a Twitter community, then a Discord, then a WhatsApp. We swapped fan art, merch, and stories of our favourite moments. We went to other shows together, and we hung out outside the theatre. I spent New Year’s Eve with a few of these friends, sitting on somebody’s floor and laughing about a bowl of limes like it was the funniest thing in the world. And now we’re together in New York. Meeting this kind, funny, and creative group of people might seem as unlikely as the premise of Mincemeat itself. But I believed it was possible, and here we are. I feel so lucky to have met these people, to take this ridiculous trip, to share these precious days with them. I know what a privilege this is – the time, the money, the ability to say let’s do this and make it happen. How many people can gather a dozen friends for even a single evening, let alone a trip halfway round the world? You might think it’s silly to travel this far for a theatre show, especially one we’ve seen plenty of times in London. Some people would never see the same show twice, and most of us are comfortably into double or triple-figures. Whenever somebody asks why, I don’t have a good answer. Because it’s fun? Because it’s moving? Because I enjoy it? I feel the need to justify it, as if there’s some logical reason that will make all of this okay. But maybe I don’t have to. Maybe joy doesn’t need justification. A theatre show doesn’t happen without people who care. Neither does a friendship. So much of our culture tells us that it’s not cool to care. It’s better to be detached, dismissive, disinterested. Enthusiasm is cringe. Sincerity is weakness. I’ve certainly felt that pressure – the urge to play it cool, to pretend I’m above it all. To act as if I only enjoy something a “normal” amount. Well, fuck that. I don’t know where the drive to be detached comes from. Maybe it’s to protect ourselves, a way to guard against disappointment. Maybe it’s to seem sophisticated, as if having passions makes us childish or less mature. Or perhaps it’s about control – if we stay detached, we never have to depend on others, we never have to trust in something bigger than ourselves. Being detached means you can’t get hurt – but you’ll also miss out on so much joy. I’m a big fan of being a big fan of things. So many of the best things in my life have come from caring, from letting myself be involved, from finding people who are a big fan of the same things as me. If I pretended not to care, I wouldn’t have any of that. Caring – deeply, foolishly, vulnerably – is how I connect with people. My friends and I care about this show, we care about each other, and we care about our joy. That care and love for each other is what brought us together, and without it we wouldn’t be here in this city. I know this is a once-in-a-lifetime trip. So many stars had to align – for us to meet, for the show we love to be successful, for us to be able to travel together. But if we didn’t care, none of those stars would have aligned. I know so many other friends who would have loved to be here but can’t be, for all kinds of reasons. Their absence isn’t for lack of caring, and they want the show to do well whether or not they’re here. I know they care, and that’s the important thing. To butcher Tennyson: I think it’s better to care about something you cannot affect, than to care about nothing at all. In a world that’s full of cynicism and spite and hatred, I feel that now more than ever. I’d recommend you go to the show if you haven’t already, but that’s not really the point of this post. Maybe you’ve already seen Operation Mincemeat, and it wasn’t for you. Maybe you’re not a theatre kid. Maybe you aren’t into musicals, or history, or war stories. That’s okay. I don’t mind if you care about different things to me. (Imagine how boring the world would be if we all cared about the same things!) But I want you to care about something. I want you to find it, find people who care about it too, and hold on to them. Because right now, in this city, with these people, at this show? I’m so glad I did. And I hope you find that sort of happiness too. Some of the people who made this trip special. Photo by Chloe, and taken from her Twitter. Timing note: I wrote this on February 15th, but I delayed posting it because I didn’t want to highlight the fact I was away from home. [If the formatting of this post looks odd in your feed reader, visit the original article]
One of the biggest mistakes that new startup founders make is trying to get away from the customer-facing roles too early. Whether it's customer support or it's sales, it's an incredible advantage to have the founders doing that work directly, and for much longer than they find comfortable. The absolute worst thing you can do is hire a sales person or a customer service agent too early. You'll miss all the golden nuggets that customers throw at you for free when they're rejecting your pitch or complaining about the product. Seeing these reasons paraphrased or summarized destroy all the nutrients in their insights. You want that whole-grain feedback straight from the customers' mouth! When we launched Basecamp in 2004, Jason was doing all the customer service himself. And he kept doing it like that for three years!! By the time we hired our first customer service agent, Jason was doing 150 emails/day. The business was doing millions of dollars in ARR. And Basecamp got infinitely, better both as a market proposition and as a product, because Jason could funnel all that feedback into decisions and positioning. For a long time after that, we did "Everyone on Support". Frequently rotating programmers, designers, and founders through a day of answering emails directly to customers. The dividends of doing this were almost as high as having Jason run it all in the early years. We fixed an incredible number of minor niggles and annoying bugs because programmers found it easier to solve the problem than to apologize for why it was there. It's not easy doing this! Customers often offer their valuable insights wrapped in rude language, unreasonable demands, and bad suggestions. That's why many founders quit the business of dealing with them at the first opportunity. That's why few companies ever do "Everyone On Support". That's why there's such eagerness to reduce support to an AI-only interaction. But quitting dealing with customers early, not just in support but also in sales, is an incredible handicap for any startup. You don't have to do everything that every customer demands of you, but you should certainly listen to them. And you can't listen well if the sound is being muffled by early layers of indirection.