Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
66
I like to think that I write code deliberately. I’m an admittedly slow developer, and I want to believe I do so on purpose. I want to know as much as I can about the context of what it is that I'm automating. I also use a limited set of tools. I used old computers for a long time, both out of an environmental mindset, but also because a slower computer quickly makes it obvious when something scales poorly.1 The idea is to seek friction, and harness it as an early signal that whatever I’m doing may need to be tweaked, readjusted. I find this friction, and even frustration in general to also be useful around learning approaches.2 In opposition to the way I'd like to do things, everything about the tech industry is oriented towards elevated productivity, accelerated growth, and "easy" solutions to whole families of problems. I feel that maybe we should teach people to program the way they teach martial arts, like only in the most desperate situations when all else failed should you...
5 months ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Ferd.ca

AI: Where in the Loop Should Humans Go?

This is a re-publishing of a blog post I originally wrote for work, but wanted on my own blog as well. AI is everywhere, and its impressive claims are leading to rapid adoption. At this stage, I’d qualify it as charismatic technology—something that under-delivers on what it promises, but promises so much that the industry still leverages it because we believe it will eventually deliver on these claims. This is a known pattern. In this post, I’ll use the example of automation deployments to go over known patterns and risks in order to provide you with a list of questions to ask about potential AI solutions. I’ll first cover a short list of base assumptions, and then borrow from scholars of cognitive systems engineering and resilience engineering to list said criteria. At the core of it is the idea that when we say we want humans in the loop, it really matters where in the loop they are. My base assumptions The first thing I’m going to say is that we currently do not have Artificial General Intelligence (AGI). I don’t care whether we have it in 2 years or 40 years or never; if I’m looking to deploy a tool (or an agent) that is supposed to do stuff to my production environments, it has to be able to do it now. I am not looking to be impressed, I am looking to make my life and the system better. Another mechanism I want you to keep in mind is something called the context gap. In a nutshell, any model or automation is constructed from a narrow definition of a controlled environment, which can expand as it gains autonomy, but remains limited. By comparison, people in a system start from a broad situation and narrow definitions down and add constraints to make problem-solving tractable. One side starts from a narrow context, and one starts from a wide one—so in practice, with humans and machines, you end up seeing a type of teamwork where one constantly updates the other: The optimal solution of a model is not an optimal solution of a problem unless the model is a perfect representation of the problem, which it never is.  — Ackoff (1979, p. 97) Because of that mindset, I will disregard all arguments of “it’s coming soon” and “it’s getting better real fast” and instead frame what current LLM solutions are shaped like: tools and automation. As it turns out, there are lots of studies about ergonomics, tool design, collaborative design, where semi-autonomous components fit into sociotechnical systems, and how they tend to fail. Additionally, I’ll borrow from the framing used by people who study joint cognitive systems: rather than looking only at the abilities of what a single person or tool can do, we’re going to look at the overall performance of the joint system. This is important because if you have a tool that is built to be operated like an autonomous agent, you can get weird results in your integration. You’re essentially building an interface for the wrong kind of component—like using a joystick to ride a bicycle. This lens will assist us in establishing general criteria about where the problems will likely be without having to test for every single one and evaluate them on benchmarks against each other. Questions you'll want to ask The following list of questions is meant to act as reminders—abstracting away all the theory from research papers you’d need to read—to let you think through some of the important stuff your teams should track, whether they are engineers using code generation, SREs using AIOps, or managers and execs making the call to adopt new tooling. Are you better even after the tool is taken away? An interesting warning comes from studying how LLMs function as learning aides. The researchers found that people who trained using LLMs tended to fail tests more when the LLMs were taken away compared to people who never studied with them, except if the prompts were specifically (and successfully) designed to help people learn. Likewise, it’s been known for decades that when automation handles standard challenges, the operators expected to take over when they reach their limits end up worse off and generally require more training to keep the overall system performant. While people can feel like they’re getting better and more productive with tool assistance, it doesn’t necessarily follow that they are learning or improving. Over time, there’s a serious risk that your overall system’s performance will be limited to what the automation can do—because without proper design, people keeping the automation in check will gradually lose the skills they had developed prior. Are you augmenting the person or the computer? Traditionally successful tools tend to work on the principle that they improve the physical or mental abilities of their operator: search tools let you go through more data than you could on your own and shift demands to external memory, a bicycle more effectively transmits force for locomotion, a blind spot alert on your car can extend your ability to pay attention to your surroundings, and so on. Automation that augments users therefore tends to be easier to direct, and sort of extends the person’s abilities, rather than acting based on preset goals and framing. Automation that augments a machine tends to broaden the device’s scope and control by leveraging some known effects of their environment and successfully hiding them away. For software folks, an autoscaling controller is a good example of the latter. Neither is fundamentally better nor worse than the other—but you should figure out what kind of automation you’re getting, because they fail differently. Augmenting the user implies that they can tackle a broader variety of challenges effectively. Augmenting the computers tends to mean that when the component reaches its limits, the challenges are worse for the operator. Is it turning you into a monitor rather than helping build an understanding? If your job is to look at the tool go and then say whether it was doing a good or bad job (and maybe take over if it does a bad job), you’re going to have problems. It has long been known that people adapt to their tools, and automation can create complacency. Self-driving cars that generally self-drive themselves well but still require a monitor are not effectively monitored. Instead, having AI that supports people or adds perspectives to the work an operator is already doing tends to yield better long-term results than patterns where the human learns to mostly delegate and focus elsewhere. (As a side note, this is why I tend to dislike incident summarizers. Don’t make it so people stop trying to piece together what happened! Instead, I prefer seeing tools that look at your summaries to remind you of items you may have forgotten, or that look for linguistic cues that point to biases or reductive points of view.) Does it pigeonhole what you can look at? When evaluating a tool, you should ask questions about where the automation lands: Does it let you look at the world more effectively? Does it tell you where to look in the world? Does it force you to look somewhere specific? Does it tell you to do something specific? Does it force you to do something? This is a bit of a hybrid between “Does it extend you?” and “Is it turning you into a monitor?” The five questions above let you figure that out. As the tool becomes a source of assertions or constraints (rather than a source of information and options), the operator becomes someone who interacts with the world from inside the tool rather than someone who interacts with the world with the tool’s help. The tool stops being a tool and becomes a representation of the whole system, which means whatever limitations and internal constraints it has are then transmitted to your users. Is it a built-in distraction? People tend to do multiple tasks over many contexts. Some automated systems are built with alarms or alerts that require stealing someone’s focus, and unless they truly are the most critical thing their users could give attention to, they are going to be an annoyance that can lower the effectiveness of the overall system. What perspectives does it bake in? Tools tend to embody a given perspective. For example, AIOps tools that are built to find a root cause will likely carry the conceptual framework behind root causes in their design. More subtly, these perspectives are sometimes hidden in the type of data you get: if your AIOps agent can only see alerts, your telemetry data, and maybe your code, it will rarely be a source of suggestions on how to improve your workflows because that isn’t part of its world. In roles that are inherently about pulling context from many disconnected sources, how on earth is automation going to make the right decisions? And moreover, who’s accountable for when it makes a poor decision on incomplete data? Surely not the buyer who installed it! This is also one of the many ways in which automation can reinforce biases—not just based on what is in its training data, but also based on its own structure and what inputs were considered most important at design time. The tool can itself become a keyhole through which your conclusions are guided. Is it going to become a hero? A common trope in incident response is heroes—the few people who know everything inside and out, and who end up being necessary bottlenecks to all emergencies. They can’t go away for vacation, they’re too busy to train others, they develop blind spots that nobody can fix, and they can’t be replaced. To avoid this, you have to maintain a continuous awareness of who knows what, and crosstrain each other to always have enough redundancy. If you have a team of multiple engineers and you add AI to it, having it do all of the tasks of a specific kind means it becomes a de facto hero to your team. If that’s okay, be aware that any outages or dysfunction in the AI agent would likely have no practical workaround. You will essentially have offshored part of your ops. Do you need it to be perfect? What a thing promises to be is never what it is—otherwise AWS would be enough, and Kubernetes would be enough, and JIRA would be enough, and the software would work fine with no one needing to fix things. That just doesn’t happen. Ever. Even if it’s really, really good, it’s gonna have outages and surprises, and it’ll mess up here and there, no matter what it is. We aren’t building an omnipotent computer god, we’re building imperfect software. You’ll want to seriously consider whether the tradeoffs you’d make in terms of quality and cost are worth it, and this is going to be a case-by-case basis. Just be careful not to fix the problem by adding a human in the loop that acts as a monitor! Is it doing the whole job or a fraction of it? We don’t notice major parts of our own jobs because they feel natural. A classic pattern here is one of AIs getting better at diagnosing patients, except the benchmarks are usually run on a patient chart where most of the relevant observations have already been made by someone else. Similarly, we often see AI pass a test with flying colors while it still can’t be productive at the job the test represents. People in general have adopted a model of cognition based on information processing that’s very similar to how computers work (get data in, think, output stuff, rinse and repeat), but for decades, there have been multiple disciplines that looked harder at situated work and cognition, moving past that model. Key patterns of cognition are not just in the mind, but are also embedded in the environment and in the interactions we have with each other. Be wary of acquiring a solution that solves what you think the problem is rather than what it actually is. We routinely show we don’t accurately know the latter. What if we have more than one? You probably know how straightforward it can be to write a toy project on your own, with full control of every refactor. You probably also know how this stops being true as your team grows. As it stands today, a lot of AI agents are built within a snapshot of the current world: one or few AI tools added to teams that are mostly made up of people. By analogy, this would be like everyone selling you a computer assuming it were the first and only electronic device inside your household. Problems arise when you go beyond these assumptions: maybe AI that writes code has to go through a code review process, but what if that code review is done by another unrelated AI agent? What happens when you get to operations and common mode failures impact components from various teams that all have agents empowered to go fix things to the best of their ability with the available data? Are they going to clash with people, or even with each other? Humans also have that ability and tend to solve it via processes and procedures, explicit coordination, announcing what they’ll do before they do it, and calling upon each other when they need help. Will multiple agents require something equivalent, and if so, do you have it in place? How do they cope with limited context? Some changes that cause issues might be safe to roll back, some not (maybe they include database migrations, maybe it is better to be down than corrupting data), and some may contain changes that rolling back wouldn’t fix (maybe the workload is controlled by one or more feature flags). Knowing what to do in these situations can sometimes be understood from code or release notes, but some situations can require different workflows involving broader parts of the organization. A risk of automation without context is that if you have situations where waiting or doing little is the best option, then you’ll need to either have automation that requires input to act, or a set of actions to quickly disable multiple types of automation as fast as possible. Many of these may exist at the same time, and it becomes the operators’ jobs to not only maintain their own context, but also maintain a mental model of the context each of these pieces of automation has access to. The fancier your agents, the fancier your operators’ understanding and abilities must be to properly orchestrate them. The more surprising your landscape is, the harder it can become to manage with semi-autonomous elements roaming around. After an outage or incident, who does the learning and who does the fixing? One way to track accountability in a system is to figure out who ends up having to learn lessons and change how things are done. It’s not always the same people or teams, and generally, learning will happen whether you want it or not. This is more of a rhetorical question right now, because I expect that in most cases, when things go wrong, whoever is expected to monitor the AI tool is going to have to steer it in a better direction and fix it (if they can); if it can’t be fixed, then the expectation will be that the automation, as a tool, will be used more judiciously in the future. In a nutshell, if the expectation is that your engineers are going to be doing the learning and tweaking, your AI isn’t an independent agent—it’s a tool that cosplays as an independent agent. Do what you will—just be mindful All in all, none of the above questions flat out say you should not use AI, nor where exactly in the loop you should put people. The key point is that you should ask that question and be aware that just adding whatever to your system is not going to substitute workers away. It will, instead, transform work and create new patterns and weaknesses. Some of these patterns are known and well-studied. We don’t have to go rushing to rediscover them all through failures as if we were the first to ever automate something. If AI ever gets so good and so smart that it’s better than all your engineers, it won’t make a difference whether you adopt it only once it’s good. In the meanwhile, these things do matter and have real impacts, so please design your systems responsibly. If you’re interested to know more about the theoretical elements underpinning this post, the following references—on top of whatever was already linked in the text—might be of interest: Books: Joint Cognitive Systems: Foundations of Cognitive Systems Engineering by Erik Hollnagel Joint Cognitive Systems: Patterns in Cognitive Systems Engineering by David D. Woods Cognition in the Wild by Edwin Hutchins Behind Human Error by David D. Woods, Sydney Dekker, Richard Cook, Leila Johannesen, Nadine Sarter Papers: Ironies of Automation by Lisanne Bainbridge The French-Speaking Ergonomists’ Approach to Work Activity by Daniellou How in the World Did We Ever Get into That Mode? Mode Error and Awareness in Supervisory Control by Nadine Sarter Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis by David D. Woods Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity by Gary Klein and David D. Woods MABA-MABA or Abracadabra? Progress on Human–Automation Co-ordination by Sidney Dekker Managing the Hidden Costs of Coordination by Laura Maguire Designing for Expertise by David D. Woods The Impact of Generative AI on Critical Thinking by Lee et al.

a month ago 26 votes
Carrots, sticks, and making things worse

This blog post originally appeared on the LFI blog but I decided to post it on my own as well. Every organization has to contend with limits: scarcity of resources, people, attention, or funding, friction from scaling, inertia from previous code bases, or a quickly shifting ecosystem. And of course there are more, like time, quality, effort, or how much can fit in anyone's mind. There are so many ways for things to go wrong; your ongoing success comes in no small part from the people within your system constantly navigating that space, making sacrifice decisions and trading off some things to buy runway elsewhere. From time to time, these come to a head in what we call a goal conflict, where two important attributes clash with each other. These are not avoidable, and in fact are just assumed to be so in many cases, such as "cheap, fast, and good; pick two". But somehow, when it comes to more specific details of our work, that clarity hides itself or gets obscured by the veil of normative judgments. It is easy after an incident to think of what people could have done differently, of signals they should have listened to, or of consequences they would have foreseen had they just been a little bit more careful. From this point of view, the idea of reinforcing desired behaviors through incentives, both positive (bonuses, public praise, promotions) and negative (demerits, re-certification, disciplinary reviews) can feel attractive. (Do note here that I am specifically talking of incentives around specific decision-making or performance, rather than broader ones such as wages, perks, overtime or hazard pay, or employment benefits, even though effects may sometimes overlap.) But this perspective itself is a trap. Hindsight bias—where we overestimate how predictable outcomes were after the fact—and its close relative outcome bias—where knowing the results after the fact tints how we judge the decision made—both serve as good reminders that we should ideally look at decisions as they were being made, with the information known and pressures present then.. This is generally made easier by assuming people were trying to do a good job and get good results; a judgment that seems to make no sense asks of us that we figure out how it seemed reasonable at the time. Events were likely challenging, resources were limited (including cognitive bandwidth), and context was probably uncertain. If you were looking for goal conflicts and difficult trade-offs, this is certainly a promising area in which they can be found. Taking people's desire for good outcomes for granted forces you to shift your perspective. It demands you move away from thinking that somehow more pressure toward succeeding would help. It makes you ask what aid could be given to navigate the situation better, how the context could be changed for the trade-offs to be negotiated differently next time around. It lets us move away from wondering how we can prevent mistakes and move toward how we could better support our participants. Hell, the idea of rewarding desired behavior feels enticing even in cases where your review process does not fall into the traps mentioned here, where you take a more just approach. But the core idea here is that you can't really expect different outcomes if the pressures and goals that gave them rise don't change either. During incidents, priorities in play already are things like "I've got to fix this to keep this business alive", stabilizing the system to prevent large cascades, or trying to prevent harm to users or customers. They come with stress, adrenalin, and sometimes a sense of panic or shock. These are likely to rank higher in the minds of people than “what’s my bonus gonna be?” or “am I losing a gift card or some plaque if I fail?” Adding incentives, whether positive or negative, does not clarify the situation. It does not address goal conflicts. It adds more variables to the equation, complexifies the situation, and likely makes it more challenging. Chances are that people will make the same decisions they would have made (and have been making continuously) in the past, obtaining the desired outcomes. Instead, they’ll change what they report later in subtle ways, by either tweaking or hiding information to protect themselves, or by gradually losing trust in the process you've put in place. These effects can be amplified when teams are given hard-to-meet abstract targets such as lowering incident counts, which can actively interfere with incident response by creating new decision points in people's mental flows. If responders have to discuss and classify the nature of an incident to fit an accounting system unrelated to solving it right now, their response is likely made slower, more challenging. This is not to say all attempts at structure and classification would hinder proper response, though. Clarifying the critical elements to salvage first, creating cues and language for patterns that will be encountered, and agreeing on strategies that support effective coordination across participants can all be really productive. It needs to be done with a deeper understanding of how your incident response actually works, and that sometimes means unpleasant feedback about how people perceive your priorities. I've been in reviews where people stated things like "we know that we get yelled at more for delivering features late than broken code so we just shipped broken code since we were out of time", or who admitted ignoring execs who made a habit of coming down from above to scold employees into fixing things they were pressured into doing anyway. These can be hurtful for an organization to consider, but they are nevertheless a real part of how people deal with exceptional situations. By trying to properly understand the challenges, by clarifying the goal conflicts that arise in systems and result in sometimes frustrating trade-offs, and by making learning from these experiences an objective of its own, we can hopefully make things a bit better. Grounding our interventions within a richer, more naturalistic understanding of incident response and all its challenges is a small—albeit a critical one—part of it all.

6 months ago 58 votes
My Blog Engine is the Erlang Build Tool

From time to time, people ask me what I use to power my blog, maybe because they like the minimalist form it has. I tell them it’s a bad idea and that I use the Erlang compiler infrastructure for it, and they agree to look elsewhere. After launching my notes section, I had to fully clean up my engine. I thought I could write about how it works because it’s fairly unique and interesting, even if you should probably not use it. The Requirements I first started my blog 14 years ago. It had roughly the same structure as it does at the time of writing this: a list of links and text with nothing else. It did poorly with mobile (which was still sort of new but I should really work to improve these days), but okay with screen readers. It’s gotta be minimal enough to load fast on old devices. There’s absolutely nothing dynamic on here. No JavaScript, no comments, no tracking, and I’m pretty sure I’ve disabled most logging and retention. I write into a void, either transcribing talks or putting down rants I’ve repeated 2-3 times to other people so it becomes faster to just link things in the future. I mostly don’t know what gets read or not, but over time I found this kept the experience better for me than chasing readers or views. Basically, a static site is the best technology for me, but from time to time it’s nice to be able to update the layout, add some features (like syntax highlighting or an RSS feed) so it needs to be better than flat HTML files. Internally it runs with erlydtl, an Erlang implementation of Django Templates, which I really liked a decade and a half ago. It supports template inheritance, which is really neat to minimize files I have to edit. All I have is a bunch of files containing my posts, a few of these templates, and a little bit of Rebar3 config tying them together. There are some features that erlydtl doesn’t support but that I wanted anyway, notably syntax highlighting (without JavaScript), markdown support, and including subsections of HTML files (a weird corner case to support RSS feeds without powering them with a database). The feature I want to discuss here is “only rebuild what you strictly need to,” which I covered by using the Rebar3 compiler. Rebar3’s Compiler Rebar3 is the Erlang community’s build tool, which Tristan and I launched over 10 years ago, a follower to the classic rebar 2.x script. A funny requirement for Rebar3 is that Erlang has multiple compilers: one for Erlang, but also one for MIB files (for SNMP), the Leex syntax analyzer generator, and the Yecc parser generator. It also is plugin-friendly in order to compile Elixir modules, and other BEAM languages, like LFE, or very early versions of Gleam. We needed to support at least four compilers out of the box, and to properly track everything such that we only rebuild what we must. This is done using a Directed Acyclic Graph (DAG) to track all files, including build artifacts. The Rebar3 compiler infrastructure works by breaking up the flow of compilation in a generic and specific subset. The specific subset will: Define which file types and paths must be considered by the compiler. Define which files are dependencies of other files. Be given a graph of all files and their artifacts with their last modified times (and metadata), and specify which of them need rebuilding. Compile individual files and provide metadata to track the artifacts. The generic subset will: Scan files and update their timestamps in a graph for the last modifications. Use the dependency information to complete the dependency graph. Propagate the timestamps of source files modifications transitively through the graph (assume you update header A, included by header B, applied by macro C, on file D; then B, C, and D are all marked as modified as recently as A in the DAG). Pass this updated graph to the specific part to get a list of files to build (usually by comparing which source files are newer than their artifacts, but also if build options changed). Schedule sequential or parallel compilation based on what the specific part specified. Update the DAG with the artifacts and build metadata, and persist the data to disk. In short, you build a compiler plugin that can name directories, file extensions, dependencies, and can compare timestamps and metadata. Then make sure this plugin can compile individual files, and the rest is handled for you. The blog engine Since I’m currently the most active Rebar3 maintainer, I’ve definitely got to maintain the compiler infrastructure described earlier. Since my blog needed to rebuild the fewest static files possible and I already used a template compiler, plugging it into Rebar3 became the solution demanding the least effort. It requires a few hundred lines of code to write the plugin and a bit of config looking like this: {blog3r,[{vars,[{url,[{base,"https://ferd.ca/"},{notes,"https://ferd.ca/notes/"},{img,"https://ferd.ca/static/img/"},...]},%% Main site{index,#{template=>"index.tpl",out=>"index.html",section=>main}},{index,#{template=>"rss.tpl",out=>"feed.rss",section=>main}},%% Notes section{index,#{template=>"index-notes.tpl",out=>"notes/index.html",section=>notes}},{index,#{template=>"rss-notes.tpl",out=>"notes/feed.rss",section=>notes}},%% All sections' pages.{sections,#{main=>{"posts/","./",[{"Mon, 02 Sep 2024 11:00:00 EDT","My Blog Engine is the Erlang Build Tool","blog-engine-erlang-build-tool.md.tpl"},{"Thu, 30 May 2024 15:00:00 EDT","The Review Is the Action Item","the-review-is-the-action-item.md.tpl"},{"Tue, 19 Mar 2024 11:00:00 EDT","A Commentary on Defining Observability","a-commentary-on-defining-observability.md.tpl"},{"Wed, 07 Feb 2024 19:00:00 EST","A Distributed Systems Reading List","distsys-reading-list.md.tpl"},...]},notes=>{"notes/","notes/",[{"Fri, 16 Aug 2024 10:30:00 EDT","Paper: Psychological Safety: The History, Renaissance, and Future of an Interpersonal Construct","papers/psychological-safety-interpersonal-construct.md.tpl"},{"Fri, 02 Aug 2024 09:30:00 EDT","Atomic Accidents and Uneven Blame","atomic-accidents-and-uneven-blame.md.tpl"},{"Sat, 27 Jul 2024 12:00:00 EDT","Paper: Moral Crumple Zones","papers/moral-crumple-zones.md.tpl"},{"Tue, 16 Jul 2024 19:00:00 EDT","Hutchins' Distributed Cognition in The Wild","hutchins-distributed-cognition-in-the-wild.md.tpl"},...]}}}]}. And blog post entry files like this: {% extends "base.tpl" %} {% block content %} <p>I like cats. I like food. <br /> I don't especially like catfood though.</p> {% markdown %} ### Have a subtitle And then _all sorts_ of content! - lists - other lists - [links]({{ url.base }}section/page)) - and whatever fits a demo > Have a quote to close this out {% endmarkdown %} {% endblock %} These call to a parent template (see base.tpl for the structure) to inject their content. The whole site gets generated that way. Even compiler error messages are lifted from the Rebar3 libraries (although I haven't wrapped everything perfectly yet), with the following coming up when I forgot to close an if tag before closing a for loop: $ rebar3 compile ===> Verifying dependencies... ===> Analyzing applications... ===> Compiling ferd_ca ===> template error: ┌─ /home/ferd/code/ferd-ca/templates/rss.tpl: │ 24 │ {% endfor %} │ ╰── syntax error before: "endfor" ===> Compiling templates/rss.tpl failed As you can see, I build my blog by calling rebar3 compile, the same command as I do for any Erlang project. I find it interesting that on one hand, this is pretty much the best design possible for me given that it represents almost no new code, no new tools, and no new costs. It’s quite optimal. On the other hand, it’s possibly the worst possible tool chain imaginable for a blog engine for almost anybody else.

7 months ago 66 votes
The Review Is the Action Item

2024/05/30 The Review Is the Action Item I like to consider running an incident review to be its own action item. Other follow-ups emerging from it are a plus, but the point is to learn from incidents, and the review gives room for that to happen. This is not surprising advice if you’ve read material from the LFI community and related disciplines. However, there are specific perspectives required to make this work, and some assumptions necessary for it, without which things can break down. How can it work? In a more traditional view, the system is believed to be stable, then disrupted into an incident. The system gets stabilized, and we must look for weaknesses that can be removed or barriers that could be added in order to prevent such disruption in the future. Other perspectives for systems include views where they are never truly stable. Things change constantly; uncertainty is normal. Under that lens, systems can’t be forced into stability by control or authority. They can be influenced and adapt on an ongoing basis, and possibly kept in balance through constant effort. Once you adopt a socio-technical perspective, the hard-to-model nature of humans becomes a desirable trait to cope with chaos. Rather than a messy variable to stamp out, you’ll want to give them more tools and ways to keep all the moving parts of the subsystems going. There, an incident review becomes an arena where misalignment in objectives can be repaired, where strategies and tactics can be discussed, where mental models can be corrected and enriched, where voices can be heard when they wouldn’t be, and where we are free to reflect on the messy reality that drove us here. This is valuable work, and establishing an environment where it takes place is a key action item on its own. People who want to keep things working will jump on this opportunity if they see any value in it. Rather than giving them tickets to work on, we’re giving them a safe context to surface and discuss useful information. They’ll carry that information with them in the future, and it may influence the decisions they make, here and elsewhere. If the stories that come out of reviews are good enough, they will be retold to others, and the organization will have learned something. That belief people will do better over time as they learn, to me, tends to be worth more than focusing on making room for a few tickets in the backlog. How can it break down? One of the unnamed assumptions with this whole approach is that teams should have the ability to influence their own roadmap and choose some of their own work. A staunchly top-down organization may leverage incident reviews as a way to let people change the established course with a high priority. That use of incident reviews can’t be denied in these contexts. We want to give people the information and the perspectives they need to come up with fixes that are effective. Good reviews with action items ought to make sense, particularly in these orgs where most of the work is normally driven by folks outside of the engineering teams. But if the maintainers do not have the opportunity to schedule work they think needs doing outside of the aftermath of an incident—work that is by definition reactive—then they have no real power to schedule preventive work on their own. And so that’s a place where learning being its own purpose breaks down: when the learnings can’t be applied. Maybe it feels like “good” reviews focused on learning apply to a surprisingly narrow set of teams then, because most teams don’t have that much control. The question here really boils down to “who is it that can apply things they learned, and when?” If the answer is “almost no one, and only when things explode,” that’s maybe a good lesson already. That’s maybe where you’d want to start remediating. Note that even this perspective is a bit reductionist, which is also another way in which learning reviews may break down. By narrowing knowledge’s utility only to when it gets applied in measurable scheduled work, we stop finding value outside of this context, and eventually stop giving space for it. It’s easy to forget that we don’t control what people learn. We don’t choose what the takeaways are. Everyone does it for themselves based on their own lived experience. More importantly, we can’t stop people from using the information they learned, whether at work or in their personal life. Lessons learned can be applied anywhere and any time, and they can become critically useful at unexpected times. Narrowing the scope of your reviews such that they only aim to prevent bad accidents indirectly hinders creating fertile grounds for good surprises as well. Going for better While the need for action items is almost always there, a key element of improving incident reviews is to not make corrections the focal point. Consider the incident review as a preliminary step, the data-gathering stage before writing down the ideas. You’re using recent events as a study of what’s surprising within the system, but also of how it is that things usually work well. Only once that perspective is established does it make sense to start thinking of ways of modifying things. Try it with only one or two reviews at first. Minor incidents are usually good, because following the methods outlined in docs like the Etsy Debriefing Facilitation Guide and the Howie guide tends to reveal many useful insights in incidents people would have otherwise overlooked as not very interesting. As you and your teams see value, expand to more and more incidents. It also helps to set the tone before and during the meetings. I’ve written a set of “ground rules” we use at Honeycomb and that my colleague Lex Neva has transcribed, commented, and published. See if something like that could adequately frame the session.. If abandoning the idea of action items seems irresponsible or impractical to you, keep them. But keep them with some distance; the common tip given by the LFI community is to schedule another meeting after the review to discuss them in isolation. iiii At some point, that follow-up meeting may become disjoint from the reviews. There’s not necessarily a reason why every incident needs a dedicated set of fixes (longer-term changes impacting them could already be in progress, for example), nor is there a reason to wait for an incident to fix things and improve them. That’s when you decouple understanding from fixing, and the incident review becomes its own sufficient action item.

10 months ago 84 votes

More in programming

We'll always need junior programmers

We received over 2,200 applications for our just-closed junior programmer opening, and now we're going through all of them by hand and by human. No AI screening here. It's a lot of work, but we have a great team who take the work seriously, so in a few weeks, we'll be able to invite a group of finalists to the next phase. This highlights the folly of thinking that what it'll take to land a job like this is some specific list of criteria, though. Yes, you have to present a baseline of relevant markers to even get into consideration, like a great cover letter that doesn't smell like AI slop, promising projects or work experience or educational background, etc. But to actually get the job, you have to be the best of the ones who've applied! It sounds self-evident, maybe, but I see questions time and again about it, so it must not be. Almost every job opening is grading applicants on the curve of everyone who has applied. And the best candidate of the lot gets the job. You can't quantify what that looks like in advance. I'm excited to see who makes it to the final stage. I already hear early whispers that we got some exceptional applicants in this round. It would be great to help counter the narrative that this industry no longer needs juniors. That's simply retarded. However good AI gets, we're always going to need people who know the ins and outs of what the machine comes up with. Maybe not as many, maybe not in the same roles, but it's truly utopian thinking that mankind won't need people capable of vetting the work done by AI in five minutes.

11 hours ago 4 votes
Requirements change until they don't

Recently I got a question on formal methods1: how does it help to mathematically model systems when the system requirements are constantly changing? It doesn't make sense to spend a lot of time proving a design works, and then deliver the product and find out it's not at all what the client needs. As the saying goes, the hard part is "building the right thing", not "building the thing right". One possible response: "why write tests"? You shouldn't write tests, especially lots of unit tests ahead of time, if you might just throw them all away when the requirements change. This is a bad response because we all know the difference between writing tests and formal methods: testing is easy and FM is hard. Testing requires low cost for moderate correctness, FM requires high(ish) cost for high correctness. And when requirements are constantly changing, "high(ish) cost" isn't affordable and "high correctness" isn't worthwhile, because a kinda-okay solution that solves a customer's problem is infinitely better than a solid solution that doesn't. But eventually you get something that solves the problem, and what then? Most of us don't work for Google, we can't axe features and products on a whim. If the client is happy with your solution, you are expected to support it. It should work when your customers run into new edge cases, or migrate all their computers to the next OS version, or expand into a market with shoddy internet. It should work when 10x as many customers are using 10x as many features. It should work when you add new features that come into conflict. And just as importantly, it should never stop solving their problem. Canonical example: your feature involves processing requested tasks synchronously. At scale, this doesn't work, so to improve latency you make it asynchronous. Now it's eventually consistent, but your customers were depending on it being always consistent. Now it no longer does what they need, and has stopped solving their problems. Every successful requirement met spawns a new requirement: "keep this working". That requirement is permanent, or close enough to decide our long-term strategy. It takes active investment to keep a feature behaving the same as the world around it changes. (Is this all a pretentious of way of saying "software maintenance is hard?" Maybe!) Phase changes In physics there's a concept of a phase transition. To raise the temperature of a gram of liquid water by 1° C, you have to add 4.184 joules of energy.2 This continues until you raise it to 100°C, then it stops. After you've added two thousand joules to that gram, it suddenly turns into steam. The energy of the system changes continuously but the form, or phase, changes discretely. Software isn't physics but the idea works as a metaphor. A certain architecture handles a certain level of load, and past that you need a new architecture. Or a bunch of similar features are independently hardcoded until the system becomes too messy to understand, you remodel the internals into something unified and extendable. etc etc etc. It's doesn't have to be totally discrete phase transition, but there's definitely a "before" and "after" in the system form. Phase changes tend to lead to more intricacy/complexity in the system, meaning it's likely that a phase change will introduce new bugs into existing behaviors. Take the synchronous vs asynchronous case. A very simple toy model of synchronous updates would be Set(key, val), which updates data[key] to val.3 A model of asynchronous updates would be AsyncSet(key, val, priority) adds a (key, val, priority, server_time()) tuple to a tasks set, and then another process asynchronously pulls a tuple (ordered by highest priority, then earliest time) and calls Set(key, val). Here are some properties the client may need preserved as a requirement: If AsyncSet(key, val, _, _) is called, then eventually db[key] = val (possibly violated if higher-priority tasks keep coming in) If someone calls AsyncSet(key1, val1, low) and then AsyncSet(key2, val2, low), they should see the first update and then the second (linearizability, possibly violated if the requests go to different servers with different clock times) If someone calls AsyncSet(key, val, _) and immediately reads db[key] they should get val (obviously violated, though the client may accept a slightly weaker property) If the new system doesn't satisfy an existing customer requirement, it's prudent to fix the bug before releasing the new system. The customer doesn't notice or care that your system underwent a phase change. They'll just see that one day your product solves their problems, and the next day it suddenly doesn't. This is one of the most common applications of formal methods. Both of those systems, and every one of those properties, is formally specifiable in a specification language. We can then automatically check that the new system satisfies the existing properties, and from there do things like automatically generate test suites. This does take a lot of work, so if your requirements are constantly changing, FM may not be worth the investment. But eventually requirements stop changing, and then you're stuck with them forever. That's where models shine. As always, I'm using formal methods to mean the subdiscipline of formal specification of designs, leaving out the formal verification of code. Mostly because "formal specification" is really awkward to say. ↩ Also called a "calorie". The US "dietary Calorie" is actually a kilocalorie. ↩ This is all directly translatable to a TLA+ specification, I'm just describing it in English to avoid paying the syntax tax ↩

8 hours ago 2 votes
How should Stripe deprecate APIs? (~2016)

While Stripe is a widely admired company for things like its creation of the Sorbet typer project, I personally think that Stripe’s most interesting strategy work is also among its most subtle: its willingness to significantly prioritize API stability. This strategy is almost invisible externally. Internally, discussions around it were frequent and detailed, but mostly confined to dedicated API design conversations. API stability isn’t just a technical design quirk, it’s a foundational decision in an API-driven business, and I believe it is one of the unsung heroes of Stripe’s business success. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our policies for managing API changes are: Design for long API lifetime. APIs are not inherently durable. Instead we have to design thoughtfully to ensure they can support change. When designing a new API, build a test application that doesn’t use this API, then migrate to the new API. Consider how integrations might evolve as applications change. Perform these migrations yourself to understand potential friction with your API. Then think about the future changes that we might want to implement on our end. How would those changes impact the API, and how would they impact the application you’ve developed. At this point, take your API to API Review for initial approval as described below. Following that approval, identify a handful of early adopter companies who can place additional pressure on your API design, and test with them before releasing the final, stable API. All new and modified APIs must be approved by API Review. API changes may not be enabled for customers prior to API Review approval. Change requests should be sent to api-review email group. For examples of prior art, review the api-review archive for prior requests and the feedback they received. All requests must include a written proposal. Most requests will be approved asynchronously by a member of API Review. Complex or controversial proposals will require live discussions to ensure API Review members have sufficient context before making a decision. We never deprecate APIs without an unavoidable requirement to do so. Even if it’s technically expensive to maintain support, we incur that support cost. To be explicit, we define API deprecation as any change that would require customers to modify an existing integration. If such a change were to be approved as an exception to this policy, it must first be approved by the API Review, followed by our CEO. One example where we granted an exception was the deprecation of TLS 1.2 support due to PCI compliance obligations. When significant new functionality is required, we add a new API. For example, we created /v1/subscriptions to support those workflows rather than extending /v1/charges to add subscriptions support. With the benefit of hindsight, a good example of this policy in action was the introduction of the Payment Intents APIs to maintain compliance with Europe’s Strong Customer Authentication requirements. Even in that case the charge API continued to work as it did previously, albeit only for non-European Union payments. We manage this policy’s implied technical debt via an API translation layer. We release changed APIs into versions, tracked in our API version changelog. However, we only maintain one implementation internally, which is the implementation of the latest version of the API. On top of that implementation, a series of version transformations are maintained, which allow us to support prior versions without maintaining them directly. While this approach doesn’t eliminate the overhead of supporting multiple API versions, it significantly reduces complexity by enabling us to maintain just a single, modern implementation internally. All API modifications must also update the version transformation layers to allow the new version to coexist peacefully with prior versions. In the future, SDKs may allow us to soften this policy. While a significant number of our customers have direct integrations with our APIs, that number has dropped significantly over time. Instead, most new integrations are performed via one of our official API SDKs. We believe that in the future, it may be possible for us to make more backwards incompatible changes because we can absorb the complexity of migrations into the SDKs we provide. That is certainly not the case yet today. Diagnosis Our diagnosis of the impact on API changes and deprecation on our business is: If you are a small startup composed of mostly engineers, integrating a new payments API seems easy. However, for a small business without dedicated engineers—or a larger enterprise involving numerous stakeholders—handling external API changes can be particularly challenging. Even if this is only marginally true, we’ve modeled the impact of minimizing API changes on long-term revenue growth, and it has a significant impact, unlocking our ability to benefit from other churn reduction work. While we believe API instability directly creates churn, we also believe that API stability directly retains customers by increasing the migration overhead even if they wanted to change providers. Without an API change forcing them to change their integration, we believe that hypergrowth customers are particularly unlikely to change payments API providers absent a concrete motivation like an API change or a payment plan change. We are aware of relatively few companies that provide long-term API stability in general, and particularly few for complex, dynamic areas like payments APIs. We can’t assume that companies that make API changes are ill-informed. Rather it appears that they experience a meaningful technical debt tradeoff between the API provider and API consumers, and aren’t willing to consistently absorb that technical debt internally. Future compliance or security requirements—along the lines of our upgrade from TLS 1.2 to TLS 1.3 for PCI—may necessitate API changes. There may also be new tradeoffs exposed as we enter new markets with their own compliance regimes. However, we have limited ability to predict these changes at this point.

6 hours ago 1 votes
Bike Brooklyn! zine

I've been biking in Brooklyn for a few years now! It's hard for me to believe it, but I'm now one of the people other bicyclists ask questions to now. I decided to make a zine that answers the most common of those questions: Bike Brooklyn! is a zine that touches on everything I wish I knew when I started biking in Brooklyn. A lot of this information can be found in other resources, but I wanted to collect it in one place. I hope to update this zine when we get significantly more safe bike infrastructure in Brooklyn and laws change to make streets safer for bicyclists (and everyone) over time, but it's still important to note that each release will reflect a specific snapshot in time of bicycling in Brooklyn. All text and illustrations in the zine are my own. Thank you to Matt Denys, Geoffrey Thomas, Alex Morano, Saskia Haegens, Vishnu Reddy, Ben Turndorf, Thomas Nayem-Huzij, and Ryan Christman for suggestions for content and help with proofreading. This zine is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, so you can copy and distribute this zine for noncommercial purposes in unadapted form as long as you give credit to me. Check out the Bike Brooklyn! zine on the web or download pdfs to read digitally or print here!

yesterday 5 votes
Announcing Hotwire Native 1.2

We’ve just launched Hotwire Native v1.2 and it’s the biggest update since the initial launch last year. The update has several key improvements, bug fixes, and more API consistency between platforms. And we’ve created all new iOS and Android demo apps to show it off! A web-first framework for building native mobile apps Improvements There are a few significant changes in v1.2 that are worth specifically highlighting. Route decision handlers Hotwire Native apps route internal urls to screens in your app, and route external urls to the device’s browser. Historically, though, it wasn’t straightforward to customize the default behavior for unique app needs. In v1.2, we’ve introduced the RouteDecisionHandler concept to iOS (formerly only on Android). Route decisions handlers offer a flexible way to decide how to route urls in your app. Out-of-the-box, Hotwire Native registers these route decision handlers to control how urls are routed: AppNavigationRouteDecisionHandler: Routes all internal urls on your app’s domain through your app. SafariViewControllerRouteDecisionHandler: (iOS Only) Routes all external http/https urls to a SFSafariViewController in your app. BrowserTabRouteDecisionHandler: (Android Only) Routes all external http/https urls to a Custom Tab in your app. SystemNavigationRouteDecisionHandler: Routes all remaining external urls (such as sms: or mailto:) through device’s system navigation. If you’d like to customize this behavior you can register your own RouteDecisionHandler implementations in your app. See the documentation for details. Server-driven historical location urls If you’re using Ruby on Rails, the turbo-rails gem provides the following historical location routes. You can use these to manipulate the navigation stack in Hotwire Native apps. recede_or_redirect_to(url, **options) — Pops the visible screen off of the navigation stack. refresh_or_redirect_to(url, **options) — Refreshes the visible screen on the navigation stack. resume_or_redirect_to(url, **options) — Resumes the visible screen on the navigation stack with no further action. In v1.2 there is now built-in support to handle these “command” urls with no additional path configuration setup necessary. We’ve also made improvements so they handle dismissing modal screens automatically. See the documentation for details. Bottom tabs When starting with Hotwire Native, one of the most common questions developers ask is how to support native bottom tab navigation in their apps. We finally have an official answer! We’ve introduced a HotwireTabBarController for iOS and a HotwireBottomNavigationController for Android. And we’ve updated the demo apps for both platforms to show you exactly how to set them up. New demo apps To better show off all the features in Hotwire Native, we’ve created new demo apps for iOS and Android. And there’s a brand new Rails web app for the native apps to leverage. Hotwire Native demo app Clone the GitHub repos to build and run the demo apps to try them out: iOS repo Android repo Rails app Huge thanks to Joe Masilotti for all the demo app improvements. If you’re looking for more resources, Joe even wrote a Hotwire Native for Rails Developers book! Release notes v1.2 contains dozens of other improvements and bug fixes across both platforms. See the full release notes to learn about all the additional changes: iOS release notes Android release notes Take a look If you’ve been curious about using Hotwire Native for your mobile apps, now is a great time to take a look. We have documentation and guides available on native.hotwired.dev and we’ve created really great demo apps for iOS and Android to help you get started.

yesterday 3 votes