Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
10
Earlier this weekGuileWhippet But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s .mostly-marking collector Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so. The problem comes for uses of from outside itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.scm_gc_malloclibguile The problem with , though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an...
2 weeks ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from wingolog

whippet lab notebook: on untagged mallocs

Salutations, populations. Today’s note is more of a work-in-progress than usual; I have been finally starting to look at getting into , and there are some open questions.WhippetGuile I started by taking a look at how Guile uses the ‘s API, to make sure I had all my bases covered for an eventual switch to something that was not BDW. I think I have a good overview now, and have divided the parts of BDW-GC used by Guile into seven categories.Boehm-Demers-Weiser collector Firstly there are the ways in which Guile’s run-time and compiler depend on BDW-GC’s behavior, without actually using BDW-GC’s API. By this I mean principally that we assume that any reference to a GC-managed object from any thread’s stack will keep that object alive. The same goes for references originating in global variables, or static data segments more generally. Additionally, we rely on GC objects not to move: references to GC-managed objects in registers or stacks are valid across a GC boundary, even if those references are outside the GC-traced graph: all objects are pinned. Some of these “uses” are internal to Guile’s implementation itself, and thus amenable to being changed, albeit with some effort. However some escape into the wild via Guile’s API, or, as in this case, as implicit behaviors; these are hard to change or evolve, which is why I am putting my hopes on Whippet’s , which allows for conservative roots.mostly-marking collector Then there are the uses of BDW-GC’s API, not to accomplish a task, but to protect the mutator from the collector: , explicitly enabling or disabling GC, calls to that take BDW-GC’s use of POSIX signals into account, and so on. BDW-GC can stop any thread at any time, between any two instructions; for most users is anodyne, but if ever you use weak references, things start to get really gnarly.GC_call_with_alloc_locksigmask Of course a new collector would have its own constraints, but switching to cooperative instead of pre-emptive safepoints would be a welcome relief from this mess. On the other hand, we will require client code to explicitly mark their threads as inactive during calls in more cases, to ensure that all threads can promptly reach safepoints at all times. Swings and roundabouts? Did you know that the Boehm collector allows for precise tracing? It does! It’s slow and truly gnarly, but when you need precision, precise tracing nice to have. (This is the interface.) Guile uses it to mark Scheme stacks, allowing it to avoid treating unboxed locals as roots. When it loads compiled files, Guile also adds some sliced of the mapped files to the root set. These interfaces will need to change a bit in a switch to Whippet but are ultimately internal, so that’s fine.GC_new_kind What is not fine is that Guile allows C users to hook into precise tracing, notably via . This is not only the wrong interface, not allowing for copying collection, but these functions are just truly gnarly. I don’t know know what to do with them yet; are our external users ready to forgo this interface entirely? We have been working on them over time, but I am not sure.scm_smob_set_mark Weak references, weak maps of various kinds: the implementation of these in terms of BDW’s API is incredibly gnarly and ultimately unsatisfying. We will be able to replace all of these with ephemerons and tables of ephemerons, which are natively supported by Whippet. The same goes with finalizers. The same goes for constructs built on top of finalizers, such as ; we’ll get to reimplement these on top of nice Whippet-supplied primitives. Whippet allows for resuscitation of finalized objects, so all is good here.guardians There is a long list of miscellanea: the interfaces to explicitly trigger GC, to get statistics, to control the number of marker threads, to initialize the GC; these will change, but all uses are internal, making it not a terribly big deal. I should mention one API concern, which is that BDW’s state is all implicit. For example, when you go to allocate, you don’t pass the API a handle which you have obtained for your thread, and which might hold some thread-local freelists; BDW will instead load thread-local variables in its API. That’s not as efficient as it could be and Whippet goes the explicit route, so there is some additional plumbing to do. Finally I should mention the true miscellaneous BDW-GC function: . Guile exposes it via an API, . It was already vestigial and we should just remove it, as it has no sensible semantics or implementation.GC_freescm_gc_free That brings me to what I wanted to write about today, but am going to have to finish tomorrow: the actual allocation routines. BDW-GC provides two, essentially: and . The difference is that “atomic” allocations don’t refer to other GC-managed objects, and as such are well-suited to raw data. Otherwise you can think of atomic allocations as a pure optimization, given that BDW-GC mostly traces conservatively anyway.GC_mallocGC_malloc_atomic From the perspective of a user of BDW-GC looking to switch away, there are two broad categories of allocations, tagged and untagged. Tagged objects have attached metadata bits allowing their type to be inspected by the user later on. This is the happy path! We’ll be able to write a function that takes any object, does a switch on, say, some bits in the first word, dispatching to type-specific tracing code. As long as the object is sufficiently initialized by the time the next safepoint comes around, we’re good, and given cooperative safepoints, the compiler should be able to ensure this invariant.gc_trace_object Then there are untagged allocations. Generally speaking, these are of two kinds: temporary and auxiliary. An example of a temporary allocation would be growable storage used by a C run-time routine, perhaps as an unbounded-sized alternative to . Guile uses these a fair amount, as they compose well with non-local control flow as occurring for example in exception handling.alloca An auxiliary allocation on the other hand might be a data structure only referred to by the internals of a tagged object, but which itself never escapes to Scheme, so you never need to inquire about its type; it’s convenient to have the lifetimes of these values managed by the GC, and when desired to have the GC automatically trace their contents. Some of these should just be folded into the allocations of the tagged objects themselves, to avoid pointer-chasing. Others are harder to change, notably for mutable objects. And the trouble is that for external users of , I fear that we won’t be able to migrate them over, as we don’t know whether they are making tagged mallocs or not.scm_gc_malloc One conventional way to handle untagged allocations is to manage to fit your data into other tagged data structures; V8 does this in many places with instances of FixedArray, for example, and Guile should do more of this. Otherwise, you make new tagged data types. In either case, all auxiliary data should be tagged. I think there may be an alternative, which would be just to support the equivalent of untagged and ; but for that, I am out of time today, so type at y’all tomorrow. Happy hacking!GC_mallocGC_malloc_atomic inventory what is to be done? implicit uses defensive uses precise tracing reachability misc allocation

2 weeks ago 11 votes
tracepoints: gnarly but worth it

Hey all, quick post today to mention that I added tracing support to the . If the support library for is available when Whippet is compiled, Whippet embedders can visualize the GC process. Like this!Whippet GC libraryLTTng Click above for a full-scale screenshot of the trace explorer processing the with the on a 2.5x heap. Of course no image will have all the information; the nice thing about trace visualizers like is that you can zoom in to sub-microsecond spans to see exactly what is happening, have nice mouseovers and clicky-clickies. Fun times!Perfetto microbenchmarknboyerparallel copying collector Adding tracepoints to a library is not too hard in the end. You need to , which has a file. You need to . Then you have a that includes the header, to generate the code needed to emit tracepoints.pull in the librarylttng-ustdeclare your tracepoints in one of your header filesminimal C filepkg-config Annoyingly, this header file you write needs to be in one of the directories; it can’t be just in the the source directory, because includes it seven times (!!) using (!!!) and because the LTTng file header that does all the computed including isn’t in your directory, GCC won’t find it. It’s pretty ugly. Ugliest part, I would say. But, grit your teeth, because it’s worth it.-Ilttngcomputed includes Finally you pepper your source with tracepoints, which probably you so that you don’t have to require LTTng, and so you can switch to other tracepoint libraries, and so on.wrap in some macro I wrote up a little . It’s not as easy as , which I think is an error. Another ugly point. Buck up, though, you are so close to graphs!guide for Whippet users about how to actually get tracesperf record By which I mean, so close to having to write a Python script to make graphs! Because LTTng writes its logs in so-called Common Trace Format, which as you might guess is not very common. I have a colleague who swears by it, that for him it is the lowest-overhead system, and indeed in my case it has no measurable overhead when trace data is not being collected, but his group uses custom scripts to convert the CTF data that he collects to... (?!?!?!!).GTKWave In my case I wanted to use Perfetto’s UI, so I found a to convert from CTF to the . But, it uses an old version of Babeltrace that wasn’t available on my system, so I had to write a (!!?!?!?!!), probably the most Python I have written in the last 20 years.scriptJSON-based tracing format that Chrome profiling used to usenew script Yes. God I love blinkenlights. As long as it’s low-maintenance going forward, I am satisfied with the tradeoffs. Even the fact that I had to write a script to process the logs isn’t so bad, because it let me get nice nested events, which most stock tracing tools don’t allow you to do. I fixed a small performance bug because of it – a . A win, and one that never would have shown up on a sampling profiler too. I suspect that as I add more tracepoints, more bugs will be found and fixed.worker thread was spinning waiting for a pool to terminate instead of helping out I think the only thing that would be better is if tracepoints were a part of Linux system ABIs – that there would be header files to emit tracepoint metadata in all binaries, that you wouldn’t have to link to any library, and the actual tracing tools would be intermediated by that ABI in such a way that you wouldn’t depend on those tools at build-time or distribution-time. But until then, I will take what I can get. Happy tracing! on adding tracepoints using the thing is it worth it? fin

a month ago 15 votes
whippet at fosdem

Hey all, the video of my is up:FOSDEM talk on Whippet Slides , if that’s your thing.here I ended the talk with some puzzling results around generational collection, which prompted . I don’t have a firm answer yet. Or rather, perhaps for the splay benchmark, it is to be expected that a generational GC is not great; but there are other benchmarks that also show suboptimal throughput in generational configurations. Surely it is some tuning issue; I’ll be looking into it.yesterday’s post Happy hacking!

a month ago 16 votes
baffled by generational garbage collection

Usually in this space I like to share interesting things that I find out; you might call it a research-epistle-publish loop. Today, though, I come not with answers, but with questions, or rather one question, but with fractal surface area: what is the value proposition of generational garbage collection? The conventional wisdom is encapsulated in a 2004 Blackburn, Cheng, and McKinley paper, , which compares whole-heap mark-sweep and copying collectors to their generational counterparts, using the Jikes RVM as a test harness. (It also examines a generational reference-counting collector, which is an interesting predecessor to the 2022 work by Zhao, Blackburn, and McKinley.)“Myths and Realities: The Performance Impact of Garbage Collection”LXR The paper finds that generational collectors spend less time than their whole-heap counterparts for a given task. This is mainly due to less time spent collecting, because generational collectors avoid tracing/copying work for older objects that mostly stay in the same place in the live object graph. The paper also notes an improvement for mutator time under generational GC, but only for the generational mark-sweep collector, which it attributes to the locality and allocation speed benefit of bump-pointer allocation in the nursery. However for copying collectors, generational GC tends to slow down the mutator, probably because of the write barrier, but in the end lower collector times still led to lower total times. So, I expected generational collectors to always exhibit lower wall-clock times than whole-heap collectors. In , I have a garbage collector with an abstract API that specializes at compile-time to the mutator’s object and root-set representation and to the collector’s allocator, write barrier, and other interfaces. I embed it in , a simple Scheme-to-C compiler that can run some small multi-threaded benchmarks, for example the classic Gabriel benchmarks. We can then test those benchmarks against different collectors, mutator (thread) counts, and heap sizes. I expect that the generational parallel copying collector takes less time than the whole-heap parallel copying collector.whippetwhiffle So, I ran some benchmarks. Take the splay-tree benchmark, derived from Octane’s . I have a port to Scheme, and the results are... not good!splay.js In this graph the “pcc” series is the whole-heap copying collector, and “generational-pcc” is the generational counterpart, with a nursery sized such that after each collection, its size is 2 MB times the number of active mutator threads in the last collector. So, for this test with eight threads, on my 8-core Ryzen 7 7840U laptop, the nursery is 16MB including the copy reserve, which happens to be the same size as the L3 on this CPU. New objects are kept in the nursery one cycle before being promoted to the old generation. There are also results for , which use an Immix-derived algorithm that allows for bump-pointer allocation but which doesn’t require a copy reserve. There, the generational collectors use a , which has very different performance characteristics as promotion is in-place, and the nursery is as large as the available heap size.“mmc” and “generational-mmc” collectorssticky mark-bit algorithm The salient point is that at all heap sizes, and for these two very different configurations (mmc and pcc), generational collection takes more time than whole-heap collection. It’s not just the splay benchmark either; I see the same thing for the very different . What is the deal?nboyer benchmark I am honestly quite perplexed by this state of affairs. I wish I had a narrative to tie this together, but in lieu of that, voici some propositions and observations. Sometimes people say that the reason generational collection is good is because you get bump-pointer allocation, which has better locality and allocation speed. This is misattribution: it’s bump-pointer allocators that have these benefits. You can have them in whole-heap copying collectors, or you can have them in whole-heap mark-compact or immix collectors that bump-pointer allocate into the holes. Or, true, you can have them in generational collectors with a copying nursery but a freelist-based mark-sweep allocator. But also you can have generational collectors without bump-pointer allocation, for free-list sticky-mark-bit collectors. To simplify this panorama to “generational collectors have good allocators” is incorrect. It’s true, generational GC does lower median pause times: But because a major collection is usually slightly more work under generational GC than in a whole-heap system, because of e.g. the need to reset remembered sets, the maximum pauses are just as big and even a little bigger: I am not even sure that it is meaningful to compare median pause times between generational and non-generational collectors, given that the former perform possibly orders of magnitude more collections than the latter. Doing fewer whole-heap traces is good, though, and in the ideal case, the less frequent major traces under generational collectors allows time for concurrent tracing, which is the true mitigation for long pause times. Could it be that the test harness I am using is in some way unrepresentative? I don’t have more than one test harness for Whippet yet. I will start work on a second Whippet embedder within the next few weeks, so perhaps we will have an answer there. Still, there is ample time spent in GC pauses in these benchmarks, so surely as a GC workload Whiffle has some utility. One reasons that Whiffle might be unrepresentative is that it is an ahead-of-time compiler, whereas nursery addresses are assigned at run-time. Whippet exposes the necessary information to allow a just-in-time compiler to specialize write barriers, for example the inline check that the field being mutated is not in the nursery, and an AOT compiler can’t encode this as an immediate. But it seems a small detail. Also, Whiffle doesn’t do much compiler-side work to elide write barriers. Could the cost of write barriers be over-represented in Whiffle, relative to a production language run-time? Relatedly, Whiffle is just a baseline compiler. It does some partial evaluation but no CFG-level optimization, no contification, no nice closure conversion, no specialization, and so on: is it not representative because it is not an optimizing compiler? How big should the nursery be? I have no idea. As a thought experiment, consider the case of a 1 kilobyte nursery. It is probably too small to allow the time for objects to die young, so the survival rate at each minor collection would be high. Above a certain survival rate, generational GC is probably a lose, because your program violates the weak generational hypothesis: it introduces a needless copy for all survivors, and a synchronization for each minor GC. On the other hand, a 1 GB nursery is probably not great either. It is plenty large enough to allow objects to die young, but the number of survivor objects in a space that large is such that pause times would not be very low, which is one of the things you would like in generational GC. Also, you lose out on locality: a significant fraction of the objects you traverse are probably out of cache and might even incur TLB misses. So there is probably a happy medium somewhere. My instinct is that for a copying nursery, you want to make it about as big as L3 cache, which on my 8-core laptop is 16 megabytes. Systems are different sizes though; in Whippet my current heuristic is to reserve 2 MB of nursery per core that was active in the previous cycle, so if only 4 threads are allocating, you would have a 8 MB nursery. Is this good? I don’t know. I don’t have a very large set of benchmarks that run on Whiffle, and they might not be representative. I mean, they are microbenchmarks. One question I had was about heap sizes. If a benchmark’s maximum heap size fits in L3, which is the case for some of them, then probably generational GC is a wash, because whole-heap collection stays in cache. When I am looking at benchmarks that evaluate generational GC, I make sure to choose those that exceed L3 size by a good factor, for example the 8-mutator splay benchmark in which minimum heap size peaks at 300 MB, or the 8-mutator nboyer-5 which peaks at 1.6 GB. But then, should nursery size scale with total heap size? I don’t know! Incidentally, the way that I scale these benchmarks to multiple mutators is a bit odd: they are serial benchmarks, and I just run some number of threads at a time, and scale the heap size accordingly, assuming that the minimum size when there are 4 threads is four times the minimum size when there is just one thread. However, , in the sense that there is no heap size under which they fail and above which they succeed; I quote:multithreaded programs are unreliable A generational collector partitions objects into old and new sets, and a minor collection starts by visiting all old-to-new edges, called the “remembered set”. As the program runs, mutations to old objects might introduce new old-to-new edges. To maintain the remembered set in a generational collector, the mutator invokes : little bits of code that run when you mutate a field in an object. This is overhead relative to non-generational configurations, where the mutator doesn’t have to invoke collector code when it sets fields.write barriers So, could it be that Whippet’s write barriers or remembered set are somehow so inefficient that my tests are unrepresentative of the state of the art? I used to use card-marking barriers, but I started to suspect they cause too much overhead during minor GC and introduced too much cache contention. I switched to some months back for Whippet’s Immix-derived space, and we use the same kind of barrier in the generational copying (pcc) collector. I think this is state of the art. I need to see if I can find a configuration that allows me to measure the overhead of these barriers, independently of other components of a generational collector.precise field-logging barriers A few months ago, my only generational collector used the algorithm, which is an unconventional configuration: its nursery is not contiguous, non-moving, and can be as large as the heap. This is part of the reason that I implemented generational support for the parallel copying collector, to have a different and more conventional collector to compare against. But generational collection loses on some of these benchmarks in both places!sticky mark-bit On one benchmark which repeatedly constructs some trees and then verifies them, I was seeing terrible results for generational GC, which I realized were because of cooperative safepoints: generational GC collects more often, so it requires that all threads reach safepoints more often, and the non-allocating verification phase wasn’t emitting any safepoints. I had to change the compiler to emit safepoints at regular intervals (in my case, on function entry), and it sped up the generational collector by a significant amount. This is one instance of a general observation, which is that any work that doesn’t depend on survivor size in a GC pause is more expensive with a generational collector, which runs more collections. Synchronization can be a cost. I had one bug in which tracing ephemerons did work proportional to the size of the whole heap, instead of the nursery; I had to specifically add generational support for the way Whippet deals with ephemerons during a collection to reduce this cost. Looking deeper at the data, I have partial answers for the splay benchmark, and they are annoying :) Splay doesn’t actually allocate all that much garbage. At a 2.5x heap, the stock parallel MMC collector (in-place, sticky mark bit) collects... one time. That’s all. Same for the generational MMC collector, because the first collection is always major. So at 2.5x we would expect the generational collector to be slightly slower. The benchmark is simply not very good – or perhaps the most generous interpretation is that it represents tasks that allocate 40 MB or so of long-lived data and not much garbage on top. Also at 2.5x heap, the whole-heap copying collector runs 9 times, and the generational copying collector does 293 minor collections and... 9 major collections. We are not reducing the number of major GCs. It means either the nursery is too small, so objects aren’t dying young when they could, or the benchmark itself doesn’t conform to the weak generational hypothesis. At a 1.5x heap, the copying collector doesn’t have enough space to run. For MMC, the non-generational variant collects 7 times, and generational MMC times out. Timing out indicates a bug, I think. Annoying! I tend to think that if I get results and there were fewer than, like, 5 major collections for a whole-heap collector, that indicates that the benchmark is probably inapplicable at that heap size, and I should somehow surface these anomalies in my analysis scripts. Doing a similar exercise for nboyer at 2.5x heap with 8 threads (4GB for 1.6GB live data), I see that pcc did 20 major collections, whereas generational pcc lowered that to 8 major collections and 3471 minor collections. Could it be that there are still too many fixed costs associated with synchronizing for global stop-the-world minor collections? I am going to have to add some fine-grained tracing to find out. I just don’t know! I want to believe that generational collection was an out-and-out win, but I haven’t yet been able to prove it is true. I do have some homework to do. I need to find a way to test the overhead of my write barrier – probably using the MMC collector and making it only do major collections. I need to fix generational-mmc for splay and a 1.5x heap. And I need to do some fine-grained performance analysis for minor collections in large heaps. Enough for today. Feedback / reactions very welcome. Thanks for reading and happy hacking! hypothesis test workbench results? “generational collection is good because bump-pointer allocation” “generational collection lowers pause times” is it whiffle? is it something about the nursery size? is it something about the benchmarks? is it the write barrier? is it something about the generational mechanism? is it something about collecting more often? is it something about collection frequency? collecting more often redux conclusion?

a month ago 22 votes

More in programming

Hardware-Aware Coding: CPU Architecture Concepts Every Developer Should Know

Write faster code by understanding how it flows through your CPU

16 hours ago 3 votes
The Road Not Taken is Guaranteed Minimum Income

The dream is incomplete until we share it with our fellow Americans.

2 days ago 4 votes
Proving Binaries

Heydon Pickering has an intriguing video dealing with the question: “Why is everything binary?” The gist of the video, to me, distills to this insight: The idea that [everything] belongs to one of two archetypes is seductive in its simplicity, so we base everything that we do and make on this false premise. That rings true to me. I tend to believe binary thinking is so prevalent because it’s the intellectual path of least resistance and we humans love to lazy. The fact is, as I’m sure any professional with any experience in any field will tell you, answers are always full of nuance and best explained with the statement “it depends”. The answers we’re all looking for are not found exclusively in one of two binary values, but in the contrast between them. In other words, when you test the accuracy of binary assertions the truth loves to reveal itself somewhere in between.[1] For example: peak design or development is found in the intermingling of form and function. Not form instead of function, nor function instead of form. Working on the web, we’re faced with so many binary choices every day: Do we need a designer or a developer? Do we make a web site or a web app? Should we build this on the client or the server? Are we driven by data or intuition? Does this work online or offline? And answering these questions is not helped by the byproduct of binary thinking, which as Heydon points out, results in intellectually and organizationally disparate structures like “Design” and ”Development”: Design thinking, but not about how to do the thing you are thinking about. Development doing, but without thinking about why the hell anyone would do this in the first place. It’s a good reminder to be consistently on guard for our own binary thinking. And when we catch ourselves, striving to look at the contrast between two options for the answer we seek. There’s a story that illustrates how you can reject binaries and invert the assumption that only two choices exist. It goes like this: A King told a condemned prisoner: “You may make one final statement. If it is true, you will be shot. If it is false, you will be hanged.” The prisoner answered, “I will be hanged.” This results in the King not being able to carry out any sentence. The prisoner manipulates the King’s logic to make both options impossible and reveal a third possible outcome. ⏎ Email · Mastodon · Bluesky

2 days ago 2 votes
Operational mechanisms for strategy.

Even the best policies fail if they aren’t adopted by the teams they’re intended to serve. Can we persistently change our company’s behaviors with a one-time announcement? No, probably not. I refer to the art of making policies work as “operations” or “strategy operations.” The good news is that effectively operating a policy is two-thirds avoiding common practices that simply don’t work. The other one-third takes some practice, but can be practiced in any engineering role: there’s no need to wait until you’re an executive to start building mastery. This chapter will dig into those mechanisms, with particular focus on: How policies are supported by operations, and how operations are composed of mechanisms that ensure they work well Evaluating operational mechanisms to select between different options, and determine which mechanisms are unlikely to be an effective choice Composing an operational plan for the specific set of policies that you are looking to support Common varieties of effective mechanisms such as approval forums, inspection mechanisms, nudges, and so on. We’ll also explore the sorts of mechanisms that tend to work poorly How to adjust your approach to operations if you are in an engineering role rather than an executive role How cargo-culting remains the largest threat to effective strategy operations Let’s unpack the details of turning your potentially good policy into an impactful policy. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. What are operational mechanisms? Operations are how a policy is implemented and reinforced. Effective operations ensure that your policies actually accomplish something. They can range from a recurring weekly meeting, to an alert that notifies the team when a threshold is exceeded, to a promotion rubric requiring a certain behavior to be promoted. In the strategy for working with new private equity ownership, we introduce a policy to backfill hires at a lower level, and also limit the maximum number of principal engineers: We will move to an “N-1” backfill policy, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit, with any exceptions approved in writing by the CTO–this applies for both promotions and external hires. That introduces an explicit operational mechanism of escalations going to the CTO, but it also introduces an implicit and undefined mechanism: how do we ensure the backfills are actually down-leveled as the policy instructs? It might be a group chat with engineering recruiting where the CTO approves the level of backfilled roles. Instead, it might be the responsibility of recruiting to enforce that downleveling. In a third approach, it might be taken on trust that hiring managers will do the right thing. Each of those three scenarios is a potential operational solution to implementing this policy. Operations is picking the right one for your circumstances, and then tweaking it as you learn from running it. Operations in government For another interesting take on how critical operations are, Recoding America by Jennifer Pahlka is well worth the read. It explores how well-intended government legislation often isn’t implementable, which results in policies that require massive IT investments but provide little benefit to constituents. How to evaluate mechanisms In order to determine the most effective operational mechanisms for the problems you’re working on, it’s useful to have a standardized rubric for evaluating mechanisms. While this rubric isn’t perfectly universal–customize it for your needs–having any rubric will make it easier to evaluate your options consistently. The rubric I use to evaluate whether an operational mechanism will be effective is: Measurability: Can you measure both leading and lagging indicators to inspect the mechanism’s impact? If you have to choose between the two, measuring leading indicators allows much quicker evaluation and iteration on your mechanisms. Adoption cost: How much work will migrating to this mechanism require? Can this work be done incrementally or does it require a major, coordinated shift? User ease (or burden): After adopting this policy, how much easier (or harder) will it be for users to perform their work? If things will be harder, are those users able to tolerate the additional time? Provider ease (or burden): How much additional ongoing maintenance will this mechanism require from the centralized or platform team providing it? For example, if every new architecture proposal requires a thorough review by your Security team, does the Security team have the actual ability to support those reviews? Reliance on authority: How much does this mechanism depend on a top-down authority’s active support? If the sponsoring executive departs, will this mechanism remain effective? Is that an effective tradeoff in this case? Culturally aligned: Is this something that your organization is going to do, or something that they are going to fight against each step? Is there a way you can adjust the framing to make it more acceptable to your organization’s culture? Generally, I find folks are good at evaluating mechanisms against these critera, but somewhat worse at accepting the consequences of their evaluation. For example, falling in love with a particular mechanism and then trying to force the organization to accept a mechanism whose adoption cost is unbearably high, or introduce a mechanism that creates significant user burden onto a team that is already struggling with tight efficiency goals like a customer support team. Self-awareness helps here, but so does consulting others to point out the errors in your reasoning, which is a core part of how I’ve found success in adopting operational mechanisms. Composing an operational plan Your operational plan is the sum of the mechanisms used to support your policies. While evaluating each individual mechanism in isolation is part of creating an operations plan, it’s also valuable to consider how the mechanisms will work together: Review the policies you’ve developed. What sort of mechanisms seem most likely to support these policies? How might these mechanisms be pooled together to avoid redundancy? Review the operational mechanisms that have worked in your organization. What mechanisms have been used to best effect, and which have left a sufficiently bad taste in the organization’s collective memory that they’ll be hard to reuse effectively? Which new mechanisms showed up in your exploration? In your exploration phase, you’ll frequent encounter mechanisms that your organization hasn’t previously tried. If any of them seem particularly well-suited to the policies you’re considering, and none of your organization’s frequently used mechanisms are good fits, then consider testing a new one. Evaluate mechanisms against the evaluation rubric. For each of the mechanisms you’re considering using, apply the rubric from the above How to evaluate mechanisms to validate they’re good fits. Consolidate into an operational plan. Now that you’ve determined the mechanisms you want to consider, work on fitting the full set of mechanisms into one coherent plan. Be particularly mindful of the ease, or burden, the integrated plan creates for both your users and platform providers. Validate plan with users and providers. Many plans make sense from afar, but fail due to imposing an unreasonable burden. Or the burden might be acceptable, but the actual workflow simply won’t work at all. Consider validating via strategy testing. If you run the above process, and can’t come to an agreement with stakeholders on your proposed plan, then simply commit to running a strategy testing process including the plan. This will create space for everyone to build confidence in the approach before they feel forced to make a commitment to following it long-term. Even if you don’t use strategy testing for your plan, at least commit to scheduling a review in three months reflecting on how things have worked out. Your operational plan is the vehicle that delivers your policies to your organization. It’s extremely tempting to skip refining the details here, but it’s a relatively quick step and will completely change your strategy’s outcomes. Common mechanisms Most companies have a handful of frequently used operational mechanisms. Some of those mechanisms are company specific, such as Amazon’s weekly business review, and others repeat across companies like requiring executive approval. Across the many mechanisms you’ll encounter, you can generally cluster them into recurring categories. This section covers the mechanisms I’ve found consistently effective. Approval and advice forums At a high level, new policies are obvious, simple and apply cleanly to the problem they are intended to solve. However, when you apply those policies to detailed, complex circumstances, it’s often ambiguous how to stay loyal to a policy’s intensions. Approval and advice forums are a common solution to that problem. Calm’s product engineering strategy shows what the simplest, and most common, approval forum looks like in practice: Exceptions are granted by the CTO, and must be in writing. The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product. All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there’s no written record that the CTO approved it, then it’s not approved. This example also has several weaknesses that happen in many approval forums. Most importantly, it doesn’t make it clear how to get approvals. It would be stronger if it explicitly explained how to get an approval (perhaps go ask in #cto-approvals), and where to find prior approvals to help someone considering requesting an exception to calibrate their request. Approvals don’t necessarily need to come from senior leadership. Instead, the senior leadership can loan their authority on a topic to another group. The LLM adoption strategy provides a good example of this: Start with Anthropic. We use Anthropic models, which are available through our existing cloud provider via AWS Bedrock. To avoid maintaining multiple implementations, where we view the underlying foundational model quality to be somewhat undifferentiated, we are not looking to adopt a broad set of LLMs at this point. This is anchored in our Wardley map of the LLM ecosystem. Exceptions will be reviewed by the Machine Learning Review in #ml-review In a more community-minded organization, the approval forums might not require senior leadership involvement at all. Instead, the culture might create an environment where the forums’ feedback is taken seriously on its own merits. Every company does approval forums a bit differently, ranging from our experiments at Carta with Navigators, granting executive authority for technical decisions to named engineers in each area, to Andrew Harmel-Law’s discussion of this topic in Facilitating Software Architecture. You can spend a lot of time arguing the details here, my experience is that having the right participants and a good executive sponsor matter a lot, and the other pieces matter a lot less. Inspection While even the best policies can fail, the more common scenario is that a policy will sort-of work, and need some modest adjustments to make it more successful. An inspect mechanism allows you to evaluate whether your policy’s is succeeding and if you need to make adjustments. The user-data access strategy provides an example: Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale. This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues’ internal tools. If we only expand requirements for accessing customer data, we won’t view this as progress because it’s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won’t represent this as progress, because we won’t have increased the number of supported requests. As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data. This example is a good start, but fully realizing an inspection mechanism requires concretely specifying where and how the data will be tracked. A better version of this would include a link to the dashboard you’ll look at, and a commitment to reviewing the data on a certain frequency. For a recent inspection mechanism, I created a recurring invite with a link to the relevant data dashboard, and a specific chat channel for discussion, and invited the working group who had agreed to review the data on that cadence. This wasn’t a synchronous meeting, but rather a commitment to independently review, and discuss anything that felt surprising. Your particular mechanisms could be threshold-triggered alerts, something you fold into an existing metrics review meeting, a script you commit to running and reviewing periodically, or something else. The most important thing is that it cannot silently fail. Nudges While it’s common to hear complaints about how a team isn’t following a new policy, as if it were a deliberate choice they’d made, I find it more common that people want to do things the new way, but rarely take time to learn how to do it. Nudges are providing individuals with context to inform them about a better way they might do something, and they are an exceptionally effective mechanism. Grounding this in an example, at Stripe we had a policy of allowing teams to self-authorize introducing new cloud hosting costs. This worked well almost all the time. However, sometimes teams would accidentally introduce large cost increases without realizing it, and teams that introduced those spikes almost never had any awareness that they had caused the problem. Even if we’d told them they must not introduce unapproved spending spikes, they simply didn’t perceive they’d done it. We had the choice between preventing all teams from introducing new spend, or we could try using a nudge. The nudge we added informed teams when their cloud spend accelerated month over month, directed to charts that explained the acceleration, and told them where to go to ask questions. Nudges pair well with inspections, and there was also a monthly review by the Efficiency Engineering team to review any spikes and reach out where necessary. Maybe we could have forced all teams to review new spend, but this nudge approach didn’t require an authoritative mandate to implement. It also meant we only spent time advising teams that actually spent too much, instead of having to discuss with every team that might spend too much. As another example making that point, a working group at Carta added a nudge to inform managers of untested pull requests merged by their team. Some managers had previously said they simply didn’t know when and why their team had merged untested pull requests, and this nudge made it easy to detect. The nudge also respected their attention by not sending a notification at all if there wasn’t a new, untested pull request. With poor ergonomics, nudges can be an overwhelming assault on your colleagues attention, but done well, I continue to believe they are the most effective operational mechanism. Documentation Policies can’t be enforced by people who don’t know they exist, or by people who don’t know how to follow those policies. In my experience, nudges are the most effective way of solving both of those problems, because nudges bring information to people at exactly the moment that information would be useful. At most companies, well-done nudges are relatively uncommon, and the far more common solution to lack of information is documentation and training. There are so many approaches to both of these topics, and I’ve not found my own approaches here particularly effective. Consequently, I am hesitant to give much advice on what will work best for you. The best I can offer is that following standard practices for your company, even if the outcomes seem imperfect, is probably your best bet. Internal knowledge bases tend to rot quickly, and introducing yet another knowledge base is almost always the illusion of progress rather than real progress. Even when you really don’t like the current one. Finally, remember that success for documentation and training is not necessarily that everyone in the company knows how a new policy works. Instead, as discussed in the chapter on whether strategy is useful, a more useful goal is informational herd immunity: as long as someone on each team understands your policy, the team will generally be capable of following it. Automation Relying on humans to respond is slow, and the quality of human response is highly varied. In many cases, automation provides the most effective and most scalable mechanism to support your policies’ rollout. Automation was key in the Uber service migration strategy, moving us out of a manual, slow process that was taking up a great deal of user and provider time: Move to structured requests, and out of tickets. Missing or incorrect information in provisioning requests create significant delays in provisioning. Further, collecting this information is the first step of moving to a self-service process. As such, we can get paid twice by reducing errors in manual provisioning while also creating the interface for self-service workflows. In that case, better automation allowed us to eliminate a series of back-and-forth negotiations to collect data, and to instead get the necessary information in a single step. Occasionally we still ran into users who couldn’t fill in the form, but now we could focus on providing a good manual experience for those rare exceptions. As you use automation as a core strategy mechanism, it’s important to recognize that designing an effective user experience is a prerequisite to automation having a positive impact. If you view the user experience of your automation as a secondary concern, then you are unlikely to make much impact with automation. Deferment to future work Sometimes there’s something you really want a policy to do, but you also know that you have no reasonable mechanism to do it. In that case, you may find explicitly deferring action on the topic useful. The strategy for integration the Index acquisition at Stripe uses this mechanism: Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. As did the strategy for working with a private equity acquirer: We believe there are significant opportunities to reduce R&D maintenance investments, but we don’t have conviction about which particular efforts we should prioritize. We will kickoff a working group to identify the features with the highest support load. There’s no shame in deferral. As much as you want to make progress on a certain area, it’s better to explicitly acknowledge that you can’t make progress on it–and clarify when you will be able to–then to allow the organization to churn on an intractable problem. Meetings Meetings are the final mechanism, and you can fit any and all of the above mechanisms into a meeting. They are a universal mechanism, although frequently overused because they can do an adequate job of operating almost any policy. The most common mechanism is a reporting meeting, such as reporting progress in the Executive Weekly Meeting as suggested in the LLM adoption strategy: Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets. Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers. Report on progress monthly in Exec Weekly Meeting, coordinated in #exec-weekly The other common meeting archetype is the weekly working meeting introduced in the chapter on strategy testing. Meetings are almost always the most expensive mechanism you can find to solve a problem, but they are easy to suggest, run, and iterate on. If you can’t find any other mechanism you believe in, then a meeting is a decent starting point. Just don’t get too fond of them, and try to iterate your way to canceling every meeting that you start. Anti-patterns In addition to the effective operational methods discussed above, there are a number of additional mechanisms that are frequently used, but which I consider anti-patterns. They can provide some value, but there’s almost always a better alternative. Top-down pronouncements: Sometimes a policy will be operationalized by simply declaring it must be followed. It’s common to see a leader declare that a policy is now in effect, assuming that the announcement is a useful way to implement the new policy. For example, some “return to office” policies dictate that the team must work from their office, but driving a real change requires motivating thoes individuals to actually return. Education-as-announcements rollouts: The default way that many companies roll out policies is through one-time “education,” often as an all-company announcement for existing employees. They might follow up by updating training for onboarding new-hires. Education sounds great, but a couple trainings will never change organizational behavior. Changing behavior requires ongoing reminders, visible role models, inspection to understand why some teams are not adopting the behavior, and so on. Education can be a good component of operationalizing a policy, but it cannot stand on its own. Mandatory recurring trainings: These are a staple of compliance driven policies, generally because of laws which require providing a certain number of hours of relevant training each year. There are two deep challenges with mandatory trainings. First, because attendance is required, people tend to make little effort to make the content good. Second, many folks don’t pay attention because they expect the content to be low quality. It’s not uncommon to hear people say that they’ve never heard of a policy that they’ve performed annual training on for multiple years. It’s possible to overcome these barriers, but in a situation where you’re accountable for changing outcomes, as opposed to shifting legal obligations away from the company, these tend to work poorly. Just change the culture. Some leaders frame most problems as cultural problems, which is a reasonable frame: most things can be usefully viewed as a cultural problem. Unfortunately, it’s common for those who rely heavily on the cultural frame to also have a simplistic view about how culture is changed. Changing an organization’s culture is tricky, and requires a combination of many techniques to create visible leaders role modeling the new behavior, and reinforcement mechanisms to ensure pockets of dissent are weeded out. Anyone who frames culture change as a simple or instant change is living in an imaginary world. If you’re using one of these approaches, it isn’t necessarily a bad choice. Instead, you should just make sure you can explain why you’re using it, and then you need to also make sure you believe that explanation. If you don’t, look for a mechanism from the earlier What if you’re not an executive? It’s easy to get discouraged when you think about which operational mechanisms are available to you as a non-executive. So many of the frequently seen mechanisms like running mandatory recurring meetings, or a binding architecture review process are not accessible to you. That is true: they’re not accessible to you. However, there’s always a related mechanism that can be implemented with less authority. The binding architecture process can be replaced with an architectural advice process. The mandatory review of pull requests can be replaced with a nudge. Although it may be more common to see the authoritative mechanisms in the companies you work in, my experience working as an executive is that these authoritative mechanisms don’t work particularly well. They do a great job of technically shifting accountability to the wider organization, but they often don’t change behavior at all. So, instead of getting frustrated by what you can’t do, focus instead on the mechanisms that are available to you today. Add nudges, focus on the real dynamics of how colleagues do work in your organization, and build a real dataset. It’s very hard to get an executive to support your initiative before the mechanisms and data exist to support it, and very easy to get their support once they do. Once you’ve done what you can without authority to build confidence, if you really do need more authority, then you’re in a good place to escalate to get an executive to support your policies. Beware cargo-culting The longer that I am in the industry, but more I am surprised by how few strategists seem to care if their approach actually works. Instead, they seem focused on doing something that might work, offloading accountability to either the organization or some team, and then moving off to the next problem. Perhaps this is driven by an unfortunate reality that leaders are often evaluated by how they appear, rather than by what they accomplish. Whether or not that’s the underlying reason for why it happens, it does make it surprisingly difficult to know which patterns to borrow from strategy rollouts and implementations. The best advice, unfortunately, is to remain skeptically optimistic. Collect ideas widely, but force the ideas to prove their merit. Summary Now that you’ve finished this chapter, you’re significantly more qualified to write a complete, useful strategy than I was a decade into my career. Often skipped, the operations behind your strategy are at least as essential as any other step, and any strategy without them will fade quietly into your organization’s history. In addition to being able to rollout a strategy of your own, this chapter also provides a useful rescue toolkit you can use to put an existing, floundering strategy back on track. If you don’t see an opportunity to write new strategy within your organization, then there’s still probably room to flex your operational skill.

2 days ago 2 votes
Age is a problem at Apple

The average age of Apple's board members is 68! Nearly half are over 70, and the youngest is 63. It’s not much better with the executive team, where the average age hovers around 60. I’m all for the wisdom of our elders, but it’s ridiculous that the world’s premier tech company is now run by a gerontocracy. And I think it’s starting to show. The AI debacle is just the latest example. I can picture the board presentation on Genmoji: “It’s what the kids want these days!!”. It’s a dumb feature because nobody on Apple’s board or in its leadership has probably ever used it outside a quick demo. I’m not saying older people can’t be an asset. Hell, at 45, I’m no spring chicken myself in technology circles! But you need a mix. You need to blend fluid and crystallized intelligence. You need some people with a finger on the pulse, not just some bravely keeping one. Once you see this, it’s hard not to view slogans like “AI for the rest of us” through that lens. It’s as if AI is like programming a VCR, and you need the grandkids to come over and set it up for you. By comparison, the average age on Meta’s board is 55. They have three members in their 40s. Steve Jobs was 42 when he returned to Apple in 1997. He was 51 when he introduced the iPhone. And he was gone — from Apple and the world — at 56. Apple literally needs some fresh blood to turn the ship around.

3 days ago 3 votes