More from wingolog
Earlier this weekGuileWhippet But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s .mostly-marking collector Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so. The problem comes for uses of from outside itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.scm_gc_malloclibguile The problem with , though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an API for the tagged case and assume that anything that goes through is requesting an untagged, conservatively-scanned block of memory. Similarly for : you could be allocating a tagged object that happens to not contain pointers, or you could be allocating an untagged array of whatever. A new API is needed there too for pointerless untagged allocations.scm_gc_mallocscm_gc_mallocscm_gc_malloc_pointerless Recall that the mostly-marking collector can be built in a number of different ways: it can support conservative and/or precise roots, it can trace the heap precisely or conservatively, it can be generational or not, and the collector can use multiple threads during pauses or not. Consider a basic configuration with precise roots. You can make tagged pointerless allocations just fine: the trace function for that tag is just trivial. You would like to extend the collector with the ability to make pointerless allocations, for raw data. How to do this?untagged Consider first that when the collector goes to trace an object, it can’t use bits inside the object to discriminate between the tagged and untagged cases. Fortunately though . Of those 8 bits, 3 are used for the mark (five different states, allowing for future concurrent tracing), two for the , one to indicate whether the object is pinned or not, and one to indicate the end of the object, so that we can determine object bounds just by scanning the metadata byte array. That leaves 1 bit, and we can use it to indicate untagged pointerless allocations. Hooray!the main space of the mostly-marking collector has one metadata byte for each 16 bytes of payloadprecise field-logging write barrier However there is a wrinkle: when Whippet decides the it should evacuate an object, it tracks the evacuation state in the object itself; the embedder has to provide an implementation of a , allowing the collector to detect whether an object is forwarded or not, to claim an object for forwarding, to commit a forwarding pointer, and so on. We can’t do that for raw data, because all bit states belong to the object, not the collector or the embedder. So, we have to set the “pinned” bit on the object, indicating that these objects can’t move.little state machine We could in theory manage the forwarding state in the metadata byte, but we don’t have the bits to do that currently; maybe some day. For now, untagged pointerless allocations are pinned. You might also want to support untagged allocations that contain pointers to other GC-managed objects. In this case you would want these untagged allocations to be scanned conservatively. We can do this, but if we do, it will pin all objects. Thing is, conservative stack roots is a kind of a sweet spot in language run-time design. You get to avoid constraining your compiler, you avoid a class of bugs related to rooting, but you can still support compaction of the heap. How is this, you ask? Well, consider that you can move any object for which we can precisely enumerate the incoming references. This is trivially the case for precise roots and precise tracing. For conservative roots, we don’t know whether a given edge is really an object reference or not, so we have to conservatively avoid moving those objects. But once you are done tracing conservative edges, any live object that hasn’t yet been traced is fair game for evacuation, because none of its predecessors have yet been visited. But once you add conservatively-traced objects back into the mix, you don’t know when you are done tracing conservative edges; you could always discover another conservatively-traced object later in the trace, so you have to pin everything. The good news, though, is that we have gained an easier migration path. I can now shove Whippet into Guile and get it running even before I have removed untagged allocations. Once I have done so, I will be able to allow for compaction / evacuation; things only get better from here. Also as a side benefit, the mostly-marking collector’s heap-conservative configurations are now faster, because we have metadata attached to objects which allows tracing to skip known-pointerless objects. This regains an optimization that BDW has long had via its , used in Guile since time out of mind.GC_malloc_atomic With support for untagged allocations, I think I am finally ready to start getting Whippet into Guile itself. Happy hacking, and see you on the other side! inside and outside on intent on data on slop fin
Salutations, populations. Today’s note is more of a work-in-progress than usual; I have been finally starting to look at getting into , and there are some open questions.WhippetGuile I started by taking a look at how Guile uses the ‘s API, to make sure I had all my bases covered for an eventual switch to something that was not BDW. I think I have a good overview now, and have divided the parts of BDW-GC used by Guile into seven categories.Boehm-Demers-Weiser collector Firstly there are the ways in which Guile’s run-time and compiler depend on BDW-GC’s behavior, without actually using BDW-GC’s API. By this I mean principally that we assume that any reference to a GC-managed object from any thread’s stack will keep that object alive. The same goes for references originating in global variables, or static data segments more generally. Additionally, we rely on GC objects not to move: references to GC-managed objects in registers or stacks are valid across a GC boundary, even if those references are outside the GC-traced graph: all objects are pinned. Some of these “uses” are internal to Guile’s implementation itself, and thus amenable to being changed, albeit with some effort. However some escape into the wild via Guile’s API, or, as in this case, as implicit behaviors; these are hard to change or evolve, which is why I am putting my hopes on Whippet’s , which allows for conservative roots.mostly-marking collector Then there are the uses of BDW-GC’s API, not to accomplish a task, but to protect the mutator from the collector: , explicitly enabling or disabling GC, calls to that take BDW-GC’s use of POSIX signals into account, and so on. BDW-GC can stop any thread at any time, between any two instructions; for most users is anodyne, but if ever you use weak references, things start to get really gnarly.GC_call_with_alloc_locksigmask Of course a new collector would have its own constraints, but switching to cooperative instead of pre-emptive safepoints would be a welcome relief from this mess. On the other hand, we will require client code to explicitly mark their threads as inactive during calls in more cases, to ensure that all threads can promptly reach safepoints at all times. Swings and roundabouts? Did you know that the Boehm collector allows for precise tracing? It does! It’s slow and truly gnarly, but when you need precision, precise tracing nice to have. (This is the interface.) Guile uses it to mark Scheme stacks, allowing it to avoid treating unboxed locals as roots. When it loads compiled files, Guile also adds some sliced of the mapped files to the root set. These interfaces will need to change a bit in a switch to Whippet but are ultimately internal, so that’s fine.GC_new_kind What is not fine is that Guile allows C users to hook into precise tracing, notably via . This is not only the wrong interface, not allowing for copying collection, but these functions are just truly gnarly. I don’t know know what to do with them yet; are our external users ready to forgo this interface entirely? We have been working on them over time, but I am not sure.scm_smob_set_mark Weak references, weak maps of various kinds: the implementation of these in terms of BDW’s API is incredibly gnarly and ultimately unsatisfying. We will be able to replace all of these with ephemerons and tables of ephemerons, which are natively supported by Whippet. The same goes with finalizers. The same goes for constructs built on top of finalizers, such as ; we’ll get to reimplement these on top of nice Whippet-supplied primitives. Whippet allows for resuscitation of finalized objects, so all is good here.guardians There is a long list of miscellanea: the interfaces to explicitly trigger GC, to get statistics, to control the number of marker threads, to initialize the GC; these will change, but all uses are internal, making it not a terribly big deal. I should mention one API concern, which is that BDW’s state is all implicit. For example, when you go to allocate, you don’t pass the API a handle which you have obtained for your thread, and which might hold some thread-local freelists; BDW will instead load thread-local variables in its API. That’s not as efficient as it could be and Whippet goes the explicit route, so there is some additional plumbing to do. Finally I should mention the true miscellaneous BDW-GC function: . Guile exposes it via an API, . It was already vestigial and we should just remove it, as it has no sensible semantics or implementation.GC_freescm_gc_free That brings me to what I wanted to write about today, but am going to have to finish tomorrow: the actual allocation routines. BDW-GC provides two, essentially: and . The difference is that “atomic” allocations don’t refer to other GC-managed objects, and as such are well-suited to raw data. Otherwise you can think of atomic allocations as a pure optimization, given that BDW-GC mostly traces conservatively anyway.GC_mallocGC_malloc_atomic From the perspective of a user of BDW-GC looking to switch away, there are two broad categories of allocations, tagged and untagged. Tagged objects have attached metadata bits allowing their type to be inspected by the user later on. This is the happy path! We’ll be able to write a function that takes any object, does a switch on, say, some bits in the first word, dispatching to type-specific tracing code. As long as the object is sufficiently initialized by the time the next safepoint comes around, we’re good, and given cooperative safepoints, the compiler should be able to ensure this invariant.gc_trace_object Then there are untagged allocations. Generally speaking, these are of two kinds: temporary and auxiliary. An example of a temporary allocation would be growable storage used by a C run-time routine, perhaps as an unbounded-sized alternative to . Guile uses these a fair amount, as they compose well with non-local control flow as occurring for example in exception handling.alloca An auxiliary allocation on the other hand might be a data structure only referred to by the internals of a tagged object, but which itself never escapes to Scheme, so you never need to inquire about its type; it’s convenient to have the lifetimes of these values managed by the GC, and when desired to have the GC automatically trace their contents. Some of these should just be folded into the allocations of the tagged objects themselves, to avoid pointer-chasing. Others are harder to change, notably for mutable objects. And the trouble is that for external users of , I fear that we won’t be able to migrate them over, as we don’t know whether they are making tagged mallocs or not.scm_gc_malloc One conventional way to handle untagged allocations is to manage to fit your data into other tagged data structures; V8 does this in many places with instances of FixedArray, for example, and Guile should do more of this. Otherwise, you make new tagged data types. In either case, all auxiliary data should be tagged. I think there may be an alternative, which would be just to support the equivalent of untagged and ; but for that, I am out of time today, so type at y’all tomorrow. Happy hacking!GC_mallocGC_malloc_atomic inventory what is to be done? implicit uses defensive uses precise tracing reachability misc allocation
Hey all, quick post today to mention that I added tracing support to the . If the support library for is available when Whippet is compiled, Whippet embedders can visualize the GC process. Like this!Whippet GC libraryLTTng Click above for a full-scale screenshot of the trace explorer processing the with the on a 2.5x heap. Of course no image will have all the information; the nice thing about trace visualizers like is that you can zoom in to sub-microsecond spans to see exactly what is happening, have nice mouseovers and clicky-clickies. Fun times!Perfetto microbenchmarknboyerparallel copying collector Adding tracepoints to a library is not too hard in the end. You need to , which has a file. You need to . Then you have a that includes the header, to generate the code needed to emit tracepoints.pull in the librarylttng-ustdeclare your tracepoints in one of your header filesminimal C filepkg-config Annoyingly, this header file you write needs to be in one of the directories; it can’t be just in the the source directory, because includes it seven times (!!) using (!!!) and because the LTTng file header that does all the computed including isn’t in your directory, GCC won’t find it. It’s pretty ugly. Ugliest part, I would say. But, grit your teeth, because it’s worth it.-Ilttngcomputed includes Finally you pepper your source with tracepoints, which probably you so that you don’t have to require LTTng, and so you can switch to other tracepoint libraries, and so on.wrap in some macro I wrote up a little . It’s not as easy as , which I think is an error. Another ugly point. Buck up, though, you are so close to graphs!guide for Whippet users about how to actually get tracesperf record By which I mean, so close to having to write a Python script to make graphs! Because LTTng writes its logs in so-called Common Trace Format, which as you might guess is not very common. I have a colleague who swears by it, that for him it is the lowest-overhead system, and indeed in my case it has no measurable overhead when trace data is not being collected, but his group uses custom scripts to convert the CTF data that he collects to... (?!?!?!!).GTKWave In my case I wanted to use Perfetto’s UI, so I found a to convert from CTF to the . But, it uses an old version of Babeltrace that wasn’t available on my system, so I had to write a (!!?!?!?!!), probably the most Python I have written in the last 20 years.scriptJSON-based tracing format that Chrome profiling used to usenew script Yes. God I love blinkenlights. As long as it’s low-maintenance going forward, I am satisfied with the tradeoffs. Even the fact that I had to write a script to process the logs isn’t so bad, because it let me get nice nested events, which most stock tracing tools don’t allow you to do. I fixed a small performance bug because of it – a . A win, and one that never would have shown up on a sampling profiler too. I suspect that as I add more tracepoints, more bugs will be found and fixed.worker thread was spinning waiting for a pool to terminate instead of helping out I think the only thing that would be better is if tracepoints were a part of Linux system ABIs – that there would be header files to emit tracepoint metadata in all binaries, that you wouldn’t have to link to any library, and the actual tracing tools would be intermediated by that ABI in such a way that you wouldn’t depend on those tools at build-time or distribution-time. But until then, I will take what I can get. Happy tracing! on adding tracepoints using the thing is it worth it? fin
Hey all, the video of my is up:FOSDEM talk on Whippet Slides , if that’s your thing.here I ended the talk with some puzzling results around generational collection, which prompted . I don’t have a firm answer yet. Or rather, perhaps for the splay benchmark, it is to be expected that a generational GC is not great; but there are other benchmarks that also show suboptimal throughput in generational configurations. Surely it is some tuning issue; I’ll be looking into it.yesterday’s post Happy hacking!
Usually in this space I like to share interesting things that I find out; you might call it a research-epistle-publish loop. Today, though, I come not with answers, but with questions, or rather one question, but with fractal surface area: what is the value proposition of generational garbage collection? The conventional wisdom is encapsulated in a 2004 Blackburn, Cheng, and McKinley paper, , which compares whole-heap mark-sweep and copying collectors to their generational counterparts, using the Jikes RVM as a test harness. (It also examines a generational reference-counting collector, which is an interesting predecessor to the 2022 work by Zhao, Blackburn, and McKinley.)“Myths and Realities: The Performance Impact of Garbage Collection”LXR The paper finds that generational collectors spend less time than their whole-heap counterparts for a given task. This is mainly due to less time spent collecting, because generational collectors avoid tracing/copying work for older objects that mostly stay in the same place in the live object graph. The paper also notes an improvement for mutator time under generational GC, but only for the generational mark-sweep collector, which it attributes to the locality and allocation speed benefit of bump-pointer allocation in the nursery. However for copying collectors, generational GC tends to slow down the mutator, probably because of the write barrier, but in the end lower collector times still led to lower total times. So, I expected generational collectors to always exhibit lower wall-clock times than whole-heap collectors. In , I have a garbage collector with an abstract API that specializes at compile-time to the mutator’s object and root-set representation and to the collector’s allocator, write barrier, and other interfaces. I embed it in , a simple Scheme-to-C compiler that can run some small multi-threaded benchmarks, for example the classic Gabriel benchmarks. We can then test those benchmarks against different collectors, mutator (thread) counts, and heap sizes. I expect that the generational parallel copying collector takes less time than the whole-heap parallel copying collector.whippetwhiffle So, I ran some benchmarks. Take the splay-tree benchmark, derived from Octane’s . I have a port to Scheme, and the results are... not good!splay.js In this graph the “pcc” series is the whole-heap copying collector, and “generational-pcc” is the generational counterpart, with a nursery sized such that after each collection, its size is 2 MB times the number of active mutator threads in the last collector. So, for this test with eight threads, on my 8-core Ryzen 7 7840U laptop, the nursery is 16MB including the copy reserve, which happens to be the same size as the L3 on this CPU. New objects are kept in the nursery one cycle before being promoted to the old generation. There are also results for , which use an Immix-derived algorithm that allows for bump-pointer allocation but which doesn’t require a copy reserve. There, the generational collectors use a , which has very different performance characteristics as promotion is in-place, and the nursery is as large as the available heap size.“mmc” and “generational-mmc” collectorssticky mark-bit algorithm The salient point is that at all heap sizes, and for these two very different configurations (mmc and pcc), generational collection takes more time than whole-heap collection. It’s not just the splay benchmark either; I see the same thing for the very different . What is the deal?nboyer benchmark I am honestly quite perplexed by this state of affairs. I wish I had a narrative to tie this together, but in lieu of that, voici some propositions and observations. Sometimes people say that the reason generational collection is good is because you get bump-pointer allocation, which has better locality and allocation speed. This is misattribution: it’s bump-pointer allocators that have these benefits. You can have them in whole-heap copying collectors, or you can have them in whole-heap mark-compact or immix collectors that bump-pointer allocate into the holes. Or, true, you can have them in generational collectors with a copying nursery but a freelist-based mark-sweep allocator. But also you can have generational collectors without bump-pointer allocation, for free-list sticky-mark-bit collectors. To simplify this panorama to “generational collectors have good allocators” is incorrect. It’s true, generational GC does lower median pause times: But because a major collection is usually slightly more work under generational GC than in a whole-heap system, because of e.g. the need to reset remembered sets, the maximum pauses are just as big and even a little bigger: I am not even sure that it is meaningful to compare median pause times between generational and non-generational collectors, given that the former perform possibly orders of magnitude more collections than the latter. Doing fewer whole-heap traces is good, though, and in the ideal case, the less frequent major traces under generational collectors allows time for concurrent tracing, which is the true mitigation for long pause times. Could it be that the test harness I am using is in some way unrepresentative? I don’t have more than one test harness for Whippet yet. I will start work on a second Whippet embedder within the next few weeks, so perhaps we will have an answer there. Still, there is ample time spent in GC pauses in these benchmarks, so surely as a GC workload Whiffle has some utility. One reasons that Whiffle might be unrepresentative is that it is an ahead-of-time compiler, whereas nursery addresses are assigned at run-time. Whippet exposes the necessary information to allow a just-in-time compiler to specialize write barriers, for example the inline check that the field being mutated is not in the nursery, and an AOT compiler can’t encode this as an immediate. But it seems a small detail. Also, Whiffle doesn’t do much compiler-side work to elide write barriers. Could the cost of write barriers be over-represented in Whiffle, relative to a production language run-time? Relatedly, Whiffle is just a baseline compiler. It does some partial evaluation but no CFG-level optimization, no contification, no nice closure conversion, no specialization, and so on: is it not representative because it is not an optimizing compiler? How big should the nursery be? I have no idea. As a thought experiment, consider the case of a 1 kilobyte nursery. It is probably too small to allow the time for objects to die young, so the survival rate at each minor collection would be high. Above a certain survival rate, generational GC is probably a lose, because your program violates the weak generational hypothesis: it introduces a needless copy for all survivors, and a synchronization for each minor GC. On the other hand, a 1 GB nursery is probably not great either. It is plenty large enough to allow objects to die young, but the number of survivor objects in a space that large is such that pause times would not be very low, which is one of the things you would like in generational GC. Also, you lose out on locality: a significant fraction of the objects you traverse are probably out of cache and might even incur TLB misses. So there is probably a happy medium somewhere. My instinct is that for a copying nursery, you want to make it about as big as L3 cache, which on my 8-core laptop is 16 megabytes. Systems are different sizes though; in Whippet my current heuristic is to reserve 2 MB of nursery per core that was active in the previous cycle, so if only 4 threads are allocating, you would have a 8 MB nursery. Is this good? I don’t know. I don’t have a very large set of benchmarks that run on Whiffle, and they might not be representative. I mean, they are microbenchmarks. One question I had was about heap sizes. If a benchmark’s maximum heap size fits in L3, which is the case for some of them, then probably generational GC is a wash, because whole-heap collection stays in cache. When I am looking at benchmarks that evaluate generational GC, I make sure to choose those that exceed L3 size by a good factor, for example the 8-mutator splay benchmark in which minimum heap size peaks at 300 MB, or the 8-mutator nboyer-5 which peaks at 1.6 GB. But then, should nursery size scale with total heap size? I don’t know! Incidentally, the way that I scale these benchmarks to multiple mutators is a bit odd: they are serial benchmarks, and I just run some number of threads at a time, and scale the heap size accordingly, assuming that the minimum size when there are 4 threads is four times the minimum size when there is just one thread. However, , in the sense that there is no heap size under which they fail and above which they succeed; I quote:multithreaded programs are unreliable A generational collector partitions objects into old and new sets, and a minor collection starts by visiting all old-to-new edges, called the “remembered set”. As the program runs, mutations to old objects might introduce new old-to-new edges. To maintain the remembered set in a generational collector, the mutator invokes : little bits of code that run when you mutate a field in an object. This is overhead relative to non-generational configurations, where the mutator doesn’t have to invoke collector code when it sets fields.write barriers So, could it be that Whippet’s write barriers or remembered set are somehow so inefficient that my tests are unrepresentative of the state of the art? I used to use card-marking barriers, but I started to suspect they cause too much overhead during minor GC and introduced too much cache contention. I switched to some months back for Whippet’s Immix-derived space, and we use the same kind of barrier in the generational copying (pcc) collector. I think this is state of the art. I need to see if I can find a configuration that allows me to measure the overhead of these barriers, independently of other components of a generational collector.precise field-logging barriers A few months ago, my only generational collector used the algorithm, which is an unconventional configuration: its nursery is not contiguous, non-moving, and can be as large as the heap. This is part of the reason that I implemented generational support for the parallel copying collector, to have a different and more conventional collector to compare against. But generational collection loses on some of these benchmarks in both places!sticky mark-bit On one benchmark which repeatedly constructs some trees and then verifies them, I was seeing terrible results for generational GC, which I realized were because of cooperative safepoints: generational GC collects more often, so it requires that all threads reach safepoints more often, and the non-allocating verification phase wasn’t emitting any safepoints. I had to change the compiler to emit safepoints at regular intervals (in my case, on function entry), and it sped up the generational collector by a significant amount. This is one instance of a general observation, which is that any work that doesn’t depend on survivor size in a GC pause is more expensive with a generational collector, which runs more collections. Synchronization can be a cost. I had one bug in which tracing ephemerons did work proportional to the size of the whole heap, instead of the nursery; I had to specifically add generational support for the way Whippet deals with ephemerons during a collection to reduce this cost. Looking deeper at the data, I have partial answers for the splay benchmark, and they are annoying :) Splay doesn’t actually allocate all that much garbage. At a 2.5x heap, the stock parallel MMC collector (in-place, sticky mark bit) collects... one time. That’s all. Same for the generational MMC collector, because the first collection is always major. So at 2.5x we would expect the generational collector to be slightly slower. The benchmark is simply not very good – or perhaps the most generous interpretation is that it represents tasks that allocate 40 MB or so of long-lived data and not much garbage on top. Also at 2.5x heap, the whole-heap copying collector runs 9 times, and the generational copying collector does 293 minor collections and... 9 major collections. We are not reducing the number of major GCs. It means either the nursery is too small, so objects aren’t dying young when they could, or the benchmark itself doesn’t conform to the weak generational hypothesis. At a 1.5x heap, the copying collector doesn’t have enough space to run. For MMC, the non-generational variant collects 7 times, and generational MMC times out. Timing out indicates a bug, I think. Annoying! I tend to think that if I get results and there were fewer than, like, 5 major collections for a whole-heap collector, that indicates that the benchmark is probably inapplicable at that heap size, and I should somehow surface these anomalies in my analysis scripts. Doing a similar exercise for nboyer at 2.5x heap with 8 threads (4GB for 1.6GB live data), I see that pcc did 20 major collections, whereas generational pcc lowered that to 8 major collections and 3471 minor collections. Could it be that there are still too many fixed costs associated with synchronizing for global stop-the-world minor collections? I am going to have to add some fine-grained tracing to find out. I just don’t know! I want to believe that generational collection was an out-and-out win, but I haven’t yet been able to prove it is true. I do have some homework to do. I need to find a way to test the overhead of my write barrier – probably using the MMC collector and making it only do major collections. I need to fix generational-mmc for splay and a 1.5x heap. And I need to do some fine-grained performance analysis for minor collections in large heaps. Enough for today. Feedback / reactions very welcome. Thanks for reading and happy hacking! hypothesis test workbench results? “generational collection is good because bump-pointer allocation” “generational collection lowers pause times” is it whiffle? is it something about the nursery size? is it something about the benchmarks? is it the write barrier? is it something about the generational mechanism? is it something about collecting more often? is it something about collection frequency? collecting more often redux conclusion?
More in programming
Jack Johnson is on Rick Rubin’s podcast Tetragrammaton talking about music, film making, creativity, and surfing. At one point (~24:30) Johnson talks about his love for surfing and the beautiful flow state it puts him in: Sometimes I’ll see a friend riding a wave while I’m paddling out, and the thing I’ll see them do just seems like magic...I’ll think, “How in the world did they just do that?” And then on your next ride you’re doing the exact same thing without thinking but it’s all muscle memory and it’s all in this flow that you get into. That’s a really beautiful state to get into, to do something that feels like a magic trick, like something you shouldn’t be able to do, but all of the sudden you’re doing it. I’m not a surfer, and I can’t do effortlessly cool. But I know what a flow state feels like. Johnson’s description reminds me of that feeling when you get a little time on a personal project — riding the wave of working on your personal website. You open your laptop. You start paddling out. Maybe you see an internet friend who was doing something cool and you want to try it but you have no idea if you’ll be able to do it as well as they did. And before you know it, you’re in that flow state where muscle memory takes over and you’re doing stuff without even consciously thinking about it — stuff that others might look at and perceive as magic (cough anything on the command line cough) but it’s not magic to you. Intuition and experience just take over while you ride the wave. Ok, I’m a nerd. But I don’t care. It’s a great feeling, regardless of whether it’s playing an instrument, or surfing, or programming. That feeling of sinking into a craft you’ve worked at your whole life that you don’t have to think about anymore. Email · Mastodon · Bluesky
One of the recurring challenges in any organization is how to split your attention across long-term and short-term problems. Your software might be struggling to scale with ramping user load while also knowing that you have a series of meaningful security vulnerabilities that need to be closed sooner than later. How do you balance across them? These sorts of balance questions occur at every level of an organization. A particularly frequent format is the debate between Product and Engineering about how much time goes towards developing new functionality versus improving what’s already been implemented. In 2020, Calm was growing rapidly as we navigated the COVID-19 pandemic, and the team was struggling to make improvements, as they felt saturated by incoming new requests. This strategy for resourcing Engineering-driven projects was our attempt to solve that problem. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our policies for resourcing Engineering-driven projects are: We will protect one Eng-driven project per product engineering team, per quarter. These projects should represent a maximum of 20% of the team’s bandwidth. Each project must advance a measurable metric, and execution must be designed to show progress on that metric within 4 weeks. These projects must adhere to Calm’s existing Engineering strategies. We resource these projects first in the team’s planning, rather than last. However, only concrete projects are resourced. If there’s no concrete proposal, then the team won’t have time budgeted for Engineering-driven work. Team’s engineering manager is responsible for deciding on the project, ensuring the project is valuable, and pushing back on attempts to defund the project. Project selection does not require CTO approval, but you should escalate to the CTO if there’s friction or disagreement. CTO will review Engineering-driven projects each quarter to summarize their impact and provide feedback to teams’ engineering managers on project selection and execution. They will also review teams that did not perform a project to understand why not. As we’ve communicated this strategy, we’ve frequently gotten conceptual alignment that this sounds reasonable, coupled with uncertainty about what sort of projects should actually be selected. At some level, this ambiguity is an acknowledgment that we believe teams will identify the best opportunities bottoms-up, we also wanted to give two concrete examples of projects we’re greenlighting in the first batch: Code-free media release: historically, we’ve needed to make a number of pull requests to add, organize, and release new pieces of media. This is high urgency work, but Engineering doesn’t exercise much judgment while doing it, and manual steps often create errors. We aim to track and eliminate these pull requests, while also increasing the number of releases that can be facilitated without scaling the content release team. Machine-learning content placement: developing new pieces of media is often a multi-week or month process. After content is ready to release, there’s generally a debate on where to place the content. This matters for the company, as this drives engagement with our users, but it matters even more to the content creator, who is generally evaluated in terms of their content’s performance. This often leads to Product and Engineering getting caught up in debates about how to surface particular pieces of content. This project aims to improve user engagement by surfacing the best content for their interests, while also giving the Content team several explicit positions to highlight content without Product and Engineering involvement. Although these projects are similar, it’s not intended that all Engineering-driven projects are of this variety. Instead it’s happenstance based on what the teams view as their biggest opportunities today. Diagnosis Our assessment of the current situation at Calm is: We are spending a high percentage of our time on urgent but low engineering value tasks. Most significantly, about one-third of our time is going into launching, debugging, and changing content that we release into our product. Engineering is involved due to limitations in our implementation, not because there is any inherent value in Engineering’s involvement. (We mostly just make releases slowly and inadvertently introduce bugs of our own.) We have a bunch of fairly clear ideas around improving the platform to empower the Content team to speed up releases, and to eliminate the Engineering involvement. However, we’ve struggled to find time to implement them, or to validate that these ideas will work. If we don’t find a way to prioritize, and succeed at implementing, a project to reduce Engineering involvement in Content releases, we will struggle to support our goals to release more content and to develop more product functionality this year Our Infrastructure team has been able to plan and make these kinds of investments stick. However, when we attempt these projects within our Product Engineering teams, things don’t go that well. We are good at getting them onto the initial roadmap, but then they get deprioritized due to pressure to complete other projects. Engineering team is not very fungible due to its small size (20 engineers), and because we have many specializations within the team: iOS, Android, Backend, Frontend, Infrastructure, and QA. We would like to staff these kinds of projects onto the Infrastructure team, but in practice that team does not have the product development experience to implement theis kind of project. We’ve discussed spinning up a Platform team, or moving product engineers onto Infrastructure, but that would either (1) break our goal to maintain joint pairs between Product Managers and Engineering Managers, or (2) be indistinguishable from prioritizing within the existing team because it would still have the same Product Manager and Engineering Manager pair. Company planning is organic, occurring in many discussions and limited structured process. If we make a decision to invest in one project, it’s easy for that project to get deprioritized in a side discussion missing context on why the project is important. These reprioritization discussions happen both in executive forums and in team-specific forums. There’s imperfect awareness across these two sorts of forums. Explore Prioritization is a deep topic with a wide variety of popular solutions. For example, many software companies rely on “RICE” scoring, calculating priority as (Reach times Impact times Confidence) divided by Effort. At the other extreme are complex methodologies like [Scaled Agile Framework)(https://en.wikipedia.org/wiki/Scaled_agile_framework). In addition to generalized planning solutions, many companies carve out special mechanisms to solve for particular prioritization gaps. Google historically offered 20% time to allow individuals to work on experimental projects that didn’t align directly with top-down priorities. Stripe’s Foundation Engineering organization developed the concept of Foundational Initiatives to prioritize cross-pillar projects with long-term implications, which otherwise struggled to get prioritized within the team-led planning process. All these methods have clear examples of succeeding, and equally clear examples of struggling. Where these initiatives have succeeded, they had an engaged executive sponsoring the practice’s rollout, including triaging escalations when the rollout inconvenienced supporters of the prior method. Where they lacked a sponsor, or were misaligned with the company’s culture, these methods have consistently failed despite the fact that they’ve previously succeeded elsewhere.
I used to make little applications just for myself. Sixteen years ago (oof) I wrote a habit tracking application, and a keylogger that let me keep track of when I was using a computer, and generate some pretty charts. I’ve taken a long break from those kinds of things. I love my hobbies, but they’ve drifted toward the non-technical, and the idea of keeping a server online for a fun project is unappealing (which is something that I hope Val Town, where I work, fixes). Some folks maintain whole ‘homelab’ setups and run Kubernetes in their basement. Not me, at least for now. But I have been tiptoeing back into some little custom tools that only I use, with a focus on just my own computing experience. Here’s a quick tour. Hammerspoon Hammerspoon is an extremely powerful scripting tool for macOS that lets you write custom keyboard shortcuts, UIs, and more with the very friendly little language Lua. Right now my Hammerspoon configuration is very simple, but I think I’ll use it for a lot more as time progresses. Here it is: hs.hotkey.bind({"cmd", "shift"}, "return", function() local frontmost = hs.application.frontmostApplication() if frontmost:name() == "Ghostty" then frontmost:hide() else hs.application.launchOrFocus("Ghostty") end end) Not much! But I recently switched to Ghostty as my terminal, and I heavily relied on iTerm2’s global show/hide shortcut. Ghostty doesn’t have an equivalent, and Mikael Henriksson suggested a script like this in GitHub discussions, so I ran with it. Hammerspoon can do practically anything, so it’ll probably be useful for other stuff too. SwiftBar I review a lot of PRs these days. I wanted an easy way to see how many were in my review queue and go to them quickly. So, this script runs with SwiftBar, which is a flexible way to put any script’s output into your menu bar. It uses the GitHub CLI to list the issues, and jq to massage that output into a friendly list of issues, which I can click on to go directly to the issue on GitHub. #!/bin/bash # <xbar.title>GitHub PR Reviews</xbar.title> # <xbar.version>v0.0</xbar.version> # <xbar.author>Tom MacWright</xbar.author> # <xbar.author.github>tmcw</xbar.author.github> # <xbar.desc>Displays PRs that you need to review</xbar.desc> # <xbar.image></xbar.image> # <xbar.dependencies>Bash GNU AWK</xbar.dependencies> # <xbar.abouturl></xbar.abouturl> DATA=$(gh search prs --state=open -R val-town/val.town --review-requested=@me --json url,title,number,author) echo "$(echo "$DATA" | jq 'length') PR" echo '---' echo "$DATA" | jq -c '.[]' | while IFS= read -r pr; do TITLE=$(echo "$pr" | jq -r '.title') AUTHOR=$(echo "$pr" | jq -r '.author.login') URL=$(echo "$pr" | jq -r '.url') echo "$TITLE ($AUTHOR) | href=$URL" done Tampermonkey Tampermonkey is essentially a twist on Greasemonkey: both let you run your own JavaScript on anybody’s webpage. Sidenote: Greasemonkey was created by Aaron Boodman, who went on to write Replicache, which I used in Placemark, and is now working on Zero, the successor to Replicache. Anyway, I have a few fancy credit cards which have ‘offers’ which only work if you ‘activate’ them. This is an annoying dark pattern! And there’s a solution to it - CardPointers - but I neither spend enough nor care enough about points hacking to justify the cost. Plus, I’d like to know what code is running on my bank website. So, Tampermonkey to the rescue! I wrote userscripts for Chase, American Express, and Citi. You can check them out on this Gist but I strongly recommend to read through all the code because of the afore-mentioned risks around running untrusted code on your bank account’s website! Obsidian Freeform This is a plugin for Obsidian, the notetaking tool that I use every day. Freeform is pretty cool, if I can say so myself (I wrote it), but could be much better. The development experience is lackluster because you can’t preview output at the same time as writing code: you have to toggle between the two states. I’ll fix that eventually, or perhaps Obsidian will add new API that makes it all work. I use Freeform for a lot of private health & financial data, almost always with an Observable Plot visualization as an eventual output. For example, when I was switching banks and one of the considerations was mortgage discounts in case I ever buy a house (ha 😢), it was fun to chart out the % discounts versus the required AUM. It’s been really nice to have this kind of visualization as ‘just another document’ in my notetaking app. Doesn’t need another server, and Obsidian is pretty secure and private.
At a conference a while back, I noticed a couple of speakers get such a confidence boost after solving a small technical glitch. We should probably make that a part of every talk. Have the mic not connect automatically, or an almost-complete puzzle on the stage that the speaker can finish, or have someone forget their badge and the speaker return it to them. Maybe the next time I, or a consenting teammate, have to give a presentation I’ll try to engineer such a situation. All conference talks should start with a small technical glitch that the speaker can easily solve was originally published by Ognjen Regoje at Ognjen Regoje • ognjen.io on April 03, 2025.
A large part of our civilisation rests on the shoulders of one medieval monk: Thomas Aquinas. Amid the turmoil of life, riddled with wickedness and pain, he would insist that our world is good. And all our success is built on this belief. Note: Before we start, let’s get one thing out of the way: Thomas Aquinas is clearly a Christian thinker, a Saint even. Yet he was also a brilliant philosopher. So even if you consider yourself agnostic or an atheist, stay with me, you will still enjoy his ideas. What is good? Thomas’ argument is rooted in Aristotle’s concept of goodness: Something is good if it fulfills its function. Aristotle had illustrated this idea with a knife. A knife is good to the extent that it cuts well. He made a distinction between an actual knife and its ideal function. That actual thing in your drawer is the existence of a knife. And its ideal function is its essence—what it means to be a knife: to cut well. So everything is separated into its existence and its ideal essence. And this is also true for humans: We have an ideal conception of what the essence of a human […] The post Thomas Aquinas — The world is divine! appeared first on Ralph Ammer.