Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
10
A surprising number of strategies are doomed from inception because their authors get attached to one particular approach without considering alternatives that would work better for their current circumstances. This happens when engineers want to pick tools solely because they are trending, and when executives insist on adopting the tech stack from their prior organization where they felt comfortable. Exploration is the antidote to early anchoring, forcing you to consider the problem widely before evaluating any of the paths forward. Exploration is about updating your priors before assuming the industry hasn’t evolved since you last worked on a given problem. Exploration is continuing to believe that things can get better when you’re not watching. This chapter covers: The goals of the exploration phase of strategy creation When to explore (always first!) and when it makes sense to stop exploring How to explore a topic, including discussion of the most common mechanisms: mining for...
a week ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Irrational Exuberance

How should we control access to user data?

At some point in a startup’s lifecycle, they decide that they need to be ready to go public in 18 months, and a flurry of IPO-readiness activity kicks off. This strategy focuses on a company working on IPO readiness, which has identified a gap in their internal controls for managing access to their users’ data. It’s a company that wants to meaningfully improve their security posture around user data access, but which has had a number of failed security initiatives over the years. Most of those initiatives have failed because they significantly degraded internal workflows for teams like customer support, such that the initial progress was reverted and subverted over time, to little long-term effect. This strategy represents the Chief Information Security Officer’s (CISO) attempt to acknowledge and overcome those historical challenges while meeting their IPO readiness obligations, and–most importantly–doing right by their users. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document has been refactored in two ways to improve readability: first, Operation has been folded into Policy; second, Refine has been embedded in Diagnose. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operations Our new policies, and the mechanisms to operate them are: Controls for accessing user data must be significantly stronger prior to our IPO. Senior leadership, legal, compliance and security have decided that we are not comfortable accepting the status quo of our user data access controls as a public company, and must meaningfully improve the quality of resource-level access controls as part of our pre-IPO readiness efforts. Our Security team is accountable for the exact mechanisms and approach to addressing this risk. We will continue to prioritize a hybrid solution to resource-access controls. This has been our approach thus far, and the fastest available option. Directly expose the log of our resource-level accesses to our users. We will build towards a user-accessible log of all company accesses of user data, and ensure we are comfortable explaining each and every access. In addition, it means that each rationale for access must be comprehensible and reasonable from a user perspective. This is important because it aligns our approach with our users’ perspectives. They will be able to evaluate how we access their data, and make decisions about continuing to use our product based on whether they agree with our use. Good security discussions don’t frame decisions as a compromise between security and usability. We will pursue multi-dimensional tradeoffs to simultaneously improve security and efficiency. Whenever we frame a discussion on trading off between security and utility, it’s a sign that we are having the wrong discussion, and that we should rethink our approach. We will prioritize mechanisms that can both automatically authorize and automatically document the rationale for accesses to customer data. The most obvious example of this is automatically granting access to a customer support agent for users who have an open support ticket assigned to that agent. (And removing that access when that ticket is reassigned or resolved.) Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale. This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues’ internal tools. If we only expand requirements for accessing customer data, we won’t view this as progress because it’s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won’t represent this as progress, because we won’t have increased the number of supported requests. As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data. Expire unused roles to move towards principle of least privilege. Today we have a number of roles granted in our role-based access control (RBAC) system to users who do not use the granted permissions. To address that issue, we will automatically remove roles from colleagues after 90 days of not using the role’s permissions. Engineers in an active on-call rotation are the exception to this automated permission pruning. Weekly reviews until we see progress; monthly access reviews in perpetuity. Starting now, there will be a weekly sync between the security engineering team, teams working on customer data access initiatives, and the CISO. This meeting will focus on rapid iteration and problem solving. This is explicitly a forum for ongoing strategy testing, with CISO serving as the meeting’s sponsor, and their Principal Security Engineer serving as the meeting’s guide. It will continue until we have clarity on the path to 100% coverage of user-comprehensible, automated rationales for access to customer data. Separately, we are also starting a monthly review of sampled accesses to customer data to ensure the proper usage and function of the rationale-creation mechanisms we build. This meeting’s goal is to review access rationales for quality and appropriateness, both by reviewing sampled rationales in the short-term, and identifying more automated mechanisms for identifying high-risk accesses to review in the future. Exceptions must be granted in writing by CISO. While our overarching Engineering Strategy states that we follow an advisory architecture process as described in Facilitating Software Architecture, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the #ciso channel. Diagnose We have a strong baseline of role-based access controls (RBAC) and audit logging. However, we have limited mechanisms for ensuring assigned roles follow the principle of least privilege. This is particularly true in cases where individuals change teams or roles over the course of their tenure at the company: some individuals have collected numerous unused roles over five-plus years at the company. Similarly, our audit logs are durable and pervasive, but we have limited proactive mechanisms for identifying anomalous usage. Instead they are typically used to understand what occurred after an incident is identified by other mechanisms. For resource-level access controls, we rely on a hybrid approach between a 3rd-party platform for incoming user requests, and approval mechanisms within our own product. Providing a rationale for access across these two systems requires manual work, and those rationales are later manually reviewed for appropriateness in a batch fashion. There are two major ongoing problems with our current approach to resource-level access controls. First, the teams making requests view them as a burdensome obligation without much benefit to them or on behalf of the user. Second, because the rationale review steps are manual, there is no verifiable evidence of the quality of the review. We’ve found no evidence of misuse of user data. When colleagues do access user data, we have uniformly and consistently found that there is a clear, and reasonable rationale for that access. For example, a ticket in the user support system where the user has raised an issue. However, the quality of our documented rationales is consistently low because it depends on busy people manually copying over significant information many times a day. Because the rationales are of low quality, the verification of these rationales is somewhat arbitrary. From a literal compliance perspective, we do provide rationales and auditing of these rationales, but it’s unclear if the majority of these audits increase the security of our users’ data. Historically, we’ve made significant security investments that caused temporary spikes in our security posture. However, looking at those initiatives a year later, in many cases we see a pattern of increased scrutiny, followed by a gradual repeal or avoidance of the new mechanisms. We have found that most of them involved increased friction for essential work performed by other internal teams. In the natural order of performing work, those teams would subtly subvert the improvements because it interfered with their immediate goals (e.g. supporting customer requests). As such, we have high conviction from our track record that our historical approach can create optical wins internally. We have limited conviction that it can create long-term improvements outside of significant, unlikely internal changes (e.g. colleagues are markedly less busy a year from now than they are today). It seems likely we need a new approach to meaningfully shift our stance on these kinds of problems. Explore Our experience is that best practices around managing internal access to user data are widely available through our networks, and otherwise hard to find. The exact rationale for this is hard to determine, but it seems possible that it’s a topic that folks are generally uncomfortable discussing in public on account of potential future liability and compliance issues. In our exploration, we found two standardized dimensions (role-based access controls, audit logs), and one highly divergent dimension (resource-specific access controls): Role-based access controls (RBAC) are a highly standardized approach at this point. The core premise is that users are mapped to one or more roles, and each role is granted a certain set of permissions. For example, a role representing the customer support agent might be granted permission to deactivate an account, whereas a role representing the sales engineer might be able to configure a new account. Audit logs are similarly standardized. All access and mutation of resources should be tied in a durable log to the human who performed the action. These logs should be accumulated in a centralized, queryable solution. One of the core challenges is determining how to utilize these logs proactively to detect issues rather than reactively when an issue has already been flagged. Resource-level access controls are significantly less standardized than RBAC or audit logs. We found three distinct patterns adopted by companies, with little consistency across companies on which is adopted. Those three patterns for resource-level access control were: 3rd-party enrichment where access to resources is managed in a 3rd-party system such as Zendesk. This requires enriching objects within those systems with data and metadata from the product(s) where those objects live. It also requires implementing actions on the platform, such as archiving or configuration, allowing them to live entirely in that platform’s permission structure. The downside of this approach is tight coupling with the platform vendor, any limitations inherent to that platform, and the overhead of maintaining engineering teams familiar with both your internal technology stack and the platform vendor’s technology stack. 1st-party tool implementation where all activity, including creation and management of user issues, is managed within the core product itself. This pattern is most common in earlier stage companies or companies whose customer support leadership “grew up” within the organization without much exposure to the approach taken by peer companies. The advantage of this approach is that there is a single, tightly integrated and infinitely extensible platform for managing interactions. The downside is that you have to build and maintain all of that work internally rather than pushing it to a vendor that ought to be able to invest more heavily into their tooling. Hybrid solutions where a 3rd-party platform is used for most actions, and is further used to permit resource-level access within the 1st-party system. For example, you might be able to access a user’s data only while there is an open ticket created by that user, and assigned to you, in the 3rd-party platform. The advantage of this approach is that it allows supporting complex workflows that don’t fit within the platform’s limitations, and allows you to avoid complex coupling between your product and the vendor platform. Generally, our experience is that all companies implement RBAC, audit logs, and one of the resource-level access control mechanisms. Most companies pursue either 3rd-party enrichment with a sizable, long-standing team owning the platform implementation, or rely on a hybrid solution where they are able to avoid a long-standing dedicated team by lumping that work into existing teams.

a week ago 13 votes
Our own agents with their own tools.

Entering 2025, I decided to spend some time exploring the topic of agents. I started reading Anthropic’s Building effective agents, followed by Chip Huyen’s AI Engineering. I kicked off a major workstream at work on using agents, and I also decided to do a personal experiment of sorts. This is a general commentary on building that project. What I wanted to build was a simple chat interface where I could write prompts, select models, and have the model use tools as appropriate. My side goal was to build this using Cursor and generally avoid writing code directly as much as possible, but I found that generally slower than writing code in emacs while relying on 4o-mini to provide working examples to pull from. Similarly, while I initially envisioned building this in fullstack TypeScript via Cursor, I ultimately bailed into a stack that I’m more comfortable, and ended up using Python3, FastAPI, PostgreSQL, and SQLAlchemy with the async psycopg3 driver. It’s been a… while… since I started a brand new Python project, and used this project as an opportunity to get comfortable with Python3’s async/await mechanisms along with Python3’s typing along with mypy. Finally, I also wanted to experiment with Tailwind, and ended up using TailwindUI’s components to build the site. The working version supports everything I wanted: creating chats with models, and allowing those models to use function calling to use tools that I provide. The models are allowed to call any number of tools in pursuit of the problem they are solving. The tool usage is the most interesting part here for sure. The simplest tool I created was a get_temperature tool that provided a fake temperature for your location. This allowed me to ask questions like “What should I wear tomorrow in San Francisco, CA?” and get a useful respond. The code to add this function to my project was pretty straightforward, just three lines of Python and 25 lines of metadata to pass to the OpenAI API. def tool_get_current_weather(location: str|None=None, format: str|None=None) -> str: "Simple proof of concept tool." temp = random.randint(40, 90) if format == 'fahrenheit' else random.randint(10, 25) return f"It's going to be {temp} degrees {format} tomorrow." FUNCTION_REGISTRY['get_current_weather'] = tool_get_current_weather TOOL_USAGE_REGISTRY['get_current_weather'] = { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location.", }, }, "required": ["location", "format"], }, } } After getting this tool, the next tool I added was a simple URL retriever tool, which allowed the agent to grab a URL and use the content of that URL in its prompt. The implementation for this tool was similarly quite simple. def tool_get_url(url: str|None=None) -> str: if url is None: return '' url = str(url) response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') content = soup.find('main') or soup.find('article') or soup.body if not content: return str(response.content) markdown = markdownify(str(content), heading_style="ATX").strip() return str(markdown) FUNCTION_REGISTRY['get_url'] = tool_get_url TOOL_USAGE_REGISTRY['get_url'] = { "type": "function", "function": { "name": "get_url", "description": "Retrieve the contents of a website via its URL.", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "The complete URL, including protocol to retrieve. For example: \"https://lethain.com\"", } }, "required": ["url"], }, } } What’s pretty amazing is how much power you can add to your agent by adding such a trivial tool as retrieving a URL. You can similarly imagine adding tools for retrieving and commenting on Github pull requests and so, which could allow a very simple agent tool like this to become quite useful. Working on this project gave me a moderately compelling view of a near-term future where most engineers have simple application like this running that they can pipe events into from various systems (email, text, Github pull requests, calendars, etc), create triggers that map events to templates that feed into prompts, and execute those prompts with tool-aware agents. Combine that with ability for other agents to register themselves with you and expose the tools that they have access to (e.g. schedule an event with tool’s owner), and a bunch of interesting things become very accessible with a very modest amount of effort: You could schedule events between two busy people’s calendars, as if both of them had an assistant managing their calendar Reply to your own pull requests with new blog posts, providing feedback on typos and grammatical issues Crawl websites you care about and identify posts you might be interested in Ask the model to generate a system model using lethain:systems, run that model, then chart the responses Add a “planning tool” which allows the model to generate a plan to guide subsequent steps in a complex task. (e.g. getting my calendar, getting a friend’s calendar, suggesting a time we could meet) None of these are exactly lifesaving, but each is somewhat useful, and I imagine there are many more fairly obvious ideas that become easy once you have the necessary scaffolding to make this sort of thing easy. Altogether, I think that I am convinced at this points that agents, using current foundational models, are going to create a number of very interesting experiences that improve our day to day lives in small ways that are, in aggregate, pretty transformational. I’m less convinced that this is the way all software should work going forward though, but more thoughts on that over time. (A bunch of fun experiments happening at work, but early days on those.)

2 weeks ago 14 votes
Is engineering strategy useful?

While I frequently hear engineers bemoan a missing strategy, they rarely complete the thought by articulating why the missing strategy matters. Instead, it serves as more of a truism: the economy used to be better, children used to respect their parents, and engineering organizations used to have an engineering strategy. This chapter starts by exploring something I believe quite strongly: there’s always an engineering strategy, even if there’s nothing written down. From there, we’ll discuss why strategy, especially written strategy, is such a valuable opportunity for organizations that take it seriously. We’ll dig into: Why there’s always a strategy, even when people say there isn’t How strategies have been impactful across my career How inappropriate strategies create significant organizational pain without much compensating impact How written strategy drives organizational learning The costs of not writing strategy down How strategy supports personal learning and developing, even in cases where you’re not empowered to “do strategy” yourself By this chapter’s end, hopefully you will agree with me that strategy is an undertaking worth investing your–and your organization’s–time in. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. There’s always a strategy I’ve never worked somewhere where people didn’t claim there as no strategy. In many of those companies, they’d say there was no engineering strategy. Once I became an executive and was able to document and distribute an engineering strategy, accusations of missing strategy didn’t go away, they just shfited to focus on a missing product or company strategy. This even happened at companies that definitively had engineering strategies like Stripe in 2016 which had numerous pillars to a clear engineering strategy such as: Maintain backwards API compatibilty, at almost any cost (e.g. force an upgrade from TLS 1.2 to TLS 1.3 to retain PCI compliance, but don’t force upgrades from the /v1/charges endpoint to the /v1/payment_intents endpoint) Work in Ruby in a monorepo, unless it’s the PCI environment, data processing, or data science work Engineers are fully responsible for the usability of their work, even when there are product or engineering managers involved Working there it was generally clear what the company’s engineering strategy was on any given topic. That said, it sometimes required asking around, and over time certain decisions became sufficiently contentious that it became hard to definitively answer what the strategy was. For example, the adoptino of Ruby versus Java became contentious enough that I distributed a strategy attempting to mediate the disagreement, Magnitudes of exploration, although it wasn’t a particularly successful effort (for reasons that are obvious in hindsight, particularly the lack of any enforcement mechanism). In the same sense that William Gibson said “The future is already here – it’s just not very evenly distributed,” there is always a strategy embedded into an organization’s decisions, although in many organizations that strategy is only visible to a small group, and may be quickly forgotten. If you ever find yourself thinking that a strategy doesn’t exist, I’d encourage you to instead ask yourself where the strategy lives if you can’t find it. Once you do find it, you may also find that the strategy is quite ineffective, but I’ve simply never found that it doesn’t exist. Strategy is impactful In “We are a product engineering company!”, we discuss Calm’s engineering strategy to address pervasive friction within the engineering team. The core of that strategy is clarifying how Calm makes major technology decisions, along with documenting the motivating goal steering those decisions: maximizing time and energy spent on creating their product. That strategy reduced friction by eliminating the cause of ongoing debate. It was successful in resetting the team’s focus. It also caused several engineers to leave the company, because it was incompatible with their priorities. It’s easy to view that as a downside, but I don’t think it was. A clear, documented strategy made it clear to everyone involved what sort of game we were playing, the rules for that game, and for the first time let them accurately decide if they wanted to be part of that game with the wider team. Creating alignment is one of the ways that strategy makes an impact, but it’s certainly not the only way. Some of the ways that strategies support the creating organization are: Concentrating company investment into a smaller space. For example, deciding not to decompose a monolith allows you to invest the majority of your tooling efforts on one language, one test suite, and one deployment mechanism. Many interesting properties only available through universal adoption. For example, moving to an “N-1 policy” on backfilled roles is a significant opportunity for managing costs, but only works if consistently adopted. As another example, many strategies for disaster recovery or multi-region are only viable if all infrastructure has a common configuration mechanism. Focus execution on what truly matters. For example, Uber’s service migration strategy allowed a four engineer team to migrate a thousand services operated by two thousand engineers to a new provisioning and orchestration platform in less than a year. This was an extraordinarily difficult project, and was only possible because of clear thinking. Creating a knowledge repository of how your organization thinks. Onboarding new hires, particularly senior new hires, is much more effective with documented strategy. For example, most industry professionals today have a strongly held opinion on how to adopt large language models. New hires will have a strong opinion as well, but they’re unlikely to share your organization’s opinion unless there’s a clear document they can read to understand it. There are some things that a strategy, even a cleverly written one, cannot do. However, it’s always been my experience that developing a strategy creates progress, even if the progress is understanding the inherent disagreement preventing agreement. Inappropriate strategy is especially impactful While good strategy can accomplish many things, it sometimes feels that inappropriate strategy is far more impactful. Of course, impactful in all the wrong ways. Digg V4 remains the worst considered strategy I’ve personally participated in. It was a complete rewrite of the Digg V3.5 codebase from a PHP monolith to a PHP frontend and backend of a dozen Python services. It also moved the database from sharded MySQL to an early version of Cassandra. Perhaps worst, it replaced the nuanced algorithms developed over a decade with a hack implemented a few days before launch. Although it’s likely Digg would have struggled to become profitable due to its reliance on search engine optimization for traffic, and Google’s frequently changing search algorithm of that era, the engineering strategy ensured we died fast rather than having an opportunity to dig our way out. Importantly, it’s not just Digg. Almost every engineering organization you drill into will have it’s share of unused platform projects that captured decades of engineering years to the detriment of an important opportunity. A shocking number of senior leaders join new companies and initiate a grand migration that attempts to entirely rewrite the architecture, switch programming languages, or otherwise shift their new organization to resemble a prior organization where they understood things better. Inappropriate versus bad When I first wrote this section, I just labeled this sort of strategy as “bad.” The challenge with that term is that the same strategy might well be very effective in a different set of circumstances. For example, if Digg had been a three person company with no revenue, rewriting from scratch could have the right decision! As a result, I’ve tried to prefer the term “inappropriate” rather than “bad” to avoid getting caught up on whether a given approach might work in other circumstances. Every approach undoubtedly works in some organization. Written strategy drives organizational learning When I joined Carta, I noticed we had an inconsistent approach to a number of important problems. Teams had distinct standard kits for how they approached new projects. Adoption of existing internal platforms was inconsistent, as was decision making around funding new internal platforms. There was widespread agreement that we were decomposing our monolith, but no agreement on how we were doing it. Coming into such a permissive strategy environment, with strong, differing perspectives on the ideal path forward, one of my first projects was writing down an explicit engineering strategy along with our newly formed Navigators team, itself a part of our new engineering strategy. Navigators at Carta As discussed in Navigators, we developed a program at Carta to have explicitly named individual contributor, technical leaders to represent key parts of the engineering organization. This representative leadership group made it possible to iterate on strategy with a small team of about ten engineers that represented the entire organization, rather than take on the impossible task of negotiating with 400 engineers directly. This written strategy made it possible to explicitly describe the problems we saw, and how we wanted to navigate those problems. Further, it was an artifact that we were able to iterate on in a small group, but then share widely for feedback from teams we might have missed. After initial publishing, we shared it widely and talked about it frequently in engineering all-hands meetings. Then we came back to it each year, or when things stopped making much sense, and revised it. As an example, our initial strategy didn’t talk about artificial intelligence at all. A few months later, we extended it to mention a very conservative approach to using Large Language Models. Most recently, we’ve revised the artificial intelligence portion again, as we dive deeply into agentic workflows. A lot of people have disagreed with parts of the strategy, which is great: that’s one of the key benefits of a written strategy, it’s possible to precisely disagree. From that disagreement, we’ve been able to evolve our strategy. Sometimes because there’s new information like the current rapidly evolution of artificial intelligence pratices, and other times because our initial approach could be improved like in how we gated membership of the initial Navigators team. New hires are able to disagree too, and do it from an informed place rather than coming across as attached to their prior company’s practices. In particular, they’re able to understand the historical thinking that motivated our decisions, even when that context is no longer obvious. At the time we paused decomposition of our monolith, there was significant friction in service provisioning, but that’s far less true today, which makes the decision seem a bit arbitrary. Only the written document can consistently communicate that context across a growing, shifting, and changing organization. With oral history, what you believe is highly dependent on who you talk with, which shapes your view of history and the present. With writen history, it’s far more possible to agree at scale, which is the prerequisite to growing at scale rather than isolating growth to small pockets of senior leadership. The cost of implicit strategy We just finished talking about written strategy, and this book spends a lot of time on this topic, including a chapter on how to structure strategies to maximize readability. It’s not just because of the positives created by written strategy, but also because of the damage unwritten strategy creates. Vulnerable to misinterpretation. Information flow in verbal organizations depends on an individual being in a given room for a decision, and then accurately repeating that information to the others who need it. However, it’s common to see those individuals fail to repeat that information elsewhere. Sometimes their interpretation is also faulty to some degree. Both of these create significant problems in operating strategy. Two-headed organizations Some years ago, I started moving towards a model where most engineering organizations I worked with have two leaders: one who’s a manager, and another who is a senior engineer. This was partially to ensure engineering context was included in senior decision making, but it was also to reduce communication errors. Errors in point-to-point communication are so prevalent when done one-to-one, that the only solution I could find for folks who weren’t reading-oriented communicators was ensuring I had communicated strategy (and other updates) to at least two people. Inconsistency across teams. At one company I worked in, promotions to Staff-plus role happened at a much higher rate in the infrastructure engineering organization than the product engineering team. This created a constant drain out of product engineering to work on infrastructure shaped problems, even if those problems weren’t particularly valuable to the business. New leaders had no idea this informal policy existed, and they would routinely run into trouble in calibration discussions. They also weren’t aware they needed to go argue for a better policy. Worse, no one was sure if this was a real policy or not, so it was ultimately random whether this perspective was represented for any given promotion: sometimes good promotions would be blocked, sometimes borderline cases would be approved. Inconsistency over time. Implementing a new policy tends to be a mix of persistent and one-time actions. For example, let’s say you wanted to standardize all HTTP operations to use the same library across your codebase. You might add a linter check to reject known alternatives, and you’ll probably do a one-time pass across your codebase standardizing on that library. However, two years later there are another three random HTTP libraries in your codebase, creeping into the cracks surrounding your linting. If the policy is written down, and a few people read it, then there’s a number of ways this could be nonetheless prevented. If it’s not written down, it’s much less likely someone will remember, and much more likely they won’t remember the rationale well enough to argue about it. Hazard to new leadership. When a new Staff-plus engineer or executive joins a company, it’s common to blame them for failing to understand the existing context behind decisions. That’s fair: a big part of senior leadership is uncovering and understanding context. It’s also unfair: explicit documentation of prior thinking would have made this much easier for them. Every particularly bad new-leader onboarding that I’ve seen has involved a new leader coming into an unfilled role, that the new leader’s manager didn’t know how to do. In those cases, success is entirely dependent on that new leader’s ability and interest in learning. In most ways, the practice of documenting strategy has a lot in common with succession planning, where the full benefits accrue to the organization rather than to the individual doing it. It’s possible to maintain things when the original authors are present, appreciating the value requires stepping outside yourself for a moment to value things that will matter most to the organization when you’re no longer a member. Information herd immunity A frequent objection to written strategy is that no one reads anything. There’s some truth to this: it’s extremely hard to get everyone in an organization to know something. However, I’ve never found that goal to be particularly important. My view of information dispersal in an organization is the same as Herd immunity: you don’t need everyone to know something, just to have enough people who know something that confusion doesn’t propagate too far. So, it may be impossible for all engineers to know strategy details, but you certainly can have every Staff-plus engineer and engineering manager know those details. Strategy supports personal learning While I believe that the largest benefits of strategy accrue to the organization, rather than the individual creating it, I also believe that strategy is an underrated avenue for self-development. The ways that I’ve seen strategy support personal development are: Creating strategy builds self-awareness. Starting with a concrete example, I’ve worked with several engineers who viewed themselves as extremely senior, but frequently demanded that projects were implemented using new programming languages or technologies because they personally wanted to learn about the technology. Their internal strategy was clear–they wanted to work on something fun–but following the steps to build an engineering strategy would have created a strategy that even they agreed didn’t make sense. Strategy supports situational awareness in new environments. Wardley mapping talks a lot about situational awareness as a prerequisite to good strategy. This is ensuring you understand the realities of your circumstances, which is the most destructive failure of new senior engineering leaders. By explicitly stating the diagnosis where the strategy applied, it makes it easier for you to debug why reusing a prior strategy in a new team or company might not work. Strategy as your personal archive. Just as documented strategy is institutional memory, it also serves as personal memory to understand the impact of your prior approaches. Each of us is an archivist of our prior work, pulling out the most valuable pieces to address the problem at hand. Over a long career, memory fades–and motivated reasoning creeps in–but explicit documentation doesn’t. Indeed, part of the reason I started working on this book now rather than later is that I realized I was starting to forget the details of the strategy work I did earlier in my career. If I wanted to preserve the wisdom of that era, and ensure I didn’t have to relearn the same lessons in the future, I had to write it now. Summary We’ve covered why strategy can be a valuable learning mechanism for both your engineering organization and for you. We’ve shown how strategies have helped organizations deal with service migrations, monolith decomposition, and right-sizing backfilling. We’ve also discussed how inappropriate strategy contributed to Digg’s demise. However, if I had to pick two things to emphasize as this chapter ends, it wouldn’t be any of those things. Rather, it would be two themes that I find are the most frequently ignored: There’s always a strategy, even if it isn’t written down. The single biggest act you can take to further strategy in your organization is to write down strategy so it can be debated, agreed upon, and explicitly evolved. Discussions around topics like strategy often get caught up in high prestige activities like making controversial decisions, but the most effective strategists I’ve seen make more progress by actually performing the basics: writing things down, exploring widely to see how other companies solve the same problem, accepting feedback into their draft from folks who disagree with them. Strategy is useful, and doing strategy can be simple, too.

2 weeks ago 20 votes
"We're a product engineering company!" -- Engineering strategy at Calm.

In my career, the majority of the strategy work I’ve done has been in non-executive roles, things like Uber’s service migration. Joining Calm was my first executive role, where I was able to not just propose, but also mandate, strategy. Like almost all startups, the engineering team was scattered when I joined. Was our most important work creating more scalable infrastructure? Was our greatest risk the failure to adopt leading programming languages? How did we rescue the stuck service decomposition initiative? This strategy is where the engineering team and I aligned after numerous rounds of iteration, debate, and inevitably some disagreement. As a strategy, it’s both basic and also unambiguous about what we valued, and I believe it’s a reasonably good starting point for any low scalability-complexity consumer product. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore, then Diagnose and so on. Relative to the default structure, this document has one tweak, folding the Operation section in with Policy. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation Our new policies, and the mechanisms to operate them are: We are a product engineering company. Users write in every day to tell us that our product has changed their lives for the better. Our technical infrastructure doesn’t get many user letters–and this is unlikely to change going forward as our infrastructure is relatively low-scale and low-complexity. Rather than attempting to change that, we want to devote the absolute maximum possible attention to product engineering. We exclusively adopt new technologies to create valuable product capabilities. We believe our technology stack as it exists today can solve the majority of our current and future product roadmaps. In the rare case where we adopt a new technology, we do so because a product capability is inherently impossible without adopting a new technology. We do not adopt new technologies for other reasons. For example, we would not adopt a new technology because someone is interested in learning about it. Nor would we adopt a technology because it is 30% better suited to a task. We write all code in the monolith. It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code must be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith. In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below. Exceptions are granted by the CTO, and must be in writing. The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product. All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there’s no written record that the CTO approved it, then it’s not approved. Proving the point about exceptions, there are two confirmed exceptions to the above strategy: We are incrementally migrating to TypeScript. We have found that static typing can prevent a number of our user-facing bugs. TypeScript provides a clean, incremental migration path for our JavaScript codebase, and we aim to migrate the entirety over the next six months. Our Web engineering team is leading this migration. We are evaluating Postgres Aurora as our primary database. Many of our recent production incidents are caused by index scans for tables with high write velocity such as tracking customer logins. We believe Aurora will perform better under these workloads. Our Infrastructure engineering team is leading this initiative. Diagnose The current state of our engineering organization: Our product is not limited by missing infrastructure capabilities. Reviewing our roadmap, there’s nothing that we are trying to build today or over the next year that is constrained by our technical infrastructure. Our uptime, stability and latency are OK but not great. We have semi-frequent stability and latency issues in our application, all of which are caused by one of two issues. First, deploying new code with a missing index because it performed well enough in a test environment. Second, writes to a small number of extremely large, skinny tables have become expensive in combination with scans over those tables’ indexes. Our infrastructure team is split between supporting monolith and service workflows. One way to measure technical debt is to understand how much time the team is spending propping up the current infrastructure. Today, that is meaningful but not overwhelming work for our team of three infrastructure engineers supporting 30 product engineers. However, we are finding infrastructure engineers increasingly pulled into debugging incidents for components moved out of the central monolith into our service architecture. This is partially due to increased inherent complexity, but it’s more due to exposing lack of monitoring and ambiguous accountability in services’ production incidents. Our product and executive stakeholders experience us as competing factions. Engineering exists to build and operate software in the company. Part of that is being easy to work with. We should not necessarily support every ask from Product if we believe they are misaligned with Engineering’s goals (e.g. maintaining security), but it should generally provide a consistent perspective across our team. Today, our stakeholders believe they will get radically different answers to basic questions of capabilities and approach depending on who they ask. If they try to get a group of engineers to agree on an approach, they often find we derail into debate about approach rather than articulating a clear point of view that allows the conversation to move forward. We’re arguing a particularly large amount about adopting new technologies and rewrites. Most of our disagreements stem around adopting new technologies or rewriting existing components into new technology stacks. For example, can we extend this feature or do we have to migrate it to a service before extending it? Can we add this to our database or should we move it into a new Redis cache instead? Is JavaScript a sufficient programming language, or do we need to rewrite this functionality in Go? This is particularly relevant to next steps around the ongoing services migration, which has been in-flight for over a year, but is yet to move any core production code. We are spending more time on infrastructure and platform work than product work. This is the combination of all the above issues, from the stability issues we are encountering in our database design, to the lack of engineering alignment on execution. This places us at odds with stakeholder expectation that we are predominantly focused on new product development. Explore Calm is a mobile application that guides users to build and maintain either a meditation or sleep habit. Recommendations and guidance across content is individual to the user, but the content is shared across all customers and is amenable to caching on a content delivery network (CDN). As long as the CDN is available, the mobile application can operate despite inability to access servers (e.g. the application remains usable from a user’s perspective, even if the non-CDN production infrastructure is unreachable). In 2010, enabling a product of this complexity would have required significant bespoke infrastructure, along with likely maintaining a physical presence in a series of datacenters to run your software. In 2020, comparable applications are generally moving towards maintaining as little internal infrastructure as possible. This perspective is summarized effectively in Intercom’s Run Less Software and Dan McKinley’s Choose Boring Technology. New companies founded in this space view essentially all infrastructure as a commodity bought off your cloud provider. This even extends to areas of innovation, such as machine learning, where the training infrastructure is typically run on an offering like AWS Bedrock, and the model infrastructure is provided by Anthropic or OpenAI.

3 weeks ago 22 votes

More in programming

Five Kinds of Nondeterminism

No newsletter next week, I'm teaching a TLA+ workshop. Speaking of which: I spend a lot of time thinking about formal methods (and TLA+ specifically) because it's where the source of almost all my revenue. But I don't share most of the details because 90% of my readers don't use FM and never will. I think it's more interesting to talk about ideas from FM that would be useful to people outside that field. For example, the idea of "property strength" translates to the idea that some tests are stronger than others. Another possible export is how FM approaches nondeterminism. A nondeterministic algorithm is one that, from the same starting conditions, has multiple possible outputs. This is nondeterministic: # Pseudocode def f() { return rand()+1; } When specifying systems, I may not encounter nondeterminism more often than in real systems, but I am definitely more aware of its presence. Modeling nondeterminism is a core part of formal specification. I mentally categorize nondeterminism into five buckets. Caveat, this is specifically about nondeterminism from the perspective of system modeling, not computer science as a whole. If I tried to include stuff on NFAs and amb operations this would be twice as long.1 1. True Randomness Programs that literally make calls to a random function and then use the results. This the simplest type of nondeterminism and one of the most ubiquitous. Most of the time, random isn't truly nondeterministic. Most of the time computer randomness is actually pseudorandom, meaning we seed a deterministic algorithm that behaves "randomly-enough" for some use. You could "lift" a nondeterministic random function into a deterministic one by adding a fixed seed to the starting state. # Python from random import random, seed def f(x): seed(x) return random() >>> f(3) 0.23796462709189137 >>> f(3) 0.23796462709189137 Often we don't do this because the point of randomness is to provide nondeterminism! We deliberately abstract out the starting state of the seed from our program, because it's easier to think about it as locally nondeterministic. (There's also "true" randomness, like using thermal noise as an entropy source, which I think are mainly used for cryptography and seeding PRNGs.) Most formal specification languages don't deal with randomness (though some deal with probability more broadly). Instead, we treat it as a nondeterministic choice: # software if rand > 0.001 then return a else crash # specification either return a or crash This is because we're looking at worst-case scenarios, so it doesn't matter if crash happens 50% of the time or 0.0001% of the time, it's still possible. 2. Concurrency # Pseudocode global x = 1, y = 0; def thread1() { x++; x++; x++; } def thread2() { y := x; } If thread1() and thread2() run sequentially, then (assuming the sequence is fixed) the final value of y is deterministic. If the two functions are started and run simultaneously, then depending on when thread2 executes y can be 1, 2, 3, or 4. Both functions are locally sequential, but running them concurrently leads to global nondeterminism. Concurrency is arguably the most dramatic source of nondeterminism. Small amounts of concurrency lead to huge explosions in the state space. We have words for the specific kinds of nondeterminism caused by concurrency, like "race condition" and "dirty write". Often we think about it as a separate topic from nondeterminism. To some extent it "overshadows" the other kinds: I have a much easier time teaching students about concurrency in models than nondeterminism in models. Many formal specification languages have special syntax/machinery for the concurrent aspects of a system, and generic syntax for other kinds of nondeterminism. In P that's choose. Others don't special-case concurrency, instead representing as it as nondeterministic choices by a global coordinator. This more flexible but also more inconvenient, as you have to implement process-local sequencing code yourself. 3. User Input One of the most famous and influential programming books is The C Programming Language by Kernighan and Ritchie. The first example of a nondeterministic program appears on page 14: For the newsletter readers who get text only emails,2 here's the program: #include /* copy input to output; 1st version */ main() { int c; c = getchar(); while (c != EOF) { putchar(c); c = getchar(); } } Yup, that's nondeterministic. Because the user can enter any string, any call of main() could have any output, meaning the number of possible outcomes is infinity. Okay that seems a little cheap, and I think it's because we tend to think of determinism in terms of how the user experiences the program. Yes, main() has an infinite number of user inputs, but for each input the user will experience only one possible output. It starts to feel more nondeterministic when modeling a long-standing system that's reacting to user input, for example a server that runs a script whenever the user uploads a file. This can be modeled with nondeterminism and concurrency: We have one execution that's the system, and one nondeterministic execution that represents the effects of our user. (One intrusive thought I sometimes have: any "yes/no" dialogue actually has three outcomes: yes, no, or the user getting up and walking away without picking a choice, permanently stalling the execution.) 4. External forces The more general version of "user input": anything where either 1) some part of the execution outcome depends on retrieving external information, or 2) the external world can change some state outside of your system. I call the distinction between internal and external components of the system the world and the machine. Simple examples: code that at some point reads an external temperature sensor. Unrelated code running on a system which quits programs if it gets too hot. API requests to a third party vendor. Code processing files but users can delete files before the script gets to them. Like with PRNGs, some of these cases don't have to be nondeterministic; we can argue that "the temperature" should be a virtual input into the function. Like with PRNGs, we treat it as nondeterministic because it's useful to think in that way. Also, what if the temperature changes between starting a function and reading it? External forces are also a source of nondeterminism as uncertainty. Measurements in the real world often comes with errors, so repeating a measurement twice can give two different answers. Sometimes operations fail for no discernable reason, or for a non-programmatic reason (like something physically blocks the sensor). All of these situations can be modeled in the same way as user input: a concurrent execution making nondeterministic choices. 5. Abstraction This is where nondeterminism in system models and in "real software" differ the most. I said earlier that pseudorandomness is arguably deterministic, but we abstract it into nondeterminism. More generally, nondeterminism hides implementation details of deterministic processes. In one consulting project, we had a machine that received a message, parsed a lot of data from the message, went into a complicated workflow, and then entered one of three states. The final state was totally deterministic on the content of the message, but the actual process of determining that final state took tons and tons of code. None of that mattered at the scope we were modeling, so we abstracted it all away: "on receiving message, nondeterministically enter state A, B, or C." Doing this makes the system easier to model. It also makes the model more sensitive to possible errors. What if the workflow is bugged and sends us to the wrong state? That's already covered by the nondeterministic choice! Nondeterministic abstraction gives us the potential to pick the worst-case scenario for our system, so we can prove it's robust even under those conditions. I know I beat the "nondeterminism as abstraction" drum a whole lot but that's because it's the insight from formal methods I personally value the most, that nondeterminism is a powerful tool to simplify reasoning about things. You can see the same approach in how I approach modeling users and external forces: complex realities black-boxed and simplified into nondeterministic forces on the system. Anyway, I hope this collection of ideas I got from formal methods are useful to my broader readership. Lemme know if it somehow helps you out! I realized after writing this that I already talked wrote an essay about nondeterminism in formal specification just under a year ago. I hope this one covers enough new ground to be interesting! ↩ There is a surprising number of you. ↩

7 hours ago 3 votes
Europe must become dangerous again

Trump is doing Europe a favor by revealing the true cost of its impotency. Because, in many ways, he has the manners and the honesty of a child. A kid will just blurt out in the supermarket "why is that lady so fat, mommy?". That's not a polite thing to ask within earshot of said lady, but it might well be a fair question and a true observation! Trump is just as blunt when he essentially asks: "Why is Europe so weak?". Because Europe is weak, spiritually and militarily, in the face of Russia. It's that inherent weakness that's breeding the delusion that Russia is at once both on its last legs, about to lose the war against Ukraine any second now, and also the all-potent superpower that could take over all of Europe, if we don't start World Word III to counter it. This is not a coherent position. If you want peace, you must be strong. The big cats in the international jungle don't stick to a rules-based order purely out of higher principles, but out of self-preservation. And they can smell weakness like a tiger smells blood. This goes for Europe too. All too happy to lecture weaker countries they do not fear on high-minded ideals of democracy and free speech, while standing aghast and weeping powerlessly when someone stronger returns the favor. I'm not saying that this is right, in some abstract moral sense. I like the idea of a rules-based order. I like the idea of territorial sovereignty. I even like the idea that the normal exchanges between countries isn't as blunt and honest as those of a child in the supermarket. But what I like and "what is" need separating. Europe simply can't have it both ways. Be weak militarily, utterly dependent on an American security guarantee, and also expect a seat at the big-cat table. These positions are incompatible. You either get your peace dividend -- and the freedom to squander it on net-zero nonsense -- or you get to have a say in how the world around you is organized. Which brings us back to Trump doing Europe a favor. For all his bluster and bullying, America is still a benign force in its relation to Europe. We're being punked by someone from our own alliance. That's a cheap way of learning the lesson that weakness, impotence, and peace-dividend thinking is a short-term strategy. Russia could teach Europe a far more costly lesson. So too China. All that to say is that Europe must heed the rude awakening from our cowboy friends across the Atlantic. They may be crude, they may be curt, but by golly, they do have a point. Get jacked, Europe, and you'll no longer get punked. Stay feeble, Europe, and the indignities won't stop with being snubbed in Saudi Arabia.

16 hours ago 3 votes
How I create static websites for tiny archives

Last year I wrote about using static websites for tiny archives. The idea is that I create tiny websites to store and describe my digital collections. There are several reasons I like this approach: HTML is flexible and lets me display data in a variety of ways; it’s likely to remain readable for a long time; it lets me add more context than a folder full of files. I’m converting more and more of my local data to be stored in static websites – paperwork I’ve scanned, screenshots I’ve taken, and web pages I’ve bookmarked. I really like this approach. I got a lot of positive feedback, but the most common reply was “please share some source code”. People wanted to see examples of the HTML and JavaScript I was using I deliberately omitted any code from the original post, because I wanted to focus on the concept, not the detail. I was trying to persuade you that static websites are a good idea for storing small archives and data sets, and I didn’t want to get distracted by the implementation. There’s also no single code base I could share – every site I build is different, and the code is often scrappy or poorly documented. I’ve built dozens of small sites this way, and there’s no site that serves as a good example of this approach – they’re all built differently, implement a subset of my ideas, or have hard-coded details. Even if I shared some source code, it would be difficult to read or understand what’s going on. However, there’s clearly an appetite for that sort of explanation, so this follow-up post will discuss the “how” rather than the “why”. There’s a lot of code, especially JavaScript, which I’ll explain in small digestible snippets. That’s another reason I didn’t describe this in the original post – I didn’t want anyone to feel overwhelmed or put off. A lot of what I’m describing here is nice-to-have, not essential. You can get started with something pretty simple. I’ll go through a feature at a time, as if we were building a new static site. I’ll use bookmarks as an example, but there’s nothing in this post that’s specific to bookmarking. If you’d like to see everything working together, check out the demo site. It includes the full code for all the sections in this post. Let’s dive in! Start with a hand-written HTML page (demo) Reduce repetition with JavaScript templates (demo) Add filtering to find specific items (demo) Introduce sorting to bring order to your data (demo) Use pagination to break up long lists (demo) Provide feedback with loading states and error handling (demo 1, demo 2) Test the code with QUnit and Playwright Manipulate the metadata with Python Store the website code in Git Closing thoughts demo) A website can be a single HTML file you edit by hand. Open a text editor like TextEdit or Notepad, copy-paste the following text, and save it in a file named bookmarks.html. <h1>Bookmarks</h1> <ul> <li><a href="https://estherschindler.medium.com/the-old-family-photos-project-lessons-in-creating-family-photos-that-people-want-to-keep-ea3909129943">Lessons in creating family photos that people want to keep, by Esther Schindler (2018)</a></li> <li><a href="https://www.theatlantic.com/technology/archive/2015/01/why-i-am-not-a-maker/384767/">Why I Am Not a Maker, by Debbie Chachra (The Atlantic, 2015)</a></li> <li><a href="https://meyerweb.com/eric/thoughts/2014/06/10/so-many-nevers/">So Many Nevers, by Eric Meyer (2014)</a></li> </ul> If you open this file in your web browser, you’ll see a list of three links. You can also check out my demo page to see this in action. This is an excellent way to build a website. If you stop here, you’ve got all the flexibility and portability of HTML, and this file will remain readable for a very long time. I build a lot of sites this way. I like it for small data sets that I know are never going to change, or which change very slowly. It’s simple, future-proof, and easy to edit if I ever need to. demo) As you store more data, it gets a bit tedious to keep copying the HTML markup for each item. Wouldn’t it be useful if we could push it into a reusable template? When a site gets bigger, I convert the metadata into JSON, then I use JavaScript and template literals to render it on the page. Let’s start with a simple example of metadata in JSON. My real data has more fields, like date saved or a list of keyword tags, but this is enough to get the idea: const bookmarks = [ { "url": "https://estherschindler.medium.com/the-old-family-photos-project-lessons-in-creating-family-photos-that-people-want-to-keep-ea3909129943", "title": "Lessons in creating family photos that people want to keep, by Esther Schindler (2018)" }, { "url": "https://www.theatlantic.com/technology/archive/2015/01/why-i-am-not-a-maker/384767/", "title": "Why I Am Not a Maker, by Debbie Chachra (The Atlantic, 2015)" }, { "url": "https://meyerweb.com/eric/thoughts/2014/06/10/so-many-nevers/", "title": "So Many Nevers, by Eric Meyer (2014)" } ]; Then I have a function that renders the data for a single bookmark as HTML: function Bookmark(bookmark) { return ` <li> <a href="${bookmark.url}">${bookmark.title}</a> </li> `; } Having a function that returns HTML is inspired by React and Next.js, where code is split into “components” that each render part of the web app. This function is simpler than what you’d get in React. Part of React’s behaviour is that it will re-render the page if the data changes, but my function won’t do that. That’s okay, because my data isn’t going to change. The HTML gets rendered once when the page loads, and that’s enough. I’m using a template literal because I find it simple and readable. It looks pretty close to the actual HTML, so I have a pretty good idea of what’s going to appear on the page. Template literals are dangerous if you’re getting data from an untrusted source – it could allow somebody to inject arbitrary HTML into your page – but I’m writing all my metadata, so I trust it. I know there are other ways to construct HTML in JavaScript, like document.createElement(), the <template> element, or Web Components – but template literals have always been sufficient for me, and I’ve never had a reason to explore other options. Now we have to call this function when the page loads, and render the list of bookmarks. Here’s the rest of the code: <script> window.addEventListener("DOMContentLoaded", () => { document.querySelector("#listOfBookmarks").innerHTML = bookmarks.map(Bookmark).join(""); }); </script> <h1>Bookmarks</h1> <ul id="listOfBookmarks"></ul> I’m listening for the DOMContentLoaded event, which occurs when the HTML page has been fully parsed. When that event occurs, it looks for <ul id="listOfBookmarks"> in the page, and inserts the HTML for the list of bookmarks. We have to wait for this event so the <ul> actually exists. If we tried to run it immediately, it might run before the <ul> exists, and then it wouldn’t know where to insert the HTML. I’m using querySelector() to find the <ul> I want to modify – this is a newer alternative to functions like getElementById(). It’s quite flexible, because I can target any CSS selector, and I find CSS rules easier to remember than the family of getElementBy* functions. Although it’s slightly slower in benchmarks, the difference is negligible and it’s easier for me to remember. If you want to see this page working, check out the demo page. I use this pattern as a starting point for a lot of my static sites – metadata in JSON, some functions that render HTML, and an event listener that renders the whole page after it loads. Once I have the basic site, I add data, render more HTML, and write CSS styles to make it look pretty. This is where I can have fun, and really customise each site. I keep tweaking until I have something I like. I’m ignoring CSS because that could be a whole other post, and there’s a vintage charm to unstyled HTML – it’s fine for what we’re discussing today. What else can we do? demo) As the list gets even longer, it’s useful to have a way to find specific items in the list – I don’t want to scroll the whole thing every time. I like adding keyword tags to my data, and then filtering for items with particular tags. If I add other metadata fields, I could filter on those too. Here’s a brief sketch of the sort of interface I like: I like to be able to define a series of filters, and apply them to focus on a specific subset of items. I like to combine multiple filters to refine my search, and to see a list of applied filters with a way to remove them, if I’ve filtered too far. I like to apply filters from a global menu, or to use controls on each item to find similar items. I use URL query parameters to store the list of currently-applied filters, for example: bookmarks.html?tag=animals&tag=wtf&publicationYear=2025 This means that any UI element that adds or removes a filter is a link to a new URL, so clicking it loads a new page, which triggers a complete re-render with the new filters. When I write filtering code, I try to make it as easy as possible to define new filters. Every site needs a slightly different set of filters, but the overall principle is always the same: here’s a long list of items, find the items that match these rules. Let’s start by expanding our data model to include a couple of new fields: const bookmarks = [ { "url": "https://estherschindler.medium.com/the-old-family-photos-project-lessons-in-creating-family-photos-that-people-want-to-keep-ea3909129943", "title": "Lessons in creating family photos that people want to keep, by Esther Schindler (2018)", "tags": ["photography", "preservation"], "publicationYear": "2018" }, … ]; Then we can define some filters we might use to narrow the list: const bookmarkFilters = [ { id: 'tag', label: 'tagged with', filterFn: (bookmark, tagName) => bookmark.tags.includes(tagName), }, { id: 'publicationYear', label: 'published in', filterFn: (bookmark, year) => bookmark.publicationYear === year, }, ]; Each filter has three fields: id matches the name of the associated URL query parameter label is how the filter will be described in the list of applied filters filterFn is a function that takes two arguments: a bookmark, and a filter value, and returns true/false depending on whether the bookmark matches this filter This list is the only place where I need to customise the filters for a particular site; the rest of the filtering code is completely generic. This means there’s only one place I need to make changes if I want to add or remove filters. The next piece of the filtering code is a generic function that filters a list of items, and takes the list of filters as an argument: /* * Filter a list of items. * * This function takes the list of items and available filters, and the * URL query parameters passed to the page. * * This function returns a list with the items that match these filters, * and a list of filters that have been applied. */ function filterItems({ items, filters, params }) { // By default, all items match, and no filters are applied. var matchingItems = items; var appliedFilters = []; // Go through the URL query params one by one, and look to // see if there's a matching filter. for (const [key, value] of params) { console.debug(`Checking query parameter ${key}`); const matchingFilter = filters.find(f => f.id === key); if (typeof matchingFilter === 'undefined') { continue; } // There's a matching filter! Go ahead and filter the // list of items to only those that match. console.debug(`Detected filter ${JSON.stringify(matchingFilter)}`); matchingItems = matchingItems.filter( item => matchingFilter.filterFn(item, value) ); // Construct a new query string that doesn't include // this filter. const altQuery = new URLSearchParams(params); altQuery.delete(key, value); const linkToRemove = "?" + altQuery.toString(); appliedFilters.push({ type: matchingFilter.id, label: matchingFilter.label, value, linkToRemove, }) } return { matchingItems, appliedFilters }; } This function doesn’t care what sort of items I’m passing, or what the actual filters are, so I can reuse it between different sites. It returns the list of matching items, and the list of applied filters. The latter allows me to show that list on the page. linkToRemove is a link to the same page with this filter removed, but keeping any other filters. This lets us provide a button that removes the filter. The final step is to wire this filtering into the page render. We need to make sure we only show items that match the filter, and show the user a list of applied filters. Here’s the new code: <script> window.addEventListener("DOMContentLoaded", () => { const params = new URLSearchParams(window.location.search); const { matchingItems: matchingBookmarks, appliedFilters } = filterItems({ items: bookmarks, filters: bookmarkFilters, params: params, }); document.querySelector("#appliedFilters").innerHTML = appliedFilters .map(f => `<li>${f.label}: ${f.value} <a href="${f.linkToRemove}">(remove)</a></li>`) .join(""); document.querySelector("#listOfBookmarks").innerHTML = matchingBookmarks.map(Bookmark).join(""); }); </script> <h1>Bookmarks</h1> <p>Applied filters:</p> <ul id="appliedFilters"></ul> <p>Bookmarks:</p> <ul id="listOfBookmarks"></ul> I stick to simple filters that can be phrased as a yes/no question, and I rely on my past self to have written sufficiently useful metadata. At least in static sites, I’ve never implemented anything like a fuzzy text search, where it’s less obvious whether a particular item should match. You can check out the filtering code on the demo page. demo) The next feature I usually implement is sorting. I build a dropdown menu with all the options, and picking one reloads the page with the new sort order. Here’s a quick design sketch: For example, I often sort by the date I saved an item, so I can find an item I saved recently. Another sort order I often use is “random”, which shuffles the items and is a fun way to explore the data. As with filters, I put the current sort order in a query parameter, for example: bookmarks.html?sortOrder=titleAtoZ As before, I want to write this in a generic way and share code between different sites. Let’s start by defining a list of sort options: const bookmarkSortOptions = [ { id: 'titleAtoZ', label: 'title (A to Z)', compareFn: (a, b) => a.title > b.title ? 1 : -1, }, { id: 'publicationYear', label: 'publication year (newest first)', compareFn: (a, b) => Number(b.publicationYear) - Number(a.publicationYear), }, ]; Each sort option has three fields: id is the value that will appear in the URL query parameter label is the human-readable label that will appear in the dropdown compareFn(a, b) is a function that compares two items, and will be passed directly to the JavaScript sort function. If it returns a negative value, then a sorts before b. If it returns a positve value, then a sorts after b. Next, we can define a function that will sort a list of items: /* * Sort a list of items. * * This function takes the list of items and available options, and the * URL query parameters passed to the page. * * It returns a list with the items in sorted order, and the * sort order that was applied. */ function sortItems({ items, sortOptions, params }) { // Did the user pass a sort order in the query parameters? const sortOrderId = getSortOrder(params); // What sort order are we using? // // Look for a matching sort option, or use the default if the sort // order is null/unrecognised. For now, use the first defined // sort order as the default. const defaultSort = sortOptions[0]; const selectedSort = sortOptions.find(s => s.id === sortOrderId) || defaultSort; console.debug(`Selected sort: ${JSON.stringify(selectedSort)}`); // Now apply the sort to the list of items. const sortedItems = items.sort(selectedSort.compareFn); return { sortedItems, appliedSortOrder: selectedSort }; } /* Get the current sort order from the URL query parameters. */ function getSortOrder(params) { return params.get("sortOrder"); } This function works with any list of items and sort orders, making it easy to reuse across different sites. I only have to define the list of sort orders once. This approach makes it easy to add new sort orders, and to write a component that renders a dropdown menu to pick the sort order: /* * Create a dropdown control to choose the sort order. When you pick * a different value, the page reloads with the new sort. */ function SortOrderDropdown({ sortOptions, appliedSortOrder }) { return ` <select onchange="setSortOrder(this.value)"> ${ sortOptions .map(({ id, label }) => ` <option value="${id}" ${id === appliedSortOrder.id ? 'selected' : ''}> ${label} </option> `) .join("") } </select> `; } function setSortOrder(sortOrderId) { const params = new URLSearchParams(window.location.search); params.set("sortOrder", sortOrderId); window.location.search = params.toString(); } Finally, we can wire the sorting code into the rest of the app. After filtering, we sort the items and then render the sorted list. We also show the sort controls on the page: <script> window.addEventListener("DOMContentLoaded", () => { const params = new URLSearchParams(window.location.search); const { matchingItems: matchingBookmarks, appliedFilters } = filterItems(…); … const { sortedItems: sortedBookmarks, appliedSortOrder } = sortItems({ items: matchingBookmarks, sortOptions: bookmarkSortOptions, params, }); document.querySelector("#sortOrder").innerHTML += SortOrderDropdown({ sortOptions: bookmarkSortOptions, appliedSortOrder }); document.querySelector("#listOfBookmarks").innerHTML = sortedBookmarks.map(Bookmark).join(""); }); </script> <p id="sortOrder">Sort by:</p> You can check out the sorting code on the demo page. demo) If you have a really long list of items, you may want to break them into multiple pages. This isn’t something I do very often. Modern web browsers are very performant, and you can put thousands of elements on the page without breaking a sweat. I’ve only had to add pagination in a couple of very image-heavy sites – if it’s a text-based site, I just show everything. (You may notice that, for example, there are no paginated lists anywhere on this site. By writing lean HTML, I can fit all my lists on a single page.) If I do want pagination, I stick to a classic design: As with other features, I use a URL query parameter to track the current page number: bookmarks.html?pageNumber=2 This code can be written in a completely generic way – it doesn’t have to care what sort of items we’re paginating. First, let’s write a function that will select a page of items for us. If we’re on page N, what items should we be showing? /* * Get a page of items. * * This function will reduce the list of items to the items that should * be shown on this particular page. */ function paginateItems({ items, pageNumber, pageSize }) { // Page numbers are 1-indexed, so page 1 corresponds to // the indices 0…(pageSize - 1). const startOfPage = (pageNumber - 1) * pageSize; const endOfPage = pageNumber * pageSize; const thisPage = items.slice(startOfPage, endOfPage); return { thisPage, totalPages: Math.ceil(items.length / pageSize), }; } In some of my sites, the page size is a suggestion rather than a hard rule. If there are 27 items and the page size is 25, I think it’s nicer to show all the items on one page than push a few items onto a second page which barely has anything on it. But that might reflect my general dislike of pagination, and it’s definitely a nice-to-have rather than a required feature. Once we know what page we’re on and how many pages there are, we can create a component to render some basic pagination controls: /* * Renders a list of pagination controls. * * This includes links to prev/next pages and the current page number. */ function PaginationControls({ pageNumber, totalPages, params }) { // If there are no pages, we don't need pagination controls. if (totalPages === 1) { return ""; } // Do we need a link to the previous page? Only if we're past page 1. if (pageNumber > 1) { const prevPageUrl = setPageNumber({ params, pageNumber: pageNumber - 1 }); prevPageLink = `<a href="${prevPageUrl}">&larr; prev</a>`; } else { prevPageLink = null; } // Do we need a link to the next page? Only if we're before // the last page. if (pageNumber < totalPages) { const nextPageUrl = setPageNumber({ params, pageNumber: pageNumber + 1 }); nextPageLink = `<a href="${nextPageUrl}">next &rarr;</a>`; } else { nextPageLink = null; } const pageText = `Page ${pageNumber} of ${totalPages}`; // Construct the final result. return [prevPageLink, pageText, nextPageLink] .filter(p => p !== null) .join(" / "); } /* Returns a URL that points to the new page number. */ function setPageNumber({ params, pageNumber }) { const updatedParams = new URLSearchParams(params); updatedParams.set("pageNumber", pageNumber); return `?${updatedParams.toString()}`; } Finally, let’s wire this code into the rest of the app. We get the page number from the URL query parameters, paginate the list of filtered and sorted items, and show some pagination controls: <script> /* Get the current page number. */ function getPageNumber(params) { return Number(params.get("pageNumber")) || 1; } window.addEventListener("DOMContentLoaded", () => { const params = new URLSearchParams(window.location.search); const { matchingItems: matchingBookmarks, appliedFilters } = filterItems(…); const { sortedItems: sortedBookmarks, appliedSortOrder } = sortItems(…); const pageNumber = getPageNumber(params); const { thisPage: thisPageOfBookmarks, totalPages } = paginateItems({ items: sortedBookmarks, pageNumber, pageSize: 25, }); document.querySelector("#paginationControls").innerHTML += PaginationControls({ pageNumber, totalPages, params }); document.querySelector("#listOfBookmarks").innerHTML = thisPageOfBookmarks.map(Bookmark).join(""); }); </script> <p id="paginationControls">Pagination controls: </p> One thing that makes pagination a little tricky is that it affects filtering and sorting as well – when you change either of those, you probably want to reset to the first page. For example, if you’re filtering for animals and you’re on page 3, then you add a second filter for giraffes, you should reset to page 1. If you stay on page 3, it might be confusing if there are less than 3 pages of results with the new filter. The key to this is calling params.delete("pageNumber") when you update the URL query parameters. You can play with the pagination on the demo page. demo 1, demo 2) One problem with relying on JavaScript to render the page is that sometimes JavaScript goes wrong. For example, I write a lot of my metadata by hand, and a typo can create invalid JSON and break the page. There are also people who disable JavaScript, or sometimes it just doesn’t work. If I’m using the site, I can open the Developer Tools in my web browser and start debugging there – but that’s not a great experience. If you’re not expecting something to go wrong, it will just look like the page is taking a long time to load. We can do better. To start, we can add a <noscript> element that explains to users that they need to enable JavaScript. This will only be shown if they’ve disabled JavaScript: <noscript> <strong>You need to enable JavaScript to use this site!</strong> </noscript> I have a demo page which disables JavaScript, so you can see how the noscript tag behaves. This won’t help if JavaScript is broken rather than disabled, so we also need to add error handling. We can listen for the error event on the window, and report an error to the user – for example, if a script fails to load. <div id="errors"></div> <script> window.addEventListener("error", function(event) { document .querySelector('#errors') .innerHTML = `<strong>Something went wrong when loading the page!</strong>`; }); </script> We can also attach an onerror handler to specific script tags, which allows us to customise the error message – we can tell the user that a particular file failed to load. <script src="app.js" onerror="alert('Something went wrong while loading app.js')"></script> I have another demo page which has a basic error handler. Finally, I like to include a loading indicator, or some placeholder text that will be replaced when the page will finish loading – this tells the user where they can expect to see something load in. <ul id="listOfBookmarks">Loading…</ul> It’s somewhat rare for me to add a loading indicator or error handling, just because I’m the only user of my static sites, and it’s easier for me to use the developer tools when something breaks. But providing mechanisms for the user to understand what’s going on is crucial if you want to build static sites like this that other people will use. Test the code with QUnit and Playwright If I’m writing a very complicated viewer, it’s helpful to have tests. I’ve found two test frameworks that I particularly like for this purpose. QUnit is a JavaScript library that I use for unit testing – to me, that means testing individual functions and components. For example, QUnit was very helpful when I was writing the early iterations of the sorting and filtering code, and writing tests caught a number of mistakes. You can run QUnit in the browser, and it only requires two files, so I can test a project without creating a whole JavaScript build system or dependency tree. Here’s an example of a QUnit test: QUnit.test("sorts bookmarks by title", function(assert) { // Create three bookmarks with different titles const bookmarkA = { title: "Almanac for apples" }; const bookmarkC = { title: "Compendium of coconuts" }; const bookmarkP = { title: "Page about papayas" }; const params = new URLSearchParams("sortOrder=titleAtoZ"); // Pass the bookmarks in the wrong order, so they can't be sorted // correctly "by accident" const { sortedItems, appliedSortOrder } = sortItems({ items: [bookmarkC, bookmarkA, bookmarkP], sortOptions: bookmarkSortOptions, params, }); // Check the bookmarks have been sorted in the right order assert.deepEqual(sortedItems, [bookmarkA, bookmarkC, bookmarkP]); }); You can see this test running in the browser in my demo page. Playwright is a testing library that can open a web app in a real web browser, interact with the page, and check that the app behaves correctly. It’s often used for dynamic web apps, but it works just as well for static pages. For example, it can test that if you select a new sort order, the page reloads and show results in the correct order. Here’s an example of a simple test written with Playwright in Python: from playwright.sync_api import expect, sync_playwright with sync_playwright() as p: browser = p.webkit.launch() # Open the HTML file in the browser page = browser.new_page() page.goto('file:///Users/alexwlchan/Sites/sorting.html') # Look for an <li> element with one of the bookmarks -- this will # only appear if the page has rendered correctly. expect(page.get_by_text("So Many Nevers")).to_be_visible() browser.close() These tools are a great safety net for catching mistakes, but I don’t always need them. I only write tests for my more complicated sites – when the sorting/filtering code is particularly complex, there’s a lot of rendering code, or I anticipate making major changes in future. I don’t bother with tests when the site is simple and unlikely to change, and I can just do manual checks when I write it the first time. Tests are less useful if I know I’ll never make changes. This is getting away from the idea of a self-contained static website, because now I’m relying on third-party code, and for Playwright I need to maintain a working Python environment. I’m okay with this, because the website is still usable even if I can no longer run the tests. These are useful sidecar tools, but I only need them if I’m making changes. If I finish a site and I know I won’t change it again, I don’t need to worry about whether the tests will still work years later. Manipulate the metadata with Python For small sites, we could write all this JavaScript directly in <script> tags or in a single file. As we get more data, splitting the metadata and application logic makes everything easier to manage. One pattern I’ve adopted is to put all the item metadata into a single, standalone JavaScript file that assigns a single variable: const bookmarks = […]; and then load that file in the HTML page with a <script src="metadata.js"> element. I use JavaScript rather than pure JSON because browsers don’t allow fetching local JSON files via file://. If you open an HTML page without a web server, the browser will block requests to fetch a JSON file because of security restrictions. By storing data in a JavaScript file instead, I can load it with a simple <script> tag. I wrote a small Python library javascript-data-files that lets me interact with JSON stored this way. This allows me to write scripts that add data to the metadata file (like saving a new bookmark) or to verify the existing metadata (like checking that I have an archived copy of every bookmark). I’ll write more about this in future posts, because this one is long enough already. For example, let’s add a new bookmark to the metadata.js file: from javascript_data_files import read_js, write_js bookmarks = read_js("metadata.js", varname="bookmarks") bookmarks.append({ "url": "https://www.theguardian.com/lifeandstyle/2019/jan/13/ella-risbridger-john-underwood-friendship-life-new-family", "title": "When my world fell apart, my friends became my family, by Ella Risbridger (2019)" }) write_js("metadata.js", varname="bookmarks", value=bookmarks) We’re starting to blur the line between a static site and a static site generator. These scripts only work if I have a working Python environment, which is less future-proof than pure HTML. I’m happy with this compromise, because the website is fully functional without them – I only need to run these scripts if I’m modifying the metadata. If I stop making changes and the Python environment breaks, I can still read everything I’ve already saved. Store the website code in Git I create Git repositories for all of my local websites. This allows me to track changes, and it means I can experiment freely – I can always roll back if I break something. These Git repositories only live on my local machine. I run git init . in the folder, I create commits to record any changes, and that’s it. I don’t push the repository to GitHub or another remote Git server. (Although I do have backups of every site, of course.) Git has a lot of features for writing code in a collaborative environment, but I don’t need any of those here – I’m the only person working on these sites. Most of the time, I just use two commands: $ git add bookmarks.html $ git commit -m "Add filtering by author" This creates a labelled snapshot of my latest changes to bookmarks.html. I only track the text files in Git – the HTML, CSS, and JavaScript. I don’t track binary files like images and videos. Git struggles with those larger files, and I don’t edit those as much as the text files, so having them in version control is less useful. I write a gitignore file to ignore all of them. Closing thoughts There are lots of ideas here, but you don’t need to use all of them – most of my sites only use a few. Every site is different, and you can pick what makes most sense for your project. If you’re building a static site for a tiny archive, start with a simple HTML file. Add features like templates, sorting, and filtering incrementally as they become useful. You don’t need to add them all upfront – that can make things more complicated than they need to be. This approach can scale from simple collections to sophisticated archives. A static website built with HTML and JavaScript is easy to maintain and modify, has no external dependencies, and is future-proof against a lot of technological changes. I’ve come to love using static websites to store my local data. They’re flexible, resilient, and surprisingly powerful. I hope you’ll consider it too, and that these ideas help you get started. [If the formatting of this post looks odd in your feed reader, visit the original article]

yesterday 4 votes
Nobody Profits

Intellectual property is a really dumb idea. “But piracy is theft. Clean and simple. It’s smash and grab. It ain’t no different than smashing a window at Tiffany’s and grabbing merchandise.” - Joe Biden, 46th president of the USA Except it isn’t and Joe Biden is a senile moron. Because when you smash the windows and grab the stuff, Tiffany’s no longer has the stuff. With piracy, everyone has the stuff. It’s a lot more like taking a picture, which Tiffany’s probably encourages. Win-win cooperation. Wealth is being increasingly concentrated. What’s shocking to me is how much everyone still cares about money. Even the die-hard complain about capitalism type deeply cares, because the opposite of love isn’t hate, it’s indifference. I hate scammers, but I’m pretty indifferent to money. The best outcome of AI is if it delivers huge amounts of value to society but no profit to anyone. The old days of the Internet were this goldmine. The Internet delivered huge value but no profit, and that’s why it was good. Suddenly we had all these new powers. Then people figured out how to monetize it. It was a race to extract every tiny bit of value, and now we have today’s Internet. Can this play out differently with AI? Let’s build technology and open source software that market breaks everything. Let’s demoralize the scammers so hard that they don’t even try. Every loser and grifter will be gone from technology because there’s nothing to be gained there. They can play golf all day or something. If I ever figure out how to channel power like Elon, I will do this. Spin up open source projects in every sector to eliminate all the capturable value. This is what I’m trying to do with comma.ai and tinygrad. I dream of a day when company valuations halve when I create a GitHub repo. Someday.

2 days ago 4 votes
My Top 15 OS Books: From Theory and Implementation to Systems Programming

A personal guide to the most useful books for understanding operating systems

2 days ago 7 votes