Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
10
This book’s introduction started by defining strategy as “making decisions.” Then we dug into exploration, diagnosis, and refinement: three chapters where you could argue that we didn’t decide anything at all. Clarifying the problem to be solved is the prerequisite of effective decision making, but eventually decisions do have to be made. Here in this chapter on policy, and the following chapter on operations, we finally start to actually make some decisions. In this chapter, we’ll dig into: How we define policy, and how setting policy differs from operating policy as discussed in the next chapter The structured steps for setting policy How many policies should you set? Is it preferable to have one policy, many policies, or does it not matter much either way? Recurring kinds of policies that appear frequently in strategies Why it’s valuable to be intentional about your strategy’s altitude, and how engineers and executives generally maintain different altitudes in their...
a week ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Irrational Exuberance

Operational mechanisms for strategy.

Even the best policies fail if they aren’t adopted by the teams they’re intended to serve. Can we persistently change our company’s behaviors with a one-time announcement? No, probably not. I refer to the art of making policies work as “operations” or “strategy operations.” The good news is that effectively operating a policy is two-thirds avoiding common practices that simply don’t work. The other one-third takes some practice, but can be practiced in any engineering role: there’s no need to wait until you’re an executive to start building mastery. This chapter will dig into those mechanisms, with particular focus on: How policies are supported by operations, and how operations are composed of mechanisms that ensure they work well Evaluating operational mechanisms to select between different options, and determine which mechanisms are unlikely to be an effective choice Composing an operational plan for the specific set of policies that you are looking to support Common varieties of effective mechanisms such as approval forums, inspection mechanisms, nudges, and so on. We’ll also explore the sorts of mechanisms that tend to work poorly How to adjust your approach to operations if you are in an engineering role rather than an executive role How cargo-culting remains the largest threat to effective strategy operations Let’s unpack the details of turning your potentially good policy into an impactful policy. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. What are operational mechanisms? Operations are how a policy is implemented and reinforced. Effective operations ensure that your policies actually accomplish something. They can range from a recurring weekly meeting, to an alert that notifies the team when a threshold is exceeded, to a promotion rubric requiring a certain behavior to be promoted. In the strategy for working with new private equity ownership, we introduce a policy to backfill hires at a lower level, and also limit the maximum number of principal engineers: We will move to an “N-1” backfill policy, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit, with any exceptions approved in writing by the CTO–this applies for both promotions and external hires. That introduces an explicit operational mechanism of escalations going to the CTO, but it also introduces an implicit and undefined mechanism: how do we ensure the backfills are actually down-leveled as the policy instructs? It might be a group chat with engineering recruiting where the CTO approves the level of backfilled roles. Instead, it might be the responsibility of recruiting to enforce that downleveling. In a third approach, it might be taken on trust that hiring managers will do the right thing. Each of those three scenarios is a potential operational solution to implementing this policy. Operations is picking the right one for your circumstances, and then tweaking it as you learn from running it. Operations in government For another interesting take on how critical operations are, Recoding America by Jennifer Pahlka is well worth the read. It explores how well-intended government legislation often isn’t implementable, which results in policies that require massive IT investments but provide little benefit to constituents. How to evaluate mechanisms In order to determine the most effective operational mechanisms for the problems you’re working on, it’s useful to have a standardized rubric for evaluating mechanisms. While this rubric isn’t perfectly universal–customize it for your needs–having any rubric will make it easier to evaluate your options consistently. The rubric I use to evaluate whether an operational mechanism will be effective is: Measurability: Can you measure both leading and lagging indicators to inspect the mechanism’s impact? If you have to choose between the two, measuring leading indicators allows much quicker evaluation and iteration on your mechanisms. Adoption cost: How much work will migrating to this mechanism require? Can this work be done incrementally or does it require a major, coordinated shift? User ease (or burden): After adopting this policy, how much easier (or harder) will it be for users to perform their work? If things will be harder, are those users able to tolerate the additional time? Provider ease (or burden): How much additional ongoing maintenance will this mechanism require from the centralized or platform team providing it? For example, if every new architecture proposal requires a thorough review by your Security team, does the Security team have the actual ability to support those reviews? Reliance on authority: How much does this mechanism depend on a top-down authority’s active support? If the sponsoring executive departs, will this mechanism remain effective? Is that an effective tradeoff in this case? Culturally aligned: Is this something that your organization is going to do, or something that they are going to fight against each step? Is there a way you can adjust the framing to make it more acceptable to your organization’s culture? Generally, I find folks are good at evaluating mechanisms against these critera, but somewhat worse at accepting the consequences of their evaluation. For example, falling in love with a particular mechanism and then trying to force the organization to accept a mechanism whose adoption cost is unbearably high, or introduce a mechanism that creates significant user burden onto a team that is already struggling with tight efficiency goals like a customer support team. Self-awareness helps here, but so does consulting others to point out the errors in your reasoning, which is a core part of how I’ve found success in adopting operational mechanisms. Composing an operational plan Your operational plan is the sum of the mechanisms used to support your policies. While evaluating each individual mechanism in isolation is part of creating an operations plan, it’s also valuable to consider how the mechanisms will work together: Review the policies you’ve developed. What sort of mechanisms seem most likely to support these policies? How might these mechanisms be pooled together to avoid redundancy? Review the operational mechanisms that have worked in your organization. What mechanisms have been used to best effect, and which have left a sufficiently bad taste in the organization’s collective memory that they’ll be hard to reuse effectively? Which new mechanisms showed up in your exploration? In your exploration phase, you’ll frequent encounter mechanisms that your organization hasn’t previously tried. If any of them seem particularly well-suited to the policies you’re considering, and none of your organization’s frequently used mechanisms are good fits, then consider testing a new one. Evaluate mechanisms against the evaluation rubric. For each of the mechanisms you’re considering using, apply the rubric from the above How to evaluate mechanisms to validate they’re good fits. Consolidate into an operational plan. Now that you’ve determined the mechanisms you want to consider, work on fitting the full set of mechanisms into one coherent plan. Be particularly mindful of the ease, or burden, the integrated plan creates for both your users and platform providers. Validate plan with users and providers. Many plans make sense from afar, but fail due to imposing an unreasonable burden. Or the burden might be acceptable, but the actual workflow simply won’t work at all. Consider validating via strategy testing. If you run the above process, and can’t come to an agreement with stakeholders on your proposed plan, then simply commit to running a strategy testing process including the plan. This will create space for everyone to build confidence in the approach before they feel forced to make a commitment to following it long-term. Even if you don’t use strategy testing for your plan, at least commit to scheduling a review in three months reflecting on how things have worked out. Your operational plan is the vehicle that delivers your policies to your organization. It’s extremely tempting to skip refining the details here, but it’s a relatively quick step and will completely change your strategy’s outcomes. Common mechanisms Most companies have a handful of frequently used operational mechanisms. Some of those mechanisms are company specific, such as Amazon’s weekly business review, and others repeat across companies like requiring executive approval. Across the many mechanisms you’ll encounter, you can generally cluster them into recurring categories. This section covers the mechanisms I’ve found consistently effective. Approval and advice forums At a high level, new policies are obvious, simple and apply cleanly to the problem they are intended to solve. However, when you apply those policies to detailed, complex circumstances, it’s often ambiguous how to stay loyal to a policy’s intensions. Approval and advice forums are a common solution to that problem. Calm’s product engineering strategy shows what the simplest, and most common, approval forum looks like in practice: Exceptions are granted by the CTO, and must be in writing. The above policies are deliberately restrictive. Sometimes they may be wrong, and we will make exceptions to them. However, each exception should be deliberate and grounded in concrete problems we are aligned both on solving and how we solve them. If we all scatter towards our preferred solution, then we’ll create negative leverage for Calm rather than serving as the engine that advances our product. All exceptions must be written. If they are not written, then you should operate as if it has not been granted. Our goal is to avoid ambiguity around whether an exception has, or has not, been approved. If there’s no written record that the CTO approved it, then it’s not approved. This example also has several weaknesses that happen in many approval forums. Most importantly, it doesn’t make it clear how to get approvals. It would be stronger if it explicitly explained how to get an approval (perhaps go ask in #cto-approvals), and where to find prior approvals to help someone considering requesting an exception to calibrate their request. Approvals don’t necessarily need to come from senior leadership. Instead, the senior leadership can loan their authority on a topic to another group. The LLM adoption strategy provides a good example of this: Start with Anthropic. We use Anthropic models, which are available through our existing cloud provider via AWS Bedrock. To avoid maintaining multiple implementations, where we view the underlying foundational model quality to be somewhat undifferentiated, we are not looking to adopt a broad set of LLMs at this point. This is anchored in our Wardley map of the LLM ecosystem. Exceptions will be reviewed by the Machine Learning Review in #ml-review In a more community-minded organization, the approval forums might not require senior leadership involvement at all. Instead, the culture might create an environment where the forums’ feedback is taken seriously on its own merits. Every company does approval forums a bit differently, ranging from our experiments at Carta with Navigators, granting executive authority for technical decisions to named engineers in each area, to Andrew Harmel-Law’s discussion of this topic in Facilitating Software Architecture. You can spend a lot of time arguing the details here, my experience is that having the right participants and a good executive sponsor matter a lot, and the other pieces matter a lot less. Inspection While even the best policies can fail, the more common scenario is that a policy will sort-of work, and need some modest adjustments to make it more successful. An inspect mechanism allows you to evaluate whether your policy’s is succeeding and if you need to make adjustments. The user-data access strategy provides an example: Measure progress on percentage of customer data access requests justified by a user-comprehensible, automated rationale. This will anchor our approach on simultaneously improving the security of user data and the usability of our colleagues’ internal tools. If we only expand requirements for accessing customer data, we won’t view this as progress because it’s not automated (and consequently is likely to encourage workarounds as teams try to solve problems quickly). Similarly, if we only improve usability, charts won’t represent this as progress, because we won’t have increased the number of supported requests. As part of this effort, we will create a private channel where the security and compliance team has visibility into all manual rationales for user-data access, and will directly message the manager of any individual who relies on a manual justification for accessing user data. This example is a good start, but fully realizing an inspection mechanism requires concretely specifying where and how the data will be tracked. A better version of this would include a link to the dashboard you’ll look at, and a commitment to reviewing the data on a certain frequency. For a recent inspection mechanism, I created a recurring invite with a link to the relevant data dashboard, and a specific chat channel for discussion, and invited the working group who had agreed to review the data on that cadence. This wasn’t a synchronous meeting, but rather a commitment to independently review, and discuss anything that felt surprising. Your particular mechanisms could be threshold-triggered alerts, something you fold into an existing metrics review meeting, a script you commit to running and reviewing periodically, or something else. The most important thing is that it cannot silently fail. Nudges While it’s common to hear complaints about how a team isn’t following a new policy, as if it were a deliberate choice they’d made, I find it more common that people want to do things the new way, but rarely take time to learn how to do it. Nudges are providing individuals with context to inform them about a better way they might do something, and they are an exceptionally effective mechanism. Grounding this in an example, at Stripe we had a policy of allowing teams to self-authorize introducing new cloud hosting costs. This worked well almost all the time. However, sometimes teams would accidentally introduce large cost increases without realizing it, and teams that introduced those spikes almost never had any awareness that they had caused the problem. Even if we’d told them they must not introduce unapproved spending spikes, they simply didn’t perceive they’d done it. We had the choice between preventing all teams from introducing new spend, or we could try using a nudge. The nudge we added informed teams when their cloud spend accelerated month over month, directed to charts that explained the acceleration, and told them where to go to ask questions. Nudges pair well with inspections, and there was also a monthly review by the Efficiency Engineering team to review any spikes and reach out where necessary. Maybe we could have forced all teams to review new spend, but this nudge approach didn’t require an authoritative mandate to implement. It also meant we only spent time advising teams that actually spent too much, instead of having to discuss with every team that might spend too much. As another example making that point, a working group at Carta added a nudge to inform managers of untested pull requests merged by their team. Some managers had previously said they simply didn’t know when and why their team had merged untested pull requests, and this nudge made it easy to detect. The nudge also respected their attention by not sending a notification at all if there wasn’t a new, untested pull request. With poor ergonomics, nudges can be an overwhelming assault on your colleagues attention, but done well, I continue to believe they are the most effective operational mechanism. Documentation Policies can’t be enforced by people who don’t know they exist, or by people who don’t know how to follow those policies. In my experience, nudges are the most effective way of solving both of those problems, because nudges bring information to people at exactly the moment that information would be useful. At most companies, well-done nudges are relatively uncommon, and the far more common solution to lack of information is documentation and training. There are so many approaches to both of these topics, and I’ve not found my own approaches here particularly effective. Consequently, I am hesitant to give much advice on what will work best for you. The best I can offer is that following standard practices for your company, even if the outcomes seem imperfect, is probably your best bet. Internal knowledge bases tend to rot quickly, and introducing yet another knowledge base is almost always the illusion of progress rather than real progress. Even when you really don’t like the current one. Finally, remember that success for documentation and training is not necessarily that everyone in the company knows how a new policy works. Instead, as discussed in the chapter on whether strategy is useful, a more useful goal is informational herd immunity: as long as someone on each team understands your policy, the team will generally be capable of following it. Automation Relying on humans to respond is slow, and the quality of human response is highly varied. In many cases, automation provides the most effective and most scalable mechanism to support your policies’ rollout. Automation was key in the Uber service migration strategy, moving us out of a manual, slow process that was taking up a great deal of user and provider time: Move to structured requests, and out of tickets. Missing or incorrect information in provisioning requests create significant delays in provisioning. Further, collecting this information is the first step of moving to a self-service process. As such, we can get paid twice by reducing errors in manual provisioning while also creating the interface for self-service workflows. In that case, better automation allowed us to eliminate a series of back-and-forth negotiations to collect data, and to instead get the necessary information in a single step. Occasionally we still ran into users who couldn’t fill in the form, but now we could focus on providing a good manual experience for those rare exceptions. As you use automation as a core strategy mechanism, it’s important to recognize that designing an effective user experience is a prerequisite to automation having a positive impact. If you view the user experience of your automation as a secondary concern, then you are unlikely to make much impact with automation. Deferment to future work Sometimes there’s something you really want a policy to do, but you also know that you have no reasonable mechanism to do it. In that case, you may find explicitly deferring action on the topic useful. The strategy for integration the Index acquisition at Stripe uses this mechanism: Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. As did the strategy for working with a private equity acquirer: We believe there are significant opportunities to reduce R&D maintenance investments, but we don’t have conviction about which particular efforts we should prioritize. We will kickoff a working group to identify the features with the highest support load. There’s no shame in deferral. As much as you want to make progress on a certain area, it’s better to explicitly acknowledge that you can’t make progress on it–and clarify when you will be able to–then to allow the organization to churn on an intractable problem. Meetings Meetings are the final mechanism, and you can fit any and all of the above mechanisms into a meeting. They are a universal mechanism, although frequently overused because they can do an adequate job of operating almost any policy. The most common mechanism is a reporting meeting, such as reporting progress in the Executive Weekly Meeting as suggested in the LLM adoption strategy: Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets. Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers. Report on progress monthly in Exec Weekly Meeting, coordinated in #exec-weekly The other common meeting archetype is the weekly working meeting introduced in the chapter on strategy testing. Meetings are almost always the most expensive mechanism you can find to solve a problem, but they are easy to suggest, run, and iterate on. If you can’t find any other mechanism you believe in, then a meeting is a decent starting point. Just don’t get too fond of them, and try to iterate your way to canceling every meeting that you start. Anti-patterns In addition to the effective operational methods discussed above, there are a number of additional mechanisms that are frequently used, but which I consider anti-patterns. They can provide some value, but there’s almost always a better alternative. Top-down pronouncements: Sometimes a policy will be operationalized by simply declaring it must be followed. It’s common to see a leader declare that a policy is now in effect, assuming that the announcement is a useful way to implement the new policy. For example, some “return to office” policies dictate that the team must work from their office, but driving a real change requires motivating thoes individuals to actually return. Education-as-announcements rollouts: The default way that many companies roll out policies is through one-time “education,” often as an all-company announcement for existing employees. They might follow up by updating training for onboarding new-hires. Education sounds great, but a couple trainings will never change organizational behavior. Changing behavior requires ongoing reminders, visible role models, inspection to understand why some teams are not adopting the behavior, and so on. Education can be a good component of operationalizing a policy, but it cannot stand on its own. Mandatory recurring trainings: These are a staple of compliance driven policies, generally because of laws which require providing a certain number of hours of relevant training each year. There are two deep challenges with mandatory trainings. First, because attendance is required, people tend to make little effort to make the content good. Second, many folks don’t pay attention because they expect the content to be low quality. It’s not uncommon to hear people say that they’ve never heard of a policy that they’ve performed annual training on for multiple years. It’s possible to overcome these barriers, but in a situation where you’re accountable for changing outcomes, as opposed to shifting legal obligations away from the company, these tend to work poorly. Just change the culture. Some leaders frame most problems as cultural problems, which is a reasonable frame: most things can be usefully viewed as a cultural problem. Unfortunately, it’s common for those who rely heavily on the cultural frame to also have a simplistic view about how culture is changed. Changing an organization’s culture is tricky, and requires a combination of many techniques to create visible leaders role modeling the new behavior, and reinforcement mechanisms to ensure pockets of dissent are weeded out. Anyone who frames culture change as a simple or instant change is living in an imaginary world. If you’re using one of these approaches, it isn’t necessarily a bad choice. Instead, you should just make sure you can explain why you’re using it, and then you need to also make sure you believe that explanation. If you don’t, look for a mechanism from the earlier What if you’re not an executive? It’s easy to get discouraged when you think about which operational mechanisms are available to you as a non-executive. So many of the frequently seen mechanisms like running mandatory recurring meetings, or a binding architecture review process are not accessible to you. That is true: they’re not accessible to you. However, there’s always a related mechanism that can be implemented with less authority. The binding architecture process can be replaced with an architectural advice process. The mandatory review of pull requests can be replaced with a nudge. Although it may be more common to see the authoritative mechanisms in the companies you work in, my experience working as an executive is that these authoritative mechanisms don’t work particularly well. They do a great job of technically shifting accountability to the wider organization, but they often don’t change behavior at all. So, instead of getting frustrated by what you can’t do, focus instead on the mechanisms that are available to you today. Add nudges, focus on the real dynamics of how colleagues do work in your organization, and build a real dataset. It’s very hard to get an executive to support your initiative before the mechanisms and data exist to support it, and very easy to get their support once they do. Once you’ve done what you can without authority to build confidence, if you really do need more authority, then you’re in a good place to escalate to get an executive to support your policies. Beware cargo-culting The longer that I am in the industry, but more I am surprised by how few strategists seem to care if their approach actually works. Instead, they seem focused on doing something that might work, offloading accountability to either the organization or some team, and then moving off to the next problem. Perhaps this is driven by an unfortunate reality that leaders are often evaluated by how they appear, rather than by what they accomplish. Whether or not that’s the underlying reason for why it happens, it does make it surprisingly difficult to know which patterns to borrow from strategy rollouts and implementations. The best advice, unfortunately, is to remain skeptically optimistic. Collect ideas widely, but force the ideas to prove their merit. Summary Now that you’ve finished this chapter, you’re significantly more qualified to write a complete, useful strategy than I was a decade into my career. Often skipped, the operations behind your strategy are at least as essential as any other step, and any strategy without them will fade quietly into your organization’s history. In addition to being able to rollout a strategy of your own, this chapter also provides a useful rescue toolkit you can use to put an existing, floundering strategy back on track. If you don’t see an opportunity to write new strategy within your organization, then there’s still probably room to flex your operational skill.

4 days ago 5 votes
Career advice in 2025.

Yesterday, the tj-actions repository, a popular tool used with Github Actions was compromised (for more background read one of these two articles). Watching the infrastructure and security engineering teams at Carta respond, it highlighted to me just how much LLMs can’t meaningfully replace many essential roles of software professionals. However, I’m also reading Jennifer Palkha’s Recoding America, which makes an important point: decision-makers can remain irrational longer than you can remain solvent. (Or, in this context, remain employed.) I’ve been thinking about this a lot lately, as I’ve ended up having more “2025 is not much fun”-themed career discussions with prior colleagues navigating the current job market. I’ve tried to pull together my points from those conversations here: Many people who first entered senior roles in 2010-2020 are finding current roles a lot less fun. There are a number of reasons for this. First, managers were generally evaluated in that period based on their ability to hire, retain and motivate teams. The current market doesn’t value those skills particularly highly, but instead prioritizes a different set of skills: working in the details, pushing pace, and navigating the technology transition to foundational models / LLMs. This means many members of the current crop of senior leaders are either worse at the skills they currently need to succeed, or are less motivated by those activities. Either way, they’re having less fun. Similarly, the would-be senior leaders from 2010-2020 era who excelled at working in the details, pushing pace and so on, are viewed as stagnate in their careers so are still finding it difficult to move into senior roles. This means that many folks feel like the current market has left them behind. This is, of course, not universal. It is a general experience that many people are having. Many people are not having this experience. The technology transition to Foundational models / LLMs as a core product and development tool is causing many senior leaders’ hard-earned playbooks to be invalidated. Many companies that were stable, durable market leaders are now in tenuous positions because foundational models threaten to erode their advantage. Whether or not their advantage is truly eroded is uncertain, but it is clear that usefully adopting foundational models into a product requires more than simply shoving an OpenAI/Anthropic API call in somewhere. Instead, you have to figure out how to design with progressive validation, with critical data validated via human-in-the-loop techniques before it is used in a critical workflow. It also requires designing for a rapidly improving toolkit: many workflows that were laughably bad in 2023 work surprisingly well with the latest reasoning models. Effective product design requires architecting for both massive improvement, and no improvement at all, of models in 2026-2027. This is equally true of writing software itself. There’s so much noise about how to write software, and much of it’s clearly propaganda–this blog’s opening anecdote regarding the tj-actions repository prove that expertise remains essential–but parts of it aren’t. I spent a few weeks in the evenings working on a new side project via Cursor in January, and I was surprised at how much my workflow changed even through Cursor itself was far from perfect. Even since then, Claude has advanced from 3.5 to 3.7 with extended thinking. Again, initial application development might easily be radically different in 2027, or it might be largely unchanged after the scaffolding step in complex codebases. (I’m also curious to see if context window limitations drive another flight from monolithic architectures.) Sitting out this transition, when we are relearning how to develop software, feels like a high risk proposition. Your well-honed skills in team development are already devalued today relative to three years ago, and now your other skills are at risk of being devalued as well. Valuations and funding are relatively less accessible to non-AI companies than they were three years ago. Certainly elite companies are doing alright, whether or not they have a clear AI angle, but the cutoff for remaining elite has risen. Simultaneously, the public markets are challenged, which means less willingness for both individuals and companies to purchase products, which slows revenue growth, further challenging valuations and funding. The consequence of this if you’re at a private, non-AI company, is that you’re likely to hire less, promote less, see less movement in pay bands, and experience a less predictable path to liquidity. It also means fewer open roles at other companies, so there’s more competition when attempting to trade up into a larger, higher compensated role at another company. The major exception to this is joining an AI company, but generally those companies are in extremely competitive markets and are priced more appropriately for investors managing a basket of investments than for employees trying to deliver a predictable return. If you join one of these companies today, you’re probably joining a bit late to experience a big pop, your equity might go to zero, and you’ll be working extremely hard for the next five to seven years. This is the classic startup contract, but not necessarily the contract that folks have expected over the past decade as maximum compensation has generally come from joining a later-stage company or member of the Magnificent Seven. As companies respond to the reduced valuations and funding, they are pushing their teams harder to find growth with their existing team. In the right environment, this can be motivating, but people may have opted into to a more relaxed experience that has become markedly less relaxed without their consent. If you pull all those things together, you’re essentially in a market where profit and pace are fixed, and you have to figure out how you personally want to optimize between people, prestige and learning. Whereas a few years ago, I think these variables were much more decoupled, that is not what I hear from folks today, even if their jobs were quite cozy a few years ago. Going a bit further, I know folks who are good at their jobs, and have been struggling to find something meaningful for six-plus months. I know folks who are exceptionally strong candidates, who can find reasonably good jobs, but even they are finding that the sorts of jobs they want simply don’t exist right now. I know folks who are strong candidates but with some oddities in their profile, maybe too many short stints, who are now being filtered out because hiring managers need some way to filter through the higher volume of candidates. I can’t give advice on what you should do, but if you’re finding this job market difficult, it’s certainly not personal. My sense is that’s basically the experience that everyone is having when searching for new roles right now. If you are in a role today that’s frustrating you, my advice is to try harder than usual to find a way to make it a rewarding experience, even if it’s not perfect. I also wouldn’t personally try to sit this cycle out unless you’re comfortable with a small risk that reentry is quite difficult: I think it’s more likely that the ecosystem is meaningfully different in five years than that it’s largely unchanged. Altogether, this hasn’t really been the advice that anyone wanted when they chatted with me, but it seems to generally have resonated with them as a realistic appraisal of the current markets. Hopefully there’s something useful for you in here as well.

a week ago 6 votes
Who gets to do strategy?

If you talk to enough aspiring leaders, you’ll become familiar with the prevalent idea that they need to be promoted before they can work on strategy. It’s a truism, but I’ve also found this idea perfectly wrong: you can work on strategy from anywhere in an organization, it just requires different tactics to do so. Both Staff Engineer and The Engineering Executive’s Primer have chapters on strategy. While the chapters’ contents are quite different, both present a practical path to advancing your organization’s thinking about complex topics. This chapter explains my belief that anyone within an organization can make meaningful progress on strategy, particularly if you are honest about the tools accessible to you, and thoughtful about how to use them. The themes we’ll dig into are: How to do strategy as an engineer, particularly an engineer who hasn’t been given explicit authority to do strategy Doing strategy as an engineering executive who is responsible for your organization’s decision-making How you can do engineering strategy even when you depend on an absent strategy, cannot acknowledge parts of the diagnosis because addressing certain problems is politically sensitive, or struggle with pockets of misaligned incentives If this book’s argument is that everyone should do strategy, is there anyone who, nonetheless, really should not do strategy? By the end, you’ll hopefully agree that engineering strategy is accessible to everyone, even though you’re always operating within constraints. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Doing strategy as an engineer It’s easy to get so distracted by executive’s top-down approach to strategy that you convince yourself that there aren’t other approachable mechanisms to doing strategy. There are! Staff Engineer introduces an approach I call Take five, then synthesize, which does strategy by: Documenting how five current and historical related decisions have been made in your organization. This is an extended exploration phase Synthesizing those five documents into a diagnosis and policy. You are naming the implicit strategy, so it’s impossible for someone to reasonably argue you’re not empowered to do strategy: you’re just describing what’s already happening At that point, either the organization feels comfortable with what you’ve written–which is their current strategy–or it doesn’t in which case you’ve forced a conversation about how to revise the approach. Creating awareness is often enough to drive strategic change, and doesn’t require any explicit authorization from an executive to do. When awareness is insufficient, the other pattern I’ve found highly effective in low-authority scenarios is an approach I wrote about in An Elegant Puzzle, and call model, document, and share: Model the approach you want others to adopt. Make it easy for them to observe how you’ve changed the way you’re doing things. Document the approach, the thinking behind it, and how to adopt it. Share the document around. If people see you succeeding with the approach, then they’re likely to copy it from you. You might be skeptical because this is an influence-based approach. However, as we’ll discuss in the next section, even executive-driven strategy is highly dependent on influence. Strategy archaeology Vernor Vinge’s A Deepness in the Sky, published in 1999, introduced the term software archaeologists, folks who created functionality by cobbling together millennia of scraps of existing software. Although it’s a somewhat different usage, I sometimes think of the “take five, then synthesize” approach as performing strategy archaeology. Simply by recording what has happened in the past, we make it easier to understand the present, and influence the future. Doing strategy as an executive The biggest misconception about executive roles, frequently held by non-executives and new executives who are about to make a series of regrettable mistakes, is that executives operate without constraints. That is false: executives have an extremely high number of constraints that they operate under. Executives have budgets, CEO visions, peers to satisfy, and a team to motivate. They can disappoint any of these temporarily, but long-term have to satisfy all of them. Nonetheless, it is true that executives have more latitude to mandate and cajole participation in the strategies that they sponsor. The Engineering Executive’s Primer’s chapter on strategy is a brief summary of this entire book, but it doesn’t say much about how executive strategy differs from non-executive strategy. How the executive’s approach to strategy differs from the engineer’s can be boiled down to: Executives can mandate following of their strategy, which empowers their policy options. An engineer can’t prevent the promotion of someone who refused to follow their policy, but an executive can. Mandates only matter if there are consequences. If an executive is unwilling to enforce consequences for non-compliance with a mandate, the ability to issue a mandate isn’t meaningful. This is also true if they can’t enforce a mandate because of lack of support from their peer executives. Even if an executive is unwilling to use mandates, they have significant visibility and access to their organization to advocate for their preferred strategy. Neither access nor mandates improve an executive’s ability to diagnose problems. However, both often create the appearance of progress. This is why executive strategies can fail so spectacularly and endure so long despite failure. As a result, my experience is that executives have an easier time doing strategy, but a much harder time learning how to do strategy well, and fewer protections to avoid serious mistakes. Further, the consequences of an executive’s poor strategy tend to be much further reaching than an engineer’s. Waiting to do strategy until you are an executive is a recipe for disaster, even if it looks easier from a distance. Doing strategy in other roles Even if you’re neither an engineer nor an engineering executive, you can still do engineering strategy. It’ll just require an even more influence-driven approach. The engineering organization is generally right to believe that they know the most about engineering, but that’s not always true. Sometimes a product manager used to be an engineer and has significant relevant experience. Other times, such as the early adoption of large language models, engineers don’t know much either, and benefit from outside perspectives. Doing strategy in challenging environments Good strategies succeed by accurately diagnosing circumstances and picking policies that address those circumstances. You are likely to spend time in organizations where both of those are challenging due to internal limitations, so it’s worth acknowledging that and discussing how to navigate those challenges. Low-trust environment Sometimes the struggle to diagnose problems is a skill issue. Being bad at strategy is in some ways the easy problem to solve: just do more strategy work to build expertise. In other cases, you may see what the problems are fairly clearly, but not know how to acknowledge the problems because your organization’s culture would frown on it. The latter is a diagnosis problem rooted in low-trust, and does make things more difficult. The chapter on Diagnosis recognizes this problem, and admits that sometimes you have to whisper the controversial parts of a strategy: When you’re writing a strategy, you’ll often find yourself trying to choose between two awkward options: say something awkward or uncomfortable about your company or someone working within it, or omit a critical piece of your diagnosis that’s necessary to understand the wider thinking. Whenever you encounter this sort of debate, my advice is to find a way to include the diagnosis, but to reframe it into a palatable statement that avoids casting blame too narrowly. In short, the solution to low-trust is to translate difficult messages into softer, less direct versions that are acceptable to state. If your goal is to hold people accountable, this can feel dishonest or like a ethical compromise, but the goal of strategy is to make better decisions, which is an entirely different concern than holding folks accountable for the past. Karpman Drama Triangle Sometimes when the diagnosis seems particularly obvious, and people don’t agree with you, it’s because you are wrong. When I’ve been obviously wrong about things I understand well, it’s usually because I’ve fallen into viewing a situation through the Karpman Drama Triangle, where all parties are mapped as the persecutor, the rescuer, or the victim. Poor-judgment environment Even when you do an excellent job diagnosing challenges, it can be difficult to drive agreement within the organization about how to address them. Sometimes this is due to genuinely complex tradeoffs, for example in Stripe’s acquisition of Index, there was debate about how to deal with Index’s Java-based technology stack, which culminated in a compromise that didn’t make anyone particularly happy: Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. That compromise is a good example of a difficult tradeoff: although parties disagreed with the approach, everyone understood the conflicting priorities that had to be addressed. In other cases, though, there are policy choices that simply don’t make much sense, generally driven by poor judgment in your organization. Sometimes it’s not poor technical judgment, but poor judgment in choosing to prioritize one’s personal interests at the expense of the company’s needs. Calm’s strategy to focus on being a product-engineering organization dealt with some aspects of that, acknowledged in its diagnosis: We’re arguing a particularly large amount about adopting new technologies and rewrites. Most of our disagreements stem around adopting new technologies or rewriting existing components into new technology stacks. For example, can we extend this feature or do we have to migrate it to a service before extending it? Can we add this to our database or should we move it into a new Redis cache instead? Is JavaScript a sufficient programming language, or do we need to rewrite this functionality in Go? In that situation, your strategy is an attempt to educate your colleagues about the tradeoffs they are making, but ultimately sometimes folks will disagree with your strategy. In that case, remember that most interesting problems require iterative solutions. Writing your strategy and sharing it will start to change the organization’s mind. Don’t get discouraged even if that change is initially slow. Dealing with missing strategies The strategy for dealing with new private equity ownership introduces a common problem: lack of clarity about what other parts of your own company want. In that case, it seems likely there will be a layoff, but it’s unclear how large that layoff will be: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. Many leaders encounter that sort of ambiguity and decide that they cannot move forward with a strategy of their own until that decision is made. While it’s true that it’s inconvenient not to know the details, getting blocked by ambiguity is always the wrong decision. Instead you should do what the private equity strategy does: accept that ambiguity as a fact to be worked around. Rather than giving up, it adopts a series of new policies to start reducing cost growth by changing their organization’s seniority mix, and recognizes that once there is clarity on reduction targets that there will be additional actions to be taken. Whenever you’re doing something challenging, there are an infinite number of reasonable rationales for why you shouldn’t or can’t make progress. Leadership is finding a way to move forward despite those issues. A missing strategy is always part of your diagnosis, but never a reason that you can’t do strategy. Who shouldn’t do strategy In my experience, there’s almost never a reason why you cannot do strategy, but there are two particular scenarios where doing strategy probably doesn’t make sense. The first is not a who, but a when problem: sometimes there is so much strategy already happening, that doing more is a distraction. If another part of your organization is already working on the same problem, do your best to work with them directly rather than generating competing work. The other time to avoid strategy is when you’re trying to satisfy an emotional need to make a direct, immediate impact. Sharing a thoughtful strategy always makes progress, but it’s often the slow, incremental progress of changing your organization’s beliefs. Even definitive, top-down strategies from executives are often ignored in pockets of an organization, and bottoms-up strategy spread slowly as they are modeled, documented and shared. Embarking on strategy work requires a tolerance for winning in the long-run, even when there’s little progress this week or this quarter. Summary As you finish reading this chapter, my hope is that you also believe that you can work on strategy in your organization, whether you’re an engineer or an executive. I also hope that you appreciate that the tools you use vary greatly depending on who you are within your organization and the culture in which you work. Whether you need to model or can mandate, there’s a mechanism that will work for you.

2 weeks ago 14 votes
How to integrate Stripe's acquisition of Index? (2018)

While discussions around acquisitions often focus on technical diligence and deciding whether to make the acquisition, the integration that follows afterwards can be even more complex. There are few irreversible trapdoor decisions in engineering, but decisions made early in an integration tend to be surprisingly durable. This engineering strategy explores Stripe’s approach to integrating their 2018 acquisition of Index. While a business book would focus on the rationale for the acquisition itself, here that rationale is merely part of the diagnosis that defines the integration tradeoffs. The integration itself is the area of focus. Like most acquisitions, the team responsible for the integration has only learned about the project after the deal closed, which means early efforts are a scramble to apply strategy testing to distinguish between optimistic dates and technical realities. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy & Operation. To understand the thinking behind this strategy, read sections in reserve order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation We’re starting with little shared context between the acquired and acquiring engineering teams, and have a six month timeline to launch a joint product. So our starting policy is a mix of a commitment to joint refinment and several provisional architectural policies: Meet at least weekly until the initial release is complete: the involved leadership from Stripe and Index will hold a weekly sync meeting to refine our approach until we fulfill our intial release timeline. This meeting is jointly owned by Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. Minimize changes to tokenization environment: because point-of-sale devices directly work with with customer payment details, the API that directly supports the point-of-sale device must live within our secured environment where payment details are stored. However, any other functionality must not be added to our tokenization environment. All other functionality must exist in standard environments: except for the minimum necessary functionality moving into the tokenization environment, everything else must be operated in our standard, non-tokenization environments. In particular, any software that requires frequent changes, or introduces complex external dependencies, should exist in the standard environments. Defer making a decision regarding the introduction of Java to a later date: the introduction of Java is incompatible with our existing engineering strategy, but at this point we’ve also been unable to align stakeholders on how to address this decision. Further, we see attempting to address this issue as a distraction from our timely goal of launching a joint product within six months. We will take up this discussion after launching the initial release. Escalations come to paired leads: given our limited shaed context across teams, all escalations must come to both Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. Security review of changes impacting tokenization environment: we need to move quickly to launch the combined point-of-sale and payments product, but we must not cut corners on security to launch faster. Security must be included and explicitly sign off on any integration decisions that involve our tokenization environment Diagnose There are generally four categories of acquisitions: talent acquisitions to bring on a talented team, business acquisitions to buy a company’s revenue and product, technology acquisitions to add a differentiated capability that would be challenging to develop internally, and time-to-market acquisitions where you could develop the capability internally but can develop it meaningfully faster by acquiring a company. While most acquisitions have a flavor of several of these dimensions, this acquisition is primarily a time-to-market acquisition aimed to address these constraints: Several of our largest customers are pushing for us to provide a point-of-sale device integrated with our API-driven payments ecosystem. At least one has implied that we either provide this functionality on a committed timeline or they may churn to a competitor. We currently have no homegrown expertise in developing or integrating with hardware such as point-of-sale devices. Based on other zero-to-one efforts internally, we believe it would take about a year to hire the team, develop and launch a minimum-viable product for a point-of-sale device integrated into our platform. Where we’ve taken a horizontal approach to supporting web payments via an API, at least one of our competitors, Square, has taken a vertically integrated approach. While their API ecosystem is less developed than ours, they are a plausible destination for customers threatening to churn. We believe that at least one of our enterprise customers will churn if our best commitment is launching a point-of-sale solution 12 months from now. We’ve decided to acquire a small point-of-sale startup, which we will use to commit to a six month timeframe for supporting an integrated point-of-sale device with our API ecosystem. We will need to rapidly integrate the acquired startup to meet this timeline. We only know a small number of details about what this will entail. We do know that point-of-sale devices directly operate on payment details (e.g. the point-of-sale device knows the credit card details of the card it reads). Our compliance obligations restrict such activity to our “tokenization environment”, a highly secured and isolated environment with direct access to payment details. This environment converts payment details into a unique token that other environments can utilize to operate against payment details without the compliance overhead of having direct access to the underlying payment details. Going into this technical integration, we have few details about the acquired company’s technology stack. We do know that they are primarily a Java shop running on AWS, where we are primarily a Ruby (with some Go) shop running on AWS. Explore Prior to this acquisition, we have done several small acquisitions. None of those acquisitions had a meaningful product to integrate with ours, so we don’t have much of an internal playbook to anchor our approach in. We do have limited experience in integrating technical acquisitions from prior companies we’ve worked in, along with talking to peers at other companies to mine their experience. Synthesizing those experiences, the recurring patterns are: Usually deal teams have made certain commitments, or the acquired team has understood certain commitments, that will be challenging to facilitate. This is doubly true when you are unaware of what those commitments might be. If folks seem to be behaving oddly, it might be one such misunderstanding, and it’s worth engaging directly to debug the confusion. There should be an executive sponsor for the acquisition, and the sponsor is typically the best person ask about the company’s intentions. If you can’t find the executive sponsor, or they are not engaged, try to recruit a new executive sponsor rather than trying to make things work without one. Close the culture gap quickly where there’s little friction, and cautiously where there’s little trust. We do need to bring the acquired company into our culture, but we have years to do that. The most successful stories of doing this leaned on a mix of moving folks into and out of the acquired team rather than applying force. The long-term cost of supporting a new technology stack is high, and in conflict with our technology strategy of consolidating on as few programming languages as possible. This is not the place to be flexible, as each additional feature in the new stack will take you further from your desired outcome. Finally, find a way to derisk key departures. Things can go wrong quickly. One of the easiest starting points is consolidating infrastructure immediately, even if the product or software takes longer. Altogether, this was not the most reassuring exploration: it was a bit abstract, and much of our research returned strongly-held, conflicting perspectives. Perhaps acquisitions, like starting a new company, is one of those places where there’s simply no right way to do it well.

3 weeks ago 14 votes

More in programming

Reduced Hours and Remote Work Options for Employees with Young Children in Japan

Japan already stipulates that employers must offer the option of reduced working hours to employees with children under three. However, the Child Care and Family Care Leave Act was amended in May 2024, with some of the new provisions coming into effect April 1 or October 1, 2025. The updates to the law address: Remote work Flexible start and end times Reduced hours On-site childcare facilities Compensation for lost salary And more Legal changes are one thing, of course, and social changes are another. Though employers are mandated to offer these options, how many employees in Japan actually avail themselves of these benefits? Does doing so create any stigma or resentment? Recent studies reveal an unsurprising gender disparity in accepting a modified work schedule, but generally positive attitudes toward these accommodations overall. The current reduced work options Reduced work schedules for employees with children under three years old are currently regulated by Article 23(1) of the Child Care and Family Care Leave Act. This Article stipulates that employers are required to offer accommodations to employees with children under three years old. Those accommodations must include the opportunity for a reduced work schedule of six hours a day. However, if the company is prepared to provide alternatives, and if the parent would prefer, this benefit can take other forms—for example, working seven hours a day or working fewer days per week. Eligible employees for the reduced work schedule are those who: Have children under three years old Normally work more than six hours a day Are not employed as day laborers Are not on childcare leave during the period to which the reduced work schedule applies Are not one of the following, which are exempted from the labor-management agreement Employees who have been employed by the company for less than one year Employees whose prescribed working days per week are two days or less Although the law requires employers to provide reduced work schedules only while the child is under three years old, some companies allow their employees with older children to work shorter hours as well. According to a 2020 survey by the Ministry of Health, Labor and Welfare, 15.8% of companies permit their employees to use the system until their children enter primary school, while 5.7% allow it until their children turn nine years old or enter third grade. Around 4% offer reduced hours until children graduate from elementary school, and 15.4% of companies give the option even after children have entered middle school. If, considering the nature or conditions of the work, it is difficult to give a reduced work schedule to employees, the law stipulates other measures such as flexible working hours. This law has now been altered, though, to include other accommodations. Updates to The Child Care and Family Care Leave Act Previously, remote work was not an option for employees with young children. Now, from April 1, 2025, employers must make an effort to allow employees with children under the age of three to work remotely if they choose. From October 1, 2025, employers are also obligated to provide two or more of the following measures to employees with children between the ages of three and the time they enter elementary school. An altered start time without changing the daily working hours, either by using a flex time system or by changing both the start and finish time for the workday The option to work remotely without changing daily working hours, which can be used 10 or more days per month Company-sponsored childcare, by providing childcare facilities or other equivalent benefits (e.g., arranging for babysitters and covering the cost) 10 days of leave per year to support employees’ childcare without changing daily working hours A reduced work schedule, which must include the option of 6-hour days How much it’s used in practice Of course, there’s always a gap between what the law specifies, and what actually happens in practice. How many parents typically make use of these legally-mandated accommodations, and for how long? The numbers A survey conducted by the Ministry of Health, Labor and Welfare in 2020 studied uptake of the reduced work schedule among employees with children under three years old. In this category, 40.8% of female permanent employees (正社員, seishain) and 21.6% of women who were not permanent employees answered that they use, or had used, the reduced work schedule. Only 12.3% of male permanent employees said the same. The same survey was conducted in 2022, and researchers found that the gap between female and male employees had actually widened. According to this second survey, 51.2% of female permanent employees and 24.3% of female non-permanent employees had reduced their hours, compared to only 7.6% of male permanent employees. Not only were fewer male employees using reduced work programs, but 41.2% of them said they did not intend to make use of them. By contrast, a mere 15.6% of female permanent employees answered they didn’t wish to claim the benefit. Of those employees who prefer the shorter schedule, how long do they typically use the benefit? The following charts, using data from the 2022 survey, show at what point those employees stop reducing their hours and return to a full-time schedule.   Female permanent employees Female non-permanent employees Male permanent employees Male non-permanent employees Until youngest child turns 1 13.7% 17.9% 50.0% 25.9% Until youngest child turns 2 11.5% 7.9% 14.5% 29.6% Until youngest child turns 3 23.0% 16.3% 10.5% 11.1% Until youngest child enters primary school 18.9% 10.5% 6.6% 11.1% Sometime after the youngest child enters primary school 22.8% 16.9% 6.5% 11.1% Not sure 10% 30.5% 11.8% 11.1% From the companies’ perspectives, according to a survey conducted by the Cabinet Office in 2023, 65.9% of employers answered that their reduced work schedule system is fully used by their employees. What’s the public perception? Some fear that the number of people using the reduced work program—and, especially, the number of women—has created an impression of unfairness for those employees who work full-time. This is a natural concern, but statistics paint a different picture. In a survey of 300 people conducted in 2024, 49% actually expressed a favorable opinion of people who work shorter hours. Also, 38% had “no opinion” toward colleagues with reduced work schedules, indicating that 87% total don’t negatively view those parents who work shorter hours. While attitudes may vary from company to company, the public overall doesn’t seem to attach any stigma to parents who reduce their work schedules. Is this “the Mommy Track”? Others are concerned that working shorter hours will detour their career path. According to this report by the Ministry of Health, Labour and Welfare, 47.6% of male permanent employees indicated that, as the result of working fewer hours, they had been changed to a position with less responsibility. The same thing happened to 65.6% of male non-permanent employees, and 22.7% of female permanent employees. Therefore, it’s possible that using the reduced work schedule can affect one’s immediate chances for advancement. However, while 25% of male permanent employees and 15.5% of female permanent employees said the quality and importance of the work they were assigned had gone down, 21.4% of male and 18.1% of female permanent employees said the quality had gone up. Considering 53.6% of male and 66.4% of female permanent employees said it stayed the same, there seems to be no strong correlation between reducing one’s working hours, and being given less interesting or important tasks. Reduced work means reduced salary These reduced work schedules usually entail dropping below the originally-contracted work hours, which means the employer does not have to pay the employee for the time they did not work. For example, consider a person who normally works 8 hours a day reducing their work time to 6 hours a day (a 25% reduction). If their monthly salary is 300,000 yen, it would also decrease accordingly by 25% to 225,000 yen. Previously, both men and women have avoided reduced work schedules, because they do not want to lose income. As more mothers than fathers choose to work shorter hours, this financial burden tends to fall more heavily on women. To address this issue, childcare short-time employment benefits (育児時短就業給付) will start from April 2025. These benefits cover both male and female employees who work shorter hours to care for a child under two years old, and pay a stipend equivalent to 10% of their adjusted monthly salary during the reduced work schedule. Returning to the previous example, this stipend would grant 10% of the reduced salary, or 22,500 yen per month, bringing the total monthly paycheck to 247,500 yen, or 82.5% of the normal salary. This additional stipend, while helpful, may not be enough to persuade some families to accept shorter hours. The childcare short-time employment benefits are available to employees who meet the following criteria: The person is insured, and is working shorter hours to care for a child under two years old. The person started a reduced work schedule immediately after using the childcare leave covered by childcare leave benefits, or the person has been insured for 12 months in the two years prior to the reduced work schedule. Conclusion Japan’s newly-mandated options for reduced schedules, remote work, financial benefits, and other childcare accommodations could help many families in Japan. However, these programs will only prove beneficial if enough employees take advantage of them. As of now, there’s some concern that parents who accept shorter schedules could look bad or end up damaging their careers in the long run. Statistically speaking, some of the news is good: most people view parents who reduce their hours either positively or neutrally, not negatively. But other surveys indicate that a reduction in work hours often equates to a reduction in responsibility, which could indeed have long-term effects. That’s why it’s important for more parents to use these accommodations freely. Not only will doing so directly benefit the children, but it will also lessen any negative stigma associated with claiming them. This is particularly true for fathers, who can help even the playing field for their female colleagues by using these perks just as much as the mothers in their offices. And since the state is now offering a stipend to help compensate for lost income, there’s less and less reason not to take full advantage of these programs.

8 hours ago 2 votes
Big endian and little endian

Every time I run into endianness, I have to look it up. Which way do the bytes go, and what does that mean? Something about it breaks my brain, and makes me feel like I can't tell which way is up and down, left and right. This is the blog post I've needed every time I run into this. I hope it'll be the post you need, too. What is endianness? The term comes from Gulliver's travels, referring to a conflict over cracking boiled eggs on the big end or the little end[1]. In computers, the term refers to the order of bytes within a segment of data, or a word. Specifically, it only refers to the order of bytes, as those are the smallest unit of addressable data: bits are not individually addressable. The two main orderings are big-endian and little-endian. Big-endian means you store the "big" end first: the most-significant byte (highest value) goes into the smallest memory address. Little-endian means you store the "little" end first: the least-significant byte (smallest value) goes into the smallest memory address. Let's look at the number 168496141 as an example. This is 0x0A0B0C0D in hex. If we store 0x0A at address a, 0x0B at a+1, 0x0C at a+2, and 0x0D at a+3, then this is big-endian. And then if we store it in the other order, with 0x0D at a and 0x0A at a+3, it's little-endian. And... there's also mixed-endianness, where you use one kind within a word (say, little-endian) and a different ordering for words themselves (say, big-endian). If our example is on a system that has 2-byte words (for the sake of illustration), then we could order these bytes in a mixed-endian fashion. One possibility would be to put 0x0B in a, 0x0A in a+1, 0x0D in a+2, and 0x0C in a+3. There are certainly reasons to do this, and it comes up on some ARM processors, but... it feels so utterly cursed. Let's ignore it for the rest of this! For me, the intuitive ordering is big-ending, because it feels like it matches how we read and write numbers in English[2]. If lower memory addresses are on the left, and higher on the right, then this is the left-to-right ordering, just like digits in a written number. So... which do I have? Given some number, how do I know which endianness it uses? You don't, at least not from the number entirely by itself. Each integer that's valid in one endianness is still a valid integer in another endianness, it just is a different value. You have to see how things are used to figure it out. Or you can figure it out from the system you're using (or which wrote the data). If you're using an x86 or x64 system, it's mostly little-endian. (There are some instructions which enable fetching/writing in a big-endian format.) ARM systems are bi-endian, allowing either. But perhaps the most popular ARM chips today, Apple silicon, are little-endian. And the major microcontrollers I checked (AVR, ESP32, ATmega) are little-endian. It's thoroughly dominant commercially! Big-endian systems used to be more common. They're not really in most of the systems I'm likely to run into as a software engineer now, though. You are likely to run into it for some things, though. Even though we don't use big-endianness for processor math most of the time, we use it constantly to represent data. It comes back in networking! Most of the Internet protocols we know and love, like TCP and IP, use "network order" which means big-endian. This is mentioned in RFC 1700, among others. Other protocols do also use little-endianness again, though, so you can't always assume that it's big-endian just because it's coming over the wire. So... which you have? For your processor, probably little-endian. For data written to the disk or to the wire: who knows, check the protocol! Why do we do this??? I mean, ultimately, it's somewhat arbitrary. We have an endianness in the way we write, and we could pick either right-to-left or left-to-right. Both exist, but we need to pick one. Given that, it makes sense that both would arise over time, since there's no single entity controlling all computer usage[3]. There are advantages of each, though. One of the more interesting advantages is that little-endianness lets us pretend integers are whatever size we like, within bounds. If you write the number 26[4] into memory on a big-endian system, then read bytes from that memory address, it will represent different values depending on how many bytes you read. The length matters for reading in and interpreting the data. If you write it into memory on a little-endian system, though, and read bytes from the address (with the remaining ones zero, very important!), then it is the same value no matter how many bytes you read. As long as you don't truncate the value, at least; 0x0A0B read as an 8-bit int would not be equal to being read as a 16-bit ints, since an 8-bit int can't hold the entire thing. This lets you read a value in the size of integer you need for your calculation without conversion. On the other hand, big-endian values are easier to read and reason about as a human. If you dump out the raw bytes that you're working with, a big-endian number can be easier to spot since it matches the numbers we use in English. This makes it pretty convenient to store values as big-endian, even if that's not the native format, so you can spot things in a hex dump more easily. Ultimately, it's all kind of arbitrary. And it's a pile of standards where everything is made up, nothing matters, and the big-end is obviously the right end of the egg to crack. You monster. The correct answer is obviously the big end. That's where the little air pocket goes. But some people are monsters... ↩ Please, please, someone make a conlang that uses mixed-endian inspired numbers. ↩ If ever there were, maybe different endianness would be a contentious issue. Maybe some of our systems would be using big-endian but eventually realize their design was better suited to little-endian, and then spend a long time making that change. And then the government would become authoritarian on the promise of eradicating endianness-affirming care and—Oops, this became a metaphor. ↩ 26 in hex is 0x1A, which is purely a coincidence and not a reference to the First Amendment. This is a tech blog, not political, and I definitely stay in my lane. If it were a reference, though, I'd remind you to exercise their 1A rights[5] now and call your elected officials to ensure that we keep these rights. I'm scared, and I'm staring down the barrel of potential life-threatening circumstances if things get worse. I expect you're scared, too. And you know what? Bravery is doing things in spite of your fear. ↩ If you live somewhere other than the US, please interpret this as it applies to your own country's political process! There's a lot of authoritarian movement going on in the world, and we all need to work together for humanity's best, most free[6] future. ↩ I originally wrote "freest" which, while spelled correctly, looks so weird that I decided to replace it with "most free" instead. ↩

12 hours ago 1 votes
The Tragic Case of Intel AI

Intel is sitting on a huge amount of card inventory they can’t move, largely because of bad software. Most of this is a summary of the public #intel-hardware channel in the tinygrad discord. Intel currently is sitting on: 15,000 Gaudi 2 cards (with baseboards) 5,100 Intel Data Center GPU Max 1450s (without baseboards) If you were Intel, what would you do with them? First, starting with the Gaudi cards. The open source repo needed to control them was archived on Feb 4, 2025. There’s a closed source version of this that’s maybe still maintained, but eww closed source and do you think it’s really maintained? The architecture is kind of tragic, and that’s likely why they didn’t open source it. Unlike every other accelerator I have seen, the MMEs, which is where all the FLOPS are, are not controllable by the TPCs. While the TPCs have an LLVM port, the MME is not documented. After some poking around, I found the spec: It’s highly fixed function, looks very similar to the Apple ANE. But that’s not even the real problem with it. The problem is that it is controlled by queues, not by the TPCs. Unpacking habanalabs-dkms-1.19.2-32.all.deb you can find the queues. There is some way to push a command stream to the device so you don’t actually have to deal with the host itself for the queues. But that doesn’t prevent you having to decompose the network you are trying to run into something you can put on this fixed function block. Programmability is on a spectrum, ranging from CPUs being the easiest, to GPUs, to things like the Qualcomm DSP / Google TPU (where at least you drive the MME from the program), to this and the Apple ANE being the hardest. While it’s impressive that they actually got on MLPerf Training v4.0 training GPT3, I suspect it’s all hand coded, and if you even can deviate off the trodden path you’ll get almost no perf. Accelerators like this are okay for low power inference where you can adjust the model architecture for the target, Apple does a great job of this. But this will never be acceptable for a training chip. Then there’s the Data Center GPU Max 1450. Intel actually sent us a few of these. You quickly run into a problem…how do you plug them in? They need OAM sockets, 48V power, and a cooling solution that can sink 600W. As far as I can tell, they were only ever deployed in two systems, the Aurora Supercomputer and the Dell XE9640. It’s hard to know, but I really doubt many of these Dell systems were sold. Intel then sent us this carrier board. In some ways it’s helpful, but in other ways it’s not at all. It still doesn’t solve cooling or power, and you need to buy 16x MCIO cables (cheap in quantity, but expensive and hard to find off the shelf). Also, I never got a straight answer, but I really doubt Intel has many of these boards. And that board doesn’t look cheap to manufacturer more of. The connectors alone, which you need two of per GPU, cost $26 each. That’s $104 for just the OAM connectors. tiny corp was in discussions to buy these GPUs. How much would you pay for one of these on a PCIe card? The specs look great. 839 TFLOPS, 128 GB of ram, 3.3 TB/s of bandwidth. However…read this article. Even in simple synthetic benchmarks, the chip doesn’t get anywhere near its max performance, and it looks to be for fundamental reasons like memory latency. We estimate we could sell PCIe versions of these GPUs for $1,000; I don’t think most people know how hard it is to move non NVIDIA hardware. Before you say you’d pay more, ask yourself, do you really want to deal with the software? An adapter card has four pieces. A PCB for the card, a 12->48V voltage converter, a heatsink, and a fan. My quote from the guy who makes an OAM adapter board was $310 for 10+ PCBs and $75 for the voltage converter. A heatsink that can handle 600W (heat pipes + vapor chamber) is going to cost $100, then maybe $20 more for the fan. That’s $505, and you still need to assemble and test them, oh and now there’s tariffs. Maybe you can get this down to $400 in ~1000 quantity. So $200 for the GPU, $400 for the adapter, $100 for shipping/fulfillment/returns (more if you use Amazon), and 30% profit if you sell at $1k. tiny would net $1M on this, which has to cover NRE and you have risk of unsold inventory. We offered Intel $200 per GPU (a $680k wire) and they said no. They wanted $600. I suspect that unless a supercomputer person who already uses these GPUs wants to buy more, they will ride it to zero. tl;dr: there’s 5100 of these GPUs with no simple way to plug them in. It’s unclear if they worth the cost of the slot they go in. I bet they end up shredded, or maybe dumped on eBay for $50 each in a year like the Xeon Phi cards. If you buy one, good luck plugging it in! The reason Meta and friends buy some AMD is as a hedge against NVIDIA. Even if it’s not usable, AMD has progressed on a solid steady roadmap, with a clear continuation from the 2018 MI50 (which you can now buy for 99% off), to the MI325X which is a super exciting chip (AMD is king of chiplets). They are even showing signs of finally investing in software, which makes me bullish. If NVIDIA stumbles for a generation, this is AMD’s game. The ROCm “copy each NVIDIA repo” strategy actually works if your competition stumbles. They can win GPUs with slow and steady improvement + competition stumbling, that’s how AMD won server CPUs. With these Intel chips, I’m not sure who they would appeal to. Ponte Vecchio is cancelled. There’s no point in investing in the platform if there’s not going to be a next generation, and therefore nobody can justify the cost of developing software, therefore there won’t be software, therefore they aren’t worth plugging in. Where does this leave Intel’s AI roadmap? The successor to Ponte Vecchio was Rialto Bridge, but that was cancelled. The successor to that was Falcon Shores, but that was also cancelled. Intel claims the next GPU will be “Jaguar Shores”, but fool me once… To quote JazzLord1234 from reddit “No point even bothering to listen to their roadmaps anymore. They have squandered all their credibility.” Gaudi 3 is a flop due to “unbaked software”, but as much as I usually do blame software, nothing has changed from Gaudi 2 and it’s just a really hard chip to program for. So there’s no future there either. I can’t say that “Jaguar Shores” square instills confidence. It didn’t inspire confidence for “Joseph B.” on LinkedIn either. From my interactions with Intel people, it seems there’s no individuals with power there, it’s all committee like leadership. The problem with this is there’s nobody who can say yes, just many people who can say no. Hence all the cancellations and the nonsense strategy. AMD’s dysfunction is different. from the beginning they had leadership that can do things (Lisa Su replied to my first e-mail), they just didn’t see the value in investing in software until recently. They sort of had a point if they were only targeting hyperscalars. but it seems like SemiAnalysis got through to them that hyperscalars aren’t going to deal with bad software either. It remains to be seem if they can shift culture to actually deliver good software, but there’s movement in that direction, and if they succeed AMD is so undervalued. Their hardware is good. With Intel, until that committee style leadership is gone, there’s 0 chance for success. Committee leadership is fine if you are trying to maintain, but Intel’s AI situation is even more hopeless than AMDs, and you’d need something major to turn it around. At least with AMD, you can try installing ROCm and be frustrated when there are bugs. Every time I have tried Intel’s software I can’t even recall getting the import to work, and the card wasn’t powerful enough that I cared. Intel needs actual leadership to turn this around, or there’s 0 future in Intel AI.

20 hours ago 1 votes
All pretty models are wrong, but some ugly models are useful

Identifying useful frameworks for companies, strategy, markets, and organizations, instead of those that just look pretty in PowerPoint.

yesterday 3 votes
Self-avoiding Walk

I’m a bit late to this, but back in summer 2024 I participated in the OST Composing Jam. The goal of this jam is to compose an original soundtrack (minimum of 3 minutes) of any style for an imaginary game. While I’ve composed a lot of video game music, I’ve never created an entire soundtrack around a single concept. Self Avoiding Walk by Daniel Marino To be honest, I wasn’t entirely sure where to start. I was torn between trying to come up with a story for a game to inspire the music, and just messing around with some synths and noodling on the keyboard. I did a little bit of both, but nothing really materialized. Synth + Metal ≈ Synthmetal Feeling a bit paralyzed, I fired up the ’ole RMG sequencer for inspiration. I saved a handful of randomized melodies and experimented with them in Reaper. After a day or two I landed on something I liked which was about the first 30 seconds or so of the second track: "Defrag." I love metal bands like Tesseract, Periphery, The Algorithm, Car Bomb, and Meshuggah. I tried experimenting with incorporating syncopated guttural guitar sounds with the synths. After several more days I finished "Defrag"—which also included "Kernel Panic" before splitting that into its own track. I didn’t have a clue what to do next, nor did I have a concept. Composing the rest of the music was a bit of a blur because I bounced around from song to song—iterating on the leitmotif over and over with different synths, envelopes, time signatures, rhythmic displacement, pitch shifting, and tweaking underlying chord structures. Production The guitars were recorded using DI with my Fender Squire and Behringer Interface. I’m primarily using the ML Sound Labs Amped Roots Free amp sim because the metal presets are fantastic and rarely need much fuss to get it sounding good. I also used Blue Cat Audio free amp sim for clean guitars. All the other instruments were MIDI tracks either programmed via piano roll or recorded with my Arturia MiniLab MKII. I used a variety of synth effects from my library of VSTs. I recorded this music before acquiring my Fender Squire Bass guitar, so bass was also programmed. Theme and Story At some point I had five songs that all sounded like they could be from the same game. The theme for this particular jam was "Inside my world." I had to figure out how I could write a story that corresponded with the theme and could align with the songs. I somehow landed on the idea of the main actor realizing his addiction to AI, embarking on a journey to "unplug." The music reflects his path to recovery, capturing the emotional and psychological evolution as he seeks to overcome his dependency. After figuring this out, I thought it would be cool to name all the songs using computer terms that could be metaphors for the different stages of recovery. Track listing Worm – In this dark and haunting opening track, the actor grapples with his addiction to AI, realizing he can no longer think independently. Defrag – This energetic track captures the physical and emotional struggles of the early stages of recovery. Kernel Panic – Menacing and eerie, this track portrays the actor’s anxiety and panic attacks as he teeters on the brink during the initial phases of recovery. Dæmons – With initial healing achieved, the real challenge begins. The ominous and chaotic melodies reflect the emotional turbulence the character endures. Time to Live – The actor, having come to terms with himself, experiences emotional growth. The heroic climax symbolizes the realization that recovery is a lifelong journey. Album art At the time I was messing around with Self-avoiding walks in generative artwork explorations. I felt like the whole concept of avoiding the self within the context of addiction and recovery metaphorically worked. So I tweaked some algorithms and generated the self-avoiding walk using JavaScript and the P5.js library. I then layered the self-avoiding walk over a photo I found visually interesting on Unsplash using a CSS blend mode. Jam results I placed around the top 50% out of over 600 entries. I would have liked to have placed higher, but despite my ranking, I thoroughly enjoyed composing the music! I’m very happy with the music, its production quality, and I also learned a lot. I would certainly participate in this style of composition jam again!

yesterday 3 votes