More from ntietz.com blog
Software licenses are a reflection of our values. How you choose to license a piece of software says a lot about what you want to achieve with it. Do you want to reach the maximum amount of users? Do you want to ensure future versions remain free and open source? Do you want to preserve your opportunity to make a profit? They can also be used to reflect other values. For example, there is the infamous JSON license written by Doug Crockford. It's essentially the MIT license with this additional clause: The Software shall be used for Good, not Evil. This has caused quite some consternation. It is a legally dubious addition, because "Good" and "Evil" are not defined here. Many people disagree on what these are. This is really not enforceable, and it's going to make many corporate lawyers wary of using software under this license1. I don't think that enforcing this clause was the point. The point is more signaling values and just having fun with it. I don't think anyone seriously believes that this license will be enforceable, or that it will truly curb the amount of evil in the world. But will it start conversations? * * * There are a lot of other small, playful licenses. None of these are going to change the world, but they inject a little joy and play into an area of software that is usually serious and somber. When I had to pick a license for my exceptional language (Hurl), I went down that serious spiral at first. What license will give the project the best adoption, or help it achieve its goals? What are its goals? Well, one its goals was definitely to be funny. Another was to make sure that people can use the software for educational purposes. If I make a language as a joke, I do want people to be able to learn from it and do their own related projects! This is where we enter one of the sheerly joyous parts of licensing: the ability to apply multiple licenses to software so that the user can decide which one to use the software under. You see a lot of Rust projects dual-licensed under Apache and MIT licenses, because the core language is dual-licensed for very good reasons. We can apply similar rationale to Hurl's license, and we end up with triple-licensing. It's currently available under three licenses, each for a separate purpose. Licensing it under the AGPL enables users to create derivative works for their own purposes (probably to learn) as long as it remains licensed the same way. And then we have a commercial license option, which is there so that if you want to commercialize it, I can get a cut of that2. The final option is to license it under the Gay Agenda License, which was created originally for this project. This option basically requires you to not be a bigot, and then you can use the software under the MIT license terms. It seems fair to me. When I got through that license slide at SIGBOVIK 2024, I knew that the mission was accomplished: bigotry was defeated the audience laughed. * * * The Gay Agenda License is a modified MIT license which requires you do a few things: You must provide attribution (typical MIT manner) You have to stand up for LGBTQ rights You have to say "be gay, do crime" during use of the software Oh, and if you support restricting LGBTQ rights, then you lose that license right away. No bigots allowed here. This is all, of course, written in more complete sentences in the license itself. The best thing is that you can use this license today! There is a website for the Gay Agenda License, the very fitting gal.gay3. The website has all the features you'd expect, like showing the license text, using appropriate flags, and copying the text to the clipboard for ease of putting this in your project. Frequently Anticipated Questions Inspired by Hannah's brilliant post's FAQ, here are answers to your questions that you must have by now. Is this enforceable? We don't really know until it's tested in court, but if that happens, everyone has already lost. So, who knows, I hope we don't find out! Isn't it somewhat ambiguous? What defines what is standing up for LGBTQ rights? Ah, yes, good catch. This is a big problem for this totally serious license. Definitely a problem. Can I use it in my project? Yeah! Let me know if you do so I can add it into a showcase on the website. But keep in mind, this is a joke totally serious license, so only use it on silly things highly serious commercial projects! How do I get a commercial license of Hurl? This is supposed to be about the Gay Agenda License, not Hurl. But since you asked, contact me for pricing. When exactly do I have to say "be gay, do crime"? To be safe, it's probably best that you mutter it continuously while using all software. You never know when it's going to be licensed under the Gay Agenda License, so repeat the mantra to ensure compliance. Thank you to Anya for the feedback on a draft of this post. Thank you to Chris for building the first version of gal.gay for me. 1 Not for nothing, because most of those corporations would probably be using the software for evil. So, mission accomplished, I guess? 2 For some reason, no one has contacted me for this option yet. I suspect widespread theft of my software, since surely people want to use Hurl. They're not using the third option, since we still see rampant transphobia. 3 This is my most expensive domain yet at $130 for the first year. I'm hoping that the price doesn't raise dramatically over time, but I'm not optimistic, since it's a three-letter domain. That said, anything short of extortion will likely be worth keeping for the wonderful email addresses I get out of this, being a gay gal myself. It's easier to spell on the phone than this domain is, anyway.
Asheville is in crisis right now. They're without drinking water, faucets run dry, and it's difficult to flush toilets. As of yesterday, the hospital has water (via tanker trucks), but 80% of the public water system is still without running water. Things are really bad. Lots of infrastructure has been washed away. Even when water is back, there has been tremendous damage done that will take a long time to recover from and rebuild. * * * Here's the only national news story my friend from Asheville had seen which covered the water situation specifically. It's hard for me to understand why this is not covered more broadly. And my heart aches for those in and around the Asheville area. As I'm far away, I can't do a lot to help. But I can donate money, which my friend said is the only donation that would help right now if you aren't in the area. She specifically pointed me to these two ways to donate: Beloved Asheville: a respected community organization in Asheville, this is a great place to send money to help. (If you're closer to that area, it does look like they have specific things they're asking for as well, but this feels like an "if you can help this way, you'd already know" situation.) Mutual Aid Disaster Relief: there's a local Asheville chapter which is doing work to help. Also an organization to support for broad disaster recovery in general. I've donated money. I hope you will, too, for this and for the many other crises that affect us. Let's help each other.
teleportation does exist from OR to recovery room I left something behind not quite a part of myself —unwelcome guests poisoning me from the inside no longer welcome
I like to make silly things, and I also like to put in minimal effort for those silly things. I also like to make things in Rust, mostly for the web, and this is where we run into a problem. See, if I want to make something for the web, I could use Django but I don't want that. I mean, Django is for building serious businesses, not for building silly non-commercial things! But using Rust, we have to do a lot more work than if we build it with Django or friends. See, so far, there's no equivalent, and the Rust community leans heavily into the "wire it up yourself" approach. As Are We Web Yet? says, "[...] you generally have to wire everything up yourself. Expect to put in a little bit of extra set up work to get started." This undersells it, though. It's more than a little bit of extra work to get started! I know because I made a list of things to do to get started. Rust needs something that does bundle this up for you, so that we can serve all web developers. Having it would make it a lot easier to make the case to use Rust. The benefits are there: you get wonderful type system, wonderful performance, and build times that give you back those coffee breaks you used to get while your code compiled. What do we need? There is a big pile of stuff that nearly every web app needs, no matter if it's big or small. Here's a rough list of what seems pretty necessary to me: Routing/handlers: this is pretty obvious, but we have to be able to get an incoming request to some handler for it. Additionally, this routing needs to handle path parameters, ideally with type information, and we'll give bonus points for query parameters, forms, etc. Templates: we'll need to generate, you know, HTML (and sometimes other content, like JSON or, if you're in the bad times still, XML). Usually I want these to have basic logic, like conditionals, match/switch, and loops. Static file serving: we'll need to serve some assets, like CSS files. This can be done separately, but having it as part of the same web server is extremely handy for both local development and for small-time deployments that won't handle much traffic. Logins: You almost always need some way to log in, since apps are usually multi-user or deployed on a public network. This is just annoying to wire up every time! It should be customizable and something you can opt out of, but it should be trivial to have logins from the start. Permissions: You also need this for systems that have multiple users, since people will have different data they're allowed to access or different roles in the system. Permissions can be complicated but you can make something relatively simple that follows the check(user, object, action) pattern and get really far with it. Database interface: You're probably going to have to store data for your app, so you want a way to do that. Something that's ORM-like is often nice, but something light is fine. Whatever you do here isn't the only way to interact with the database, but it'll be used for things like logins, permissions, and admin tools, so it's going to be a fundamental piece. Admin tooling: This is arguably a quality-of-life issue, not a necessity, except that every time you setup your application in a local environment or in production you're going to have to bootstrap it with at least one user or some data. And you'll have to do admin actions sometimes! So I think having this built-in for at least some of the common actions is a necessity for a seamless experience. WebSockets: I use WebSockets in a lot of my projects. They just let you do really fun things with pushing data out to connected users in a more real-time fashion! Hot reloading: This is a huge one for developer experience, because you want to have the ability to see changes really quickly. When code or a template change, you need to see that reflected as soon as humanly possible (or as soon as the Rust compiler allows). Then we have a pile of things that are quality-of-life improvements, and I think are necessary for long-term projects but might not be as necessary upfront, so users are less annoyed at implementing it themselves because the cost is spread out. Background tasks: There needs to be a story for these! You're going to have features that have to happen on a schedule, and having a consistent way to do that is a big benefit and makes development easier. Monitoring/observability: Only the smallest, least-critical systems should skip this. It's really important to have and it will make your life so much easier when you have it in that moment that you desperately need it. Caching: There are a lot of ways to do this, and all of them make things more complicated and maybe faster? So this is nice to have a story for, but users can also handle it themselves. Emails and other notifications: It's neat to be able to have password resets and things built-in, and this is probably a necessity if you're going to have logins, so you can have password resets. But other than that feature, it feels like it won't get used that much and isn't a big deal to add in when you need it. Deployment tooling: Some consistent way to deploy somewhere is really nice, even if it's just an autogenerated Dockerfile that you can use with a host of choice. CSS/JS bundling: In the time it is, we use JS and CSS everywhere, so you probably want a web tool to be aware of them so they can be included seamlessly. But does it really have to be integrated in? Probably not... So those are the things I'd target in a framework if I were building one! I might be doing that... The existing ecosystem There's quite a bit out there already for building web things in Rust. None of them quite hit what I want, which is intentional on their part: none of them aspire to be what I'm looking for here. I love what exists, and I think we're sorely missing what I want here (I don't think I'm alone). Web frameworks There are really two main groups of web frameworks/libraries right now: the minimalist ones, and the single-page app ones. The minimalist ones are reminiscent of Flask, Sinatra, and other small web frameworks. These include the excellent actix-web and axum, as well as myriad others. There are so many of these, and they all bring a nice flavor to web development by leveraging Rust's type system! But they don't give you much besides handlers; none of the extra functionality we want in a full for-lazy-developers framework. Then there are the single-page app frameworks. These fill a niche where you can build things with Rust on the backend and frontend, using WebAssembly for the frontend rendering. These tend to be less mature, but good examples include Dioxus, Leptos, and Yew. I used Yew to build a digital vigil last year, and it was enjoyable but I'm not sure I'd want to do it in a "real" production setting. Each of these is excellent for what it is—but what it is requires a lot of wiring up still. Most of my projects would work well with the minimalist frameworks, but those require so much wiring up! So it ends up being a chore to set that up each time that I want to do something. Piles of libraries! The rest of the ecosystem is piles of libraries. There are lots of template libraries! There are some libraries for logins, and for permissions. There are WebSocket libraries! Often you'll find some projects and examples which integrate a couple of the things you're using, but you won't find something that integrates all the pieces you're using. I've run into some of the examples being out of date, which is to be expected in a fast-moving ecosystem. The pile of libraries leaves a lot of friction, though. It makes getting started, the "just wiring it up" part, very difficult and often an exercise in researching how things work, to understand them in depth enough to do the integration. What I've done before The way I've handled this before is basically to pick a base framework (typically actix-web or axum) and then search out all the pieces I want on top of it. Then I'd wire them up, either all at the beginning or as I need them. There are starter templates that could help me avoid some of this pain. They can definitely help you skip some of the initial pain, but you still get all the maintenance burden. You have to make sure your libraries stay up to date, even when there are breaking changes. And you will drift from the template, so it's not really feasible to merge changes to it into your project. For the projects I'm working on, this means that instead of keeping one framework up to date, I have to keep n bespoke frameworks up to date across all my projects! Eep! I'd much rather have a single web framework that handles it all, with clean upgrade instructions between versions. There will be breaking changes sometimes, but this way they can be documented instead of coming about due to changes in the interactions between two components which don't even know they're going to be integrated together. Imagining the future I want In an ideal world, there would be a framework for Rust that gives me all the features I listed above. And it would also come with excellent documentation, changelogs, thoughtful versioning and handling of breaking changes, and maybe even a great community. All the things I love about Django, could we have those for a Rust web framework so that we can reap the benefits of Rust without having to go needlessly slowly? This doesn't exist right now, and I'm not sure if anyone else is working on it. All paths seem to lead me toward "whoops I guess I'm building a web framework." I hope someone else builds one, too, so we can have multiple options. To be honest, "web framework" sounds way too grandiose for what I'm doing, which is simply wiring things together in an opinionated way, using (mostly) existing building blocks1. Instead of calling it a framework, I'm thinking of it as a web toolkit: a bundle of tools tastefully chosen and arranged to make the artisan highly effective. My toolkit is called nicole's web toolkit, or newt. It's available in a public repository, but it's really not usable (the latest changes aren't even pushed yet). It's not even usable for me yet—this isn't a launch post, more shipping my design doc (and hoping someone will do my work for me so I don't have to finish newt :D). The goal for newt is to be able to create a new small web app and start on the actual project in minutes instead of days, bypassing the entire process of wiring things up. I think the list of must-haves and quality-of-life features above will be a start, but by no means everything we need. I'm not ready to accept contributions, but I hope to be there at some point. I think that Rust really needs this, and the whole ecosystem will benefit from it. A healthy ecosystem will have multiple such toolkits, and I hope to see others develop as well. * * * If you want to follow along with mine, though, feel free to subscribe to my RSS feed or newsletter, or follow me on Mastodon. I'll try to let people know in all those places when the toolkit is ready for people to try out. Or I'll do a post-mortem on it, if it ends up that I don't get far with it! Either way, this will be fun. 1 I do plan to build a few pieces from scratch for this, as the need arises. Some things will be easier that way, or fit more cohesively. Can't I have a little greenfield, as a treat?
The first time I went on call as a software engineer, it was exciting—and ultimately traumatic. Since then, I've had on-call experiences at multiple other jobs and have grown to really appreciate it as part of the role. As I've progressed through my career, I've gotten to help establish on-call processes and run some related trainings. Here is some of what I wish I'd known when I started my first on-call shift, and what I try to tell each engineer before theirs. Heroism isn't your job, triage is It's natural to feel a lot of pressure with on-call responsibilities. You have a production application that real people need to use! When that pager goes off, you want to go in and fix the problem yourself. That's the job, right? But it's not. It's not your job to fix every issue by yourself. It is your job to see that issues get addressed. The difference can be subtle, but important. When you get that page, your job is to assess what's going on. A few questions I like to ask are: What systems are affected? How badly are they impacted? Does this affect users? With answers to those questions, you can figure out what a good course of action is. For simple things, you might just fix it yourself! If it's a big outage, you're putting on your incident commander hat and paging other engineers to help out. And if it's a false alarm, then you're putting in a fix for the noisy alert! (You're going to fix it, not just ignore that, right?) Just remember not to be a hero. You don't need to fix it alone, you just need to figure out what's going on and get a plan. Call for backup Related to the previous one, you aren't going this alone. Your main job in holding the pager is to assess and make sure things get addressed. Sometimes you can do that alone, but often you can't! Don't be afraid to call for backup. People want to be helpful to their teammates, and they want that support available to them, too. And it's better to be wake me up a little too much than to let me sleep through times when I was truly needed. If people are getting woken up a lot, the issue isn't calling for backup, it's that you're having too many true emergencies. It's best to figure out that you need backup early, like 10 minutes in, to limit the damage of the incident. The faster you figure out other people are needed, the faster you can get the situation under control. Communicate a lot In any incident, adrenaline runs and people are stressed out. The key to good incident response is communication in spite of the adrenaline. Communicating under pressure is a skill, and it's one you can learn. Here are a few of the times and ways of communicating that I think are critical: When you get on and respond to an alert, say that you're there and that you're assessing the situation Once you've assessed it, post an update; if the assessment is taking a while, post updates every 15 minutes while you do so (and call for backup) After the situation is being handled, update key stakeholders at least every 30 minutes for the first few hours, and then after that slow down to hourly You are also going to have to communicate within the response team! There might be a dedicated incident channel or one for each incident. Either way, try to over communicate about what you're working on and what you've learned. Keep detailed notes, with timestamps When you're debugging weird production stuff at 3am, that's the time you really need to externalize your memory and thought processes into a notes document. This helps you keep track of what you're doing, so you know which experiments you've run and which things you've ruled out as possibilities or determined as contributing factors. It also helps when someone else comes up to speed! That person will be able to use your notes to figure out what has happened, instead of you having to repeat it every time someone gets on. Plus, the notes doc won't forget things, but you will. You will also need these notes later to do a post-mortem. What was tried, what was found, and how it was fixed are all crucial for the discussion. Timestamps are critical also for understanding the timeline of the incident and the response! This document should be in a shared place, since people will use it when they join the response. It doesn't need to be shared outside of the engineering organization, though, and likely should not be. It may contain details that lead to more questions than they answer; sometimes, normal engineering things can seem really scary to external stakeholders! You will learn a lot! When you're on call, you get to see things break in weird and unexpected ways. And you get to see how other people handle those things! Both of these are great ways to learn a lot. You'll also just get exposure to things you're not used to seeing. Some of this will be areas that you don't usually work in, like ops if you're a developer, or application code if you're on the ops side. Some more of it will be business side things for the impact of incidents. And some will be about the psychology of humans, as you see the logs of a user clicking a button fifteen hundred times (get that person an esports sponsorship, geez). My time on call has led to a lot of my professional growth as a software engineer. It has dramatically changed how I worked on systems. I don't want to wake up at 3am to fix my bad code, and I don't want it to wake you up, either. Having to respond to pages and fix things will teach you all the ways they can break, so you'll write more resilient software that doesn't break. And it will teach you a lot about the structure of your engineering team, good or bad, in how it's structured and who's responding to which things. Learn by shadowing No one is born skilled at handling production alerts. You gain these skills by doing, so get out there and do it—but first, watch someone else do it. No matter how much experience you have writing code (or responding to incidents), you'll learn a lot by watching a skilled coworker handle incoming issues. Before you're the primary for an on-call shift, you should shadow someone for theirs. This will let you see how they handle things and what the general vibe is. This isn't easy to do! It means that they'll have to make sure to loop you in even when blood is pumping, so you may have to remind them periodically. You'll probably miss out on some things, but you'll see a lot, too. Some things can (and should) wait for Monday morning When we get paged, it usually feels like a crisis. If not to us, it sure does to the person who's clicking that button in frustration, generating a ton of errors, and somehow causing my pager to go off. But not all alerts are created equal. If you assess something and figure out that it's only affecting one or two customers in something that's not time sensitive, and it's currently 4am on a Saturday? Let people know your assessment (and how to reach you if you're wrong, which you could be) and go back to bed. Real critical incidents have to be fixed right away, but some things really need to wait. You want to let them go until later for two reasons. First is just the quality of the fix. You're going to fix things more completely if you're rested when you're doing so! Second, and more important, is your health. It's wrong to sacrifice your health (by being up at 4am fixing things) for something non-critical. Don't sacrifice your health Many of us have had bad on-call experiences. I sure have. One regret is that I didn't quit that on-call experience sooner. I don't even necessarily mean quitting the job, but pushing back on it. If I'd stood up for myself and said "hey, we have five engineers, it should be more than just me on call," and held firm, maybe I'd have gotten that! Or maybe I'd have gotten a new job. What I wouldn't have gotten is the knowledge that you can develop a rash from being too stressed. If you're in a bad on-call situation, please try to get out of it! And if you can't get out of it, try to be kind to yourself and protect yourself however you can (you deserve better). Be methodical and reproduce before you fix Along with taking great notes, you should make sure that you test hypotheses. What could be causing this issue? And before that, what even is the problem? And how do we make it happen? Write down your answers to these! Then go ahead and try to reproduce the issue. After reproducing it, you can try to go through your hypotheses and test them out to see what's actually contributing to the issue. This way, you can bisect problem spaces instead of just eliminating one thing at a time. And since you know how to reproduce the issue now, you can be confident that you do have a fix at the end of it all! Have fun Above all, the thing I want people new to on-call to do? Just have fun. I know this might sound odd, because being on call is a big job responsibility! But I really do think it can be fun. There's a certain kind of joy in going through the on-call response together. And there's a fun exhilaration to it all. And the joy of fixing things and really being the competent engineer who handled it with grace under pressure. Try to make some jokes (at an appropriate moment!) and remember that whatever happens, it's going to be okay. Probably.
More in programming
Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact. Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe’s Product Infrastructure had similar goals to other companies’ Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes. This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring. Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe’s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can’t take credit for this strategy’s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe’s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation The Product Infrastructure team is investing in Stripe’s developer experience by: Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas. We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review. Our three highest priority areas for this half are: Add static typing to the highest value portions of our Ruby codebase, such that we can run the type checker locally and on the test machines to identify errors more quickly. Support selective test execution such that engineers can quickly determine and run the most appropriate tests on their machine rather than delaying until tests run on the build server. Instrument test failures such that we have better data to prioritize future efforts. Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team’s time to get our type checker to an effective place. Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase. It’s possible for Product Infrastructure to take on both, despite its fixed size. We’ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code’s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable. Based on our knowledge of Facebook’s Hack project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It’s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios. Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value. In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the Command Query Responsibility Segregation (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase. Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code. Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified. Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues. For escalations and issues, reach out in the #product-infra channel. Diagnose In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity. Our current diagnosis of our developer productivity is: We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.) We have two primary mechanisms for understanding our engineer’s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey. Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners. Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request’s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete. By the time their build-test cycle completes, they’ve lost their focus and maybe take several hours to return to addressing the results. There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes. Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase. Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity. Our data infrastructure is implemented in Scala. While these developers have concerns–primarily slow build times–they manage their build and deployment infrastructure independently, and the group remains relatively small. Ruby is not a highly performant programming language, but we’ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language. Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers. Developers working in Golang and Scala routinely ask for more centralized support, but it’s challenging to prioritize those requests as we’re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala. If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.
The new AMD HX370 option in the Framework 13 is a good step forward in performance for developers. It runs our HEY test suite in 2m7s, compared to 2m43s for the 7840U (and 2m49s for a M4 Pro!). It's also about 20% faster in most single-core tasks than the 7840U. But is that enough to warrant the jump in price? AMD's latest, best chips have suddenly gotten pretty expensive. The F13 w/ HX370 now costs $1,992 with 32GB RAM / 1TB. Almost the same an M4 Pro MBP14 w/ 24GB / 1TB ($2,199). I'd pick the Framework any day for its better keyboard, 3:2 matte screen, repairability, and superb Linux compatibility, but it won't be because the top option is "cheaper" any more. Of course you could also just go with the budget 6-core Ryzen AI 5 340 in same spec for $1,362. I'm sure that's a great machine too. But maybe the sweet spot is actually the Ryzen AI 7 350. It "only" has 8 cores (vs 12 on the 370), but four of those are performance cores -- the same as the 370. And it's $300 cheaper. So ~$1,600 gets you out the door. I haven't actually tried the 350, though, so that's just speculation. I've been running the 370 for the last few months. Whichever chip you choose, the rest of the Framework 13 package is as good as it ever was. This remains my favorite laptop of at least the last decade. I've been running one for over a year now, and combined with Omakub + Neovim, it's the first machine in forever where I've actually enjoyed programming on a 13" screen. The 3:2 aspect ratio combined with Linux's superb multiple desktops that switch with 0ms lag and no animations means I barely miss the trusted 6K Apple XDR screen when working away from the desk. The HX370 gives me about 6 hours of battery life in mixed use. About the same as the old 7840U. Though if all I'm doing is writing, I can squeeze that to 8-10 hours. That's good enough for me, but not as good as a Qualcomm machine or an Apple M-chip machine. For some people, those extra hours really make the difference. What does make a difference, of course, is Linux. I've written repeatedly about how much of a joy it's been to rediscover Linux on the desktop, and it's a joy that keeps on giving. For web work, it's so good. And for any work that requires even a minimum of Docker, it's so fast (as the HEY suite run time attests). Apple still has a strong hardware game, but their software story is falling apart. I haven't heard many people sing the praises of new iOS or macOS releases in a long while. It seems like without an asshole in charge, both have move towards more bloat, more ads, more gimmicks, more control. Linux is an incredible antidote to this nonsense these days. It's also just fun! Seeing AMD catch up in outright performance if not efficiency has been a delight. Watching Framework perfect their 13" laptop while remaining 100% backwards compatible in terms of upgrades with the first versions is heartwarming. And getting to test the new Framework Desktop in advance of its Q3 release has only affirmed my commitment to both. But on the new HX370, it's in my opinion the best Linux laptop you can buy today, which by extension makes it the best web developer laptop too. The top spec might have gotten a bit pricey, but there are options all along the budget spectrum, which retains all the key ingredients any way. Hard to go wrong. Forza Framework!
I’m a big fan of keyring, a Python module made by Jason R. Coombs for storing secrets in the system keyring. It works on multiple operating systems, and it knows what password store to use for each of them. For example, if you’re using macOS it puts secrets in the Keychain, but if you’re on Windows it uses Credential Locker. The keyring module is a safe and portable way to store passwords, more secure than using a plaintext config file or an environment variable. The same code will work on different platforms, because keyring handles the hard work of choosing which password store to use. It has a straightforward API: the keyring.set_password and keyring.get_password functions will handle a lot of use cases. >>> import keyring >>> keyring.set_password("xkcd", "alexwlchan", "correct-horse-battery-staple") >>> keyring.get_password("xkcd", "alexwlchan") "correct-horse-battery-staple" Although this API is simple, it’s not perfect – I have some frustrations with the get_password function. In a lot of my projects, I’m now using a small function that wraps get_password. What do I find frustrating about keyring.get_password? If you look up a password that isn’t in the system keyring, get_password returns None rather than throwing an exception: >>> print(keyring.get_password("xkcd", "the_invisible_man")) None I can see why this makes sense for the library overall – a non-existent password is very normal, and not exceptional behaviour – but in my projects, None is rarely a usable value. I normally use keyring to retrieve secrets that I need to access protected resources – for example, an API key to call an API that requires authentication. If I can’t get the right secrets, I know I can’t continue. Indeed, continuing often leads to more confusing errors when some other function unexpectedly gets None, rather than a string. For a while, I wrapped get_password in a function that would throw an exception if it couldn’t find the password: def get_required_password(service_name: str, username: str) -> str: """ Get password from the specified service. If a matching password is not found in the system keyring, this function will throw an exception. """ password = keyring.get_password(service_name, username) if password is None: raise RuntimeError(f"Could not retrieve password {(service_name, username)}") return password When I use this function, my code will fail as soon as it fails to retrieve a password, rather than when it tries to use None as the password. This worked well enough for my personal projects, but it wasn’t a great fit for shared projects. I could make sense of the error, but not everyone could do the same. What’s that password meant to be? A good error message explains what’s gone wrong, and gives the reader clear steps for fixing the issue. The error message above is only doing half the job. It tells you what’s gone wrong (it couldn’t get the password) but it doesn’t tell you how to fix it. As I started using this snippet in codebases that I work on with other developers, I got questions when other people hit this error. They could guess that they needed to set a password, but the error message doesn’t explain how, or what password they should be setting. For example, is this a secret they should pick themselves? Is it a password in our shared password vault? Or do they need an API key for a third-party service? If so, where do they find it? I still think my initial error was an improvement over letting None be used in the rest of the codebase, but I realised I could go further. This is my extended wrapper: def get_required_password(service_name: str, username: str, explanation: str) -> str: """ Get password from the specified service. If a matching password is not found in the system keyring, this function will throw an exception and explain to the user how to set the required password. """ password = keyring.get_password(service_name, username) if password is None: raise RuntimeError( "Unable to retrieve required password from the system keyring!\n" "\n" "You need to:\n" "\n" f"1/ Get the password. Here's how: {explanation}\n" "\n" "2/ Save the new password in the system keyring:\n" "\n" f" keyring set {service_name} {username}\n" ) return password The explanation argument allows me to explain what the password is for to a future reader, and what value it should have. That information can often be found in a code comment or in documentation, but putting it in an error message makes it more visible. Here’s one example: get_required_password( "flask_app", "secret_key", explanation=( "Pick a random value, e.g. with\n" "\n" " python3 -c 'import secrets; print(secrets.token_hex())'\n" "\n" "This password is used to securely sign the Flask session cookie. " "See https://flask.palletsprojects.com/en/stable/config/#SECRET_KEY" ), ) If you call this function and there’s no keyring entry for flask_app/secret_key, you get the following error: Unable to retrieve required password from the system keyring! You need to: 1/ Get the password. Here's how: Pick a random value, e.g. with python3 -c 'import secrets; print(secrets.token_hex())' This password is used to securely sign the Flask session cookie. See https://flask.palletsprojects.com/en/stable/config/#SECRET_KEY 2/ Save the new password in the system keyring: keyring set flask_app secret_key It’s longer, but this error message is far more informative. It tells you what’s wrong, how to save a password, and what the password should be. This is based on a real example where the previous error message led to a misunderstanding. A co-worker saw a missing password called “secret key” and thought it referred to a secret key for calling an API, and didn’t realise it was actually for signing Flask session cookies. Now I can write a more informative error message, I can prevent that misunderstanding happening again. (We also renamed the secret, for additional clarity.) It takes time to write this explanation, which will only ever be seen by a handful of people, but I think it’s important. If somebody sees it at all, it’ll be when they’re setting up the project for the first time. I want that setup process to be smooth and straightforward. I don’t use this wrapper in all my code, particularly small or throwaway toys that won’t last long enough for this to be an issue. But in larger codebases that will be used by other developers, and which I expect to last a long time, I use it extensively. Writing a good explanation now can avoid frustration later. [If the formatting of this post looks odd in your feed reader, visit the original article]
Carson Gross has a post about vendoring which brought back memories of how I used to build websites in ye olden days, back in the dark times before npm. “Vendoring” is where you copy dependency source files directly into your project (usually in a folder called /vendor) and then link to them — all of this being a manual process. For example: Find jquery.js or reset.css somewhere on the web (usually from the project’s respective website, in my case I always pulled jQuery from the big download button on jQuery.com and my CSS reset from Eric Meyer’s website). Copy that file into /vendor, e.g. /vendor/jquery@1.2.3.js Pull it in where you need it, e.g. <script src="/vendor/jquery@1.2.3.js"> And don’t get me started on copying your transitive dependencies (the dependencies of your dependencies). That gets complicated when you’re vendoring by hand! Now-a-days package managers and bundlers automate all of this away: npm i what you want, import x from 'pkg', and you’re on your way! It’s so easy (easy to get all that complexity). But, as the HTMX article points out, a strength can also be a weakness. It’s not all net gain (emphasis mine): Because dealing with large numbers of dependencies is difficult, vendoring encourages a culture of independence. You get more of what you make easy, and if you make dependencies easy, you get more of them. I like that — you get more of what you make easy. Therefore: be mindful of what you make easy! As Carson points out, dependency management tools foster a culture of dependence — imagine that! I know I keep lamenting Deno’s move away from HTTP imports by default, but I think this puts a finger on why I’m sad: it perpetuates the status quo, whereas a stance on aligning imports with how the browser works would not perpetuate this dependence on dependency resolution tooling. There’s no package manager or dependency resolution algorithm for the browser. I was thinking about all of this the other day when I then came across this thread of thoughts from Dave Rupert on Mastodon. Dave says: I prefer to use and make simpler, less complex solutions that intentionally do less. But everyone just wants the thing they have now but faster and crammed with more features (which are juxtaposed) He continues with this lesson from his startup Luro: One of my biggest takeaways from Luro is that it’s hard-to-impossible to sell a process change. People will bolt stuff onto their existing workflow (ecosystem) all day long, but it takes a religious conversion to change process. Which really helped me put words to my feelings regarding HTTP imports in Deno: i'm less sad about the technical nature of the feature, and more about what it represented as a potential “religious revival” in the area of dependency management in JS. package.json & dep management has become such an ecosystem unto itself that it seems only a Great Reawakening™️ will change it. I don’t have a punchy point to end this article. It’s just me working through my feelings. Email · Mastodon · Bluesky
Short one this time because I have a lot going on this week. In computation complexity, NP is the class of all decision problems (yes/no) where a potential proof (or "witness") for "yes" can be verified in polynomial time. For example, "does this set of numbers have a subset that sums to zero" is in NP. If the answer is "yes", you can prove it by presenting a set of numbers. We would then verify the witness by 1) checking that all the numbers are present in the set (~linear time) and 2) adding up all the numbers (also linear). NP-complete is the class of "hardest possible" NP problems. Subset sum is NP-complete. NP-hard is the set all problems at least as hard as NP-complete. Notably, NP-hard is not a subset of NP, as it contains problems that are harder than NP-complete. A natural question to ask is "like what?" And the canonical example of "NP-harder" is the halting problem (HALT): does program P halt on input C? As the argument goes, it's undecidable, so obviously not in NP. I think this is a bad example for two reasons: All NP requires is that witnesses for "yes" can be verified in polynomial time. It does not require anything for the "no" case! And even though HP is undecidable, there is a decidable way to verify a "yes": let the witness be "it halts in N steps", then run the program for that many steps and see if it halted by then. To prove HALT is not in NP, you have to show that this verification process grows faster than polynomially. It does (as busy beaver is uncomputable), but this all makes the example needlessly confusing.1 "What's bigger than a dog? THE MOON" Really (2) bothers me a lot more than (1) because it's just so inelegant. It suggests that NP-complete is the upper bound of "solvable" problems, and after that you're in full-on undecidability. I'd rather show intuitive problems that are harder than NP but not that much harder. But in looking for a "slightly harder" problem, I ran into an, ah, problem. It seems like the next-hardest class would be EXPTIME, except we don't know for sure that NP != EXPTIME. We know for sure that NP != NEXPTIME, but NEXPTIME doesn't have any intuitive, easily explainable problems. Most "definitely harder than NP" problems require a nontrivial background in theoretical computer science or mathematics to understand. There is one problem, though, that I find easily explainable. Place a token at the bottom left corner of a grid that extends infinitely up and right, call that point (0, 0). You're given list of valid displacement moves for the token, like (+1, +0), (-20, +13), (-5, -6), etc, and a target point like (700, 1). You may make any sequence of moves in any order, as long as no move ever puts the token off the grid. Does any sequence of moves bring you to the target? This is PSPACE-complete, I think, which still isn't proven to be harder than NP-complete (though it's widely believed). But what if you increase the number of dimensions of the grid? Past a certain number of dimensions the problem jumps to being EXPSPACE-complete, and then TOWER-complete (grows tetrationally), and then it keeps going. Some point might recognize this as looking a lot like the Ackermann function, and in fact this problem is ACKERMANN-complete on the number of available dimensions. A friend wrote a Quanta article about the whole mess, you should read it. This problem is ludicrously bigger than NP ("Chicago" instead of "The Moon"), but at least it's clearly decidable, easily explainable, and definitely not in NP. It's less confusing if you're taught the alternate (and original!) definition of NP, "the class of problems solvable in polynomial time by a nondeterministic Turing machine". Then HALT can't be in NP because otherwise runtime would be bounded by an exponential function. ↩