Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]

Dan Slimmon

Dan Slimmon
Is ops a bullshit job? I recently had the pleasure of reading anthropologist David Graeber’s 2018 book, Bullshit Jobs: A...
a week ago
a week ago
I recently had the pleasure of reading anthropologist David Graeber’s 2018 book, Bullshit Jobs: A Theory. Graeber defines a bullshit job as, a form of paid employment that is so completely pointless, unnecessary, or pernicious that even the employee cannot justify its existence...
Dan Slimmon
Incident SEV scales are a waste of time Ask an engineering leader about their incident response protocol and they’ll tell you about their...
3 weeks ago
3 weeks ago
Ask an engineering leader about their incident response protocol and they’ll tell you about their severity scale. “The first thing we do is we assign a severity to the incident,” they’ll say, “so the right people will get notified.” And this is sensible. In order to figure out...
Dan Slimmon
The queueing shell game Queues are not just architectural widgets that you can insert into your architecture wherever...
6 months ago
6 months ago
Queues are not just architectural widgets that you can insert into your architecture wherever they're needed. Queues are spontaneously occurring phenomena, just like a waterfall or a thunderstorm.
Dan Slimmon
Podcast: Small Batches with Adam Hawkins I was recently delighted to be interviewed by Adam Hawkins on his podcast Small Batches. We...
6 months ago
6 months ago
I was recently delighted to be interviewed by Adam Hawkins on his podcast Small Batches. We discussed a huge variety of topics. Here is the full episode, and on that page you’ll find meticulously timestamped links to specific topics. Check out the rest of Adam’s podcast, it’s...
Dan Slimmon
Putting a meaningful dent in your error backlog We often don't realize how noisy the errors have gotten until things are already well out of hand....
6 months ago
6 months ago
We often don't realize how noisy the errors have gotten until things are already well out of hand. After all, we've got shit to do. Deadlines to hit. By the time we decide to get serious about error management, a huge, impenetrable, meaningless backlog of errors has already...
Dan Slimmon
No Observability Without Theory: The Talk Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at...
7 months ago
7 months ago
Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at Monitorama 2024. If you’ve never been to Monitorama, I can’t recommend it enough. I think it’s the best tech conference, period. This talk was adapted from an old blog post of mine,...
Dan Slimmon
Leading incidents when you’re junior If you’re a junior engineer at a software company, you might be required to be on call for the...
8 months ago
8 months ago
If you’re a junior engineer at a software company, you might be required to be on call for the systems your team owns. Which means you’ll eventually be called upon to lead an incident response. And since incidents don’t care what your org chart looks like, fate may place you in...
Dan Slimmon
Fight knowledge decay with a rich Incident Summary It only takes a few off-the-rails incidents in your software career to realize the importance of...
8 months ago
8 months ago
It only takes a few off-the-rails incidents in your software career to realize the importance of writing things down. That’s why so many companies’ incident response protocols define a scribe role. The scribe’s job, generally, is to take notes on everything that happens. In other...
Dan Slimmon
Ask questions first, shoot later The fact that fixing and diagnosing often converge to the same actions doesn't change the fact that...
9 months ago
9 months ago
The fact that fixing and diagnosing often converge to the same actions doesn't change the fact that these two concurrent activities have different goals. The goal of fixing is to bring the system into line with your mental model of how it's supposed to function. The goal of...
Dan Slimmon
Podcast appearance: The Debrief from Incident.io I’m so grateful to Incident.io for the opportunity to shout from their rooftop about Clinical...
9 months ago
9 months ago
I’m so grateful to Incident.io for the opportunity to shout from their rooftop about Clinical troubleshooting, which I firmly believe is the way we should all be diagnosing system failures. Enjoy the full episode!
Dan Slimmon
The World Record for Loneliness What's the farthest any person has been from the nearest other person?
9 months ago
Dan Slimmon
Garden-path incidents Barb’s story It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the...
10 months ago
10 months ago
Barb’s story It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the waiter sets down Barb’s usual order (#307, the Bun Chay, extra spicy), Barb’s nostrils catch the heavenly aroma of peanuts and scallions and red chiles. A wave of calm moves...
Dan Slimmon
Explaining the fire When the firefighters arrive at the blazing building, they don't need to explain the fire. They need...
10 months ago
10 months ago
When the firefighters arrive at the blazing building, they don't need to explain the fire. They need to put it out. It doesn't matter whether a toaster malfunctioned, or a cat knocked over a candle, or a smoker fell asleep watching The Voice. But when PagerDuty blows up and we...
Dan Slimmon
I was on the Slight Reliability podcast! Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident...
10 months ago
10 months ago
Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident response, diagnosis, and looking for trouble. It was very chill! Full 28-minute episode:
Dan Slimmon
Incident, Inçident, Incidënt When you deploy broken code, it may cause an incident. Then you'll have to declare an incident. And...
11 months ago
11 months ago
When you deploy broken code, it may cause an incident. Then you'll have to declare an incident. And don't forget to create an incident so customers can stay informed!
Dan Slimmon
Dead air on the incident call Silence can mean different things to different people in different situations. In this post, I'll...
11 months ago
11 months ago
Silence can mean different things to different people in different situations. In this post, I'll present a few incident scenarios and explore the role of the incident commander in breaking (or simply abiding in) dead air.
Dan Slimmon
Clinical troubleshooting: diagnose any production issue, fast. Over the years, I've developed a reliable method for harnessing the diagnostic power of groups. My...
11 months ago
11 months ago
Over the years, I've developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of context need to reason together about problems in a complex, dynamic system:...
Dan Slimmon
Interviewing engineers for diagnostic skills In SaaS, when we’re hiring engineers, we usually imagine that their time will mostly be spent...
a year ago
a year ago
In SaaS, when we’re hiring engineers, we usually imagine that their time will mostly be spent building things. So we never forget to interview for skills at building stuff. Sometimes we ask candidates to write code on the fly. Other times we ask them to whiteboard out a sensible...
Dan Slimmon
3 questions that will make you a phenomenal rubber duck As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I...
a year ago
a year ago
As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy. “Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped...
Dan Slimmon
Why transaction order matters, even if you’re only reading There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET...
a year ago
a year ago
There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET TRANSACTION statement. They are: This last guarantee is one against serialization anomalies. A serialization anomaly is any sequence of events that produces a result that would be...
Dan Slimmon
Concurrent locks and MultiXacts in Postgres Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by...
a year ago
a year ago
Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by Postgres. There was this one class of query that would get slower and slower over the course of about an hour. The exact pathology is a tale for another time, but the investigation led...
Dan Slimmon
Squeeze the hell out of the system you have When complexity leaps are on the table, there's usually also an opportunity to squeeze some extra...
a year ago
a year ago
When complexity leaps are on the table, there's usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When...
Dan Slimmon
Don’t fix it just because it’s technical debt. Why should we only spend part of our time doing work that maximizes value, and the rest of our time...
a year ago
a year ago
Why should we only spend part of our time doing work that maximizes value, and the rest of our time doing other, less optimal work?
Dan Slimmon
It’s fine to use names in post-mortems The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be...
a year ago
a year ago
The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be healthy and useful. The purpose of the blameless post-mortem is to let us find explanations deeper than human error.
Dan Slimmon
Incident metrics tell you nothing about reliability When an incident response process is created, there arise many voices calling for measurement. “As...
a year ago
a year ago
When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things...
Dan Slimmon
Post-mortems: content over structure The value of post-mortems is apparent: failures present opportunities to learn about unexpected...
a year ago
a year ago
The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability. The value of post-mortem documents is much less apparent. Many R&D orgs will insist that...
Dan Slimmon
Outliers carry information. Don’t leave them on the table Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how...
a year ago
a year ago
Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it. If you operate a complex system, like a SaaS app, you probably have a dashboard...
Dan Slimmon
5 production surprises worth investigating As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and...
a year ago
a year ago
As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother. But...
Dan Slimmon
Platform teams don’t need to act like companies Lately you see a lot of software company R&D teams organized around internal products. The Search...
a year ago
a year ago
Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local...
Dan Slimmon
Fix tomorrow’s problems by fixing today’s problems A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have...
over a year ago
over a year ago
A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have been performed. At first, it’s too minuscule to notice. But the average deploy latency grows over time. Eventually, deploys start randomly timing out. The deploy pipeline grinds to...
Dan Slimmon
Huh! as a signal We can never predict with certainty what the next system failure will be. But we can predict,...
over a year ago
over a year ago
We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising. We can use that!