Dan Slimmon

Dan Slimmon

Lecture: Queueing theory on a cocktail napkin Queues are everywhere, and they follow mathematical rules. Learn a few of those rules! It'll go a...

2 months ago

38

Lecture: Queueing theory on a cocktail napkin

from Dan Slimmon [alt+shift+b] in programming

2 months ago

Queues are everywhere, and they follow mathematical rules. Learn a few of those rules! It'll go a long way to making you a stronger SRE.

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

did u ever read so hard u accidentally wrote? Owning a production Postgres database is never boring. The other day, I’m looking for trouble (as I...

5 months ago

49

did u ever read so hard u accidentally wrote?

from Dan Slimmon [alt+shift+b] in programming

5 months ago

Owning a production Postgres database is never boring. The other day, I’m looking for trouble (as I am wont to do), and I notice this weird curve in the production database metrics: So we’ve got these spikes in WALWrite: the number of processes waiting to write to the write-ahead...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Is ops a bullshit job? I recently had the pleasure of reading anthropologist David Graeber’s 2018 book, Bullshit Jobs: A...

6 months ago

85

Is ops a bullshit job?

from Dan Slimmon [alt+shift+b] in programming

6 months ago

I recently had the pleasure of reading anthropologist David Graeber’s 2018 book, Bullshit Jobs: A Theory. Graeber defines a bullshit job as, a form of paid employment that is so completely pointless, unnecessary, or pernicious that even the employee cannot justify its existence...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Incident SEV scales are a waste of time Ask an engineering leader about their incident response protocol and they’ll tell you about their...

7 months ago

65

Incident SEV scales are a waste of time

from Dan Slimmon [alt+shift+b] in programming

7 months ago

Ask an engineering leader about their incident response protocol and they’ll tell you about their severity scale. “The first thing we do is we assign a severity to the incident,” they’ll say, “so the right people will get notified.” And this is sensible. In order to figure out...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

The queueing shell game Queues are not just architectural widgets that you can insert into your architecture wherever...

a year ago

105

The queueing shell game

from Dan Slimmon [alt+shift+b] in programming

a year ago

Queues are not just architectural widgets that you can insert into your architecture wherever they're needed. Queues are spontaneously occurring phenomena, just like a waterfall or a thunderstorm.

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Podcast: Small Batches with Adam Hawkins I was recently delighted to be interviewed by Adam Hawkins on his podcast Small Batches. We...

a year ago

90

Podcast: Small Batches with Adam Hawkins

from Dan Slimmon [alt+shift+b] in programming

a year ago

I was recently delighted to be interviewed by Adam Hawkins on his podcast Small Batches. We discussed a huge variety of topics. Here is the full episode, and on that page you’ll find meticulously timestamped links to specific topics. Check out the rest of Adam’s podcast, it’s...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Putting a meaningful dent in your error backlog We often don't realize how noisy the errors have gotten until things are already well out of hand....

a year ago

119

Putting a meaningful dent in your error backlog

from Dan Slimmon [alt+shift+b] in programming

a year ago

We often don't realize how noisy the errors have gotten until things are already well out of hand. After all, we've got shit to do. Deadlines to hit. By the time we decide to get serious about error management, a huge, impenetrable, meaningless backlog of errors has already...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

No Observability Without Theory: The Talk Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at...

a year ago

109

No Observability Without Theory: The Talk

from Dan Slimmon [alt+shift+b] in programming

a year ago

Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at Monitorama 2024. If you’ve never been to Monitorama, I can’t recommend it enough. I think it’s the best tech conference, period. This talk was adapted from an old blog post of mine,...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Leading incidents when you’re junior If you’re a junior engineer at a software company, you might be required to be on call for the...

a year ago

117

Leading incidents when you’re junior

from Dan Slimmon [alt+shift+b] in programming

a year ago

If you’re a junior engineer at a software company, you might be required to be on call for the systems your team owns. Which means you’ll eventually be called upon to lead an incident response. And since incidents don’t care what your org chart looks like, fate may place you in...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Fight knowledge decay with a rich Incident Summary It only takes a few off-the-rails incidents in your software career to realize the importance of...

a year ago

102

Fight knowledge decay with a rich Incident Summary

from Dan Slimmon [alt+shift+b] in programming

a year ago

It only takes a few off-the-rails incidents in your software career to realize the importance of writing things down. That’s why so many companies’ incident response protocols define a scribe role. The scribe’s job, generally, is to take notes on everything that happens. In other...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Ask questions first, shoot later The fact that fixing and diagnosing often converge to the same actions doesn't change the fact that...

a year ago

109

Ask questions first, shoot later

from Dan Slimmon [alt+shift+b] in programming

a year ago

The fact that fixing and diagnosing often converge to the same actions doesn't change the fact that these two concurrent activities have different goals. The goal of fixing is to bring the system into line with your mental model of how it's supposed to function. The goal of...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Podcast appearance: The Debrief from Incident.io I’m so grateful to Incident.io for the opportunity to shout from their rooftop about Clinical...

a year ago

72

Podcast appearance: The Debrief from Incident.io

from Dan Slimmon [alt+shift+b] in programming

a year ago

I’m so grateful to Incident.io for the opportunity to shout from their rooftop about Clinical troubleshooting, which I firmly believe is the way we should all be diagnosing system failures. Enjoy the full episode!

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

The World Record for Loneliness What's the farthest any person has been from the nearest other person?

a year ago

79

The World Record for Loneliness

from Dan Slimmon [alt+shift+b] in programming

a year ago

What's the farthest any person has been from the nearest other person?

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Garden-path incidents Barb’s story It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the...

a year ago

107

Garden-path incidents

from Dan Slimmon [alt+shift+b] in programming

a year ago

Barb’s story It’s 12 noon on a Minneapolis Wednesday, which means Barb can be found at Quang. As the waiter sets down Barb’s usual order (#307, the Bun Chay, extra spicy), Barb’s nostrils catch the heavenly aroma of peanuts and scallions and red chiles. A wave of calm moves...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Explaining the fire When the firefighters arrive at the blazing building, they don't need to explain the fire. They need...

a year ago

56

Explaining the fire

from Dan Slimmon [alt+shift+b] in programming

a year ago

When the firefighters arrive at the blazing building, they don't need to explain the fire. They need to put it out. It doesn't matter whether a toaster malfunctioned, or a cat knocked over a candle, or a smoker fell asleep watching The Voice. But when PagerDuty blows up and we...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

I was on the Slight Reliability podcast! Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident...

a year ago

70

I was on the Slight Reliability podcast!

from Dan Slimmon [alt+shift+b] in programming

a year ago

Thanks very much to host Stephen Townshend of Slight Reliability podcast. We talked about incident response, diagnosis, and looking for trouble. It was very chill! Full 28-minute episode:

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Incident, Inçident, Incidënt When you deploy broken code, it may cause an incident. Then you'll have to declare an incident. And...

a year ago

74

Incident, Inçident, Incidënt

from Dan Slimmon [alt+shift+b] in programming

a year ago

When you deploy broken code, it may cause an incident. Then you'll have to declare an incident. And don't forget to create an incident so customers can stay informed!

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Dead air on the incident call Silence can mean different things to different people in different situations. In this post, I'll...

a year ago

71

Dead air on the incident call

from Dan Slimmon [alt+shift+b] in programming

a year ago

Silence can mean different things to different people in different situations. In this post, I'll present a few incident scenarios and explore the role of the incident commander in breaking (or simply abiding in) dead air.

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Clinical troubleshooting: diagnose any production issue, fast. Over the years, I've developed a reliable method for harnessing the diagnostic power of groups. My...

a year ago

71

Clinical troubleshooting: diagnose any production issue, fast.

from Dan Slimmon [alt+shift+b] in programming

a year ago

Over the years, I've developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of context need to reason together about problems in a complex, dynamic system:...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Interviewing engineers for diagnostic skills In SaaS, when we’re hiring engineers, we usually imagine that their time will mostly be spent...

a year ago

63

Interviewing engineers for diagnostic skills

from Dan Slimmon [alt+shift+b] in programming

a year ago

In SaaS, when we’re hiring engineers, we usually imagine that their time will mostly be spent building things. So we never forget to interview for skills at building stuff. Sometimes we ask candidates to write code on the fly. Other times we ask them to whiteboard out a sensible...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

3 questions that will make you a phenomenal rubber duck As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I...

a year ago

96

3 questions that will make you a phenomenal rubber duck

from Dan Slimmon [alt+shift+b] in programming

a year ago

As a Postgres reliability consultant and SRE, I’ve spent many hours being a rubber duck. Now I outperform even the incisive bath toy. “Rubber duck debugging” is a widespread, tongue-in-cheek term for the practice of explaining, out-loud, a difficult problem that you’re stumped...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Why transaction order matters, even if you’re only reading There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET...

a year ago

68

Why transaction order matters, even if you’re only reading

from Dan Slimmon [alt+shift+b] in programming

a year ago

There are 4 isolation levels defined by the SQL standard, and Postgres supports them through the SET TRANSACTION statement. They are: This last guarantee is one against serialization anomalies. A serialization anomaly is any sequence of events that produces a result that would be...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Concurrent locks and MultiXacts in Postgres Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by...

a year ago

39

Concurrent locks and MultiXacts in Postgres

from Dan Slimmon [alt+shift+b] in programming

a year ago

Pretty recently, I was troubleshooting a performance issue in a production Rails app backed by Postgres. There was this one class of query that would get slower and slower over the course of about an hour. The exact pathology is a tale for another time, but the investigation led...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Squeeze the hell out of the system you have When complexity leaps are on the table, there's usually also an opportunity to squeeze some extra...

over a year ago

48

Squeeze the hell out of the system you have

from Dan Slimmon [alt+shift+b] in programming

over a year ago

When complexity leaps are on the table, there's usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Don’t fix it just because it’s technical debt. Why should we only spend part of our time doing work that maximizes value, and the rest of our time...

over a year ago

32

Don’t fix it just because it’s technical debt.

from Dan Slimmon [alt+shift+b] in programming

over a year ago

Why should we only spend part of our time doing work that maximizes value, and the rest of our time doing other, less optimal work?

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

It’s fine to use names in post-mortems The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be...

over a year ago

33

It’s fine to use names in post-mortems

from Dan Slimmon [alt+shift+b] in programming

over a year ago

The purpose of the blameless post-mortem is not to make everyone feel comfortable. Discomfort can be healthy and useful. The purpose of the blameless post-mortem is to let us find explanations deeper than human error.

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Incident metrics tell you nothing about reliability When an incident response process is created, there arise many voices calling for measurement. “As...

over a year ago

32

Incident metrics tell you nothing about reliability

from Dan Slimmon [alt+shift+b] in programming

over a year ago

When an incident response process is created, there arise many voices calling for measurement. “As long as we’re creating standards for incidents, let’s track Mean-Time-To-Recovery (MTTR) and Mean-Time-To-Detection (MTTD) and Mean-time-Between-Failures (MTBF)!” they say things...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Post-mortems: content over structure The value of post-mortems is apparent: failures present opportunities to learn about unexpected...

over a year ago

34

Post-mortems: content over structure

from Dan Slimmon [alt+shift+b] in programming

over a year ago

The value of post-mortems is apparent: failures present opportunities to learn about unexpected behaviors of the system, and learning lets us make improvements to the system’s reliability. The value of post-mortem documents is much less apparent. Many R&D orgs will insist that...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Outliers carry information. Don’t leave them on the table Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how...

over a year ago

38

Outliers carry information. Don’t leave them on the table

from Dan Slimmon [alt+shift+b] in programming

over a year ago

Over a decade ago, I saw this talk by John Rauser. Only recently, though, did I come to realize how incredibly influential this talk has been on my career. Gosh what a great talk! You should watch it. If you operate a complex system, like a SaaS app, you probably have a dashboard...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

5 production surprises worth investigating As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and...

over a year ago

36

5 production surprises worth investigating

from Dan Slimmon [alt+shift+b] in programming

over a year ago

As an SRE, I’m a vocal believer in following one’s nose: seeking out surprising phenomena and getting to the bottom of them. By adopting this habit, we can find and fix many classes of problems before they turn into incidents. Over time, this makes things run much smoother. But...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Platform teams don’t need to act like companies Lately you see a lot of software company R&D teams organized around internal products. The Search...

over a year ago

32

Platform teams don’t need to act like companies

from Dan Slimmon [alt+shift+b] in programming

over a year ago

Lately you see a lot of software company R&D teams organized around internal products. The Search Team provides a Search service and its “customers” are the teams whose code consumes that service. The Developer Productivity Team’s product is a suite of tools for managing local...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Fix tomorrow’s problems by fixing today’s problems A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have...

over a year ago

33

Fix tomorrow’s problems by fixing today’s problems

from Dan Slimmon [alt+shift+b] in programming

over a year ago

A bug in our deployment system causes O(N²) latency with respect to the number of deploys that have been performed. At first, it’s too minuscule to notice. But the average deploy latency grows over time. Eventually, deploys start randomly timing out. The deploy pipeline grinds to...

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

Dan Slimmon

Huh! as a signal We can never predict with certainty what the next system failure will be. But we can predict,...

over a year ago

36

Huh! as a signal

from Dan Slimmon [alt+shift+b] in programming

over a year ago

We can never predict with certainty what the next system failure will be. But we can predict, because painful experience has taught us, that some or all of the causes of that failure will be surprising. We can use that!

Direct Link [→] Remove from reading list Add to reading list [alt+a]

upvote [alt+ctrl+↑] downvote [alt+ctrl+↓] prev [↑] next [↓]

New here?

bored reading