Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
86
.highlight pre { background-color: #efecec; border-color: var(--theme-secondary-background-color); border-radius: 10px; } The firehose of data is turned on In the beginning, the Internet was a small, cozy place. Most people weren’t online, and most businesses weren’t really online. The old Internet was for nerds willing to suffer through the less-than-straightforward technical setup, before the soul-scraping screech of a 28K baud modem resulted in a successful connection to the interwebs. Finally - we could now slowly download, bar by bar, images of Cindy Margolis. It was an innocent time, with tacky page view counters, guestbooks, “dancing baby” animated gifs, scrolling marquees, and just terrible background color choices. Back then we discovered things on the Web through an array of search engines - AltaVista, Excite, Lycos, Yahoo… None of them particularly stood out on their own. Yahoo was a more thorough, actual directory of websites maintained by fellow life...
11 months ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Renegade Otter

A Lannister Always Pays His Technical Debts

A tale of two rewrites Jamie Zawinski is kind of a tech legend. He came up with the name “Mozilla”, invented that whole thing where you can send HTML in emails, and more. In his harrowing work diary of how Mosaic/Netscape came to be, Jamie described the burnout rodeo that was the Mosaic development (the top disclaimer has its own history — ignore it): I slept at work again last night; two and a half hours curled up in a quilt underneath my desk, from 11am to 1:30pm or so. That was when I woke up with a start, realizing that I was late for a meeting we were scheduled to have to argue about colormaps and dithering, and how we should deal with all the nefarious 8-bit color management issues. But it was no big deal, we just had the meeting later. It’s hard for someone to hold it against you when you miss a meeting because you’ve been at work so long that you’ve passed out from exhaustion. Netscape’s wild ride is well-depicted in the dramatized Discovery mini-series Valley of the Boom, and the company eventually collapsed with the death march rewrite of what seemed to be just seriously unmaintainable code. It was the subject of one of the more famous articles by ex-Microsoft engineer and then entrepreneur Joel Spolsky - Things You Should Never Do. While the infamous Netscape codebase is long gone, the people that it enriched are still shaping the world to this day. There have been big, successful rewrites. Twitter moved away from Ruby-on-Rails to JVM over a decade ago but the first, year-long full rewrite effort completely failed. Following architecture by fiat from the top, the engineering team said nothing, speaking out only days before the launch. The whole thing would crash out of the gate, they claimed, so Twitter had to go back to the drawing board and rewrite again. I'd love to hear from you. What didn’t work for Netscape worked for Twitter. Why? Netscape had major heat coming from ruthless Microsoft competition, very little time for major moves, and a team aleady exhausted from “office heroics”. Twitter, however, is a unique product that is incredibly hard to dislodge, even with the almost purposefully incompetent and reckless management. It’s hard to abandon your social media account after accumulating algorithmic reputation and followers for years, and yet one can switch browsers faster than they can switch socks. Companies often do not survive this kind of adventure without having an almost unfair moat. Those that do survive, they probably caught some battle scars. Friendly Fire: Notify in Slack directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed The road to hell is paved with TODO comments All of this is to say that you should probably never let your system rot so badly until a code rewrite is even discussed. It never just happens. Your code doesn’t just become unmaintainable overnight. It gets there by the constant cutting of corners, hard-coding things, and crop-dusting your work with long-forgotten //FIXME comments. Fix who? We used to call it technical debt - a term that is now being frowned upon. The concept of “technical debt” got popular around the time when we were getting obsessed with “proh-cess” and Agile, as we got tired of death march projects, arbitrary deadlines, and general lack of structure and visibility into our work. Every software project felt like a tour — you came up for air and then went back into the 💩 for months. Agile meant that the stakeholders could be present in our planning meetings. We had to explain to them - somehow - that it took time to upgrade the web framework from v1 to v5 because no one has been using v1 for years, and in general, it slowed everyone down. Since we didn’t know how to explain this to a non-coder, someone came up with the condescending “technical debt” — “those spreadsheet monkeys wouldn’t understand what we do here!” While “technical debt” has most likely run its course as a manipulative verbal device, it is absolutely the right term to use amongst ourselves to reason about risks and to properly triage them. The three type of technical debt The word “debt” has negative connotations for sure, but just like with actual monetary debt, it’s never great but not always horrible. To mutilate the famous saying - you have to spend code to make code. I would categorize technical debt into three types — Aesthetic, Deferrable, and Toxic. A mark of a good engineer is knowing when to create technical debt, what kind of debt, and when to repay it. Aesthetic debt This is the kind of stuff that triggers your OCD but does not really affect your users or your velocity in any way. Maybe the imports are not sorted the way you want, and maybe there is a naming convention that is grinding your gears. It’s something that can be addressed with relatively low effort when you are good and ready, in many cases with proper automated code analysis and tools. Deferrable debt Deferrable debt is what should be refactored at some point, but it’s fairly contained and will not be a problem in the immediate future. The kind of debt that you need to minimize by methodically striking it off your list, and as long as it seeps through into your sprint work, you can probably avoid a scenario where it all gets out of control. Sometimes this sort of thing is really contained - a lone hacky file, written in the Mesozoic Era by a sleep-deprived Jamie Zawinski because someone was breathing down his neck. No one really understands what the code does, but it’s been humming along for the last 7 years, so why take your chances by waking the sleeping dragons? Slap the Safety Pig on it, claim a victory, and go shake down a vending machine. Toxic debt This is the kind of debt that needs to be addressed before it’s too late. How do you identify “toxic” debt? It’s that thing that you did half-way and now it’s become a workaround magnet. “We have to do it like this now until we fix it - someday”. The workarounds then become the foundation of new features, creating new and exciting debugging side quests. The future work required grows bigger with every new feature and a line of code. This is the toxic debt. Lack of tests is toxic debt Not having automated tests, or insufficient testing of critical paths, is tech debt in its own right. The more untested code you are adding, the more miserable your life is going to get over time. Tests are important to fight the debt itself. It’s much easier to take a sledgehammer to your codebase when a solid integration test suite’s got your back. We don’t like it, it’s upfront work that slows us down, but at some point after your Minimal Viable Prototype starts running away from you, you need to switch into Test Mode and tie it all down — before things get really nasty. Lack of documentation is toxic debt I am not talking about a War & Peace sized manual or detailed and severely out of date architecture diagrams in your Google Docs. Just a a set of critical READMEs and runbooks on how to start the system locally and perform basic tasks. What variables and secrets do I need? What else do I need installed? If there is a bug report, how do I configure my local environment to reproduce it, and so on. The time taken to reverse-engineer a system every time has an actual dollar value attached to it, plus the opportunity cost of not doing useful work. Put. It. In. A. Card. I have been guilty of this myself. I love TODOs. They are easy to add without breaking the flow, and they are configured in my IDE to be bright and loud. It’s a TODO — I will do it someday. During the Annual TODO Week, obviously. Let’s be frank — marking items as “TODO” is saying to yourself that you should really do this thing, but probably never will. This is relevant because TODO items can represent any level of technical debt described above, and so you should really make these actual stories on your Kanban/Agile boards. Mark technical debt as such You should be able to easily scan your “debt stories” and figure out which ones have payment due. This can be either a tag in your issue-tracking system or a column in your Kanban-style board like Trello. An approach like this will let you gauge better the ratio of new feature stories vs the growing technical debt. Your debt column will never be empty — that goal is as futile as Zero Inbox, but it should never grow out of control either. // TODO: conclusion

a year ago 47 votes
Code Lab - Job queues in Postgres

Introduction Friendly Fire needs to periodically execute scheduled jobs - to remind Slack users to review GitHub pull requests. Instead of bolting on a new system just for this, I decided to leverage Postgres instead. The must-have requirement was the ability to schedule a job to run in the future, with workers polling for “ripe” jobs, executing them and retrying on failure, with exponential backoff. With SKIP LOCKED, Postgres has the needed functionality, allowing a single worker to atomically pull a job from the job queue without another worker pulling the same one. This project is a demo of this system, slightly simplified. This example, available on GitHub is a playground for the following: How to set up a base Quart web app with Postgres using Poetry How to process a queue of immediate and delayed jobs using only the database How to retry failed jobs with exponential backoff How to use custom decorators to ensure atomic HTTP requests (success - commit, failure - rollback) How to use Pydantic for stricter Python models How to use asyncpg and asynchronously query Postgres with connection pooling How to test asyncio code using pytest and unittest.IsolatedAsyncioTestCase How to manipulate the clock in tests using freezegun How to use mypy, flake8, isort, and black to format and lint the code How to use Make to simplify local commands ALTER MODE SKIP COMPLEXITY Postgres introduced SKIP LOCKED years ago, but recently there was a noticeable uptick in the interest around this feature. In particular regarding its obvious use for simpler queuing systems, allowing us to bypass libraries or maintenance-hungry third-party messaging systems. Why now? It’s hard to say, but my guess is that the tech sector is adjusting to the leaner times, looking for more efficient and cheaper ways of achieving the same goals at common-scale but with fewer resources. Or shall we say - reasonable resources. What’s Quart? Quart is the asynchronous version of Flask. If you know about the g - the global request context - you will be right at home. Multiple quality frameworks have entered Python-scape in recent years - FastAPI, Sanic, Falcon, Litestar. There is also Bottle and Carafe. Apparently naming Python frameworks after liquid containers is now a running joke. Seeing that both Flask and Quart are now part of the Pallets project, Quart has been curiously devoid of hype. These two are in the process of being merged and at some point will become one framework - classic synchronous Flask and asynchronous Quart in one. How it works Writing about SKIP LOCKED is going to be redundant as this has been covered plenty elsewhere. For example, in this article. Even more in-depth are these slides from 2016 PGCON. The central query looks like this: DELETE FROM job WHERE id = ( SELECT id FROM job WHERE ripe_at IS NULL OR [current_time_argument] >= ripe_at FOR UPDATE SKIP LOCKED LIMIT 1 ) RETURNING *, id::text Each worker is added as a background task, periodically querying the database for “ripe” jobs (the ones ready to execute), and then runs the code for that specific job type. A job that does not have the “ripe” time set will be executed whenever a worker is available. A job that fails will be retried with exponential backoff, up to Job.max_retries times: next_retry_minutes = self.base_retry_minutes * pow(self.tries, 2) Creating a job is simple: job: Job = Job( job_type=JobType.MY_JOB_TYPE, arguments={"user_id": user_id}, ).runs_in(hours=1) await jobq.service.job_db.save(job) SKIP LOCKED and DELETE ... SELECT FOR UPDATE tango together to make sure that no worker gets the same job at the same time. To keep things interesting, at the Postgres level we have an MD5-based auto-generated column to make sure that no job of the same type and with the same arguments gets queued up more than once. This project also demonstrates the usage of custom DB transaction decorators in order to have a cleaner transaction notation: @write_transaction @api.put("/user") async def add_user(): # DB write logic @read_transaction @api.get("/user") async def get_user(): # DB read logic A request (or a function) annotated with one of these decorators will be in an atomic transaction until it exits, and rolled back if it fails. At shutdown, the “stop” flag in each worker is set, and the server waits until all the workers complete their sleep cycles, peacing out gracefully. async def stop(self): for worker in self.workers: worker.request_stop() while not all([w.stopped for w in self.workers]): logger.info("Waiting for all workers to stop...") await asyncio.sleep(1) logger.info("All workers have stopped") Testing The test suite leverages unittest.IsolatedAsyncioTestCase (Python 3.8 and up) to grant us access to asyncSetUp() - this way we can call await in our test setup functions: async def asyncSetUp(self) -> None: self.app: Quart = create_app() self.ctx: quart.ctx.AppContext = self.app.app_context() await self.ctx.push() self.conn = await asyncpg.connect(...) db.connection_manager.set_connection(self.conn) self.transaction = self.conn.transaction() await self.transaction.start() async def asyncTearDown(self) -> None: await self.transaction.rollback() await self.conn.close() await self.ctx.pop() Note that we set up the database only once for our test class. At the end of each test, the connection is rolled back, returning the database to its pristine state for the next test. This is a speed trick to make sure we don’t have to run database setup code each single time. In this case it doesn’t really matter, but in a test suite large enough, this is going to add up. For delayed jobs, we simulate the future by freezing the clock at a specific time (relative to now): # jump to the FUTURE with freeze_time(now + datetime.timedelta(hours=2)): ripe_job = await jobq.service.job_db.get_one_ripe_job() assert ripe_job Improvements Batching - pulling more than one job at once would add major dragonforce to this system. This is not part of the example as to not overcomplicate it. You just need to be careful and return the failed jobs back in the queue while deleting the completed ones. With enough workers, a system like this could really be capable of handling serious common-scale workloads. Server exit - there are less than trivial ways of interrupting worker sleep cycles. This could improve the experience of running the service locally. In its current form, you have to wait a few seconds until all worker loops get out of sleep() and read the STOP flag. Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed

a year ago 40 votes
Your database skills are not ‘good to have’

A MySQL war story It’s 2006, and the New York Magazine digital team set out to create a new search experience for its Fashion Week portal. It was one of those projects where technical feasibility was not even discussed with the tech team - a common occurrence back then. Agile was still new, let alone in publishing. It was just a vision, a real friggin’ moonshot, and 10 to 12 weeks to develop the wireframed version of the product. There would be almost no time left for proper QA. Fashion Week does not start slowly but rather goes from zero to sixty in a blink. The vision? Thousands of near-real-time fashion show images, each one with its sub-items categorized: “2006”, “bag”, “red”, “ leather”, and so on. A user will land on the search page and have the ability to “drill down” and narrow the results based on those properties. To make things much harder, all of these properties would come with exact counts. The workflow was going to be intense. Photographers will courier their digital cartridges from downtown NYC to our offices on Madison Avenue, where the images will be processed, tagged by interns, and then indexed every hour by our Perl script, reading the tags from the embedded EXIF information. Failure to build the search product on our side would have collapsed the entire ecosystem already in place, primed and ready to rumble. “Oh! Just use the facets in Solr, dude”. Yeah, not so fast - dude. In 2006 that kind of technology didn’t even exist yet. I sat through multiple enterprise search engine demos with our CTO, and none of the products (which cost a LOT of money) could do a deep faceted search. We already had an Autonomy license and my first try proved that… it just couldn’t do it. It was supposed to be able to, but the counts were all wrong. Endeca (now owned by Oracle), came out of stealth when the design part of the project was already underway. Too new, too raw, too risky. The idea was just a little too ambitious for its time, especially for a tiny team in a non-tech company. So here we were, a team of three, myself and two consultants, writing Perl for the indexing script, query-parsing logic, and modeling the data - in MySQL 4. It was one of those projects where one single insurmountable technical risk would have sunk the whole thing. I will cut the story short and spare you the excitement. We did it, and then we went out to celebrate at a karaoke bar (where I got my very first work-stress-related severe hangover) 🤮 For someone who was in charge of the SQL model and queries, it was days and days of tuning those, timing every query and studying the EXPLAIN output to see what else I could do to squeeze another 50ms out of the database. There were no free nights or weekends. In the end, it was a combination of trial and error, digging deep into MySQL server settings, and crafting GROUP BY queries that would make you nauseous. The MySQL query analyzer was fidgety back then, and sometimes re-arranging the fields in the SELECT clause could change a query’s performance. Imagine if SELECT field1, field2 FROM my_table was faster than SELECT field2, field1 FROM my_table. Why would it do that? I have no idea to this day, and I don’t even want to know. Unfortunately, I lost examples of this work, but the Way Back Machine has proof of our final product. The point here is - if you really know your database, you can do pretty crazy things with it, and with the modern generation of storage technologies and beefier hardware, you don’t even need to push the limits - it should easily handle what I refer to as “common-scale”. Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed The fading art of SQL In the past few years I have been noticing an unsettling trend - software engineers are eager to use exotic “planet-scale” databases for pretty rudimentary problems, while at the same time not having a good grasp of the very powerful relational database engine they are likely already using, let alone understanding the technology’s more advanced and useful capabilities. The SQL layer is buried so deep beneath libraries and too clever by a half ORMs that it all just becomes high-level code. Why is it slow? No idea - let's add Cassandra to it! Modern hardware certainly allows us to go way up from the CPU into the higher abstraction layers, while it wasn’t that uncommon in the past to convert certain functions to assembly code in order to squeeze every bit of performance out of the processor. Now compute and storage is cheaper - it’s true - but abusing this abundance has trained us laziness and complacency. Suddenly, that Cloud bill is a wee too high, and heavens knows how much energy the world is burning by just running billions of these inefficient ORM queries every second against mammoth database instances. The morning of my first job interview in 2004, I was on a subway train memorizing the nine levels of database normalization. Or is it five levels? I don’t remember, and It doesn’t even matter - no one will ever ask you this now in a software engineer interview. Just skimming through the table of contents of your database of choice, say the now freshly in vogue Postgres, you will find an absolute treasure trove of features fit to handle everything but the most gruesome planet-scale computer science problems. Petabyte-sized Postgres boxes, replicated, are effortlessly running now as you are reading this. The trick is to not expect your database or your ORM to read your mind. Speaking of… ORMs are the frenemy I was a new hire at an e-commerce outfit once, and right off the bat I was thrown into fixing serious performance issues with the company’s product catalog pages. Just a straight-forward, paginated grid of product images. How hard could it be? Believe it or not - it be. The pages took over 10 seconds to load, sometimes longer, the database was struggling, and the solution was to “just cache it”. One last datapoint - this was not a high-traffic site. The pages were dead-slow even if there was no traffic at all. That’s a rotten sign that something is seriously off. After looking a bit closer, I realized that I hit the motherlode - all top three major database and coding mistakes in one. ❌ Mistake #1: There is no index The column that was hit in every single mission-critical query had no index. None. After adding the much-needed index in production, you could practically hear MySQL exhaling in relief. Still, the performance was not quite there yet, so I had to dig deeper, now in the code. ❌ Mistake #2: Assuming each ORM call is free Activating the query logs locally and reloading a product listing page, I see… 200, 300, 500 queries fired off just to load one single page. What the shit? Turns out, this was the result of a classic ORM abuse of going through every record in a loop, to the effect of: for product_id in product_ids: product = amazing_orm.products.get(id=product_id) products.append(product) The high number of queries was also due the fact that some of this logic was nested. The obvious solution is to keep the number of queries in each request to a minimum, leveraging SQL to join and combine the data into one single blob. This is what relational databases do - it’s in the name. Each separate query needs to travel to the database, get parsed, transformed, analyzed, planned, executed, and then travel back to the caller. It is one of the most expensive operations you can do, and ORMs will happily do the worst possible thing for you in terms of performance. One wonders what those algorithm and data structure interview questions are good for, considering you are more likely to run into a sluggish database call than a B-tree implementation (common structure used for database indexes). ❌ Mistake #3: Pulling in the world To make matters worse, the amount of data here was relatively small, but there were dozens and dozens of columns. What do ORMs usually do by default in order to make your life “easier”? They send the whole thing, all the columns, clogging your network pipes with the data that you don’t even need. It is a form of toxic technical debt, where the speed of development will eventually start eating into performance. I spent hours within the same project hacking the dark corners of the Dango admin, overriding default ORM queries to be less “eager”. This led to a much better office-facing experience. Performance IS a feature Serious, mission-critical systems have been running on classic and boring relational databases for decades, serving thousands of requests per second. These systems have become more advanced, more capable, and more relevant. They are wonders of computer science, one can claim. You would think that an ancient database like Postgres (in development since 1982) is in some kind of legacy maintenance mode at this point, but the opposite is true. In fact, the work has been only accelerating, with the scale and features becoming pretty impressive. What took multiple queries just a few years ago now takes a single one. Why is this significant? It has been known for a long time, as discovered by Amazon, that every additional 100ms of a user waiting for a page to load loses a business money. We also know now that from a user’s perspective, the maximum target response time for a web page is around 100 milliseconds: A delay of less than 100 milliseconds feels instant to a user, but a delay between 100 and 300 milliseconds is perceptible. A delay between 300 and 1,000 milliseconds makes the user feel like a machine is working, but if the delay is above 1,000 milliseconds, your user will likely start to mentally context-switch. The “just add more CPU and RAM if it’s slow” approach may have worked for a while, but many are finding out the hard way that this kind of laziness is not sustainable in a frugal business environment where costs matter. Database anti-patterns Knowing what not to do is as important as knowing what to do. Some of the below mistakes are all too common: ❌ Anti-pattern #1. Using exotic databases for the wrong reasons Technologies like DynamoDB are designed to handle scale at which Postgres and MySQL begin to fail. This is achieved by denormalizing, duplicating the data aggressively, where the database is not doing much real-time data manipulation or joining. Your data is now modeled after how it is queried, not after how it is related. Regular relational concepts disintegrate at this insane level of scale. Needless to say, if you are resorting to this kind of storage for “common-scale” problems, you are already solving problems you don’t have. ❌ Anti-pattern #2. Caching things unnecessarily Caching is a necessary evil - but it’s not always necessary. There is an entire class of bugs and on-call issues that stem from stale cached data. Read-only database replicas are a classic architecture pattern that is still very much not outdated, and it will buy you insane levels of performance before you have to worry about anything. It should not be a surprise that mature relational databases already have query caching in place - it just has to be tuned for your specific needs. Cache invalidation is hard. It adds more complexity and states of uncertainty to your system. It makes debugging more difficult. I received more emails from content teams than I care for throughout my career that wondered “why is the data not there, I updated it 30 minutes ago?!” Caching should not act as a bandaid for bad architecture and non-performant code. ❌ Anti-pattern #3. Storing everything and a kitchen sink As much punishment as an industry-standard database can take, it’s probably not a good idea to not care at all about what’s going into it, treating it like a data landfill of sorts. Management, querying, backups, migrations - all becomes painful once the DB grows substantially. Even if that is of no concern as you are using a managed cloud DB - the costs should be. An RDBMS is a sophisticated piece of technology, and storing data in it is expensive. Figure out common-scale first It is fairly easy to make a beefy Postgres or a MySQL database grind to a halt if you expect it to do magic without any extra work. “It’s not web-scale, boss. Our 2 million records seem to be too much of a lift. We need DynamoDB, Kafka, and event sourcing!” A relational database is not some antiquated technology that only us tech fossils choose to be experts in, a thing that can be waved off like an annoying insect. “Here we React and GraphQL all the things, old man”. In legal speak, a modern RDBMS is innocent until proven guilty, and the burden of proof should be extremely high - and almost entirely on you. Finally, if I have to figure out “why it’s slow”, my approximate runbook is: Compile a list of unique queries, from logging, slow query log, etc. Look at the most frequent queries first Use EXPLAIN to check slow query plans for index usage Select only the data that needs to travel across the wire If an ORM is doing something silly without a workaround, pop the hood and get dirty with the raw SQL plumbing Most importantly, study your database (and SQL). Learn it, love it, use it, abuse it. Spending a couple of days just leafing through that Postgres manual to see what it can do will probably make you a better engineer than spending more time on the next flavor-of-the-month React hotness. Again. Related posts I am not your Cloud person Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed

a year ago 45 votes
Death by a thousand microservices

The Church of Complexity There is a pretty well-known sketch in which an engineer is explaining to the project manager how an overly complicated maze of microservices works in order to get a user’s birthday - and fails to do so anyway. The scene accurately describes the absurdity of the state of the current tech culture. We laugh, and yet bringing this up in a serious conversation is tantamount to professional heresy, rendering you borderline un-hirable. How did we get here? How did our aim become not addressing the task at hand but instead setting a pile of cash on fire by solving problems we don’t have? Trigger warning: Some people understandably got salty when I name-checked JavaScript and NodeJS as a source of the problem, but my point really was more about the dangers of hermetically sealed software ecosystems that seem hell-bent on re-learning the lessons that we just had finished learning. We ran into the complexity wall before and reset - otherwise we'd still be using CORBA and SOAP. These air-tight developer bubbles are a wrecking ball on the entire industry, and it takes about a full decade to swing. The perfect storm There are a few events in recent history that may have contributed to the current state of things. First, a whole army of developers writing JavaScript for the browser started self-identifying as “full-stack”, diving into server development and asynchronous code. JavaScript is JavaScript, right? What difference does it make what you create using it - user interfaces, servers, games, or embedded systems. Right? Node was still kind of a learning project of one person, and the early JavaScript was a deeply problematic choice for server development. Pointing this out to still green server-side developers usually resulted in a lot of huffing and puffing. This is all they knew, after all. The world outside of Node effectively did not exist, the Node way was the only way, and so this was the genesis of the stubborn, dogmatic thinking that we are dealing with to this day. And then, a steady stream of FAANG veterans started merging into the river of startups, mentoring the newly-minted and highly impressionable young JavaScript server-side engineers. The apostles of the Church of Complexity would assertively claim that “how they did things over at Google” was unquestionable and correct - even if it made no sense with the given context and size. What do you mean you don’t have a separate User Preferences Service? That just will not scale, bro! But, it’s easy to blame the veterans and the newcomers for all of this. What else was happening? Oh yeah - easy money. What do you do when you are flush with venture capital? You don’t go for revenue, surely! On more than one occasion I received an email from management, asking everyone to be in the office, tidy up their desks and look busy, as a clouder of Patagonia vests was about to be paraded through the space. Investors needed to see explosive growth, but not in profitability, no. They just needed to see how quickly the company could hire ultra-expensive software engineers to do … something. And now that you have these developers, what do you do with them? Well, they could build a simpler system that is easier to grow and maintain, or they could conjure up a monstrous constellation of “microservices” that no one really understands. Microservices - the new way of writing scalable software! Are we just going to pretend that the concept of “distributed systems” never existed? (Let’s skip the whole parsing of nuances about microservices not being real distributed systems). Back in the days when the tech industry was not such a bloated farce, distributed systems were respected, feared, and generally avoided - reserved only as the weapon of last resort for particularly gnarly problems. Everything with a distributed system becomes more challenging and time-consuming - development, debugging, deployment, testing, resilience. But I don’t know - maybe it’s all super easy now because toooollling. There is no standard tooling for microservices-based development - there is no common framework. Working on distributed systems has gotten only marginally easier in 2020s. The Dockers and the Kuberneteses of the world did not magically take away the inherent complexity of a distributed setup. I love referring to this summary of 5 years of startup audits, as it is packed with common-sense conclusions: … the startups we audited that are now doing the best usually had an almost brazenly ‘Keep It Simple’ approach to engineering. Cleverness for cleverness sake was abhorred. On the flip side, the companies where we were like ”woah, these folks are smart as hell” for the most part kind of faded. Generally, the major foot-gun that got a lot of places in trouble was the premature move to microservices, architectures that relied on distributed computing, and messaging-heavy designs. Literally - “complexity kills”. The audit revealed an interesting pattern, where many startups experienced a sort of collective imposter syndrome while building straight-forward, simple, performant systems. There is a dogma attached to not starting out with microservices on day one - no matter the problem. “Everyone is doing microservices, yet we have a single Django monolith maintained by just a few engineers, and a MySQL instance - what are we doing wrong?”. The answer is almost always “nothing”. Likewise, it’s very often that seasoned engineers experience hesitation and inadequacy in today’s tech world, and the good news is that, no - it’s probably not you. It’s common for teams to pretend like they are doing “web scale”, hiding behind libraries, ORMs, and cache - confident in their expertise (they crushed that Leetcode!), yet they may not even be aware of database indexing basics. You are operating in a sea of unjustified overconfidence, waste, and Dunning-Kruger, so who is really the imposter here? Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Slack notifications Out-of-office support File matching Try it now. There is nothing wrong with a monolith The idea that you cannot grow without a system that looks like the infamous slide of Afghanistan war strategy is a myth. Dropbox, Twitter, Netflix, Facebook, GitHub, Instagram, Shopify, StackOverflow - these companies and others started out as monolithic code bases. Many have a monolith at their core to this day. StackOverflow makes it a point of pride how little hardware they need to run the massive site. Shopify is still a Rails monolith, leveraging the tried and true Resque to process billions of tasks. WhatsApp went supernova with their Erlang monolith and a relatively small team. How? WhatsApp consciously keeps the engineering staff small to only about 50 engineers. Individual engineering teams are also small, consisting of 1 - 3 engineers and teams are each given a great deal of autonomy. In terms of servers, WhatsApp prefers to use a smaller number of servers and vertically scale each server to the highest extent possible. Instagram was acquired for billions - with a crew of 12. And do you imagine Threads as an effort involving a whole Meta campus? Nope. They followed the Instagram model, and this is the entire Threads team: Perhaps claiming that your particular problem domain requires a massively complicated distributed system and an open office stuffed to the gills with turbo-geniuses is just crossing over into arrogance rather than brilliance? Don’t solve problems you don’t have It’s a simple question - what problem are you solving? Is it scale? How do you know how to break it all up for scale and performance? Do you have enough data to show what needs to be a separate service and why? Distributed systems are built for size and resilience. Can your system scale and be resilient at the same time? What happens if one of the services goes down or comes to a crawl? Just scale it up? What about the other services that are going to get hit with traffic? Did you war-game the endless permutations of things that can and will go wrong? Is there backpressure? Circuit breakers? Queues? Jitter? Sensible timeouts on every endpoint? Are there fool-proof guards to make sure a simple change does not bring everything down? The knobs you need to be aware of and tune are endless, and they are all specific to your system’s particular signature of usage and load. The truth is that most companies will never reach the massive size that will actually require building a true distributed system. Your cosplaying Amazon and Google - without their scale, expertise, and endless resources - is very likely just an egregious waste of money and time. Religiously following all the steps from an article called “Ten morning habits of very successful people” is not going to make you a billionaire. The only thing harder than a distributed system is a BAD distributed system. “But each team… but separate… but API” Trying to shove a distributed topology into your company’s structure is a noble effort, but it almost always backfires. It’s a common approach to break up a problem into smaller pieces and then solve those one by one. So, the thinking goes, if you break up one service into multiple ones, everything becomes easier. The theory is sweet and elegant - each microservice is being maintained rigorously by a dedicated team, walled off behind a beautiful, backward-compatible, versioned API. In fact, this is so solid that you rarely even have to communicate with that team - as if the microservice was maintained by a 3rd party vendor. It’s simple! If that doesn’t sound familiar, that’s because this rarely happens. In reality, our Slack channels are flooded with messages from teams communicating about releases, bugs, configuration updates, breaking changes, and PSAs. Everyone needs to be on top of everything, all the time. And if that wasn’t great, it’s normal for one already-slammed team to half-ass multiple microservices instead of doing a great job on a single one, often changing ownership as people come and go. In order to win the race, we don’t build one good race car - we build a fleet of shitty golf carts. What you lose There are multiple pitfalls to building with microservices, and often that minefield is either not fully appreciated or simply ignored. Teams spend months writing highly customized tooling and learning lessons not related at all to the core product. Here are just some often overlooked aspects… Say goodbye to DRY After decades of teaching developers to write Don’t Repeat Yourself code, it seems we just stopped talking about it altogether. Microservices by default are not DRY, with every service stuffed with redundant boilerplate. Very often the overhead of such “plumbing” is so heavy, and the size of the microservices is so small, that the average instance of a service has more “service” than “product”. So what about the common code that can be factored out? Have a common library? How does the common library get updated? Keep different versions everywhere? Force updates regularly, creating dozens of pull requests across all repositories? Keep it all in a monorepo? That comes with its own set of problems. Allow for some code duplication? Forget it, each team gets to reinvent the wheel every time. Each company going this route faces these choices, and there are no good “ergonomic” options - you have to choose your version of the pain. Developer ergonomics will crater “Developer ergonomics” is the friction, the amount of effort a developer must go through in order to get something done, be it working on a new feature or resolving a bug. With microservices, an engineer has to have a mental map of the entire system in order to know what services to bring up for any particular task, what teams to talk to, whom to talk to, and what about. The “you have to know everything before doing anything” principle. How do you keep on top of it? Spotify, a multi-billion dollar company, spent probably not negligible internal resources to build Backstage, software for cataloging its endless systems and services. This should at least give you a clue that this game is not for everyone, and the price of the ride is high. So what about the tooooling? The Not Spotifies of the world are left with MacGyvering their own solutions, robustness and portability of which you can probably guess. And how many teams actually streamline the process of starting a YASS - “yet another stupid service”? This includes: Developer privileges in GitHub/GitLab Default environment variables and configuration CI/CD Code quality checkers Code review settings Branch rules and protections Monitoring and observability Test harness Infrastructure-as-code And of course, multiply this list by the number of programming languages used throughout the company. Maybe you have a usable template or a runbook? Maybe a frictionless, one-click system to launch a new service from scratch? It takes months to iron out all the kinks with this kind of automation. So, you can either work on your product, or you can be working on toooooling. Integration tests - LOL As if the everyday microservices grind was not enough, you also forfeit the peace of mind offered by solid integration tests. Your single-service and unit tests are passing, but are your critical paths still intact after each commit? Who is in charge of the overall integration test suite, in Postman or wherever else? Is there one? Integration testing a distributed setup is a nearly-impossible problem, so we pretty much gave up on that and replaced it with another one - Observability. Just like “microservices” are the new “distributed systems”, “observability” is the new “debugging in production”. Surely, you are not writing real software if you are not doing…. observability! Observability has become its own sector, and you will pay in both pretty penny and in developer time for it. It doesn’t come as plug-and-pay either - you need to understand and implement canary releases, feature flags, etc. Who is doing that? One already overwhelmed engineer? As you can see, breaking up your problem does not make solving it easier - all you get is another set of even harder problems. What about just “services”? Why do your services need to be “micro”? What’s wrong with just services? Some startups have gone as far as create a service for each function, and yes, “isn’t that just like Lambda” is a valid question. This gives you an idea of how far gone this unchecked cargo cult is. So what do we do? Starting with a monolith is one obvious choice. A pattern that could also work in many instances is “trunk & branches”, where the main “meat and potatoes” monolith is helped by “branch” services. A branch service can be one that takes care of a clearly-identifiable and separately-scalable load. A CPU-hungry Image-Resizing Service makes way more sense than a User Registration Service. Or do you get so many registrations per second that it requires independent horizontal scaling? Side note: In version control, back in the days of CVS and Subversion, we rarely used "master" branches. We had "trunk and branches" because, you know - *trees*. "Master" branches appeared somewhere along the way, and when GitHub decided to do away with the rather unfortunate naming convention, the average engineer was too young to remember about "trunk" - and so the generic "main" default came to be. The pendulum is swinging back The hype, however, seems to be dying down. The VC cash faucet is tightening, and so the businesses have been market-corrected into exercising common-sense decisions, recognizing that perhaps splurging on web-scale architectures when they don’t have web-scale problems is not sustainable. Ultimately, when faced with the need to travel from New York to Philadelphia, you have two options. You can either attempt to construct a highly intricate spaceship for an orbital descent to your destination, or you can simply purchase an Amtrak train ticket for a 90-minute ride. That is the problem at hand. Additional reading & listening How to recover from microservices You want modules, not microservices XML is the future Gasp! You might not need microservices Podcast: How we keep Stack Overflow’s codebase clean and modern Goodbye Microservices: From 100s of problem children to 1 superstar It’s the future Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Slack notifications Out-of-office support File matching Try it now.

a year ago 16 votes

More in programming

Believe it's going to work even though it probably won't

To be a successful founder, you have to believe that what you're working on is going to work — despite knowing it probably won't! That sounds like an oxymoron, but it's really not. Believing that what you're building is going to work is an essential component of coming to work with the energy, fortitude, and determination it's going to require to even have a shot. Knowing it probably won't is accepting the odds of that shot. It's simply the reality that most things in business don't work out. At least not in the long run. Most businesses fail. If not right away, then eventually. Yet the world economy is full of entrepreneurs who try anyway. Not because they don't know the odds, but because they've chosen to believe they're special. The best way to balance these opposing points — the conviction that you'll make it work, the knowledge that it probably won't — is to do all your work in a manner that'll make you proud either way. If it doesn't work, you still made something you wouldn't be ashamed to put your name on. And if it does work, you'll beam with pride from making it on the basis of something solid. The deep regret from trying and failing only truly hits when you look in the mirror and see Dostoevsky staring back at you with this punch to the gut: "Your worst sin is that you have destroyed and betrayed yourself for nothing." Oof. Believe it's going to work.  Build it in a way that makes you proud to sign it. Base your worth on a human on something greater than a business outcome.

23 hours ago 2 votes
Binary Arithmetic and Bitwise Operations for Systems Programming

Understand how computers represent numbers and perform operations at the bit level before diving into assembly

35 minutes ago 1 votes
How to use “real” UART

I recently went into a deep dive on “UART” and will publish a much longer article on the topic. This is just a recap of the basics to help put things in context. Many tutorials focus on using UART over USB, which adds many layers of abstraction, hiding what it actually is. Here, I deliberately … Continue reading How to use “real” UART → The post How to use “real” UART appeared first on Quentin Santos.

2 days ago 4 votes
Critical Trade Theory

You know about Critical Race Theory, right? It says that if there’s an imbalance in, say, income between races, it must be due to discrimination. This is what wokism seems to be, and it’s moronic and false. The right wing has invented something equally stupid. Introducing Critical Trade Theory, stolen from this tweet. If there’s an imbalance in trade between countries, it must be due to unfair practices. (not due to the obvious, like one country is 10x richer than the other) There’s really only one way the trade deficits will go away, and that’s if trade goes to zero (or maybe if all these countries become richer than America). Same thing with the race deficits, no amount of “leg up” bullshit will change them. Why are all the politicians in America anti-growth anti-reality idiots who want to drive us into the poor house? The way this tariff shit is being done is another stupid form of anti-merit benefits to chosen groups of people, with a whole lot of grift to go along with it. Makes me just not want to play.

2 days ago 2 votes
How to get better at strategy?

One of the most memorable quotes in Arthur Miller’s The Death of a Salesman comes from Uncle Ben, who describes his path to becoming wealthy as, “When I was seventeen, I walked into the jungle, and when I was twenty-one I walked out. And by God I was rich.” I wish I could describe the path to learning engineering strategy in similar terms, but by all accounts it’s a much slower path. Two decades in, I am still learning more from each project I work on. This book has aimed to accelerate your learning path, but my experience is that there’s still a great deal left to learn, despite what this book has hoped to accomplish. This final chapter is focused on the remaining advice I have to give on how you can continue to improve at strategy long after reading this book’s final page. Inescapably, this chapter has become advice on writing your own strategy for improving at strategy. You are already familiar with my general suggestions on creating strategy, so this chapter provides focused advice on creating your own plan to get better at strategy. It covers: Exploring strategy creation to find strategies you can learn from via public and private resources, and through creating learning communities How to diagnose the strategies you’ve found, to ensure you learn the right lessons from each one Policies that will help you find ways to perform and practice strategy within your organization, whether or not you have organizational authority Operational mechanisms to hold yourself accountable to developing a strategy practice My final benediction to you as a strategy practitioner who has finished reading this book With that preamble, let’s write this book’s final strategy: your personal strategy for developing your strategy practice. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Exploring strategy creation Ideally, we’d start our exploration of how to improve at engineering strategy by reading broadly from the many publicly available examples. Unfortunately, there simply aren’t many easily available works to learn from others’ experience. Nonetheless, resources do exist, and we’ll discuss the three categories that I’ve found most useful: Public resources on engineering strategy, such as companies’ engineering blogs Private and undocumented strategies available through your professional network Learning communities that you build together, including ongoing learning circles Each of these is explored in its own section below. Public resources While there aren’t as many public engineering strategy resources as I’d like, I’ve found that there are still a reasonable number available. This book collects a number of such resources in the appendix of engineering strategy resources. That appendix also includes some individuals’ blog posts that are adjacent to this topic. You can go a long way by searching and prompting your way into these resources. As you read them, it’s important to recognize that public strategies are often misleading, as discussed previously in evaluating strategies. Everyone writing in public has an agenda, and that agenda often means that they’ll omit important details to make themselves, or their company, come off well. Make sure you read through the lines rather than taking things too literally. Private resources Ironically, where public resources are hard to find, I’ve found it much easier to find privately held strategy resources. While private recollections are still prone to inaccuracies, the incentives to massage the truth are less pronounced. The most useful sources I’ve found are: peers’ stories – strategies are often oral histories, and they are shared freely among peers within and across companies. As you build out your professional network, you can usually get access to any company’s engineering strategy on any topic by just asking. There are brief exceptions. Even a close peer won’t share a sensitive strategy before its existence becomes obvious externally, but they’ll be glad to after it does. People tend to over-estimate how much information companies can keep private anyway: even reading recent job postings can usually expose a surprising amount about a company. internal strategy archaeologists – while surprisingly few companies formally collect their strategies into a repository, the stories are informally collected by the tenured members of the organization. These folks are the company’s strategy archaeologists, and you can learn a great deal by explicitly consulting them becoming a strategy archaeologist yourself – whether or not you’re a tenured member of your company, you can learn a tremendous amount by starting to build your own strategy repository. As you start collecting them, you’ll interest others in contributing their strategies as well. As discussed in Staff Engineer’s section on the Write five then synthesize approach to strategy, over time you can foster a culture of documentation where one didn’t exist before. Even better, building that culture doesn’t require any explicit authority, just an ongoing show of excitement. There are other sources as well, ranging from attending the hallway track in conferences to organizing dinners where stories are shared with a commitment to privacy. Working in community My final suggestion for seeing how others work on strategy is to form a learning circle. I formed a learning circle when I first moved into an executive role, and at this point have been running it for more than five years. What’s surprised me the most is how much I’ve learned from it. There are a few reasons why ongoing learning circles are exceptional for sharing strategy: Bi-directional discussion allows so much more learning and understanding than mono-directional communication like conference talks or documents. Groups allow you to learn from others’ experiences and others’ questions, rather than having to guide the entire learning yourself. Continuity allows you to see the strategy at inception, during the rollout, and after it’s been in practice for some time. Trust is built slowly, and you only get the full details about a problem when you’ve already successfully held trust about smaller things. An ongoing group makes this sort of sharing feasible where a transient group does not. Although putting one of these communities together requires a commitment, they are the best mechanism I’ve found. As a final secret, many people get stuck on how they can get invited to an existing learning circle, but that’s almost always the wrong question to be asking. If you want to join a learning circle, make one. That’s how I got invited to mine. Diagnosing your prior and current strategy work Collecting strategies to learn from is a valuable part of learning. You also have to determine what lessons to learn from each strategy. For example, you have to determine whether Calm’s approach to resourcing Engineering-driven projects is something to copy or something to avoid. What I’ve found effective is to apply the strategy rubric we developed in the “Is this strategy any good?” chapter to each of the strategies you’ve collected. Even by splitting a strategy into its various phases, you’ll learn a lot. Applying the rubric to each phase will teach you more. Each time you do this to another strategy, you’ll get a bit faster at applying the rubric, and you’ll start to see interesting, recurring patterns. As you dig into a strategy that you’ve split into phases and applied the evaluation rubric to, here are a handful of questions that I’ve found interesting to ask myself: How long did it take to determine a strategy’s initial phase could be improved? How high was the cost to fund that initial phase’s discovery? Why did the strategy reach its final stage and get repealed or replaced? How long did that take to get there? If you had to pick only one, did this strategy fail in its approach to exploration, diagnosis, policy or operations? To what extent did the strategy outlive the tenure of its primary author? Did it get repealed quickly after their departure, did it endure, or was it perhaps replaced during their tenure? Would you generally repeat this strategy, or would you strive to avoid repeating it? If you did repeat it, what conditions seem necessary to make it a success? How might you apply this strategy to your current opportunities and challenges? It’s not necessary to work through all of these questions for every strategy you’re learning from. I often try to pick the two that I think might be most interesting for a given strategy. Policy for improving at strategy At a high level, there are just a few key policies to consider for improving your strategic abilities. The first is implementing strategy, and the second is practicing implementing strategy. While those are indeed the starting points, there are a few more detailed options worth consideration: If your company has existing strategies that are not working, debug one and work to fix it. If you lack the authority to work at the company scope, then decrease altitude until you find an altitude you can work at. Perhaps setting Engineering organizational strategies is beyond your circumstances, but strategy for your team is entirely accessible. If your company has no documented strategies, document one to make it debuggable. Again, if operating at a high altitude isn’t attainable for some reason, operate at a lower altitude that is within reach. If your company’s or team’s strategies are effective but have low adoption, see if you can iterate on operational mechanisms to increase adoption. Many such mechanisms require no authority at all, such as low-noise nudges or the model-document-share approach. If existing strategies are effective and have high adoption, see if you can build excitement for a new strategy. Start by mining for which problems Staff-plus engineers and senior managers believe are important. Once you find one, you have a valuable strategy vein to start mining. If you don’t feel comfortable sharing your work internally, then try writing proposals while only sharing them to a few trusted peers. You can even go further to only share proposals with trusted external peers, perhaps within a learning circle that you create or join. Trying all of these at once would be overwhelming, so I recommend picking one in any given phase. If you aren’t able to make traction, then try another until something works. It’s particularly important to recognize in your diagnosis where things are not working–perhaps you simply don’t have the sponsorship you need to enforce strategy so you need to switch towards suggesting strategies instead–and you’ll find something that works. What if you’re not allowed to do strategy? If you’re looking to find one, you’ll always unearth a reason why it’s not possible to do strategy in your current environment. If you’ve convinced yourself that there’s simply no policy that would allow you to do strategy in your current role, then the two most useful levers I’ve found are: Lower your altitude – there’s always a scale where you can perform strategy, even if it’s just your team or even just yourself. Only you can forbid yourself from developing personal strategies. Practice rather than perform – organizations can only absorb so much strategy development at a given time, so sometimes they won’t be open to you doing more strategy. In that case, you should focus on practicing strategy work rather than directly performing it. Only you can stop yourself from practice. Don’t believe the hype: you can always do strategy work. Operating your strategy improvement policies As the refrain goes, even the best policies don’t accomplish much if they aren’t paired with operational mechanisms to ensure the policies actually happen, and debug why they aren’t happening. Although it’s tempting to ignore operations when it comes to our personal habits, I think that would be a mistake: our personal habits have the most significant long-term impact on ourselves, and are the easiest habits to ignore since others generally won’t ask about them. The mechanisms I’d recommend: Explicitly track the strategies that you’ve implemented, refined, documented, or read. This should be in a document, spreadsheet or folder where you can explicitly see if you have or haven’t done the work. Review your tracked strategies every quarter: are you working on the expected number and in the expected way? If not, why not? Ideally, your review should be done in community with a peer or a learning circle. It’s too easy to deceive yourself, it’s much harder to trick someone else. If your periodic review ever discovers that you’re simply not doing the work you expected, sit down for an hour with someone that you trust–ideally someone equally or more experienced than you–and debug what’s going wrong. Commit to doing this before your next periodic review. Tracking your personal habits can feel a bit odd, but it’s something I highly recommend. I’ve been setting and tracking personal goals for some time now—for example, in my 2024 year in review—and have benefited greatly from it. Too busy for strategy Many companies convince themselves that they’re too much in a rush to make good decisions. I’ve certainly gotten stuck in this view at times myself, although at this point in my career I find it increasingly difficult to not recognize that I have a number of tools to create time for strategy, and an obligation to do strategy rather than inflict poor decisions on the organizations I work in. Here’s my advice for creating time: If you’re not tracking how often you’re creating strategies, then start there. If you’ve not worked on a single strategy in the past six months, then start with one. If implementing a strategy has been prohibitively time consuming, then focus on practicing a strategy instead. If you do try all those things and still aren’t making progress, then accept your reality: you don’t view doing strategy as particularly important. Spend some time thinking about why that is, and if you’re comfortable with your answer, then maybe this is a practice you should come back to later. Final words At this point, you’ve read everything I have to offer on drafting engineering strategy. I hope this has refined your view on what strategy can be in your organization, and has given you the tools to draft a more thoughtful future for your corner of the software engineering industry. What I’d never ask is for you to wholly agree with my ideas here. They are my best thinking on this topic, but strategy is a topic where I’m certain Hegel’s world view is the correct one: even the best ideas here are wrong in interesting ways, and will be surpassed by better ones.

2 days ago 2 votes