More from Renegade Otter
A tale of two rewrites Jamie Zawinski is kind of a tech legend. He came up with the name “Mozilla”, invented that whole thing where you can send HTML in emails, and more. In his harrowing work diary of how Mosaic/Netscape came to be, Jamie described the burnout rodeo that was the Mosaic development (the top disclaimer has its own history — ignore it): I slept at work again last night; two and a half hours curled up in a quilt underneath my desk, from 11am to 1:30pm or so. That was when I woke up with a start, realizing that I was late for a meeting we were scheduled to have to argue about colormaps and dithering, and how we should deal with all the nefarious 8-bit color management issues. But it was no big deal, we just had the meeting later. It’s hard for someone to hold it against you when you miss a meeting because you’ve been at work so long that you’ve passed out from exhaustion. Netscape’s wild ride is well-depicted in the dramatized Discovery mini-series Valley of the Boom, and the company eventually collapsed with the death march rewrite of what seemed to be just seriously unmaintainable code. It was the subject of one of the more famous articles by ex-Microsoft engineer and then entrepreneur Joel Spolsky - Things You Should Never Do. While the infamous Netscape codebase is long gone, the people that it enriched are still shaping the world to this day. There have been big, successful rewrites. Twitter moved away from Ruby-on-Rails to JVM over a decade ago but the first, year-long full rewrite effort completely failed. Following architecture by fiat from the top, the engineering team said nothing, speaking out only days before the launch. The whole thing would crash out of the gate, they claimed, so Twitter had to go back to the drawing board and rewrite again. I'd love to hear from you. What didn’t work for Netscape worked for Twitter. Why? Netscape had major heat coming from ruthless Microsoft competition, very little time for major moves, and a team aleady exhausted from “office heroics”. Twitter, however, is a unique product that is incredibly hard to dislodge, even with the almost purposefully incompetent and reckless management. It’s hard to abandon your social media account after accumulating algorithmic reputation and followers for years, and yet one can switch browsers faster than they can switch socks. Companies often do not survive this kind of adventure without having an almost unfair moat. Those that do survive, they probably caught some battle scars. Friendly Fire: Notify in Slack directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed The road to hell is paved with TODO comments All of this is to say that you should probably never let your system rot so badly until a code rewrite is even discussed. It never just happens. Your code doesn’t just become unmaintainable overnight. It gets there by the constant cutting of corners, hard-coding things, and crop-dusting your work with long-forgotten //FIXME comments. Fix who? We used to call it technical debt - a term that is now being frowned upon. The concept of “technical debt” got popular around the time when we were getting obsessed with “proh-cess” and Agile, as we got tired of death march projects, arbitrary deadlines, and general lack of structure and visibility into our work. Every software project felt like a tour — you came up for air and then went back into the 💩 for months. Agile meant that the stakeholders could be present in our planning meetings. We had to explain to them - somehow - that it took time to upgrade the web framework from v1 to v5 because no one has been using v1 for years, and in general, it slowed everyone down. Since we didn’t know how to explain this to a non-coder, someone came up with the condescending “technical debt” — “those spreadsheet monkeys wouldn’t understand what we do here!” While “technical debt” has most likely run its course as a manipulative verbal device, it is absolutely the right term to use amongst ourselves to reason about risks and to properly triage them. The three type of technical debt The word “debt” has negative connotations for sure, but just like with actual monetary debt, it’s never great but not always horrible. To mutilate the famous saying - you have to spend code to make code. I would categorize technical debt into three types — Aesthetic, Deferrable, and Toxic. A mark of a good engineer is knowing when to create technical debt, what kind of debt, and when to repay it. Aesthetic debt This is the kind of stuff that triggers your OCD but does not really affect your users or your velocity in any way. Maybe the imports are not sorted the way you want, and maybe there is a naming convention that is grinding your gears. It’s something that can be addressed with relatively low effort when you are good and ready, in many cases with proper automated code analysis and tools. Deferrable debt Deferrable debt is what should be refactored at some point, but it’s fairly contained and will not be a problem in the immediate future. The kind of debt that you need to minimize by methodically striking it off your list, and as long as it seeps through into your sprint work, you can probably avoid a scenario where it all gets out of control. Sometimes this sort of thing is really contained - a lone hacky file, written in the Mesozoic Era by a sleep-deprived Jamie Zawinski because someone was breathing down his neck. No one really understands what the code does, but it’s been humming along for the last 7 years, so why take your chances by waking the sleeping dragons? Slap the Safety Pig on it, claim a victory, and go shake down a vending machine. Toxic debt This is the kind of debt that needs to be addressed before it’s too late. How do you identify “toxic” debt? It’s that thing that you did half-way and now it’s become a workaround magnet. “We have to do it like this now until we fix it - someday”. The workarounds then become the foundation of new features, creating new and exciting debugging side quests. The future work required grows bigger with every new feature and a line of code. This is the toxic debt. Lack of tests is toxic debt Not having automated tests, or insufficient testing of critical paths, is tech debt in its own right. The more untested code you are adding, the more miserable your life is going to get over time. Tests are important to fight the debt itself. It’s much easier to take a sledgehammer to your codebase when a solid integration test suite’s got your back. We don’t like it, it’s upfront work that slows us down, but at some point after your Minimal Viable Prototype starts running away from you, you need to switch into Test Mode and tie it all down — before things get really nasty. Lack of documentation is toxic debt I am not talking about a War & Peace sized manual or detailed and severely out of date architecture diagrams in your Google Docs. Just a a set of critical READMEs and runbooks on how to start the system locally and perform basic tasks. What variables and secrets do I need? What else do I need installed? If there is a bug report, how do I configure my local environment to reproduce it, and so on. The time taken to reverse-engineer a system every time has an actual dollar value attached to it, plus the opportunity cost of not doing useful work. Put. It. In. A. Card. I have been guilty of this myself. I love TODOs. They are easy to add without breaking the flow, and they are configured in my IDE to be bright and loud. It’s a TODO — I will do it someday. During the Annual TODO Week, obviously. Let’s be frank — marking items as “TODO” is saying to yourself that you should really do this thing, but probably never will. This is relevant because TODO items can represent any level of technical debt described above, and so you should really make these actual stories on your Kanban/Agile boards. Mark technical debt as such You should be able to easily scan your “debt stories” and figure out which ones have payment due. This can be either a tag in your issue-tracking system or a column in your Kanban-style board like Trello. An approach like this will let you gauge better the ratio of new feature stories vs the growing technical debt. Your debt column will never be empty — that goal is as futile as Zero Inbox, but it should never grow out of control either. // TODO: conclusion
Introduction Friendly Fire needs to periodically execute scheduled jobs - to remind Slack users to review GitHub pull requests. Instead of bolting on a new system just for this, I decided to leverage Postgres instead. The must-have requirement was the ability to schedule a job to run in the future, with workers polling for “ripe” jobs, executing them and retrying on failure, with exponential backoff. With SKIP LOCKED, Postgres has the needed functionality, allowing a single worker to atomically pull a job from the job queue without another worker pulling the same one. This project is a demo of this system, slightly simplified. This example, available on GitHub is a playground for the following: How to set up a base Quart web app with Postgres using Poetry How to process a queue of immediate and delayed jobs using only the database How to retry failed jobs with exponential backoff How to use custom decorators to ensure atomic HTTP requests (success - commit, failure - rollback) How to use Pydantic for stricter Python models How to use asyncpg and asynchronously query Postgres with connection pooling How to test asyncio code using pytest and unittest.IsolatedAsyncioTestCase How to manipulate the clock in tests using freezegun How to use mypy, flake8, isort, and black to format and lint the code How to use Make to simplify local commands ALTER MODE SKIP COMPLEXITY Postgres introduced SKIP LOCKED years ago, but recently there was a noticeable uptick in the interest around this feature. In particular regarding its obvious use for simpler queuing systems, allowing us to bypass libraries or maintenance-hungry third-party messaging systems. Why now? It’s hard to say, but my guess is that the tech sector is adjusting to the leaner times, looking for more efficient and cheaper ways of achieving the same goals at common-scale but with fewer resources. Or shall we say - reasonable resources. What’s Quart? Quart is the asynchronous version of Flask. If you know about the g - the global request context - you will be right at home. Multiple quality frameworks have entered Python-scape in recent years - FastAPI, Sanic, Falcon, Litestar. There is also Bottle and Carafe. Apparently naming Python frameworks after liquid containers is now a running joke. Seeing that both Flask and Quart are now part of the Pallets project, Quart has been curiously devoid of hype. These two are in the process of being merged and at some point will become one framework - classic synchronous Flask and asynchronous Quart in one. How it works Writing about SKIP LOCKED is going to be redundant as this has been covered plenty elsewhere. For example, in this article. Even more in-depth are these slides from 2016 PGCON. The central query looks like this: DELETE FROM job WHERE id = ( SELECT id FROM job WHERE ripe_at IS NULL OR [current_time_argument] >= ripe_at FOR UPDATE SKIP LOCKED LIMIT 1 ) RETURNING *, id::text Each worker is added as a background task, periodically querying the database for “ripe” jobs (the ones ready to execute), and then runs the code for that specific job type. A job that does not have the “ripe” time set will be executed whenever a worker is available. A job that fails will be retried with exponential backoff, up to Job.max_retries times: next_retry_minutes = self.base_retry_minutes * pow(self.tries, 2) Creating a job is simple: job: Job = Job( job_type=JobType.MY_JOB_TYPE, arguments={"user_id": user_id}, ).runs_in(hours=1) await jobq.service.job_db.save(job) SKIP LOCKED and DELETE ... SELECT FOR UPDATE tango together to make sure that no worker gets the same job at the same time. To keep things interesting, at the Postgres level we have an MD5-based auto-generated column to make sure that no job of the same type and with the same arguments gets queued up more than once. This project also demonstrates the usage of custom DB transaction decorators in order to have a cleaner transaction notation: @write_transaction @api.put("/user") async def add_user(): # DB write logic @read_transaction @api.get("/user") async def get_user(): # DB read logic A request (or a function) annotated with one of these decorators will be in an atomic transaction until it exits, and rolled back if it fails. At shutdown, the “stop” flag in each worker is set, and the server waits until all the workers complete their sleep cycles, peacing out gracefully. async def stop(self): for worker in self.workers: worker.request_stop() while not all([w.stopped for w in self.workers]): logger.info("Waiting for all workers to stop...") await asyncio.sleep(1) logger.info("All workers have stopped") Testing The test suite leverages unittest.IsolatedAsyncioTestCase (Python 3.8 and up) to grant us access to asyncSetUp() - this way we can call await in our test setup functions: async def asyncSetUp(self) -> None: self.app: Quart = create_app() self.ctx: quart.ctx.AppContext = self.app.app_context() await self.ctx.push() self.conn = await asyncpg.connect(...) db.connection_manager.set_connection(self.conn) self.transaction = self.conn.transaction() await self.transaction.start() async def asyncTearDown(self) -> None: await self.transaction.rollback() await self.conn.close() await self.ctx.pop() Note that we set up the database only once for our test class. At the end of each test, the connection is rolled back, returning the database to its pristine state for the next test. This is a speed trick to make sure we don’t have to run database setup code each single time. In this case it doesn’t really matter, but in a test suite large enough, this is going to add up. For delayed jobs, we simulate the future by freezing the clock at a specific time (relative to now): # jump to the FUTURE with freeze_time(now + datetime.timedelta(hours=2)): ripe_job = await jobq.service.job_db.get_one_ripe_job() assert ripe_job Improvements Batching - pulling more than one job at once would add major dragonforce to this system. This is not part of the example as to not overcomplicate it. You just need to be careful and return the failed jobs back in the queue while deleting the completed ones. With enough workers, a system like this could really be capable of handling serious common-scale workloads. Server exit - there are less than trivial ways of interrupting worker sleep cycles. This could improve the experience of running the service locally. In its current form, you have to wait a few seconds until all worker loops get out of sleep() and read the STOP flag. Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed
A MySQL war story It’s 2006, and the New York Magazine digital team set out to create a new search experience for its Fashion Week portal. It was one of those projects where technical feasibility was not even discussed with the tech team - a common occurrence back then. Agile was still new, let alone in publishing. It was just a vision, a real friggin’ moonshot, and 10 to 12 weeks to develop the wireframed version of the product. There would be almost no time left for proper QA. Fashion Week does not start slowly but rather goes from zero to sixty in a blink. The vision? Thousands of near-real-time fashion show images, each one with its sub-items categorized: “2006”, “bag”, “red”, “ leather”, and so on. A user will land on the search page and have the ability to “drill down” and narrow the results based on those properties. To make things much harder, all of these properties would come with exact counts. The workflow was going to be intense. Photographers will courier their digital cartridges from downtown NYC to our offices on Madison Avenue, where the images will be processed, tagged by interns, and then indexed every hour by our Perl script, reading the tags from the embedded EXIF information. Failure to build the search product on our side would have collapsed the entire ecosystem already in place, primed and ready to rumble. “Oh! Just use the facets in Solr, dude”. Yeah, not so fast - dude. In 2006 that kind of technology didn’t even exist yet. I sat through multiple enterprise search engine demos with our CTO, and none of the products (which cost a LOT of money) could do a deep faceted search. We already had an Autonomy license and my first try proved that… it just couldn’t do it. It was supposed to be able to, but the counts were all wrong. Endeca (now owned by Oracle), came out of stealth when the design part of the project was already underway. Too new, too raw, too risky. The idea was just a little too ambitious for its time, especially for a tiny team in a non-tech company. So here we were, a team of three, myself and two consultants, writing Perl for the indexing script, query-parsing logic, and modeling the data - in MySQL 4. It was one of those projects where one single insurmountable technical risk would have sunk the whole thing. I will cut the story short and spare you the excitement. We did it, and then we went out to celebrate at a karaoke bar (where I got my very first work-stress-related severe hangover) 🤮 For someone who was in charge of the SQL model and queries, it was days and days of tuning those, timing every query and studying the EXPLAIN output to see what else I could do to squeeze another 50ms out of the database. There were no free nights or weekends. In the end, it was a combination of trial and error, digging deep into MySQL server settings, and crafting GROUP BY queries that would make you nauseous. The MySQL query analyzer was fidgety back then, and sometimes re-arranging the fields in the SELECT clause could change a query’s performance. Imagine if SELECT field1, field2 FROM my_table was faster than SELECT field2, field1 FROM my_table. Why would it do that? I have no idea to this day, and I don’t even want to know. Unfortunately, I lost examples of this work, but the Way Back Machine has proof of our final product. The point here is - if you really know your database, you can do pretty crazy things with it, and with the modern generation of storage technologies and beefier hardware, you don’t even need to push the limits - it should easily handle what I refer to as “common-scale”. Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed The fading art of SQL In the past few years I have been noticing an unsettling trend - software engineers are eager to use exotic “planet-scale” databases for pretty rudimentary problems, while at the same time not having a good grasp of the very powerful relational database engine they are likely already using, let alone understanding the technology’s more advanced and useful capabilities. The SQL layer is buried so deep beneath libraries and too clever by a half ORMs that it all just becomes high-level code. Why is it slow? No idea - let's add Cassandra to it! Modern hardware certainly allows us to go way up from the CPU into the higher abstraction layers, while it wasn’t that uncommon in the past to convert certain functions to assembly code in order to squeeze every bit of performance out of the processor. Now compute and storage is cheaper - it’s true - but abusing this abundance has trained us laziness and complacency. Suddenly, that Cloud bill is a wee too high, and heavens knows how much energy the world is burning by just running billions of these inefficient ORM queries every second against mammoth database instances. The morning of my first job interview in 2004, I was on a subway train memorizing the nine levels of database normalization. Or is it five levels? I don’t remember, and It doesn’t even matter - no one will ever ask you this now in a software engineer interview. Just skimming through the table of contents of your database of choice, say the now freshly in vogue Postgres, you will find an absolute treasure trove of features fit to handle everything but the most gruesome planet-scale computer science problems. Petabyte-sized Postgres boxes, replicated, are effortlessly running now as you are reading this. The trick is to not expect your database or your ORM to read your mind. Speaking of… ORMs are the frenemy I was a new hire at an e-commerce outfit once, and right off the bat I was thrown into fixing serious performance issues with the company’s product catalog pages. Just a straight-forward, paginated grid of product images. How hard could it be? Believe it or not - it be. The pages took over 10 seconds to load, sometimes longer, the database was struggling, and the solution was to “just cache it”. One last datapoint - this was not a high-traffic site. The pages were dead-slow even if there was no traffic at all. That’s a rotten sign that something is seriously off. After looking a bit closer, I realized that I hit the motherlode - all top three major database and coding mistakes in one. ❌ Mistake #1: There is no index The column that was hit in every single mission-critical query had no index. None. After adding the much-needed index in production, you could practically hear MySQL exhaling in relief. Still, the performance was not quite there yet, so I had to dig deeper, now in the code. ❌ Mistake #2: Assuming each ORM call is free Activating the query logs locally and reloading a product listing page, I see… 200, 300, 500 queries fired off just to load one single page. What the shit? Turns out, this was the result of a classic ORM abuse of going through every record in a loop, to the effect of: for product_id in product_ids: product = amazing_orm.products.get(id=product_id) products.append(product) The high number of queries was also due the fact that some of this logic was nested. The obvious solution is to keep the number of queries in each request to a minimum, leveraging SQL to join and combine the data into one single blob. This is what relational databases do - it’s in the name. Each separate query needs to travel to the database, get parsed, transformed, analyzed, planned, executed, and then travel back to the caller. It is one of the most expensive operations you can do, and ORMs will happily do the worst possible thing for you in terms of performance. One wonders what those algorithm and data structure interview questions are good for, considering you are more likely to run into a sluggish database call than a B-tree implementation (common structure used for database indexes). ❌ Mistake #3: Pulling in the world To make matters worse, the amount of data here was relatively small, but there were dozens and dozens of columns. What do ORMs usually do by default in order to make your life “easier”? They send the whole thing, all the columns, clogging your network pipes with the data that you don’t even need. It is a form of toxic technical debt, where the speed of development will eventually start eating into performance. I spent hours within the same project hacking the dark corners of the Dango admin, overriding default ORM queries to be less “eager”. This led to a much better office-facing experience. Performance IS a feature Serious, mission-critical systems have been running on classic and boring relational databases for decades, serving thousands of requests per second. These systems have become more advanced, more capable, and more relevant. They are wonders of computer science, one can claim. You would think that an ancient database like Postgres (in development since 1982) is in some kind of legacy maintenance mode at this point, but the opposite is true. In fact, the work has been only accelerating, with the scale and features becoming pretty impressive. What took multiple queries just a few years ago now takes a single one. Why is this significant? It has been known for a long time, as discovered by Amazon, that every additional 100ms of a user waiting for a page to load loses a business money. We also know now that from a user’s perspective, the maximum target response time for a web page is around 100 milliseconds: A delay of less than 100 milliseconds feels instant to a user, but a delay between 100 and 300 milliseconds is perceptible. A delay between 300 and 1,000 milliseconds makes the user feel like a machine is working, but if the delay is above 1,000 milliseconds, your user will likely start to mentally context-switch. The “just add more CPU and RAM if it’s slow” approach may have worked for a while, but many are finding out the hard way that this kind of laziness is not sustainable in a frugal business environment where costs matter. Database anti-patterns Knowing what not to do is as important as knowing what to do. Some of the below mistakes are all too common: ❌ Anti-pattern #1. Using exotic databases for the wrong reasons Technologies like DynamoDB are designed to handle scale at which Postgres and MySQL begin to fail. This is achieved by denormalizing, duplicating the data aggressively, where the database is not doing much real-time data manipulation or joining. Your data is now modeled after how it is queried, not after how it is related. Regular relational concepts disintegrate at this insane level of scale. Needless to say, if you are resorting to this kind of storage for “common-scale” problems, you are already solving problems you don’t have. ❌ Anti-pattern #2. Caching things unnecessarily Caching is a necessary evil - but it’s not always necessary. There is an entire class of bugs and on-call issues that stem from stale cached data. Read-only database replicas are a classic architecture pattern that is still very much not outdated, and it will buy you insane levels of performance before you have to worry about anything. It should not be a surprise that mature relational databases already have query caching in place - it just has to be tuned for your specific needs. Cache invalidation is hard. It adds more complexity and states of uncertainty to your system. It makes debugging more difficult. I received more emails from content teams than I care for throughout my career that wondered “why is the data not there, I updated it 30 minutes ago?!” Caching should not act as a bandaid for bad architecture and non-performant code. ❌ Anti-pattern #3. Storing everything and a kitchen sink As much punishment as an industry-standard database can take, it’s probably not a good idea to not care at all about what’s going into it, treating it like a data landfill of sorts. Management, querying, backups, migrations - all becomes painful once the DB grows substantially. Even if that is of no concern as you are using a managed cloud DB - the costs should be. An RDBMS is a sophisticated piece of technology, and storing data in it is expensive. Figure out common-scale first It is fairly easy to make a beefy Postgres or a MySQL database grind to a halt if you expect it to do magic without any extra work. “It’s not web-scale, boss. Our 2 million records seem to be too much of a lift. We need DynamoDB, Kafka, and event sourcing!” A relational database is not some antiquated technology that only us tech fossils choose to be experts in, a thing that can be waved off like an annoying insect. “Here we React and GraphQL all the things, old man”. In legal speak, a modern RDBMS is innocent until proven guilty, and the burden of proof should be extremely high - and almost entirely on you. Finally, if I have to figure out “why it’s slow”, my approximate runbook is: Compile a list of unique queries, from logging, slow query log, etc. Look at the most frequent queries first Use EXPLAIN to check slow query plans for index usage Select only the data that needs to travel across the wire If an ORM is doing something silly without a workaround, pop the hood and get dirty with the raw SQL plumbing Most importantly, study your database (and SQL). Learn it, love it, use it, abuse it. Spending a couple of days just leafing through that Postgres manual to see what it can do will probably make you a better engineer than spending more time on the next flavor-of-the-month React hotness. Again. Related posts I am not your Cloud person Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Connect GitHub users to Slack and notify directly Skip reviewers who are not available File pattern matching Individual code review reminders No access to your codebase needed
The Church of Complexity There is a pretty well-known sketch in which an engineer is explaining to the project manager how an overly complicated maze of microservices works in order to get a user’s birthday - and fails to do so anyway. The scene accurately describes the absurdity of the state of the current tech culture. We laugh, and yet bringing this up in a serious conversation is tantamount to professional heresy, rendering you borderline un-hirable. How did we get here? How did our aim become not addressing the task at hand but instead setting a pile of cash on fire by solving problems we don’t have? Trigger warning: Some people understandably got salty when I name-checked JavaScript and NodeJS as a source of the problem, but my point really was more about the dangers of hermetically sealed software ecosystems that seem hell-bent on re-learning the lessons that we just had finished learning. We ran into the complexity wall before and reset - otherwise we'd still be using CORBA and SOAP. These air-tight developer bubbles are a wrecking ball on the entire industry, and it takes about a full decade to swing. The perfect storm There are a few events in recent history that may have contributed to the current state of things. First, a whole army of developers writing JavaScript for the browser started self-identifying as “full-stack”, diving into server development and asynchronous code. JavaScript is JavaScript, right? What difference does it make what you create using it - user interfaces, servers, games, or embedded systems. Right? Node was still kind of a learning project of one person, and the early JavaScript was a deeply problematic choice for server development. Pointing this out to still green server-side developers usually resulted in a lot of huffing and puffing. This is all they knew, after all. The world outside of Node effectively did not exist, the Node way was the only way, and so this was the genesis of the stubborn, dogmatic thinking that we are dealing with to this day. And then, a steady stream of FAANG veterans started merging into the river of startups, mentoring the newly-minted and highly impressionable young JavaScript server-side engineers. The apostles of the Church of Complexity would assertively claim that “how they did things over at Google” was unquestionable and correct - even if it made no sense with the given context and size. What do you mean you don’t have a separate User Preferences Service? That just will not scale, bro! But, it’s easy to blame the veterans and the newcomers for all of this. What else was happening? Oh yeah - easy money. What do you do when you are flush with venture capital? You don’t go for revenue, surely! On more than one occasion I received an email from management, asking everyone to be in the office, tidy up their desks and look busy, as a clouder of Patagonia vests was about to be paraded through the space. Investors needed to see explosive growth, but not in profitability, no. They just needed to see how quickly the company could hire ultra-expensive software engineers to do … something. And now that you have these developers, what do you do with them? Well, they could build a simpler system that is easier to grow and maintain, or they could conjure up a monstrous constellation of “microservices” that no one really understands. Microservices - the new way of writing scalable software! Are we just going to pretend that the concept of “distributed systems” never existed? (Let’s skip the whole parsing of nuances about microservices not being real distributed systems). Back in the days when the tech industry was not such a bloated farce, distributed systems were respected, feared, and generally avoided - reserved only as the weapon of last resort for particularly gnarly problems. Everything with a distributed system becomes more challenging and time-consuming - development, debugging, deployment, testing, resilience. But I don’t know - maybe it’s all super easy now because toooollling. There is no standard tooling for microservices-based development - there is no common framework. Working on distributed systems has gotten only marginally easier in 2020s. The Dockers and the Kuberneteses of the world did not magically take away the inherent complexity of a distributed setup. I love referring to this summary of 5 years of startup audits, as it is packed with common-sense conclusions: … the startups we audited that are now doing the best usually had an almost brazenly ‘Keep It Simple’ approach to engineering. Cleverness for cleverness sake was abhorred. On the flip side, the companies where we were like ”woah, these folks are smart as hell” for the most part kind of faded. Generally, the major foot-gun that got a lot of places in trouble was the premature move to microservices, architectures that relied on distributed computing, and messaging-heavy designs. Literally - “complexity kills”. The audit revealed an interesting pattern, where many startups experienced a sort of collective imposter syndrome while building straight-forward, simple, performant systems. There is a dogma attached to not starting out with microservices on day one - no matter the problem. “Everyone is doing microservices, yet we have a single Django monolith maintained by just a few engineers, and a MySQL instance - what are we doing wrong?”. The answer is almost always “nothing”. Likewise, it’s very often that seasoned engineers experience hesitation and inadequacy in today’s tech world, and the good news is that, no - it’s probably not you. It’s common for teams to pretend like they are doing “web scale”, hiding behind libraries, ORMs, and cache - confident in their expertise (they crushed that Leetcode!), yet they may not even be aware of database indexing basics. You are operating in a sea of unjustified overconfidence, waste, and Dunning-Kruger, so who is really the imposter here? Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Slack notifications Out-of-office support File matching Try it now. There is nothing wrong with a monolith The idea that you cannot grow without a system that looks like the infamous slide of Afghanistan war strategy is a myth. Dropbox, Twitter, Netflix, Facebook, GitHub, Instagram, Shopify, StackOverflow - these companies and others started out as monolithic code bases. Many have a monolith at their core to this day. StackOverflow makes it a point of pride how little hardware they need to run the massive site. Shopify is still a Rails monolith, leveraging the tried and true Resque to process billions of tasks. WhatsApp went supernova with their Erlang monolith and a relatively small team. How? WhatsApp consciously keeps the engineering staff small to only about 50 engineers. Individual engineering teams are also small, consisting of 1 - 3 engineers and teams are each given a great deal of autonomy. In terms of servers, WhatsApp prefers to use a smaller number of servers and vertically scale each server to the highest extent possible. Instagram was acquired for billions - with a crew of 12. And do you imagine Threads as an effort involving a whole Meta campus? Nope. They followed the Instagram model, and this is the entire Threads team: Perhaps claiming that your particular problem domain requires a massively complicated distributed system and an open office stuffed to the gills with turbo-geniuses is just crossing over into arrogance rather than brilliance? Don’t solve problems you don’t have It’s a simple question - what problem are you solving? Is it scale? How do you know how to break it all up for scale and performance? Do you have enough data to show what needs to be a separate service and why? Distributed systems are built for size and resilience. Can your system scale and be resilient at the same time? What happens if one of the services goes down or comes to a crawl? Just scale it up? What about the other services that are going to get hit with traffic? Did you war-game the endless permutations of things that can and will go wrong? Is there backpressure? Circuit breakers? Queues? Jitter? Sensible timeouts on every endpoint? Are there fool-proof guards to make sure a simple change does not bring everything down? The knobs you need to be aware of and tune are endless, and they are all specific to your system’s particular signature of usage and load. The truth is that most companies will never reach the massive size that will actually require building a true distributed system. Your cosplaying Amazon and Google - without their scale, expertise, and endless resources - is very likely just an egregious waste of money and time. Religiously following all the steps from an article called “Ten morning habits of very successful people” is not going to make you a billionaire. The only thing harder than a distributed system is a BAD distributed system. “But each team… but separate… but API” Trying to shove a distributed topology into your company’s structure is a noble effort, but it almost always backfires. It’s a common approach to break up a problem into smaller pieces and then solve those one by one. So, the thinking goes, if you break up one service into multiple ones, everything becomes easier. The theory is sweet and elegant - each microservice is being maintained rigorously by a dedicated team, walled off behind a beautiful, backward-compatible, versioned API. In fact, this is so solid that you rarely even have to communicate with that team - as if the microservice was maintained by a 3rd party vendor. It’s simple! If that doesn’t sound familiar, that’s because this rarely happens. In reality, our Slack channels are flooded with messages from teams communicating about releases, bugs, configuration updates, breaking changes, and PSAs. Everyone needs to be on top of everything, all the time. And if that wasn’t great, it’s normal for one already-slammed team to half-ass multiple microservices instead of doing a great job on a single one, often changing ownership as people come and go. In order to win the race, we don’t build one good race car - we build a fleet of shitty golf carts. What you lose There are multiple pitfalls to building with microservices, and often that minefield is either not fully appreciated or simply ignored. Teams spend months writing highly customized tooling and learning lessons not related at all to the core product. Here are just some often overlooked aspects… Say goodbye to DRY After decades of teaching developers to write Don’t Repeat Yourself code, it seems we just stopped talking about it altogether. Microservices by default are not DRY, with every service stuffed with redundant boilerplate. Very often the overhead of such “plumbing” is so heavy, and the size of the microservices is so small, that the average instance of a service has more “service” than “product”. So what about the common code that can be factored out? Have a common library? How does the common library get updated? Keep different versions everywhere? Force updates regularly, creating dozens of pull requests across all repositories? Keep it all in a monorepo? That comes with its own set of problems. Allow for some code duplication? Forget it, each team gets to reinvent the wheel every time. Each company going this route faces these choices, and there are no good “ergonomic” options - you have to choose your version of the pain. Developer ergonomics will crater “Developer ergonomics” is the friction, the amount of effort a developer must go through in order to get something done, be it working on a new feature or resolving a bug. With microservices, an engineer has to have a mental map of the entire system in order to know what services to bring up for any particular task, what teams to talk to, whom to talk to, and what about. The “you have to know everything before doing anything” principle. How do you keep on top of it? Spotify, a multi-billion dollar company, spent probably not negligible internal resources to build Backstage, software for cataloging its endless systems and services. This should at least give you a clue that this game is not for everyone, and the price of the ride is high. So what about the tooooling? The Not Spotifies of the world are left with MacGyvering their own solutions, robustness and portability of which you can probably guess. And how many teams actually streamline the process of starting a YASS - “yet another stupid service”? This includes: Developer privileges in GitHub/GitLab Default environment variables and configuration CI/CD Code quality checkers Code review settings Branch rules and protections Monitoring and observability Test harness Infrastructure-as-code And of course, multiply this list by the number of programming languages used throughout the company. Maybe you have a usable template or a runbook? Maybe a frictionless, one-click system to launch a new service from scratch? It takes months to iron out all the kinks with this kind of automation. So, you can either work on your product, or you can be working on toooooling. Integration tests - LOL As if the everyday microservices grind was not enough, you also forfeit the peace of mind offered by solid integration tests. Your single-service and unit tests are passing, but are your critical paths still intact after each commit? Who is in charge of the overall integration test suite, in Postman or wherever else? Is there one? Integration testing a distributed setup is a nearly-impossible problem, so we pretty much gave up on that and replaced it with another one - Observability. Just like “microservices” are the new “distributed systems”, “observability” is the new “debugging in production”. Surely, you are not writing real software if you are not doing…. observability! Observability has become its own sector, and you will pay in both pretty penny and in developer time for it. It doesn’t come as plug-and-pay either - you need to understand and implement canary releases, feature flags, etc. Who is doing that? One already overwhelmed engineer? As you can see, breaking up your problem does not make solving it easier - all you get is another set of even harder problems. What about just “services”? Why do your services need to be “micro”? What’s wrong with just services? Some startups have gone as far as create a service for each function, and yes, “isn’t that just like Lambda” is a valid question. This gives you an idea of how far gone this unchecked cargo cult is. So what do we do? Starting with a monolith is one obvious choice. A pattern that could also work in many instances is “trunk & branches”, where the main “meat and potatoes” monolith is helped by “branch” services. A branch service can be one that takes care of a clearly-identifiable and separately-scalable load. A CPU-hungry Image-Resizing Service makes way more sense than a User Registration Service. Or do you get so many registrations per second that it requires independent horizontal scaling? Side note: In version control, back in the days of CVS and Subversion, we rarely used "master" branches. We had "trunk and branches" because, you know - *trees*. "Master" branches appeared somewhere along the way, and when GitHub decided to do away with the rather unfortunate naming convention, the average engineer was too young to remember about "trunk" - and so the generic "main" default came to be. The pendulum is swinging back The hype, however, seems to be dying down. The VC cash faucet is tightening, and so the businesses have been market-corrected into exercising common-sense decisions, recognizing that perhaps splurging on web-scale architectures when they don’t have web-scale problems is not sustainable. Ultimately, when faced with the need to travel from New York to Philadelphia, you have two options. You can either attempt to construct a highly intricate spaceship for an orbital descent to your destination, or you can simply purchase an Amtrak train ticket for a 90-minute ride. That is the problem at hand. Additional reading & listening How to recover from microservices You want modules, not microservices XML is the future Gasp! You might not need microservices Podcast: How we keep Stack Overflow’s codebase clean and modern Goodbye Microservices: From 100s of problem children to 1 superstar It’s the future Renegade Otter is the developer of Friendly Fire - Smarter pull request assignment for GitHub: Slack notifications Out-of-office support File matching Try it now.
More in programming
I’ve written about how I don’t love the idea of overriding basic computing controls. Instead, I generally favor opting to respect user choice and provide the controls their platform does. Of course, this means platforms need to surface better primitives rather than supplying basic ones with an ability to opt out. What am I even talking about? Let me give an example. The Webkit team just shipped a new API for <input type=color> which provides users the ability to pick colors with wide gamut P3 and alpha transparency. The entire API is just a little bit of declarative HTML: <label> Select a color: <input type="color" colorspace="display-p3" alpha> </label> From that simple markup (on iOS) you get this beautiful, robust color picker. That’s a great color picker, and if you’re choosing colors a lot on iOS respectively and encountering this particular UI a lot, that’s even better — like, “Oh hey, I know how to use this thing!” With a picker like that, how many folks really want additional APIs to override that interface and style it themselves? This is the kind of better platform defaults I’m talking about. A little bit of HTML markup, and boom, a great interface to a common computing task that’s tailored to my device and uniform in appearance and functionality across the websites and applications I use. What more could I want? You might want more, like shoving your brand down my throat, but I really don’t need to see BigFinanceCorp Green™️ as a themed element in my color or date picker. If I could give HTML an aspirational slogan, it would be something along the lines of Mastercard’s old one: There are a few use cases platform defaults can’t solve, for everything else there’s HTML. Email · Mastodon · Bluesky
Today’s my last day at Carta, where I got the chance to serve as their CTO for the past two years. I’ve learned so much working there, and I wanted to end my chapter there by collecting my thoughts on what I learned. (I am heading somewhere, and will share news in a week or two after firming up the communication plan with my new team there.) The most important things I learned at Carta were: Working in the details – if you took a critical lens towards my historical leadership style, I think the biggest issue you’d point at is my being too comfortable operating at a high level of abstraction. Utilizing the expertise of others to fill in your gaps is a valuable skill, but–like any single approach–it’s limiting when utilized too frequently. One of the strengths of Carta’s “house leadership style” is expecting leaders to go deep into the details to get informed and push pace. What I practiced there turned into the pieces on strategy testing and developing domain expertise. Refining my approach to engineering strategy – over the past 18 months, I’ve written a book on engineering strategy (posts are all in #eng-strategy-book), with initial chapters coming available for early release with O’Reilly next month. Fingers crossed, the book will be released in approximately October. Coming into Carta, I already had much of my core thesis about how to do engineering strategy, but Carta gave me a number of complex projects to practice on, and excellent people to practice with: thank you to Dan, Shawna and Vogl in particular! More on this project in the next few weeks. Extract the kernel – everywhere I’ve ever worked, teams have struggled understanding executives. In every case, the executives could be clearer, but it’s not particularly interesting to frame these problems as something the executives need to fix. Sure, that’s true they could communicate better, but that framing makes you powerless, when you have a great deal of power to understand confusing communication. After all, even good communicators communicate poorly sometimes. Meaningfully adopting LLMs – a year ago I wrote up notes on adopting LLMs in your products, based on what we’d learned so far. Since then, we’ve learned a lot more, and LLMs themselves have significantly improved. Carta has been using LLMs in real, business-impacting workflows for over a year. That’s continuing to expand into solving more complex internal workflows, and even more interestingly into creating net-new product capabilities that ought to roll out more widely in the next few months (currently released to small beta groups). This is the first major technology transition that I’ve experienced in a senior leadership role (since I was earlier in my career when mobile internet transitioned from novelty to commodity). The immense pressure to adopt faster, combined with the immense uncertainty if it’s a meaningful change or a brief blip was a lot of fun, and was the inspiration for this strategy document around LLM adoption. Multi-dimensional tradeoffs – a phrase that Henry Ward uses frequent is that “everyone’s right, just at a different altitude.” That idea resonates with me, and meshes well with the ideas of multi-dimensional tradeoffs and layers of context that I find improve decision making for folks in roles that require making numerous, complex decisions. Working at Carta, these ideas formalized from something I intuited into something I could explain clearly. Navigators – I think our most successful engineering strategy at Carta was rolling out the Navigator program, which ensured senior-most engineers had context and direct representation, rather than relying exclusively on indirect representation via engineering management. Carta’s engineering managers are excellent, but there’s always something lost as discussions extend across layers. The Navigator program probably isn’t a perfect fit for particularly small companies, but I think any company with more than 100-150 engineers would benefit from something along these lines. How to create software quality – I’ve evolved my thinking about software quality quite a bit over time, but Carta was particularly helpful in distinguishing why some pieces of software are so hard to build despite having little-to-no scale from a data or concurrency perspective. These systems, which I label as “high essential complexity”, deserve more credit for their complexity, even if they have little in the way of complexity from infrastructure scaling. Shaping eng org costs – a few years ago, I wrote about my mental model for managing infrastructure costs. At Carta, I got to refine my thinking about engineering salary costs, with most of those ideas getting incorporated in the Navigating Private Equity ownership strategy, and the eng org seniority mix model. The three biggest levers are (1) “N-1 backfills”, (2) requiring a business rationale for promotions into senior-most levels, and (3) shifting hiring into cost efficient hiring regions. None of these are the sort of inspiring topics that excite folks, but they are all essential to the long term stability of your organization. Explaining engineering costs to boards/execs – Similarly, I finally have a clear perspective on how to represent R&D investment to boards in the same language that they speak in, which I wrote up here, and know how to do it quickly without relying on any manually curated internal datasets. Lots of smaller stuff, like the no wrong doors policy for routing colleagues to appropriate channels, how to request headcount in a way that is convincing to executives, Act Two rationales for how people’s motivations evolve over the course of long careers (and my own personal career mission to advance the industry, why friction isn’t velocity even though many folks act like it is. I’ve also learned quite a bit about venture capital, fund administration, cap tables, non-social network products, operating a multi-business line company, and various operating models. Figuring out how to sanitize those learnings to share the interesting tidbits without leaking internal details is a bit too painful, so I’m omitting them for now. Maybe some will be shareable in four or five years after my context goes sufficiently stale. As a closing thought, I just want to say how much I’ve appreciated the folks I’ve gotten to work with at Carta. From the executive team (Ali, April, Charly, Davis, Henry, Jeff, Nicole, Vrushali) to my directs (Adi, Ciera, Dan, Dave, Jasmine, Javier, Jayesh, Karen, Madhuri, Sam, Shawna) to the navigators (there’s a bunch of y’all). The people truly are always the best part, and that was certainly true at Carta.
Some major updates to our open-source Automerge library, an introduction to Sketchy Calendars, and a peek at our work on collaborative game development. Also some meta content—a refreshed website, and a talk about how we work.
Test UI outcomes, not API requests. Mock network calls in setup, but assert on what users actually see and experience, not implementation details.
Do you feel that the number of applications needed to land a role has skyrocketed? If so, your instincts are correct. According to a Workday Global Workforce Report in September 2024, job applications are growing at a rate four times faster than job openings. This growth is fuelled by a tight job market as well as the new availability of remote work and online job boards. It’s also one of the results of improved generative AI. Around half of all job seekers use AI tools to create their resumes or fill out applications. More than that, a 2024 survey found that 29 percent of applicants were using AI tools to complete skills tests, while 26 percent employed AI tools to mass apply to positions, regardless of fit or qualifications. This never-before-seen flood of applications poses new hardships for both job candidates and recruiters. Candidates must ensure that their applications stand out enough from the pile to receive a recruiter’s attention. Recruiters, meanwhile, are struggling to manage the sheer number of resumes they receive, and winnow through heaps of irrelevant or unqualified applicants to find the ones they need. These problems worsen if you’re an overseas candidate hoping to find a role in Japan. Japan is a popular country for migrants, thereby increasing the competition for each open position. In addition, recruiters here have set expectations and criteria, some of which can be triggered unknowingly by candidates unfamiliar with the Japanese market. With all this in mind, how can you ensure your resume stands out from the crowd—and is there anything else you can do to pass the screening stage? I interviewed nine recruiters, both external and in-house, to learn how applicants can increase their chances of success. Below are their detailed suggestions on improving your resume, avoiding Japan-specific red flags, and persisting even in the face of rejection. The competition The first questions I asked each recruiter were: How many resumes do you review in a month? How long does it take you to review a resume? Some interviewees work for agencies or independently, while others are employed by the companies they screen applicants for. Surprisingly, where they work doesn’t consistently affect how many resumes they receive. What does affect their numbers is whether they accept candidates from overseas. One anonymous contributor stated the case plainly: “The volume of applications depends on whether the job posting targets candidates in Japan or internationally.” In Japan: we receive around 20–100+ applications within the first three days. Outside of Japan: a single job posting can attract 200–1,000 applications within three days. ”[Because] we are generally only open to current residents of Japan, our total applicant count is around 100 or so in a month,” said Caleb McClain, who is both a Senior Software Engineer and a hiring manager at Lunaris. “In the past, when we accepted applications from abroad it was much higher, though I unfortunately don’t have stats for that period. It was unmanageable for a single person (me) reviewing the applications, though! “Given that I deal with 100 or so per month, I probably spend a bit more time than others screening applications, but it depends. I’ll give every candidate a quick read through within a minute or so and, if I didn’t find a reason to immediately reject them, I’ll spend a few more minutes reading about their experience more deeply. I’ll check out the companies they have listed for their experience if I’m not familiar with them and, if they have a Github or personal projects listed, I’ll also spend a few minutes checking those out.” For companies that accept overseas candidates, the workload is greater. Laine Takahashi, a Talent Acquisition employee at HENNGE, estimated that every month they receive around 200 completed applications for engineering mid-career roles and 270 applications for their Global Internship program. Since their application process starts with a coding test as well as a resume and cover letter, it can take up to two weeks to review, score, and respond to each application. Clement Chidiac, Senior Technical Recruiter at Mercari, explained that the number of resumes he reviews monthly varies widely. “As an example, one of the current roles I am working on received 250+ applications in three weeks. Typically a recruiter at Mercari can work from 5–20 positions at a time, so this gives you an idea.” He also said that his initial quick scan of each resume might take between 5–30 seconds. External recruiters process resumes at a similar rate. Edmund Ho, Principal Consultant for Talisman Corporation, works with around 15 clients a month. To find them, he looks at 20–30 resumes a day, or 600–700 a month, and can only spend 30 seconds to 2 minutes on each one before coming to a decision. Axel Algoet, founder and CEO of InnoHyve, only reviews 200 resumes a month—but “if you count LinkedIn profiles, it’s probably around 1,000.” Why LinkedIn? “I usually start by looking at LinkedIn—the companies they’ve worked at and the roles they’ve had,” Algoet explained. “From there, I can quickly tell whether I’m open to talking with them or not. Since I focus on a very specific segment of roles, I can rapidly identify if a candidate might be a fit for my clients.” Applicant Tracking Systems (ATS) Given the sheer volume of resumes to review and respond to, it’s not surprising that companies are using Applicant Tracking Systems. What’s more unexpected is how few recruiters personally use an ATS or AI when evaluating candidates. Both Ho and Algoet reported that though a high percentage of their clients use an ATS—as many as 90 percent, according to Ho—they themselves don’t use one. Ho in particular emphasized that he manually reads every resume he receives. Lunaris doesn’t use an ATS, “unless you count Notion,” joked McClain. “Open to recommendations!” Koji Hamane, Vice President of Human Resources at KOMOJU, said, “Up to 2023, we were managing the pipeline on a spreadsheet basis, and you cannot do it anymore with 3,000 applications [a year]. So it’s more effective and efficient in terms of tracking where each applicant sits in the recruiting process, but it also facilitates communication among [the members of] the interview panel.” The ATS KOMOJU uses is Workable. “Workable, I mean, you know, it works,” Hamane joked. “It’s much better than nothing. . . . Workable actually shows the valid points of the candidates, highlights characteristics, and evaluates the fit for the required positions, like from a 0 to 100 point basis. It helps, but actually you need to go through the details anyway, to properly assess the candidates.” Chidiac explained that Mercari also uses Workable, which has a feature that matches keywords from the job description to the resume, giving the resume a score. “I’ve never made a decision based on that,” said Chidiac. “It’s an indicator, but it’s not accurate enough yet to use it as a decision-making tool.” For example, it doesn’t screen out non-Japanese speakers when Japanese is a requirement for the role. I think these [ATS] tools are going to be better, and they’re going to work. I think it’s a good idea to help junior recruiters. But I think it has to be used as a ‘decision helper,’ not a decision-making tool. There’s also an element of ethics—do you want to be screened out by a robot? HENNGE uses a different ATS, Greenhouse, mostly to communicate with candidates and send them the results of their application. “ Everything they submit,” said Sonam Choden, HENNGE’s Software Engineer Recruiter, “is actually manually checked by somebody in our team. It’s not that everything is automated for the coding test—the bot only checks if they meet the minimum score. Then there is another [human] screener that will actually look over the test itself. If they pass the coding test, then we have another [human] screener looking through each and every document, both the resume and the cover letter.” How to format your resume The good news is that, according to our interviewees, passing the resume screening doesn’t involve trying to master ATS algorithms. However, since many recruiters are manually evaluating a high number of resume every day, they can spend at most only a few minutes on each one. That’s why it’s critical to make your resume stand out positively from the rest. You can see tips on formatting and good practices in our article on the subject, but below recruiters offer detailed explanations of exactly what they’re looking for—and, importantly, what red flags lead to rejection. Red flags The biggest red flags called out by recruiters are frequent job changes, not having skills required by the position, applications from abroad when no visa support is available, mismatches in salary expectations, and lack of required Japanese language ability. Frequent job changes Jumpiness. Job-hopping. Career-switching. Although they had different names for it, nearly everyone listed frequent job changes as the number one red flag on a candidate’s resume—at least, when applying to jobs in Japan. “There’s a term HR in Japan uses: ‘Oh, this guy is jumpy,’” Clement Chidiac told me. When he asked what they meant by that, they told him it referred to a candidate who had only been in their last job for two years or less. “And my first reaction was like, ‘Is that a bad thing?’ I think in the US, and in most tech companies, people change over every two to three years. I remember at my university in France, I was told you need to change your job externally or internally every three years to grow. But in Japan, there’s still the element of loyalty, right?” It’s changing a little bit, but when I have a candidate, a good candidate, that has had four jobs in the past ten years, I know I’m going to get questioned. . . . If I get a candidate that’s changed jobs three times in the past three years, they’re not likely to pass the screening, especially if they’re overseas. “Which is fair, right?” he added. “Because it’s a bit expensive, it’s a bit of a risk, and [it takes] a bit of time.” Why do Japanese companies feel so strongly on this issue? Some of it is simply history—lifetime employment at a single company was the Japanese ideal until quite recently. But as Chidiac pointed out, hiring overseas candidates represents additional investments in both money and time spent navigating the visa system, so it makes sense for Japanese companies to move more cautiously when doing so. Sayaka Sasaki, who was previously employed as a Sourcing Specialist by Tech Japan Inc., told me that recruiters attempt to use past job history to foresee the future. “A lack of consistency in career history can also lead to rejection,” she said. “Recruiters can often predict a candidate’s future career plans and job-switching tendencies based on their past job-change patterns.” Koji Hamane has another reason for considering job tenure. “When you try to leave some achievement or visible impact, [you have to] take some time in the same job, in the same company. So from that perspective, the tenure of each position on a resume really matters. Even though you say, ‘I have this capability and I have this strength,’ your tenure at each company is very short, and [you] don’t leave an impact on those workplaces.” In this sense, Hamane is not evaluating loyalty for its own sake, but considering tenure as a variable to assess the reproducibility of meaningful achievement. For him, achievement and impact—rather than tenure length itself—are the true signals of qualities such as leadership and resilience. Long-time or regular freelancers may face similar scrutiny. Though Chidiac is reluctant to call freelancing a red flag, he acknowledged that it can cause problems. “[With] an engineer that’s been doing freelance for the past three or four years, I know I’m going to get pushback from the hiring team, because they might have worked on three-, four-, five-month projects. They might not have the depth of knowledge that companies on a large scale might want to hire.” Also my question is, if that person has been working on their own for three or four years, how are they going to work in the team? How long are they going to stay with us? Are they going to be happy being part of a company and then maybe having to come to the office, that kind of thing? He gave an example: “If you get 100 applicants for backend engineer roles, it’s sad, but you’re going to go with the ones that fit the most traditional background. If I’m hiring and I’m getting five candidates from PayPay . . . I might prioritize these people as opposed to a freelancer that’s based out of Spain and wants to relocate to Japan, because there are a lot of question marks. That’s the reality of the candidate pool. “Now, if the freelancer in Spain has the exact experience that I want, and I don’t have other applicants, then yeah, of course I’ll talk to that person. I’ll take time to understand [their reasons].” How to “fix” job-hopping on your resume If you have changed jobs frequently, is rejection guaranteed? Not necessarily. These recruiters also offered a host of tips to compensate for job-hopping, freelancing stints, or gaps in your work history. The biggest tip: include an explanation on your resume. Edmund Ho advises offering a “reason for leaving” for short-term jobs, defining short-term as “less than three years.” For example, if the job was a limited contract role, then labelling it as such will prevent Japanese companies from drawing the conclusion that you left prematurely. Lay-offs and failed start-ups will also be looked upon more benevolently than simply quitting. In addition, Ho suggested that those with difficult resumes avail themselves of an agent or recruiter. Since the recruiter will contact the company directly, they have the chance to advocate and explain your job history better than the resume alone can. Sasaki also feels that explanations can help, but added a caveat: “Being honest about what you did during a gap period is not a bad thing. However, it is important to present it in a positive light. For example, if you traveled abroad or spent time at your family home during the gap period, you could write something like this: ‘Once I start a new job, it will be difficult to take a long vacation. So, I took advantage of this break to visit [destination], which I had always dreamed of seeing. Experiencing [specific highlight] was a lifelong goal, and it helped me refresh myself while boosting my motivation for work.’ “If the gap period lasted for more than a year, it is necessary to provide a convincing explanation for the hiring manager. For instance, you could write, ‘I used this time to enhance my skills by studying [specific subject] and preparing for [certification].’ If you have actually obtained a qualification, that would be a perfect way to present your time productively.” Hamane answered the question quite differently. “Do you gamble?” he asked me. He went on: “ When I say ‘gamble,’ ultimately recruiting is decision-making under uncertainty, right? It comes with risks. But the most important question is, what are the downside risks and upside risks?” “In the game of hiring,” Hamane explained, “employers are looking for indicators of future performance. Tenure, to me, is not inherently valuable, but serves as a variable to assess whether a candidate had the opportunity to leave a meaningful impact. It’s not about loyalty or raw length of time, but about whether qualities like resilience or leadership had the chance to emerge. Those qualities often require time. However, I don’t judge the number of years on its own—what matters is whether there is evidence of real contributions.” A shorter tenure with clear impact can be just as strong a signal as longer service. That’s why I view tenure not categorically, but contextually—as one indicator among others. If possible, then, a candidate should focus on highlighting their work contributions and unique strengths in their resume, which can counterbalance the perceived “downside risk” of job-hopping. Incompatibility with the job description Most other red flags can be categorized as “incompatible with the job description.” This includes: Not possessing the required skills Applying from abroad when the position doesn’t offer visa support Mismatch in salary expectations Not speaking Japanese Many of the resumes recruiters receive are wholly unsuited for the position. Hamane estimated that 70 percent of the resumes his department reviews are essentially “random applications.” Almost all the applications are basically not qualified. One of the major reasons why is the Internet. The Internet enables us to apply for any job from anywhere, right? So there are so many applications with no required skills. . . . From my perspective, they are applying on a batch basis, like mass applications. Even if the candidate has the required job skills, if they’re overseas and the position doesn’t offer visa support, their resume almost certainly won’t pass. Caleb McClain, whose company is currently hiring only domestically, said, “The most common reason [for rejection] is the person is applying from abroad. . . . After that, if there’s just a clear skills mismatch, we won’t move forward with them.” Axel Algoet pointed out that nationality can be a problem even if the company is open to hiring from overseas. “I support many companies in the space, aerospace, and defense industries,” he said, “and they are not allowed to hire candidates from certain countries.” It’s important to comprehend any legal issues surrounding sensitive industries before applying, to save both your own and the company’s time. He also mentioned that, while companies do look for candidates with experience at top enterprises, a prestigious background can actually be a red flag—-mostly in terms of compensation. Japanese tech companies on average pay lower wages than American businesses, and a mismatch in expectations can become a major stumbling block in the application process overall. “Especially [for] candidates coming from companies like Indeed or some foreign firms,” Algoet said, “if I know I won’t be able to match or beat their current salary, I tell them upfront.” Not speaking Japanese is another common stumbling block. Companies have different expectations of candidates when it comes to Japanese language ability. Algoet said that, although in his own niche Japanese often isn’t required at all, a Japanese level below JLPT N2 can be a problem for other roles. Sasaki agreed that speaking Japanese to at least the JLPT N3 level would open more doors. Anticipating potential rejection points If you can anticipate why recruiters might reject you, you can structure your resume accordingly, highlighting your strengths while deemphasizing any weak points. For example, if you don’t live in Japan but do speak Japanese, it’s important to bring attention to that fact. “Something that’s annoying,” said Chidiac, “that I’m seeing a lot from a hiring manager point of view, is that they sort of anticipate or presume things. . . . ‘That person has only been in Japan for a year, they can’t speak Japanese.’ But there are some people that have been [going to] Japanese school back home.” That’s why he urges candidates to clearly state both their language ability and their connections to Japan in their resume whenever possible. Chidiac also mentioned seniority issues. “It’s important that you highlight any elements of seniority.” However, he added, “Seniority means different things depending on the environment.” That’s why context is critical in your resume. If you’ve worked for a company in another country or another industry, the recruiter may not intuitively know much about the scale or complexity of the projects you’ve worked on. Without offering some context—the size of the project, the size of the team, the technologies involved, etc.—it’s difficult for recruiters to judge. If you contextualize your projects properly, though, Chidiac believes that even someone with relatively few years of experience may still be viewed favorably for higher roles. If you’ve led a very strong project, you might have the seniority we want. Finally, Edmund Ho suggested an easy trick for those without a STEM degree: just put down the university you graduated from, and not your major. “It’s cheating!” he said with a chuckle. Green flags Creating a great resume isn’t just about avoiding pitfalls. Your resume may also be missing some of the green flags recruiters get excited to see, which can open doors or lead to unexpected offers. Niche skills Niche skills were cited by several as not only being valuable in and of themselves, but also being a great way to open otherwise closed doors. Even when the job description doesn’t call for your unusual ability or experience, it’s probably worth including them in your resume. “I’ll of course take into consideration the requirements as written in our current open listings,” said McClain, “as that represents the core of what we are looking for at any given time. However, I also try to keep an eye out for interesting individuals with skills or experience that may benefit us in ways we haven’t considered yet, or match well with projects that aren’t formally planned but we are excited about starting when we have the time or the right people.” Chidiac agrees that he takes special note of rare skills or very senior candidates on a resume. “We might be able to create an unseeable headcount to secure a rare talent. . . . I think it’s important to have that mindset, especially for niche areas. Machine learning is one that comes to mind, but it could also be very senior [candidates], like staff level or principal level engineers, or people coming from very strong companies, or people that solve problems that we want to solve at the moment, that kind of thing.” I call it the opportunistic approach, like the unusual path, but it’s important to have that in mind when you apply for a company, because you might not be a fit for a role now, but you might not be aware that a role is going to open soon. Sasaki pointed out that niche skills can compensate for an otherwise relatively weak resume, or one that would be bypassed by more traditional Japanese companies. “If the company you are applying to is looking for a niche skill set that only you possess, they will want to speak with you in an interview. So don’t lose hope!” Tailoring to the job description “I don’t think there’s a secret recipe to automatically pass the resume screening, because at the end of the day, you need to match the job, right?” said Chidiac. “But I’ve seen people that use the same resume for different roles, and sometimes it’s missing [relevant] experience or specific keywords. So I think it’s important to really read the job description and think about, ‘Okay, these are all the main skills they want. Let me highlight these in some way.’” If you’re a cloud infrastructure engineer, but you’ve done a lot of coding in the past, or you use a specific technology but it doesn’t show on your CV, you may be automatically rejected either by the recruiter or by the [ATS]. But if you make sure that, ‘Oh yeah, I’ve seen the need for coding skill. I’m going to add that I was a software engineer when I started and I’m doing coding on my side project,’ that will help you with the screening. It’s not necessary to entirely remake your resume each time, Chidiac believes, but you should at least ensure that at the top of the resume you highlight the skills that match the job description. Connections to Japan While most of this advice would be relevant anywhere in the world, recruiters did offer one additional tip for applying in Japan—emphasizing your connection to the country. “Whenever a candidate overseas writes a little thing about any ties to Japan, it usually helps,” said Chidiac. For example, he believes that it helps to highlight your Japanese language ability at the top of your resume. [If] someone writes like, ‘I want to come to Japan,’ ‘I’ve been going to Japanese school for the last five years,’ ‘I’ve got family in Japan,’ . . . that kind of stuff usually helps. Laine Takahashi confirmed that HENNGE shows extra interest in those kinds of candidates. “Either in the cover letter or the CV,” she said, “if they’re not living in Japan, we want them to write about their passion for coming to Japan.” Ho went so far as to state that every overseas candidate he’d helped land a job in Japan had either already learned some Japanese, or had an interest in Japanese culture. Tourists who’d just enjoyed traveling in Japan were less successful, he’d found. How important is a cover letter? Most recruiters had similar advice for candidates, but one serious point of contention arose: cover letters. Depending on their company and hiring style, interviewees’ opinions ranged widely on whether cover letters were necessary or helpful. Cover letters aren’t important “I was trying to remember the last time I read a cover letter,” said Clement Chidiac, “and I honestly don’t think I’ve ever screened an application based on the cover letter.” Instead, Mercari typically requests a resume and poses some screening questions. Chidiac thought this might be a controversial opinion to take, but it was echoed strongly by around half of the other interviewees. When applying to jobs in Japan, there’s no need to write a cover letter, Edmund Ho told me. “Companies in Japan don’t care!” He then added, “One company, HENNGE, uses cover letters. But you don’t need,” he advised, “to write a fancy cover letter.” “I never ask for cover letters,” said Axel Algoet. “Instead, I usually set up a casual twenty-minute call between the hiring manager and the candidate, as a quick intro to decide if it’s worth moving forward with the interview process.” Getting to skip the cover letter and go straight to an early-stage interview is a major advantage Algoet is able to offer his candidates. “That said,” he added, “if a candidate is rejected at the screening stage and I feel the client is making a mistake, we sometimes work on a cover letter together to give it another shot.” Cover letters are extremely important According to Sayaka Sasaki, though, Japanese companies don’t just expect cover letters—they read them quite closely. “Some people may find this hard to believe,” said Sasaki, “but many Japanese companies carefully analyze aspects of a candidate’s personality that cannot be directly read from the text of a cover letter. They expect to see respect, humility, enthusiasm, and sincerity reflected in the writing.” Such companies also expect, or at least hope for, brevity and clarity. “Long cover letters are not a good sign,” said Koji Hamane. “You need to be clear and concise.” He does appreciate cover letters, though, especially for junior candidates, who have less information on their resume. “It supplements [our knowledge of] the candidate’s objectives, and helps us to verify the fit between the candidate’s motivation and the job and the company.” Caleb McClain feels strongly that a good cover letter is the best way for a candidate to stand out from a crowd. “After looking at enough resumes,” he said, “you start to notice similarities and patterns, and as the resume screener I feel a bit of exhaustion over trying to pick out what makes a person unique or better-suited for the position than another.” A well-written and personal cover letter that expresses genuine interest in joining ‘our’ team and company and working on ‘our’ projects will make you stand out and, assuming you meet the requirements otherwise, I will take that interest into serious consideration. “For example,” McClain continued, “we had an applicant in the past who wrote about his experience using our e-commerce site, SolarisJapan, many years ago, and his positive impressions of shopping there. Others wrote about their interests which clearly align with our businesses, or about details from our TokyoDev company profile that appealed to them.” McClain urged candidates to “really tie your experience and interests into what the company does, show us why you’re the best fit! Use the cover letter to stand out in the crowd and show us who you are in ways that a standard resume cannot. If you have interesting projects on Github or blogs on technical topics, share them! But of course,” he added, “make sure they are in a state where you’d want others to read them.” What to avoid in your cover letter “However,” McClain also cautioned, “[cover letters are] a double-edged sword, and for as many times as they’ve caused an application to rise to the top, they’ve also sunk that many.” For this reason, it’s best not to attach a cover letter unless one is specifically requested. Since cover letters are extremely important to some recruiters, however, you should have a good one prepared in advance—and not one authored by an AI tool. “I sometimes receive cover letters,” McClain told me, “that are very clearly written by AI, even going so far as to leave the prompt in the cover letter. Others simply rehash points from their resume, which is a shame and feels like a waste. This is your chance to really sell yourself!” He wasn’t the only recruiter who frowned on using AI. “Avoid simply copying and pasting AI-generated content into your cover letter,” Sasaki advised. “At the very least, you should write the base structure yourself. Using AI to refine your writing is acceptable, but hiring managers tend to dislike cover letters that clearly appear to be AI-written.” Laine Takahashi and Sonam Choden at HENNGE have also received their share of AI-generated letters. Sometimes, Choden explained, the use of AI is blatantly obvious, because the places where the company or applicant’s name should be written aren’t filled out. That doesn’t mean they’re opposed to all use of AI, though. “[The screeners] do not have a problem with the usage of AI technology. It’s just that [you should] show a bit more of your personality,” Takahashi said. She thinks it’s acceptable to use AI “just for making the sentences a bit more pretty, for example, but the story itself is still yours.” A bigger mistake would be not writing a cover letter at all. “There are cases,” Takahashi explained, “where perhaps the candidate thought that we actually don’t look at or read the cover letter.” They sent the CV, and then the cover letter was like, ‘Whatever, you’re not going to read this anyway.’ That’s an automatic fail from our side. “We do understand,” said Choden, “that most developers now think cover letters are an outdated type of process. But for us, there is a lot of benefit in actually going through with the cover letter, because it’s really hard to judge someone by one piece like a resume, right? So the cover letter is perfect to supplement with things that you might not be able to express in a one-page CV.” Other tips for success The interviewees offered a host of other tips to help candidates advance in the application process. Recruiters vs job boards There are pros and cons to working with a recruiter as opposed to applying directly. Partnering with a recruiter can be a complex process in its own right, and candidates should not expect recruiters to guarantee a specific placement or job. Edmund Ho pointed out some of the advantages of working with a recruiter from the start of your job search. Not only can they help fix your resume, or call a company’s HR directly if you’re rejected, but these services are free. After all, external recruiters are paid only if they successfully place you with a company. Axel Algoet also recommended candidates find a recruiter, but he offered a few caveats to this general advice. “Many candidates are unaware of the candidate ownership rule—which means that when a recruiter submits your application, they ‘own’ it for the next 12–18 months. There’s nothing you can do about it after that point.” By that, he means that the agency you work with will be eligible for a fee if you are hired within that timeframe. Other agencies typically won’t submit your application if it is currently “owned” by another. This affects TokyoDev as well: if you apply to a company with a recruiter, and then later apply to another role at that company via TokyoDev within 12 months of the original application, the recruiter receives the hiring fee rather than TokyoDev. That’s why, Algoet said, you should make sure your recruiter is a good fit and can represent you properly. “If you feel they can’t,” he suggested, “walk away.” And if you have less than three years of experience, he suggests skipping a recruiter entirely. “Many companies don’t want to pay recruitment fees for junior candidates,” he added, “but that doesn’t mean they won’t hire you. Reach out to hiring managers directly.” From the internal recruiter’s perspective, Sonam Choden is in favor of candidates who come through job boards. “I think we definitely have more success with job boards where people are actively directly applying, rather than candidates from agents. In terms of the requirements, the candidates introduced by agents have the experience and what we’re looking for, but those candidates introduced by agents might not necessarily be looking for work, or even if they are . . . [HENNGE] might not be their first choice.” Laine Takahashi agreed and cited TokyoDev as one of HENNGE’s best sources for candidates. We’ve been using TokyoDev for the longest time . . . before the [other] job boards that we’re using now. I think TokyoDev was the one that gave us a good head start for hiring inside Japan. “And now we’re expanding to other job boards as well,” she said, “but still, TokyoDev is [at] the top, definitely.” Follow up Ho casually nailed the dilemma around sending a message or email to follow up on your application. “It’s always best to follow up if you don’t hear back,” he said, “but if you follow up too much, it’s irritating.” The question is, how much is too much? When is it too soon to message a recruiter or hiring manager? Ho gave a concrete suggestion: “Send a message after three days to one week.” For Chidiac, following up is a strategy he’s used himself to great effect. “Something that I’ve always done when I look for a job is ping people on LinkedIn, trying to anticipate who is the hiring manager for that role, or who’s the recruiter for that role, and say ‘Hey, I want to apply,’ or ‘I’ve applied.’” [I’ve said] ‘I know I might not be able to do this and this and that, but I’ve done this and this and this. Can we have a quick chat? Do you need me to tailor my CV differently? Do you have any other roles that you think would be a good fit?’ And then, follow up frequently. “This is something that’s important,” he added, “showing that you’ve researched about the company, showing that you’ve attended meetups from time to time, checking the [company] blogs as well. I’ve had people that just said, ‘Hey, I’ve seen on the blogs that you’re working on this. This is what I’ve done in my company. If you’re hiring [for] this team, let me know, right?’ So that could be a good tip to stand out from other applicants. [But] I think there’s no rule. It’s just going to be down to individuals.” “You might,” he continued, “end up talking to someone who’s like, ‘Hey, don’t ever contact me again.’ As an agency recruiter that happened to me, someone said, ‘How did you get my phone [number]? Don’t ever call me again.’ . . . [But] then a lot of the time it’s like, ‘Oh, we’re both French, let’s help each other out,’ or, ‘Oh, yeah, we were at the same university,’ or ‘Hey, I know you know that person.’” Chidiac gave a recent example of a highly-effective follow-up message. “He used to work in top US tech companies for the past 25 years. [After he applied to Mercari], the person messaged me out of the blue: ‘I’m in Japan, I’m semi-retired, I don’t care about money. I really like what Mercari is doing. I’ve done X and Y at these companies.’ . . . So yeah, I was like, I don’t have a role, but this is an exceptional CV. I’ll show it to the hiring team.” There are a few caveats to this advice, however. First, a well-researched, well-crafted follow-up message is necessary to stand out from the crowd—and these days, there is quite a crowd. “Oh my goodness,” Choden exclaimed when I brought up the subject. “I actually wanted to write a post on LinkedIn, apologizing to people for not being able to get back to them, because of the amount of requests to connect and all related to the positions that we have at HENNGE.” Takahashi and Choden explained that many of these messages are attempts to get around the actual hiring process. “Sometimes,” Choden said, “when I do have the time, I try to redirect them. ‘Oh, please, apply here, or go directly to the site,’ because we can’t really do anything, they have to start with the coding test itself. . . . I do look at them,” Choden went on, “and if they’re actually asking a question that I can help with, then I’m more than happy to reply.” Nonetheless, a few candidates have attempted to go over their heads. Sometimes we have some candidates who are asking for updates on their application directly from our CEO. It’s quite shocking, because they send it to his work email as well. “And then he’s like, ‘Is anybody handling this? Why am I getting this email?’,” Choden related. Other applicants have emailed random HENNGE employees, or even members of the overseas branch in Taiwan. Needless to say, such candidates don’t endear themselves to anyone on the hiring team. Be persistent “I know a bunch of people,” Chidiac told me, “that managed to land a job because they’ve tried harder going to meetups, reaching out to people, networking, that kind of thing.” One of those people was Chidiac himself, who in 2021 was searching for an in-house recruiter position in Japan, while not speaking Japanese. In his job hunt, Chidiac was well aware that he faced some major disadvantages. “So I went the extra mile by contacting the company directly and being like, ‘This is what I’ve done, I’ve solved these problems, I’ve done this, I’ve done that, I know the Japanese market . . . [but] I don’t speak Japanese.’” There’s a bit of a reality check that everyone has to have on what they can bring to the table and how much effort they need to [put forth]. You’re going to have to sell yourself and reach out and find your people. “Does it always work? No. Does it often work? No. But it works, right?” said Chidiac with a laugh. “Like five percent of the time it works every time. But you need to understand that there are some markets that are tougher than others.” Ho agreed that job-hunters, particularly candidates who are overseas hoping to work in Japan for the first time, face a tough road. He recommended applying to as many jobs as possible, but in a strictly organized way. “Make an Excel sheet for your applications,” he urged. Such a spreadsheet should track your applications, when you followed up on those applications, and the probation period for reapplying to that company when you receive a rejection. Most importantly, Ho believes candidates should maintain a realistic, but optimistic, view of the process. “Keep a longer mindset,” he suggested. “Maybe you don’t get an offer the first year, but you do the second year.” Conclusion Given the staggering number of applications recruiters must process, and the increasing competition for good roles—especially those open to candidates overseas—it’s easy to become discouraged. Nonetheless, Japan needs international developers. Given Japan’s demographics, as well as the government’s interest in implementing AI and digital transformation (DX) solutions for social problems, that fact won’t change anytime soon. We at TokyoDev suggest that candidates interested in working in Japan adopt two basic approaches. First, follow the advice in this article and also in our resume-writing guide to prevent your resume from being rejected for common flaws. You can highlight niche skills, write an original cover letter, and send appropriate follow-up messages to the recruiters and hiring managers you hope to impress. Second, persistence is key. The work culture in Japan is evolving and there are more openings for new candidates. Japan’s startup scene is also burgeoning, and modern tech companies—such as Mercari—continue to grow and hire. If your long-term goal is to work in Japan, then it’s worth investing the time to keep applying. That said, hopefully the suggestions offered above will help turn what might have been a lengthy job-hunt into a quicker and more successful search. To apply to open positions right now, see our job board. If you want to hear more tips from other international developers in Japan, check out the TokyoDev Discord. We also have articles with more advice on job hunting, relocating to Japan, and life in Japan.