More from Tyler Cipriani: blog
.title {text-wrap:balance;} #content > p:first-child {text-wrap:balance;} If Git had a nemesis, it’d be large files. Large files bloat Git’s storage, slow down git clone, and wreak havoc on Git forges. In 2015, GitHub released Git LFS—a Git extension that hacked around problems with large files. But Git LFS added new complications and storage costs. Meanwhile, the Git project has been quietly working on large files. And while LFS ain’t dead yet, the latest Git release shows the path towards a future where LFS is, finally, obsolete. What you can do today: replace Git LFS with Git partial clone Git LFS works by storing large files outside your repo. When you clone a project via LFS, you get the repo’s history and small files, but skip large files. Instead, Git LFS downloads only the large files you need for your working copy. In 2017, the Git project introduced partial clones that provide the same benefits as Git LFS: Partial clone allows us to avoid downloading [large binary assets] in advance during clone and fetch operations and thereby reduce download times and disk usage. – Partial Clone Design Notes, git-scm.com Git’s partial clone and LFS both make for: Small checkouts – On clone, you get the latest copy of big files instead of every copy. Fast clones – Because you avoid downloading large files, each clone is fast. Quick setup – Unlike shallow clones, you get the entire history of the project—you can get to work right away. What is a partial clone? A Git partial clone is a clone with a --filter. For example, to avoid downloading files bigger than 100KB, you’d use: git clone --filter='blobs:size=100k' <repo> Later, Git will lazily download any files over 100KB you need for your checkout. By default, if I git clone a repo with many revisions of a noisome 25 MB PNG file, then cloning is slow and the checkout is obnoxiously large: $ time git clone https://github.com/thcipriani/noise-over-git Cloning into '/tmp/noise-over-git'... ... Receiving objects: 100% (153/153), 1.19 GiB real 3m49.052s Almost four minutes to check out a single 25MB file! $ du --max-depth=0 --human-readable noise-over-git/. 1.3G noise-over-git/. $ ^ 🤬 And 50 revisions of that single 25MB file eat 1.3GB of space. But a partial clone side-steps these problems: $ git config --global alias.pclone 'clone --filter=blob:limit=100k' $ time git pclone https://github.com/thcipriani/noise-over-git Cloning into '/tmp/noise-over-git'... ... Receiving objects: 100% (1/1), 24.03 MiB real 0m6.132s $ du --max-depth=0 --human-readable noise-over-git/. 49M noise-over-git/ $ ^ 😻 (the same size as a git lfs checkout) My filter made cloning 97% faster (3m 49s → 6s), and it reduced my checkout size by 96% (1.3GB → 49M)! But there are still some caveats here. If you run a command that needs data you filtered out, Git will need to make a trip to the server to get it. So, commands like git diff, git blame, and git checkout will require a trip to your Git host to run. But, for large files, this is the same behavior as Git LFS. Plus, I can’t remember the last time I ran git blame on a PNG 🙃. Why go to the trouble? What’s wrong with Git LFS? Git LFS foists Git’s problems with large files onto users. And the problems are significant: 🖕 High vendor lock-in – When GitHub wrote Git LFS, the other large file systems—Git Fat, Git Annex, and Git Media—were agnostic about the server-side. But GitHub locked users to their proprietary server implementation and charged folks to use it.1 💸 Costly – GitHub won because it let users host repositories for free. But Git LFS started as a paid product. Nowadays, there’s a free tier, but you’re dependent on the whims of GitHub to set pricing. Today, a 50GB repo on GitHub will cost $40/year for storage. In contrast, storing 50GB on Amazon’s S3 standard storage is $13/year. 😰 Hard to undo – Once you’ve moved to Git LFS, it’s impossible to undo the move without rewriting history. 🌀 Ongoing set-up costs – All your collaborators need to install Git LFS. Without Git LFS installed, your collaborators will get confusing, metadata-filled text files instead of the large files they expect. The future: Git large object promisors Large files create problems for Git forges, too. GitHub and GitLab put limits on file size2 because big files cost more money to host. Git LFS keeps server-side costs low by offloading large files to CDNs. But the Git project has a new solution. Earlier this year, Git merged a new feature: large object promisers. Large object promisors aim to provide the same server-side benefits as LFS, minus the hassle to users. This effort aims to especially improve things on the server side, and especially for large blobs that are already compressed in a binary format. This effort aims to provide an alternative to Git LFS – Large Object Promisors, git-scm.com What is a large object promisor? Large object promisors are special Git remotes that only house large files. In the bright, shiny future, large object promisors will work like this: You push a large file to your Git host. In the background, your Git host offloads that large file to a large object promisor. When you clone, the Git host tells your Git client about the promisor. Your client will clone from the Git host, and automagically nab large files from the promisor remote. But we’re still a ways off from that bright, shiny future. Git large object promisors are still a work in progress. Pieces of large object promisors merged to Git in March of 2025. But there’s more to do and open questions yet to answer. And so, for today, you’re stuck with Git LFS for giant files. But once large object promisors see broad adoption, maybe GitHub will let you push files bigger than 100MB. The future of large files in Git is Git. The Git project is thinking hard about large files, so you don’t have to. Today, we’re stuck with Git LFS. But soon, the only obstacle for large files in Git will be your half-remembered, ominous hunch that it’s a bad idea to stow your MP3 library in Git. Edited by Refactoring English Later, other Git forges made their own LFS servers. Today, you can push to multiple Git forges or use an LFS transfer agent, but all this makes set up harder for contributors. You’re pretty much locked-in unless you put in extra effort to get unlocked.↩︎ File size limits: 100MB for GitHub, 100MB for GitLab.com↩︎
Any code of your own that you haven’t looked at for six or more months might as well have been written by someone else. – Eagleson’s Law After scouring git history, I found the correct config file, but someone removed it. Their full commit message read: Remove config. Don't bring it back. Very. helpful. But I get it; it’s hard to care about commit messages when you’re making a quick change. Git commit templates can help. Commit templates provide a scaffold for your commit messages, reminding you to answer questions like: What problem are you solving? Why is this the solution? What alternatives did you consider? Where can I read more? What is a git commit template? When you type git commit, git pops open your text editor1. Git can pre-fill your editor with a commit template—a form that reminds you of everything it’s easy to forget when writing a commit. Creating a commit template is simple. Create a plaintext file – mine lives at ~/.config/git/message.txt Tell git to use it: git config --global \ commit.template '~/.config/git/message.txt' My template packs everything I know about writing a commit. Project-specific templates Large projects, such as Linux kernel, git, and MediaWiki, have their own commit guidelines. Git templates can remind you about these per-project requirements if you add a commit template to a project’s .git/config file. Another way to do this is git’s includeIf configuration setting. includeIf lets you override git config settings when you’re working under directories you define. For example, all my Wikimedia work lives in ~/Projects/Wikimedia and at the bottom of my ~/.config/git/config I have: [includeIf "gitdir:~/Projects/Wikimedia/**"] path = ~/.config/git/config.wikimedia In config.wikimedia, I point to my Wikimedia-specific commit template (along with other necessary git settings: my user.email, core.hooksPath, and a pushInsteadOf url to push to ssh even when I clone via https). Forge-specific templates Personal git commit templates lead to better commits, which make for a better history. The forge-specific pull-request templates are a band-aid, the cheap kind that falls off in the shower. There’s no incentive for GitHub to make git history better: the worse your commit history, the more you rely on GitHub. Still, all the major pull-request-style forges let you foist a pull-request template on your contributors. As a contributor, I dislike filling those out—they add unnecessary friction. Commit message contents Your commit template allows you do the hard thinking upfront. Then, when you make a commit, you simply follow the template. My template asks questions I answer with my commit message: 72ch. wide -------------------------------------------------------- BODY # | # - Why should this change be made? | # - What problem are you solving? | # - Why this solution? | # - What's wrong with the current code? | # - Are there other ways to do it? | # - How can the reviewer confirm it works? | # | # ---------------------------------------------------------------- /BODY But other clever folks cooked up conventions you could incorporate: Conventional commits – how do your commits relate to semantic versioning? This makes it easier for SRE and downstream users. Problem/Solution format – first pioneered by ZeroMQ2, this format anticipates the questions of future developers and reviewers. Gitmoji – developed for the GitHub crowd, this format defines an emoji shorthand that makes it easy to spot changes of a particular type. Commit message formatting How you format text affects how people read it. My template also deals with text formatting rules3: Subject – 50 characters or less, capitalized, no end punctuation. Body – Wrap at 72 characters with a blank line separating it from the subject. Trailers – Standard formats with a blank line separating them from the body. People will read your commit in different contexts: git log, git shortlog, and git rebase. But git’s pager has no line wrapping by default. I hard wrap at 72 characters because that makes text easier to read in wide terminals.4 Finally, my template addresses trailers, reminding me about standard trailers supported in the projects I’m working on. Git can interpret trailers, which can be useful later. For example, if I wanted a tab-separated list of commits and their related tasks I could find that with git log: $ TAB=%x09 $ BUG_TRAILER='%(trailers:key=Bug,valueonly=true,separator=%x2C )' $ SHORT_HASH=%h $ SUBJ=%s $ FORMAT="${SHORT_HASH}${TAB}${BUG_TRAILER}${TAB}${GIT_SUBJ}" $ git log --topo-order --no-merges \ --format="$FORMAT" d2b09deb12f T359762 Rewrite Kurdish (ku) Latin to Arabic converter 28123a6a262 T332865 tests: Remove non-static fallback in HookRunnerTestBase 4e919a307a4 T328919 tests: Remove unused argument from data provider in PageUpdaterTest bedd0f685f9 objectcache: Improve `RESTBagOStuff::handleError()` 2182a0c4490 T393219 tests: Remove two data provider in RestStructureTest Git commit templates free your brain from remembering what you should write, allowing you to focus on the story you should tell. Your future self will thank you for the effort. Starting with core.editor in your git config, $VISUAL or $EDITOR in your shell, finally falling back to vi.↩︎ I think…↩︎ All cribbed from Tim Pope↩︎ Another story I’ve heard: a standard terminal allows 80 characters per line. git log indents commit messages with 4 spaces. A 72-character-per-line commit centers text on an 80-character-per-line terminal. To me, readability in modern terminals is a better reason to wrap than kowtowing to antiquated terminals.↩︎
.title { text-wrap: balance } Spending for October, generated by piping hledger → R Over the past six months, I’ve tracked my money with hledger—a plain text double-entry accounting system written in Haskell. It’s been surprisingly painless. My previous attempts to pick up real accounting tools floundered. Hosted tools are privacy nightmares, and my stint with GnuCash didn’t last. But after stumbling on Dmitry Astapov’s “Full-fledged hledger” wiki1, it clicked—eventually consistent accounting. Instead of modeling your money all at once, take it one hacking session at a time. It should be easy to work towards eventual consistency. […] I should be able to [add financial records] bit by little bit, leaving things half-done, and picking them up later with little (mental) effort. – Dmitry Astapov, Full-Fledged Hledger Principles of my system I’ve cobbled together a system based on these principles: Avoid manual entry – Avoid typing in each transaction. Instead, rely on CSVs from the bank. CSVs as truth – CSVs are the only things that matter. Everything else can be blown away and rebuilt anytime. Embrace version control – Keep everything under version control in Git for easy comparison and safe experimentation. Learn hledger in five minutes hledger concepts are heady, but its use is simple. I divide the core concepts into two categories: Stuff hledger cares about: Transactions – how hledger moves money between accounts. Journal files – files full of transactions Stuff I care about: Rules files – how I set up accounts, import CSVs, and move money between accounts. Reports – help me see where my money is going and if I messed up my rules. Transactions move money between accounts: 2024-01-01 Payday income:work $-100.00 assets:checking $100.00 This transaction shows that on Jan 1, 2024, money moved from income:work into assets:checking—Payday. The sum of each transaction should be $0. Money comes from somewhere, and the same amount goes somewhere else—double-entry accounting. This is powerful technology—it makes mistakes impossible to ignore. Journal files are text files containing one or more transactions: 2024-01-01 Payday income:work $-100.00 assets:checking $100.00 2024-01-02 QUANSHENG UVK5 assets:checking $-29.34 expenses:fun:radio $29.34 Rules files transform CSVs into journal files via regex matching. Here’s a CSV from my bank: Transaction Date,Description,Category,Type,Amount,Memo 09/01/2024,DEPOSIT Paycheck,Payment,Payment,1000.00, 09/04/2024,PizzaPals Pizza,Food & Drink,Sale,-42.31, 09/03/2024,Amazon.com*XXXXXXXXY,Shopping,Sale,-35.56, 09/03/2024,OBSIDIAN.MD,Shopping,Sale,-10.00, 09/02/2024,Amazon web services,Personal,Sale,-17.89, And here’s a checking.rules to transform that CSV into a journal file so I can use it with hledger: # checking.rules # -------------- # Map CSV fields → hledger fields[0] fields date,description,category,type,amount,memo,_ # `account1`: the account for the whole CSV.[1] account1 assets:checking account2 expenses:unknown skip 1 date-format %m/%d/%Y currency $ if %type Payment account2 income:unknown if %category Food & Drink account2 expenses:food:dining # [0]: <https://hledger.org/hledger.html#field-names> # [1]: <https://hledger.org/hledger.html#account-field> With these two files (checking.rules and 2024-09_checking.csv), I can make the CSV into a journal: $ > 2024-09_checking.journal \ hledger print \ --rules-file checking.rules \ -f 2024-09_checking.csv $ head 2024-09_checking.journal 2024-09-01 DEPOSIT Paycheck assets:checking $1000.00 income:unknown $-1000.00 2024-09-02 Amazon web services assets:checking $-17.89 expenses:unknown $17.89 Reports are interesting ways to view transactions between accounts. There are registers, balance sheets, and income statements: $ hledger incomestatement \ --depth=2 \ --file=2024-09_bank.journal Revenues: $1000.00 income:unknown ----------------------- $1000.00 Expenses: $42.31 expenses:food $63.45 expenses:unknown ----------------------- $105.76 ----------------------- Net: $894.24 At the beginning of September, I spent $105.76 and made $1000, leaving me with $894.24. But a good chunk is going to the default expense account, expenses:unknown. I can use the hleger aregister to see what those transactions are: $ hledger areg expenses:unknown \ --file=2024-09_checking.journal \ -O csv | \ csvcut -c description,change | \ csvlook | description | change | | ------------------------ | ------ | | OBSIDIAN.MD | 10.00 | | Amazon web services | 17.89 | | Amazon.com*XXXXXXXXY | 35.56 | l Then, I can add some more rules to my checking.rules: if OBSIDIAN.MD account2 expenses:personal:subscriptions if Amazon web services account2 expenses:personal:web:hosting if Amazon.com account2 expenses:personal:shopping:amazon Now, I can reprocess my data to get a better picture of my spending: $ > 2024-09_bank.journal \ hledger print \ --rules-file bank.rules \ -f 2024-09_bank.csv $ hledger bal expenses \ --depth=3 \ --percent \ -f 2024-09_checking2.journal 30.0 % expenses:food:dining 33.6 % expenses:personal:shopping 9.5 % expenses:personal:subscriptions 16.9 % expenses:personal:web -------------------- 100.0 % For the Amazon.com purchase, I lumped it into the expenses:personal:shopping account. But I could dig deeper—download my order history from Amazon and categorize that spending. This is the power of working bit-by-bit—the data guides you to the next, deeper rabbit hole. Goals and non-goals Why am I doing this? For years, I maintained a monthly spreadsheet of account balances. I had a balance sheet. But I still had questions. Spending over six months, generated by piping hledger → gnuplot Before diving into accounting software, these were my goals: Granular understanding of my spending – The big one. This is where my monthly spreadsheet fell short. I knew I had money in the bank—I kept my monthly balance sheet. I budgeted up-front the % of my income I was saving. But I had no idea where my other money was going. Data privacy – I’m unwilling to hand the keys to my accounts to YNAB or Mint. Increased value over time – The more time I put in, the more value I want to get out—this is what you get from professional tools built for nerds. While I wished for low-effort setup, I wanted the tool to be able to grow to more uses over time. Non-goals—these are the parts I never cared about: Investment tracking – For now, I left this out of scope. Between monthly balances in my spreadsheet and online investing tools’ ability to drill down, I was fine.2 Taxes – Folks smarter than me help me understand my yearly taxes.3 Shared system – I may want to share reports from this system, but no one will have to work in it except me. Cash – Cash transactions are unimportant to me. I withdraw money from the ATM sometimes. It evaporates. hledger can track all these things. My setup is flexible enough to support them someday. But that’s unimportant to me right now. Monthly maintenance I spend about an hour a month checking in on my money Which frees me to spend time making fancy charts—an activity I perversely enjoy. Income vs. Expense, generated by piping hledger → gnuplot Here’s my setup: $ tree ~/Documents/ledger . ├── export │ ├── 2024-balance-sheet.txt │ └── 2024-income-statement.txt ├── import │ ├── in │ │ ├── amazon │ │ │ └── order-history.csv │ │ ├── credit │ │ │ ├── 2024-01-01_2024-02-01.csv │ │ │ ├── ... │ │ │ └── 2024-10-01_2024-11-01.csv │ │ └── debit │ │ ├── 2024-01-01_2024-02-01.csv │ │ ├── ... │ │ └── 2024-10-01_2024-11-01.csv │ └── journal │ ├── amazon │ │ └── order-history.journal │ ├── credit │ │ ├── 2024-01-01_2024-02-01.journal │ │ ├── ... │ │ └── 2024-10-01_2024-11-01.journal │ └── debit │ ├── 2024-01-01_2024-02-01.journal │ ├── ... │ └── 2024-10-01_2024-11-01.journal ├── rules │ ├── amazon │ │ └── journal.rules │ ├── credit │ │ └── journal.rules │ ├── debit │ │ └── journal.rules │ └── common.rules ├── 2024.journal ├── Makefile └── README Process: Import – download a CSV for the month from each account and plop it into import/in/<account>/<dates>.csv Make – run make Squint – Look at git diff; if it looks good, git add . && git commit -m "💸" otherwise review hledger areg to see details. The Makefile generates everything under import/journal: journal files from my CSVs using their corresponding rules. reports in the export folder I include all the journal files in the 2024.journal with the line: include ./import/journal/*/*.journal Here’s the Makefile: SHELL := /bin/bash RAW_CSV = $(wildcard import/in/**/*.csv) JOURNALS = $(foreach file,$(RAW_CSV),$(subst /in/,/journal/,$(patsubst %.csv,%.journal,$(file)))) .PHONY: all all: $(JOURNALS) hledger is -f 2024.journal > export/2024-income-statement.txt hledger bs -f 2024.journal > export/2024-balance-sheet.txt .PHONY clean clean: rm -rf import/journal/**/*.journal import/journal/%.journal: import/in/%.csv @echo "Processing csv $< to $@" @echo "---" @mkdir -p $(shell dirname $@) @hledger print --rules-file rules/$(shell basename $$(dirname $<))/journal.rules -f "$<" > "$@" If I find anything amiss (e.g., if my balances are different than what the bank tells me), I look at hleger areg. I may tweak my rules or my CSVs and then I run make clean && make and try again. Simple, plain text accounting made simple. And if I ever want to dig deeper, hledger’s docs have more to teach. But for now, the balance of effort vs. reward is perfect. while reading a blog post from Jonathan Dowland↩︎ Note, this is covered by full-fledged hledger – Investements↩︎ Also covered in full-fledged hledger – Tax returns↩︎
Luckily, I speak Leet. – Amita Ramanujan, Numb3rs, CBS’s IRC Drama There’s an episode of the CBS prime-time drama Numb3rs that plumbs the depths of Dr. Joel Fleischman’s1 knowledge of IRC. In one scene, Fleischman wonders, “What’s ‘leet’”? “Leet” is writing that replaces letters with numbers, e.g., “Numb3rs,” where 3 stands in for e. In short, leet is like the heavy-metal “S” you drew in middle school: Sweeeeet. / \ / | \ | | | \ \ | | | \ | / \ / ASCII art version of your misspent youth. Following years of keen observation, I’ve noticed Git commit hashes are also letters and numbers. Git commit hashes are, as Fleischman might say, prime targets for l33tification. What can I spell with a git commit? DenITDao via orlybooks) With hexidecimal we can spell any word containing the set of letters {A, B, C, D, E, F}—DEADBEEF (a classic) or ABBABABE (for Mama Mia aficionados). This is because hexidecimal is a base-16 numbering system—a single “digit” represents 16 numbers: Base-10: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 16 15 Base-16: 0 1 2 3 4 5 6 7 8 9 A B C D E F Leet expands our palette of words—using 0, 1, and 5 to represent O, I, and S, respectively. I created a script that scours a few word lists for valid words and phrases. With it, I found masterpieces like DADB0D (dad bod), BADA55 (bad ass), and 5ADBAB1E5 (sad babies). Manipulating commit hashes for fun and no profit Git commit hashes are no mystery. A commit hash is the SHA-1 of a commit object. And a commit object is the commit message with some metadata. $ mkdir /tmp/BADA55-git && cd /tmp/BAD55-git $ git init Initialized empty Git repository in /tmp/BADA55-git/.git/ $ echo '# BADA55 git repo' > README.md && git add README.md && git commit -m 'Initial commit' [main (root-commit) 68ec0dd] Initial commit 1 file changed, 1 insertion(+) create mode 100644 README.md $ git log --oneline 68ec0dd (HEAD -> main) Initial commit Let’s confirm we can recreate the commit hash: $ git cat-file -p 68ec0dd > commit-msg $ sha1sum <(cat \ <(printf "commit ") \ <(wc -c < commit-msg | tr -d '\n') \ <(printf '%b' '\0') commit-msg) 68ec0dd6dead532f18082b72beeb73bd828ee8fc /dev/fd/63 Our repo’s first commit has the hash 68ec0dd. My goal is: Make 68ec0dd be BADA55. Keep the commit message the same, visibly at least. But I’ll need to change the commit to change the hash. To keep those changes invisible in the output of git log, I’ll add a \t and see what happens to the hash. $ truncate -s -1 commit-msg # remove final newline $ printf '\t\n' >> commit-msg # Add a tab $ # Check the new SHA to see if it's BADA55 $ sha1sum <(cat \ <(printf "commit ") \ <(wc -c < commit-msg | tr -d '\n') \ <(printf '%b' '\0') commit-msg) 27b22ba5e1c837a34329891c15408208a944aa24 /dev/fd/63 Success! I changed the SHA-1. Now to do this until we get to BADA55. Fortunately, user not-an-aardvark created a tool for that—lucky-commit that manipulates a commit message, adding a combination of \t and [:space:] characters until you hit a desired SHA-1. Written in rust, lucky-commit computes all 256 unique 8-bit strings composed of only tabs and spaces. And then pads out commits up to 48-bits with those strings, using worker threads to quickly compute the SHA-12 of each commit. It’s pretty fast: $ time lucky_commit BADA555 real 0m0.091s user 0m0.653s sys 0m0.007s $ git log --oneline bada555 (HEAD -> main) Initial commit $ xxd -c1 <(git cat-file -p 68ec0dd) | grep -cPo ': (20|09)' 12 $ xxd -c1 <(git cat-file -p HEAD) | grep -cPo ': (20|09)' 111 Now we have an more than an initial commit. We have a BADA555 initial commit. All that’s left to do is to make ALL our commits BADA55 by abusing git hooks. $ cat > .git/hooks/post-commit && chmod +x .git/hooks/post-commit #!/usr/bin/env bash echo 'L337-ifying!' lucky_commit BADA55 $ echo 'A repo that is very l33t.' >> README.md && git commit -a -m 'l33t' L337-ifying! [main 0e00cb2] l33t 1 file changed, 1 insertion(+) $ git log --oneline bada552 (HEAD -> main) l33t bada555 Initial commit And now I have a git repo almost as cool as the sweet “S” I drew in middle school. This is a Northern Exposure spin off, right? I’ve only seen 1:48 of the show…↩︎ or SHA-256 for repos that have made the jump to a more secure hash function↩︎
More in programming
Although it looks really good, I have not yet tried the Jujutsu (jj) version control system, mainly because it’s not yet clearly superior to Magit. But I have been following jj discussions with great interest. One of the things that jj has not yet tackled is how to do better than git refs / branches / tags. As I underestand it, jj currently has something like Mercurial bookmarks, which are more like raw git ref plumbing than a high-level porcelain feature. In particular, jj lacks signed or annotated tags, and it doesn’t have branch names that always automatically refer to the tip. This is clearly a temporary state of affairs because jj is still incomplete and under development and these gaps are going to be filled. But the discussions have led me to think about how git’s branches are unsatisfactory, and what could be done to improve them. branch merge rebase squash fork cover letters previous branch workflow questions branch One of the huge improvements in git compared to Subversion was git’s support for merges. Subversion proudly advertised its support for lightweight branches, but a branch is not very useful if you can’t merge it: an un-mergeable branch is not a tool you can use to help with work-in-progress development. The point of this anecdote is to illustrate that rather than trying to make branches better, we should try to make merges better and branches will get better as a consequence. Let’s consider a few common workflows and how git makes them all unsatisfactory in various ways. Skip to cover letters and previous branch below where I eventually get to the point. merge A basic merge workflow is, create a feature branch hack, hack, review, hack, approve merge back to the trunk The main problem is when it comes to the merge, there may be conflicts due to concurrent work on the trunk. Git encourages you to resolve conflicts while creating the merge commit, which tends to bypass the normal review process. Git also gives you an ugly useless canned commit message for merges, that hides what you did to resolve the conflicts. If the feature branch is a linear record of the work then it can be cluttered with commits to address comments from reviewers and to fix mistakes. Some people like an accurate record of the history, but others prefer the repository to contain clean logical changes that will make sense in years to come, keeping the clutter in the code review system. rebase A rebase-oriented workflow deals with the problems of the merge workflow but introduces new problems. Primarily, rebasing is intended to produce a tidy logical commit history. And when a feature branch is rebased onto the trunk before it is merged, a simple fast-forward check makes it trivial to verify that the merge will be clean (whether it uses separate merge commit or directly fast-forwards the trunk). However, it’s hard to compare the state of the feature branch before and after the rebase. The current and previous tips of the branch (amongst other clutter) are recorded in the reflog of the person who did the rebase, but they can’t share their reflog. A force-push erases the previous branch from the server. Git forges sometimes make it possible to compare a branch before and after a rebase, but it’s usually very inconvenient, which makes it hard to see if review comments have been addressed. And a reviewer can’t fetch past versions of the branch from the server to review them locally. You can mitigate these problems by adding commits in --autosquash format, and delay rebasing until just before merge. However that reintroduces the problem of merge conflicts: if the autosquash doesn’t apply cleanly the branch should have another round of review to make sure the conflicts were resolved OK. squash When the trunk consists of a sequence of merge commits, the --first-parent log is very uninformative. A common way to make the history of the trunk more informative, and deal with the problems of cluttered feature branches and poor rebase support, is to squash the feature branch into a single commit on the trunk instead of mergeing. This encourages merge requests to be roughly the size of one commit, which is arguably a good thing. However, it can be uncomfortably confining for larger features, or cause extra busy-work co-ordinating changes across multiple merge requests. And squashed feature branches have the same merge conflict problem as rebase --autosquash. fork Feature branches can’t always be short-lived. In the past I have maintained local hacks that were used in production but were not (not yet?) suitable to submit upstream. I have tried keeping a stack of these local patches on a git branch that gets rebased onto each upstream release. With this setup the problem of reviewing successive versions of a merge request becomes the bigger problem of keeping track of how the stack of patches evolved over longer periods of time. cover letters Cover letters are common in the email patch workflow that predates git, and they are supported by git format-patch. Github and other forges have a webby version of the cover letter: the message that starts off a pull request or merge request. In git, cover letters are second-class citizens: they aren’t stored in the repository. But many of the problems I outlined above have neat solutions if cover letters become first-class citizens, with a Jujutsu twist. A first-class cover letter starts off as a prototype for a merge request, and becomes the eventual merge commit. Instead of unhelpful auto-generated merge commits, you get helpful and informative messages. No extra work is needed since we’re already writing cover letters. Good merge commit messages make good --first-parent logs. The cover letter subject line works as a branch name. No more need to invent filename-compatible branch names! Jujutsu doesn’t make you name branches, giving them random names instead. It shows the subject line of the topmost commit as a reminder of what the branch is for. If there’s an explicit cover letter the subject line will be a better summary of the branch as a whole. I often find the last commit on a branch is some post-feature cleanup, and that kind of commit has a subject line that is never a good summary of its feature branch. As a prototype for the merge commit, the cover letter can contain the resolution of all the merge conflicts in a way that can be shared and reviewed. In Jujutsu, where conflicts are first class, the cover letter commit can contain unresolved conflicts: you don’t have to clean them up when creating the merge, you can leave that job until later. If you can share a prototype of your merge commit, then it becomes possible for your collaborators to review any merge conflicts and how you resolved them. To distinguish a cover letter from a merge commit object, a cover letter object has a “target” header which is a special kind of parent header. A cover letter also has a normal parent commit header that refers to earlier commits in the feature branch. The target is what will become the first parent of the eventual merge commit. previous branch The other ingredient is to add a “previous branch” header, another special kind of parent commit header. The previous branch header refers to an older version of the cover letter and, transitively, an older version of the whole feature branch. Typically the previous branch header will match the last shared version of the branch, i.e. the commit hash of the server’s copy of the feature branch. The previous branch header isn’t changed during normal work on the feature branch. As the branch is revised and rebased, the commit hash of the cover letter will change fairly frequently. These changes are recorded in git’s reflog or jj’s oplog, but not in the “previous branch” chain. You can use the previous branch chain to examine diffs between versions of the feature branch as a whole. If commits have Gerrit-style or jj-style change-IDs then it’s fairly easy to find and compare previous versions of an individual commit. The previous branch header supports interdiff code review, or allows you to retain past iterations of a patch series. workflow Here are some sketchy notes on how these features might work in practice. One way to use cover letters is jj-style, where it’s convenient to edit commits that aren’t at the tip of a branch, and easy to reshuffle commits so that a branch has a deliberate narrative. When you create a new feature branch, it starts off as an empty cover letter with both target and parent pointing at the same commit. Alternatively, you might start a branch ad hoc, and later cap it with a cover letter. If this is a small change and rebase + fast-forward is allowed, you can edit the “cover letter” to contain the whole change. Otherwise, you can hack on the branch any which way. Shuffle the commits that should be part of the merge request so that they occur before the cover letter, and edit the cover letter to summarize the preceding commits. When you first push the branch, there’s (still) no need to give it a name: the server can see that this is (probably) going to be a new merge request because the top commit has a target branch and its change-ID doesn’t match an existing merge request. Also when you push, your client automatically creates a new instance of your cover letter, adding a “previous branch” header to indicate that the old version was shared. The commits on the branch that were pushed are now immutable; rebases and edits affect the new version of the branch. During review there will typically be multiple iterations of the branch to address feedback. The chain of previous branch headers allows reviewers to see how commits were changed to address feedback, interdiff style. The branch can be merged when the target header matches the current trunk and there are no conflicts left to resolve. When the time comes to merge the branch, there are several options: For a merge workflow, the cover letter is used to make a new commit on the trunk, changing the target header into the first parent commit, and dropping the previous branch header. Or, if you like to preserve more history, the previous branch chain can be retained. Or you can drop the cover letter and fast foward the branch on to the trunk. Or you can squash the branch on to the trunk, using the cover letter as the commit message. questions This is a fairly rough idea: I’m sure that some of the details won’t work in practice without a lot of careful work on compatibility and deployability. Do the new commit headers (“target” and “previous branch”) need to be headers? What are the compatibility issues with adding new headers that refer to other commits? How would a server handle a push of an unnamed branch? How could someone else pull a copy of it? How feasible is it to use cover letter subject lines instead of branch names? The previous branch header is doing a similar job to a remote tracking branch. Is there an opportunity to simplify how we keep a local cache of the server state? Despite all that, I think something along these lines could make branches / reviews / reworks / merges less awkward. How you merge should me a matter of your project’s preferred style, without interference from technical limitations that force you to trade off one annoyance against another. There remains a non-technical limitation: I have assumed that contributors are comfortable enough with version control to use a history-editing workflow effectively. I’ve lost all perspective on how hard this is for a newbie to learn; I expect (or hope?) jj makes it much easier than git rebase.
In my post yesterday (“ARM is great, ARM is terrible (and so is RISC-V)), I described my desire to find ARM hardware with AES instructions to support full-disk encryption, and the poor state of the OS ecosystem around the newer ARM boards. I was anticipating buying either a newer ARM SBC or an x86 mini … Continue reading Performant Full-Disk Encryption on a Raspberry Pi, but Foiled by Twisty UARTs →
Debates, at their finest, are about exploring topics together in search for truth. That probably sounds hopelessly idealistic to anyone who've ever perused a comment section on the internet, but ideals are there to remind us of what's possible, to inspire us to reach higher — even if reality falls short. I've been reaching for those debating ideals for thirty years on the internet. I've argued with tens of thousands of people, first on Usenet, then in blog comments, then Twitter, now X, and also LinkedIn — as well as a million other places that have come and gone. It's mostly been about technology, but occasionally about society and morality too. There have been plenty of heated moments during those three decades. It doesn't take much for a debate between strangers on this internet to escalate into something far lower than a "search for truth", and I've often felt willing to settle for just a cordial tone! But for the majority of that time, I never felt like things might escalate beyond the keyboards and into the real world. That was until we had our big blow-up at 37signals back in 2021. I suddenly got to see a different darkness from the most vile corners of the internet. Heard from those who seem to prowl for a mob-sanctioned opportunity to threaten and intimidate those they disagree with. It fundamentally changed me. But I used the experience as a mirror to reflect on the ways my own engagement with the arguments occasionally felt too sharp, too personal. And I've since tried to refocus way more of my efforts on the positive and the productive. I'm by no means perfect, and the internet often tempts the worst in us, but I resist better now than I did then. What I cannot come to terms with, though, is the modern equation of words with violence. The growing sense of permission that if the disagreement runs deep enough, then violence is a justified answer to settle it. That sounds so obvious that we shouldn't need to state it in a civil society, but clearly it is not. Not even in technology. Not even in programming. There are plenty of factions here who've taken to justify their violent fantasies by referring to their ideological opponents as "nazis", "fascists", or "racists". And then follow that up with a call to "punch a nazi" or worse. When you hear something like that often enough, it's easy to grow glib about it. That it's just a saying. They don't mean it. But I'm afraid many of them really do. Which brings us to Charlie Kirk. And the technologists who name drinks at their bar after his mortal wound just hours after his death, to name but one of the many, morbid celebrations of the famous conservative debater's death. It's sickening. Deeply, profoundly sickening. And my first instinct was exactly what such people would delight in happening. To watch the rest of us recoil, then retract, and perhaps even eject. To leave the internet for a while or forever. But I can't do that. We shouldn't do that. Instead, we should double down on the opposite. Continue to show up with our ideals held high while we debate strangers in that noble search for the truth. Where we share our excitement, our enthusiasm, and our love of technology, country, and humanity. I think that's what Charlie Kirk did so well. Continued to show up for the debate. Even on hostile territory. Not because he thought he was ever going to convince everyone, but because he knew he'd always reach some with a good argument, a good insight, or at least a different perspective. You could agree or not. Counter or be quiet. But the earnest exploration of the topics in a live exchange with another human is as fundamental to our civilization as Socrates himself. Don't give up, don't give in. Keep debating.
I’ve long been interested in new and different platforms. I ran Debian on an Alpha back in the late 1990s and was part of the Alpha port team; then I helped bootstrap Debian on amd64. I’ve got somewhere around 8 Raspberry Pi devices in active use right now, and the free NNCPNET Internet email service … Continue reading ARM is great, ARM is terrible (and so is RISC-V) →
In my first interview out of college I was asked the change counter problem: Given a set of coin denominations, find the minimum number of coins required to make change for a given number. IE for USA coinage and 37 cents, the minimum number is four (quarter, dime, 2 pennies). I implemented the simple greedy algorithm and immediately fell into the trap of the question: the greedy algorithm only works for "well-behaved" denominations. If the coin values were [10, 9, 1], then making 37 cents would take 10 coins in the greedy algorithm but only 4 coins optimally (10+9+9+9). The "smart" answer is to use a dynamic programming algorithm, which I didn't know how to do. So I failed the interview. But you only need dynamic programming if you're writing your own algorithm. It's really easy if you throw it into a constraint solver like MiniZinc and call it a day. int: total; array[int] of int: values = [10, 9, 1]; array[index_set(values)] of var 0..: coins; constraint sum (c in index_set(coins)) (coins[c] * values[c]) == total; solve minimize sum(coins); You can try this online here. It'll give you a prompt to put in total and then give you successively-better solutions: coins = [0, 0, 37]; ---------- coins = [0, 1, 28]; ---------- coins = [0, 2, 19]; ---------- coins = [0, 3, 10]; ---------- coins = [0, 4, 1]; ---------- coins = [1, 3, 0]; ---------- Lots of similar interview questions are this kind of mathematical optimization problem, where we have to find the maximum or minimum of a function corresponding to constraints. They're hard in programming languages because programming languages are too low-level. They are also exactly the problems that constraint solvers were designed to solve. Hard leetcode problems are easy constraint problems.1 Here I'm using MiniZinc, but you could just as easily use Z3 or OR-Tools or whatever your favorite generalized solver is. More examples This was a question in a different interview (which I thankfully passed): Given a list of stock prices through the day, find maximum profit you can get by buying one stock and selling one stock later. It's easy to do in O(n^2) time, or if you are clever, you can do it in O(n). Or you could be not clever at all and just write it as a constraint problem: array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; var int: buy; var int: sell; var int: profit = prices[sell] - prices[buy]; constraint sell > buy; constraint profit > 0; solve maximize profit; Reminder, link to trying it online here. While working at that job, one interview question we tested out was: Given a list, determine if three numbers in that list can be added or subtracted to give 0? This is a satisfaction problem, not a constraint problem: we don't need the "best answer", any answer will do. We eventually decided against it for being too tricky for the engineers we were targeting. But it's not tricky in a solver; include "globals.mzn"; array[int] of int: numbers = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array[index_set(numbers)] of var {0, -1, 1}: choices; constraint sum(n in index_set(numbers)) (numbers[n] * choices[n]) = 0; constraint count(choices, -1) + count(choices, 1) = 3; solve satisfy; Okay, one last one, a problem I saw last year at Chipy AlgoSIG. Basically they pick some leetcode problems and we all do them. I failed to solve this one: Given an array of integers heights representing the histogram's bar height where the width of each bar is 1, return the area of the largest rectangle in the histogram. The "proper" solution is a tricky thing involving tracking lots of bookkeeping states, which you can completely bypass by expressing it as constraints: array[int] of int: numbers = [2,1,5,6,2,3]; var 1..length(numbers): x; var 1..length(numbers): dx; var 1..: y; constraint x + dx <= length(numbers); constraint forall (i in x..(x+dx)) (y <= numbers[i]); var int: area = (dx+1)*y; solve maximize area; output ["(\(x)->\(x+dx))*\(y) = \(area)"] There's even a way to automatically visualize the solution (using vis_geost_2d), but I didn't feel like figuring it out in time for the newsletter. Is this better? Now if I actually brought these questions to an interview the interviewee could ruin my day by asking "what's the runtime complexity?" Constraint solvers runtimes are unpredictable and almost always than an ideal bespoke algorithm because they are more expressive, in what I refer to as the capability/tractability tradeoff. But even so, they'll do way better than a bad bespoke algorithm, and I'm not experienced enough in handwriting algorithms to consistently beat a solver. The real advantage of solvers, though, is how well they handle new constraints. Take the stock picking problem above. I can write an O(n²) algorithm in a few minutes and the O(n) algorithm if you give me some time to think. Now change the problem to Maximize the profit by buying and selling up to max_sales stocks, but you can only buy or sell one stock at a given time and you can only hold up to max_hold stocks at a time? That's a way harder problem to write even an inefficient algorithm for! While the constraint problem is only a tiny bit more complicated: include "globals.mzn"; int: max_sales = 3; int: max_hold = 2; array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array [1..max_sales] of var int: buy; array [1..max_sales] of var int: sell; array [index_set(prices)] of var 0..max_hold: stocks_held; var int: profit = sum(s in 1..max_sales) (prices[sell[s]] - prices[buy[s]]); constraint forall (s in 1..max_sales) (sell[s] > buy[s]); constraint profit > 0; constraint forall(i in index_set(prices)) (stocks_held[i] = (count(s in 1..max_sales) (buy[s] <= i) - count(s in 1..max_sales) (sell[s] <= i))); constraint alldifferent(buy ++ sell); solve maximize profit; output ["buy at \(buy)\n", "sell at \(sell)\n", "for \(profit)"]; Most constraint solving examples online are puzzles, like Sudoku or "SEND + MORE = MONEY". Solving leetcode problems would be a more interesting demonstration. And you get more interesting opportunities to teach optimizations, like symmetry breaking. Because my dad will email me if I don't explain this: "leetcode" is slang for "tricky algorithmic interview questions that have little-to-no relevance in the actual job you're interviewing for." It's from leetcode.com. ↩