More from DYNOMIGHT
This is an article that just appeared in Asimov Press, who kindly agreed that I could publish it here and also humored my deep emotional need to use words like “Sparklepuff”. Do you like information theory? Do you like molecular biology? Do you like the idea of smashing them together and seeing what happens? If so, then here’s a question: How much information is in your DNA? When I first looked into this question, I thought it was simple: Human DNA has about 3.1 billion base pairs. Each base pair can take one of four values (A, T, C, or G) It takes 2 bits to encode one of four possible values (00, 01, 10, or 11) Thus, human DNA contains 6.2 billion bits. Easy, right? Sure, except: You have two versions of each base pair, one from each of your parents. Should you count both? All humans have almost identical DNA. Does that matter? DNA can be compressed. Should you look at the compressed representation? It’s not clear how much of our DNA actually does something useful. The insides of your cells are a convulsing pandemonium of interacting “hacks”, designed to keep working even as mutations constantly screw around with the DNA itself. Should we only count the “useful” parts? Such questions quickly run into the limits of knowledge for both biology and computer science. To answer them, we need to figure out what exactly we mean by “information” and how that’s related to what’s happening inside cells. In attempting that, I will lead you through a frantic tour of information theory and molecular biology. We’ll meet some strange characters, including genomic compression algorithms based on deep learning, retrotransposons, and Kolmogorov complexity. Ultimately, I’ll argue that the intuitive idea of information in a genome is best captured by a new definition of a “bit”—one that’s unknowable with our current level of scientific knowledge. On counting What is “information”? This isn’t just a pedantic question, as there are actually several different mathematical definitions of a “bit”. Often, the differences don’t matter, but for DNA, they turn out to matter a lot, so let’s start with the simplest. In the storage space definition, a bit is a “slot” in which you can store one of two possible values. If some object can represent 2ⁿ possible patterns, then it contains n bits, regardless of which pattern actually happens to be stored. So here’s a question we can answer precisely: How much information could your DNA store? A few reminders: DNA is a polymer. It’s a long chain of chunks of ~40 atoms called “nucleotides”. There are four different chunks, commonly labeled A, T, C, and G. In humans, DNA comes in 23 pieces of different lengths, called “chromosomes”. Humans are “diploid”, meaning we have two versions of each chromosome. We get one from each of our parents, made by randomly weaving together sections from the two chromosomes they got from their parents. Technically there’s also a tiny amount of DNA in the mitochondria. This is neat because you get it from your mother basically unchanged and so scientists can trace tiny mutations back to see how our great-great-…-great grandmothers were all related. If you go far enough back, our maternal lines all lead to a single woman, Mitochondrial Eve, who probably lived in East Africa 120,000 to 156,000 years ago. But mitochondrial DNA is tiny so I won’t mention it again. Chromosomes 1-22 have a total of 2.875 billion nucleotides; the X chromosome has 156 million, and the Y chromosome has 62 million. From here, we can calculate the total storage space in your DNA. Remember, each nucleotide has 4 options, corresponding to 2 bits. So if you’re female, your total storage space is: (2×2875 + 2×156) million nucleotides If you’re male, the total storage space is: (2×2875 + 156 + 62) million nucleotides For comparison, a standard single-layer DVD can store 37.6 billion bits or 4.7 GB. The code for your body, magnificent as it is, takes up as much space as around 40 minutes of standard definition video. So in principle, your DNA could represent around 212,000,000,000 different patterns. But hold on. Given human common ancestry, the chromosome pair you got from your mother is almost identical to the one you got from your father. And even ignoring that, there are long sequences of nucleotides that are repeated over and over in your DNA, enough to make up a significant fraction of the total. It seems weird to count all this repeated stuff. So perhaps we want a more nuanced definition of “information.” On compression A string of 12 billion zeros is much longer than this article. But most people would (I hope) agree that this article contains more information than a string of 12 billion zeros. Why? One of the fundamental ideas from information theory is to define information in terms of compression. Roughly speaking, the “information” in some string is the length of the shortest possible compressed representation of that string. So how much can you compress DNA? Answers to this question are all over the place. Some people claim it can be compressed by more than 99 percent, while others claim the state of the art is only around 25 percent. This discrepancy is explained by different definitions of “compression”, which turn out to correspond to different notions of “information”. 99.6 percent identical. Fun facts: Because of these deletions and insertions, different people have slightly different amounts of DNA. In fact, each of your chromosome pairs have DNA of slightly different lengths. When your body creates sperm/ova it uses a crazy machine to align the chromosomes in a sensible way so different sections can be woven together without creating nonsense. Also, those same measures of similarity would say that we’re around 96 percent identical with our closest living cousins, the bonobos and chimpanzees. The fact that we share so much DNA is key to how some algorithms can compress DNA by more than 99 percent. They do this by first storing a reference genome, which includes all the DNA that’s shared by all people and perhaps the most common variants for regions of DNA where people differ. Then, for each individual person, these algorithms only store the differences from the reference genome. Because that reference only has to be stored once, it isn’t counted in the compressed representation. That’s great if you want to cram as many of your friends’ genomes on a hard drive as possible. But it’s a strange definition to use if you want to measure the “information content of DNA”. It implies that any genomic content that doesn’t change between individuals isn’t important enough to count as “information”. However, we know from evolutionary biology that it’s often the most crucial DNA that changes the least precisely because it’s so important. Heritability tends to be lower for genes more closely related to reproduction. The best compression without a reference seems to be around 25 percent. (I expect this number to rise a bit over time, as the newest methods use deep learning and research is ongoing.) That’s not a lot of compression. However, these algorithms are benchmarked in terms of how well they compress a genome that includes only one copy of each chromosome. Since your two chromosomes are almost identical (at least, ignoring the Y chromosome), I’d guess that you could represent the other half almost for free, meaning a compression rate of around 50 percent + ½ × 25 percent ≈ 62 percent. On information So if you compress DNA using an algorithm with a reference genome, it can be compressed by more than 99 percent, down to less than 120 million bits. But if you compress it without a reference genome, the best you can do is 62 percent, meaning 4.6 billion bits. Which of these is right? The answer is that either could be right. There are two different definitions of a “bit” in information theory that correspond to different types of compression. In the Kolmogorov complexity definition, named after the remarkable Soviet mathematician Andrey Kolmogorov, a bit is a property of a particular string of 1s and 0s. The number of bits of information in the string is the length of the shortest computer program that would output that string. In the Shannon information definition, named after the also-remarkable American polymath Claude Shannon, a bit is again a property of a particular sequence of 1s and 0s, but it’s only defined relative to some large pool of possible sequences. In this definition, if a given sequence has a probability p of occurring, then it contains n bits for whatever value of n satisfies 2ⁿ=1/p. Or, equivalently, n=-log₂ p. The Kolmogorov complexity definition is clearly related to compression. But what about Shannon’s? Well, say you have three beloved pet rabbits, Fluffles, Marmalade, and Sparklepuff. And say you have one picture of each of them, each 1 MB large when compressed. To keep me updated on how you’re feeling, you like to send me these same pictures over and over again, with different pets for different moods. You send a picture of Fluffles ½ the time, Marmalade ¼ of the time, and Sparklepuff ¼ of the time. (You only communicate in rabbit pictures, never with text or images.) But then you decide to take off in a spacecraft, and your data rates go way up. Continuing the flow of pictures is crucial, so what’s the cheapest way to do that? The best thing would be that we agree that if you send me a 0, I should pull up the picture of Fluffles, while if you send 10 I should pull up Marmalade, and if you send 11, I should pull up Sparklepuff. This is unambiguous: If you send 0011100, that means Fluffles, then Fluffles again, then Sparklepuff, then Marmalade, then Fluffles one more time. It all works out. The “code length” for Fluffles is the number n so that 2ⁿ=1/p: pet probability p code code length n 2ⁿ 1/p Fluffles ½ 0 1 2 2 Marmelade ¼ 10 2 4 4 Sparklepuff ¼ 11 2 4 4 Intuitively, the idea is that if you want to send as few bits as possible over time, then you should give short codes to high-probability patterns and long codes to low-probability patterns. If you do this optimally (in the sense that you’ll send the fewest bits over time), it turns out that the best thing is to code a pattern with probability p with about n bits, where 2ⁿ=p. (In general, things don’t work out quite this nicely, but you get the idea.) In the Fluffles scenario, the Kolmogorov complexity definition would say that each of the images contains 1 MB of information since that’s the smallest each image can be compressed. But under the Shannon information definition, the Fluffles image contains 1 bit of information, and the Marmalade and Sparklepuff images contain 2 bits. This is quite a difference! Now, let’s return to DNA. There, the Kolmogorov complexity definition basically corresponds to the best possible compression algorithm without a reference. As we saw above, the best-known current algorithm can compress by 62 percent. So, under the Kolmogorov complexity definition, DNA contains at most 12 billion × (1-0.62) ≈ 4.6 billion bits of information. Meanwhile, under the Shannon information definition, you can assume that the distribution of all human genomes is known. The information in your DNA only includes the bits needed to reconstruct your genome. That’s essentially the same as compressing with a reference. So, under the Shannon information definition, your DNA contains less than 12 billion × (1-0.01) ≈ 120 million bits of information. While neither of these is “wrong” for DNA, I prefer the Kolmogorov complexity definition for its ability to best capture DNA that codes for features and functions shared by all humans. After all, if you’re trying to measure how much “information” our DNA carries from our evolutionary history, surely you want to include that which has been universally preserved. On biology At some point, your high-school biology teacher probably told you (or will tell you) this story about how life works: First, your DNA gets transcribed into matching RNA. Next, that RNA gets translated into protein. Then the protein does Protein Stuff. If things were that simple, we could easily calculate the information density of DNA just by looking at what fraction of your DNA ever becomes a protein (only around 1 percent). But it’s not that simple. The rest of your DNA does other important things, like regulating what proteins get made. Some of it seems to exist only for the purpose of copying itself. Some of it might do nothing, or it might do important things we don’t even know about yet. In the beginning, your DNA is relaxing in the nucleus. Some parts of your DNA, called promoters, are designed so that if certain proteins are nearby, they’ll stick to the DNA. If that happens, then a hefty little enzyme called “RNA polymerase” will show up, crack open the two strands of DNA, and start transcribing the nucleotides on one side into “pre-messenger RNA” (pre-mRNA). Eventually, for one of several reasons—none of which make any sense to me—the enzyme will decide it’s time to stop transcribing, and the pre-mRNA will detach and float off into the nucleus. At this point, it’s a few thousand or a few tens of thousands of nucleotides long. Then, my personal favorite macromolecular complex, the “spliceosome”, grabs the pre-mRNA, cuts away most of it, and throws those parts away. The sections of DNA that code for the parts that are kept are called exons, while the sections that code for parts that are thrown away are called introns. Next, another enzyme called “RNA guanylyltransferase” (we can’t all be beautiful) adds a “cap” to one end, and an enzyme called “poly(A) polymerase” adds a “tail” to the other end. The pre-mRNA is now all grown up and has graduated to being regular mRNA. At this point, it is a few hundred or a few thousand nucleotides long. Then, some proteins notice that the mRNA has a tail, grab it, and throw it out of the nucleus into the cytoplasm, where the noble ribosome lurks. The ribosome grabs the mRNA and turns it into a protein. It does this by starting at one end and looking at chunks of three nucleotides at a time, called "codons". When it sees a certain "start" pattern, it starts translating each chunk into one of 20 amino acids and continues until it sees a chunk with a "stop" pattern.Since there are 4 kinds of nucleotides, there are 4³=64 possible chunks, while your body only uses 20 amino acids. So the ribosome, logically, gives some amino acids (like leucine) six different codons, and others (like tryptophan) only one codon. Also there are three different stop codons, but only one start codon, and that start codon is also the codon for methionine. So all proteins have methionine at one end unless something else comes and removes it later. Biology is layer after layer of this kind of exasperating complexity, totally indifferent to your desire to understand it. The resulting protein lives happily ever after. It’s thought that ~1 percent of your DNA is exons and ~24 percent is introns. What’s the rest of it doing? Well, while the above dance is happening, other sections of DNA are “regulating” it. Enhancers are regions of DNA where a certain protein can bind and cause the DNA to physically bend so that some promoter somewhere else (typically within a million nucleotides) is more likely to get activated. Silencers do the opposite. Insulators block enhancers and silencers from influencing regions they shouldn’t influence. While that might sound complicated, we’re just warming up. The same region of DNA can be both an intron and an enhancer and/or a silencer. That’s right, in the middle of the DNA that codes for some protein, evolution likes to put DNA that regulates some other, distant protein. When it’s not regulating, it gets transcribed into (probably useless) pre-RNA and then cut away and recycled by the spliceosome. Centromeres are "attachment points" used when copying DNA during cell division. Telomeres are "extra" DNA at the ends of the chromosomes. Telomeres shrink as we age. The body has mechanisms to re-lengthen them, but it mostly only uses these in stem cells and reproductive cells. Longevity folks are interested in activating these mechanisms in other tissues to fight aging, but this is risky since the body seems to intentionally limit telomere repair as a strategy to prevent cancer cells from growing out of control. Further complicating this picture are many regions of DNA that code for RNA that’s never translated into a protein but still has some function. Some regions make tRNA, whose job is to bring amino acids to the ribosome. Other regions make rRNA, which bundle together with some proteins to become the ribosome. There’s siRNA, microRNA, and piRNA that screw around with mRNA produced. And there’s scaRNA, snoRNA, rRNA, lncRNA, and mrRNA. Many more types are sure to be defined in the future, both because it’s hard to know for sure if DNA gets transcribed, it’s hard to know what functions RNA might have, and because academics have strong incentives to invent ever-finer subcategories. pseudogenes. These are regions of DNA that almost make proteins, but not quite. Sometimes, this happens because they lack a promoter, so they never get transcribed into mRNA. Other times, they might lack a start codon, so after their mRNA makes it to the ribosome, it never actually starts making a protein. Then, there are instances when the DNA has an early stop codon or a "frameshift" mutation meaning the alignment of the RNA into chunks of three gets screwed up. In these cases, the ribosome will often detect that something is wrong and call for help to destroy the protein. In other cases, a short protein is made that doesn't do anything. In more serious cases, these mutations might make the organism non-viable, or lead to problems like Tay-Sachs disease or Cystic fibrosis. But this wouldn’t be considered a pseudogene. On messiness Why? Why is this all such a mess? Why is it so hard to say if a given section of DNA does anything useful? Biologists hate “why” questions. We can’t re-run evolution, so how can we say “why” evolution did things the way it did? Better to focus on how biological systems actually work. This is probably wise. But since I’m not a biologist (or wise), I’ll give my theory: Cells work like this because DNA is under constant attack from mutations. Mutations most commonly arise during cell replication. Your DNA is composed of around 250 billion atoms. Making a perfect copy of all those atoms is hard. Your body has amazing nanomachines with many redundant mechanisms to try to correct errors, and it’s estimated that the error rate is less than one per billion nucleotides. But with several billion nucleotides, mutations happen. There are also environmental sources of mutations. Ultraviolet light has more energy than visible light. If it hits your skin, that energy can sort of knock atoms out of place. The same thing happens if you’re exposed to radiation. Certain chemicals, like formaldehyde, benzene, or asbestos, can also do this or can interfere with your body’s error correction tricks. “DNA transposons” get cut out and stuck back in somewhere else, while “retrotransposons” create RNA that’s designed to get reverse-transcribed back into the DNA in another location. There are also “retroviruses” like HIV that contain RNA that they insert into the genome. Some people theorize that retrotransposons can evolve into retroviruses and vice-versa. It’s rare for retrotransposons to actually succeed in making a copy of themselves. They seem to have only a 1 in 100,000 or in 1,000,000 chance of copying themselves during cell division. But this is perhaps 10 times as high in the germ line, so the sperm from older men is more likely to contain such mutations. Mutations in your regular cells will just affect you, but mutations in your sperm/eggs could affect all future generations. Evolution helps manage this through selection. Say you have 10 bad mutations, and I have 10 bad mutations, but those mutations are in different spots. If we have some babies together, some of them might get 13 bad mutations, but some might only get 7, and the latter babies are more likely to pass on their genes. But as well as selection, cells seem designed to be extremely robust to these kinds of errors. Instead of just relying on selection, there are many redundant mechanisms to tolerate them without much issue. And remember, evolution is a madman. If it decides to tolerate some mutation, everything else will be optimized against it. So even if a mutation is harmful at first, evolution may later find a way to make use of it. On information again So, in theory, how should we define the “information content” of DNA? I propose a definition I call the “phenotypic Kolmogorov complexity”. (This has surely been proposed by someone before, but I can’t find a reference, try as I might.) Roughly speaking, this is how short you could make DNA and still get a “human”. The “phenotype” of an animal is just a fancy way of referring to its “observable physical characteristics and behaviors”. So this definition says, like Kolmogorov complexity, to try and find the shortest compressed representation of the DNA. But instead of needing to lead to the same DNA you have, it just needs to lead to an embryo that would look and behave like you do. This definition isn’t totally precise, because I’m not saying how precisely the phenotype needs to match. Even if there’s some completely useless section of DNA and we remove it, that would make all your cells a tiny bit lighter. We need to tolerate some level of approximation. The idea is that it should be very close, but it’s hard to make this precise. So what would this number be? My guess is that you could reduce the amount of DNA by at least 75 percent, but not by more than 98 percent, meaning the information content is: 12 billion bits But in reality, nobody knows. We still have no idea what (if anything) lots of DNA is doing, and we’re a long way from fully understanding how much it can be reduced. Probably, no one will know for a long time.
In a recent post about trading stuff for money, I mentioned: Europe had a [blood plasma] shortage of around 38%, which it met by importing plasma from paid donors in the United States, where blood products account for 2% of all exports by value. The internet’s reaction was: “TWO PERCENT?” “TWO PERCENT OF U.S. EXPORTS ARE BLOOD!?” Well, I took that 2% number from a 2024 article in the Economist: Last year American blood-product exports accounted for 1.8% of the country’s total goods exports, up from just 0.5% a decade ago—and were worth $37bn. That makes blood the country’s ninth-largest goods export, ahead of coal and gold. All told, America now supplies 70% or so of the plasma used to make medicine. I figured the Economist was trustworthy on matters of economics. But note: That 1.8% number is for blood products, not just blood. It’s a percentage of goods exported, excluding services. It’s wrong. The article doesn’t explain how they arrived at 1.8%. And since the Economist speaks in the voice of God (without bylines), I can’t corner and harass the actual journalist. I’d have liked to reverse-engineer their calculations, but this was impossible since the world hasn’t yet caught on that they should always show lots of digits. So what’s the right number? In 2023, total US goods exports were $2,045 billion, almost exactly ⅔ of all exports, including services. How much of that involves blood? Well, the government keeps statistics on trade based on an insanely detailed classification scheme. All goods get some number. For example, dirigibles fall under HTS 8801.90.0000: Leg warmers fall under HTS 6406.99.1530: So what about blood? Well, HTS 3002 is the category for: Human blood; animal blood prepared for therapeutic, prophylactic or diagnostic uses; antisera and other blood fractions and modified immunological products, whether or not obtained by means of biotechnological processes; vaccines, toxins, cultures of micro-organisms (excluding yeasts) and similar products: The total exports in this category in 2023 were 41.977 billion, or 2.05% of all goods exports. But that category includes many products that don’t require human blood such as most vaccines. To get the actual data, you need to go through a website maintained by the US Trade Commission. This website has good and bad aspects. On the one hand, it’s slow and clunky and confusing and often randomly fails to deliver any results. On the other hand, when you re-submit, it clears your query and then blocks you for submitting too many requests, which is nice. But after a lot of tearing of hair, I got what seems to be the most detailed breakdown of that category available. There are some finer subcategories in the taxonomy, but they don’t seem to have any data. So let’s go through those categories. To start, here are some that would seem to almost always contain human blood: Category Description Exports ($) Percentage of US goods exports 3002.12.00.10 HUMAN BLOOD PLASMA 5,959,103,120 0.2914% 3002.12.00.20 NORMAL HUMAN BLOOD SERA, WHETHER OR NOT FREEZE-DRIED 38,992,251 0.0019% 3002.12.00.30 HUMAN IMMUNE BLOOD SERA 5,608,090 0.0003% 3002.12.00.90 ANTISERA AND OTHER BLOOD FRACTIONS 4,808,069,119 0.2351% 3002.90.52.10 WHOLE HUMAN BLOOD 22,710,898 0.0011% TOTAL (YES BLOOD) 10,834,483,478 0.5298% Next, there are several categories that would seem to essentially never contain human blood: Category Description Exports ($) Percentage of US goods exports 3002.12.00.40 FETAL BOVINE SERUM (FBS) 146,026,727 0.0071% 3002.42.00.00 VACCINES FOR VETERINARY MEDICINE 638,191,743 0.0312% 3002.49.00.00 VACCINES, TOXINS, CULTURES OF MICRO-ORGANISMS EXCLUDING YEASTS, AND SIMILAR PRODUCTS, NESOI 1,630,036,341 0.0797% 3002.59.00.00 CELL CULTURES, WHETHER OR NOT MODIFIED, NESOI 79,384,134 0.0039% 3002.90.10.00 FERMENTS 361,418,233 0.0177% TOTAL (NO BLOOD) 2,869,107,296 0.1403% Finally, there are categories that include some products that might contain human blood: Category Description Exports ($) Percentage of US goods exports 3002.13.00.00 IMMUNOLOGICAL PRODUCTS, UNMIXED, NOT PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 624,283,112 0.0305% 3002.14.00.00 IMMUNOLOGICAL PRODUCTS, MIXED, NOT PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 5,060,866,208 0.2475% 3002.15.01.00 IMMUNOLOGICAL PRODUCTS, PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 13,317,356,469 0.6512% 3002.41.00.00 VACCINES FOR HUMAN MEDICINE, NESOI 7,760,695,744 0.3795% 3002.51.00.00 CELL THERAPY PRODUCTS 595,963,010 0.0291% 3002.90.52.50 HUMAN BLOOD; ANIMAL BLOOD PREPARED FOR THERAPEUTIC, PROPHYLATIC OR DIAGNOSTIC USES; ANTISERA AND OTHER BLOOD FRACTIONS, ETC. NESOI 914,348,561 0.0447% TOTAL (MAYBE BLOOD) 28,273,513,104 1.3826% The biggest contributor here is IMMUNOLOGICAL PRODUCTS (be they MIXED or UNMIXED, PUT UP or NOT PUT UP). The largest fraction of these is probably antibodies. Antibodies are sometimes made from human blood. You may remember that in 2020, some organizations collected human blood from people who’d recovered from Covid to make antibodies. But it’s important to stress that this is quite rare. Human blood, after all, is expensive. So—because capitalism—whenever possible animals are used instead, often rabbits, goats, sheep, or humanized mice. I can’t find any hard statistics on this. But I know several people who work in this industry. So I asked them to just guess what fraction might include human blood. Biologists don’t like numbers, so this took a lot of pleading, but my best estimate is 8%. When looking at similar data a few years ago, Market Design suggested that immunoglobulin products might also fall under this category. But as far as I can tell this is not true. I looked up the tariff codes for a few immunoglobulin products, and they all seem to fall under 3002.90 (“HUMAN BLOOD; ANIMAL BLOOD PREPARED FOR THERAPEUTIC, PROPHYLATIC OR DIAGNOSTIC USES; ANTISERA AND OTHER BLOOD FRACTIONS, ETC. NESOI”). What about vaccines or cell therapy products? These almost never contain human blood. But they are sometimes made by growing human cell lines, and sometimes those cell lines require human blood serum to grow. More pleading with the biologists produced a guess that this is true for 5% of vaccines and 80% of cell therapies. Aside: Even if they do require blood serum, it’s somewhat debatable if they should count as “blood products”. How far down the supply chain does that classification apply? If I make cars, and one of my employees gets injured and needs a blood transfusion, are my cars now “blood products”? Anyway, here’s my best guess for the percentage of products in this middle category that use human blood: Category Description Needs blood (guess) Exports ($) Percentage of US goods exports 3002.13.00.00 IMMUNOLOGICAL PRODUCTS, UNMIXED, NOT PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 8% 49,942,648 0.0024% 3002.14.00.00 IMMUNOLOGICAL PRODUCTS, MIXED, NOT PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 8% 404,869,296 0.0198% 3002.15.01.00 IMMUNOLOGICAL PRODUCTS, PUT UP IN MEASURED DOSES OR IN FORMS OR PACKINGS FOR RETAIL SALE 8% 1,065,388,517 0.0521% 3002.41.00.00 VACCINES FOR HUMAN MEDICINE, NESOI 5% 388,034,787 0.0190% 3002.51.00.00 CELL THERAPY PRODUCTS 80% 476,770,408 0.0233% 3002.90.52 HUMAN BLOOD; ANIMAL BLOOD PREPARED FOR THERAPEUTIC, PROPHYLATIC OR DIAGNOSTIC USES; ANTISERA AND OTHER BLOOD FRACTIONS, ETC. NESOI 90% 822,913,704 0.0402% TOTAL (GUESSED BLOOD) 3,207,919,363 0.1569% So 0.5298% of goods exports almost certainly use blood, and my best guess is that another 0.1569% of exports also include blood, for a total of 0.6867%. Obviously, this is a rough cut. But I couldn’t find any other source that shows their work in any detail, so I hoped that by publishing this I could at least prod Cunningham’s law into action. Sorry for all the numbers.
I don’t know how to internet, but I know you’re supposed to get into beefs. In the nearly five years this blog has existed, the closest I’ve come was once politely asking Slime Mold Time Mold, “Hello, would you like to have a beef?” They said, “That sounds great but we’re really busy right now, sorry.” Beefing is a funny thing. Before we invented police and laws as courts, gossip was the only method human beings had to enforce the social compact. So we’re naturally drawn to beefs. And, as I’ve written before, I believe that even with laws and courts, social punishment remains necessary in many circumstances. The legal system is designed to supplement social norms, not replace them. Beefs tend to get a lot of attention. I like attention. I hope I get credit when the world inevitably turns against ultrasonic humidifiers. But I don’t really want attention for beefing. I don’t have a “mission” for this blog, but if I did, it would be to slightly increase the space in which people are calm and respectful and care about getting the facts right. I think we need more of this, and I’m worried that society is devolving into “trench warfare” where facts are just tools to be used when convenient for your political coalition, and everyone assumes everyone is distorting everything, all the time. Nevertheless, I hereby beef with Crémieux. That’s the start of a recent thread from Crémieux on the left, and sections from a post I wrote in 2022 on the right. Click here if you want to see the tedious details for the rest of the thread. Is this plagiarism? I think so. And I don’t think it’s a close call. Plagiarism is: Presenting work or ideas from another source as your own, with or without consent of the original author, by incorporating it into your work without full acknowledgement. And in particular: Paraphrasing the work of others by altering a few words and changing their order, or by closely following the structure of their argument, is plagiarism if you do not give due acknowledgement to the author whose work you are using. A passing reference to the original author in your own text may not be enough; you must ensure that you do not create the misleading impression that the paraphrased wording or the sequence of ideas are entirely your own. Applying this definition requires some judgement. Crémieux took eleven (unattributed) screenshots from my posts, as well as the entire structure of ideas. But there was a link at the end. Would it be clear to readers where all the ideas came from? Is the link at the end “due acknowledgement”? I think few reasonable people would say yes. There are also several phrases and sentences that are taken verbatim or almost verbatim. E.g. I wrote: Aspartame is a weird synthetic molecule that’s 200 times sweeter than sucrose. Half of the world’s aspartame is made by Ajinomoto of Tokyo—the same company that first brought us MSG back in 1909. And Crémieux wrote: Aspartame is a sugary sweet synthetic molecule that’s 200 times sweeter than sucrose. More than half of the world’s supply comes from Ajinomoto of Tokyo, better known for bringing the world MSG. This does not happen by accident. Crémieux seemed to understand this when former Harvard president Claudine Gay was accused of plagiarism. But I still consider this something of a technicality. It happens that Crémieux got sloppy and didn’t rephrase some stuff. But you could easily use AI to rephrase more and it would still be plagiarism. Why complain? I don’t understand twitter. Maybe this is normal there. But I understand rationalist-adjacent blogs. If this was some random person, I’d probably let it go. But Crémieux presents as a member of my community. And inside that community, I feel comfortable saying this is Not Done. And if it is done, I expect an apology and a correction, rather than a long series of suspiciously practiced deflections. I don’t expect this post will do much for my reputation. When I read it, I feel like I’m being a bit petty, and I should be spending my time on all the important things happening in the world. I think that’s what Crémieux is counting on: There’s no way to protest this behavior without hurting yourself in the process. But I’ve read Schelling, and I’m not going to play the game on that level. I’d like to be known as a blogger with a quiet little community that calmly argues about control variables and GLP-1 trials and the hard problem of consciousness, not someone who whines about getting enough credit. But I’ve decided to take the reputational hit, because norms are important, and if you care about something, you have to be willing to defend it.
Theanine is an amino acid that occurs naturally in tea. Many people take it as a supplement for stress or anxiety. It’s mechanistically plausible, but the scientific literature hasn’t been able to find much of a benefit. So I ran a 16-month blinded self-experiment in the hopes of showing it worked. It did not work. At the end of the post, I put out a challenge: If you think theanine, prove it. Run a blinded self-experiment. After all, if it works, then what are you afraid of? Well, it turns out that Luis Costigan had already run a self-experiment. Here was his protocol: Each morning, take 200 mg theanine or placebo (blinded) along with a small iced coffee. Wait 90 minutes. Record anxiety on a subjective scale of 0-10. He repeated this for 20 days. His mean anxiety after theanine was 4.2 and after placebo it was 5.0. A simple Bayesian analysis said there was an 82.6% chance theanine reduced anxiety. The p-value was 0.31, but this is a Bayesian blog—this is what you'd expect with a sample size of 20. A sample size of 20 just doesn’t have enough statistical power to have a good chance of finding a statistically significant result. If you assume the mean under placebo is 5.0, the mean under theanine is 4.2, and the standard deviation is 2.0, then you’d only have a 22.6% chance of getting a result with p<0.05. I think this experiment was good, both the experiment and the analysis. It doesn’t prove theanine works, but it was enough to make me wonder: Maybe theanine does work, but I somehow failed to bring out the effect? What would give theanine the best possible chance of working? Theanine is widely reported to help with anxiety from caffeine. While I didn’t explicitly take caffeine as part of my previous experiment, I drink tea almost every day, so I figured that if theanine helps, it should have shown up. But most people (and Luis) take theanine with coffee, not tea. I find that coffee makes me much more nervous than tea. For this reason, I sort of hate coffee and rarely drink it. Maybe the tiny amounts of natural theanine in tea masked the effects of the supplements? Or maybe you need to take theanine and caffeine at the same time? Or maybe for some strange reason theanine works for coffee (or coffee-tier anxiety) but not tea? So fine. To hell with my mental health. I decided to take theanine (or placebo) together with coffee on an empty stomach first thing in the day. And I decided to double the dose of theanine from 200 mg to 400 mg. Details Coffee. I used one of those pod machines which are incredibly uncool but presumably deliver a consistent amount of caffeine. Measurements. Each day I recorded my stress levels on a subjective 1-5 scale before I took the capsules. An hour later, I recorded my end stress levels, and my percentage prediction that what I took was actually theanine. Blinding. I have capsules that either contain 200 mg of theanine or 25 mcg of vitamin D. These are exactly the same size. I struggled for a while to see how to take two pills of the same type while being blind to the results. In the end, I put two pills of each type in identical looking cups and shuffled the cups. Then I shut my eyes, took a sip of coffee (to make sure I couldn’t taste any difference), swallowed the pills on one cup, and put the others into a numbered envelope. Here’s a picture of the envelopes, to prove I actually did this and/or invite sympathy for all the coffee I had to endure: After 37 days I ran out of capsules. Initial thoughts I’m going to try something new. As I write these words, I have not yet opened the envelopes, so I don’t know the results. I’m going to register some thoughts. My main thought is: I have no idea what the results will show. It really felt like on some days I got the normal spike of anxiety I expect from coffee and on other days it was almost completely gone. But in my previous experiment I often felt the same thing and was proven wrong. It wouldn’t surprise me if the results show a strong effect, or if it’s all completely random. I’ll also pre-register (sort of) the statistical analyses I intend to do: I’ll plot the data. I’ll repeat Luis’s Bayesian analysis, which looks at end stress levels only. I’ll repeat that again, but looking at the change in stress levels. I’ll repeat that again, but looking at my percentage prediction that what I actually took was theanine vs. placebo. I’ll compute regular-old confidence intervals and p-values for end stress, change in stress, and my percentage prediction that what I actually took was theanine vs. placebo. Intermission Please hold while I open all the envelopes and do the analyses. Here’s a painting. Plots Here are the raw stress levels. Each line line shows one trial, with the start marked with a small horizontal bar. Remember, this measures the effect of coffee and the supplement. So even though stress tends to go up, this would still show a benefit if it went up less with theanine. Here is the difference in stress levels. If Δ Stress is negative, that means stress went down. Here are the start vs. end stress levels, ignoring time. The dotted line shows equal stress levels, so anything below that line means stress went down. And finally, here are my percentage predictions of if what I had taken was actually theanine: So…. nothing jumps out so far. Analysis So I did the analysis in my pre-registered plan above. In the process, I realized I wanted to show some extra stuff. It’s all simple and I think unobjectionable. But if you’re the kind of paranoid person who only trusts pre-registered things, I love and respect you and I will mark those with “✔️”. End stress The first thing we’ll look at is the final stress levels, one hour after taking theanine or vitamin D. First up, regular-old frequentist statistics. Variable Mean 95% C.I. p theanine end stress 1.93 (1.80, 2.06) vitamin D end stress 2.01 (1.91, 2.10) ✔️ difference (T-D) -0.069 (-0.23, 0.083) 0.33 If the difference is less than zero, that would suggest theanine was better. It looks like there might be a small difference, but it’s nowhere near statistically significant. Next up, Bayes! In this analysis, there are latent variables for the mean and standard deviation of end stress (after one hour) with theanine and also for vitamin D. Following Luis’s analysis, these each have a Gaussian prior with a mean and standard deviation based on the overall mean in the data. Variable Mean 95% C.I. P[T better] end stress (T) 1.93 (1.81, 2.06) end stress (D) 2.00 (1.91, 2.10) difference (T-D) -0.069 (-0.23, 0.09) 80.5% ✔️ % diff (T-D)/D -3.38% (-11.1%, 4.71%) 80.5% The results are extremely similar to the frequentist analysis. This says there’s an 80% chance theanine is better. Δ Stress Next up, let’s look at the difference in stress levels defined as Δ = (end - start). Since this measures an increase in stress, we’d like it to be as small as possible. So again, if the difference is negative, that would suggest theanine is better. Here are the good-old frequentist statistics. Variable Mean 95% C.I. p theanine Δ stress 0.082 (-0.045, 0.209) vitamin D Δ stress 0.085 (-0.024, 0.194) ✔️ difference (T-D) 0.0026 (-0.158, 0.163) 0.334 And here’s the Bayesian analysis. It’s just like the first one except we have latent variables for the difference in stress levels (end-start). If the difference of that difference was less than zero, that would again suggest theanine was better. Variable Mean 95% C.I. P[T better] Δ stress (T) 0.0837 (-0.039, 0.20) Δ stress (D) 0.0845 (-0.024, 0.19) difference (T-D) -0.0008 (-0.16, 0.16) 50.5% ✔️ % diff (T-D)/D 22.0% (-625%, 755%) 55.9% In retrospect, this percentage prediction analysis is crazy, and I suggest you ignore it. The issue is that even though Δ stress is usually positive (coffee bad) it’s near zero and can be negative. Computing (T-D)/D when D can be negative is stupid and I think makes the whole calculation meaningless. I regret pre-registering this. The absolute difference is fine. It’s very close (almost suspiciously close) to zero. Percentage prediction Finally, let’s look at my percentage prediction that what I took was theanine. It really felt like I could detect a difference. But could I? Here we’d hope that I’d give a higher prediction that I’d taken theanine when I’d actually taken theanine. So a positive difference would suggest theanine is better, or at least different. Variable Mean 95% C.I. p % with theanine 52.8% (45.8%, 59.9%) % with vitamin D 49.3% (43.2%, 55.4%) ✔️ difference (T-D) 3.5% (-5.4%, 12.4%) 0.428 And here’s the corresponding Bayesian analysis. This is just like the first two, except with latent variables for my percentage prediction under theanine and vitamin D. Variable Mean 95% C.I. P[T better] % prediction (T) 52.7% (45.8%, 59.6%) % prediction (D) 49.3% (43.4%, 55.2%) difference (T-D) 3.3% (-5.7%, 12.4%) 77.1% ✔️ % diff (T-D)/D 7.2% (-10.8%, 27.6%) 77.1% Taking a percentage difference of a quantity that is itself a percentage difference is really weird, but fine. Discussion This is the most annoying possible outcome. A clear effect would have made me happy. Clear evidence of no effect would also have made me happy. Instead, some analyses say there might be a small effect, and others suggest nothing. Ugh. But I’ll say this: If there is any effect, it’s small. I know many people say theanine is life-changing, and I know why: It’s insanely easy to fool yourself. Even after running a previous 18-month trial and finding no effect, I still often felt like I could feel the effects in this experiment. I still thought I might open up all the envelopes and find that I had been under-confident in my guesses. Instead, I barely did better than chance. So I maintain my previous rule. If you claim that theanine has huge effects for you, blind experiment or GTFO.
More in life
a recap + recording of BATWRITE #001
When we make something new, people often ask "why don't you just add that to Basecamp?" There are a number of reasons, depending on what it is. But, broadly, making something brand new gives you latitude (and attitude) to explore new tech and design approaches. It's the opposite of grafting something on to a heavier, larger system that already exists. The gravity of existing decisions in current systems requires so much energy to reach escape velocity that you tend to conform rather than explore. Essentially you're bent back to where you started, rather than arcing out towards a new horizon. New can be wrong, but it's always interesting. And that in itself is worth it. Because in the end, even if the whole new thing doesn't work out, individual elements, explorations, and executions discovered along the way can make their way back into other things you're already doing. Or something else new down the road. These bits would have been undiscovered had you never set out for new territory in the first place. Ultimately, a big part of making something new is simply thinking something new. -Jason
Those of you who remember when I used to blog things that weren’t about travel, rejoice! This was our last full day in Japan before we travel back to Australia, with heavy hearts and significantly heavier baggage. I’m at our local Doutor having a coffee and thinking back to how great this was. The start was awful on account of that migraine and stomach bug that wouldn’t quit, but I was relived it only lasted a few days. Nagoya, Takoyama, and Toyama were so much fun! It’s always a novelty seeing somewhere new, but they were all amazing places. Toyama in particular is a place I could see myself moving to; I joked that it was our $SydneySuburb in Japan. We wanted to look for gifts for a few friends and family, so we spent our last full day in Tōkyō going to the requisite places. This meant Akihabara again, which honestly as a nerd wasn’t exactly an onerous requirement! Detractors claim Akihabara isn’t as good as it used to be, which may be true. But seeing all the pop culture I otherwise only ever see in a computer screen displayed out in the open and in dozens of stores, and more electronics than I’d be exposed to in Australia in a given year, it’s something else. To get there we walked a slightly different route, which took us under that famed JR arch bridge. It’s honestly far larger, higher, and more impressive in person than any of the photos online I’d seen suggest. But first was brunch! The first time we went to Akihabara I was only able to keep water and those CalorieMate blocks down, so we decided to re-visit The Flying Scotsman café we went to on one of our first days. We ended up at our same table, with the same wait staff who drew us little happy bunnies on our receipts and napkins. If you’re in Akiba and want somewhere to chill, I can’t recommend this place enough. They’re lovely. We walked past the massive JR Akihabara Station and Atré building, and saw the walls plastered with images for the 30th Anniversary of AQUAPLUS! The ToHeart2 visual novel and anime were oddly one of those series that Clara and I first bonded over, to the point where our combined domain is named for a character from it. They had merchandise in one of the small halls upstairs, though they were already sold out of the characters we knew! People of discerning tastes had been through before, it seems. While we were there, we also checked out the IDOLM@STER Official Store, because that’s also what people of discerning tastes do! IM@S was the first of these idol groups I ever knew, and while the lineup has been updated and changed a lot over the years, they still had a small shelf of the original characters I used to listen way back when; including my beloved Yukiho Hagiwara! Atré should have called this the Nostalgia Floor. We went hunting around a few more of the anime shops down the road, including the other Lashinbang, Mandarake, and BOOK·OFF stores we didn’t get time to before. These second-hand stores are absolutely packed with every series, character, and type of merchandise you can imagine. If teenage me had the money and access, he would have gone absolutely nuts in places like this. Fortunately Clara and I are far more responsible and rational and financially prudent and… oh no, how will this all fit in our bags? We also made the mistake of going to both HARD·OFF second-hand stores in Akiba. Their outlets elsewhere in Japan are definitely better on account of tourists generally not venturing out from the big cities, but they still had items that I would have loved to stuff in my bags had we the space for them. I also managed to snag a couple of PC-133 DIMMs which most of my desktops use, but are becoming as scarce as hen’s teeth. For a part of Tōkyō we were only going to “quickly” check out “in the morning” before going to Kinshichō, we ended up staying there till the late afternoon. Akihabara is a classic time void; I blame Makise Kurisu among others. We got the JR Chūō-Sōbu Line from Akihabara over to Kinshichō station in Subida, in the eastern part of Tōkyō. We left the station and were greeted with a familiar sight in the distance! I know the Skytree is a polarising piece of architecture, but there’s no doubt it’s… omnipresent, especially on a clear day like this! It seemed like it was popping up everywhere we walked. Clara was able to find the stores she was after for her friends, then we went to get some dinner. We ended up at this Chinese/Japanese restaurant where I was able to tick off having gyoza and omurice in the one meal… though the flavours were decidedly not Japanese! It was amazing, and a silly opportunity to use the Art mode on my OM-3 to take a picture of it outside as we left. Our flight isn’t until late tomorrow, so our plan is to wander around Ikebukuro after we check out, and to pretend that we somehow have another two weeks! One can dream, desu. By Ruben Schade in Tokyo, 2025-05-08.