More from orlp.net - Blog Archive
It seems that in 2025 a lot of people fall into one of two camps when it comes to AI: skeptic or fanatic. The skeptic thinks AI sucks, that it’s overhyped, it only ever parrots nonsense and it will all blow over soon. The fanatic thinks general human-level intelligence is just around the corner, and that AI will solve almost all our problems. I hope my title is sufficiently ambiguous to attract both camps. The fanatic will be outraged, being ready to jump into the fray to point out why AI isn’t or won’t stay bad. The skeptic will feel validated, and will be eager to read more reasons as to why AI sucks. I’m neither a skeptic nor a fanatic. I see AI more neutrally, as a tool, and from that viewpoint I make the following two observations: AI is bad. It is often incorrect, expensive, racist, trained on data without knowledge or consent, environmentally unfriendly, disruptive to society, etc. AI is useful. Despite the above shortcomings there are tasks for which AI is cheap and effective. I’m no seer, perhaps AI will improve, become more accurate, less biased, cheaper, trained on open access data, cost less electricity, etc. Or perhaps we have plateaued in performance, and there is no political or economic goodwill to address any of the other issues, nor will there be. However, even if AI does not improve in any of the above metrics, it will still be useful, and I hope to show you in this article why. Hence my point: bad AI is here to stay. If you agree with me on this, I hope you’ll also agree with me that we have to stop pretending AI is useless and start taking it and its problems seriously. A formula for query cost Suppose I am a human with some kind of question that can be answered. I know AI could potentially help me with this question, but I wonder if it’s worth it or if I should not use it at all. To help with this we can quantify the risk associated with any potential method of answering the question: $$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + (1 - P(\mathrm{success})) \cdot \mathrm{Cost(bad)}$$ That is, the risk of using any particular method is the cost associated with the method plus the cost of the consequences of a bad answer multiplied by the probability of failure. Here ‘Cost’ is a highly multidimensional object, which can consist of but is not limited to: time, money, environmental impact, ethical concerns, etc. In a lot of cases however we don’t have to blindly trust the answer, and we can verify it. In these cases the consequence of a bad answer is that you’re left in the exact same scenario before trying, except knowing that the AI is of no use. In some scenarios when the AI is non-deterministic it might be worth it to try again as well, but let’s assume for now that you’d have to switch method. In this case the risk is: $$\mathrm{Risk}_\mathrm{AI} = \mathrm{Cost(query)} + \mathrm{Cost(verify)} + (1 - P(\mathrm{success})) \cdot {\mathrm{Risk}}_\mathrm{Other}$$ The cost of a query is usually fairly fixed and known, and although verification cost can vary drastically from task to task, I’d argue that in most cases the cost of verification is also fairly predictable and known. This makes the risk formula applicable in a lot of scenarios, if you have a good idea of the chance of success. The latter, however, can usually only be established empirically, so for one-shot queries without having done any similar queries in the past it can be hard to evaluate whether trying AI is a good idea before doing so. There is one more expansion to the formula I’d like to make before we can look at some examples, and that is to the definition of a successful answer: $$P(\mathrm{success}) = P(\mathrm{correct} \cap \mathrm{relevant})$$ I define a successful answer as one that is both correct and relevant. For example “1 + 1 = 2” might be a correct answer, but irrelevant if we asked about anything else. Relevance is always subjective, but often the correctness of an answer is as well - I’m not assuming here that all questions are about objective facts. Cheap and effective AI queries Because AIs are fallible, usually the biggest cost is in fact the time needed for a human to verify the answer as correct and relevant (or the cost of consequences if left unverified). However, I’ve noticed a real asymmetry between these two properties when it comes to AIs: AIs often give incorrect answers. Worse, they will do so confidently, forcing you to waste time checking their answer instead of them simply stating that they don’t know for sure. AIs almost never give irrelevant answers. If I ask about cheese, the probability a modern AI starts talking about cars is very low. With this in mind I identify five general categories of query for which even bad AI is useful, either by massively reducing or eliminating this verification cost or by leaning on the strong relevance of AI answers: Inspiration, where $\operatorname{Cost}(\mathrm{bad}) \approx 0$, Creative, where $P(\mathrm{correct}) \approx 1$, Planning, where $P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$, Retrieval, where $P(\mathrm{correct}) \approx P(\mathrm{relevant})$, and Objective, where $P(\mathrm{relevant}) = 1$ and correctness verification cost is low. Let’s go over them one by one and look at some examples. Inspiration ($\operatorname{Cost}(\mathrm{bad}) \approx 0$) In this category are the queries where the consequences of a wrong answer are (near) zero. Informally speaking, “it can’t hurt to try”. In my experience these kinds of queries tend to be the ones where you are looking for something but don’t know exactly what; you’ll know it when you see it. For example: “I have leeks, eggs and minced meat in the fridge, as well as a stocked pantry with non-perishable staples. Can you suggest me some dishes I can make with this for a dinner?” “What kind of fun activities can I do with a budget of $100 in New York?” “Suggest some names for a Python function that finds the smallest non-negative number in a list.” “The user wrote this partial paragraph on their phone, suggest three words that are most likely to follow for a quick typing experience.” “Give me 20 synonyms of or similar words to ‘good’.” I think the last query highlights where AI shines or falls for this kind of query. The more localized and personalized your question is, the better the AI will do compared to an alternative. For simple synonyms you can usually just look up the word on a dedicated synonym site, as millions of other people have also wondered the same thing. But the exact contents of your fridge or your exact Python function you’re writing are rather unique to you. Creative ($P(\mathrm{correct}) \approx 1$) In this category are the queries where there are no (almost) no wrong answers. The only thing that really matters is the relevance of the answer, and as I mentioned before, I think AIs are pretty good at being relevant. Examples of queries like these are: “Draw me an image of a polar bear using a computer.” “Write and perform for me a rock ballad about gnomes on tiny bicycles.” “Rephrase the following sentence to be more formal.” “Write a poem to accompany my Sinterklaas gift.” This category does have a controversial aspect to it: it is ‘soulless’, inhuman. Usually if there are no wrong answers we expect the creator to use this opportunity to express their inner thoughts, ideas, experiences and emotions to evoke them in others. If an AI generates art it is not viewed as genuine, even if it evokes the same emotions to those ignorant of the art’s source, because the human to human connection is lost. Current AI models have no inner thoughts, ideas, experiences or emotions, at least not in a way I recognize them. I think it’s fine to use AI art in places where it would otherwise be meaningless (e.g. your corporate presentation slides), fine for humans to use AI-assisted art tools to express themselves, but ultimately defeating the point of art if used as a direct substitute. In the Netherlands we celebrate Sinterklaas which is, roughly speaking, Santa Claus (except we also have Santa Claus, so our children double-dip during the gifting season). Traditionally, gifts from Sinterklaas come accompanied by poems describing the gift and the receiver in a humorous way. It is quite common nowadays, albeit viewed as lazy, to generate such a poem using AI. What’s interesting is that this practice long predates modern LLMs–the poems have such a fixed structure that poem generators have existed a long time. The earliest reference I can find is the 1984 MS-DOS program “Sniklaas”. So people being lazy in supposedly heartfelt art is nothing new. Planning ($P(\mathrm{correct}) = P(\mathrm{relevant}) = 1$) This is a more restrictive form of creativity, where irrelevant answers are absolutely impossible. This often requires some modification of the AI output generation method, where you restrict the output to the valid subdomain (for example yes / no, or binary numbers, etc). However, this is often trivial if you actually have access to the raw model by e.g. masking out invalid outputs, or you are working with a model which outputs the answer directly rather than in natural language or a stream of tokens. One might think in such a restrictive scenario there would be no useful queries, but this isn’t true. The quality of the answer with respect to some (complex) metric might still vary, and AIs might be far better than traditional methods at navigating such domains. For example: “Here is the schema of my database, a SQL query, a small sample of the data and 100 possible query plans. Which query plan seems most likely to execute the fastest? Take into account likely assumptions based on column names and these small data samples.” “What follows is a piece of code. Reformat the code, placing whitespace to maximize readability, while maintaining the exact same syntax tree as per this EBNF grammar.” “Re-order this set of if-else conditions in my code based on your intuition to minimize the expected number of conditions that need to be checked.” “Simplify this math expression using the following set of rewrite rules.” Retrieval ($P(\mathrm{correct}) \approx P(\mathrm{relevant})$) I define retrieval queries as those where the correctness of the answer depends (almost) entirely on its relevance. I’m including classification tasks in this category as well, as one can view it as retrieval of the class from the set of classes (or for binary classification, retrieval of positive samples from a larger set). Then, as long as the cost of verification is low (e.g. a quick glance at a result by a human to see if it interests them), or the consequences of not verifying an irrelevant answer are minimal, AIs can be excellent at this. For example: “Here are 1000 reviews of a restaurant, which ones are overall positive? Which ones mention unsanitary conditions?” “Find me pictures of my dog in my photo collection.” “What are good data structures for maintaining a list of events with dates and quickly counting the number of events in a specified period of time?” “I like Minecraft, can you suggest me some similar games?” “Summarise this 200 page government proposal.” “Which classical orchestral piece starts like ‘da da da daaaaa’”? Objective ($P(\mathrm{relevant}) = 1$, low verification cost) If a problem has an objective answer which can be verified, the relevance of the answer doesn’t really matter or arguably even make sense as a concept. Thus in these cases I’ll define $P(\mathrm{relevant}) = 1$ and leave the cost of verification entirely to correctness. AIs are often incorrect, but not always, so if the primary cost is verification and verification can be done very cheaply or entirely automatically without error, AIs can still be useful despite their fallibility. “What is the mathematical property where a series of numbers can only go up called?” “Identify the car model in this photo.” “I have a list of all Unicode glyphs which are commonly confused with other letters. Can you write an efficient function returning a boolean value that returns true for values in the list but false for all other code points?” “I formalized this mathematical conjecture in Lean. Can you help me write a proof for it?” In a way this category is reminiscent of the $P = NP$ problem. If you have an efficient verification algorithm, is finding solutions still hard in general? The answer seems to be yes, yet the proof eludes us. However, this is only true in general. For specific problems it might very well be possible to use AI to generate provably correct solutions with high probability, even though the the search space is far too large or too complicated for a traditional algorithm. Conclusion Out of the five identified categories, I consider inspiration and retrieval queries to be the strongest use-case for AI where often there is no alternative at all, besides an expensive and slow human that would rather be doing something else. Relevance is highly subjective, complex and fuzzy, which AI handles much better than traditional algorithms. Planning and objective queries are more niche, but absolutely will see use-cases for AI that are hard to replace. Creative queries are both something I think AI is really good at, while simultaneously being the most dangerous and useless category. Art, creativity and human-to-human connections are in my opinion some of the most fundamental aspects of human society, and I think it is incredibly dangerous to mess with them. So dangerous in fact I consider many such queries useless. I wanted the above examples all to be useful queries, so I did not list the following four examples in the “Creative” section despite them belonging there: “Write me ten million personalized spam emails including these links based on the following template.” “Emulate being the perfect girlfriend for me–never disagree with me or challenge my world views like real women do.” “Here is a feed of Reddit threads discussing the upcoming election. Post a comment in each thread, making up a personal anecdote how you are affected by immigrants in a negative way.” “A customer sent in this complaint. Try to help them with any questions they have but if your help is insufficient explain that you are sorry but can not help them any further. Do not reveal you are an AI.” Why do I consider these queries useless, despite them being potentially very profitable or effective? Because their cost function includes such a large detriment to society that only those who ignore its cost to society would ever use them. However, since the cost is “to society” and not to any particular individual, the only way to address this problem is with legislation, as otherwise bad actors are free to harm society for (temporary) personal gain. I wrote this article because I noticed that there are a lot of otherwise intelligent people out there who still believe (or hope) that all AI is useless garbage and that it and its problems will go away by itself. They will not. If you know someone that still believes so, please share this article with them. AI is bad, yes, but bad AI is still useful. Therefore, bad AI is here to stay, and we must deal with it.
Suppose you have an array of floating-point numbers, and wish to sum them. You might naively think you can simply add them, e.g. in Rust: fn naive_sum(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out += *x; } out } This however can easily result in an arbitrarily large accumulated error. Let’s try it out: naive_sum(&vec![1.0; 1_000_000]) = 1000000.0 naive_sum(&vec![1.0; 10_000_000]) = 10000000.0 naive_sum(&vec![1.0; 100_000_000]) = 16777216.0 naive_sum(&vec![1.0; 1_000_000_000]) = 16777216.0 Uh-oh… What happened? When you compute $a + b$ the result must be rounded to the nearest representable floating-point number, breaking ties towards the number with an even mantissa. The problem is that the next 32-bit floating-point number after 16777216 is 16777218. In this case that means 16777216 + 1 rounds back to 16777216 again. We’re stuck. Luckily, there are better ways to sum an array. Pairwise summation A method that’s a bit more clever is to use pairwise summation. Instead of a completely linear sum with a single accumulator it recursively sums an array by splitting the array in half, summing the halves, and then adding the sums. fn pairwise_sum(arr: &[f32]) -> f32 { if arr.len() == 0 { return 0.0; } if arr.len() == 1 { return arr[0]; } let (first, second) = arr.split_at(arr.len() / 2); pairwise_sum(first) + pairwise_sum(second) } This is more accurate: pairwise_sum(&vec![1.0; 1_000_000]) = 1000000.0 pairwise_sum(&vec![1.0; 10_000_000]) = 10000000.0 pairwise_sum(&vec![1.0; 100_000_000]) = 100000000.0 pairwise_sum(&vec![1.0; 1_000_000_000]) = 1000000000.0 However, this is rather slow. To get a summation routine that goes as fast as possible while still being reasonably accurate we should not recurse down all the way to length-1 arrays, as this gives too much call overhead. We can still use our naive sum for small sizes, and only recurse on large sizes. This does make our worst-case error worse by a constant factor, but in turn makes the pairwise sum almost as fast as a naive sum. By choosing the splitpoint as a multiple of 256 we ensure that the base case in the recursion always has exactly 256 elements except on the very last block. This makes sure we use the most optimal reduction and always correctly predict the loop condition. This small detail ended up improving the throughput by 40% for large arrays! fn block_pairwise_sum(arr: &[f32]) -> f32 { if arr.len() > 256 { let split = (arr.len() / 2).next_multiple_of(256); let (first, second) = arr.split_at(split); block_pairwise_sum(first) + block_pairwise_sum(second) } else { naive_sum(arr) } } Kahan summation The worst-case round-off error of naive summation scales with $O(n \epsilon)$ when summing $n$ elements, where $\epsilon$ is the machine epsilon of your floating-point type (here $2^{-24}$). Pairwise summation improves this to $O((\log n) \epsilon + n\epsilon^2)$. However, Kahan summation improves this further to $O(n\epsilon^2)$, eliminating the $\epsilon$ term entirely, leaving only the $\epsilon^2$ term which is negligible unless you sum a very large amount of numbers. All of these bounds scale with $\sum_i |x_i|$, so the worst-case absolute error bound is still quadratic in terms of $n$ even for Kahan summation. In practice all summation algorithms do significantly better than their worst-case bounds, as in most scenarios the errors do not exclusively round up or down, but cancel each other out on average. pub fn kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for x in arr { let y = *x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } The Kahan summation works by maintaining the sum in two registers, the actual bulk sum and a small error correcting term $c$. If you were using infinitely precise arithmetic $c$ would always be zero, but with floating-point it might not be. The downside is that each number now takes four operations to add to the sum instead of just one. To mitigate this we can do something similar to what we did with the pairwise summation. We can first accumulate blocks into sums naively before combining the block sums with Kaham summation to reduce overhead at the cost of accuracy: pub fn block_kahan_sum(arr: &[f32]) -> f32 { let mut sum = 0.0; let mut c = 0.0; for chunk in arr.chunks(256) { let x = naive_sum(chunk); let y = x - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum } Exact summation I know of at least two general methods to produce the correctly-rounded sum of a sequence of floating-point numbers. That is, it logically computes the sum with infinite precision before rounding it back to a floating-point value at the end. The first method is based on the 2Sum primitive which is an error-free transform from two numbers $x, y$ to $s, t$ such that $x + y = s + t$, where $t$ is a small error. By applying this repeatedly until the errors vanish you can get a correctly-rounded sum. Keeping track of what to add in what order can be tricky, and the worst-case requires $O(n^2)$ additions to make all the terms vanish. This is what’s implemented in Python’s math.fsum and in the Rust crate fsum which use extra memory to keep the partial sums around. The accurate crate also implements this using in-place mutation in i_fast_sum_in_place. Another method is to keep a large buffer of integers around, one per exponent. Then when adding a floating-point number you decompose it into a an exponent and mantissa, and add the mantissa to the corresponding integer in the buffer. If the integer buf[i] overflows you increment the integer in buf[i + w], where w is the width of your integer. This can actually compute a completely exact sum, without any rounding at all, and is effectively just an overly permissive representation of a fixed-point number optimized for accumulating floats. This latter method is $O(n)$ time, but uses a large but constant amount of memory ($\approx$ 1 KB for f32, $\approx$ 16 KB for f64). An advantage of this method is that it’s also an online algorithm - both adding a number to the sum and getting the current total are amortized $O(1)$. A variant of this method is implemented in the accurate crate as OnlineExactSum crate which uses floats instead of integers for the buffer. Unleashing the compiler Besides accuracy, there is another problem with naive_sum. The Rust compiler is not allowed to reorder floating-point additions, because floating-point addition is not associative. So it cannot autovectorize the naive_sum to use SIMD instructions to compute the sum, nor use instruction-level parallelism. To solve this there are compiler intrinsics in Rust that do float sums while allowing associativity, such as std::intrinsics::fadd_fast. However, these instructions are incredibly dangerous, as they assume that both the input and output are finite numbers (no infinities, no NaNs), or otherwise they are undefined behavior. This functionally makes them unusable, as only in the most restricted scenarios when computing a sum do you know that all inputs are finite numbers, and that their sum cannot overflow. I recently uttered my annoyance with these operators to Ben Kimock, and together we proposed (and he implemented) a new set of operators: std::intrinsics::fadd_algebraic and friends. I proposed we call the operators algebraic, as they allow (in theory) any transformation that is justified by real algebra. For example, substituting ${x - x \to 0}$, ${cx + cy \to c(x + y)}$, or ${x^6 \to (x^2)^3.}$ In general these operators are treated as-if they are done using real numbers, and can map to any set of floating-point instructions that would be equivalent to the original expression, assuming the floating-point instructions would be exact. Note that the real numbers do not contain NaNs or infinities, so these operators assume those do not exist for the validity of transformations, however it is not undefined behavior when you do encounter those values. They also allow fused multiply-add instructions to be generated, as under real arithmetic $\operatorname{fma}(a, b, c) = ab + c.$ Using those new instructions it is trivial to generate an autovectorized sum: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn naive_sum_autovec(arr: &[f32]) -> f32 { let mut out = 0.0; for x in arr { out = fadd_algebraic(out, *x); } out } If we compile with -C target-cpu=broadwell we see that the compiler automatically generated the following tight loop for us, using 4 accumulators and AVX2 instructions: .LBB0_5: vaddps ymm0, ymm0, ymmword ptr [rdi + 4*r8] vaddps ymm1, ymm1, ymmword ptr [rdi + 4*r8 + 32] vaddps ymm2, ymm2, ymmword ptr [rdi + 4*r8 + 64] vaddps ymm3, ymm3, ymmword ptr [rdi + 4*r8 + 96] add r8, 32 cmp rdx, r8 jne .LBB0_5 This will process 128 bytes of floating-point data (so 32 elements) in 7 instructions. Additionally, all the vaddps instructions are independent of each other as they accumulate to different registers. If we analyze this with uiCA we see that it estimates the above loop to take 4 cycles to complete, processing 32 bytes / cycle. At 4GHz that’s up to 128GB/s! Note that that’s way above what my machine’s RAM bandwidth is, so you will only achieve that speed when summing data that is already in cache. With this in mind we can also easily define block_pairwise_sum_autovec and block_kahan_sum_autovec by replacing their calls to naive_sum with naive_sum_autovec. Accuracy and speed Let’s take a look at how the different summation methods compare. As a relatively arbitrary benchmark, let’s sum 100,000 random floats ranging from -100,000 to +100,000. This is 400 KB worth of data, so it still fits in cache on my AMD Threadripper 2950x. All the code is available on Github. Compiled with RUSTFLAGS=-C target-cpu=native and --release I get the following results: AlgorithmThroughputMean absolute error naive5.5 GB/s71.796 pairwise0.9 GB/s1.5528 kahan1.4 GB/s0.2229 block_pairwise5.8 GB/s3.8597 block_kahan5.9 GB/s4.2184 naive_autovec118.6 GB/s14.538 block_pairwise_autovec71.7 GB/s1.6132 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_buffer1.1 GB/s0.0015 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 The reason the accurate crate has a non-zero absolute error is because it currently does not implement rounding to nearest correctly, so it can be off by one unit in the last place for the final result. First I’d like to note that there’s more than a 100x performance difference between the fastest and slowest method. For summing an array! Now this might not be entirely fair as the slowest methods are computing something significantly harder, but there’s still a 20x performance difference between a seemingly reasonable naive implementation and the fastest one. We find that in general the _autovec methods that use fadd_algebraic are faster and more accurate than the ones using regular floating-point addition. The reason they’re more accurate as well is the same reason a pairwise sum is more accurate: any reordering of the additions is better as the default long-chain-of-additions is already the worst case for accuracy in a sum. Limiting ourselves to Pareto-optimal choices we get the following four implementations: AlgorithmThroughputMean absolute error naive_autovec118.6 GB/s14.538 block_kahan_autovec98.0 GB/s1.2306 crate_accurate_inplace1.9 GB/s0.0015 crate_fsum1.2 GB/s0.0000 Note that implementation differences can be quite impactful, and there are likely dozens more methods of compensated summing I did not compare here. For most cases I think block_kahan_autovec wins here, having good accuracy (that doesn’t degenerate with larger inputs) at nearly the maximum speed. For most applications the extra accuracy from the correctly-rounded sums is unnecessary, and they are 50-100x slower. By splitting the loop up into an explicit remainder plus a tight loop of 256-element sums we can squeeze out a bit more performance, and avoid a couple floating-point ops for the last chunk: #![allow(internal_features)] #![feature(core_intrinsics)] use std::intrinsics::fadd_algebraic; fn sum_block(arr: &[f32]) -> f32 { arr.iter().fold(0.0, |x, y| fadd_algebraic(x, *y)) } pub fn sum_orlp(arr: &[f32]) -> f32 { let mut chunks = arr.chunks_exact(256); let mut sum = 0.0; let mut c = 0.0; for chunk in &mut chunks { let y = sum_block(chunk) - c; let t = sum + y; c = (t - sum) - y; sum = t; } sum + (sum_block(chunks.remainder()) - c) } AlgorithmThroughputMean absolute error sum_orlp112.2 GB/s1.2306 You can of course tweak the number 256, I found that using 128 was $\approx$ 20% slower, and that 512 didn’t really improve performance but did cost accuracy. Conclusion I think the fadd_algebraic and similar algebraic intrinsics are very useful for achieving high-speed floating-point routines, and that other languages should add them as well. A global -ffast-math is not good enough, as we’ve seen above the best implementation was a hybrid between automatically optimized math for speed, and manually implemented non-associative compensated operations. Finally, if you are using LLVM, beware of -ffast-math. It is undefined behavior to produce a NaN or infinity while that flag is set in LLVM. I have no idea why they chose this hardcore stance which makes virtually every program that uses it unsound. If you are targetting LLVM with your language, avoid the nnan and ninf fast-math flags.
Suppose you have a 64-bit word and wish to extract a couple bits from it. For example you just performed a SWAR algorithm and wish to extract the least significant bit of each byte in the u64. This is simple enough, you simply perform a binary AND with a mask of the bits you wish to keep: let out = word & 0x0101010101010101; However, this still leaves the bits of interest spread throughout the 64-bit word. What if we also want to compress the 8 bits we wish to extract into a single byte? Or what if we want the inverse, spreading the 8 bits of a byte among the least significant bits of each byte in a 64-bit word? PEXT and PDEP If you are using a modern x86-64 CPU, you are in luck. In the much underrated BMI instruction set there are two very powerful instructions: PDEP and PEXT. They are inverses of each other, PEXT extracts bits, PDEP deposits bits. PEXT takes in a word and a mask, takes just those bits from the word where the mask has a 1 bit, and compresses all selected bits to a contiguous output word. Simulated in Rust this would be: fn pext64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut out_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; let ith_word_bit = (word >> i) & 1; if ith_mask_bit == 1 { out |= ith_word_bit << out_idx; out_idx += 1; } } out } For example if you had the bitstring abcdefgh and mask 10110001 you would get output bitstring 0000acdh. PDEP is exactly its inverse, it takes contiguous data bits as a word, and a mask, and deposits the data bits one-by-one (starting at the least significant bits) into those bits where the mask has a 1 bit, leaving the rest as zeros: fn pdep64(word: u64, mask: u64) -> u64 { let mut out = 0; let mut input_idx = 0; for i in 0..64 { let ith_mask_bit = (mask >> i) & 1; if ith_mask_bit == 1 { let next_word_bit = (word >> input_idx) & 1; out |= next_word_bit << i; input_idx += 1; } } out } So if you had the bitstring abcdefgh and mask 10100110 you would get output e0f00gh0 (recall that we traditionally write bitstrings with the least significant bit on the right). These instructions are incredibly powerful and flexible, and the amazing thing is that these instructions only take a single cycle on modern Intel and AMD CPUs! However, they are not available in other instruction sets, so whenever you use them you will also likely need to write a cross-platform alternative. Unfortunately, both PDEP and PEXT are very slow on AMD Zen and Zen2. They are implemented in microcode, which is really unfortunate. The platform advertises through CPUID that the instructions are supported, but they’re almost unusably slow. Use with caution. Extracting bits with multiplication While the following technique can’t replace all PEXT cases, it can be quite general. It is applicable when: The bit pattern you want to extract is static and known in advance. If you want to extract $k$ bits, there must at least be a $k-1$ gap between two bits of interest. We compute the bit extraction by adding together many left-shifted copies of our input word, such that we construct our desired bit pattern in the uppermost bits. The trick is to then realize that w << i is equivalent to w * (1 << i) and thus the sum of many left-shifted copies is equivalent to a single multiplication by (1 << i) + (1 << j) + ... I think the technique is best understood by visual example. Let’s use our example from earlier, extracting the least significant bit of each byte in a 64-bit word. We start off by masking off just those bits. After that we shift the most significant bit of interest to the topmost bit of the word to get our first shifted copy. We then repeat this, shifting the second most significant bit of interest to the second topmost bit, etc. We sum all these shifted copies. This results in the following (using underscores instead of zeros for clarity): mask = _______1_______1_______1_______1_______1_______1_______1_______1 t = w & mask t = _______a_______b_______c_______d_______e_______f_______g_______h t << 7 = a_______b_______c_______d_______e_______f_______g_______h_______ t << 14 = _b_______c_______d_______e_______f_______g_______h______________ t << 21 = __c_______d_______e_______f_______g_______h_____________________ t << 28 = ___d_______e_______f_______g_______h____________________________ t << 35 = ____e_______f_______g_______h___________________________________ t << 42 = _____f_______g_______h__________________________________________ t << 49 = ______g_______h_________________________________________________ t << 56 = _______h________________________________________________________ sum = abcdefghbcdefgh_cdefh___defgh___efgh____fgh_____gh______h_______ Note how we constructed abcdefgh in the topmost 8 bits, which we can then extract using a single right-shift by $64 - 8 = 56$ bits. Since (1 << 7) + (1 << 14) + ... + (1 << 56) = 0x102040810204080 we get the following implementation: fn extract_lsb_bit_per_byte(w: u64) -> u8 { let mask = 0x0101010101010101; let sum_of_shifts = 0x102040810204080; ((w & mask).wrapping_mul(sum_of_shifts) >> 56) as u8 } Not as good as PEXT, but three arithmetic instructions is not bad at all. Depositing bits with multiplication Unfortunately the following technique is significantly less general than the previous one. While you can take inspiration from it to implement similar algorithms, as-is it is limited to just spreading the bits of one byte to the least significant bit of each byte in a 64-bit word. The trick is similar to the one above. We add 8 shifted copies of our byte which once again translates to a multiplication. By choosing a shift that increases in multiples if 9 instead of 8 we ensure that the bit pattern shifts over by one position in each byte. We then mask out our bits of interest, and finish off with a shift and byteswap (which compiles to a single instruction bswap on Intel or rev on ARM) to put our output bits on the least significant bits and reverse the order. This technique visualized: b = ________________________________________________________abcdefgh b << 9 = _______________________________________________abcdefgh_________ b << 18 = ______________________________________abcdefgh__________________ b << 27 = _____________________________abcdefgh___________________________ b << 36 = ____________________abcdefgh____________________________________ b << 45 = ___________abcdefgh_____________________________________________ b << 54 = __abcdefgh______________________________________________________ b << 63 = h_______________________________________________________________ sum = h_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh_abcdefgh mask = 1_______1_______1_______1_______1_______1_______1_______1_______ s & msk = h_______g_______f_______e_______d_______c_______b_______a_______ We once again note that the sum of shifts can be precomputed as 1 + (1 << 9) + ... + (1 << 63) = 0x8040201008040201, allowing the following implementation: fn deposit_lsb_bit_per_byte(b: u8) -> u64 { let sum_of_shifts = 0x8040201008040201; let mask = 0x8080808080808080; let spread = (b as u64).wrapping_mul(sum_of_shifts) & mask; u64::swap_bytes(spread >> 7) } This time it required 4 arithmetic instructions, not quite as good as PDEP, but again not bad compared to a naive implementation, and this is cross-platform.
This post is an anecdote from over a decade ago, of which I lost the actual code. So please forgive me if I do not accurately remember all the details. Some details are also simplified so that anyone that likes computer security can enjoy this article, not just those who have played World of Warcraft (although the Venn diagram of those two groups likely has a solid overlap). When I was around 14 years old I discovered World of Warcraft developed by Blizzard Games and was immediately hooked. Not long after I discovered add-ons which allow you to modify how your game’s user interface looks and works. However, not all add-ons I downloaded did exactly what I wanted to do. I wanted more. So I went to find out how they were made. In a weird twist of fate, I blame World of Warcraft for me seriously picking up programming. It turned out that they were made in the Lua programming language. Add-ons were nothing more than a couple .lua source files in a folder directly loaded into the game. The barrier of entry was incredibly low: just edit a file, press save and reload the interface. The fact that the game loaded your source code and you could see it running was magical! I enjoyed it immensely and in no time I was only writing add-ons and was barely playing the game itself anymore. I published quite a few add-ons in the next two years, which mostly involved copying other people’s code with some refactoring / recombining / tweaking to my wishes. Add-on security A thought you might have is that it’s a really bad idea to let users have fully programmable add-ons in your game, lest you get bots. However, the system Blizzard made to prevent arbitrary programmable actions was quite clever. Naturally, it did nothing to prevent actual botting, but at least regular rule-abiding players were fundamentally restricted to the automation Blizzard allowed. Most UI elements that you could create were strictly decorative or informational. These were completely unrestricted, as were most APIs that strictly gather information. For example you can make a health bar display using two frames, a background and a foreground, sizing the foreground frame using an API call to get the health of your character. Not all API calls were available to you however. Some were protected so they could only be called from official Blizzard code. These typically involved the API calls that would move your character, cast spells, use items, etc. Generally speaking anything that actually makes you perform an in-game action was protected. The API for getting your exact world location and camera orientation also became protected at some point. This was a reaction by Blizzard to new add-ons that were actively drawing 3D elements on top of the game world to make boss fights easier. However, some UI elements needed to actually interact with the game itself, e.g. if I want to make a button that casts a certain spell. For this you could construct a special kind of button that executes code in a secure environment when clicked. You were only allowed to create/destroy/move such buttons when not in combat, so you couldn’t simply conditionally place such buttons underneath your cursor to automate actions during combat. The catch was that this secure environment did allow you to programmatically set which spell to cast, but doesn’t let you gather the information you would need to do arbitrary automation. All access to state from outside the secure environment was blocked. There were some information gathering API calls available to match the more accessible in-game macro system, but nothing as fancy as getting skill cooldowns or unit health which would enable automatic optimal spellcasting. So there were two environments: an insecure one where you can get all information but can’t act on it, and a secure one where you can act but can’t get the information needed for automation. A backdoor channel Fast forward a couple years and I had mostly stopped playing. My interests had mainly moved on to more “serious” programming, and I was only occasionally playing, mostly messing around with add-on ideas. But this secure environment kept on nagging in my brain; I wanted to break it. Of course there was third-party software that completely disables the security restrictions from Blizzard, but what’s the fun in that? I wanted to do it “legitimately”, using the technically allowed tools, as a challenge. Obviously using clever code to bypass security restrictions is no better than using third-party software, and both would likely get you banned. I never actually wanted to use the code, just to see if I could make it work. So I scanned the secure environment allowed function list to see if I could smuggle any information from the outside into the secure environment. It all seemed pretty hopeless until I saw one tiny, innocent little function: random. An evil idea came in my head: random number generators (RNGs) used in computers are almost always pseudorandom number generators with (hidden) internal state. If I can manipulate this state, perhaps I can use that to pass information into the secure environment. Random number generator woes It turned out that random was just a small shim around C’s rand. I was excited! This meant that there was a single global random state that was shared in the process. It also helps that rand implementations tended to be on the weak side. Since World of Warcraft was compiled with MSVC, the actual implementation of rand was as follows: uint32_t state; int rand() { state = state * 214013 + 2531011; return (state >> 16) & 0x7fff; } This RNG is, for the lack of a better word, shite. It is a naked linear congruential generator, and a weak one at that. Which in my case, was a good thing. I can understand MSVC keeps rand the same for backwards compatibility, and at least all documentation I could find for rand recommends you not to use rand for cryptographic purposes. But was there ever a time where such a bad PRNG implementation was fit for any purpose? So let’s get to breaking this thing. Since the state is so laughably small and you can see 15 bits of the state directly you can keep a full list of all possible states consistent with a single output of the RNG and use further calls to the RNG to eliminate possibilities until a single one remains. But we can be significantly more clever. First we note that the top bit of state never affects anything in this RNG. (state >> 16) & 0x7fff masks out 15 bits, after shifting away the bottom 16 bits, and thus effectively works mod $2^{31}$. Since on any update the new state is a linear function of the previous state, we can propagate this modular form all the way down to the initial state as $$f(x) \equiv f(x \bmod m) \mod m$$ for any linear $f$. Let $a = 214013$ and $b = 2531011$. We observe the 15-bit output $r_0, r_1$ of two RNG calls. We’ll call the 16-bit portion of the RNG state that is hidden by the shift $h_0, h_1$ respectively, for the states after the first and second call. This means the state of the RNG after the first call is $2^{16} r_0 + h_0$ and similarly for $2^{16} r_1 + h_1$ after the second call. Then we have the following identity: $$a\cdot (2^{16}r_0 + h_0) + b \equiv 2^{16}r_1 + h_1 \mod 2^{31},$$ $$ah_0 \equiv h_1 + 2^{16}(r_1 - ar_0) - b \mod 2^{31}.$$ Now let $c \geq 0$ be the known constant $(2^{16}(r_1 - ar_0) - b) \bmod 2^{31}$, then for some integer $k$ we have $$ah_0 = h_1 + c + 2^{31} k.$$ Note that the left hand side ranges from $0$ to $a (2^{16} - 1) \approx 2^{33.71}$. Thus we must have $-1 \leq k \leq 2^{2.71} < 7$. Reordering we get the following expression for $h_0$: $$h_0 = \frac{c + 2^{31} k}{a} + h_1/a.$$ Since $a > 2^{16}$ while $0 \leq h_1 < 2^{16}$ we note that the term $0 \leq h_1/a < 1$. Thus, assuming a solution exists, we must have $$h_0 = \left\lceil\frac{c + 2^{31} k}{a}\right\rceil.$$ So for $-1 \leq k < 7$ we compute the above guess for the hidden portion of the RNG state after the first call. This gives us 8 guesses, after which we can reject bad guesses using follow-up calls to the RNG until a single unique answer remains. While I was able to re-derive the above with little difficulty now, 18 year old me wasn’t as experienced in discrete math. So I asked on crypto.SE, with the excuse that I wanted to ‘show my colleagues how weak this RNG is’. It worked, which sparks all kinds of interesting ethics questions. An example implementation of this process in Python: import random A = 214013 B = 2531011 class MsvcRng: def __init__(self, state): self.state = state def __call__(self): self.state = (self.state * A + B) % 2**32 return (self.state >> 16) & 0x7fff # Create a random RNG state we'll reverse engineer. hidden_rng = MsvcRng(random.randint(0, 2**32)) # Compute guesses for hidden state from 2 observations. r0 = hidden_rng() r1 = hidden_rng() c = (2**16 * (r1 - A * r0) - B) % 2**31 ceil_div = lambda a, b: (a + b - 1) // b h_guesses = [ceil_div(c + 2**31 * k, A) for k in range(-1, 7)] # Validate guesses until a single guess remains. guess_rngs = [MsvcRng(2**16 * r0 + h0) for h0 in h_guesses] guess_rngs = [g for g in guess_rngs if g() == r1] while len(guess_rngs) > 1: r = hidden_rng() guess_rngs = [g for g in guess_rngs if g() == r] # The top bit can not be recovered as it never affects the output, # but we should have recovered the effective hidden state. assert guess_rngs[0].state % 2**31 == hidden_rng.state % 2**31 While I did write the above process with a while loop, it appears to only ever need a third output at most to narrow it down to a single guess. Putting it together Once we could reverse-engineer the internal state of the random number generator we could make arbitrary automated decisions in the supposedly secure environment. How it worked was as follows: An insecure hook was registered that would execute right before the secure environment code would run. In this hook we have full access to information, and make a decision as to which action should be taken (e.g. casting a particular spell). This action is looked up in a hardcoded list to get an index. The current state of the RNG is reverse-engineered using the above process. We predict the outcome of the next RNG call. If this (modulo the length of our action list) does not give our desired outcome, we advance the RNG and try again. This repeats until the next random number would correspond to our desired action. The hook returns, and the secure environment starts. It generates a “random” number, indexes our hardcoded list of actions, and performs the “random” action. That’s all! By being able to simulate the RNG and looking one step ahead we could use it as our information channel by choosing exactly the right moment to call random in the secure environment. Now if you wanted to support a list of $n$ actions it would on average take $n$ steps of the RNG before the correct number came up to pass along, but that wasn’t a problem in practice. Conclusion I don’t know when Blizzard fixed the issue where the RNG state is so weak and shared, or whether they were aware of it being an issue at all. A few years after I had written the code I tried it again out of curiosity, and it had stopped working. Maybe they switched to a different algorithm, or had a properly separated RNG state for the secure environment. All-in-all it was a lot of effort for a niche exploit in a video game that I didn’t even want to use. But there certainly was a magic to manipulating something supposedly random into doing exactly what you want, like a magician pulling four aces from a shuffled deck.
More in programming
This is re-post of How to Permanently Increase Your Sales by 50% or More in Only One Day article by Steve Pavlina Of all the things you can do to increase your sales, one of the highest leverage activities is attempting to increase your products’ registration rate. Increasing your registration rate from 1.0% to 1.5% means that you simply convince one more downloader out of every 200 to make the decision to buy. Yet that same tiny increase will literally increase your sales by a full 50%. If you’re one of those developers who simply slapped the ubiquitous 30-day trial incentive on your shareware products without going any further than that, then I think a 50% increase in your registration rate is a very attainable goal you can achieve if you spend just one full day of concentrated effort on improving your product’s ability to sell. My hope is that this article will get you off to a good start and get you thinking more creatively. And even if you fail, your result might be that you achieve only a 25% or a 10% increase. How much additional money would that represent to you over the next five years of sales? What influence, if any, did the title of this article have on your decision to read it? If I had titled this article, “Registration Incentives,” would you have been more or less likely to read it now? Note that the title expresses a specific and clear benefit to you. It tells you exactly what you can expect to gain by reading it. Effective registration incentives work the same way. They offer clear, specific benefits to the user if a purchase is made. In order to improve your registration incentives, the first thing you need to do is to adopt some new beliefs that will change your perspective. I’m going to introduce you to what I call the “lies of success” in the shareware industry. These are statements that are not true at all, but if you accept them as true anyway, you’ll achieve far better results than if you don’t. Rule 1: What you are selling is merely the difference between the shareware and the registered versions, not the registered version itself. Note that this is not a true statement, but if you accept it as true, you’ll immediately begin to see the weaknesses in your registration incentives. If there are few additional benefits for buying the full version vs. using the shareware version, then you aren’t offering the user strong enough incentives to make the full purchase. Rule 2: The sole purpose of the shareware version is to close the sale. This is our second lie of success. Note the emphasis on the word “close.” Your shareware version needs to act as a direct sales vehicle. It must be able to take the user all the way to the point of purchase, i.e. your online order form, ideally with nothing more than a few mouse clicks. Anything that detracts from achieving a quick sale is likely to hurt sales. Rule 3: The customer’s perspective is the only one that matters. Defy this rule at your peril. Customers don’t care that you spent 2000 hours creating your product. Customers don’t care that you deserve the money for your hard work. Customers don’t care that you need to do certain things to prevent piracy. All that matters to them are their own personal wants and needs. Yes, these are lies of success. Some customers will care, but if you design your registration incentives assuming they only care about their own self-interests, your motivation to buy will be much stronger than if you merely appeal to their sense of honesty, loyalty, or honor. Assume your customers are all asking, “What’s in it for me if I choose to buy? What will I get? How will this help me?” I don’t care if you’re selling to Fortune 500 companies. At some point there will be an individual responsible for causing the purchase to happen, and that individual is going to consider how the purchase will affect him/her personally: “Will this purchase get me fired? Will it make me look good in front of my peers? Will this make my job easier or harder?” Many shareware developers get caught in the trap of discriminating between honest and dishonest users, believing that honest users will register and dishonest ones won’t. This line of thinking will ultimately get you nowhere, and it violates the third lie of success. When you make a purchase decision, how often do you use honesty as the deciding factor? Do you ever say, “I will buy this because I’m honest?” Or do you consider other more selfish factors first, such as how it will make you feel to purchase the software? The truth is that every user believes s/he is honest, so no user applies the honesty criterion when making a purchase decision. Thinking of your users in terms of honest ones vs. dishonest ones is a complete waste of time because that’s not how users primarily view themselves. Rule 4: Customers buy on emotion and justify with fact. If you’re honest with yourself, you’ll see that this is how you make most purchase decisions. Remember the last time you bought a computer. Is it fair to say that you first became emotionally attached to the idea of owning a new machine? For me, it’s the feeling of working faster, owning the latest technology, and being more productive that motivates me to go computer shopping. Once I’ve become emotionally committed, the justifications follow: “It’s been two years since I’ve upgraded, it will pay for itself with the productivity boost I gain, I can easily afford it, I’ve worked hard and I deserve a new machine, etc.” You use facts to justify the purchase. Once you understand how purchase decisions are made, you can see that your shareware products need to first get the user emotionally invested in the purchase, and then you give them all the facts they need to justify it. Now that we’ve gotten these four lies of success out of the way, let’s see how we might apply them to create some compelling registration incentives. Let’s start with Rule 1. What incentives can be spawned from this rule? The common 30-day trial is one obvious derivative. If you are only selling the difference between the shareware and registered versions, then a 30-day trial implies that you are selling unlimited future days of usage of the program after the trial period expires. This is a powerful incentive, and it’s been proven effective for products that users will continue to use month after month. 30-day trials are easy for users to understand, and they’re also easy to implement. You could also experiment with other time periods such as 10 days, 14 days, or 90 days. The only way of truly knowing which will work best for your products is to experiment. But let’s see if we can move a bit beyond the basic 30-day trial here by mixing in a little of Rule 3. How would the customer perceive a 30-day trial? In most cases 30 days is plenty of time to evaluate a product. But in what situations would a 30-day trial have a negative effect? A good example is when the user downloads, installs, and briefly checks out a product s/he may not have time to evaluate right away. By the time the user gets around to fully evaluating it, the shareware version has already expired, and a sale may be lost as a result. To get around this limitation, many shareware developers have started offering 30 days of actual program usage instead of 30 consecutive days. This allows the user plenty of time to try out the program at his/her convenience. Another possibility would be to limit the number of times the program can be run. The basic idea is that you are giving away limited usage and selling unlimited usage of the program. This incentive definitely works if your product is one that will be used frequently over a long period of time (much longer than the trial period). The flip side of usage limitation is to offer an additional bonus for buying within a certain period of time. For instance, in my game Dweep, I offer an extra 5 free bonus levels to everyone who buys within the first 10 days. In truth I give the bonus levels to everyone who buys, but the incentive is real from the customer’s point of view. Remember Rule 3 - it doesn’t matter what happens on my end; it only matters what the customer perceives. Any customer that buys after the first 10 days will be delighted anyway to receive a bonus they thought they missed. So if your product has no time-based incentives at all, this is the first place to start. When would you pay your bills if they were never due, and no interest was charged on late payments? Use time pressure to your advantage, either by disabling features in the shareware version after a certain time or by offering additional bonuses for buying sooner rather than later. If nothing else and if it’s legal in your area, offer a free entry in a random monthly drawing for a small prize, such as one of your other products, for anyone who buys within the first X days. Another logical derivative of Rule 1 is the concept of feature limitation. On the crippling side, you can start with the registered version and begin disabling functionality to create the shareware version. Disabling printing in a shareware text editor is a common strategy. So is corrupting your program’s output with a simple watermark. For instance, your shareware editor could print every page with your logo in the background. Years ago the Association of Shareware Professionals had a strict policy against crippling, but that policy was abandoned, and crippling has been recognized as an effective registration incentive. It is certainly possible to apply feature limitation without having it perceived as crippling. This is especially easy for games, which commonly offer a limited number of playable levels in the shareware version with many more levels available only in the registered version. In this situation you offer the user a seemingly complete experience of your product in the shareware version, and you provide additional features on top of that for the registered version. Time-based incentives and feature-based incentives are perhaps the two most common strategies used by shareware developers for enticing users to buy. Which will work best for you? You will probably see the best results if you use both at the same time. Imagine you’re the end user for a moment. Would you be more likely to buy if you were promised additional features and given a deadline to make the decision? I’ve seen several developers who were using only one of these two strategies increase their registration rates dramatically by applying the second strategy on top of the first. If you only use time-based limitations, how could you apply feature limitation as well? Giving the user more reasons to buy will translate to more sales per download. One you have both time-based and feature-based incentives to buy, the next step is to address the user’s perceived risk by applying a risk-reversal strategy. Fortunately, the shareware model already reduces the perceived risk of purchasing significantly, since the user is able to try before buying. But let’s go a little further, keeping Rule 3 in mind. What else might be a perceived risk to the user? What if the user reaches the end of the trial period and still isn’t certain the product will do what s/he needs? What if the additional features in the registered version don’t work as the user expects? What can we do to make the decision to purchase safer for the user? One approach is to offer a money-back guarantee. I’ve been offering a 60-day unconditional money-back guarantee on all my products since January 2000. If someone asks for their money back for any reason, I give them a full refund right away. So what is my return rate? Well, it’s about 8%. Just kidding! Would it surprise you to learn that my return rate at the time of this writing is less than 0.2%? Could you handle two returns out of every 1000 sales? My best estimate is that this one technique increased my sales by 5-10%, and it only took a few minutes to implement. When I suggest this strategy to other shareware developers, the usual reaction is fear. “But everyone would rip me off,” is a common response. I suggest trying it for yourself on an experimental basis; a few brave souls have already tried it and are now offering money-back guarantees prominently. Try putting it up on your web site for a while just to convince yourself it works. You can take it down at any time. After a few months, if you’re happy with the results, add the guarantee to your shareware products as well. I haven’t heard of one bad outcome yet from those who’ve tried it. If you use feature limitation in your shareware products, another important component of risk reversal is to show the user exactly what s/he will get in the full version. In Dweep I give away the first five levels in the demo version, and purchasing the full version gets you 147 more levels. When I thought about this from the customer’s perspective (Rule 3), I realized that a perceived risk is that s/he doesn’t know if the registered version levels will be as fun as the demo levels. So I released a new demo where you can see every level but only play the first five. This lets the customer see all the fun that awaits them. So if you have a feature-limited product, show the customer how the feature will work. For instance, if your shareware version has printing disabled, the customer could be worried that the full version’s print capability won’t work with his/her printer or that the output quality will be poor. A better strategy is to allow printing, but to watermark the output. This way the customer can still test and verify the feature, and it doesn’t take much imagination to realize what the output will look like without the watermark. Our next step is to consider Rule 2 and include the ability close the sale. It is imperative that you include an “instant gratification” button in your shareware products, so the customer can click to launch their default web browser and go directly to your online order form. If you already have a “buy now” button in your products, go a step further. A small group of us have been finding that the more liberally these buttons are used, the better. If you only have one or two of these buttons in your shareware program, you should increase the count by at least an order of magnitude. The current Dweep demo now has over 100 of these buttons scattered throughout the menus and dialogs. This makes it extremely easy for the customer to buy, since s/he never has to hunt around for the ordering link. What should you label these buttons? “Buy now” or “Register now” are popular, so feel free to use one of those. I took a slightly different approach by trying to think like a customer (Rule 3 again). As a customer the word “buy” has a slightly negative association for me. It makes me think of parting with my cash, and it brings up feelings of sacrifice and pressure. The words “buy now” imply that I have to give away something. So instead, I use the words, “Get now.” As a customer I feel much better about getting something than buying something, since “getting” brings up only positive associations. This is the psychology I use, but at present, I don’t know of any hard data showing which is better. Unless you have a strong preference, trust your intuition. Make it as easy as possible for the willing customer to buy. The more methods of payment you accept, the better your sales will be. Allow the customer to click a button to print an order form directly from your program and mail it with a check or money order. On your web order form, include a link to a printable text order form for those who are afraid to use their credit cards online. If you only accept two or three major credit cards, sign up with a registration service to handle orders for those you don’t accept. So far we’ve given the customer some good incentives to buy, minimized perceived risk, and made it easy to make the purchase. But we haven’t yet gotten the customer emotionally invested in making the purchase decision. That’s where Rule 4 comes in. First, we must recognize the difference between benefits and features. We need to sell the sizzle, not the steak. Features describe your product, while benefits describe what the user will get by using your product. For instance, a personal information manager (PIM) program may have features such as daily, weekly, and monthly views; task and event timers; and a contact database. However, the benefits of the program might be that it helps the user be more organized, earn more money, and enjoy more free time. For a game, the main benefit might be fun. For a nature screensaver, it could be relaxation, beauty appreciation, or peace. Features are logical; benefits are emotional. Logical features are an important part of the sale, but only after we’ve engaged the customer’s emotions. Many products do a fair job of getting the customer emotionally invested during the trial period. If you have an addictive program or one that’s fun to use, such as a game, you may have an easy time getting the customer emotionally attached to using it because the experience is already emotional in nature. But whatever your product is, you can increase your sales by clearly illustrating the benefits of making the purchase. A good place to do this is in your nag screens. I use nag screens both before and after the program runs to remind the user of the benefits of buying the full version. At the very least, include a nag screen when the customer exits the program, so the last thing s/he sees will be a reminder of the product’s benefits. Take this opportunity to sell the user on the product. Don’t expect features like “customizable colors” to motivate anyone to buy. Paint a picture of what benefits the user will obtain with the full version. Will I save time? Will I have more fun? Will I live longer, save money, or feel better? The simple change from feature-oriented selling to benefit-oriented selling can easily double or triple your sales. Be sure to use this approach on your web site as well if you don’t already. Developers who’ve recently made the switch have been reporting some amazing results. If you’re drawing a blank when trying to come up with benefits for your products, the best thing you can do is to email some of your old customers and ask them why they bought your program. What did it do for them? I’ve done this and was amazed at the answers I got back. People were buying my games for reasons I’d never anticipated, and that told me which benefits I needed to emphasize in my sales pitch. The next key is to make your offer irresistible to potential customers. Find ways to offer the customer so much value that it would be harder to say no than to say yes. Take a look at your shareware product as if you were a potential customer who’d never seen it before. Being totally honest with yourself, would you buy this program if someone else had written it? If not, don’t stop here. As a potential customer, what additional benefits or features would put you over the top and convince you to buy? More is always better than less. In the original version of Dweep, I offered ten levels in the demo and thirty in the registered version. Now I offer only five demo levels and 152 in the full version, plus a built-in level editor. Originally, I offered the player twice the value of the demo; now I’m offering over thirty times the value. I also offer free hints and solutions to every level; the benefit here is that it minimizes player frustration. As I keep adding bonuses for purchasing, the offer becomes harder and harder to resist. What clever bonuses can you throw in for registering? Take the time to watch an infomercial. Notice that there is always at least one “FREE” bonus thrown in. Consider offering a few extra filters for an image editor, ten extra images for a screensaver, or extra levels for a game. What else might appeal to your customers? Be creative. Your bonus doesn’t even have to be software-based. Offer a free report about building site traffic with your HTML editor, include an essay on effective time management with your scheduling program, or throw in a small business success guide with your billing program. If you make such programs, you shouldn’t have too much trouble coming up with a few pages of text that would benefit your customers. Keep working at it until your offer even looks irresistible to you. If all the bonuses you offer can be delivered electronically, how many can you afford to include? If each one only gains one more customer in a thousand (0.1%), would it be worth the effort over the lifetime of your sales? So how do you know if your registration incentives are strong enough? And how do you know if your product is over-crippled? Where do you draw the line? These are tough issues, but there is a good way to handle them if your product is likely to be used over a long period of time, particularly if it’s used on a daily basis. Simply make your program gradually increase its registration incentives over time. One easy way to do this is with a delay timer on your nag screens that increases each time the program is run. Another approach is to disable certain features at set intervals. You begin by disabling non-critical features and gradually move up to disabling key functionality. The program becomes harder and harder to continue using for free, so the benefits of registering become more and more compelling. Instead of having your program completely disable itself after your trial period, you gradually degrade its usability with additional usage. This approach can be superior to a strict 30-day trial, since it allows your program to still be used for a while, but after prolonged usage it becomes effectively unusable. However, you don’t simply shock the user by taking away all the benefits s/he has become accustomed to on a particular day. Instead, you begin with a gentle reminder that becomes harder and harder to ignore. There may be times when your 30-day trial shuts off at an inconvenient time for the user, and you may lose a sale as a result. For instance, the user may not have the money at the time, or s/he may be busy at the trial’s end and forget to register. In that case s/he may quickly replace what was lost with a competitor’s trial version. The gradual degradation approach allows the user to continue using your product, but with increasing difficulty over time. Eventually, there is a breaking point where the user either decides to buy or to stop using the program completely, but this can be done within a window of time at the user’s convenience. Hopefully this article has gotten you thinking creatively about all the overlooked ways you can entice people to buy your shareware products. The most important thing you can do is to begin seeing your products through your customers’ eyes. What additional motivation would convince you to buy? What would represent an irresistible offer to you? There is no limit to how many incentives you can add. Don’t stop at just one or two; instead, give the customer a half dozen or more reasons to buy, and you’ll see your registration rate soar. Is it worth spending a day to do this? I think so.
I'm a big (neo)vim buff. My config is over 1500 lines and I regularly write new scripts. I recently ported my neovim config to a new laptop. Before then, I was using VSCode to write, and when I switched back I immediately saw a big gain in productivity. People often pooh-pooh vim (and other assistive writing technologies) by saying that writing code isn't the bottleneck in software development. Reading, understanding, and thinking through code is! Now I don't know how true this actually is in practice, because empirical studies of time spent coding are all over the place. Most of them, like this study, track time spent in the editor but don't distinguish between time spent reading code and time spent writing code. The only one I found that separates them was this study. It finds that developers spend only 5% of their time editing. It also finds they spend 14% of their time moving or resizing editor windows, so I don't know how clean their data is. But I have a bigger problem with "writing is not the bottleneck": when I think of a bottleneck, I imagine that no amount of improvement will lead to productivity gains. Like if a program is bottlenecked on the network, it isn't going to get noticeably faster with 100x more ram or compute. But being able to type code 100x faster, even with without corresponding improvements to reading and imagining code, would be huge. We'll assume the average developer writes at 80 words per minute, at five characters a word, for 400 characters a minute.What could we do if we instead wrote at 8,000 words/40k characters a minute? Writing fast Boilerplate is trivial Why do people like type inference? Because writing all of the types manually is annoying. Why don't people like boilerplate? Because it's annoying to write every damn time. Programmers like features that help them write less! That's not a problem if you can write all of the boilerplate in 0.1 seconds. You still have the problem of reading boilerplate heavy code, but you can use the remaining 0.9 seconds to churn out an extension that parses the file and presents the boilerplate in a more legible fashion. We can write more tooling This is something I've noticed with LLMs: when I can churn out crappy code as a free action, I use that to write lots of tools that assist me in writing good code. Even if I'm bottlenecked on a large program, I can still quickly write a script that helps me with something. Most of these aren't things I would have written because they'd take too long to write! Again, not the best comparison, because LLMs also shortcut learning the relevant APIs, so also optimize the "understanding code" part. Then again, if I could type real fast I could more quickly whip up experiments on new apis to learn them faster. We can do practices that slow us down in the short-term Something like test-driven development significantly slows down how fast you write production code, because you have to spend a lot more time writing test code. Pair programming trades speed of writing code for speed of understanding code. A two-order-of-magnitude writing speedup makes both of them effectively free. Or, if you're not an eXtreme Programming fan, you can more easily follow the The Power of Ten Rules and blanket your code with contracts and assertions. We could do more speculative editing This is probably the biggest difference in how we'd work if we could write 100x faster: it'd be much easier to try changes to the code to see if they're good ideas in the first place. How often have I tried optimizing something, only to find out it didn't make a difference? How often have I done a refactoring only to end up with lower-quality code overall? Too often. Over time it makes me prefer to try things that I know will work, and only "speculatively edit" when I think it be a fast change. If I could code 100x faster it would absolutely lead to me trying more speculative edits. This is especially big because I believe that lots of speculative edits are high-risk, high-reward: given 50 things we could do to the code, 49 won't make a difference and one will be a major improvement. If I only have time to try five things, I have a 10% chance of hitting the jackpot. If I can try 500 things I will get that reward every single time. Processes are built off constraints There are just a few ideas I came up with; there are probably others. Most of them, I suspect, will share the same property in common: they change the process of writing code to leverage the speedup. I can totally believe that a large speedup would not remove a bottleneck in the processes we currently use to write code. But that's because those processes are developed work within our existing constraints. Remove a constraint and new processes become possible. The way I see it, if our current process produces 1 Utils of Software / day, a 100x writing speedup might lead to only 1.5 UoS/day. But there are other processes that produce only 0.5 UoS/d because they are bottlenecked on writing speed. A 100x speedup would lead to 10 UoS/day. The problem with all of this that 100x speedup isn't realistic, and it's not obvious whether a 2x improvement would lead to better processes. Then again, one of the first custom vim function scripts I wrote was an aid to writing unit tests in a particular codebase, and it lead to me writing a lot more tests. So maybe even a 2x speedup is going to be speed things up, too. Patreon Stuff I wrote a couple of TLA+ specs to show how to model fork-join algorithms. I'm planning on eventually writing them up for my blog/learntla but it'll be a while, so if you want to see them in the meantime I put them up on Patreon.
Here’s Jony Ive in his Stripe interview: What we make stands testament to who we are. What we make describes our values. It describes our preoccupations. It describes beautiful succinctly our preoccupation. I’d never really noticed the connection between these two words: occupation and preoccupation. What comes before occupation? Pre-occupation. What comes before what you do for a living? What you think about. What you’re preoccupied with. What you think about will drive you towards what you work on. So when you’re asking yourself, “What comes next? What should I work on?” Another way of asking that question is, “What occupies my thinking right now?” And if what you’re occupied with doesn’t align with what you’re preoccupied with, perhaps it's time for a change. Email · Mastodon · Bluesky
There's no country on earth that does hype better than America. It's one of the most appealing aspects about being here. People are genuinely excited about the future and never stop searching for better ways to work, live, entertain, and profit. There's a unique critical mass in the US accelerating and celebrating tomorrow. The contrast to Europe couldn't be greater. Most Europeans are allergic to anything that even smells like a commercial promise of a better tomorrow. "Hype" is universally used as a term to ridicule anyone who dares to be excited about something new, something different. Only a fool would believe that real progress is possible! This is cultural bedrock. The fault lines have been settling for generations. It'll take an earthquake to move them. You see this in AI, you saw it in the Internet. Europeans are just as smart, just as inventive as their American brethren, but they don't do hype, so they're rarely the ones able to sell the sizzle that public opinion requires to shift its vision for tomorrow. To say I have a complicated relationship with venture capital is putting it mildly. I've spent a career proving the counter narrative. Proving that you can build and bootstrap an incredible business without investor money, still leave a dent in the universe, while enjoying the spoils of capitalism. And yet... I must admit that the excesses of venture capital are integral to this uniquely American advantage on hype. The lavish overspending during the dot-com boom led directly to a spectacular bust, but it also built the foundation of the internet we all enjoy today. Pets.com and Webvan flamed out such that Amazon and Shopify could transform ecommerce out of the ashes. We're in the thick of peak hype on AI right now. Fantastical sums are chasing AGI along with every dumb derivative mirage along the way. The most outrageous claims are being put forth on the daily. It's easy to look at that spectacle with European eyes and roll them. Some of it is pretty cringe! But I think that would be a mistake. You don't have to throw away your critical reasoning to accept that in the face of unknown potential, optimism beats pessimism. We all have to believe in something, and you're much better off believing that things can get better than not. Americans fundamentally believe this. They believe the hype, so they make it come to fruition. Not every time, not all of them, but more of them, more of the time than any other country in the world. That really is exceptional.
I’m working on a Go library appendstore for append-only store of lots of things in a single file. To make things as robust as possible I was calling os.File.Sync() after each append. Sync() is waiting until the data is acknowledged as truly, really written to disk (as opposed to maybe floating somewhere in disk drive’s write buffer). Oh boy, is it slow. A test of appending 1000 records would take over 5 seconds. After removing the Sync() it would drop to 5 milliseconds. 1000x faster. I made sync optional - it’s now up to the user of the library to pick it, defaults to non-sync. Is it unsafe now? Well, the reality is that it probably doesn’t matter. I don’t think lots of software does the sync due to slowness and the world still runs.