More from Evan Jones - Software Engineer | Computer Scientist
You can't safely use the C setenv() or unsetenv() functions in a program that uses threads. Those functions modify global state, and can cause other threads calling getenv() to crash. This also causes crashes in other languages that use those C standard library functions, such as Go's os.Setenv (Go issue) and Rust's std::env::set_var() (Rust issue). I ran into this in a Go program, because Go's built-in DNS resolver can call C's getaddrinfo(), which uses environment variables. This cost me 2 days to track down and file the Go bug. Sadly, this problem has been known for decades. For example, an article from January 2017 said: "None of this is new, but we do re-discover it roughly every five years. See you in 2022." This was only one year off! (She wrote an update in October 2023 after I emailed her about my Go bug.) This is a flaw in the POSIX standard, which extends the C Standard to allow modifying environment varibles. The most infuriating part is that many people who could influence the standard or maintain the C libraries don't see this as a problem. The argument is that the specification clearly documents that setenv() cannot be used with threads. Therefore, if someone does this, the crashes are their fault. We should apparently read every function's specification carefully, not use software written by others, and not use threads. These are unrealistic assumptions in modern software. I think we should instead strive to create APIs that are hard to screw up, and evolve as the ecosystem changes. The C language and standard library continue to play an important role at the base of most software. We either need to figure out how to improve it, or we need to figure out how to abandon it. Why is setenv() not thread-safe? The biggest problem is that getenv() returns a char*, with no need for applications to free it later. One thread could be using this pointer when another thread changes the same environment variable using setenv() or unsetenv(). The getenv() function is perfect if environment variables never change. For example, for accessing a process's initial table of environment variables (see the System V ABI: AMD64 Section 3.4.1). It turns out the C Standard only includes getenv(), so according to C, that is exactly how this should work. However, most implementations also follow the POSIX standard (e.g. POSIX.1-2017), which extends C to include functions that modify the environment. This means the current getenv() API is problematic. Even worse, putenv() adds a char* to the set of environment variables. It is explicitly required that if the application modifies the memory after putenv() returns, it modifies the environment variables. This means applications can modify the value passed to putenv() at any time, without any synchronization. FreeBSD used to implement putenv() by copying the value, but it changed it with FreeBSD 7 in 2008, which suggests some programs really do depend on modifying the environment in this fashion (see FreeBSD putenv man page). As a final problem, environ is a NULL-terminated array of pointers (char**) that an application can read and assign to (see definition in POSIX.1-2017). This is how applications can iterate over all environment variables. Accesses to this array are not thread-safe. However, in my experience many fewer applications use this than getenv() and setenv(). However, this does cause some libraries to not maintain the set of environment variables in a thread-safe way, since they directly update this table. Environment variable implementations Implementations need to choose what do do when an application overwrites an existing variable. I looked at glibc, musl, Solaris/Illumos, and FreeBSD/Apple's C standard libraries, and they make the following choices: Never free environment variables (glibc, Solaris/Illumos): Calling setenv() repeatedly is effectively a memory leak. However, once a value is returned from getenv(), it is immutable and can be used by threads safely. Free the environment variables (musl, FreeBSD/Apple): Using the pointer returned by getenv() after another thread calls setenv() can crash. A second problem is ensuring the set of environment variables is updated in a thread-safe fashion. This is what causes crashes in glibc. glibc uses an array to hold pointers to the "NAME=value" strings. It holds a lock in setenv() when changing this array, but not in getenv(). If a thread calling setenv() needs to resize the array of pointers, it copies the values to a new array and frees the previous one. This can cause other threads executing getenv() to crash, since they are now iterating deallocated memory. This is particularly annoying since glibc already leaks environment variables, and holds a lock in setenv(). All it needs to do is hold the lock inside getenv(), and it would no longer crash. This would make getenv() slightly slower. However, getenv() already uses a linear search of the array, so performance does not appear to be a concern. More sophisticated implementations are possible if this is a problem, such as Solaris/Illumos's lock-free implementation. Why do programs use environment variables? Environment variables useful for configuring shared libraries or language runtimes that are included in other programs. This allows users to change the configuration, without program authors needing to explicitly pass the configuration in. One alternative is command line flags, which requires programs to parse them and pass them in to the libraries. Another alternative are configuration files, which then need some other way to disable or configure, to be able to test new configurations. Environment variables are a simple solution. AS a result, many libraries call getenv() (see a partial list below). Since many libraries are configured through environment variables, a program may need to change these variables to configure the libraries it uses. This is common at application startup. This causes programs to need to call setenv(). Given this issue, it seems like libraries should also provide a way to explicitly configure any settings, and avoid using environment variables. We should fix this problem, and we can In my opinion, it is rediculous that this has been a known problem for so long. It has wasted thousands of hours of people's time, either debugging the problems, or debating what to do about it. We know how to fix the problem. First, we can make a thread-safe implementation, like Illumos/Solaris. This has some limitations: it leaks memory in setenv(), and is still unsafe if a program uses putenv() or the environ variable. However, this is an improvement over the current Linux and Apple implementations. The second solution is to add new APIs to get one and get all environment variables that are thread-safe by design, like Microsoft's getenv_s() (see below for the controversy around C11's "Annex K"). My preferred solution would be to do both. This would reduce the chances of hitting this problem for existing programs and libraries, and also provide a path to avoid the problems entirely for new code or languages like Go and Rust. My rough idea would be the following: Add a function to copy one single environment variable to a user-specified buffer, similar to getenv_s(). Add a thread-safe API to iterate over all environment variables, or to copy all variables out. Mark getenv() as deprecated, recommending the new thread-safe getenv() function instead. Mark putenv() as deprecated, recommending setenv() instead. Mark environ as deprecated, recommending environment variable functions instead. Update the implementation of environment varibles to be thread-safe. This requires leaking memory if getenv() is used on a variable, but we can detect if the old functions are used, and only leak memory in that case. This means programs written in other languages will avoid these problems as soon as their runtimes are updated. Update the C and POSIX standards to require the above changes. This would be progress. The getenv_s / C Standard Annex K controversy Microsoft provides getenv_s(), which copies the environment variable into a caller-provided buffer. This is easy to make thread-safe by holding a read lock while copying the variable. After the function returns, future changes to the environment have no effect. This is included in the C11 Standard as Annex K "Bounds Checking Interfaces". The C standard Annexes are optional features. This Annex includes new functions intended to make it harder to make mistakes with buffers that are the wrong size. The first draft of this extension was published in 2003. This is when Microsoft was focusing on "Trustworthy Computing" after a January 2002 memo from Bill Gates. Basically, Windows wasn't designed to be connected to the Internet, and now that it was, people were finding many security problems. Lots of them were caused by buffer handling mistakes. Microsoft developed new versions of a number of problematic functions, and added checks to the Visual C++ compiler to warn about using the old ones. They then attempted to standardize these functions. My understanding is the people responsible for the Unix POSIX standards did not like the design of these functions, so they refused to implement them. For more details, see Field Experience With Annex K published in September 2015, Stack Overflow: Why didn't glibc implement _s functions? updated March 2023, and Rich Felker of musl on both technical and social reasons for not implementing Annex K from February 2019. I haven't looked at the rest of the functions, but having spent way too long looking at getenv(), the general idea of getenv_s() seems like a good idea to me. Standardizing this would help avoid this problem. Incomplete list of common environment variables This is a list of some uses of environment variables from fairly widely used libraries and services. This shows that environment variables are pretty widely used. Cloud Provider Credentials and Services AWS's SDKs for credentials (e.g. AWS_ACCESS_KEY_ID) Google Cloud Application Default Credentials (e.g. GOOGLE_APPLICATION_CREDENTIALS) Microsoft Azure Default Azure Credential (e.g. AZURE_CLIENT_ID) AWS's Lambda serverless product: sets a large number of variables like AWS_REGION, AWS_LAMBDA_FUNCTION_NAME, and credentials like AWS_SECRET_ACCESS_KEY Google Cloud Run serverless product: configuration like PORT, K_SERVICE, K_REVISION Kubernetes service discovery: Defines variables SERVICE_NAME_HOST and SERVICE_NAME_PORT. Third-party C/C++ Libraries OpenTelemetry: Metrics and tracing. Many environment variables like OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES. OpenSSL: many configurable variables like HTTPS_PROXY, OPENSSL_CONF, OPENSSL_ENGINES. BoringSSL: Google's fork of OpenSSL used in Chrome and others. It reads SSLKEYLOGFILE just like OpenSSL for logging TLS keys for debugging. Libcurl: proxies, SSL/TLS configuration and debugging like HTTPS_PROXY, CURL_SSL_BACKEND, CURL_DEBUG. Libpq Postgres client library: connection parameters including credentials like PGHOSTADDR, PGDATABASE, and PGPASSWORD. Rust Standard Library std::thread RUST_MIN_STACK: Calls std::env::var() on the first call to spawn() a new thread. It is cached in a static atomic variable and never read again. See implementation in thread::min_stack(). std::backtrace RUST_LIB_BACKTRACE: Calls std::env::var() on the first call to capture a backtrace. It is cached in a static atomic variable and never read again. See implementation in Backtrace::enabled().
This is a reminder that random load balancing is unevenly distributed. If we distribute a set of items randomly across a set of servers (e.g. by hashing, or by randomly selecting a server), the average number of items on each server is num_items / num_servers. It is easy to assume this means each server has close to the same number of items. However, since we are selecting servers at random, they will have different numbers of items, and the imbalance can be important. For load balancing, a reasonable model is that each server has fixed capacity (e.g. it can serve 3000 requests/second, or store 100 items, etc.). We need to divide the total workload over the servers, so that each server stays below its capacity. This means the number of servers is determined by the most loaded server, not the average. This is a classic balls in bins problem that has been well studied, and there are some interesting theoretical results. However, I wanted some specific numbers, so I wrote a small simulation. The summary is that the imbalance depends on the expected number of items per server (that is, num_items / num_servers). This means workload is more balanced with fewer servers, or with more items. This means that dividing a set of items over more servers makes the distribution more unfair, which is a reason we can get worse than linear scaling of a distributed system. Let's make this more concrete with an example. Let's assume we have a workload of 1000 items, and each server can hold a maximum of 100 items. If we place the exact same number of items on each server, we only need 10 servers, and each of them is completely busy. However, if we place the items randomly, then the median (p50) number of items is 100 items. This means half the servers will have more than 100 items, and will be overloaded. If we want less than a 1% chance of an overloaded server, we need to look at the 99th percentile (p99) server load. We need to use at least 13 servers, which has a p99 load of 97 items. For 14 servers, the average is 77 items, so our servers are on average 23% idle. This shows how the imbalance leads to wasted capacity. This is a bit of an extreme example, because the number of items is small. Let's assume we can make the items 10× smaller, say by dividing them into pieces. Our workload now consists of 10k items, and each server has the capacity to hold 1000 (1k) items. Our perfectly balanced workload still needs 10 servers. With random load balancing, to have a less than 1 in 1000 chance of exceeding our capacity, we only need 11 servers, which has a p99 load of 98 items and a p999 of 100 items. With 11 servers, the average number of items is 910 or 91%, so our servers are only 9% idle. This shows how splitting work into smaller pieces improves load balancing. Another way to look at this is to think about a scaling scenario. Let's go back to our workload of 1000 items, where each server can handle 100 items, and we have 13 servers to ensure we have less than a 1% chance of an overloaded server. Now let's assume the amount of work per item doubles, for example because the service has become more popular, so each item has become larger. Now, each server can hold a maximum of 50 items. If we have perfectly linear scaling, we can double the number of servers from 13 to 26 to handle this workload. However, 26 servers has a p99 of 53 items, so we again have a more than 1% chance of overload. We need to use 28 servers which has a p99 of 50 items. This means we doubled the workload, but had to increase the number of servers from 13 to 28, which is 2.15×. This is sub-linear scaling. As a way to visualize the imbalance, the chart below shows the p99 to average ratio, which is a measure of how imbalanced the system is. If everything is perfectly balanced, the value is 1.0. A value of 2.0 means 1% of servers will have double the number of items of the average server. This shows that the imbalance increases with the number of servers, and increases with fewer items. Power of Two Random Choices Another way to improve load balancing is to have smarter placement. Perfect placement can be hard, but it is often possible to use the "power of two random choices" technique: select two servers at random, and place the item on the least loaded of the two. This makes the distribution much more balanced. For 1000 items and 100 items/server, 11 servers has a p999 of 93 items, so much less than 0.1% chance of overload, compared to needing 14 servers with random load balancing. For the scaling scenario where each server can only handle 50 items, we only need 21 servers to have a p999 of 50 items, compared to 28 servers with random load balancing. The downside of the two choices technique is that each request is now more expensive, since it must query two servers instead of one. However, in many cases where the "item not found" requests are much less expensive than the "item found" requests, this can still be a substantial improvement. For another look at how this improves load balancing, with a nice simulation that includes information delays, see Marc Brooker's blog post. Raw simulation output I will share the code for this simulation later. simulating placing items on servers with random selection iterations=10000 (number of times num_items are placed on num_servers) measures the fraction of items on each server (server_items/num_items) and reports the percentile of all servers in the run P99_AVG_RATIO = p99 / average; approximately the worst server compared to average num_items=1000: num_servers=3 p50=0.33300 p95=0.35800 p99=0.36800 p999=0.37900 AVG=0.33333; P99_AVG_RATIO=1.10400; ITEMS_PER_NODE=333.3 num_servers=5 p50=0.20000 p95=0.22100 p99=0.23000 p999=0.24000 AVG=0.20000; P99_AVG_RATIO=1.15000; ITEMS_PER_NODE=200.0 num_servers=10 p50=0.10000 p95=0.11600 p99=0.12300 p999=0.13100 AVG=0.10000; P99_AVG_RATIO=1.23000; ITEMS_PER_NODE=100.0 num_servers=11 p50=0.09100 p95=0.10600 p99=0.11300 p999=0.12000 AVG=0.09091; P99_AVG_RATIO=1.24300; ITEMS_PER_NODE=90.9 num_servers=12 p50=0.08300 p95=0.09800 p99=0.10400 p999=0.11200 AVG=0.08333; P99_AVG_RATIO=1.24800; ITEMS_PER_NODE=83.3 num_servers=13 p50=0.07700 p95=0.09100 p99=0.09700 p999=0.10400 AVG=0.07692; P99_AVG_RATIO=1.26100; ITEMS_PER_NODE=76.9 num_servers=14 p50=0.07100 p95=0.08500 p99=0.09100 p999=0.09800 AVG=0.07143; P99_AVG_RATIO=1.27400; ITEMS_PER_NODE=71.4 num_servers=25 p50=0.04000 p95=0.05000 p99=0.05500 p999=0.06000 AVG=0.04000; P99_AVG_RATIO=1.37500; ITEMS_PER_NODE=40.0 num_servers=50 p50=0.02000 p95=0.02800 p99=0.03100 p999=0.03500 AVG=0.02000; P99_AVG_RATIO=1.55000; ITEMS_PER_NODE=20.0 num_servers=100 p50=0.01000 p95=0.01500 p99=0.01800 p999=0.02100 AVG=0.01000; P99_AVG_RATIO=1.80000; ITEMS_PER_NODE=10.0 num_servers=1000 p50=0.00100 p95=0.00300 p99=0.00400 p999=0.00500 AVG=0.00100; P99_AVG_RATIO=4.00000; ITEMS_PER_NODE=1.0 num_items=2000: num_servers=3 p50=0.33350 p95=0.35050 p99=0.35850 p999=0.36550 AVG=0.33333; P99_AVG_RATIO=1.07550; ITEMS_PER_NODE=666.7 num_servers=5 p50=0.20000 p95=0.21500 p99=0.22150 p999=0.22850 AVG=0.20000; P99_AVG_RATIO=1.10750; ITEMS_PER_NODE=400.0 num_servers=10 p50=0.10000 p95=0.11100 p99=0.11600 p999=0.12150 AVG=0.10000; P99_AVG_RATIO=1.16000; ITEMS_PER_NODE=200.0 num_servers=11 p50=0.09100 p95=0.10150 p99=0.10650 p999=0.11150 AVG=0.09091; P99_AVG_RATIO=1.17150; ITEMS_PER_NODE=181.8 num_servers=12 p50=0.08350 p95=0.09350 p99=0.09800 p999=0.10300 AVG=0.08333; P99_AVG_RATIO=1.17600; ITEMS_PER_NODE=166.7 num_servers=13 p50=0.07700 p95=0.08700 p99=0.09100 p999=0.09600 AVG=0.07692; P99_AVG_RATIO=1.18300; ITEMS_PER_NODE=153.8 num_servers=14 p50=0.07150 p95=0.08100 p99=0.08500 p999=0.09000 AVG=0.07143; P99_AVG_RATIO=1.19000; ITEMS_PER_NODE=142.9 num_servers=25 p50=0.04000 p95=0.04750 p99=0.05050 p999=0.05450 AVG=0.04000; P99_AVG_RATIO=1.26250; ITEMS_PER_NODE=80.0 num_servers=50 p50=0.02000 p95=0.02550 p99=0.02750 p999=0.03050 AVG=0.02000; P99_AVG_RATIO=1.37500; ITEMS_PER_NODE=40.0 num_servers=100 p50=0.01000 p95=0.01400 p99=0.01550 p999=0.01750 AVG=0.01000; P99_AVG_RATIO=1.55000; ITEMS_PER_NODE=20.0 num_servers=1000 p50=0.00100 p95=0.00250 p99=0.00300 p999=0.00400 AVG=0.00100; P99_AVG_RATIO=3.00000; ITEMS_PER_NODE=2.0 num_items=5000: num_servers=3 p50=0.33340 p95=0.34440 p99=0.34920 p999=0.35400 AVG=0.33333; P99_AVG_RATIO=1.04760; ITEMS_PER_NODE=1666.7 num_servers=5 p50=0.20000 p95=0.20920 p99=0.21320 p999=0.21740 AVG=0.20000; P99_AVG_RATIO=1.06600; ITEMS_PER_NODE=1000.0 num_servers=10 p50=0.10000 p95=0.10700 p99=0.11000 p999=0.11320 AVG=0.10000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=500.0 num_servers=11 p50=0.09080 p95=0.09760 p99=0.10040 p999=0.10380 AVG=0.09091; P99_AVG_RATIO=1.10440; ITEMS_PER_NODE=454.5 num_servers=12 p50=0.08340 p95=0.08980 p99=0.09260 p999=0.09580 AVG=0.08333; P99_AVG_RATIO=1.11120; ITEMS_PER_NODE=416.7 num_servers=13 p50=0.07680 p95=0.08320 p99=0.08580 p999=0.08900 AVG=0.07692; P99_AVG_RATIO=1.11540; ITEMS_PER_NODE=384.6 num_servers=14 p50=0.07140 p95=0.07740 p99=0.08000 p999=0.08300 AVG=0.07143; P99_AVG_RATIO=1.12000; ITEMS_PER_NODE=357.1 num_servers=25 p50=0.04000 p95=0.04460 p99=0.04660 p999=0.04880 AVG=0.04000; P99_AVG_RATIO=1.16500; ITEMS_PER_NODE=200.0 num_servers=50 p50=0.02000 p95=0.02340 p99=0.02480 p999=0.02640 AVG=0.02000; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 num_servers=100 p50=0.01000 p95=0.01240 p99=0.01340 p999=0.01460 AVG=0.01000; P99_AVG_RATIO=1.34000; ITEMS_PER_NODE=50.0 num_servers=1000 p50=0.00100 p95=0.00180 p99=0.00220 p999=0.00260 AVG=0.00100; P99_AVG_RATIO=2.20000; ITEMS_PER_NODE=5.0 num_items=10000: num_servers=3 p50=0.33330 p95=0.34110 p99=0.34430 p999=0.34820 AVG=0.33333; P99_AVG_RATIO=1.03290; ITEMS_PER_NODE=3333.3 num_servers=5 p50=0.20000 p95=0.20670 p99=0.20950 p999=0.21260 AVG=0.20000; P99_AVG_RATIO=1.04750; ITEMS_PER_NODE=2000.0 num_servers=10 p50=0.10000 p95=0.10500 p99=0.10700 p999=0.10940 AVG=0.10000; P99_AVG_RATIO=1.07000; ITEMS_PER_NODE=1000.0 num_servers=11 p50=0.09090 p95=0.09570 p99=0.09770 p999=0.09990 AVG=0.09091; P99_AVG_RATIO=1.07470; ITEMS_PER_NODE=909.1 num_servers=12 p50=0.08330 p95=0.08790 p99=0.08980 p999=0.09210 AVG=0.08333; P99_AVG_RATIO=1.07760; ITEMS_PER_NODE=833.3 num_servers=13 p50=0.07690 p95=0.08130 p99=0.08320 p999=0.08530 AVG=0.07692; P99_AVG_RATIO=1.08160; ITEMS_PER_NODE=769.2 num_servers=14 p50=0.07140 p95=0.07570 p99=0.07740 p999=0.07950 AVG=0.07143; P99_AVG_RATIO=1.08360; ITEMS_PER_NODE=714.3 num_servers=25 p50=0.04000 p95=0.04330 p99=0.04460 p999=0.04620 AVG=0.04000; P99_AVG_RATIO=1.11500; ITEMS_PER_NODE=400.0 num_servers=50 p50=0.02000 p95=0.02230 p99=0.02330 p999=0.02440 AVG=0.02000; P99_AVG_RATIO=1.16500; ITEMS_PER_NODE=200.0 num_servers=100 p50=0.01000 p95=0.01170 p99=0.01240 p999=0.01320 AVG=0.01000; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 num_servers=1000 p50=0.00100 p95=0.00150 p99=0.00180 p999=0.00210 AVG=0.00100; P99_AVG_RATIO=1.80000; ITEMS_PER_NODE=10.0 num_items=100000: num_servers=3 p50=0.33333 p95=0.33579 p99=0.33681 p999=0.33797 AVG=0.33333; P99_AVG_RATIO=1.01043; ITEMS_PER_NODE=33333.3 num_servers=5 p50=0.20000 p95=0.20207 p99=0.20294 p999=0.20393 AVG=0.20000; P99_AVG_RATIO=1.01470; ITEMS_PER_NODE=20000.0 num_servers=10 p50=0.10000 p95=0.10157 p99=0.10222 p999=0.10298 AVG=0.10000; P99_AVG_RATIO=1.02220; ITEMS_PER_NODE=10000.0 num_servers=11 p50=0.09091 p95=0.09241 p99=0.09304 p999=0.09379 AVG=0.09091; P99_AVG_RATIO=1.02344; ITEMS_PER_NODE=9090.9 num_servers=12 p50=0.08334 p95=0.08477 p99=0.08537 p999=0.08602 AVG=0.08333; P99_AVG_RATIO=1.02444; ITEMS_PER_NODE=8333.3 num_servers=13 p50=0.07692 p95=0.07831 p99=0.07888 p999=0.07954 AVG=0.07692; P99_AVG_RATIO=1.02544; ITEMS_PER_NODE=7692.3 num_servers=14 p50=0.07143 p95=0.07277 p99=0.07332 p999=0.07396 AVG=0.07143; P99_AVG_RATIO=1.02648; ITEMS_PER_NODE=7142.9 num_servers=25 p50=0.04000 p95=0.04102 p99=0.04145 p999=0.04193 AVG=0.04000; P99_AVG_RATIO=1.03625; ITEMS_PER_NODE=4000.0 num_servers=50 p50=0.02000 p95=0.02073 p99=0.02103 p999=0.02138 AVG=0.02000; P99_AVG_RATIO=1.05150; ITEMS_PER_NODE=2000.0 num_servers=100 p50=0.01000 p95=0.01052 p99=0.01074 p999=0.01099 AVG=0.01000; P99_AVG_RATIO=1.07400; ITEMS_PER_NODE=1000.0 num_servers=1000 p50=0.00100 p95=0.00117 p99=0.00124 p999=0.00132 AVG=0.00100; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 power of two choices num_items=1000: num_servers=3 p50=0.33300 p95=0.33400 p99=0.33500 p999=0.33600 AVG=0.33333; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=333.3 num_servers=5 p50=0.20000 p95=0.20100 p99=0.20200 p999=0.20300 AVG=0.20000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=10 p50=0.10000 p95=0.10100 p99=0.10200 p999=0.10200 AVG=0.10000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=11 p50=0.09100 p95=0.09200 p99=0.09300 p999=0.09300 AVG=0.09091; P99_AVG_RATIO=1.02300; ITEMS_PER_NODE=90.9 num_servers=12 p50=0.08300 p95=0.08500 p99=0.08500 p999=0.08600 AVG=0.08333; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=83.3 num_servers=13 p50=0.07700 p95=0.07800 p99=0.07900 p999=0.07900 AVG=0.07692; P99_AVG_RATIO=1.02700; ITEMS_PER_NODE=76.9 num_servers=14 p50=0.07200 p95=0.07300 p99=0.07300 p999=0.07400 AVG=0.07143; P99_AVG_RATIO=1.02200; ITEMS_PER_NODE=71.4 num_servers=25 p50=0.04000 p95=0.04100 p99=0.04200 p999=0.04200 AVG=0.04000; P99_AVG_RATIO=1.05000; ITEMS_PER_NODE=40.0 num_servers=50 p50=0.02000 p95=0.02100 p99=0.02200 p999=0.02200 AVG=0.02000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=20.0 num_servers=100 p50=0.01000 p95=0.01100 p99=0.01200 p999=0.01200 AVG=0.01000; P99_AVG_RATIO=1.20000; ITEMS_PER_NODE=10.0 num_servers=1000 p50=0.00100 p95=0.00200 p99=0.00200 p999=0.00300 AVG=0.00100; P99_AVG_RATIO=2.00000; ITEMS_PER_NODE=1.0 power of two choices num_items=2000: num_servers=3 p50=0.33350 p95=0.33400 p99=0.33400 p999=0.33450 AVG=0.33333; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=666.7 num_servers=5 p50=0.20000 p95=0.20050 p99=0.20100 p999=0.20150 AVG=0.20000; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=400.0 num_servers=10 p50=0.10000 p95=0.10050 p99=0.10100 p999=0.10100 AVG=0.10000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=11 p50=0.09100 p95=0.09150 p99=0.09200 p999=0.09200 AVG=0.09091; P99_AVG_RATIO=1.01200; ITEMS_PER_NODE=181.8 num_servers=12 p50=0.08350 p95=0.08400 p99=0.08400 p999=0.08450 AVG=0.08333; P99_AVG_RATIO=1.00800; ITEMS_PER_NODE=166.7 num_servers=13 p50=0.07700 p95=0.07750 p99=0.07800 p999=0.07800 AVG=0.07692; P99_AVG_RATIO=1.01400; ITEMS_PER_NODE=153.8 num_servers=14 p50=0.07150 p95=0.07200 p99=0.07250 p999=0.07250 AVG=0.07143; P99_AVG_RATIO=1.01500; ITEMS_PER_NODE=142.9 num_servers=25 p50=0.04000 p95=0.04050 p99=0.04100 p999=0.04100 AVG=0.04000; P99_AVG_RATIO=1.02500; ITEMS_PER_NODE=80.0 num_servers=50 p50=0.02000 p95=0.02050 p99=0.02100 p999=0.02100 AVG=0.02000; P99_AVG_RATIO=1.05000; ITEMS_PER_NODE=40.0 num_servers=100 p50=0.01000 p95=0.01050 p99=0.01100 p999=0.01100 AVG=0.01000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=20.0 num_servers=1000 p50=0.00100 p95=0.00150 p99=0.00200 p999=0.00200 AVG=0.00100; P99_AVG_RATIO=2.00000; ITEMS_PER_NODE=2.0 power of two choices num_items=5000: num_servers=3 p50=0.33340 p95=0.33360 p99=0.33360 p999=0.33380 AVG=0.33333; P99_AVG_RATIO=1.00080; ITEMS_PER_NODE=1666.7 num_servers=5 p50=0.20000 p95=0.20020 p99=0.20040 p999=0.20060 AVG=0.20000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=10 p50=0.10000 p95=0.10020 p99=0.10040 p999=0.10040 AVG=0.10000; P99_AVG_RATIO=1.00400; ITEMS_PER_NODE=500.0 num_servers=11 p50=0.09100 p95=0.09120 p99=0.09120 p999=0.09140 AVG=0.09091; P99_AVG_RATIO=1.00320; ITEMS_PER_NODE=454.5 num_servers=12 p50=0.08340 p95=0.08360 p99=0.08360 p999=0.08380 AVG=0.08333; P99_AVG_RATIO=1.00320; ITEMS_PER_NODE=416.7 num_servers=13 p50=0.07700 p95=0.07720 p99=0.07720 p999=0.07740 AVG=0.07692; P99_AVG_RATIO=1.00360; ITEMS_PER_NODE=384.6 num_servers=14 p50=0.07140 p95=0.07160 p99=0.07180 p999=0.07180 AVG=0.07143; P99_AVG_RATIO=1.00520; ITEMS_PER_NODE=357.1 num_servers=25 p50=0.04000 p95=0.04020 p99=0.04040 p999=0.04040 AVG=0.04000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=50 p50=0.02000 p95=0.02020 p99=0.02040 p999=0.02040 AVG=0.02000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=100 p50=0.01000 p95=0.01020 p99=0.01040 p999=0.01040 AVG=0.01000; P99_AVG_RATIO=1.04000; ITEMS_PER_NODE=50.0 num_servers=1000 p50=0.00100 p95=0.00120 p99=0.00140 p999=0.00140 AVG=0.00100; P99_AVG_RATIO=1.40000; ITEMS_PER_NODE=5.0 power of two choices num_items=10000: num_servers=3 p50=0.33330 p95=0.33340 p99=0.33350 p999=0.33360 AVG=0.33333; P99_AVG_RATIO=1.00050; ITEMS_PER_NODE=3333.3 num_servers=5 p50=0.20000 p95=0.20010 p99=0.20020 p999=0.20030 AVG=0.20000; P99_AVG_RATIO=1.00100; ITEMS_PER_NODE=2000.0 num_servers=10 p50=0.10000 p95=0.10010 p99=0.10020 p999=0.10020 AVG=0.10000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=11 p50=0.09090 p95=0.09100 p99=0.09110 p999=0.09110 AVG=0.09091; P99_AVG_RATIO=1.00210; ITEMS_PER_NODE=909.1 num_servers=12 p50=0.08330 p95=0.08350 p99=0.08350 p999=0.08360 AVG=0.08333; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=833.3 num_servers=13 p50=0.07690 p95=0.07700 p99=0.07710 p999=0.07720 AVG=0.07692; P99_AVG_RATIO=1.00230; ITEMS_PER_NODE=769.2 num_servers=14 p50=0.07140 p95=0.07160 p99=0.07160 p999=0.07170 AVG=0.07143; P99_AVG_RATIO=1.00240; ITEMS_PER_NODE=714.3 num_servers=25 p50=0.04000 p95=0.04010 p99=0.04020 p999=0.04020 AVG=0.04000; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=400.0 num_servers=50 p50=0.02000 p95=0.02010 p99=0.02020 p999=0.02020 AVG=0.02000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=100 p50=0.01000 p95=0.01010 p99=0.01020 p999=0.01020 AVG=0.01000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=1000 p50=0.00100 p95=0.00110 p99=0.00120 p999=0.00120 AVG=0.00100; P99_AVG_RATIO=1.20000; ITEMS_PER_NODE=10.0 power of two choices num_items=100000: num_servers=3 p50=0.33333 p95=0.33334 p99=0.33335 p999=0.33336 AVG=0.33333; P99_AVG_RATIO=1.00005; ITEMS_PER_NODE=33333.3 num_servers=5 p50=0.20000 p95=0.20001 p99=0.20002 p999=0.20003 AVG=0.20000; P99_AVG_RATIO=1.00010; ITEMS_PER_NODE=20000.0 num_servers=10 p50=0.10000 p95=0.10001 p99=0.10002 p999=0.10002 AVG=0.10000; P99_AVG_RATIO=1.00020; ITEMS_PER_NODE=10000.0 num_servers=11 p50=0.09091 p95=0.09092 p99=0.09093 p999=0.09093 AVG=0.09091; P99_AVG_RATIO=1.00023; ITEMS_PER_NODE=9090.9 num_servers=12 p50=0.08333 p95=0.08335 p99=0.08335 p999=0.08336 AVG=0.08333; P99_AVG_RATIO=1.00020; ITEMS_PER_NODE=8333.3 num_servers=13 p50=0.07692 p95=0.07694 p99=0.07694 p999=0.07695 AVG=0.07692; P99_AVG_RATIO=1.00022; ITEMS_PER_NODE=7692.3 num_servers=14 p50=0.07143 p95=0.07144 p99=0.07145 p999=0.07145 AVG=0.07143; P99_AVG_RATIO=1.00030; ITEMS_PER_NODE=7142.9 num_servers=25 p50=0.04000 p95=0.04001 p99=0.04002 p999=0.04002 AVG=0.04000; P99_AVG_RATIO=1.00050; ITEMS_PER_NODE=4000.0 num_servers=50 p50=0.02000 p95=0.02001 p99=0.02002 p999=0.02002 AVG=0.02000; P99_AVG_RATIO=1.00100; ITEMS_PER_NODE=2000.0 num_servers=100 p50=0.01000 p95=0.01001 p99=0.01002 p999=0.01002 AVG=0.01000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=1000 p50=0.00100 p95=0.00101 p99=0.00102 p999=0.00102 AVG=0.00100; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0
I was wondering: how often do nanosecond timestamps collide on modern systems? The answer is: very often, like 5% of all samples, when reading the clock on all 4 physical cores at the same time. As a result, I think it is unsafe to assume that a raw nanosecond timestamp is a unique identifier. I wrote a small test program to test this. I used Go, which records both the "absolute" time and the "monotonic clock" relative time on each call to time.Now(), so I compared both the relative difference between consecutive timestamps, as well as just the absolute timestamps. As expected, the behavior depends on the system, so I observe very different results on Mac OS X and Linux. On Linux, within a single thread, both the absolute and monotonic times always increase. On my system, the minimum increment was 32 ns. Between threads, approximately 5% of the absolute times were exactly the same as other threads. Even with 2 threads on a 4 core system, approximately 2% of timestamps collided. On Mac OS X: the absolute time has microsecond resolution, so there are an astronomical number of collisions when I repeat this same test. Even within a thread I often observe the monotonic clock not increment. See the test program on Github if you are curious.
The read() and write() system calls take a variable-length byte array as an argument. As a simplified model, the time for the system call should be some constant "per-call" time, plus time directly proportional to the number of bytes in the array. That is, the time for each call should be time = (per_call_minimum_time) + (array_len) × (per_byte_time). With this model, using a larger buffer should increase throughput, asymptotically approaching 1/per_byte_time. I was curious: do real system calls behave this way? What are the ideal buffer sizes for read() and write() if we want to maximize throughput? I decided to do some experiments with blocking I/O. These are not rigorous, and I suspect the results will vary significantly if the hardware and software are different than one the system I tested. The really short answer is that a buffer of 32 KiB is a good starting point on today's systems, and I would want to measure the performance to go beyond that. However, for large writes, performance can increase. On Linux, the simple model holds for small buffers (≤ 4 KiB), but once the program approaches the maximum throughput, the throughput becomes highly variable and in many cases decreases as the buffers get larger. For blocking I/O, approximately 32 KiB is large enough to hit the maximum throughput for read(), but write() throughput improves with buffers up to around 256 KiB - 1 MiB. The reason for the asymmetry is that the Linux kernel will only write less than the entire buffer (a "short write") if there is an error (e.g. a signal causing EINTR). Thus, larger write buffers means the operating system needs to switch to the process less often. On the other head, "short reads", where a read() returns less than the maximum length, become increasingly common as the buffer size increases, which diminishes the benefit. There is a SO_RCVLOWAT socket option to change this that I did not test. The experiments were run on two 16 CPU Google Cloud T2D instances, which use AMD EPYC Milan processors (3rd generation, released in 2021). Each core is a real physical core. I used Ubuntu 23.04 running kernel 6.2.0-1005-gcp. My benchmark program is written in Rust and is available on Github. On localhost, Unix sockets were able to transfer data at approximately 9000 MiB/s. Localhost TCP sockets were a bit slower, around 7000 MiB/s. When using two separate cloud VMs with a networking throughput limit of 32 Gbps = 3800 MiB/s, I needed to use 6 TCP sockets to reliably reach that maximum throughput. A single TCP socket gets around 1400 MiB/s with 256 KiB buffers, with peaks as high as 2200 MiB/s. Experiment 1: /dev/zero and /dev/urandom My first experiment is reading from the /dev/zero and /dev/urandom devices. These are software devices implemented by the kernel, so they should have low overhead and low variability, since other tasks are not involved. Reading from /dev/urandom should be much slower than /dev/zero since the kernel must generate random bytes, rather than just zeros. The chart below shows the throughput for reading from /dev/zero as the buffer size is increased. The results show that the basic linear time per system call model holds until the system reaches maximum throughput (256 kiB buffer = 39000 MiB/s for /dev/zero, or 16 kiB = 410 MiB/s for /dev/urandom). As the buffer size increases further, the throughput decreases as the buffers get too big. This suggests that some other cost for larger buffers starts to outweigh the reduction in number of system calls. Perhaps CPU caches become less effective? The AMD EPYC Milan (3rd gen) CPU I tested on has 32 KiB of L1 data cache and 512 KiB of L2 data cache per core. The performance decreases don't exactly line up with these numbers, but it still seems plausible. The numbers for /dev/urandom are substantially lower, but otherwise similar. I did a linear least-squares fit on the average time per system call, shown in the following chart. If I use all the data, the fit is not good, because the trend changes for larger buffers. However, if I use the data up to the maximum throughput at 256 KiB, the fit is very good, as shown on the chart below. The linear fit models the minimum time per system call as 167 ns, with 0.0235 ns/byte additional time. If we want to use smaller buffers, using a 64 KiB buffer for reading from /dev/zero gets within 95% of the maximum throughput. Experiment 2: Unix and localhost TCP sockets Exchanging data with other processes is the thing I am actually interested in, so I tested Unix and TCP sockets on a single machine. In this case, I varied both the write buffer size and the read buffer size. Unfortunately, these results vary a lot. A more robust comparison would require running each experiment many times, and using some sort of statistical comparison. However, this "quick and dirty" experiment satisfied my curiousity, so I didn't do that. As a result, my conclusions here are vague. The simple model that increasing buffer size should decrease overhead is true, but only until the buffers are about 4 KiB. Above that point, the results start to be highly variable, and it is much harder to draw general conclusion. However, appears that increasing the write buffer size generally is quite helpful up to at least 256 KiB, and often needed as much as 1 MiB to get the highest localhost throughput. I suspect this is because on Linux with blocking sockets, write() will not return until it has written all the data in the buffer, unless there is an error (e.g. EINTR). As a result, passing a large buffer means the kernel can do a lot of the work without needing to switch back to user space. Unfortunately, the same is not true for read(), which often returns "short reads" with any data that is available in the buffer. This starts with buffer sizes around 2 KiB, with the percentage of short reads increasing as the buffer size gets larger. This means the simple model does not hold, because we aren't actually increasing the bytes per read call. I suspect this is a factor which means this microbenchmark is likely not representative of real programs. A real program will do something with the buffer, which will provide time for more data to be buffered in the kernel, and would probably decrease the number of short reads. This likely means larger buffers are in practice more useful than this microbenchmark suggests. As a result of this, the highest throughput often was achievable with small read buffers. I'm somewhat arbitrarily selecting 16 KiB at the best read buffer, and 256 KiB as the best write buffer, although a 1 MiB write buffer seems to be To give a sense of how variable the results are, the plot below shows the local Unix socket throughput for each read and write buffer throughput size. I apologize for the ugly plot. I did not want to spend the time to make it more beautiful. This plot is interactive so you can slice the data to the area of interest. I recommend zooming in to the left hand size with read buffers up to about 300 KiB. The first thing to note is at least on Linux with blocking sockets, the writer will almost never have a "short write", where the write system call returns before writing all the data in the buffer. Unless there is a signal (EINTR) or some other "error" condition, write() will not return until all the bytes are written. The same is not true for reads. The read() system call will often return a "short" read, starting around buffer sizes of 2 KiB. The percentage of short reads generally increases as buffer sizes get bigger, which is logical. Another note is that sockets have in-kernel send and receive buffers. I did not tune these at all. It is possible that better performance is possible by tuning these settings, but that was not my goal. I wanted to know what happens "out of the box" for general-purpose programs without any special tuning. Experiment 3: TCP between two hosts In this experiment, I used two separate hosts connected with 32 Gbps networking in Google Cloud. I first tested the TCP throughput using iperf, to independently verify the network performance. A single TCP connection with iperf is not enough to fully utilize the network. I tried fiddling with some command line options and with Kernel settings like net.ipv4.tcp_rmem and wasn't able to get much better than about 12 Gb/s = 1400 MiB/s. The throughput is also highly varible. Here is some example output with iperf reporting at 2 second intervals, where you can see the throughput ranging from 10 to 19 Gb/s, with an average over the entire interval of 12 Gb/s. To hit the maximum network throughput, I need to use 6 or more parallel TCP connections (iperf -c IP_ADDRESS --time 60 --interval 2 -l 262144 -P 6). Using 3 connections gets around 26 Gb/s, and using 4 or 5 will occasionally hit the maximum, but will also occasionally drop down. Using at least 6 seems to reliably stay at the maximum. Due to this variability, it is hard to draw any conclusions about buffer size. In particular: a single TCP connection is not limited by CPU. The system uses about 40% of a single CPU core, basically all in the kernel. This is more about how the buffer sizes may impact scheduling choices. That said, it is clear that you cannot hit the maximum throughput with a small write buffer. The experiments with 4 KiB write buffers reached approximately 300 MiB/s, while an 8 KiB write buffer was much faster, around 1400 MiB/s. Larger still generally seems better, up to around 256 KiB, which occasionally reached 2200 MiB/s = 17.6 Gb/s. The plot below shows the TCP socket throughput for each read and write buffer size. Again, I apologize for the ugly plot.
This is a post for myself, because I wasted a lot of time understanding this bug, and I want to be able to remember it in the future. I expect close to zero others to be interested. The C standard library function isspace() returns a non-zero value (true) for the six "standard" ASCII white-space characters ('\t', '\n', '\v', '\f', '\r', ' '), and any locale-specific characters. By default, a program starts in the "C" locale, which will only return true for the six ASCII white-space characters. However, if the program changes locales, it can return true for other values. As a result, unless you really understand locales, you should use your own version of this function, or ICU4C's u_isspace() function. An implementation of isspace() for ASCII is one line: /* Returns true for the 6 ASCII white-space characters: \t \n \v \f \r ' '. */ int isspace_ascii(int c) { return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' '; } I ran into this because On Mac OS X, Postgres switches to the system's default locale, which is something that uses UTF-8 (e.g. en_US.UTF-8, fr_CA.UTF-8, etc). In this case, isspace() returns true for Unicode white-space values, which includes 0x85 = NEL = Next Line, and 0xA0 = NBSP = No-Break Space. This caused a bug in parsing Postgres Hstore values that use Unicode. I have attempted to submit a patch to fix this (mailing list post, commitfest entry). For a program to demonstrate the behaviour on different systems, see isspace_locale on Github.
More in programming
Revealed: How the UK tech secretary uses ChatGPT for policy advice by Chris Stokel-Walker for the New Scientist
This book’s introduction started by defining strategy as “making decisions.” Then we dug into exploration, diagnosis, and refinement: three chapters where you could argue that we didn’t decide anything at all. Clarifying the problem to be solved is the prerequisite of effective decision making, but eventually decisions do have to be made. Here in this chapter on policy, and the following chapter on operations, we finally start to actually make some decisions. In this chapter, we’ll dig into: How we define policy, and how setting policy differs from operating policy as discussed in the next chapter The structured steps for setting policy How many policies should you set? Is it preferable to have one policy, many policies, or does it not matter much either way? Recurring kinds of policies that appear frequently in strategies Why it’s valuable to be intentional about your strategy’s altitude, and how engineers and executives generally maintain different altitudes in their strategies Criteria to use for evaluating whether your policies are likely to be impactful How to develop novel policies, and why it’s rare Why having multiple bundles of alternative policies is generally a phase in strategy development that indicates a gap in your diagnosis How policies that ignore constraints sound inspirational, but accomplish little Dealing with ambiguity and uncertainty created by missing strategies from cross-functional stakeholders By the end, you’ll be ready to evaluate why an existing strategy’s policies are struggling to make an impact, and to start iterating on policies for strategy of your own. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. What is policy? Policy is interpreting your diagnosis into a concrete plan. That plan will be a collection of decisions, tradeoffs, and approaches. They’ll range from coding practices, to hiring mandates, to architectural decisions, to guidance about how choices are made within your organization. An effective policy solves the entirety of the strategy’s diagnosis, although the diagnosis itself is encouraged to specify which aspects can be ignored. For example, the strategy for working with private equity ownership acknowledges in its diagnosis that they don’t have clear guidance on what kind of reduction to expect: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. Faced with that uncertainty, the policy simply acknowledges the ambiguity and commits to reconsider when more information becomes available: We believe our new ownership will provide a specific target for Research and Development (R&D) operating expenses during the upcoming financial year planning. We will revise these policies again once we have explicit targets, and will delay planning around reductions until we have those numbers to avoid running two overlapping processes. There are two frequent points of confusion when creating policies that are worth addressing directly: Policy is a subset of strategy, rather than the entirety of strategy, because policy is only meaningful in the context of the strategy’s diagnosis. For example, the “N-1 backfill policy” makes sense in the context of new, private equity ownership. The policy wouldn’t work well in a rapidly expanding organization. Any strategy without a policy is useless, but you’ll also find policies without context aren’t worth much either. This is particularly unfortunate, because so often strategies are communicated without those critical sections. Policy describes how tradeoffs should be made, but it doesn’t verify how the tradeoffs are actually being made in practice. The next chapter on operations covers how to inspect an organization’s behavior to ensure policies are followed. When reworking a strategy to be more readable, it often makes sense to merge policy and operation sections together. However, when drafting strategy it’s valuable to keep them separate. Yes, you might use a weekly meeting to review whether the policy is being followed, but whether it’s an effective policy is independent of having such a meeting, and what operational mechanisms you use will vary depending on the number of policies you intend to implement. With this definition in mind, now we can move onto the more interesting discussion of how to set policy. How to set policy Every part of writing a strategy feels hard when you’re doing it, but I personally find that writing policy either feels uncomfortably easy or painfully challenging. It’s never a happy medium. Fortunately, the exploration and diagnosis usually come together to make writing your policy simple: although sometimes that simple conclusion may be a difficult one to swallow. The steps I follow to write a strategy’s policy are: Review diagnosis to ensure it captures the most important themes. It doesn’t need to be perfect, but it shouldn’t have omissions so obvious that you can immediately identify them. Select policies that address the diagnosis. Explicitly match each policy to one or more diagnoses that it addresses. Continue adding policies until every diagnosis is covered. This is a broad instruction, but it’s simpler than it sounds because you’ll typically select from policies identified during your exploration phase. However, there certainly is space to tweak those policies, and to reapply familiar policies to new circumstances. If you do find yourself developing a novel policy, there’s a later section in this chapter, Developing novel policies, that addresses that topic in more detail. Consolidate policies in cases where they overlap or adjoin. For example, two policies about specific teams might be generalized into a policy about all teams in the engineering organization. Backtest policy against recent decisions you’ve made. This is particularly effective if you maintain a decision log in your organization. Mine for conflict once again, much as you did in developing your diagnosis. Emphasize feedback from teams and individuals with a different perspective than your own, but don’t wholly eliminate those that you agree with. Just as it’s easy to crowd out opposing views in diagnosis if you don’t solicit their input, it’s possible to accidentally crowd out your own perspective if you anchor too much on others’ perspectives. Consider refinement if you finish writing, and you just aren’t sure your approach works – that’s fine! Return to the refinement phase by deploying one of the refinement techniques to increase your conviction. Remember that we talk about strategy like it’s done in one pass, but almost all real strategy takes many refinement passes. The steps of writing policy are relatively pedestrian, largely because you’ve done so much of the work already in the exploration, diagnosis, and refinement steps. If you skip those phases, you’d likely follow the above steps for writing policy, but the expected quality of the policy itself would be far lower. How many policies? Addressing the entirety of the diagnosis is often complex, which is why most strategies feature a set of policies rather than just one. The strategy for decomposing a monolithic application is not one policy deciding not to decompose, but a series of four policies: Business units should always operate in their own code repository and monolith. New integrations across business unit monoliths should be done using gRPC. Except for new business unit monoliths, we don’t allow new services. Merge existing services into business-unit monoliths where you can. Four isn’t universally the right number either. It’s simply the number that was required to solve that strategy’s diagnosis. With an excellent diagnosis, your policies will often feel inevitable, and perhaps even boring. That’s great: what makes a policy good is that it’s effective, not that it’s novel or inspiring. Kinds of policies While there are so many policies you can write, I’ve found they generally fall into one of four major categories: approvals, allocations, direction, and guidance. This section introduces those categories. Approvals define the process for making a recurring decision. This might require invoking an architecture advice process, or it might require involving an authority figure like an executive. In the Index post-acquisition integration strategy, there were a number of complex decisions to be made, and the approval mechanism was: Escalations come to paired leads: given our limited shared context across teams, all escalations must come to both Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. This allowed the acquired and acquiring teams to start building trust between each other by ensuring both were consulted before any decision was finalized. On the other hand, the user data access strategy’s approval strategy was more focused on managing corporate risk: Exceptions must be granted in writing by CISO. While our overarching Engineering Strategy states that we follow an advisory architecture process as described in Facilitating Software Architecture, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the #ciso channel. These two different approval processes had different goals, so they made tradeoffs differently. There are so many ways to tweak approval, allowing for many different tradeoffs between safety, productivity, and trust. Allocations describe how resources are split across multiple potential investments. Allocations are the most concrete statement of organizational priority, and also articulate the organization’s belief about how productivity happens in teams. Some companies believe you go fast by swarming more people onto critical problems. Other companies believe you go fast by forcing teams to solve problems without additional headcount. Both can work, and teach you something important about the company’s beliefs. The strategy on Uber’s service migration has two concrete examples of allocation policies. The first describes the Infrastructure engineering team’s allocation between manual provision tasks and investing into creating a self-service provisioning platform: Constrain manual provisioning allocation to maximize investment in self-service provisioning. The service provisioning team will maintain a fixed allocation of one full time engineer on manual service provisioning tasks. We will move the remaining engineers to work on automation to speed up future service provisioning. This will degrade manual provisioning in the short term, but the alternative is permanently degrading provisioning by the influx of new service requests from newly hired product engineers. The second allocation policy is implicitly noted in this strategy’s diagnosis, where it describes the allocation policy in the Engineering organization’s higher altitude strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. Allocation policies often create a surprising amount of clarity for the team, and I include them in almost every policy I write either explicitly, or implicitly in a higher altitude strategy. Direction provides explicit instruction on how a decision must be made. This is the right tool when you know where you want to go, and exactly the way that you want to get there. Direction is appropriate for problems you understand clearly, and you value consistency more than empowering individual judgment. Direction works well when you need an unambiguous policy that doesn’t leave room for interpretation. For example, Calm’s policy for working in the monolith: We write all code in the monolith. It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code must be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith. In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below. In that case, the team couldn’t agree on what should go into the monolith. Individuals would often make incompatible decisions, so creating consistency required removing personal judgment from the equation. Sometimes judgment is the issue, and sometimes consistency is difficult due to misaligned incentives. A good example of this comes in strategy on working with new Private Equity ownership: We will move to an “N-1” backfill policy, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit. It’s likely that hiring managers would simply ignore this backfill policy if it was stated more softly, although sometimes less forceful policies are useful. Guidance provides a recommendation about how a decision should be made. Guidance is useful when there’s enough nuance, ambiguity, or complexity that you can explain the desired destination, but you can’t mandate the path to reaching it. One example of guidance comes from the Index acquisition integration strategy: Minimize changes to tokenization environment: because point-of-sale devices directly work with customer payment details, the API that directly supports the point-of-sale device must live within our secured environment where payment details are stored. However, any other functionality must not be added to our tokenization environment. This might read like direction, but it’s clarifying the desired outcome of avoiding unnecessary complexity in the tokenization environment. However, it’s not able to articulate what complexity is necessary, so ultimately it’s guidance because it requires significant judgment to interpret. A second example of guidance comes in the strategy on decomposing a monolithic codebase: Merge existing services into business-unit monoliths where you can. We believe that each choice to move existing services back into a monolith should be made “in the details” rather than from a top-down strategy perspective. Consequently, we generally encourage teams to wind down their existing services outside of their business unit’s monolith, but defer to teams to make the right decision for their local context. This is another case of knowing the desired outcome, but encountering too much uncertainty to direct the team on how to get there. If you ask five engineers about whether it’s possible to merge a given service back into a monolithic codebase, they’ll probably disagree. That’s fine, and highlights the value of guidance: it makes it possible to make incremental progress in areas where more concrete direction would cause confusion. When you’re working on a strategy’s policy section, it’s important to consider all of these categories. Which feel most natural to use will vary depending on your team and role, but they’re all usable: If you’re a developer productivity team, you might have to lean heavily on guidance in your policies and increased support for that guidance within the details of your platform. If you’re an executive, you might lean heavily on direction. Indeed, you might lean too heavily on direction, where guidance often works better for areas where you understand the direction but not the path. If you’re a product engineering organization, you might have to narrow the scope of your direction to the engineers within that organization to deal with the realities of complex cross-organization dynamics. Finally, if you have a clear approach you want to take that doesn’t fit cleanly into any of these categories, then don’t let this framework dissuade you. Give it a try, and adapt if it doesn’t initially work out. Maintaining strategy altitude The chapter on when to write engineering strategy introduced the concept of strategy altitude, which is being deliberate about where certain kinds of policies are created within your organization. Without repeating that section in its entirety, it’s particularly relevant when you set policy to consider how your new policies eliminate flexibility within your organization. Consider these two somewhat opposing strategies: Stripe’s Sorbet strategy only worked in an organization that enforced the use of a single programming language across (essentially) all teams Uber’s service migration strategy worked well in an organization that was unwilling to enforce consistent programming language adoption across teams Stripe’s organization-altitude policy took away the freedom of individual teams to select their preferred technology stack. In return, they unlocked the ability to centralize investment in a powerful way. Uber went the opposite way, unlocking the ability of teams to pick their preferred technology stack, while significantly reducing their centralized teams’ leverage. Both altitudes make sense. Both have consequences. Criteria for effective policies In The Engineering Executive’s Primer’s chapter on engineering strategy, I introduced three criteria for evaluating policies. They ought to be applicable, enforced, and create leverage. Defining those a bit: Applicable: it can be used to navigate complex, real scenarios, particularly when making tradeoffs. Enforced: teams will be held accountable for following the guiding policy. Create Leverage: create compounding or multiplicative impact. The last of these three, create leverage, made sense in the context of a book about engineering executives, but probably doesn’t make as much sense here. Some policies certainly should create leverage (e.g. empower developer experience team by restricting new services), but others might not (e.g. moving to an N-1 backfill policy). Outside the executive context, what’s important isn’t necessarily creating leverage, but that a policy solves for part of the diagnosis. That leaves the other two–being applicable and enforced–both of which are necessary for a policy to actually address the diagnosis. Any policy which you can’t determine how to apply, or aren’t willing to enforce, simply won’t be useful. Let’s apply these criteria to a handful of potential policies. First let’s think about policies we might write to improve the talent density of our engineering team: “We only hire world-class engineers.” This isn’t applicable, because it’s unclear what a world-class engineer means. Because there’s no mutually agreeable definition in this policy, it’s also not consistently enforceable. “We only hire engineers that get at least one ‘strong yes’ in scorecards.” This is applicable, because there’s a clear definition. This is enforceable, depending on the willingness of the organization to reject seemingly good candidates who don’t happen to get a strong yes. Next, let’s think about a policy regarding code reuse within a codebase: “We follow a strict Don’t Repeat Yourself policy in our codebase.” There’s room for debate within a team about whether two pieces of code are truly duplicative, but this is generally applicable. Because there’s room for debate, it’s a very context specific determination to decide how to enforce a decision. “Code authors are responsible for determining if their contributions violate Don’t Repeat Yourself, and rewriting them if they do.” This is much more applicable, because now there’s only a single person’s judgment to assess the potential repetition. In some ways, this policy is also more enforceable, because there’s no longer any ambiguity around who is deciding whether a piece of code is a repetition. The challenge is that enforceability now depends on one individual, and making this policy effective will require holding individuals accountable for the quality of their judgement. An organization that’s unwilling to distinguish between good and bad judgment won’t get any value out of the policy. This is a good example of how a good policy in one organization might become a poor policy in another. If you ever find yourself wanting to include a policy that for some reason either can’t be applied or can’t be enforced, stop to ask yourself what you’re trying to accomplish and ponder if there’s a different policy that might be better suited to that goal. Developing novel policies My experience is that there are vanishingly few truly novel policies to write. There’s almost always someone else has already done something similar to your intended approach. Calm’s engineering strategy is such a case: the details are particular to the company, but the general approach is common across the industry. The most likely place to find truly novel policies is during the adoption phase of a new widespread technology, such as the rise of ubiquitous mobile phones, cloud computing, or large language models. Even then, as explored in the strategy for adopting large-language models, the new technology can be engaged with as a generic technology: Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets. Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers. You could simply replace “LLM” with “data-driven” and it would be equally readable. In this way, policy can generally sidestep areas of uncertainty by being a bit abstract. This avoids being overly specific about topics you simply don’t know much about. However, even if your policy isn’t novel to the industry, it might still be novel to you or your organization. The steps that I’ve found useful to debug novel policies are the same steps as running a condensed version of the strategy process, with a focus on exploration and refinement: Collect a number of similar policies, with a focus on how those policies differ from the policy you are creating Create a systems model to articulate how this policy will work, and also how it will differ from the similar policies you’re considering Run a strategy testing cycle for your proto-policy to discover any unknown-unknowns about how it works in practice Whether you run into this scenario is largely a function of the extent of your, and your organization’s, experience. Early in my career, I found myself doing novel (for me) strategy work very frequently, and these days I rarely find myself doing novel work, instead focusing on adaptation of well-known policies to new circumstances. Are competing policy proposals an anti-pattern? When creating policy, you’ll often have to engage with the question of whether you should develop one preferred policy or a series of potential strategies to pick from. Developing these is a useful stage of setting policy, but rather than helping you refine your policy, I’d encourage you to think of this as exposing gaps in your diagnosis. For example, when Stripe developed the Sorbet ruby-typing tooling, there was debate between two policies: Should we build a ruby-typing tool to allow a centralized team to gradually migrate the company to a typed codebase? Should we migrate the codebase to a preexisting strongly typed language like Golang or Java? These were, initially, equally valid hypotheses. It was only by clarifying our diagnosis around resourcing that it became clear that incurring the bulk of costs in a centralized team was clearly preferable to spreading the costs across many teams. Specifically, recognizing that we wanted to prioritize short-term product engineering velocity, even if it led to a longer migration overall. If you do develop multiple policy options, I encourage you to move the alternatives into an appendix rather than including them in the core of your strategy document. This will make it easier for readers of your final version to understand how to follow your policies, and they are the most important long-term user of your written strategy. Recognizing constraints A similar problem to competing solutions is developing a policy that you cannot possibly fund. It’s easy to get enamored with policies that you can’t meaningfully enforce, but that’s bad policy, even if it would work in an alternate universe where it was possible to enforce or resource it. To consider a few examples: The strategy for controlling access to user data might have proposed requiring manual approval by a second party of every access to customer data. However, that would have gone nowhere. Our approach to Uber’s service migration might have required more staffing for the infrastructure engineering team, but we knew that wasn’t going to happen, so it was a meaningless policy proposal to make. The strategy for navigating private equity ownership might have argued that new ownership should not hold engineering accountable to a new standard on spending. But they would have just invalidated that strategy in the next financial planning period. If you find a policy that contemplates an impractical approach, it doesn’t only indicate that the policy is a poor one, it also suggests your policy is missing an important pillar. Rather than debating the policy options, the fastest path to resolution is to align on the diagnosis that would invalidate potential paths forward. In cases where aligning on the diagnosis isn’t possible, for example because you simply don’t understand the possibilities of a new technology as encountered in the strategy for adopting LLMs, then you’ve typically found a valuable opportunity to use strategy refinement to build alignment. Dealing with missing strategies At a recent company offsite, we were debating which policies we might adopt to deal with annual plans that kept getting derailed after less than a month. Someone remarked that this would be much easier if we could get the executive team to commit to a clearer, written strategy about which business units we were prioritizing. They were, of course, right. It would be much easier. Unfortunately, it goes back to the problem we discussed in the diagnosis chapter about reframing blockers into diagnosis. If a strategy from the company or a peer function is missing, the empowering thing to do is to include the absence in your diagnosis and move forward. Sometimes, even when you do this, it’s easy to fall back into the belief that you cannot set a policy because a peer function might set a conflicting policy in the future. Whether you’re an executive or an engineer, you’ll never have the details you want to make the ideal policy. Meaningful leadership requires taking meaningful risks, which is never something that gets comfortable. Summary After working through this chapter, you know how to develop policy, how to assemble policies to solve your diagnosis, and how to avoid a number of the frequent challenges that policy writers encounter. At this point, there’s only one phase of strategy left to dig into, operating the policies you’ve created.
I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos. This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right. Approach #1: ORDER BY RANDOM() My first attempt was pretty naïve – I used an ORDER BY RANDOM() clause to sort the table, then limit the results: SELECT * FROM photos ORDER BY random() LIMIT 10 This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable. It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values. I found a suggestion from Stack Overflow user Ali to do a random sort on the id column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting: SELECT * FROM photos WHERE id IN ( SELECT id FROM photos ORDER BY RANDOM() LIMIT 10 ) This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted. Approach #2: WHERE rowid > (…) Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach: SELECT * FROM photos WHERE rowid > ( ABS(RANDOM()) % (SELECT max(rowid) FROM photos) ) LIMIT 10 The rowid is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid unless you explicitly tell it not to, or create your own integer primary key. This query works by picking a point between the biggest and smallest rowid values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation. This query is much faster – around 0.0008s – but I didn’t go this route. The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101, 102, 103, … – which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety! (This behaviour isn’t guaranteed – if you don’t add an ORDER BY clause to a SELECT query, then the order of results is undefined. SQLite is returning rows in rowid order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.) Approach #3: Select random rowid values outside SQLite Max’s answer was the first time I’d heard of rowid, and it gave me an idea – what if I chose random rowid values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want. Here’s the procedure I came up with: Create an empty list to store our sample. Find the highest rowid that’s currently in use: sqlite> SELECT MAX(rowid) FROM photos; 1913389 Use a random number generator to pick a rowid between 1 and the highest rowid: >>> import random >>> random.randint(1, max_rowid) 196476 If we’ve already got this rowid, discard it and generate a new one. (The rowid is a signed, 64-bit integer, so the minimum possible value is always 1.) Look for a row with that rowid: SELECT * FROM photos WHERE rowid = 196476 If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another rowid. If such a row doesn’t exist, return to step 3 and generate another rowid. This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s). This approach is best for tables where the rowid values are mostly contiguous – it would be slower if there are lots of rowids between 1 and the max that don’t exist. If there are large gaps in rowid values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid values separately. This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid values do exist. Summary Here are the four approaches I tried: Approach Performance (for 2M rows) Notes ORDER BY RANDOM() ~0.5s Slowest, easiest to read WHERE id IN (SELECT id …) ~0.15s Faster, still fairly easy to understand WHERE rowid > ... ~0.0008s Returns clustered results Random rowid in Python ~0.001s Fast and returns varied results, requires code outside SQL I’m using the random rowid in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM(), I could feel myself waiting for the page to load. But I’ve used ORDER BY RANDOM() in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on. Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool! [If the formatting of this post looks odd in your feed reader, visit the original article]