TCP and gRPC Failed Connection Timeouts

from Evan Jones - Software Engineer | Computer Scientist [alt+shift+b] in programming

After my last article about how gRPC is hard to configure, with a client keepalive example, I realized that I don't understand how TCP and gRPC handle failed connections. This article is my attempt to figure it out, and document it so I can find it again in the future. The best reference I've found for how TCP handles failures on recent Linux kernels is When TCP sockets refuse to die. It gets into the details much more than this article does. Instead, I will try to briefly summarize what TCP and gRPC do in some failure scenarios that I have run into before. My opinion is that most TCP applications probably should turn on TCP keepalives (use setsockopt to set SO_KEEPALIVE, TCP_KEEPIDLE, and TCP_KEEPINTVL), and set the TCP_USER_TIMEOUT option. These settings ensure the kernel will report a failed connection much faster than it would by default (seconds to minutes instead of hours). For example, setting SO_KEEPALIVE=1, TCP_KEEPIDLE=15, TCP_KEEPINTVL=15, TCP_KEEPCNT=5 and...

over a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Comments

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Evan Jones - Software Engineer | Computer Scientist

Setenv is not Thread Safe and C Doesn't Want to Fix It

You can't safely use the C setenv() or unsetenv() functions in a program that uses threads. Those functions modify global state, and can cause other threads calling getenv() to crash. This also causes crashes in other languages that use those C standard library functions, such as Go's os.Setenv (Go issue) and Rust's std::env::set_var() (Rust issue). I ran into this in a Go program, because Go's built-in DNS resolver can call C's getaddrinfo(), which uses environment variables. This cost me 2 days to track down and file the Go bug. Sadly, this problem has been known for decades. For example, an article from January 2017 said: "None of this is new, but we do re-discover it roughly every five years. See you in 2022." This was only one year off! (She wrote an update in October 2023 after I emailed her about my Go bug.) This is a flaw in the POSIX standard, which extends the C Standard to allow modifying environment varibles. The most infuriating part is that many people who could influence the standard or maintain the C libraries don't see this as a problem. The argument is that the specification clearly documents that setenv() cannot be used with threads. Therefore, if someone does this, the crashes are their fault. We should apparently read every function's specification carefully, not use software written by others, and not use threads. These are unrealistic assumptions in modern software. I think we should instead strive to create APIs that are hard to screw up, and evolve as the ecosystem changes. The C language and standard library continue to play an important role at the base of most software. We either need to figure out how to improve it, or we need to figure out how to abandon it. Why is setenv() not thread-safe? The biggest problem is that getenv() returns a char*, with no need for applications to free it later. One thread could be using this pointer when another thread changes the same environment variable using setenv() or unsetenv(). The getenv() function is perfect if environment variables never change. For example, for accessing a process's initial table of environment variables (see the System V ABI: AMD64 Section 3.4.1). It turns out the C Standard only includes getenv(), so according to C, that is exactly how this should work. However, most implementations also follow the POSIX standard (e.g. POSIX.1-2017), which extends C to include functions that modify the environment. This means the current getenv() API is problematic. Even worse, putenv() adds a char* to the set of environment variables. It is explicitly required that if the application modifies the memory after putenv() returns, it modifies the environment variables. This means applications can modify the value passed to putenv() at any time, without any synchronization. FreeBSD used to implement putenv() by copying the value, but it changed it with FreeBSD 7 in 2008, which suggests some programs really do depend on modifying the environment in this fashion (see FreeBSD putenv man page). As a final problem, environ is a NULL-terminated array of pointers (char**) that an application can read and assign to (see definition in POSIX.1-2017). This is how applications can iterate over all environment variables. Accesses to this array are not thread-safe. However, in my experience many fewer applications use this than getenv() and setenv(). However, this does cause some libraries to not maintain the set of environment variables in a thread-safe way, since they directly update this table. Environment variable implementations Implementations need to choose what do do when an application overwrites an existing variable. I looked at glibc, musl, Solaris/Illumos, and FreeBSD/Apple's C standard libraries, and they make the following choices: Never free environment variables (glibc, Solaris/Illumos): Calling setenv() repeatedly is effectively a memory leak. However, once a value is returned from getenv(), it is immutable and can be used by threads safely. Free the environment variables (musl, FreeBSD/Apple): Using the pointer returned by getenv() after another thread calls setenv() can crash. A second problem is ensuring the set of environment variables is updated in a thread-safe fashion. This is what causes crashes in glibc. glibc uses an array to hold pointers to the "NAME=value" strings. It holds a lock in setenv() when changing this array, but not in getenv(). If a thread calling setenv() needs to resize the array of pointers, it copies the values to a new array and frees the previous one. This can cause other threads executing getenv() to crash, since they are now iterating deallocated memory. This is particularly annoying since glibc already leaks environment variables, and holds a lock in setenv(). All it needs to do is hold the lock inside getenv(), and it would no longer crash. This would make getenv() slightly slower. However, getenv() already uses a linear search of the array, so performance does not appear to be a concern. More sophisticated implementations are possible if this is a problem, such as Solaris/Illumos's lock-free implementation. Why do programs use environment variables? Environment variables useful for configuring shared libraries or language runtimes that are included in other programs. This allows users to change the configuration, without program authors needing to explicitly pass the configuration in. One alternative is command line flags, which requires programs to parse them and pass them in to the libraries. Another alternative are configuration files, which then need some other way to disable or configure, to be able to test new configurations. Environment variables are a simple solution. AS a result, many libraries call getenv() (see a partial list below). Since many libraries are configured through environment variables, a program may need to change these variables to configure the libraries it uses. This is common at application startup. This causes programs to need to call setenv(). Given this issue, it seems like libraries should also provide a way to explicitly configure any settings, and avoid using environment variables. We should fix this problem, and we can In my opinion, it is rediculous that this has been a known problem for so long. It has wasted thousands of hours of people's time, either debugging the problems, or debating what to do about it. We know how to fix the problem. First, we can make a thread-safe implementation, like Illumos/Solaris. This has some limitations: it leaks memory in setenv(), and is still unsafe if a program uses putenv() or the environ variable. However, this is an improvement over the current Linux and Apple implementations. The second solution is to add new APIs to get one and get all environment variables that are thread-safe by design, like Microsoft's getenv_s() (see below for the controversy around C11's "Annex K"). My preferred solution would be to do both. This would reduce the chances of hitting this problem for existing programs and libraries, and also provide a path to avoid the problems entirely for new code or languages like Go and Rust. My rough idea would be the following: Add a function to copy one single environment variable to a user-specified buffer, similar to getenv_s(). Add a thread-safe API to iterate over all environment variables, or to copy all variables out. Mark getenv() as deprecated, recommending the new thread-safe getenv() function instead. Mark putenv() as deprecated, recommending setenv() instead. Mark environ as deprecated, recommending environment variable functions instead. Update the implementation of environment varibles to be thread-safe. This requires leaking memory if getenv() is used on a variable, but we can detect if the old functions are used, and only leak memory in that case. This means programs written in other languages will avoid these problems as soon as their runtimes are updated. Update the C and POSIX standards to require the above changes. This would be progress. The getenv_s / C Standard Annex K controversy Microsoft provides getenv_s(), which copies the environment variable into a caller-provided buffer. This is easy to make thread-safe by holding a read lock while copying the variable. After the function returns, future changes to the environment have no effect. This is included in the C11 Standard as Annex K "Bounds Checking Interfaces". The C standard Annexes are optional features. This Annex includes new functions intended to make it harder to make mistakes with buffers that are the wrong size. The first draft of this extension was published in 2003. This is when Microsoft was focusing on "Trustworthy Computing" after a January 2002 memo from Bill Gates. Basically, Windows wasn't designed to be connected to the Internet, and now that it was, people were finding many security problems. Lots of them were caused by buffer handling mistakes. Microsoft developed new versions of a number of problematic functions, and added checks to the Visual C++ compiler to warn about using the old ones. They then attempted to standardize these functions. My understanding is the people responsible for the Unix POSIX standards did not like the design of these functions, so they refused to implement them. For more details, see Field Experience With Annex K published in September 2015, Stack Overflow: Why didn't glibc implement _s functions? updated March 2023, and Rich Felker of musl on both technical and social reasons for not implementing Annex K from February 2019. I haven't looked at the rest of the functions, but having spent way too long looking at getenv(), the general idea of getenv_s() seems like a good idea to me. Standardizing this would help avoid this problem. Incomplete list of common environment variables This is a list of some uses of environment variables from fairly widely used libraries and services. This shows that environment variables are pretty widely used. Cloud Provider Credentials and Services AWS's SDKs for credentials (e.g. AWS_ACCESS_KEY_ID) Google Cloud Application Default Credentials (e.g. GOOGLE_APPLICATION_CREDENTIALS) Microsoft Azure Default Azure Credential (e.g. AZURE_CLIENT_ID) AWS's Lambda serverless product: sets a large number of variables like AWS_REGION, AWS_LAMBDA_FUNCTION_NAME, and credentials like AWS_SECRET_ACCESS_KEY Google Cloud Run serverless product: configuration like PORT, K_SERVICE, K_REVISION Kubernetes service discovery: Defines variables SERVICE_NAME_HOST and SERVICE_NAME_PORT. Third-party C/C++ Libraries OpenTelemetry: Metrics and tracing. Many environment variables like OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES. OpenSSL: many configurable variables like HTTPS_PROXY, OPENSSL_CONF, OPENSSL_ENGINES. BoringSSL: Google's fork of OpenSSL used in Chrome and others. It reads SSLKEYLOGFILE just like OpenSSL for logging TLS keys for debugging. Libcurl: proxies, SSL/TLS configuration and debugging like HTTPS_PROXY, CURL_SSL_BACKEND, CURL_DEBUG. Libpq Postgres client library: connection parameters including credentials like PGHOSTADDR, PGDATABASE, and PGPASSWORD. Rust Standard Library std::thread RUST_MIN_STACK: Calls std::env::var() on the first call to spawn() a new thread. It is cached in a static atomic variable and never read again. See implementation in thread::min_stack(). std::backtrace RUST_LIB_BACKTRACE: Calls std::env::var() on the first call to capture a backtrace. It is cached in a static atomic variable and never read again. See implementation in Backtrace::enabled().

a year ago • 58 votes

Random Load Balancing is Unevenly Distributed

This is a reminder that random load balancing is unevenly distributed. If we distribute a set of items randomly across a set of servers (e.g. by hashing, or by randomly selecting a server), the average number of items on each server is num_items / num_servers. It is easy to assume this means each server has close to the same number of items. However, since we are selecting servers at random, they will have different numbers of items, and the imbalance can be important. For load balancing, a reasonable model is that each server has fixed capacity (e.g. it can serve 3000 requests/second, or store 100 items, etc.). We need to divide the total workload over the servers, so that each server stays below its capacity. This means the number of servers is determined by the most loaded server, not the average. This is a classic balls in bins problem that has been well studied, and there are some interesting theoretical results. However, I wanted some specific numbers, so I wrote a small simulation. The summary is that the imbalance depends on the expected number of items per server (that is, num_items / num_servers). This means workload is more balanced with fewer servers, or with more items. This means that dividing a set of items over more servers makes the distribution more unfair, which is a reason we can get worse than linear scaling of a distributed system. Let's make this more concrete with an example. Let's assume we have a workload of 1000 items, and each server can hold a maximum of 100 items. If we place the exact same number of items on each server, we only need 10 servers, and each of them is completely busy. However, if we place the items randomly, then the median (p50) number of items is 100 items. This means half the servers will have more than 100 items, and will be overloaded. If we want less than a 1% chance of an overloaded server, we need to look at the 99th percentile (p99) server load. We need to use at least 13 servers, which has a p99 load of 97 items. For 14 servers, the average is 77 items, so our servers are on average 23% idle. This shows how the imbalance leads to wasted capacity. This is a bit of an extreme example, because the number of items is small. Let's assume we can make the items 10× smaller, say by dividing them into pieces. Our workload now consists of 10k items, and each server has the capacity to hold 1000 (1k) items. Our perfectly balanced workload still needs 10 servers. With random load balancing, to have a less than 1 in 1000 chance of exceeding our capacity, we only need 11 servers, which has a p99 load of 98 items and a p999 of 100 items. With 11 servers, the average number of items is 910 or 91%, so our servers are only 9% idle. This shows how splitting work into smaller pieces improves load balancing. Another way to look at this is to think about a scaling scenario. Let's go back to our workload of 1000 items, where each server can handle 100 items, and we have 13 servers to ensure we have less than a 1% chance of an overloaded server. Now let's assume the amount of work per item doubles, for example because the service has become more popular, so each item has become larger. Now, each server can hold a maximum of 50 items. If we have perfectly linear scaling, we can double the number of servers from 13 to 26 to handle this workload. However, 26 servers has a p99 of 53 items, so we again have a more than 1% chance of overload. We need to use 28 servers which has a p99 of 50 items. This means we doubled the workload, but had to increase the number of servers from 13 to 28, which is 2.15×. This is sub-linear scaling. As a way to visualize the imbalance, the chart below shows the p99 to average ratio, which is a measure of how imbalanced the system is. If everything is perfectly balanced, the value is 1.0. A value of 2.0 means 1% of servers will have double the number of items of the average server. This shows that the imbalance increases with the number of servers, and increases with fewer items. Power of Two Random Choices Another way to improve load balancing is to have smarter placement. Perfect placement can be hard, but it is often possible to use the "power of two random choices" technique: select two servers at random, and place the item on the least loaded of the two. This makes the distribution much more balanced. For 1000 items and 100 items/server, 11 servers has a p999 of 93 items, so much less than 0.1% chance of overload, compared to needing 14 servers with random load balancing. For the scaling scenario where each server can only handle 50 items, we only need 21 servers to have a p999 of 50 items, compared to 28 servers with random load balancing. The downside of the two choices technique is that each request is now more expensive, since it must query two servers instead of one. However, in many cases where the "item not found" requests are much less expensive than the "item found" requests, this can still be a substantial improvement. For another look at how this improves load balancing, with a nice simulation that includes information delays, see Marc Brooker's blog post. Raw simulation output I will share the code for this simulation later. simulating placing items on servers with random selection iterations=10000 (number of times num_items are placed on num_servers) measures the fraction of items on each server (server_items/num_items) and reports the percentile of all servers in the run P99_AVG_RATIO = p99 / average; approximately the worst server compared to average num_items=1000: num_servers=3 p50=0.33300 p95=0.35800 p99=0.36800 p999=0.37900 AVG=0.33333; P99_AVG_RATIO=1.10400; ITEMS_PER_NODE=333.3 num_servers=5 p50=0.20000 p95=0.22100 p99=0.23000 p999=0.24000 AVG=0.20000; P99_AVG_RATIO=1.15000; ITEMS_PER_NODE=200.0 num_servers=10 p50=0.10000 p95=0.11600 p99=0.12300 p999=0.13100 AVG=0.10000; P99_AVG_RATIO=1.23000; ITEMS_PER_NODE=100.0 num_servers=11 p50=0.09100 p95=0.10600 p99=0.11300 p999=0.12000 AVG=0.09091; P99_AVG_RATIO=1.24300; ITEMS_PER_NODE=90.9 num_servers=12 p50=0.08300 p95=0.09800 p99=0.10400 p999=0.11200 AVG=0.08333; P99_AVG_RATIO=1.24800; ITEMS_PER_NODE=83.3 num_servers=13 p50=0.07700 p95=0.09100 p99=0.09700 p999=0.10400 AVG=0.07692; P99_AVG_RATIO=1.26100; ITEMS_PER_NODE=76.9 num_servers=14 p50=0.07100 p95=0.08500 p99=0.09100 p999=0.09800 AVG=0.07143; P99_AVG_RATIO=1.27400; ITEMS_PER_NODE=71.4 num_servers=25 p50=0.04000 p95=0.05000 p99=0.05500 p999=0.06000 AVG=0.04000; P99_AVG_RATIO=1.37500; ITEMS_PER_NODE=40.0 num_servers=50 p50=0.02000 p95=0.02800 p99=0.03100 p999=0.03500 AVG=0.02000; P99_AVG_RATIO=1.55000; ITEMS_PER_NODE=20.0 num_servers=100 p50=0.01000 p95=0.01500 p99=0.01800 p999=0.02100 AVG=0.01000; P99_AVG_RATIO=1.80000; ITEMS_PER_NODE=10.0 num_servers=1000 p50=0.00100 p95=0.00300 p99=0.00400 p999=0.00500 AVG=0.00100; P99_AVG_RATIO=4.00000; ITEMS_PER_NODE=1.0 num_items=2000: num_servers=3 p50=0.33350 p95=0.35050 p99=0.35850 p999=0.36550 AVG=0.33333; P99_AVG_RATIO=1.07550; ITEMS_PER_NODE=666.7 num_servers=5 p50=0.20000 p95=0.21500 p99=0.22150 p999=0.22850 AVG=0.20000; P99_AVG_RATIO=1.10750; ITEMS_PER_NODE=400.0 num_servers=10 p50=0.10000 p95=0.11100 p99=0.11600 p999=0.12150 AVG=0.10000; P99_AVG_RATIO=1.16000; ITEMS_PER_NODE=200.0 num_servers=11 p50=0.09100 p95=0.10150 p99=0.10650 p999=0.11150 AVG=0.09091; P99_AVG_RATIO=1.17150; ITEMS_PER_NODE=181.8 num_servers=12 p50=0.08350 p95=0.09350 p99=0.09800 p999=0.10300 AVG=0.08333; P99_AVG_RATIO=1.17600; ITEMS_PER_NODE=166.7 num_servers=13 p50=0.07700 p95=0.08700 p99=0.09100 p999=0.09600 AVG=0.07692; P99_AVG_RATIO=1.18300; ITEMS_PER_NODE=153.8 num_servers=14 p50=0.07150 p95=0.08100 p99=0.08500 p999=0.09000 AVG=0.07143; P99_AVG_RATIO=1.19000; ITEMS_PER_NODE=142.9 num_servers=25 p50=0.04000 p95=0.04750 p99=0.05050 p999=0.05450 AVG=0.04000; P99_AVG_RATIO=1.26250; ITEMS_PER_NODE=80.0 num_servers=50 p50=0.02000 p95=0.02550 p99=0.02750 p999=0.03050 AVG=0.02000; P99_AVG_RATIO=1.37500; ITEMS_PER_NODE=40.0 num_servers=100 p50=0.01000 p95=0.01400 p99=0.01550 p999=0.01750 AVG=0.01000; P99_AVG_RATIO=1.55000; ITEMS_PER_NODE=20.0 num_servers=1000 p50=0.00100 p95=0.00250 p99=0.00300 p999=0.00400 AVG=0.00100; P99_AVG_RATIO=3.00000; ITEMS_PER_NODE=2.0 num_items=5000: num_servers=3 p50=0.33340 p95=0.34440 p99=0.34920 p999=0.35400 AVG=0.33333; P99_AVG_RATIO=1.04760; ITEMS_PER_NODE=1666.7 num_servers=5 p50=0.20000 p95=0.20920 p99=0.21320 p999=0.21740 AVG=0.20000; P99_AVG_RATIO=1.06600; ITEMS_PER_NODE=1000.0 num_servers=10 p50=0.10000 p95=0.10700 p99=0.11000 p999=0.11320 AVG=0.10000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=500.0 num_servers=11 p50=0.09080 p95=0.09760 p99=0.10040 p999=0.10380 AVG=0.09091; P99_AVG_RATIO=1.10440; ITEMS_PER_NODE=454.5 num_servers=12 p50=0.08340 p95=0.08980 p99=0.09260 p999=0.09580 AVG=0.08333; P99_AVG_RATIO=1.11120; ITEMS_PER_NODE=416.7 num_servers=13 p50=0.07680 p95=0.08320 p99=0.08580 p999=0.08900 AVG=0.07692; P99_AVG_RATIO=1.11540; ITEMS_PER_NODE=384.6 num_servers=14 p50=0.07140 p95=0.07740 p99=0.08000 p999=0.08300 AVG=0.07143; P99_AVG_RATIO=1.12000; ITEMS_PER_NODE=357.1 num_servers=25 p50=0.04000 p95=0.04460 p99=0.04660 p999=0.04880 AVG=0.04000; P99_AVG_RATIO=1.16500; ITEMS_PER_NODE=200.0 num_servers=50 p50=0.02000 p95=0.02340 p99=0.02480 p999=0.02640 AVG=0.02000; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 num_servers=100 p50=0.01000 p95=0.01240 p99=0.01340 p999=0.01460 AVG=0.01000; P99_AVG_RATIO=1.34000; ITEMS_PER_NODE=50.0 num_servers=1000 p50=0.00100 p95=0.00180 p99=0.00220 p999=0.00260 AVG=0.00100; P99_AVG_RATIO=2.20000; ITEMS_PER_NODE=5.0 num_items=10000: num_servers=3 p50=0.33330 p95=0.34110 p99=0.34430 p999=0.34820 AVG=0.33333; P99_AVG_RATIO=1.03290; ITEMS_PER_NODE=3333.3 num_servers=5 p50=0.20000 p95=0.20670 p99=0.20950 p999=0.21260 AVG=0.20000; P99_AVG_RATIO=1.04750; ITEMS_PER_NODE=2000.0 num_servers=10 p50=0.10000 p95=0.10500 p99=0.10700 p999=0.10940 AVG=0.10000; P99_AVG_RATIO=1.07000; ITEMS_PER_NODE=1000.0 num_servers=11 p50=0.09090 p95=0.09570 p99=0.09770 p999=0.09990 AVG=0.09091; P99_AVG_RATIO=1.07470; ITEMS_PER_NODE=909.1 num_servers=12 p50=0.08330 p95=0.08790 p99=0.08980 p999=0.09210 AVG=0.08333; P99_AVG_RATIO=1.07760; ITEMS_PER_NODE=833.3 num_servers=13 p50=0.07690 p95=0.08130 p99=0.08320 p999=0.08530 AVG=0.07692; P99_AVG_RATIO=1.08160; ITEMS_PER_NODE=769.2 num_servers=14 p50=0.07140 p95=0.07570 p99=0.07740 p999=0.07950 AVG=0.07143; P99_AVG_RATIO=1.08360; ITEMS_PER_NODE=714.3 num_servers=25 p50=0.04000 p95=0.04330 p99=0.04460 p999=0.04620 AVG=0.04000; P99_AVG_RATIO=1.11500; ITEMS_PER_NODE=400.0 num_servers=50 p50=0.02000 p95=0.02230 p99=0.02330 p999=0.02440 AVG=0.02000; P99_AVG_RATIO=1.16500; ITEMS_PER_NODE=200.0 num_servers=100 p50=0.01000 p95=0.01170 p99=0.01240 p999=0.01320 AVG=0.01000; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 num_servers=1000 p50=0.00100 p95=0.00150 p99=0.00180 p999=0.00210 AVG=0.00100; P99_AVG_RATIO=1.80000; ITEMS_PER_NODE=10.0 num_items=100000: num_servers=3 p50=0.33333 p95=0.33579 p99=0.33681 p999=0.33797 AVG=0.33333; P99_AVG_RATIO=1.01043; ITEMS_PER_NODE=33333.3 num_servers=5 p50=0.20000 p95=0.20207 p99=0.20294 p999=0.20393 AVG=0.20000; P99_AVG_RATIO=1.01470; ITEMS_PER_NODE=20000.0 num_servers=10 p50=0.10000 p95=0.10157 p99=0.10222 p999=0.10298 AVG=0.10000; P99_AVG_RATIO=1.02220; ITEMS_PER_NODE=10000.0 num_servers=11 p50=0.09091 p95=0.09241 p99=0.09304 p999=0.09379 AVG=0.09091; P99_AVG_RATIO=1.02344; ITEMS_PER_NODE=9090.9 num_servers=12 p50=0.08334 p95=0.08477 p99=0.08537 p999=0.08602 AVG=0.08333; P99_AVG_RATIO=1.02444; ITEMS_PER_NODE=8333.3 num_servers=13 p50=0.07692 p95=0.07831 p99=0.07888 p999=0.07954 AVG=0.07692; P99_AVG_RATIO=1.02544; ITEMS_PER_NODE=7692.3 num_servers=14 p50=0.07143 p95=0.07277 p99=0.07332 p999=0.07396 AVG=0.07143; P99_AVG_RATIO=1.02648; ITEMS_PER_NODE=7142.9 num_servers=25 p50=0.04000 p95=0.04102 p99=0.04145 p999=0.04193 AVG=0.04000; P99_AVG_RATIO=1.03625; ITEMS_PER_NODE=4000.0 num_servers=50 p50=0.02000 p95=0.02073 p99=0.02103 p999=0.02138 AVG=0.02000; P99_AVG_RATIO=1.05150; ITEMS_PER_NODE=2000.0 num_servers=100 p50=0.01000 p95=0.01052 p99=0.01074 p999=0.01099 AVG=0.01000; P99_AVG_RATIO=1.07400; ITEMS_PER_NODE=1000.0 num_servers=1000 p50=0.00100 p95=0.00117 p99=0.00124 p999=0.00132 AVG=0.00100; P99_AVG_RATIO=1.24000; ITEMS_PER_NODE=100.0 power of two choices num_items=1000: num_servers=3 p50=0.33300 p95=0.33400 p99=0.33500 p999=0.33600 AVG=0.33333; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=333.3 num_servers=5 p50=0.20000 p95=0.20100 p99=0.20200 p999=0.20300 AVG=0.20000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=10 p50=0.10000 p95=0.10100 p99=0.10200 p999=0.10200 AVG=0.10000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=11 p50=0.09100 p95=0.09200 p99=0.09300 p999=0.09300 AVG=0.09091; P99_AVG_RATIO=1.02300; ITEMS_PER_NODE=90.9 num_servers=12 p50=0.08300 p95=0.08500 p99=0.08500 p999=0.08600 AVG=0.08333; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=83.3 num_servers=13 p50=0.07700 p95=0.07800 p99=0.07900 p999=0.07900 AVG=0.07692; P99_AVG_RATIO=1.02700; ITEMS_PER_NODE=76.9 num_servers=14 p50=0.07200 p95=0.07300 p99=0.07300 p999=0.07400 AVG=0.07143; P99_AVG_RATIO=1.02200; ITEMS_PER_NODE=71.4 num_servers=25 p50=0.04000 p95=0.04100 p99=0.04200 p999=0.04200 AVG=0.04000; P99_AVG_RATIO=1.05000; ITEMS_PER_NODE=40.0 num_servers=50 p50=0.02000 p95=0.02100 p99=0.02200 p999=0.02200 AVG=0.02000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=20.0 num_servers=100 p50=0.01000 p95=0.01100 p99=0.01200 p999=0.01200 AVG=0.01000; P99_AVG_RATIO=1.20000; ITEMS_PER_NODE=10.0 num_servers=1000 p50=0.00100 p95=0.00200 p99=0.00200 p999=0.00300 AVG=0.00100; P99_AVG_RATIO=2.00000; ITEMS_PER_NODE=1.0 power of two choices num_items=2000: num_servers=3 p50=0.33350 p95=0.33400 p99=0.33400 p999=0.33450 AVG=0.33333; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=666.7 num_servers=5 p50=0.20000 p95=0.20050 p99=0.20100 p999=0.20150 AVG=0.20000; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=400.0 num_servers=10 p50=0.10000 p95=0.10050 p99=0.10100 p999=0.10100 AVG=0.10000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=11 p50=0.09100 p95=0.09150 p99=0.09200 p999=0.09200 AVG=0.09091; P99_AVG_RATIO=1.01200; ITEMS_PER_NODE=181.8 num_servers=12 p50=0.08350 p95=0.08400 p99=0.08400 p999=0.08450 AVG=0.08333; P99_AVG_RATIO=1.00800; ITEMS_PER_NODE=166.7 num_servers=13 p50=0.07700 p95=0.07750 p99=0.07800 p999=0.07800 AVG=0.07692; P99_AVG_RATIO=1.01400; ITEMS_PER_NODE=153.8 num_servers=14 p50=0.07150 p95=0.07200 p99=0.07250 p999=0.07250 AVG=0.07143; P99_AVG_RATIO=1.01500; ITEMS_PER_NODE=142.9 num_servers=25 p50=0.04000 p95=0.04050 p99=0.04100 p999=0.04100 AVG=0.04000; P99_AVG_RATIO=1.02500; ITEMS_PER_NODE=80.0 num_servers=50 p50=0.02000 p95=0.02050 p99=0.02100 p999=0.02100 AVG=0.02000; P99_AVG_RATIO=1.05000; ITEMS_PER_NODE=40.0 num_servers=100 p50=0.01000 p95=0.01050 p99=0.01100 p999=0.01100 AVG=0.01000; P99_AVG_RATIO=1.10000; ITEMS_PER_NODE=20.0 num_servers=1000 p50=0.00100 p95=0.00150 p99=0.00200 p999=0.00200 AVG=0.00100; P99_AVG_RATIO=2.00000; ITEMS_PER_NODE=2.0 power of two choices num_items=5000: num_servers=3 p50=0.33340 p95=0.33360 p99=0.33360 p999=0.33380 AVG=0.33333; P99_AVG_RATIO=1.00080; ITEMS_PER_NODE=1666.7 num_servers=5 p50=0.20000 p95=0.20020 p99=0.20040 p999=0.20060 AVG=0.20000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=10 p50=0.10000 p95=0.10020 p99=0.10040 p999=0.10040 AVG=0.10000; P99_AVG_RATIO=1.00400; ITEMS_PER_NODE=500.0 num_servers=11 p50=0.09100 p95=0.09120 p99=0.09120 p999=0.09140 AVG=0.09091; P99_AVG_RATIO=1.00320; ITEMS_PER_NODE=454.5 num_servers=12 p50=0.08340 p95=0.08360 p99=0.08360 p999=0.08380 AVG=0.08333; P99_AVG_RATIO=1.00320; ITEMS_PER_NODE=416.7 num_servers=13 p50=0.07700 p95=0.07720 p99=0.07720 p999=0.07740 AVG=0.07692; P99_AVG_RATIO=1.00360; ITEMS_PER_NODE=384.6 num_servers=14 p50=0.07140 p95=0.07160 p99=0.07180 p999=0.07180 AVG=0.07143; P99_AVG_RATIO=1.00520; ITEMS_PER_NODE=357.1 num_servers=25 p50=0.04000 p95=0.04020 p99=0.04040 p999=0.04040 AVG=0.04000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=50 p50=0.02000 p95=0.02020 p99=0.02040 p999=0.02040 AVG=0.02000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=100 p50=0.01000 p95=0.01020 p99=0.01040 p999=0.01040 AVG=0.01000; P99_AVG_RATIO=1.04000; ITEMS_PER_NODE=50.0 num_servers=1000 p50=0.00100 p95=0.00120 p99=0.00140 p999=0.00140 AVG=0.00100; P99_AVG_RATIO=1.40000; ITEMS_PER_NODE=5.0 power of two choices num_items=10000: num_servers=3 p50=0.33330 p95=0.33340 p99=0.33350 p999=0.33360 AVG=0.33333; P99_AVG_RATIO=1.00050; ITEMS_PER_NODE=3333.3 num_servers=5 p50=0.20000 p95=0.20010 p99=0.20020 p999=0.20030 AVG=0.20000; P99_AVG_RATIO=1.00100; ITEMS_PER_NODE=2000.0 num_servers=10 p50=0.10000 p95=0.10010 p99=0.10020 p999=0.10020 AVG=0.10000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=11 p50=0.09090 p95=0.09100 p99=0.09110 p999=0.09110 AVG=0.09091; P99_AVG_RATIO=1.00210; ITEMS_PER_NODE=909.1 num_servers=12 p50=0.08330 p95=0.08350 p99=0.08350 p999=0.08360 AVG=0.08333; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=833.3 num_servers=13 p50=0.07690 p95=0.07700 p99=0.07710 p999=0.07720 AVG=0.07692; P99_AVG_RATIO=1.00230; ITEMS_PER_NODE=769.2 num_servers=14 p50=0.07140 p95=0.07160 p99=0.07160 p999=0.07170 AVG=0.07143; P99_AVG_RATIO=1.00240; ITEMS_PER_NODE=714.3 num_servers=25 p50=0.04000 p95=0.04010 p99=0.04020 p999=0.04020 AVG=0.04000; P99_AVG_RATIO=1.00500; ITEMS_PER_NODE=400.0 num_servers=50 p50=0.02000 p95=0.02010 p99=0.02020 p999=0.02020 AVG=0.02000; P99_AVG_RATIO=1.01000; ITEMS_PER_NODE=200.0 num_servers=100 p50=0.01000 p95=0.01010 p99=0.01020 p999=0.01020 AVG=0.01000; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0 num_servers=1000 p50=0.00100 p95=0.00110 p99=0.00120 p999=0.00120 AVG=0.00100; P99_AVG_RATIO=1.20000; ITEMS_PER_NODE=10.0 power of two choices num_items=100000: num_servers=3 p50=0.33333 p95=0.33334 p99=0.33335 p999=0.33336 AVG=0.33333; P99_AVG_RATIO=1.00005; ITEMS_PER_NODE=33333.3 num_servers=5 p50=0.20000 p95=0.20001 p99=0.20002 p999=0.20003 AVG=0.20000; P99_AVG_RATIO=1.00010; ITEMS_PER_NODE=20000.0 num_servers=10 p50=0.10000 p95=0.10001 p99=0.10002 p999=0.10002 AVG=0.10000; P99_AVG_RATIO=1.00020; ITEMS_PER_NODE=10000.0 num_servers=11 p50=0.09091 p95=0.09092 p99=0.09093 p999=0.09093 AVG=0.09091; P99_AVG_RATIO=1.00023; ITEMS_PER_NODE=9090.9 num_servers=12 p50=0.08333 p95=0.08335 p99=0.08335 p999=0.08336 AVG=0.08333; P99_AVG_RATIO=1.00020; ITEMS_PER_NODE=8333.3 num_servers=13 p50=0.07692 p95=0.07694 p99=0.07694 p999=0.07695 AVG=0.07692; P99_AVG_RATIO=1.00022; ITEMS_PER_NODE=7692.3 num_servers=14 p50=0.07143 p95=0.07144 p99=0.07145 p999=0.07145 AVG=0.07143; P99_AVG_RATIO=1.00030; ITEMS_PER_NODE=7142.9 num_servers=25 p50=0.04000 p95=0.04001 p99=0.04002 p999=0.04002 AVG=0.04000; P99_AVG_RATIO=1.00050; ITEMS_PER_NODE=4000.0 num_servers=50 p50=0.02000 p95=0.02001 p99=0.02002 p999=0.02002 AVG=0.02000; P99_AVG_RATIO=1.00100; ITEMS_PER_NODE=2000.0 num_servers=100 p50=0.01000 p95=0.01001 p99=0.01002 p999=0.01002 AVG=0.01000; P99_AVG_RATIO=1.00200; ITEMS_PER_NODE=1000.0 num_servers=1000 p50=0.00100 p95=0.00101 p99=0.00102 p999=0.00102 AVG=0.00100; P99_AVG_RATIO=1.02000; ITEMS_PER_NODE=100.0

over a year ago • 49 votes

Nanosecond timestamp collisions are common

I was wondering: how often do nanosecond timestamps collide on modern systems? The answer is: very often, like 5% of all samples, when reading the clock on all 4 physical cores at the same time. As a result, I think it is unsafe to assume that a raw nanosecond timestamp is a unique identifier. I wrote a small test program to test this. I used Go, which records both the "absolute" time and the "monotonic clock" relative time on each call to time.Now(), so I compared both the relative difference between consecutive timestamps, as well as just the absolute timestamps. As expected, the behavior depends on the system, so I observe very different results on Mac OS X and Linux. On Linux, within a single thread, both the absolute and monotonic times always increase. On my system, the minimum increment was 32 ns. Between threads, approximately 5% of the absolute times were exactly the same as other threads. Even with 2 threads on a 4 core system, approximately 2% of timestamps collided. On Mac OS X: the absolute time has microsecond resolution, so there are an astronomical number of collisions when I repeat this same test. Even within a thread I often observe the monotonic clock not increment. See the test program on Github if you are curious.

over a year ago • 42 votes

How much does the read/write buffer size matter for socket throughput?

The read() and write() system calls take a variable-length byte array as an argument. As a simplified model, the time for the system call should be some constant "per-call" time, plus time directly proportional to the number of bytes in the array. That is, the time for each call should be time = (per_call_minimum_time) + (array_len) × (per_byte_time). With this model, using a larger buffer should increase throughput, asymptotically approaching 1/per_byte_time. I was curious: do real system calls behave this way? What are the ideal buffer sizes for read() and write() if we want to maximize throughput? I decided to do some experiments with blocking I/O. These are not rigorous, and I suspect the results will vary significantly if the hardware and software are different than one the system I tested. The really short answer is that a buffer of 32 KiB is a good starting point on today's systems, and I would want to measure the performance to go beyond that. However, for large writes, performance can increase. On Linux, the simple model holds for small buffers (≤ 4 KiB), but once the program approaches the maximum throughput, the throughput becomes highly variable and in many cases decreases as the buffers get larger. For blocking I/O, approximately 32 KiB is large enough to hit the maximum throughput for read(), but write() throughput improves with buffers up to around 256 KiB - 1 MiB. The reason for the asymmetry is that the Linux kernel will only write less than the entire buffer (a "short write") if there is an error (e.g. a signal causing EINTR). Thus, larger write buffers means the operating system needs to switch to the process less often. On the other head, "short reads", where a read() returns less than the maximum length, become increasingly common as the buffer size increases, which diminishes the benefit. There is a SO_RCVLOWAT socket option to change this that I did not test. The experiments were run on two 16 CPU Google Cloud T2D instances, which use AMD EPYC Milan processors (3rd generation, released in 2021). Each core is a real physical core. I used Ubuntu 23.04 running kernel 6.2.0-1005-gcp. My benchmark program is written in Rust and is available on Github. On localhost, Unix sockets were able to transfer data at approximately 9000 MiB/s. Localhost TCP sockets were a bit slower, around 7000 MiB/s. When using two separate cloud VMs with a networking throughput limit of 32 Gbps = 3800 MiB/s, I needed to use 6 TCP sockets to reliably reach that maximum throughput. A single TCP socket gets around 1400 MiB/s with 256 KiB buffers, with peaks as high as 2200 MiB/s. Experiment 1: /dev/zero and /dev/urandom My first experiment is reading from the /dev/zero and /dev/urandom devices. These are software devices implemented by the kernel, so they should have low overhead and low variability, since other tasks are not involved. Reading from /dev/urandom should be much slower than /dev/zero since the kernel must generate random bytes, rather than just zeros. The chart below shows the throughput for reading from /dev/zero as the buffer size is increased. The results show that the basic linear time per system call model holds until the system reaches maximum throughput (256 kiB buffer = 39000 MiB/s for /dev/zero, or 16 kiB = 410 MiB/s for /dev/urandom). As the buffer size increases further, the throughput decreases as the buffers get too big. This suggests that some other cost for larger buffers starts to outweigh the reduction in number of system calls. Perhaps CPU caches become less effective? The AMD EPYC Milan (3rd gen) CPU I tested on has 32 KiB of L1 data cache and 512 KiB of L2 data cache per core. The performance decreases don't exactly line up with these numbers, but it still seems plausible. The numbers for /dev/urandom are substantially lower, but otherwise similar. I did a linear least-squares fit on the average time per system call, shown in the following chart. If I use all the data, the fit is not good, because the trend changes for larger buffers. However, if I use the data up to the maximum throughput at 256 KiB, the fit is very good, as shown on the chart below. The linear fit models the minimum time per system call as 167 ns, with 0.0235 ns/byte additional time. If we want to use smaller buffers, using a 64 KiB buffer for reading from /dev/zero gets within 95% of the maximum throughput. Experiment 2: Unix and localhost TCP sockets Exchanging data with other processes is the thing I am actually interested in, so I tested Unix and TCP sockets on a single machine. In this case, I varied both the write buffer size and the read buffer size. Unfortunately, these results vary a lot. A more robust comparison would require running each experiment many times, and using some sort of statistical comparison. However, this "quick and dirty" experiment satisfied my curiousity, so I didn't do that. As a result, my conclusions here are vague. The simple model that increasing buffer size should decrease overhead is true, but only until the buffers are about 4 KiB. Above that point, the results start to be highly variable, and it is much harder to draw general conclusion. However, appears that increasing the write buffer size generally is quite helpful up to at least 256 KiB, and often needed as much as 1 MiB to get the highest localhost throughput. I suspect this is because on Linux with blocking sockets, write() will not return until it has written all the data in the buffer, unless there is an error (e.g. EINTR). As a result, passing a large buffer means the kernel can do a lot of the work without needing to switch back to user space. Unfortunately, the same is not true for read(), which often returns "short reads" with any data that is available in the buffer. This starts with buffer sizes around 2 KiB, with the percentage of short reads increasing as the buffer size gets larger. This means the simple model does not hold, because we aren't actually increasing the bytes per read call. I suspect this is a factor which means this microbenchmark is likely not representative of real programs. A real program will do something with the buffer, which will provide time for more data to be buffered in the kernel, and would probably decrease the number of short reads. This likely means larger buffers are in practice more useful than this microbenchmark suggests. As a result of this, the highest throughput often was achievable with small read buffers. I'm somewhat arbitrarily selecting 16 KiB at the best read buffer, and 256 KiB as the best write buffer, although a 1 MiB write buffer seems to be To give a sense of how variable the results are, the plot below shows the local Unix socket throughput for each read and write buffer throughput size. I apologize for the ugly plot. I did not want to spend the time to make it more beautiful. This plot is interactive so you can slice the data to the area of interest. I recommend zooming in to the left hand size with read buffers up to about 300 KiB. The first thing to note is at least on Linux with blocking sockets, the writer will almost never have a "short write", where the write system call returns before writing all the data in the buffer. Unless there is a signal (EINTR) or some other "error" condition, write() will not return until all the bytes are written. The same is not true for reads. The read() system call will often return a "short" read, starting around buffer sizes of 2 KiB. The percentage of short reads generally increases as buffer sizes get bigger, which is logical. Another note is that sockets have in-kernel send and receive buffers. I did not tune these at all. It is possible that better performance is possible by tuning these settings, but that was not my goal. I wanted to know what happens "out of the box" for general-purpose programs without any special tuning. Experiment 3: TCP between two hosts In this experiment, I used two separate hosts connected with 32 Gbps networking in Google Cloud. I first tested the TCP throughput using iperf, to independently verify the network performance. A single TCP connection with iperf is not enough to fully utilize the network. I tried fiddling with some command line options and with Kernel settings like net.ipv4.tcp_rmem and wasn't able to get much better than about 12 Gb/s = 1400 MiB/s. The throughput is also highly varible. Here is some example output with iperf reporting at 2 second intervals, where you can see the throughput ranging from 10 to 19 Gb/s, with an average over the entire interval of 12 Gb/s. To hit the maximum network throughput, I need to use 6 or more parallel TCP connections (iperf -c IP_ADDRESS --time 60 --interval 2 -l 262144 -P 6). Using 3 connections gets around 26 Gb/s, and using 4 or 5 will occasionally hit the maximum, but will also occasionally drop down. Using at least 6 seems to reliably stay at the maximum. Due to this variability, it is hard to draw any conclusions about buffer size. In particular: a single TCP connection is not limited by CPU. The system uses about 40% of a single CPU core, basically all in the kernel. This is more about how the buffer sizes may impact scheduling choices. That said, it is clear that you cannot hit the maximum throughput with a small write buffer. The experiments with 4 KiB write buffers reached approximately 300 MiB/s, while an 8 KiB write buffer was much faster, around 1400 MiB/s. Larger still generally seems better, up to around 256 KiB, which occasionally reached 2200 MiB/s = 17.6 Gb/s. The plot below shows the TCP socket throughput for each read and write buffer size. Again, I apologize for the ugly plot.

over a year ago • 97 votes

The C Standard Library Function isspace() Depends on Locale

This is a post for myself, because I wasted a lot of time understanding this bug, and I want to be able to remember it in the future. I expect close to zero others to be interested. The C standard library function isspace() returns a non-zero value (true) for the six "standard" ASCII white-space characters ('\t', '\n', '\v', '\f', '\r', ' '), and any locale-specific characters. By default, a program starts in the "C" locale, which will only return true for the six ASCII white-space characters. However, if the program changes locales, it can return true for other values. As a result, unless you really understand locales, you should use your own version of this function, or ICU4C's u_isspace() function. An implementation of isspace() for ASCII is one line: /* Returns true for the 6 ASCII white-space characters: \t \n \v \f \r ' '. */ int isspace_ascii(int c) { return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' '; } I ran into this because On Mac OS X, Postgres switches to the system's default locale, which is something that uses UTF-8 (e.g. en_US.UTF-8, fr_CA.UTF-8, etc). In this case, isspace() returns true for Unicode white-space values, which includes 0x85 = NEL = Next Line, and 0xA0 = NBSP = No-Break Space. This caused a bug in parsing Postgres Hstore values that use Unicode. I have attempted to submit a patch to fix this (mailing list post, commitfest entry). For a program to demonstrate the behaviour on different systems, see isspace_locale on Github.

over a year ago • 103 votes

More in programming

first-class merges and cover letters

Although it looks really good, I have not yet tried the Jujutsu (jj) version control system, mainly because it’s not yet clearly superior to Magit. But I have been following jj discussions with great interest. One of the things that jj has not yet tackled is how to do better than git refs / branches / tags. As I underestand it, jj currently has something like Mercurial bookmarks, which are more like raw git ref plumbing than a high-level porcelain feature. In particular, jj lacks signed or annotated tags, and it doesn’t have branch names that always automatically refer to the tip. This is clearly a temporary state of affairs because jj is still incomplete and under development and these gaps are going to be filled. But the discussions have led me to think about how git’s branches are unsatisfactory, and what could be done to improve them. branch merge rebase squash fork cover letters previous branch workflow questions branch One of the huge improvements in git compared to Subversion was git’s support for merges. Subversion proudly advertised its support for lightweight branches, but a branch is not very useful if you can’t merge it: an un-mergeable branch is not a tool you can use to help with work-in-progress development. The point of this anecdote is to illustrate that rather than trying to make branches better, we should try to make merges better and branches will get better as a consequence. Let’s consider a few common workflows and how git makes them all unsatisfactory in various ways. Skip to cover letters and previous branch below where I eventually get to the point. merge A basic merge workflow is, create a feature branch hack, hack, review, hack, approve merge back to the trunk The main problem is when it comes to the merge, there may be conflicts due to concurrent work on the trunk. Git encourages you to resolve conflicts while creating the merge commit, which tends to bypass the normal review process. Git also gives you an ugly useless canned commit message for merges, that hides what you did to resolve the conflicts. If the feature branch is a linear record of the work then it can be cluttered with commits to address comments from reviewers and to fix mistakes. Some people like an accurate record of the history, but others prefer the repository to contain clean logical changes that will make sense in years to come, keeping the clutter in the code review system. rebase A rebase-oriented workflow deals with the problems of the merge workflow but introduces new problems. Primarily, rebasing is intended to produce a tidy logical commit history. And when a feature branch is rebased onto the trunk before it is merged, a simple fast-forward check makes it trivial to verify that the merge will be clean (whether it uses separate merge commit or directly fast-forwards the trunk). However, it’s hard to compare the state of the feature branch before and after the rebase. The current and previous tips of the branch (amongst other clutter) are recorded in the reflog of the person who did the rebase, but they can’t share their reflog. A force-push erases the previous branch from the server. Git forges sometimes make it possible to compare a branch before and after a rebase, but it’s usually very inconvenient, which makes it hard to see if review comments have been addressed. And a reviewer can’t fetch past versions of the branch from the server to review them locally. You can mitigate these problems by adding commits in --autosquash format, and delay rebasing until just before merge. However that reintroduces the problem of merge conflicts: if the autosquash doesn’t apply cleanly the branch should have another round of review to make sure the conflicts were resolved OK. squash When the trunk consists of a sequence of merge commits, the --first-parent log is very uninformative. A common way to make the history of the trunk more informative, and deal with the problems of cluttered feature branches and poor rebase support, is to squash the feature branch into a single commit on the trunk instead of mergeing. This encourages merge requests to be roughly the size of one commit, which is arguably a good thing. However, it can be uncomfortably confining for larger features, or cause extra busy-work co-ordinating changes across multiple merge requests. And squashed feature branches have the same merge conflict problem as rebase --autosquash. fork Feature branches can’t always be short-lived. In the past I have maintained local hacks that were used in production but were not (not yet?) suitable to submit upstream. I have tried keeping a stack of these local patches on a git branch that gets rebased onto each upstream release. With this setup the problem of reviewing successive versions of a merge request becomes the bigger problem of keeping track of how the stack of patches evolved over longer periods of time. cover letters Cover letters are common in the email patch workflow that predates git, and they are supported by git format-patch. Github and other forges have a webby version of the cover letter: the message that starts off a pull request or merge request. In git, cover letters are second-class citizens: they aren’t stored in the repository. But many of the problems I outlined above have neat solutions if cover letters become first-class citizens, with a Jujutsu twist. A first-class cover letter starts off as a prototype for a merge request, and becomes the eventual merge commit. Instead of unhelpful auto-generated merge commits, you get helpful and informative messages. No extra work is needed since we’re already writing cover letters. Good merge commit messages make good --first-parent logs. The cover letter subject line works as a branch name. No more need to invent filename-compatible branch names! Jujutsu doesn’t make you name branches, giving them random names instead. It shows the subject line of the topmost commit as a reminder of what the branch is for. If there’s an explicit cover letter the subject line will be a better summary of the branch as a whole. I often find the last commit on a branch is some post-feature cleanup, and that kind of commit has a subject line that is never a good summary of its feature branch. As a prototype for the merge commit, the cover letter can contain the resolution of all the merge conflicts in a way that can be shared and reviewed. In Jujutsu, where conflicts are first class, the cover letter commit can contain unresolved conflicts: you don’t have to clean them up when creating the merge, you can leave that job until later. If you can share a prototype of your merge commit, then it becomes possible for your collaborators to review any merge conflicts and how you resolved them. To distinguish a cover letter from a merge commit object, a cover letter object has a “target” header which is a special kind of parent header. A cover letter also has a normal parent commit header that refers to earlier commits in the feature branch. The target is what will become the first parent of the eventual merge commit. previous branch The other ingredient is to add a “previous branch” header, another special kind of parent commit header. The previous branch header refers to an older version of the cover letter and, transitively, an older version of the whole feature branch. Typically the previous branch header will match the last shared version of the branch, i.e. the commit hash of the server’s copy of the feature branch. The previous branch header isn’t changed during normal work on the feature branch. As the branch is revised and rebased, the commit hash of the cover letter will change fairly frequently. These changes are recorded in git’s reflog or jj’s oplog, but not in the “previous branch” chain. You can use the previous branch chain to examine diffs between versions of the feature branch as a whole. If commits have Gerrit-style or jj-style change-IDs then it’s fairly easy to find and compare previous versions of an individual commit. The previous branch header supports interdiff code review, or allows you to retain past iterations of a patch series. workflow Here are some sketchy notes on how these features might work in practice. One way to use cover letters is jj-style, where it’s convenient to edit commits that aren’t at the tip of a branch, and easy to reshuffle commits so that a branch has a deliberate narrative. When you create a new feature branch, it starts off as an empty cover letter with both target and parent pointing at the same commit. Alternatively, you might start a branch ad hoc, and later cap it with a cover letter. If this is a small change and rebase + fast-forward is allowed, you can edit the “cover letter” to contain the whole change. Otherwise, you can hack on the branch any which way. Shuffle the commits that should be part of the merge request so that they occur before the cover letter, and edit the cover letter to summarize the preceding commits. When you first push the branch, there’s (still) no need to give it a name: the server can see that this is (probably) going to be a new merge request because the top commit has a target branch and its change-ID doesn’t match an existing merge request. Also when you push, your client automatically creates a new instance of your cover letter, adding a “previous branch” header to indicate that the old version was shared. The commits on the branch that were pushed are now immutable; rebases and edits affect the new version of the branch. During review there will typically be multiple iterations of the branch to address feedback. The chain of previous branch headers allows reviewers to see how commits were changed to address feedback, interdiff style. The branch can be merged when the target header matches the current trunk and there are no conflicts left to resolve. When the time comes to merge the branch, there are several options: For a merge workflow, the cover letter is used to make a new commit on the trunk, changing the target header into the first parent commit, and dropping the previous branch header. Or, if you like to preserve more history, the previous branch chain can be retained. Or you can drop the cover letter and fast foward the branch on to the trunk. Or you can squash the branch on to the trunk, using the cover letter as the commit message. questions This is a fairly rough idea: I’m sure that some of the details won’t work in practice without a lot of careful work on compatibility and deployability. Do the new commit headers (“target” and “previous branch”) need to be headers? What are the compatibility issues with adding new headers that refer to other commits? How would a server handle a push of an unnamed branch? How could someone else pull a copy of it? How feasible is it to use cover letter subject lines instead of branch names? The previous branch header is doing a similar job to a remote tracking branch. Is there an opportunity to simplify how we keep a local cache of the server state? Despite all that, I think something along these lines could make branches / reviews / reworks / merges less awkward. How you merge should me a matter of your project’s preferred style, without interference from technical limitations that force you to trade off one annoyance against another. There remains a non-technical limitation: I have assumed that contributors are comfortable enough with version control to use a history-editing workflow effectively. I’ve lost all perspective on how hard this is for a newbie to learn; I expect (or hope?) jj makes it much easier than git rebase.

20 hours ago • 5 votes

Performant Full-Disk Encryption on a Raspberry Pi, but Foiled by Twisty UARTs

In my post yesterday (“ARM is great, ARM is terrible (and so is RISC-V)), I described my desire to find ARM hardware with AES instructions to support full-disk encryption, and the poor state of the OS ecosystem around the newer ARM boards. I was anticipating buying either a newer ARM SBC or an x86 mini … Continue reading Performant Full-Disk Encryption on a Raspberry Pi, but Foiled by Twisty UARTs →

7 hours ago • 3 votes

Words are not violence

Debates, at their finest, are about exploring topics together in search for truth. That probably sounds hopelessly idealistic to anyone who've ever perused a comment section on the internet, but ideals are there to remind us of what's possible, to inspire us to reach higher — even if reality falls short. I've been reaching for those debating ideals for thirty years on the internet. I've argued with tens of thousands of people, first on Usenet, then in blog comments, then Twitter, now X, and also LinkedIn — as well as a million other places that have come and gone. It's mostly been about technology, but occasionally about society and morality too. There have been plenty of heated moments during those three decades. It doesn't take much for a debate between strangers on this internet to escalate into something far lower than a "search for truth", and I've often felt willing to settle for just a cordial tone! But for the majority of that time, I never felt like things might escalate beyond the keyboards and into the real world. That was until we had our big blow-up at 37signals back in 2021. I suddenly got to see a different darkness from the most vile corners of the internet. Heard from those who seem to prowl for a mob-sanctioned opportunity to threaten and intimidate those they disagree with. It fundamentally changed me. But I used the experience as a mirror to reflect on the ways my own engagement with the arguments occasionally felt too sharp, too personal. And I've since tried to refocus way more of my efforts on the positive and the productive. I'm by no means perfect, and the internet often tempts the worst in us, but I resist better now than I did then. What I cannot come to terms with, though, is the modern equation of words with violence. The growing sense of permission that if the disagreement runs deep enough, then violence is a justified answer to settle it. That sounds so obvious that we shouldn't need to state it in a civil society, but clearly it is not. Not even in technology. Not even in programming. There are plenty of factions here who've taken to justify their violent fantasies by referring to their ideological opponents as "nazis", "fascists", or "racists". And then follow that up with a call to "punch a nazi" or worse. When you hear something like that often enough, it's easy to grow glib about it. That it's just a saying. They don't mean it. But I'm afraid many of them really do. Which brings us to Charlie Kirk. And the technologists who name drinks at their bar after his mortal wound just hours after his death, to name but one of the many, morbid celebrations of the famous conservative debater's death. It's sickening. Deeply, profoundly sickening. And my first instinct was exactly what such people would delight in happening. To watch the rest of us recoil, then retract, and perhaps even eject. To leave the internet for a while or forever. But I can't do that. We shouldn't do that. Instead, we should double down on the opposite. Continue to show up with our ideals held high while we debate strangers in that noble search for the truth. Where we share our excitement, our enthusiasm, and our love of technology, country, and humanity. I think that's what Charlie Kirk did so well. Continued to show up for the debate. Even on hostile territory. Not because he thought he was ever going to convince everyone, but because he knew he'd always reach some with a good argument, a good insight, or at least a different perspective. You could agree or not. Counter or be quiet. But the earnest exploration of the topics in a live exchange with another human is as fundamental to our civilization as Socrates himself. Don't give up, don't give in. Keep debating.

5 hours ago • 2 votes

ARM is great, ARM is terrible (and so is RISC-V)

I’ve long been interested in new and different platforms. I ran Debian on an Alpha back in the late 1990s and was part of the Alpha port team; then I helped bootstrap Debian on amd64. I’ve got somewhere around 8 Raspberry Pi devices in active use right now, and the free NNCPNET Internet email service … Continue reading ARM is great, ARM is terrible (and so is RISC-V) →

yesterday • 4 votes

Many Hard Leetcode Problems are Easy Constraint Problems

In my first interview out of college I was asked the change counter problem: Given a set of coin denominations, find the minimum number of coins required to make change for a given number. IE for USA coinage and 37 cents, the minimum number is four (quarter, dime, 2 pennies). I implemented the simple greedy algorithm and immediately fell into the trap of the question: the greedy algorithm only works for "well-behaved" denominations. If the coin values were [10, 9, 1], then making 37 cents would take 10 coins in the greedy algorithm but only 4 coins optimally (10+9+9+9). The "smart" answer is to use a dynamic programming algorithm, which I didn't know how to do. So I failed the interview. But you only need dynamic programming if you're writing your own algorithm. It's really easy if you throw it into a constraint solver like MiniZinc and call it a day. int: total; array[int] of int: values = [10, 9, 1]; array[index_set(values)] of var 0..: coins; constraint sum (c in index_set(coins)) (coins[c] * values[c]) == total; solve minimize sum(coins); You can try this online here. It'll give you a prompt to put in total and then give you successively-better solutions: coins = [0, 0, 37]; ---------- coins = [0, 1, 28]; ---------- coins = [0, 2, 19]; ---------- coins = [0, 3, 10]; ---------- coins = [0, 4, 1]; ---------- coins = [1, 3, 0]; ---------- Lots of similar interview questions are this kind of mathematical optimization problem, where we have to find the maximum or minimum of a function corresponding to constraints. They're hard in programming languages because programming languages are too low-level. They are also exactly the problems that constraint solvers were designed to solve. Hard leetcode problems are easy constraint problems.1 Here I'm using MiniZinc, but you could just as easily use Z3 or OR-Tools or whatever your favorite generalized solver is. More examples This was a question in a different interview (which I thankfully passed): Given a list of stock prices through the day, find maximum profit you can get by buying one stock and selling one stock later. It's easy to do in O(n^2) time, or if you are clever, you can do it in O(n). Or you could be not clever at all and just write it as a constraint problem: array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; var int: buy; var int: sell; var int: profit = prices[sell] - prices[buy]; constraint sell > buy; constraint profit > 0; solve maximize profit; Reminder, link to trying it online here. While working at that job, one interview question we tested out was: Given a list, determine if three numbers in that list can be added or subtracted to give 0? This is a satisfaction problem, not a constraint problem: we don't need the "best answer", any answer will do. We eventually decided against it for being too tricky for the engineers we were targeting. But it's not tricky in a solver; include "globals.mzn"; array[int] of int: numbers = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array[index_set(numbers)] of var {0, -1, 1}: choices; constraint sum(n in index_set(numbers)) (numbers[n] * choices[n]) = 0; constraint count(choices, -1) + count(choices, 1) = 3; solve satisfy; Okay, one last one, a problem I saw last year at Chipy AlgoSIG. Basically they pick some leetcode problems and we all do them. I failed to solve this one: Given an array of integers heights representing the histogram's bar height where the width of each bar is 1, return the area of the largest rectangle in the histogram. The "proper" solution is a tricky thing involving tracking lots of bookkeeping states, which you can completely bypass by expressing it as constraints: array[int] of int: numbers = [2,1,5,6,2,3]; var 1..length(numbers): x; var 1..length(numbers): dx; var 1..: y; constraint x + dx <= length(numbers); constraint forall (i in x..(x+dx)) (y <= numbers[i]); var int: area = (dx+1)*y; solve maximize area; output ["(\(x)->\(x+dx))*\(y) = \(area)"] There's even a way to automatically visualize the solution (using vis_geost_2d), but I didn't feel like figuring it out in time for the newsletter. Is this better? Now if I actually brought these questions to an interview the interviewee could ruin my day by asking "what's the runtime complexity?" Constraint solvers runtimes are unpredictable and almost always than an ideal bespoke algorithm because they are more expressive, in what I refer to as the capability/tractability tradeoff. But even so, they'll do way better than a bad bespoke algorithm, and I'm not experienced enough in handwriting algorithms to consistently beat a solver. The real advantage of solvers, though, is how well they handle new constraints. Take the stock picking problem above. I can write an O(n²) algorithm in a few minutes and the O(n) algorithm if you give me some time to think. Now change the problem to Maximize the profit by buying and selling up to max_sales stocks, but you can only buy or sell one stock at a given time and you can only hold up to max_hold stocks at a time? That's a way harder problem to write even an inefficient algorithm for! While the constraint problem is only a tiny bit more complicated: include "globals.mzn"; int: max_sales = 3; int: max_hold = 2; array[int] of int: prices = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8]; array [1..max_sales] of var int: buy; array [1..max_sales] of var int: sell; array [index_set(prices)] of var 0..max_hold: stocks_held; var int: profit = sum(s in 1..max_sales) (prices[sell[s]] - prices[buy[s]]); constraint forall (s in 1..max_sales) (sell[s] > buy[s]); constraint profit > 0; constraint forall(i in index_set(prices)) (stocks_held[i] = (count(s in 1..max_sales) (buy[s] <= i) - count(s in 1..max_sales) (sell[s] <= i))); constraint alldifferent(buy ++ sell); solve maximize profit; output ["buy at \(buy)\n", "sell at \(sell)\n", "for \(profit)"]; Most constraint solving examples online are puzzles, like Sudoku or "SEND + MORE = MONEY". Solving leetcode problems would be a more interesting demonstration. And you get more interesting opportunities to teach optimizations, like symmetry breaking. Because my dad will email me if I don't explain this: "leetcode" is slang for "tricky algorithmic interview questions that have little-to-no relevance in the actual job you're interviewing for." It's from leetcode.com. ↩

yesterday • 4 votes

New here?