Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
71
Many people assume that large language models (LLMs) will disrupt existing consumer voice assistants. Compared to Siri, while today’s ChatGPT is largely unable to complete real-world tasks like hailing an Uber, it’s far better than Siri at understanding and generating language, especially in response to novel requests. From Tom’s Hardware, this captures the sentiment I see among tech commentators: GPT-4o will enable ChatGPT to become a legitimate Siri competitor, with real-time conversations via voice that are responded to instantly without lag time. […] ChatGPT’s new real-time responses make tools like Siri and Echo seem lethargic. And although ChatGPT likely won’t be able to schedule your haircuts like Google Assistant can, it did put up admirable real-time translating chops to challenge Google. Last year, there were rumors that OpenAI was working on its own hardware, which would open the possibility of integrating ChatGPT at the system level along the lines of the Humane Ai Pin....
a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Kevin Chen

Real estate is one of the hardest open problems in scaled self driving

I’ve had a minor obsession with Waymo’s autonomous vehicle depots recently. Over the past few months, I’ve flown a drone as part of a stakeout to understand how they work. And I’ve taken a deep dive into an apparent Waymo outage to find the company charging its electric vehicles from temporary diesel generators. The reason for my obsession? I believe depot buildouts will be one of the last hard problems in scaled autonomous driving. Long after the hardware, software, and AI have been perfected, real estate acquisition will remain a limiting factor in large-scale AV deployment. Waymo’s main depot at 201 Toland Street, San Francisco. Will self driving follow software scaling laws? In 2021, Elon Musk claimed that Tesla FSD’s release will be “one of the biggest asset value increases in history.” The day FSD goes to wide release will be one of the biggest asset value increases in history — Elon Musk (@elonmusk) October 20, 2021 Musk is arguing that, once autonomous driving has been solved, it can be instantly rolled out at the push of a button. Nearly all of Tesla’s fleet could be put to productive use without humans behind the wheel. While Musk’s viewpoint is on the extreme end, it’s a sentiment shared by many who have worked on or invested in autonomous driving over the years. Once you have hardware capable of supporting safe driverless operation, it’s just a matter of developing the right software. Software can be replicated infinitely at zero marginal cost. Could autonomous driving therefore scale as quickly as software platforms like Uber or DoorDash? The answer is not so simple. Self-driving cars are still cars — cars that exist in the physical world and need to be parked, fueled, cleaned, and repaired. Uber and other multi-sided marketplace platforms have been able to grow exponentially because they distribute these responsibilities to the individual drivers — many Uber drivers park at their own homes — allowing the platform provider to focus on developing the software pieces. So far, AV companies like Waymo and Cruise have taken a different approach. They’ve preferred to centralize these operational tasks in large depots staffed with their own personnel. This is because AV technology is still maturing and cannot be easily productized in the short term. Additionally, Timothy B. Lee notes in Understanding AI that “having hardware, software, and support services all under one roof makes it easier for Waymo to experiment with different technologies and business models.” When the kinks are still being worked out, it is more straightforward to vertically integrate everything in a single organization. The many jobs to be done of a robotaxi depot Depots for human-driven fleets, such as rental cars or delivery vans, only require a parking area with minimal additional infrastructure. This enables a fairly straightforward trade-off between location and cost: the fleet operator seeks a location close to customer demand while minimizing rent. For example, a logistics company participating in Amazon’s Delivery Service Partner program can run its depot from any sufficiently cheap parking lot near the local Amazon warehouse. The same constraints affect depot selection for autonomous vehicles. However, the depot also needs to be more than just a convenient parking lot to store off-duty cars. Because AVs are often also EVs, the ideal site also has electric vehicle charging. Because AVs need to upload driving logs to the cloud, it should have a high-speed Internet connection too. Let’s explore these constraints in detail. Location Depots should be placed close to customer demand to minimize deadheading (non-revenue driving), which would raise costs while degrading the customer experience with longer pickup times. Ideal depots are therefore located in desirable residential or commercial areas, where there is more competition among potential tenants. Placing depots in high-demand neighborhoods instead of industrial areas can also increase the probability of local opposition. Waymo has already encountered opposition during a proposed expansion of their main depot, even though it is located in an industrial neighborhood with many similar facilities. Again from Timothy B. Lee: Waymo sought a permit to convert the warehouse next door into some office space and a parking lot for Waymo employees. San Francisco’s Board of Supervisors unanimously rejected Waymo’s application. The rejection was partly based on fears that Waymo would eventually use the space to launch a delivery service in the city (Waymo hasn’t announced any plans to do this so far). But it also reflected city leaders’ frustration with their general lack of power over Waymo. Now consider the recent incident in which driverless Waymo vehicles honked at each other while entering a depot near residential buildings in San Francisco, often well into the early hours of the morning. While residents and the company resolved the situation amicably, it will surely be raised in future discussions of new Waymo depots in residential areas should they come before the Board of Supervisors or Planning Commission. Electric vehicle charging AV developers have preferred to run their services with electric vehicles. Although AV and EV technologies are not inherently coupled, running a fully electric fleet adds an environmental angle to the AV sales pitch, allowing the companies to claim that AV rides reduce emissions by displacing gas-powered driving. An EV fleet also lowers vehicle maintenance costs. Waymo and Cruise each have locations with DC fast charging capability. This approach avoids relying on public chargers. Taking Waymo’s primary San Francisco depot as an example, the company installed 38 chargers of approximately 60 kW each, implying a total site power of around 2.4 MW. Waymo vehicles charging in San Francisco. Approximately one-third of parking spots in the main depot have charging. Bringing in so many high-power chargers likely added significant complexity to Waymo’s depot construction. While we don’t Waymo’s process, we have a fairly good benchmark from the Tesla community, which tracks Tesla Supercharger installations closely. From Bruce Mah, a seasoned EV charging observer, the construction process is: As with any construction project, things usually start with selecting a site and permitting. There will often be some demolition / excavation of part of a parking lot (Superchargers are often built in existing parking lots). Tesla equipment such as charging cabinets, posts, etc. will usually be installed next (see T1 below). Eventually there will be some inspections from the local Authority Having Jurisdiction (AHJ). A utility transformer (from PG&E, SCE, etc.) is usually the last piece of equipment to be installed. Repaving, painting, and installation of parking stops will also usually happen late in the process, as well as landscaping and lighting enhancements. Of these steps, permitting and utility work are not within the charging operator’s control. California municipalities, especially San Francisco, have a notoriously slow and political permitting process. With PG&E, the utility serving much of the state, electrical service upgrades involving a new distribution transformer can take months. Timelines aside, building out a charging site is also expensive. For example, an agreement between Tesla and the City of West Hollywood values an eight-plug location at $482,942 for both equipment and construction. Data offload The final piece of the puzzle is data offload. Autonomous vehicles log vast amounts of data as they drive, measured in hundreds of GBs to TBs per hour. Some of the data is subject to mandatory retention and must be uploaded for later review. At a minimum, all AV collisions in California must be reported to the DMV. Regulators at all levels of government expect the AV developer to present analyses of serious incidents, including recordings from the vehicle and explanations of the AV’s decisions. In addition to regulatory requirements, the AV developer often wants to return much more data for engineering purposes: near misses, stuck events, novel or interesting scenarios, and more. The upshot is that the AV operator needs to upload a substantial portion of the hundreds of GBs to TBs logged per hour of driving. Uploading over cellular networks would not be cost effective. These transfers must occur at a depot. Today, it’s likely that Waymo and Cruise use disk swapping for data offload. When a car fills up its internal logging disk, it notifies an operator to plug in a fresh one. The full disks may be uploaded directly from the depot or shipped to a datacenter. This whole process is labor intensive and, over time, may pose a reliability concern due to dust or water ingress. A Waymo operator performs a possible disk swap. Many AV developers are moving toward direct data transfer from the vehicle using Ethernet, Wi-Fi, or a private 5G network, which reduces the number of manual touch points and moving parts. Charging is a great time to perform these transfers. However, this imposes an additional requirement on the depot: a fast upload speed, probably a fiber connection of at least 10 Gbps. Where do we go from here? When we put all three requirements together (great location, high-power EV charging, and high-speed Internet), there may be few to none sites that fit the bill. This would require the AV operator to take on site-specific construction projects to add amenities like charging and Internet — a strategy that sits in direct opposition to rapid and cost-effective scaling. Another possibility is to engineer ways to relax the constraints. Decoupling the requirements Waymo and Cruise do not require all of their locations to have charging and data offload. For example, Waymo operates satellite lots in downtown San Francisco only for storing their off-hail vehicles. Every night, fleet management software instructs the cars to travel back to the main depot for charging and data offload. A Waymo satellite location in San Francisco with minimal staffing, no charging, and apparently no data offloading. This solution works as long as the total charging and data transfer capacity across all locations exceeds the average throughput required to keep the fleet in working order. However, the lack of redundancy can lead to cascading failures, such as the apparent power outage at Waymo’s main depot that led the company to shut down many vehicles during a Friday evening rush hour. Reducing charging power Waymo and Cruise currently use DC fast charging (DCFC) for their fleets. Level 2 (L2) or AC charging could reduce the cost of buildouts because the equipment is cheaper and can often be added without bringing a new utility transformer. This could enable overnight charging in satellite locations that do not currently have any charging capacity. Imagine an operator showing up to plug in all the cars at night when there is little demand, then returning in the morning to unplug them. There is an order of magnitude speed difference between L2 and DCFC. This is important for consumer charging, where the consumer cares about the time to get a single car back on the road. However, charging power for any individual car becomes less important when charging a large fleet. Fleet operators care about the total throughput of turning around cars, which is proportional to total power delivered across all chargers. In other words, assuming an autonomous ride hailing service will always have overnight lulls in demand and enough parking spots during those times, the most scalable strategy is to procure your desired total charging power at the lowest price. DCFC equipment costs disproportionately more per kW due to the additional complexity of the charging equipment — and that doesn’t include the additional maintenance complexity. The table below compares ChargePoint’s cheapest L2 and DCFC units: Charger Power (kW) Price ($) Unit Price ($/kW) ChargePoint CPF50 9.6 kW $1,299 $135/kW ChargePoint CPE250 62.5 kW $52,000 $832/kW In addition to more scalable depot buildouts, reduced charging power can also increase the longevity of the vehicle’s traction battery, which is an important factor in managing vehicle depreciation. Reducing data logging rate Most AV developers start out by logging and uploading all data generated on their vehicles. This makes development easy because the data is always there when you need it. These assumptions need to be broken when a growing fleet generates proportionally more logs, most of which contain routine driving and are not very interesting. We can split the data logged by AVs into two categories: Raw sensor data, such as lidar point clouds, camera images, radar returns, and audio. Derived data, such as detections from the perception system or motion plans from the behavior system. One approach is to keep only one category of data. Retaining only the derived data can still enable debugging of serious incidents, as long as the perception system can be trusted to provide a faithful representation of the raw sensor data. On the other hand, retaining only the raw sensor data makes the logs more useful for developing the mapping and perception system. Similar-looking derived data can be generated by running a replay simulator as needed, but it is challenging to reproduce the exact same outputs as those on the vehicle unless the AV software is fully deterministic. Data retention decisions can also be made temporally. The key challenge here is high-recall classification of which time ranges in the log must be retained. For example, if a DMV-reportable collision occurs, the associated log data must never be discarded. These decisions can happen either on-device or in the cloud, but they must be made without uploading the full log to the cloud, since our bottleneck is the connection from the vehicle to the Internet. Conclusion The current trajectory for scaled autonomous driving would require desirable depot locations to include charging and Internet, making real estate acquisition challenging. There exist opportunities to reduce the additional requirements over time with the goal of making the problem closer to “rent a bunch of conveniently located parking lots.” While these are not traditionally considered autonomous driving problems, solving them will be key to unlocking the next phase of scaling.

10 months ago 75 votes
How autonomous vehicle simulation works

When autonomous vehicle developers justify the safety of their driverless vehicle deployments, they lean heavily on their testing in simulation. Common talking points take the form of “we made our car drive X billion miles in simulation.” From these vague statements, it’s challenging to determine what a simulator is, or how it works. There’s more to simulation than endless driving in a virtual environment. For example, Waymo’s technology overview page says (emphasis mine): We’ve driven more than 20 billion miles in simulation to help identify the most challenging situations our vehicles will encounter on public roads. We can either replay and tweak real-world miles or build completely new virtual scenarios, for our autonomous driving software to practice again and again. Cruise’s safety page contains similar language:1 Before setting out on public roads, Cruise vehicles complete more than 250,000 simulations and closed course testing during everyday and extreme conditions. The main impression one gets from these overviews is that (1) simulation can test many driving scenarios, and (2) everyone will be super impressed if you use it a lot. Going one layer deeper to the few blog posts and talks full of slick GIFs, you might reach the conclusion that simulation is like a video game for the autonomous vehicle in the vein of Grand Theft Auto (GTA): a fully generated 3D environment complete with textures, lighting, and non-player characters (NPCs). Much like human players of GTA, the autonomous vehicle would be able to drive however it likes, freed from real-world consequences. Source: Cruise. While this type of fully synthetic simulation exists in the world of autonomous driving, it’s actually the least commonly used type of simulation.2 Instead, just as a software developer leans on many kinds of testing before releasing an application, an AV developer runs many types of simulation before deploying an autonomous vehicle. Each type of simulation is best suited for a particular use case, with trade-offs between realism, coverage, technical complexity, and cost to operate. In this post, we’ll walk through the system design of a simulator at a hypothetical AV company, starting from first principles. We may never know the details of the actual simulator architecture used by any particular AV developer. However, by exploring the design trade-offs from first principles, I hope to shed some light on how this key system works. Contents Our imaginary self-driving car Replay simulation Interactivity and the pose divergence problem Synthetic simulation The high cost of realistic imagery Round-trip conversions to pixels and back Skipping the sensor data Making smart agents Generating scene descriptions Limitations of pure synthetic simulation Hybrid simulation Conclusion Our imaginary self-driving car Let’s begin by defining our hypothetical autonomous driving software, which will help us illustrate how simulation fits into the development process. Imagine it’s 2015, the peak of self-driving hype, and our team has raised a vast sum of money to develop an autonomous vehicle. Like a human driver, our software drives by continuously performing a few basic tasks: It makes observations about the road and other road users. It reasons about what others might do and plans how it should drive. Finally, it executes those planned motions by steering, accelerating, and braking. Rinse and repeat. This mental model helps us group related code into modules, enabling them to be developed and tested independently. There will be four modules in our system:3 Sensor Interface: Take in raw sensor data such as camera images and lidar point clouds. Sensing: Detect objects such as vehicles, pedestrians, lane lines, and curbs. Behavior: Determine the best trajectory (path) for the vehicle to drive. Vehicle Interface: Convert the trajectory into steering, accelerator, and brake commands to control the vehicle’s drive-by-wire (DBW) system. We connect our modules to each other using an inter-process communication framework (“middleware”) such as ROS, which provides a publish–subscribe system (pubsub) for our modules to talk to each other. Here’s a concrete example of our module-based encapsulation system in action: The sensing module publishes a message containing the positions of other road users. The behavior module subscribes to this message when it wants to know whether there are pedestrians nearby. The behavior module doesn’t know and doesn’t care how the perception module detected those pedestrians; it just needs to see a message that conforms to the agreed-upon API schema. Defining a schema for each message also allows us to store a copy of everything sent through the pubsub system. These driving logs will come in handy for debugging because it allows us to inspect the system with module-level granularity. Our full system looks like this: Simplified architecture diagram for an autonomous vehicle. Now it’s time to take our autonomous vehicle for a spin. We drive around our neighborhood, encountering some scenarios in which our vehicle drives incorrectly, which cause our in-car safety driver to take over driving from the autonomous vehicle. Each disengagement gets reviewed by our engineering team. They analyze the vehicle’s logs and propose some software changes. Now we need a way to prove our changes have actually improved performance. We need the ability to compare the effectiveness of multiple proposed fixes. We need to do this quickly so our engineers can receive timely feedback. We need a simulator! Replay simulation Motivated by the desire to make progress quickly, we try the simplest solution first. The key insight: our software modules don’t care where the incoming messages come from. Could we simulate a past scenario by simply replaying messages from our log as if they were being sent in real time? As the name suggests, this is exactly how replay simulation works. Under normal operation, the input to our software is sensor data captured from real sensors. The simulator replaces this by replaying sensor data from an existing log. Under normal operation, the output of our software is a trajectory (or a set of accelerator and steering commands) that the real car executes. The simulator intercepts the output to control the simulated vehicle’s position instead. Modified architecture diagram for running replay simulation. There are two primary ways we can use this type of simulator, depending on whether we use a different software version as the onroad drive: Different software: By running modified versions of our modules in the simulator, we can get a rough idea of how the changes will affect the vehicle’s behavior. This can provide early feedback on whether a change improves the vehicle’s behavior or successfully fixes a bug. Same software: After a disengagement, we may want to know what would have happened if the autonomous vehicle were allowed to continue driving without human input. Simulation can provide this counterfactual by continuing to play back messages as if the disengagement never happened. We’ve gained these important testing capabilities with relatively little effort. Rather than take on the complexity of a fully generated 3D environment, we got away with a few modifications to our pubsub framework. Interactivity and the pose divergence problem The simplicity of a pure replay simulator also leads to its key weakness: a complete lack of interactivity. Everything in the simulated environment was loaded verbatim from a log. Therefore, the environment does not respond to the simulated vehicle’s behavior, which can lead to unrealistic interactions with other road users. This classic example demonstrates what can happen when the simulated vehicle’s behavior changes too much: Watch on YouTube. Dragomir Anguelov’s guest lecture at MIT. Source: Lex Fridman. Our vehicle, when it drove in the real world, was where the green vehicle is. Now, in simulation, we drove differently and we have the blue vehicle. So we’re driving…bam. What happened? Well, there is a purple agent over there — a pesky purple agent — who, in the real world, saw that we passed them safely. And so it was safe for them to go, but it’s no longer safe, because we changed what we did. So the insight is: in simulation, our actions affect the environment and needed to be accounted for. Anguelov’s video shows the simulated vehicle driving slower than the real vehicle. This kind of problem is called pose divergence, a term that covers any simulation where differences in the simulated vehicle’s driving decisions cause its position to differ from the real-world vehicle’s position. In the video, the pose divergence leads to an unrealistic collision in simulation. A reasonable driver in the purple vehicle’s position would have observed the autonomous vehicle and waited for it to pass before entering the intersection.4 However, in replay simulation, all we can do is play back the other driver’s actions verbatim. In general, problems arising from the lack of interactivity mean the simulated scenario no longer provides useful feedback to the AV developer. This is a pretty serious limitation! The whole point of the simulator is to allow the simulated vehicle to make different driving decisions. If we cannot trust the realism of our simulations anytime there is an interaction with another road user, it rules out a lot of valuable use cases. Synthetic simulation We can solve these interactivity problems by using a simulated environment to generate synthetic inputs that respond to our vehicle’s actions. Creating a synthetic simulation usually starts with a high-level scene description containing: Agents: fully interactive NPCs that react to our vehicle’s behavior. Environments: 3D models of roads, signs, buildings, weather, etc. that can be rendered from any viewpoint. From the scene description, we can generate different types of synthetic inputs for our vehicle to be injected at different layers of its software stack, depending on which modules we want to test. In synthetic sensor simulation, the simulator uses a game engine to render the scene description into fake sensor data, such as camera images, lidar point clouds, and radar returns. The simulator sets up our software modules to receive the generated imagery instead of sensor data logged from real-world driving. Modified architecture diagram for running synthetic simulation with generated sensors. The same game engine can render the scene from any arbitrary perspective, including third-person views. This is how they make all those slick highlight reels. The high cost of realistic imagery Simulations that generate fake sensor data can be quite expensive, both to develop and to run. The developer needs to create a high-quality 3D environment with realistic object models and lighting rivaling AAA games. Example of Cruise’s synthetic simulation showing the same scene rendered into synthetic camera, lidar, and radar data. Source: Cruise. For example, a Cruise blog post mentions some elements of their synthetic simulation roadmap (emphasis mine): With limited time and resources, we have to make choices. For example, we ask how accurately we should model tires, and whether or not it is more important than other factors we have in our queue, like modeling LiDAR reflections off of car windshields and rearview mirrors or correctly modeling radar multipath returns. Even if rendering reflections and translucent surfaces is already well understood in computer graphics, Cruise may still need to make sure their renderer generates realistic reflections that resemble their lidar. This challenge gives a sense of the attention to detail required. It’s only one of many that needs to be solved when building a synthetic sensor simulator. So far, we have only covered the high development costs. Synthetic sensor simulation also incurs high variable costs every time simulation is run. Round-trip conversions to pixels and back By its nature, synthetic sensor simulation performs a round-trip conversion to and from synthetic imagery to test the perception system. The game engine first renders its scene description to synthetic imagery for each sensor on the simulated vehicle, burning many precious GPU-hours in the process, only to have the perception system perform the inverse operation when it detects the objects in the scene to produce the autonomous vehicle’s internal scene representation.5 Every time you launch a synthetic sensor simulation, NVIDIA, Intel, and/or AWS are laughing all the way to the bank. Despite the expense of testing the perception system with synthetic simulation, it is also arguably less effective than testing with real-world imagery paired with ground truth labels. With real imagery, there can be no question about its realism. Synthetic imagery never looks quite right. These practical limitations mean that synthetic sensor simulation ends up as the least used simulator type in AV companies. Usually, it’s also the last type of simulator to be built at a new company. Developers don’t need synthetic imagery most of the time, especially when they have at their disposal a fleet of vehicles that can record the real thing. On the other hand, we cannot easily test risky driving behavior in the real world. For example, it is better to synthesize a bunch of red light runners than try to find them in the real world. This means we are primarily using synthetic simulation to test the behavior system. Skipping the sensor data In synthetic agent simulation, the simulator uses a high-level scene description to generate synthetic outputs from the perception/sensing system. In software development terms, it’s like replacing the perception system with a mock to focus on testing downstream components. This type of simulation requires fewer computational resources to run because the scene description doesn’t need to make a round-trip conversion to sensor data. Modified architecture diagram for running synthetic simulation with generated agents. With image quality out of the picture, the value of synthetic simulation rests solely on the quality of the scenarios it can create. We can split this into two main challenges: designing agents with realistic behaviors generating the scene descriptions containing various agents, street layouts, and environmental conditions Making smart agents You could start developing the control policy for a smart agent similar to NPC design in early video games. A basic smart agent could simply follow a line or a path without reacting to anyone else, which could be used to test the autonomous vehicle’s reaction to a right of way violation. A fancier smart agent could follow a path while also maintaining a safe following distance from the vehicle in front. This type of agent could be placed behind our simulated vehicle, resolving the rear-ending problem mentioned above. Like an audience of demanding gamers, the users of our simulator quickly expect increasingly complex and intelligent behaviors from the smart agents. An ideal smart agent system would capture the full spectrum of every action that other road users could possibly take. This system would also generate realistic behaviors, including realistic-looking trajectories and reaction times, so that we can trust the outcomes of simulations involving smart agents. Finally, our smart agents need to be controllable: they can be given destinations or intents, enabling developers to design simulations that test specific scenarios. Watch on YouTube. Two Cruise simulations in which smart agents (orange boxes) interact with the autonomous vehicle. In the second simulation, two parked cars have been inserted into the bottom of the visualization. Notice how the smart agents and the autonomous vehicle drive differently in the two simulations as they interact with each other and the additional parked cars. Source: Cruise. Developing a great smart agent policy ends up falling in the same difficulty ballpark as developing a great autonomous driving policy. The two systems may even share technical foundations. For example, they may have a shared component that is trained to predict the behaviors of other road users, which can be used for both planning our vehicle’s actions and for generating realistic agents in simulation. Generating scene descriptions Even with the ability to generate realistic synthetic imagery and realistic smart agent behaviors, our synthetic simulation is not complete. We still need a broad and diverse dataset of scene descriptions that can thoroughly test our vehicle. These scene descriptions usually come from a mix of sources: Automatic conversion from onroad scenarios: We can write a program that takes a logged real-world drive, guesses the intent of other road users, and stores those intents as a synthetic simulation scenario. Manual design: Analogous to a level editor in a video game. A human either builds the whole scenario from scratch or makes manual edits to an automatic conversion. For example, a human can design a scenario based on a police report of a human-on-human-driver collision to simulate what the vehicle might have done in that scenario. Generative AI: Recent work from Zoox uses diffusion models trained on a large dataset of onroad scenarios. Example of a real-world log (top) converted to a synthetic simulation scenario, then rendered into synthetic camera images (bottom). Notice how some elements, such as the protest signs, are not carried over, perhaps because they are not supported by the perception system or the scene converter. Source: Cruise. Scenarios can also be fuzzed, where the simulator adds random noise to the scene parameters, such as the speed limit of the road or the goals of simulated agents. This can upsample a small number of converted or manually designed scenes to a larger set that can be used to check for robustness and prevent overfitting. Fuzzing can also help developers understand the space of possible outcomes, as shown in the example below, which fuzzes the reaction time of a synthetic tailgater: An example of fuzzing tailgater reaction time. Source: Waymo. The distribution on the right shows a dot for each variant of the scenario, colored green or red depending on whether a simulated collision occurred. In this experiment, the collision becomes unavoidable once the simulated tailgater’s reaction time exceeds about 1 second. Limitations of pure synthetic simulation With these sources plus fuzzing, we’ve ensured the quantity of scenarios in our library, but we still don’t have any guarantees on the quality. Perhaps the scenarios we (and maybe our generative AI tools) invent are too hard or too easy compared to the distribution of onroad driving our vehicle encounters. If our vehicle drives poorly in a synthetic scenario, does the autonomous driving system need improvement? Or is the scenario unrealistically hard, perhaps because the behavior of its smart agents is too unreasonable? If our vehicle passes with flying colors, is it doing a good job? Or is the scenario library missing some challenging scenarios simply because we did not imagine that they could happen? This is a fundamental problem of pure synthetic simulation. Once we start modifying and fuzzing our simulated scenarios, there isn’t a straightforward way to know whether they remain representative of the real world. And we still need to collect a large quantity of real-world mileage to ensure that we have not missed any rare scenarios. Hybrid simulation We can combine our two types of simulator into a hybrid simulator that takes advantages of the strengths of each, providing an environment that is both realistic and interactive without breaking the bank. From replay simulation, use log replay to ensure every simulated scenario is rooted in a real-world scenario and has perfectly realistic sensor data. From synthetic simulation, make the simulation interactive by selectively replacing other road users with smart agents if they could interact with our vehicle.6 Modified architecture diagram merging parts of replay and synthetic simulation. Hybrid simulation usually serves as the default type of simulation that works well for most use cases. One convenient interpretation is that hybrid simulation is a worry-free replacement for replay simulation: anytime the developer would have used replay, they can absentmindedly switch to hybrid simulation to take care of the most common simulation artifacts while retaining most of the benefits of replay simulation. Conclusion We’ve seen that there are many types of simulation used in autonomous driving. They exist on a spectrum from purely replaying onroad scenarios to fully synthesized environments. The ideal simulation platform allows developers to pick an operating point on that spectrum that fits their use case. Hybrid simulation based on a large volume of real-world miles satisfies most testing needs at a reasonable cost, while fully synthetic modes serve niche use cases that can justify the higher development and operating costs. Cruise has written several deep dives about the usage and scaling of their simulation platform. However, neither Cruise nor Waymo provide many details on the construction of their simulator. ↩ I’ve even heard arguments that it’s only good for making videos. ↩ There exist architectures that are more end-to-end. However, to the best of my knowledge, those systems do not have driverless deployments with nontrivial mileage, making simulation testing less relevant. ↩ Another interactivity problem arises from the replay simulator’s inability to simulate different points of view as the simulated vehicle moves. A large pose divergence often causes the simulated vehicle to drive into an area not observed by the vehicle that produced the onroad log. For example, a simulated vehicle could decide to drive around a corner much earlier. But it wouldn’t be able to see anything until the log data also rounds the corner. No matter where the simulated vehicle drives, it will always be limited to what the logged vehicle saw. ↩ “Computer vision is inverse computer graphics.” ↩ As a nice bonus, because the irrelevant road users are replayed exactly as they drove in real life, this may reduce the compute cost of simulation. ↩

a year ago 58 votes
Why autonomous trucking is harder than autonomous rideshare

Recently, The Verge asked, “where are all the robot trucks?” It’s a good question. Trucking was supposed to be the ideal first application of autonomous driving. Freeways contain predictable, highly structured driving scenarios. An autonomous truck would not have to deal with the complexities of intersections and two-way traffic. It could easily drive hundreds of miles without encountering a single pedestrian. DALL-E 3 prompt: “Generate an artistic, landscape aspect ratio watercolor painting of a truck with a bright red cab, pulling a white trailer. The truck drives uphill on an empty, rural highway during wintertime, lined with evergreen trees and a snow bank on a foggy, cloudy day.” The trucks could also be commercially viable with only freeway driving capability, or freeways plus a short segment of surface streets needed to reach a transfer hub. The AV company would only need to deal with a limited set of businesses as customers, bypassing the messiness of supporting a large pool of consumers inherent to the B2C model. Autonomous trucks would not be subject to rest requirements. As The Verge notes, “truck operators are allowed to drive a maximum of 11 hours a day and have to take a 30-minute rest after eight consecutive hours behind the wheel. Autonomous trucks would face no such restrictions,” enabling them to provide a service that would be literally unbeatable by a human driver. If you had asked me in 2018, when I first started working in the AV industry, I would’ve bet that driverless trucks would be the first vehicle type to achieve a million-mile driverless deployment. Aurora even pivoted their entire company to trucking in 2020, believing it to be easier than city driving. Yet sitting here in 2024, we know that both Waymo and Cruise have driven millions of miles on city streets — a large portion in the dense urban environment of San Francisco — and there are no driverless truck deployments. What happened? I think the problem is that driverless autonomous trucking is simply harder than driverless rideshare. The trucking problem appears easier at the outset, and indeed many AV developers quickly reach their initial milestones, giving them false confidence. But the difficulty ramps up sharply when the developer starts working on the last bit of polish. They encounter thorny problems related to the high speeds on freeways and trucks’ size, which must be solved before taking the human out of the driver’s seat. What is the driverless bar? Here’s a simplistic framework: No driver in the vehicle. No guarantee of a timely response from remote operators or backend services. Therefore, all safety-critical decisions must be made by the onboard computer alone. Under these constraints, the system still meets or exceeds human safety level. This is a really, really high bar. For example, on surface streets, this means the system on its own is capable of driving at least 100k miles without property damage and 40M miles without fatality.1 The system can still have flaws, but virtually all of those problems must result in a lack of progress, rather than collision or injury. In short, while the system may not know the right thing to do in every scenario, it should never do the wrong thing. (There are several high quality safety frameworks for those interested in a rigorous definition.23 It’s beyond the scope of this post.) Now, let’s look at each aspect of trucking to see how it exacerbates these challenges. Truck-specific challenges Stopping distance vs. sensing range The required sensor capability for an autonomous vehicle is determined by the most challenging scenario that the vehicle needs to handle. A major challenge in trucking is stopping behind a stalled vehicle or large debris in a travel lane. To avoid collision, the autonomous vehicle would need a sensing range greater than or equal to its stopping distance. We’ll make a simplifying assumption that stopping distance defines the minimum detection range requirements. A driverless-quality perception system needs perfect recall on other vehicles within the vehicle’s worst-case stopping distance. Passenger vehicles can decelerate up to –8 m/s². Trucks can only achieve around –4 m/s², which increases the stopping distance and puts the sensing range requirement right at the edge of what today’s sensors can deliver. Here are the sight stopping distances for an empty truck in dry conditions on roads of varying grade:4 Speed (mph) 0% Grade (m) –3% Grade (m) –6% Grade (m) 50 115–141 124–150 136–162 70 122–178 136–162 236–305 Sight stopping distances defined as the distance needed to stop assuming a 2.5-second reaction time with no braking, followed by maximum braking. The distance is computed for an empty truck in dry conditions on roads of varying grade. Stopping distance increases in wet weather or when driving downhill with a load (not shown). Now let’s compare these distances with the capabilities of various sensors: Lidar sensors provide trustworthy 3D data because they take direct measurements based on physical principles. They have a usable range of around 200–250 meters, plenty for city driving but not enough for every truck use case. Lidar detection models may also need to accumulate multiple scans/frames over time to detect faraway objects reliably, especially for smaller items like debris, further decreasing the usable detection range. Note that some solid-state lidars claim significantly more range than 250 meters. These numbers are collected under ideal conditions; for computing minimum sensing capability, we are interested in the range that can provide perfect recall and really great precision. For example, the lidar may be unable to reach its maximum range over the entire field of view, or may require undesirable trade-offs like a scan pattern that reduces point density and field of view to achieve more range. Radar can see farther than lidar. For example, this high-end ZF radar claims vehicle detections up to 350 meters away. Radar is great for tracking moving vehicles, but has trouble distinguishing between stationary vehicles and other background objects. Tesla Autopilot has infamously shown this problem by braking for overpasses and running into stalled vehicles. “Imaging” radars like the ZF device will do better than the radars on production vehicles. They still do not have the azimuth resolution to separate objects beyond 200 meters, where radar input is most needed. Cameras can detect faraway objects as long as there are enough pixels on the object, which leads to the selection of cameras with high resolution and a narrow field of view (telephoto lens). A vehicle will carry multiple narrow cameras for full coverage during turns. However, cameras cannot measure distance or speed directly. A combined camera + radar system using machine learning probably has the best chance here, especially with recent advances in ML-based early fusion, but would need to perform well enough to serve as the primary detection source beyond 200 meters. Training such a model is closer to an open problem than simply receiving that data from a lidar. In summary, we don’t appear to have any sensing solutions with the performance needed for trucks to meet the driverless bar. Controls Controlling a passenger vehicle — determining the amount of steering and throttle input to make the vehicle follow a trajectory — is a simpler problem than controlling a truck. For example, passenger vehicles are generally modeled as a single rigid body, while a truck and its trailer can move separately. The planner and controller need to account for this when making sharp turns and, in extreme low-friction conditions, to avoid jackknifing. These features come in addition to all the usual controls challenges that also apply to passenger vehicles. They can be built but require additional development and validation time. Freeway-specific challenges OK, so trucks are hard, but what about the freeway part? It may now sound appealing to build L4 freeway autonomy for passenger vehicles. However, driving on freeways also brings additional challenges on top of what is needed for city streets. Achieving the minimal risk condition on freeways Autonomous vehicles are supposed to stop when they detect an internal fault or driving situation that they can’t handle. This is called the minimal risk condition (MRC). For example, an autonomous passenger vehicle that detects an error in the HD map or a sensor failure might be programmed to execute a pullover or stop in lane depending on the problem severity. While MRC behaviors are annoying for other road users and embarrassing for the AV developer, they do not add undue risk on surface streets given the low speeds and already chaotic nature of city driving. This gives the AV developer more breathing room (within reason) to deploy a system that does not know how to handle every driving scenario perfectly, but knows enough to stay out of trouble. It’s a different story on the freeway. Stopping in lane becomes much more dangerous with the possibility of a rear-end collision at high speed. All stopping should be planned well in advance, ideally exiting at the next ramp, or at least driving to the closest shoulder with enough room to park. This greatly increases the scope of edge cases that need to be handled autonomously and at freeway speeds. For example: Scene understanding: If the vehicle encounters an unexpected construction zone, crash site, or other non-nominal driving scenario, it’s not enough to detect and stop. Rerouting, while a viable option on surface streets, usually isn’t an option on freeways because it may be difficult or illegal to make a u-turn by the time the vehicle can see the construction. A freeway under construction is also more likely to be the only path to the destination, especially if the autonomous vehicle in question is not designed to drive on city streets. Operational solutions are also not enough for a scaled deployment. AV developers often disallow their vehicles from routing through known problem areas gathered from manually driven scouting vehicles or announcements made by authorities. For a scaled deployment, however, it’s not reasonable to know the status of every mile of road at all times. Therefore, the system needs to find the right path through unstructured scenarios, possibly following instructions from police directing traffic, even if it involves traffic violations such as driving on the wrong side of the road. We know that current state-of-the-art autonomous vehicles still occasionally drive into wet concrete and trenches, which shows it is nontrivial to make a correct decision. Mapping: If the lane lines have been repainted, and the system normally uses an HD map, it needs to ignore the map and build a new one on-the-fly from the perception system’s output. It needs to distinguish between mapping and perception errors. Uptime: Sensor, computer, and software failures need to be virtually eliminated through redundancy and/or engineering elbow grease. The system needs almost perfect uptime. For example, it’s fine to enter a max-braking MRC when losing a sensor or restarting a software module on surface streets, provided those failures are rare. The same maneuver would be dangerous on the freeway, so the failure must be eliminated, or a fallback/redundancy developed. These problems are not impossible to overcome. Every autonomous passenger vehicle has solved them to some extent, with the remaining edge cases punted to some combination of MRC and remote operators. The difference is that, on freeways, they need to be solved with a very high level of reliability to meet the driverless bar. Freeways are boring The features that make freeways simpler — controlled access, no intersections, one-way traffic — also make “interesting” events more rare. This is a double-edged sword. While the simpler environment reduces the number of software features to be developed, it also increases the iteration time and cost. During development, “interesting” events are needed to train data-hungry ML models. For validation, each new software version to be qualified for driverless operation needs to encounter a minimum number of “interesting” events before comparisons to a human safety level can have statistical significance. Overall, iteration becomes more expensive when it takes more vehicle-hours to collect each event. AV developers can only respond by increasing the size of their operations teams or accepting more time between software releases. (Note that simulation is not a perfect solution either. The rarity of events increases vehicle-hours run in simulation, and so far, nobody has shown a substitute for real-world miles in the context of driverless software validation.) Is it ever going to happen? Trucking requires longer range sensing and more complex controls, increasing system complexity and pushing the problem to the bleeding edge of current sensing capabilities. At the same time, driving on freeways brings additional reliability requirements, raising the quality bar on every software component from mapping to scene understanding. If both the truck form factor and the freeway domain increase the level of difficulty, then driverless trucking might be the hardest application of autonomous driving: City Freeway Cars Baseline Harder Trucks Harder Hardest Now that scaled rideshare is mostly working in cities, I expect to see scaled freeway rideshare next. Does this mean driverless trucking will never happen? No, I still believe AV developers will overcome these challenges eventually. Aurora, Kodiak, and Gatik have all promised some form of driverless deployment by the end of the year. We probably won’t see anything close to a million-mile deployment in 2024 though. Getting there will require advances in sensing, machine learning, and a lot of hard work. Thanks to Steven W. and others for the discussions and feedback. This should be considered a bare minimum because humans perform much better on freeways, raising the bar for AVs. Rough numbers taken from Table 3, passenger vehicle national average on surface streets: Scanlon, J. M., Kusano, K. D., Fraade-Blanar, L. A., McMurry, T. L., Chen, Y. H., & Victor, T. (2023). Benchmarks for Retrospective Automated Driving System Crash Rate Analysis Using Police-Reported Crash Data. arXiv preprint arXiv:2312.13228. (blog) ↩ Kalra, N., & Paddock, S. M. (2016). Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94, 182-193. ↩ Favaro, F., Fraade-Blanar, L., Schnelle, S., Victor, T., Peña, M., Engstrom, J., … & Smith, D. (2023). Building a Credible Case for Safety: Waymo’s Approach for the Determination of Absence of Unreasonable Risk. arXiv preprint arXiv:2306.01917. (blog) ↩ Computed from tables 1 and 2: Harwood, D. W., Glauz, W. D., & Mason, J. M. (1989). Stopping sight distance design for large trucks. Transportation Research Record, 1208, 36-46. ↩

a year ago 29 votes
How Cruise vehicles return to the garage autonomously in heavy rain

Cruise doesn’t carry passengers in heavy rain. The operational design domain (ODD) in their CPUC permit (PDF) only allows services in light rain. I’ve always wondered how they implement this operationally. For example, Waymo preemptively launches all cars with operators in the driver’s seat anytime there’s rain in the forecast. Cruise has no such policy: I have never seen them assign operators to customer-facing vehicles. Yet Cruise claims to run up to 100 driverless vehicles concurrently. It would be impractical to dispatch a human driver to each vehicle whenever it starts raining. When the latest atmostpheric river hit San Francisco, I knew it was my chance to find out how it worked. Monitoring the Cruise app As the rain intensified, as expected, all cars disappeared from Cruise’s app and the weather pause icon appeared. But then something unusual happened. The app returned to its normal state. A few cars showed up near a hole in the geofence — and they were actually hailable. Visiting the garage I drove over to find that this street is the entrance to one of Cruise’s garages. The same location has been featured in Cruise executives’ past tweets promoting the service.1 Despite the heavy rain and gusts strong enough to blow my hat/jacket off, a steady stream of Cruise vehicles were returning themselves to the garage in driverless mode. Driverless Cruise vehicles enter the garage during heavy rain. A member of Cruise’s operations team enters the vehicle to drive it into the garage. In total, I observed: 8 driverless vehicles 1 manually driven vehicle 1 support vehicle (unmodified Chevy Bolt not capable of autonomous driving) Two vehicles skip the garage After the first six driverless vehicles returned, the next two kept driving past the garage. I followed them in my own car. They drove for about 16 minutes, handling large puddles and road spray without noticeable comfort issues. Eventually they looped back to the garage and successfully entered. A Cruise vehicle drives through a puddle during its detour. I’m not totally sure what happened here. I can think of two reasonable explanations: Boring: The cars missed the turn for some unknown reason. Exciting: Cruise has implemented logic to avoid overwhelming the operations team’s ability to put cars back in the garage. If there are too many vehicles waiting to return, subsequent cars take a detour to kill time instead of blocking the driveway. Key take-aways Cruise is capable of handling heavy rain in driverless mode. The majority of Cruise vehicles returned to the garage autonomously. This enables them to handle correlated events, such as rain, without deploying a large operations team. Cruise may have implemented “take a lap around the block” logic to avoid congestion at the garage entrance. I can’t find the timelapse of Cruise launching their driverless cars anymore. I’m pretty sure it was posted to Twitter. Please let me know if you have the link! Update: Link to tweet by @kvogt. ↩

over a year ago 38 votes

More in programming

Single-Use Disposable Applications

As search gets worse and “working code” gets cheaper, apps get easier to make from scratch than to find.

11 hours ago 3 votes
Thoughts on Motivation and My 40-Year Career

I’ve never published an essay quite like this. I’ve written about my life before, reams of stuff actually, because that’s how I process what I think, but never for public consumption. I’ve been pushing myself to write more lately because my co-authors and I have a whole fucking book to write between now and October. […]

6 hours ago 3 votes
Desktop UI frameworks written by a single person

Less known desktop UI frameworks Writing desktop software is hard. The UI technologies of Windows or MacOS are awful compared to web technology. What can trivially be done with HTML/CSS/JavaScript in few minutes can take hours using Windows’s win32 APIs or Mac’s Cocoa. That’s why the default technology for desktop apps, especially cross-platform, is Electron: a Chrome browser combined with Node runtime. The problem is that it’s bloaty: each app is a unique build of Chrome with a little bit of application code. Chrome is over 100MB so many apps ship less than 1MB of code in a 100M wrapper. People tried to address the problem of poor OS APIs by writing UI frameworks, often meant to be cross-platform. You’ve heard about QT, GTK, wxWindows. The problem with those is that they are also old, their APIs are not the greatest either and they are bloaty as well. There just doesn’t seem to be a good option. Writing your own framework seems impossible due to the size of task. But is it? I’ll show a couple of less-known UI frameworks written mostly be a single person, often done simply to enable writing an application. SWELL in WDL WDL is interesting. Justin Frankel, the guy who created Winamp, has a repository of C++ code he uses in different projects. After selling Winamp to AOL, a side quest of writing file sharing application, getting fired from AOL for writing file sharing application, he started a company building Reaper a digital audio workstation software for Windows. Winamp is a win32 API program and so is Reaper. At some point Justin decided to make a Mac version but by then he had a lot of code heavily using win32 APIs. So he did what anyone in his position would: he implemented win32 APIs for Mac OS and Linux and called it SWELL - Simple Windows Emulation Layer. Ok, actually no-one else would do it. It was an insane idea but it worked. It’s important to not over-state SWELL capabilities. It’s not Wine. You can’t take any win32 program and recompile for Mac with SWELL. Frankel is insanely pragmatic and so is his code. SWELL only implements the subset of APIs he uses in Reaper. At the same time Reaper is a big app so if SWELL works for Reaper, it could work for your app. WDL is open-source using permissive MIT license. Sublime Text For a few years Sublime Text was THE programmer’s editor. It was written by a single developer in C++ and he wrote a custom UI toolkit for it. Not open source but its existence shows it can be done. RAD Debugger RAD Debugger is an open-source Windows debugger for C/C++ apps written in C by mostly a single person. It implements a custom UI framework based on 3D renderer. The UI is integral part of the the app but the code is well structured so you probably can take just their UI / render code and use it in your own C / C++ app. Currently the app / UI is only for Windows but it’s designed to be cross-platform and they are working on porting the renderer to Mac OS / Linux. They use permissive MIT license and everything is written in C. Dear ImGUI Dear ImGui is a newer cross-platform, UI framework in C++. Open source, permissive MIT license. Written by mostly a single person. Ghostty Ghostty is a cross-platform terminal emulator and UI. It’s written in Zig by mostly a single person and uses it’s own low-level GPU renderer for the UI. You too can write your own UI framework At first the idea of writing your own UI framework seems impossibly daunting. What I’m hoping to show is that if you’re ambitious enough it’s possible to build cross platform desktop apps that are not just bloated 100MB Chrome wrappers around few kilobytes of custom code. I’m not saying it’s a simple thing, just that enough people did it that it’s possible. It shouldn’t be necessary but both Microsoft and Apple have tragically dropped the ball on providing decent, high-performance UI libraries for their OS. Microsoft even writes their own apps, like Teams, in web technologies. Thanks to open source you’re not at the staring line. You can just use Dear ImGUI or WDL’s SWELL. Or you can extract the UI code from RAD Debugger or Ghostty (if you write in Zig). Or you can look at how their implementation to speed up your own design and implementation.

yesterday 2 votes
Logic for Programmers Turns One

I released Logic for Programmers exactly one year ago today. It feels weird to celebrate the anniversary of something that isn't 1.0 yet, but software projects have a proud tradition of celebrating a dozen anniversaries before 1.0. I wanted to share about what's changed in the past year and the work for the next six+ months. The Road to 0.1 I had been noodling on the idea of a logic book since the pandemic. The first time I wrote about it on the newsletter was in 2021! Then I said that it would be done by June and would be "under 50 pages". The idea was to cover logic as a "soft skill" that helped you think about things like requirements and stuff. That version sucked. If you want to see how much it sucked, I put it up on Patreon. Then I slept on the next draft for three years. Then in 2024 a lot of business fell through and I had a lot of free time, so with the help of Saul Pwanson I rewrote the book. This time I emphasized breadth over depth, trying to cover a lot more techniques. I also decided to self-publish it instead of pitching it to a publisher. Not going the traditional route would mean I would be responsible for paying for editing, advertising, graphic design etc, but I hoped that would be compensated by much higher royalties. It also meant I could release the book in early access and use early sales to fund further improvements. So I wrote up a draft in Sphinx, compiled it to LaTeX, and uploaded the PDF to leanpub. That was in June 2024. Since then I kept to a monthly cadence of updates, missing once in November (short-notice contract) and once last month (Systems Distributed). The book's now on v0.10. What's changed? A LOT v0.1 was very obviously an alpha, and I have made a lot of improvements since then. For one, the book no longer looks like a Sphinx manual. Compare! Also, the content is very, very different. v0.1 was 19,000 words, v.10 is 31,000.1 This comes from new chapters on TLA+, constraint/SMT solving, logic programming, and major expansions to the existing chapters. Originally, "Simplifying Conditionals" was 600 words. Six hundred words! It almost fit in two pages! The chapter is now 2600 words, now covering condition lifting, quantifier manipulation, helper predicates, and set optimizations. All the other chapters have either gotten similar facelifts or are scheduled to get facelifts. The last big change is the addition of book assets. Originally you had to manually copy over all of the code to try it out, which is a problem when there are samples in eight distinct languages! Now there are ready-to-go examples for each chapter, with instructions on how to set up each programming environment. This is also nice because it gives me breaks from writing to code instead. How did the book do? Leanpub's all-time visualizations are terrible, so I'll just give the summary: 1180 copies sold, $18,241 in royalties. That's a lot of money for something that isn't fully out yet! By comparison, Practical TLA+ has made me less than half of that, despite selling over 5x as many books. Self-publishing was the right choice! In that time I've paid about $400 for the book cover (worth it) and maybe $800 in Leanpub's advertising service (probably not worth it). Right now that doesn't come close to making back the time investment, but I think it can get there post-release. I believe there's a lot more potential customers via marketing. I think post-release 10k copies sold is within reach. Where is the book going? The main content work is rewrites: many of the chapters have not meaningfully changed since 1.0, so I am going through and rewriting them from scratch. So far four of the ten chapters have been rewritten. My (admittedly ambitious) goal is to rewrite three of them by the end of this month and another three by the end of next. I also want to do final passes on the rewritten chapters; as most of them have a few TODOs left lying around. (Also somehow in starting this newsletter and publishing it I realized that one of the chapters might be better split into two chapters, so there could well-be a tenth technique in v0.11 or v0.12!) After that, I will pass it to a copy editor while I work on improving the layout, making images, and indexing. I want to have something worthy of printing on a dead tree by 1.0. In terms of timelines, I am very roughly estimating something like this: Summer: final big changes and rewrites Early Autumn: graphic design and copy editing Late Autumn: proofing, figuring out printing stuff Winter: final ebook and initial print releases of 1.0. (If you know a service that helps get self-published books "past the finish line", I'd love to hear about it! Preferably something that works for a fee, not part of royalties.) This timeline may be disrupted by official client work, like a new TLA+ contract or a conference invitation. Needless to say, I am incredibly excited to complete this book and share the final version with you all. This is a book I wished for years ago, a book I wrote because nobody else would. It fills a critical gap in software educational material, and someday soon I'll be able to put a copy on my bookshelf. It's exhilarating and terrifying and above all, satisfying. It's also 150 pages vs 50 pages, but admittedly this is partially because I made the book smaller with a larger font. ↩

2 days ago 4 votes
Implementing UI translation in SumatraPDF, a C++ Windows application

Translating user interface of SumatraPDF SumatraPDF is the best PDF/eBook/Comic Book viewer for Windows. It’s small, fast, full of features, free and open-source. It became popular enough that it made sense to translate the UI for non-English users. Currently we support 72 languages. This article describes how I designed and implemented a translation system in SumatraPDF, a native win32 C++ Windows application. Hard things about translating the UI There are 2 hard things about translating an application code for translation system (extracting strings to translate, translate strings from English to user’s language) translating them into many languages Extracting strings to translate from source code Currently there are 381 strings in SumatraPDF subject to translation. It’s important that the system requires the least amount of effort when adding new strings to translate. Every string that needs to be translated is marked in .cpp or .h file with one of two macros: _TRA("Rename") _TRN("Open") I have a script that extracts those strings from source files. Mine is written in Go but it could just as well be Python or JavaScript. It’s a simple regex job. _TR stands for “translation”. _TRA(s) expands into const char* trans::GetTranslation(const char* str) function which returns str translated to current UI language. We auto-detect language at startup based on Windows settings and allow the user to explicitly set UI language. For English we just return the original string. If a string to be translated is e.g. a part of const char* array[], we can’t use trans::GetTranslation(). For cases like that we have _TRN() which expands to English string. We have to write code to translate it at some point. Adding new strings is therefore as simple as wrapping them in _TRA() or _TRN() macros. Translating strings into many languages Now that we’ve extracted strings to be translated, we need to translate them into 72 languages. SumatraPDF is a free, open-source program. I don’t have a budget to hire translators. I don’t have a budget, period. The only option was to get help from SumatraPDF users. It was vital to make it very easy for users to send me translations. I didn’t want to ask them, for example, to download some translation software. Design and implementation of AppTranslator web app I couldn’t find a really simple software for crowd sourcing translations so I wrote my own: https://github.com/kjk/apptranslator You can see it in action: https://www.apptranslator.org/app/SumatraPDF I designed it to be generic but I don’t think anyone else is using it. AppTranslator is simple. Per https://tools.arslexis.io/wc/: 4k lines of Go server code 451 lines of html code a single dependency: bootstrap CSS framework (the project is old) It’s simple because I don’t want to spend a lot of time writing translation software. It’s just a side project in service of the goal of translating SumatraPDF. Login is exclusively via GitHub. It doesn’t even use a database. Like in Redis, changes are stored as a series of operations in an append-only log. We keep the whole state in memory and re-create it from the log at startup. Main operation is translate a string from English to language X represented as [kOpTranslation, english string, language, translation, user who provided translation]. When user provides a translation in the web UI, we send an API call to the server which appends the translation operation to the log. Simple and reliable. Because the code is written in Go, it’s very fast and memory efficient. When running it uses mere megabytes of RAM. It can comfortably run on the smallest 256 MB VPS server. I backup the log to S3 so if the server ever fails, I can re-install the program on a new server and re-download the translations from S3. I provide RSS feed for each language so that people who provide translations can monitor for new strings to be translated. Sending strings for translation and receiving translations So I have a web app for collecting translations and a script that extracts strings to be translated from source code. How do they connect? AppTranslator has an API for submitting the current set of strings to be translated in the simplest possible format: a line for each string (I ensure there are no newlines in the string itself by escaping them with \n) API is password protected because only I can submit the strings. The server compares the strings sent with the current set and records a difference in the log. It also sends a response with translations. Again the simplest possible format: AppTranslator: SumatraPDF 651b739d7fa110911f25563c933f42b1d37590f8 :%s annotation. Ctrl+click to edit. am:%s մեկնաբանություն: Ctrl+քլիք՝ խմբագրելու համար: ar:ملاحظة %s. اضغط Ctrl للتحرير. az:Qeyd %s. Düzəliş etmək üçün Ctrl+düyməyə basın. As you can see: a string to translate is on a line starting with : is followed by translations of that strings in the format: ${lang}: ${translation} An optimization: 651b739d7fa110911f25563c933f42b1d37590f8 is a hash of this response. If I submit this hash with my request and translations didn’t change on the server, the response is empty. Implementing C++ part of translation system So now I have a text file with translation downloaded from the server. How do I get a translation in my C++ code? As with everything in SumatraPDF, I try to do things in a simple and efficient way. The whole Translation.cpp is only 239 lines of code. The core of translation system is const char* trans::GetTranslation(const char* s); function. I embed the translations in exact the same format as received from AppTranslator in the executable as data file in resources. If the UI language is English, we do nothing. trans::GetTranslation() returns its argument. When we switch the language, we load the translations from resources and build an index: an array of English strings an array of corresponding translations Both arrays use my own StrVec class optimized for storing an array of strings. To find a translation we scan the first array to find an index of the string and return translation from the second array, at the same index. Linear scan seems like it would be slow but it isn’t. Resizing dialogs I have a few dialogs defined in SumatraPDF.rc file. The problem with dialogs is that position of UI elements is fixed. A translated string will almost certainly have a different size than the English string which will mess up fixed layout. Thankfully someone wrote DialogSizer that smartly resizes dialogs and solves this problem. The evolution of a solution No AppTranslator My initial implementation was simpler. I didn’t yet have AppTranslator so I stored the strings in a text file in repository in the same format as what I described above. People would download it, make changes using a text editor and send me the file via email which I would then checkin. It worked for a while but it became worse over time. More strings, more languages created more work for me to manually manage e-mail submissions. I decided to automate the process. Code generation My first implementation of C++ side used code generation instead of embedding the text file in resources. My Go script would generate C++ source code files with static const char* [] arrays. This worked well but I decided to improve it further by making the code use the text file with translations embedded in the app. The main motivation for the change was to open a possibility of downloading latest translations from the server to fix the problem of translations not being all ready when I build the release executable. I haven’t done that yet but it’s now easier to implement given that the format of strings embedded in the exe is the same as the one I can download from AppTranslator. Only utf-8 SumatraPDF started by using both WCHAR* Unicode strings and char* utf8 strings. For that reason the translation system had to support returning translation in both WCHAR* and char* version. Over time I refactored the code to use mostly utf8 and at some point I no longer needed to support WCHAR* version. That made the code even smaller and reduced memory usage. The experience I’m happy how things turned out. AppTranslator proved to be reliable and hassle free. It runs for many years now and collected 35440 string translations from users. I automated everything so that all I need to do is to periodically re-run the script that extracts strings from source code, uploads them to AppTranslator and downloads latest translations. One problem is that translations are not always ready in time for release so I make a release and then people start translating strings added since last release. I’ve considered downloading the latest translations from the server, in addition to embedding them in an executable at the time of building the app. Would I do the same today? While AppTranslator is reliable and doesn’t require on-going work, it would be better to not have to run a server at all. The world has changed since I started SumatraPDF. Namely: people are comfortable using GitHub and you can edit files directly in GitHub UI. It’s not a great experience but it works. One option would be to generate a translation text file for each language, in this format: :first untranslated string :second untranslated string :first translated string translation of first string :second translated string translation of second string Untranslated strings are listed at the top, to make it easier to find. A link would send a translator directly to edit this file in GitHub UI. When translator saves translations, it creates a PR for me to review and merge. The roads not taken But why did you re-invent everything? You should do X instead. All other X that I know about suck. Using per-language .rc resource files Traditional way of localizing / translating Window GUI apps is to store all strings and dialog definitions in an .rc file. Each language gets its own .rc file (or files) and the program picks the right resource based on a language. This doesn’t solve the 2 hard problems: having an easy way to add strings for translations having an easy way for users to provide translations XML horror show There was a dark time when the world was under the iron grip of XML fanaticism. Everything had to be an XML file even when it was the worst possible solution for the problem. XML doesn’t solve the 2 hard problems and a string storage format is an absolute nightmare for human editing. GNU gettext There’s a C library gettext that uses .po files. This is much saner solution than XML horror show. .po files are relatively simple text format. The code is already written. Warning: tooting my own horn. My format is better. It’s easier for people to edit, it’s easier to write code to parse it. This looks like many times more than 239 lines of code. Ok, gettext probably does a bit more than my code, but clearly nothing than I need. It also doesn’t solve the 2 hard problems. I would still have to write code to extract strings from source code and build a way to allow users to translate them easily.

2 days ago 3 votes