the penultimate conditional syntax

from Tony Finch's blog [alt+shift+b] in programming

About half a year ago I encountered a paper bombastically titled “the ultimate conditional syntax”. It has the attractive goal of unifying pattern match with boolean if tests, and its solution is in some ways very nice. But it seems over-complicated to me, especially for something that’s a basic work-horse of programming. I couldn’t immediately see how to cut it down to manageable proportions, but recently I had an idea. I’ll outline it under the “penultimate conditionals” heading below, after reviewing the UCS and explaining my motivation. what the UCS? whence UCS out of scope penultimate conditionals dangling syntax examples antepenultimate breath what the UCS? The ultimate conditional syntax does several things which are somewhat intertwined and support each other. An “expression is pattern” operator allows you to do pattern matching inside boolean expressions. Like “match” but unlike most other expressions, “is” binds variables whose scope is the rest of the boolean expression...

a month ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Tony Finch's blog

clamp / median / range

Here are a few tangentially-related ideas vaguely near the theme of comparison operators. comparison style clamp style clamp is median clamp in range range style style clash? comparison style Some languages such as BCPL, Icon, Python have chained comparison operators, like if min <= x <= max: ... In languages without chained comparison, I like to write comparisons as if they were chained, like, if min <= x && x <= max { // ... } A rule of thumb is to prefer less than (or equal) operators and avoid greater than. In a sequence of comparisons, order values from (expected) least to greatest. clamp style The clamp() function ensures a value is between some min and max, def clamp(min, x, max): if x < min: return min if max < x: return max return x I like to order its arguments matching the expected order of the values, following my rule of thumb for comparisons. (I used that flavour of clamp() in my article about GCRA.) But I seem to be unusual in this preference, based on a few examples I have seen recently. clamp is median Last month, Fabian Giesen pointed out a way to resolve this difference of opinion: A function that returns the median of three values is equivalent to a clamp() function that doesn’t care about the order of its arguments. This version is written so that it returns NaN if any of its arguments is NaN. (When an argument is NaN, both of its comparisons will be false.) fn med3(a: f64, b: f64, c: f64) -> f64 { match (a <= b, b <= c, c <= a) { (false, false, false) => f64::NAN, (false, false, true) => b, // a > b > c (false, true, false) => a, // c > a > b (false, true, true) => c, // b <= c <= a (true, false, false) => c, // b > c > a (true, false, true) => a, // c <= a <= b (true, true, false) => b, // a <= b <= c (true, true, true) => b, // a == b == c } } When two of its arguments are constant, med3() should compile to the same code as a simple clamp(); but med3()’s misuse-resistance comes at a small cost when the arguments are not known at compile time. clamp in range If your language has proper range types, there is a nicer way to make clamp() resistant to misuse: fn clamp(x: f64, r: RangeInclusive<f64>) -> f64 { let (&min,&max) = (r.start(), r.end()); if x < min { return min } if max < x { return max } return x; } let x = clamp(x, MIN..=MAX); range style For a long time I have been fond of the idea of a simple counting for loop that matches the syntax of chained comparisons, like for min <= x <= max: ... By itself this is silly: too cute and too ad-hoc. I’m also dissatisfied with the range or slice syntax in basically every programming language I’ve seen. I thought it might be nice if the cute comparison and iteration syntaxes were aspects of a more generally useful range syntax, but I couldn’t make it work. Until recently when I realised I could make use of prefix or mixfix syntax, instead of confining myself to infix. So now my fantasy pet range syntax looks like >= min < max // half-open >= min <= max // inclusive And you might use it in a pattern match if x is >= min < max { // ... } Or as an iterator for x in >= min < max { // ... } Or to take a slice xs[>= min < max] style clash? It’s kind of ironic that these range examples don’t follow the left-to-right, lesser-to-greater rule of thumb that this post started off with. (x is not lexically between min and max!) But that rule of thumb is really intended for languages such as C that don’t have ranges. Careful stylistic conventions can help to avoid mistakes in nontrivial conditional expressions. It’s much better if language and library features reduce the need for nontrivial conditions and catch mistakes automatically.

a week ago • 11 votes

Golang and Let's Encrypt: a free software story

Here’s a story from nearly 10 years ago. the bug I think it was my friend Richard Kettlewell who told me about a bug he encountered with Let’s Encrypt in its early days in autumn 2015: it was failing to validate mail domains correctly. the context At the time I had previously been responsible for Cambridge University’s email anti-spam system for about 10 years, and in 2014 I had been given responsibility for Cambridge University’s DNS. So I knew how Let’s Encrypt should validate mail domains. Let’s Encrypt was about one year old. Unusually, the code that runs their operations, Boulder, is free software and open to external contributors. Boulder is written in Golang, and I had not previously written any code in Golang. But its reputation is to be easy to get to grips with. So, in principle, the bug was straightforward for me to fix. How difficult would it be as a Golang newbie? And what would Let’s Encrypt’s contribution process be like? the hack I cloned the Boulder repository and had a look around the code. As is pretty typical, there are a couple of stages to fixing a bug in an unfamiliar codebase: work out where the problem is try to understand if the obvious fix could be better In this case, I remember discovering a relatively substantial TODO item that intersected with the bug. I can’t remember the details, but I think there were wider issues with DNS lookups in Boulder. I decided it made sense to fix the immediate problem without getting involved in things that would require discussion with Let’s Encrypt staff. I faffed around with the code and pushed something that looked like it might work. A fun thing about this hack is that I never got a working Boulder test setup on my workstation (or even Golang, I think!) – I just relied on the Let’s Encrypt cloud test setup. The feedback time was very slow, but it was tolerable for a simple one-off change. the fix My pull request was small, +48-14. After a couple of rounds of review and within a few days, it was merged and put into production! A pleasing result. the upshot I thought Golang (at least as it was used in the Boulder codebase) was as easy to get to grips with as promised. I did not touch it again until several years later, because there was no need to, but it seemed fine. I was very impressed by the Let’s Encrypt continuous integration and automated testing setup, and by their low-friction workflow for external contributors. One of my fastest drive-by patches to get into worldwide production. My fix was always going to be temporary, and all trace of it was overwritten years ago. It’s good when “temporary” turns out to be true! the point I was reminded of this story in the pub this evening, and I thought it was worth writing down. It demonstrated to me that Let’s Encrypt really were doing all the good stuff they said they were doing. So thank you to Let’s Encrypt for providing an exemplary service and for giving me a happy little anecdote.

2 weeks ago • 13 votes

performance of random floats

A couple of years ago I wrote about random floating point numbers. In that article I was mainly concerned about how neat the code is, and I didn’t pay attention to its performance. Recently, a comment from Oliver Hunt and a blog post from Alisa Sireneva prompted me to wonder if I made an unwarranted assumption. So I wrote a little benchmark, which you can find in pcg-dxsm.git. As a brief recap, there are two basic ways to convert a random integer to a floating point number between 0.0 and 1.0: Use bit fiddling to construct an integer whose format matches a float between 1.0 and 2.0; this is the same span as the result but with a simpler exponent. Bitcast the integer to a float and subtract 1.0 to get the result. Shift the integer down to the same range as the mantissa, convert to float, then multiply by a scaling factor that reduces it to the desired range. This produces one more bit of randomness than the bithacking conversion. (There are other less basic ways.) My benchmark has 2 x 2 x 2 tests: bithacking vs multiplying 32 bit vs 64 bit sequential integers vs random integers Each operation is isolated from the benchmark loop by putting it in a separate translation unit (to prevent the compiler from inlining) and there is a fence instruction (ISB SY on ARM, MFENCE on AMD) in the loop to stop the CPU from overlapping successive iterations. I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X. In the table below, the leftmost column is the number of random bits. The top half measures sequential numbers, the bottom half is random numbers. The times are nanoseconds per operation, which includes the overheads of the benchmark loop and function call. arm amd 23 12.15 11.22 24 13.37 11.21 52 12.11 11.02 53 13.38 11.20 23 14.75 12.62 24 15.85 12.81 52 16.78 14.23 53 18.02 14.41 The times vary a little from run to run but the difference in speed of the various loops is reasonably consistent. I think my conclusion is that the bithacking conversion is about 1ns faster than the multiply conversion on my ARM box. There’s a subnanosecond difference on my AMD box which might indicate that the conversion takes different amounts of time depending on the value? Dunno.

a month ago • 20 votes

moka pot notes

In hot weather I like to drink my coffee in an iced latte. To make it, I have a very large Bialetti Moka Express. Recently when I got it going again after a winter of disuse, it took me a couple of attempts to get the technique right, so here are some notes as a reminder to my future self next year. It’s worth noting that I’m not fussy about my coffee: I usually drink pre-ground beans from the supermarket, with cream (in winter hot coffee) or milk and ice. basic principle When I was getting the hang of my moka pot, I learned from YouTube coffee geeks such as James Hoffmann that the main aim is for the water to be pushed through the coffee smoothly and gently. Better to err on the side of too little flow than too much. I have not had much success trying to make fine temperature adjustments while the coffee is brewing, because the big moka pot has a lot of thermal inertia: it takes a long time for any change in gas level to have any effect on on the coffee flow. routine fill the kettle and turn it on put the moka pot’s basket in a mug to keep it stable fill it with coffee (mine needs about 4 Aeropress scoops) tamp it down firmly [1] when the kettle has boiled, fill the base of the pot to just below the pressure valve (which is also just below the filter screen in the basket) insert the coffee basket, making sure there are no stray grounds around the edge where the seal will mate screw on the upper chamber firmly put it on a small gas ring turned up to the max [2] leave the lid open and wait for the coffee to emerge immediately turn the gas down to the minimum [3] the coffee should now come out in a steady thin stream without spluttering or stalling when the upper chamber is filled near the mouths of the central spout, it’ll start fizzing or spluttering [4] turn off the gas and pour the coffee into a carafe notes If I don’t tamp the grounds, the pot tends to splutter. I guess tamping gives the puck better integrity to resist channelling, and to keep the water under even pressure. Might be an effect of the relatively coarse supermarket grind? It takes a long time to get the pot back up to boiling point and I’m not sure that heating it up slower helps. The main risk, I think, is overshooting the ideal steady brewing state too much, but: With my moka pot on my hob the lowest gas flow on the smallest rings is just enough to keep the coffee flowing without stalling. The flow when the coffee first emerges is relatively fast, and it slows to the steady state several seconds after I turn the heat down, so I think the overshoot isn’t too bad. This routine turns almost all of the water into coffee, which Hoffmann suggests is a good result, and a sign that the pressure and temperature aren’t getting too high.

a month ago • 15 votes

the algebra of dependent types

TIL (or this week-ish I learned) why big-sigma and big-pi turn up in the notation of dependent type theory. I’ve long been aware of the zoo of more obscure Greek letters that turn up in papers about type system features of functional programming languages, μ, Λ, Π, Σ. Their meaning is usually clear from context but the reason for the choice of notation is usually not explained. I recently stumbled on an explanation for Π (dependent functions) and Σ (dependent pairs) which turn out to be nicer than I expected, and closely related to every-day algebraic data types. sizes of types The easiest way to understand algebraic data types is by counting the inhabitants of a type. For example: the unit type () has one inhabitant, (), and the number 1 is why it’s called the unit type; the bool type hass two inhabitants, false and true. I have even seen these types called 1 and 2 (cruelly, without explanation) in occasional papers. product types Or pairs or (more generally) tuples or records. Usually written, (A, B) The pair contains an A and a B, so the number of possible values is the number of possible A values multiplied by the number of possible B values. So it is spelled in type theory (and in Standard ML) like, A * B sum types Or disjoint union, or variant record. Declared in Haskell like, data Either a b = Left a | Right b Or in Rust like, enum Either<A, B> { Left(A), Right(B), } A value of the type is either an A or a B, so the number of possible values is the number of A values plus the number of B values. So it is spelled in type theory like, A + B dependent pairs In a dependent pair, the type of the second element depends on the value of the first. The classic example is a slice, roughly, struct IntSlice { len: usize, elem: &[i64; len], } (This might look a bit circular, but the idea is that an array [i64; N] must be told how big it is – its size is an explicit part of its type – but an IntSlice knows its own size. The traditional dependent “vector” type is a sized linked list, more like my array type than my slice type.) The classic way to write a dependent pair in type theory is like, Σ len: usize . Array(Int, len) The big sigma binds a variable that has a type annotation, with a scope covering the expression after the dot – similar syntax to a typed lambda expression. We can expand a simple example like this into a many-armed sum type: either an array of length zero, or an array of length 1, or an array of length 2, … but in a sigma type the discriminant is user-defined instead of hidden. The number of possible values of the type comes from adding up all the alternatives, a summation just like the big sigma summation we were taught in school. ∑ a ∈ A B a When the second element doesn’t depend on the first element, we can count the inhabitants like, ∑ A B = A*B And the sigma type simplifies to a product type. telescopes An aside from the main topic of these notes, I also recently encountered the name “telescope” for a multi-part dependent tuple or record. The name “telescope” comes from de Bruijn’s AUTOMATH, one of the first computerized proof assistants. (I first encountered de Bruijn as the inventor of numbered lambda bindings.) dependent functions The return type of a dependent function can vary according to the argument it is passed. For example, to construct an array we might write something like, fn repeat_zero(len: usize) -> [i64; len] { [0; len] } The classic way to write the type of repeat_zero() is very similar to the IntSlice dependent pair, but with a big pi instead of a big sigma: Π len: usize . Array(Int, len) Mmm, pie. To count the number of possible (pure, total) functions A ➞ B, we can think of each function as a big lookup table with A entries each containing a B. That is, a big tuple (B, B, … B), that is, B * B * … * B, that is, BA. Functions are exponential types. We can count a dependent function, where the number of possible Bs depends on which A we are passed, ∏ a ∈ A B a danger I have avoided the terms “dependent sum” and “dependent product”, because they seem perfectly designed to cause confusion over whether I am talking about variants, records, or functions. It kind of makes me want to avoid algebraic data type jargon, except that there isn’t a good alternative for “sum type”. Hmf.

a month ago • 25 votes

More in programming

Thoughts on Motivation and My 40-Year Career

I’ve never published an essay quite like this. I’ve written about my life before, reams of stuff actually, because that’s how I process what I think, but never for public consumption. I’ve been pushing myself to write more lately because my co-authors and I have a whole fucking book to write between now and October. […]

10 hours ago • 4 votes

Single-Use Disposable Applications

As search gets worse and “working code” gets cheaper, apps get easier to make from scratch than to find.

15 hours ago • 4 votes

Desktop UI frameworks written by a single person

Less known desktop UI frameworks Writing desktop software is hard. The UI technologies of Windows or MacOS are awful compared to web technology. What can trivially be done with HTML/CSS/JavaScript in few minutes can take hours using Windows’s win32 APIs or Mac’s Cocoa. That’s why the default technology for desktop apps, especially cross-platform, is Electron: a Chrome browser combined with Node runtime. The problem is that it’s bloaty: each app is a unique build of Chrome with a little bit of application code. Chrome is over 100MB so many apps ship less than 1MB of code in a 100M wrapper. People tried to address the problem of poor OS APIs by writing UI frameworks, often meant to be cross-platform. You’ve heard about QT, GTK, wxWindows. The problem with those is that they are also old, their APIs are not the greatest either and they are bloaty as well. There just doesn’t seem to be a good option. Writing your own framework seems impossible due to the size of task. But is it? I’ll show a couple of less-known UI frameworks written mostly be a single person, often done simply to enable writing an application. SWELL in WDL WDL is interesting. Justin Frankel, the guy who created Winamp, has a repository of C++ code he uses in different projects. After selling Winamp to AOL, a side quest of writing file sharing application, getting fired from AOL for writing file sharing application, he started a company building Reaper a digital audio workstation software for Windows. Winamp is a win32 API program and so is Reaper. At some point Justin decided to make a Mac version but by then he had a lot of code heavily using win32 APIs. So he did what anyone in his position would: he implemented win32 APIs for Mac OS and Linux and called it SWELL - Simple Windows Emulation Layer. Ok, actually no-one else would do it. It was an insane idea but it worked. It’s important to not over-state SWELL capabilities. It’s not Wine. You can’t take any win32 program and recompile for Mac with SWELL. Frankel is insanely pragmatic and so is his code. SWELL only implements the subset of APIs he uses in Reaper. At the same time Reaper is a big app so if SWELL works for Reaper, it could work for your app. WDL is open-source using permissive MIT license. Sublime Text For a few years Sublime Text was THE programmer’s editor. It was written by a single developer in C++ and he wrote a custom UI toolkit for it. Not open source but its existence shows it can be done. RAD Debugger RAD Debugger is an open-source Windows debugger for C/C++ apps written in C by mostly a single person. It implements a custom UI framework based on 3D renderer. The UI is integral part of the the app but the code is well structured so you probably can take just their UI / render code and use it in your own C / C++ app. Currently the app / UI is only for Windows but it’s designed to be cross-platform and they are working on porting the renderer to Mac OS / Linux. They use permissive MIT license and everything is written in C. Dear ImGUI Dear ImGui is a newer cross-platform, UI framework in C++. Open source, permissive MIT license. Written by mostly a single person. Ghostty Ghostty is a cross-platform terminal emulator and UI. It’s written in Zig by mostly a single person and uses it’s own low-level GPU renderer for the UI. You too can write your own UI framework At first the idea of writing your own UI framework seems impossibly daunting. What I’m hoping to show is that if you’re ambitious enough it’s possible to build cross platform desktop apps that are not just bloated 100MB Chrome wrappers around few kilobytes of custom code. I’m not saying it’s a simple thing, just that enough people did it that it’s possible. It shouldn’t be necessary but both Microsoft and Apple have tragically dropped the ball on providing decent, high-performance UI libraries for their OS. Microsoft even writes their own apps, like Teams, in web technologies. Thanks to open source you’re not at the staring line. You can just use Dear ImGUI or WDL’s SWELL. Or you can extract the UI code from RAD Debugger or Ghostty (if you write in Zig). Or you can look at how their implementation to speed up your own design and implementation.

yesterday • 2 votes

Logic for Programmers Turns One

I released Logic for Programmers exactly one year ago today. It feels weird to celebrate the anniversary of something that isn't 1.0 yet, but software projects have a proud tradition of celebrating a dozen anniversaries before 1.0. I wanted to share about what's changed in the past year and the work for the next six+ months. The Road to 0.1 I had been noodling on the idea of a logic book since the pandemic. The first time I wrote about it on the newsletter was in 2021! Then I said that it would be done by June and would be "under 50 pages". The idea was to cover logic as a "soft skill" that helped you think about things like requirements and stuff. That version sucked. If you want to see how much it sucked, I put it up on Patreon. Then I slept on the next draft for three years. Then in 2024 a lot of business fell through and I had a lot of free time, so with the help of Saul Pwanson I rewrote the book. This time I emphasized breadth over depth, trying to cover a lot more techniques. I also decided to self-publish it instead of pitching it to a publisher. Not going the traditional route would mean I would be responsible for paying for editing, advertising, graphic design etc, but I hoped that would be compensated by much higher royalties. It also meant I could release the book in early access and use early sales to fund further improvements. So I wrote up a draft in Sphinx, compiled it to LaTeX, and uploaded the PDF to leanpub. That was in June 2024. Since then I kept to a monthly cadence of updates, missing once in November (short-notice contract) and once last month (Systems Distributed). The book's now on v0.10. What's changed? A LOT v0.1 was very obviously an alpha, and I have made a lot of improvements since then. For one, the book no longer looks like a Sphinx manual. Compare! Also, the content is very, very different. v0.1 was 19,000 words, v.10 is 31,000.1 This comes from new chapters on TLA+, constraint/SMT solving, logic programming, and major expansions to the existing chapters. Originally, "Simplifying Conditionals" was 600 words. Six hundred words! It almost fit in two pages! The chapter is now 2600 words, now covering condition lifting, quantifier manipulation, helper predicates, and set optimizations. All the other chapters have either gotten similar facelifts or are scheduled to get facelifts. The last big change is the addition of book assets. Originally you had to manually copy over all of the code to try it out, which is a problem when there are samples in eight distinct languages! Now there are ready-to-go examples for each chapter, with instructions on how to set up each programming environment. This is also nice because it gives me breaks from writing to code instead. How did the book do? Leanpub's all-time visualizations are terrible, so I'll just give the summary: 1180 copies sold, $18,241 in royalties. That's a lot of money for something that isn't fully out yet! By comparison, Practical TLA+ has made me less than half of that, despite selling over 5x as many books. Self-publishing was the right choice! In that time I've paid about $400 for the book cover (worth it) and maybe $800 in Leanpub's advertising service (probably not worth it). Right now that doesn't come close to making back the time investment, but I think it can get there post-release. I believe there's a lot more potential customers via marketing. I think post-release 10k copies sold is within reach. Where is the book going? The main content work is rewrites: many of the chapters have not meaningfully changed since 1.0, so I am going through and rewriting them from scratch. So far four of the ten chapters have been rewritten. My (admittedly ambitious) goal is to rewrite three of them by the end of this month and another three by the end of next. I also want to do final passes on the rewritten chapters; as most of them have a few TODOs left lying around. (Also somehow in starting this newsletter and publishing it I realized that one of the chapters might be better split into two chapters, so there could well-be a tenth technique in v0.11 or v0.12!) After that, I will pass it to a copy editor while I work on improving the layout, making images, and indexing. I want to have something worthy of printing on a dead tree by 1.0. In terms of timelines, I am very roughly estimating something like this: Summer: final big changes and rewrites Early Autumn: graphic design and copy editing Late Autumn: proofing, figuring out printing stuff Winter: final ebook and initial print releases of 1.0. (If you know a service that helps get self-published books "past the finish line", I'd love to hear about it! Preferably something that works for a fee, not part of royalties.) This timeline may be disrupted by official client work, like a new TLA+ contract or a conference invitation. Needless to say, I am incredibly excited to complete this book and share the final version with you all. This is a book I wished for years ago, a book I wrote because nobody else would. It fills a critical gap in software educational material, and someday soon I'll be able to put a copy on my bookshelf. It's exhilarating and terrifying and above all, satisfying. It's also 150 pages vs 50 pages, but admittedly this is partially because I made the book smaller with a larger font. ↩

2 days ago • 5 votes

Implementing UI translation in SumatraPDF, a C++ Windows application

Translating user interface of SumatraPDF SumatraPDF is the best PDF/eBook/Comic Book viewer for Windows. It’s small, fast, full of features, free and open-source. It became popular enough that it made sense to translate the UI for non-English users. Currently we support 72 languages. This article describes how I designed and implemented a translation system in SumatraPDF, a native win32 C++ Windows application. Hard things about translating the UI There are 2 hard things about translating an application code for translation system (extracting strings to translate, translate strings from English to user’s language) translating them into many languages Extracting strings to translate from source code Currently there are 381 strings in SumatraPDF subject to translation. It’s important that the system requires the least amount of effort when adding new strings to translate. Every string that needs to be translated is marked in .cpp or .h file with one of two macros: _TRA("Rename") _TRN("Open") I have a script that extracts those strings from source files. Mine is written in Go but it could just as well be Python or JavaScript. It’s a simple regex job. _TR stands for “translation”. _TRA(s) expands into const char* trans::GetTranslation(const char* str) function which returns str translated to current UI language. We auto-detect language at startup based on Windows settings and allow the user to explicitly set UI language. For English we just return the original string. If a string to be translated is e.g. a part of const char* array[], we can’t use trans::GetTranslation(). For cases like that we have _TRN() which expands to English string. We have to write code to translate it at some point. Adding new strings is therefore as simple as wrapping them in _TRA() or _TRN() macros. Translating strings into many languages Now that we’ve extracted strings to be translated, we need to translate them into 72 languages. SumatraPDF is a free, open-source program. I don’t have a budget to hire translators. I don’t have a budget, period. The only option was to get help from SumatraPDF users. It was vital to make it very easy for users to send me translations. I didn’t want to ask them, for example, to download some translation software. Design and implementation of AppTranslator web app I couldn’t find a really simple software for crowd sourcing translations so I wrote my own: https://github.com/kjk/apptranslator You can see it in action: https://www.apptranslator.org/app/SumatraPDF I designed it to be generic but I don’t think anyone else is using it. AppTranslator is simple. Per https://tools.arslexis.io/wc/: 4k lines of Go server code 451 lines of html code a single dependency: bootstrap CSS framework (the project is old) It’s simple because I don’t want to spend a lot of time writing translation software. It’s just a side project in service of the goal of translating SumatraPDF. Login is exclusively via GitHub. It doesn’t even use a database. Like in Redis, changes are stored as a series of operations in an append-only log. We keep the whole state in memory and re-create it from the log at startup. Main operation is translate a string from English to language X represented as [kOpTranslation, english string, language, translation, user who provided translation]. When user provides a translation in the web UI, we send an API call to the server which appends the translation operation to the log. Simple and reliable. Because the code is written in Go, it’s very fast and memory efficient. When running it uses mere megabytes of RAM. It can comfortably run on the smallest 256 MB VPS server. I backup the log to S3 so if the server ever fails, I can re-install the program on a new server and re-download the translations from S3. I provide RSS feed for each language so that people who provide translations can monitor for new strings to be translated. Sending strings for translation and receiving translations So I have a web app for collecting translations and a script that extracts strings to be translated from source code. How do they connect? AppTranslator has an API for submitting the current set of strings to be translated in the simplest possible format: a line for each string (I ensure there are no newlines in the string itself by escaping them with \n) API is password protected because only I can submit the strings. The server compares the strings sent with the current set and records a difference in the log. It also sends a response with translations. Again the simplest possible format: AppTranslator: SumatraPDF 651b739d7fa110911f25563c933f42b1d37590f8 :%s annotation. Ctrl+click to edit. am:%s մեկնաբանություն: Ctrl+քլիք՝ խմբագրելու համար: ar:ملاحظة %s. اضغط Ctrl للتحرير. az:Qeyd %s. Düzəliş etmək üçün Ctrl+düyməyə basın. As you can see: a string to translate is on a line starting with : is followed by translations of that strings in the format: ${lang}: ${translation} An optimization: 651b739d7fa110911f25563c933f42b1d37590f8 is a hash of this response. If I submit this hash with my request and translations didn’t change on the server, the response is empty. Implementing C++ part of translation system So now I have a text file with translation downloaded from the server. How do I get a translation in my C++ code? As with everything in SumatraPDF, I try to do things in a simple and efficient way. The whole Translation.cpp is only 239 lines of code. The core of translation system is const char* trans::GetTranslation(const char* s); function. I embed the translations in exact the same format as received from AppTranslator in the executable as data file in resources. If the UI language is English, we do nothing. trans::GetTranslation() returns its argument. When we switch the language, we load the translations from resources and build an index: an array of English strings an array of corresponding translations Both arrays use my own StrVec class optimized for storing an array of strings. To find a translation we scan the first array to find an index of the string and return translation from the second array, at the same index. Linear scan seems like it would be slow but it isn’t. Resizing dialogs I have a few dialogs defined in SumatraPDF.rc file. The problem with dialogs is that position of UI elements is fixed. A translated string will almost certainly have a different size than the English string which will mess up fixed layout. Thankfully someone wrote DialogSizer that smartly resizes dialogs and solves this problem. The evolution of a solution No AppTranslator My initial implementation was simpler. I didn’t yet have AppTranslator so I stored the strings in a text file in repository in the same format as what I described above. People would download it, make changes using a text editor and send me the file via email which I would then checkin. It worked for a while but it became worse over time. More strings, more languages created more work for me to manually manage e-mail submissions. I decided to automate the process. Code generation My first implementation of C++ side used code generation instead of embedding the text file in resources. My Go script would generate C++ source code files with static const char* [] arrays. This worked well but I decided to improve it further by making the code use the text file with translations embedded in the app. The main motivation for the change was to open a possibility of downloading latest translations from the server to fix the problem of translations not being all ready when I build the release executable. I haven’t done that yet but it’s now easier to implement given that the format of strings embedded in the exe is the same as the one I can download from AppTranslator. Only utf-8 SumatraPDF started by using both WCHAR* Unicode strings and char* utf8 strings. For that reason the translation system had to support returning translation in both WCHAR* and char* version. Over time I refactored the code to use mostly utf8 and at some point I no longer needed to support WCHAR* version. That made the code even smaller and reduced memory usage. The experience I’m happy how things turned out. AppTranslator proved to be reliable and hassle free. It runs for many years now and collected 35440 string translations from users. I automated everything so that all I need to do is to periodically re-run the script that extracts strings from source code, uploads them to AppTranslator and downloads latest translations. One problem is that translations are not always ready in time for release so I make a release and then people start translating strings added since last release. I’ve considered downloading the latest translations from the server, in addition to embedding them in an executable at the time of building the app. Would I do the same today? While AppTranslator is reliable and doesn’t require on-going work, it would be better to not have to run a server at all. The world has changed since I started SumatraPDF. Namely: people are comfortable using GitHub and you can edit files directly in GitHub UI. It’s not a great experience but it works. One option would be to generate a translation text file for each language, in this format: :first untranslated string :second untranslated string :first translated string translation of first string :second translated string translation of second string Untranslated strings are listed at the top, to make it easier to find. A link would send a translator directly to edit this file in GitHub UI. When translator saves translations, it creates a PR for me to review and merge. The roads not taken But why did you re-invent everything? You should do X instead. All other X that I know about suck. Using per-language .rc resource files Traditional way of localizing / translating Window GUI apps is to store all strings and dialog definitions in an .rc file. Each language gets its own .rc file (or files) and the program picks the right resource based on a language. This doesn’t solve the 2 hard problems: having an easy way to add strings for translations having an easy way for users to provide translations XML horror show There was a dark time when the world was under the iron grip of XML fanaticism. Everything had to be an XML file even when it was the worst possible solution for the problem. XML doesn’t solve the 2 hard problems and a string storage format is an absolute nightmare for human editing. GNU gettext There’s a C library gettext that uses .po files. This is much saner solution than XML horror show. .po files are relatively simple text format. The code is already written. Warning: tooting my own horn. My format is better. It’s easier for people to edit, it’s easier to write code to parse it. This looks like many times more than 239 lines of code. Ok, gettext probably does a bit more than my code, but clearly nothing than I need. It also doesn’t solve the 2 hard problems. I would still have to write code to extract strings from source code and build a way to allow users to translate them easily.

2 days ago • 3 votes

New here?