More from Tony Finch's blog
Recently, Alex Kladov wrote on the TigerBeetle blog about swarm testing data structures. It’s a neat post about randomized testing with Zig. I wrote a comment with an idea that was new to Alex @matklad, so I’m reposing a longer version here. differential testing problems grow / shrink random elements element-wise testing test loop data structure size invariants performance conclusion differential testing A common approach to testing data structures is to write a second reference implementation that has the same API but simpler and/or more obviously correct, though it uses more memory or is slower or less concurrent or otherwise not up to production quality. Then, run the production implementation and the reference implementation on the same sequence of operations, and verify that they produce the same results. Any difference is either a bug in the production implementation (probably) or a bug in the reference implementation (unlucky) or a bug in the tests (unfortunate). This is a straightforward differential testing pattern. problems There are a couple of difficulties with this kind of basic differential testing. grow / shrink The TigerBeetle article talks about adjusting the probabilities of different operations on the data structure to try to explore more edge cases. To motivate the idea, the article talks about adjusting the probabilities of adding or deleting items: If adding and deleting have equal probability, then the test finds it hard to grow the data structure to interesting sizes that might expose bugs. Unfortunately, if the probability of add is greater than del, then the data structure tends to grow without bound. If the probability of del is greater than add, then it tries to shrink from nothing: worse than equal probabilities! They could preload the data structure to test how it behaves when it shrinks, but a fixed set of probabilities per run is not good at testing both growth and shrinkage on the same test run on the same data structure. One way to improve this kind of test is to adjust the probability of add and del dynamically: make add more likely when the data structure is small, and del more likely when it is big. And maybe make add more likely in the first half of a test run and del more likely in the second half. random elements The TigerBeetle article glosses over the question of where the tests get fresh elements to add to the data structure. And its example is chosen so it doesn’t have to think about which elements get deleted. In my experience writing data structures for non-garbage-collected languages, I had to be more deliberate about how to create and destroy elements. That led to a style of test that’s more element-centric, as Alex described it. element-wise testing Change the emphasis so that instead of testing that two implementations match, test that one implementation obeys the expected behaviour. No need to make a drop-in replacement reference implementation! What I typically do is pre-allocate an array of elements, with slots that I can set to keep track of how each element relates to the data structure under test. The most important property is whether the element has been added or deleted, but there might be others related to ordering of elements, or values associated with keys, and so on. test loop Each time round the loop, choose at random an element from the array, and an action such as add / del / get / … Then, if it makes sense, perform the operation on the data structure with the element. For example, you might skip an add action if the element is already in the data structure, unless you can try to add it and expect an error. data structure size This strategy tends to grow the data structure until about 50% of the pre-allocated elements are inserted, then it makes a random walk around this 50% point. Random walks can diverge widely from their central point both in theory and in practice, so this kind of testing is reasonably effective at both growing and (to a lesser extent) shrinking the data structure. invariants I usually check some preconditions before an action, to verify that the data structure matches the expected properties of the chosen element. This can help to detect earlier that an action on one element has corrupted another element. After performing the action and updating the element’s properties, I check the updated properties as a postcondition, to make sure the action had the expected effects. performance John Regehr’s great tutorial, how to fuzz an ADT implementation, recommends writing a checkRep() function that thoroughly verifies a data structure’s internal consistency. A checkRep() function is a solid gold testing tool, but it is O(n) at least and typically very slow. If you call checkRep() frequently during testing, your tests slow down dramatically as your data structure gets larger. I like my per-element invariants to be local and ideally O(1) or O(log n) at worst, so they don’t slow down the tests too much. conclusion Recently I’ve used this pattern to exhibit concurrency bugs in an API that’s hard to make thread-safe. Writing the tests has required some cunning to work out what invariants I can usefully maintain and test; what variety of actions I can use to stress those invariants; and what mix of elements + actions I need so that my tests know which properties of each element should be upheld and which can change. I’m testing multiple implementations of the same API, trying to demonstrate which is safest. Differential testing can tell me that implementations diverge, but not which is correct, whereas testing properties and invariants more directly tells me whether an implementation does what I expect. (Or gives me a useless answer when my tests are weak.) Which is to say that this kind of testing is a fun creative challenge. I find it a lot more rewarding than example-based testing.
I have added syntax highlighting to my blog using tree-sitter. Here are some notes about what I learned, with some complaining. static site generator markdown ingestion highlighting incompatible?! highlight names class names styling code results future work frontmatter templates feed style highlight quality static site generator I moved my blog to my own web site a few years ago. It is produced using a scruffy Rust program that converts a bunch of Markdown files to HTML using pulldown-cmark, and produces complete pages from Handlebars templates. Why did I write another static site generator? Well, partly as an exercise when learning Rust. Partly, since I wrote my own page templates, I’m not going to benefit from a library of existing templates. On the contrary, it’s harder to create new templates that work with a general-purpose SSG than write my own simpler site-specific SSG. It’s miserable to write programs in template languages. My SSG can keep the logic in the templates to a minimum, and do all the fiddly stuff in Rust. (Which is not very fiddly, because my site doesn’t have complicated navigation – compared to the multilevel menus on www.dns.cam.ac.uk for instance.) markdown ingestion There are a few things to do to each Markdown file: split off and deserialize the YAML frontmatter find the <cut> or <toc> marker that indicates the end of the teaser / where the table of contents should be inserted augment headings with self-linking anchors (which are also used by the ToC) Before this work I was using regexes to do all these jobs, because that allowed me to treat pulldown-cmark as a black box: Markdown in, HTML out. But for syntax highlighting I had to be able to find fenced code blocks. It was time to put some code into the pipeline between pulldown-cmark’s parser and renderer. And if I’m using a proper parser I can get rid of a few regexes: after some hacking, now only the YAML frontmatter is handled with a regex. Sub-heading linkification and ToC construction are fiddly and more complicated than they were before. But they are also less buggy: markup in headings actually works now! Compared to the ToC, it’s fairly simple to detect code blocks and pass them through a highlighter. You can look at my Markdown munger here. (I am not very happy with the way it uses state, but it works.) highlighting As well as the tree-sitter-highlight documentation I used femark as an example implementation. I encountered a few problems. incompatible?! I could not get the latest tree-sitter-highlight to work as described in its documentation. I thought the current tree-sitter crates were incompatible with each other! For a while I downgraded to an earlier version, but eventually I solved the problem. Where the docs say, let javascript_language = tree_sitter_javascript::language(); They should say: let javascript_language = tree_sitter::Language::new( tree_sitter_javascript::LANGUAGE ); highlight names I was offended that tree-sitter-highlight seems to expect me to hardcode a list of highlight names, without explaining where they come from or what they mean. I was doubly offended that there’s an array of STANDARD_CAPTURE_NAMES but it isn’t exported, and doesn’t match the list in the docs. You mean I have to copy and paste it? Which one?! There’s some discussion of highlight names in the tree-sitter manual’s “syntax highlighting” chapter, but that is aimed at people who are writing a tree-sitter grammar, not people who are using one. Eventually I worked out that tree_sitter_javascript::HIGHLIGHT_QUERY in the tree-sitter-highlight example corresponds to the contents of a highlights.scm file. Each @name in highlights.scm is a highlight name that I might be interested in. In principle I guess different tree-sitter grammars should use similar highlight names in their highlights.scm files? (Only to a limited extent, it turns out.) I decided the obviously correct list of highlight names is the list of every name defined in the HIGHLIGHT_QUERY. The query is just a string so I can throw a regex at it and build an array of the matches. This should make the highlighter produce <span> wrappers for as many tokens as possible in my code, which might be more than necessary but I don’t have to style them all. class names The tree-sitter-highlight crate comes with a lightly-documented HtmlRenderer, which does much of the job fairly straightforwardly. The fun part is the attribute_callback. When the HtmlRenderer is wrapping a token, it emits the start of a <span then expects the callback to append whatever HTML attributes it thinks might be appropriate. Uh, I guess I want a class="..." here? Well, the highlight names work a little bit like class names: they have dot-separated parts which tree-sitter-highlight can match more or less specifically. (However I am telling it to match all of them.) So I decided to turn each dot-separated highlight name into a space-separated class attribute. The nice thing about this is that my Rust code doesn’t need to know anything about a language’s tree-sitter grammar or its highlight query. The grammar’s highlight names become CSS class names automatically. styling code Now I can write some simple CSS to add some colours to my code. I can make type names green, code span.hilite.type { color: #aca; } If I decide builtin types should be cyan like keywords I can write, code span.hilite.type.builtin, code span.hilite.keyword { color: #9cc; } results You can look at my tree-sitter-highlight wrapper here. Getting it to work required a bit more creativity than I would have preferred, but it turned out OK. I can add support for a new language by adding a crate to Cargo.toml and a couple of lines to hilite.rs – and maybe some CSS if I have not yet covered its highlight names. (Like I just did to highlight the CSS above!) future work While writing this blog post I found myself complaining about things that I really ought to fix instead. frontmatter I might simplify the per-page source format knob so that I can use pulldown-cmark’s support for YAML frontmatter instead of a separate regex pass. This change will be easier if I can treat the html pages as Markdown without mangling them too much (is Markdown even supposed to be idempotent?). More tricky are a couple of special case pages whose source is Handlebars instead of Markdown. templates I’m not entirely happy with Handlebars. It’s a more powerful language than I need – I chose Handlebars instead of Mustache because Handlebars works neatly with serde. But it has a dynamic type system that makes the templates more error-prone than I would like. Perhaps I can find a more static Rust template system that takes advantage of the close coupling between my templates and the data structure that describes the web site. However, I like my templates to be primarily HTML with a sprinkling of insertions, not something weird that’s neither HTML nor Rust. feed style There’s no CSS in my Atom feed, so code blocks there will remain unstyled. I don’t know if feed readers accept <style> tags or if it has to be inline styles. (That would make a mess of my neat setup!) highlight quality I’m not entirely satisfied with the level of detail and consistency provided by the tree-sitter language grammars and highlight queries. For instance, in the CSS above the class names and property names have the same colour because the CSS highlights.scm gives them the same highlight name. The C grammar is good at identifying variables, but the Rust grammar is not. Oh well, I guess it’s good enough for now. At least it doesn’t involve Javascript.
Last year I wrote about inlining just the fast path of Lemire’s algorithm for nearly-divisionless unbiased bounded random numbers. The idea was to reduce code bloat by eliminating lots of copies of the random number generator in the rarely-executed slow paths. However a simple split prevented the compiler from being able to optimize cases like pcg32_rand(1 << n), so a lot of the blog post was toying around with ways to mitigate this problem. On Monday while procrastinating a different blog post, I realised that it’s possible to do better: there’s a more general optimization which gives us the 1 << n special case for free. nearly divisionless Lemire’s algorithm has about 4 neat tricks: use multiplication instead of division to reduce the output of a random number generator modulo some limit eliminate the bias in (1) by (counterintuitively) looking at the lower digits fun modular arithmetic to calculate the reject threshold for (2) arrange the reject tests to avoid the slow division in (3) in most cases The nearly-divisionless logic in (4) leads to two copies of the random number generator, in the fast path and the slow path. Generally speaking, compilers don’t try do deduplicate code that was written by the programmer, so they can’t simplify the nearly-divisionless algorithm very much when the limit is constant. constantly divisionless Two points occurred to me: when the limit is constant, the reject threshold (3) can be calculated at compile time when the division is free, there’s no need to avoid it using (4) These observations suggested that when the limit is constant, the function for random numbers less than a limit should be written: static inline uint32_t pcg32_rand_const(pcg32_t *rng, uint32_t limit) { uint32_t reject = -limit % limit; uint64_t sample; do sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit); while ((uint32_t)(sample) < reject); return ((uint32_t)(sample >> 32)); } This has only one call to pcg32_random(), saving space as I wanted, and the compiler is able to eliminate the loop automatically when the limit is a power of two. The loop is smaller than a call to an out-of-line slow path function, so it’s better all round than the code I wrote last year. algorithm selection As before it’s possible to automatically choose the constantly-divisionless or nearly-divisionless algorithms depending on whether the limit is a compile-time constant or run-time variable, using arcane C tricks or GNU C __builtin_constant_p(). I have been idly wondering how to do something similar in other languages. Rust isn’t very keen on automatic specialization, but it has a reasonable alternative. The thing to avoid is passing a runtime variable to the constantly-divisionless algorithm, because then it becomes never-divisionless. Rust has a much richer notion of compile-time constants than C, so it’s possible to write a method like the follwing, which can’t be misused: pub fn upto<const LIMIT: u32>(&mut self) -> u32 { let reject = LIMIT.wrapping_neg().wrapping_rem(LIMIT); loop { let (lo, hi) = self.get_u32().embiggening_mul(LIMIT); if lo < reject { continue; } else { return hi; } } } assert!(rng.upto::<42>() < 42); (embiggening_mul is my stable replacement for the unstable widening_mul API.) This is a nugatory optimization, but there are more interesting cases where it makes sense to choose a different implementation for constant or variable arguments – that it, the constant case isn’t simply a constant-folded or partially-evaluated version of the variable case. Regular expressions might be lex-style or pcre-style, for example. It’s a curious question of language design whether it should be possible to write a library that provides a uniform API that automatically chooses constant or variable implementations, or whether the user of the library must make the choice explicit. Maybe I should learn some Zig to see how its comptime works.
One of the neat things about the PCG random number generator by Melissa O’Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation. However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add. … Or is it? With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster. You can look at the benchmark in my pcg-dxsm repository. skip ahead the insight multipliers trying it out results skip ahead One of the slightly weird features that PCG gets from its underlying linear congruential generator is “seekability”: you can skip ahead k steps in the stream of random numbers in log(k) time. The PCG paper (in section 4.3.1) cites Forrest Brown’s paper, random numbers with arbitrary strides, which explains that the skip-ahead feature is useful for reproducibility of monte carlo simulations. But what caught my eye is the skip-ahead formula. Rephrased in programmer style, state[n+k] = state[n] * pow(MUL, k) + inc * (pow(MUL, k) - 1) / (MUL - 1) the insight The skip-ahead formula says that we can calculate a future state using a couple of multiplications. The skip-ahead multipliers depend only on the LCG multiplier, not on the variable state, nor on the configurable increment. That means that for a fixed skip ahead, we can precalculate the multipliers before compile time. The skip-ahead formula allows us to unroll the PCG data dependency chain. Normally, four iterations of the PCG state update look like, state0 = rng->state state1 = state0 * MUL + rng->inc state2 = state1 * MUL + rng->inc state3 = state2 * MUL + rng->inc state4 = state3 * MUL + rng->inc rng->state = state4 With the skip-ahead multipliers it looks like, state0 = rng->state state1 = state0 * MULs1 + rng->inc * MULi1 state2 = state0 * MULs2 + rng->inc * MULi2 state3 = state0 * MULs3 + rng->inc * MULi3 state4 = state0 * MULs4 + rng->inc * MULi4 rng->state = state4 These state calculations can be done in parallel using NEON or AVX vector instructions. The disadvantage is that calculating future states in parallel requires more multiplications than doing so in series, but that’s OK because modern CPUs have lots of ALUs. multipliers The skip-ahead formula is useful for jumping ahead long distances, because (as Forrest Brown explained) you can do the exponentiation in log(k) time using repeated squaring. (The same technique is used in for modexp in RSA.) But I’m only interested in the first few skip-ahead multipliers. I’ll define the linear congruential generator as: lcg(s, inc) = s * MUL + inc Which is used in PCG’s normal state update like: rng->state = lcg(rng->state, rng->inc) To precalculate the first few skip-ahead multipliers, we iterate the LCG starting from zero and one, like this: MULs0 = 1 MULs1 = lcg(MULs0, 0) MULs2 = lcg(MULs1, 0) MULi0 = 0 MULi1 = lcg(MULi0, 1) MULi2 = lcg(MULi1, 1) My benchmark code’s commentary includes a proof by induction, which I wrote to convince myself that these multipliers are correct. trying it out To explore how well this skip-ahead idea works, I have written a couple of variants of my pcg32_bytes() function, which simply iterates pcg32 and writes the results to a byte array. The variants have an adjustable amount of parallelism. One variant is written as scalar code in a loop that has been unrolled by hand a few times. I wanted to see if standard C gets a decent speedup, perhaps from autovectorization. The other variant uses the GNU C portable vector extensions to calculate pcg32 in an explicitly parallel manner. The benchmark also ensures the output from every variant matches the baseline pcg32_bytes(). results The output from the benchmark harness lists: the function variant either the baseline version or uN for a scalar loop unrolled N times or xN for vector code with N lanes its speed in bytes per nanosecond (aka gigabytes per second) its performance relative to the baseline There are small differences in style between the baseline and u1 functions, but their performance ought to be basically the same. Apple clang 16, Macbook Pro M1 Pro. This compiler is eager and fairly effective at autovectorizing. ARM NEON isn’t big enough to get a speedup from 8 lanes of parallelism. __ 3.66 bytes/ns x 1.00 u1 3.90 bytes/ns x 1.07 u2 6.40 bytes/ns x 1.75 u3 7.66 bytes/ns x 2.09 u4 8.52 bytes/ns x 2.33 x2 7.59 bytes/ns x 2.08 x4 10.49 bytes/ns x 2.87 x8 10.40 bytes/ns x 2.84 The following results were from my AMD Ryzen 9 7950X running Debian 12 “bookworm”, comparing gcc vs clang, and AVX2 vs AVX512. gcc is less keen to autovectorize so it doesn’t do very well with the unrolled loops. (Dunno why u1 is so much slower than the baseline.) gcc 12.2 -march=x86-64-v3 __ 5.57 bytes/ns x 1.00 u1 5.13 bytes/ns x 0.92 u2 5.03 bytes/ns x 0.90 u3 7.01 bytes/ns x 1.26 u4 6.83 bytes/ns x 1.23 x2 3.96 bytes/ns x 0.71 x4 8.00 bytes/ns x 1.44 x8 12.35 bytes/ns x 2.22 clang 16.0 -march=x86-64-v3 __ 4.89 bytes/ns x 1.00 u1 4.08 bytes/ns x 0.83 u2 8.76 bytes/ns x 1.79 u3 10.43 bytes/ns x 2.13 u4 10.81 bytes/ns x 2.21 x2 6.67 bytes/ns x 1.36 x4 12.67 bytes/ns x 2.59 x8 15.27 bytes/ns x 3.12 gcc 12.2 -march=x86-64-v4 __ 5.53 bytes/ns x 1.00 u1 5.53 bytes/ns x 1.00 u2 5.55 bytes/ns x 1.00 u3 6.99 bytes/ns x 1.26 u4 6.79 bytes/ns x 1.23 x2 4.75 bytes/ns x 0.86 x4 17.14 bytes/ns x 3.10 x8 20.90 bytes/ns x 3.78 clang 16.0 -march=x86-64-v4 __ 5.53 bytes/ns x 1.00 u1 4.25 bytes/ns x 0.77 u2 7.94 bytes/ns x 1.44 u3 9.31 bytes/ns x 1.68 u4 15.33 bytes/ns x 2.77 x2 9.07 bytes/ns x 1.64 x4 21.74 bytes/ns x 3.93 x8 26.34 bytes/ns x 4.76 That last result is pcg32 generating random numbers at 200 Gbit/s.
The International Obfuscated C Code Contest has a newly revamped web site, and the Judges have announced the 28th contest, to coincide with its 40th anniversary. (Or 41st?) The Judges have also updated the archive of past winners so that as many of them as possible work on modern systems. Accordingly, I took a look at my 1998 winner to see how much damage time hath wrought. When it is built, my program needs to go through the C preprocessor twice. There are a few reasons: It’s part of coercing the C compiler into compiling OFL, an obfuscated functional language. OFL has keywords l and b, short for let and be, so for example the function for constructing a pair is defined as l pair b (BB (B (B K)) C CI) In a less awful language that might be written let pair = λx λy λf λg (f x y) Anyway, the first pass of the C preprocessor turns a l (let) declaration into a macro #define pair b (BB (B (B K)) C CI) And the second pass expands the macros. (There’s a joke in the README that the OFL compiler has one optimization, function inlining (which is actually implemented by cpp macro expansion) but in fact inlining harms the performance of OFL.) The smaller the OFL interpreter, the more space there is for the program written in OFL. In the 1998 IOCCC rules, #define cost 7 characters, whereas l cost only one. I think the modern rules don’t count C or cpp keywords so there’s less reason to use this stupid trick to save space. Running the program through cpp twice is a horrible abuse of C and therefore just the kind of joke that the IOCCC encourages. (In fact the Makefile sends the program through cpp three times, twice explicitly and once as part of compiling to machine code. This is deliberately gratuitous, INABIAF.) There were a couple of ways this silliness caused problems. Modern headers are sensitive to which version of the C standard is in effect, wrt things like restrict keywords in standard library function declarations. The extra preprocessor invocations needed to be fixed to use consistent -std= options so that the final compilation doesn’t encounter language features from the future. Newer gcc emits #line directives around macro expansions. This caused problems for the declaration l ef E(EOF) which defines ef as a primitive value equal to EOF. After preprocessing this became #define ef E( #line 1213 "stdio.h" (-1) #line 69 "fanf.c" ) so the macro definition got truncated. The fix was to process the #include directives in the second preprocessor pass rather than the first. I vaguely remember some indecision when writing the program about whether to #include in the first or second pass, in particular whether preprocessing the headers twice would lead to trouble. First pass #include seemed to work and was shorter so that was what the original submission did. There’s one further change. The IOCCC Judges are trying to avoid compiler warnings about nonstandard arguments to main. To save a few characters, my entry had int main(int c) { ... } but the argument c isn’t used so I just removed it. The build commands still print “This may take some time to complete”, because in the 1990s if you tried to compile with optimization you would have been waiting a long time, if it completed at all. The revamped Makefile uses -O3, which takes gcc over 30 seconds and half a gigabyte of RAM. Quite a lot for less than 2.5 KiB of C!
More in programming
Hey peoples! Tonight, some meta-words. As you know I am fascinated by compilers and language implementations, and I just want to know all the things and implement all the fun stuff: intermediate representations, flow-sensitive source-to-source optimization passes, register allocation, instruction selection, garbage collection, all of that. It started long ago with a combination of curiosity and a hubris to satisfy that curiosity. The usual way to slake such a thirst is structured higher education followed by industry apprenticeship, but for whatever reason my path sent me through a nuclear engineering bachelor’s program instead of computer science, and continuing that path was so distasteful that I noped out all the way to rural Namibia for a couple years. Fast-forward, after 20 years in the programming industry, and having picked up some language implementation experience, a few years ago I returned to garbage collection. I have a good level of language implementation chops but never wrote a memory manager, and Guile’s performance was limited by its use of the Boehm collector. I had been on the lookout for something that could help, and when I learned of it seemed to me that the only thing missing was an appropriate implementation for Guile, and hey I could do that!Immix I started with the idea of an -style interface to a memory manager that was abstract enough to be implemented by a variety of different collection algorithms. This kind of abstraction is important, because in this domain it’s easy to convince oneself that a given algorithm is amazing, just based on vibes; to stay grounded, I find I always need to compare what I am doing to some fixed point of reference. This GC implementation effort grew into , but as it did so a funny thing happened: the as a direct replacement for the Boehm collector maintained mark bits in a side table, which I realized was a suitable substrate for Immix-inspired bump-pointer allocation into holes. I ended up building on that to develop an Immix collector, but without lines: instead each granule of allocation (16 bytes for a 64-bit system) is its own line.MMTkWhippetmark-sweep collector that I prototyped The is funny, because it defines itself as a new class of collector, fundamentally different from the three other fundamental algorithms (mark-sweep, mark-compact, and evacuation). Immix’s are blocks (64kB coarse-grained heap divisions) and lines (128B “fine-grained” divisions); the innovation (for me) is the discipline by which one can potentially defragment a block without a second pass over the heap, while also allowing for bump-pointer allocation. See the papers for the deets!Immix papermark-regionregionsoptimistic evacuation However what, really, are the regions referred to by ? If they are blocks, then the concept is trivial: everyone has a block-structured heap these days. If they are spans of lines, well, how does one choose a line size? As I understand it, Immix’s choice of 128 bytes was to be fine-grained enough to not lose too much space to fragmentation, while also being coarse enough to be eagerly swept during the GC pause.mark-region This constraint was odd, to me; all of the mark-sweep systems I have ever dealt with have had lazy or concurrent sweeping, so the lower bound on the line size to me had little meaning. Indeed, as one reads papers in this domain, it is hard to know the real from the rhetorical; the review process prizes novelty over nuance. Anyway. What if we cranked the precision dial to 16 instead, and had a line per granule? That was the process that led me to Nofl. It is a space in a collector that came from mark-sweep with a side table, but instead uses the side table for bump-pointer allocation. Or you could see it as an Immix whose line size is 16 bytes; it’s certainly easier to explain it that way, and that’s the tack I took in a .recent paper submission to ISMM’25 Wait what! I have a fine job in industry and a blog, why write a paper? Gosh I have meditated on this for a long time and the answers are very silly. Firstly, one of my language communities is Scheme, which was a research hotbed some 20-25 years ago, which means many practitioners—people I would be pleased to call peers—came up through the PhD factories and published many interesting results in academic venues. These are the folks I like to hang out with! This is also what academic conferences are, chances to shoot the shit with far-flung fellows. In Scheme this is fine, my work on Guile is enough to pay the intellectual cover charge, but I need more, and in the field of GC I am not a proven player. So I did an atypical thing, which is to cosplay at being an independent researcher without having first been a dependent researcher, and just solo-submit a paper. Kids: if you see yourself here, just go get a doctorate. It is not easy but I can only think it is a much more direct path to goal. And the result? Well, friends, it is this blog post :) I got the usual assortment of review feedback, from the very sympathetic to the less so, but ultimately people were confused by leading with a comparison to Immix but ending without an evaluation against Immix. This is fair and the paper does not mention that, you know, I don’t have an Immix lying around. To my eyes it was a good paper, an , but, you know, just a try. I’ll try again sometime.80% paper In the meantime, I am driving towards getting Whippet into Guile. I am hoping that sometime next week I will have excised all the uses of the BDW (Boehm GC) API in Guile, which will finally allow for testing Nofl in more than a laboratory environment. Onwards and upwards! whippet regions? paper??!?
Having spent four decades as a programmer in various industries and situations, I know that modern software development processes are far more stressful than when I started. It's not simply that developing software today is more complex than it was back in 1981. In that early decade, none
In previous articles, we saw how to use “real” UART, and looked into the trick used by Arduino to automatically reset boards when uploading firmware. Today, we’ll look into how Espressif does something similar, using even more tricks. “Real” UART on the Saola As usual, let’s first simply connect the UART adapter. Again, we connect … Continue reading Espressif’s Automatic Reset → The post Espressif’s Automatic Reset appeared first on Quentin Santos.