Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
11
Salutations, populations. Today’s note is more of a work-in-progress than usual; I have been finally starting to look at getting into , and there are some open questions.WhippetGuile I started by taking a look at how Guile uses the ‘s API, to make sure I had all my bases covered for an eventual switch to something that was not BDW. I think I have a good overview now, and have divided the parts of BDW-GC used by Guile into seven categories.Boehm-Demers-Weiser collector Firstly there are the ways in which Guile’s run-time and compiler depend on BDW-GC’s behavior, without actually using BDW-GC’s API. By this I mean principally that we assume that any reference to a GC-managed object from any thread’s stack will keep that object alive. The same goes for references originating in global variables, or static data segments more generally. Additionally, we rely on GC objects not to move: references to GC-managed objects in registers or stacks are valid across a GC boundary, even if those...
2 weeks ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from wingolog

whippet lab notebook: untagged mallocs, bis

Earlier this weekGuileWhippet But now I do! Today’s note is about how we can support untagged allocations of a few different kinds in Whippet’s .mostly-marking collector Why bother supporting untagged allocations at all? Well, if I had my way, I wouldn’t; I would just slog through Guile and fix all uses to be tagged. There are only a finite number of use sites and I could get to them all in a month or so. The problem comes for uses of from outside itself, in C extensions and embedding programs. These users are loathe to adapt to any kind of change, and garbage-collection-related changes are the worst. So, somehow, we need to support these users if we are not to break the Guile community.scm_gc_malloclibguile The problem with , though, is that it is missing an expression of intent, notably as regards tagging. You can use it to allocate an object that has a tag and thus can be traced precisely, or you can use it to allocate, well, anything else. I think we will have to add an API for the tagged case and assume that anything that goes through is requesting an untagged, conservatively-scanned block of memory. Similarly for : you could be allocating a tagged object that happens to not contain pointers, or you could be allocating an untagged array of whatever. A new API is needed there too for pointerless untagged allocations.scm_gc_mallocscm_gc_mallocscm_gc_malloc_pointerless Recall that the mostly-marking collector can be built in a number of different ways: it can support conservative and/or precise roots, it can trace the heap precisely or conservatively, it can be generational or not, and the collector can use multiple threads during pauses or not. Consider a basic configuration with precise roots. You can make tagged pointerless allocations just fine: the trace function for that tag is just trivial. You would like to extend the collector with the ability to make pointerless allocations, for raw data. How to do this?untagged Consider first that when the collector goes to trace an object, it can’t use bits inside the object to discriminate between the tagged and untagged cases. Fortunately though . Of those 8 bits, 3 are used for the mark (five different states, allowing for future concurrent tracing), two for the , one to indicate whether the object is pinned or not, and one to indicate the end of the object, so that we can determine object bounds just by scanning the metadata byte array. That leaves 1 bit, and we can use it to indicate untagged pointerless allocations. Hooray!the main space of the mostly-marking collector has one metadata byte for each 16 bytes of payloadprecise field-logging write barrier However there is a wrinkle: when Whippet decides the it should evacuate an object, it tracks the evacuation state in the object itself; the embedder has to provide an implementation of a , allowing the collector to detect whether an object is forwarded or not, to claim an object for forwarding, to commit a forwarding pointer, and so on. We can’t do that for raw data, because all bit states belong to the object, not the collector or the embedder. So, we have to set the “pinned” bit on the object, indicating that these objects can’t move.little state machine We could in theory manage the forwarding state in the metadata byte, but we don’t have the bits to do that currently; maybe some day. For now, untagged pointerless allocations are pinned. You might also want to support untagged allocations that contain pointers to other GC-managed objects. In this case you would want these untagged allocations to be scanned conservatively. We can do this, but if we do, it will pin all objects. Thing is, conservative stack roots is a kind of a sweet spot in language run-time design. You get to avoid constraining your compiler, you avoid a class of bugs related to rooting, but you can still support compaction of the heap. How is this, you ask? Well, consider that you can move any object for which we can precisely enumerate the incoming references. This is trivially the case for precise roots and precise tracing. For conservative roots, we don’t know whether a given edge is really an object reference or not, so we have to conservatively avoid moving those objects. But once you are done tracing conservative edges, any live object that hasn’t yet been traced is fair game for evacuation, because none of its predecessors have yet been visited. But once you add conservatively-traced objects back into the mix, you don’t know when you are done tracing conservative edges; you could always discover another conservatively-traced object later in the trace, so you have to pin everything. The good news, though, is that we have gained an easier migration path. I can now shove Whippet into Guile and get it running even before I have removed untagged allocations. Once I have done so, I will be able to allow for compaction / evacuation; things only get better from here. Also as a side benefit, the mostly-marking collector’s heap-conservative configurations are now faster, because we have metadata attached to objects which allows tracing to skip known-pointerless objects. This regains an optimization that BDW has long had via its , used in Guile since time out of mind.GC_malloc_atomic With support for untagged allocations, I think I am finally ready to start getting Whippet into Guile itself. Happy hacking, and see you on the other side! inside and outside on intent on data on slop fin

2 weeks ago 11 votes
tracepoints: gnarly but worth it

Hey all, quick post today to mention that I added tracing support to the . If the support library for is available when Whippet is compiled, Whippet embedders can visualize the GC process. Like this!Whippet GC libraryLTTng Click above for a full-scale screenshot of the trace explorer processing the with the on a 2.5x heap. Of course no image will have all the information; the nice thing about trace visualizers like is that you can zoom in to sub-microsecond spans to see exactly what is happening, have nice mouseovers and clicky-clickies. Fun times!Perfetto microbenchmarknboyerparallel copying collector Adding tracepoints to a library is not too hard in the end. You need to , which has a file. You need to . Then you have a that includes the header, to generate the code needed to emit tracepoints.pull in the librarylttng-ustdeclare your tracepoints in one of your header filesminimal C filepkg-config Annoyingly, this header file you write needs to be in one of the directories; it can’t be just in the the source directory, because includes it seven times (!!) using (!!!) and because the LTTng file header that does all the computed including isn’t in your directory, GCC won’t find it. It’s pretty ugly. Ugliest part, I would say. But, grit your teeth, because it’s worth it.-Ilttngcomputed includes Finally you pepper your source with tracepoints, which probably you so that you don’t have to require LTTng, and so you can switch to other tracepoint libraries, and so on.wrap in some macro I wrote up a little . It’s not as easy as , which I think is an error. Another ugly point. Buck up, though, you are so close to graphs!guide for Whippet users about how to actually get tracesperf record By which I mean, so close to having to write a Python script to make graphs! Because LTTng writes its logs in so-called Common Trace Format, which as you might guess is not very common. I have a colleague who swears by it, that for him it is the lowest-overhead system, and indeed in my case it has no measurable overhead when trace data is not being collected, but his group uses custom scripts to convert the CTF data that he collects to... (?!?!?!!).GTKWave In my case I wanted to use Perfetto’s UI, so I found a to convert from CTF to the . But, it uses an old version of Babeltrace that wasn’t available on my system, so I had to write a (!!?!?!?!!), probably the most Python I have written in the last 20 years.scriptJSON-based tracing format that Chrome profiling used to usenew script Yes. God I love blinkenlights. As long as it’s low-maintenance going forward, I am satisfied with the tradeoffs. Even the fact that I had to write a script to process the logs isn’t so bad, because it let me get nice nested events, which most stock tracing tools don’t allow you to do. I fixed a small performance bug because of it – a . A win, and one that never would have shown up on a sampling profiler too. I suspect that as I add more tracepoints, more bugs will be found and fixed.worker thread was spinning waiting for a pool to terminate instead of helping out I think the only thing that would be better is if tracepoints were a part of Linux system ABIs – that there would be header files to emit tracepoint metadata in all binaries, that you wouldn’t have to link to any library, and the actual tracing tools would be intermediated by that ABI in such a way that you wouldn’t depend on those tools at build-time or distribution-time. But until then, I will take what I can get. Happy tracing! on adding tracepoints using the thing is it worth it? fin

a month ago 15 votes
whippet at fosdem

Hey all, the video of my is up:FOSDEM talk on Whippet Slides , if that’s your thing.here I ended the talk with some puzzling results around generational collection, which prompted . I don’t have a firm answer yet. Or rather, perhaps for the splay benchmark, it is to be expected that a generational GC is not great; but there are other benchmarks that also show suboptimal throughput in generational configurations. Surely it is some tuning issue; I’ll be looking into it.yesterday’s post Happy hacking!

a month ago 17 votes
baffled by generational garbage collection

Usually in this space I like to share interesting things that I find out; you might call it a research-epistle-publish loop. Today, though, I come not with answers, but with questions, or rather one question, but with fractal surface area: what is the value proposition of generational garbage collection? The conventional wisdom is encapsulated in a 2004 Blackburn, Cheng, and McKinley paper, , which compares whole-heap mark-sweep and copying collectors to their generational counterparts, using the Jikes RVM as a test harness. (It also examines a generational reference-counting collector, which is an interesting predecessor to the 2022 work by Zhao, Blackburn, and McKinley.)“Myths and Realities: The Performance Impact of Garbage Collection”LXR The paper finds that generational collectors spend less time than their whole-heap counterparts for a given task. This is mainly due to less time spent collecting, because generational collectors avoid tracing/copying work for older objects that mostly stay in the same place in the live object graph. The paper also notes an improvement for mutator time under generational GC, but only for the generational mark-sweep collector, which it attributes to the locality and allocation speed benefit of bump-pointer allocation in the nursery. However for copying collectors, generational GC tends to slow down the mutator, probably because of the write barrier, but in the end lower collector times still led to lower total times. So, I expected generational collectors to always exhibit lower wall-clock times than whole-heap collectors. In , I have a garbage collector with an abstract API that specializes at compile-time to the mutator’s object and root-set representation and to the collector’s allocator, write barrier, and other interfaces. I embed it in , a simple Scheme-to-C compiler that can run some small multi-threaded benchmarks, for example the classic Gabriel benchmarks. We can then test those benchmarks against different collectors, mutator (thread) counts, and heap sizes. I expect that the generational parallel copying collector takes less time than the whole-heap parallel copying collector.whippetwhiffle So, I ran some benchmarks. Take the splay-tree benchmark, derived from Octane’s . I have a port to Scheme, and the results are... not good!splay.js In this graph the “pcc” series is the whole-heap copying collector, and “generational-pcc” is the generational counterpart, with a nursery sized such that after each collection, its size is 2 MB times the number of active mutator threads in the last collector. So, for this test with eight threads, on my 8-core Ryzen 7 7840U laptop, the nursery is 16MB including the copy reserve, which happens to be the same size as the L3 on this CPU. New objects are kept in the nursery one cycle before being promoted to the old generation. There are also results for , which use an Immix-derived algorithm that allows for bump-pointer allocation but which doesn’t require a copy reserve. There, the generational collectors use a , which has very different performance characteristics as promotion is in-place, and the nursery is as large as the available heap size.“mmc” and “generational-mmc” collectorssticky mark-bit algorithm The salient point is that at all heap sizes, and for these two very different configurations (mmc and pcc), generational collection takes more time than whole-heap collection. It’s not just the splay benchmark either; I see the same thing for the very different . What is the deal?nboyer benchmark I am honestly quite perplexed by this state of affairs. I wish I had a narrative to tie this together, but in lieu of that, voici some propositions and observations. Sometimes people say that the reason generational collection is good is because you get bump-pointer allocation, which has better locality and allocation speed. This is misattribution: it’s bump-pointer allocators that have these benefits. You can have them in whole-heap copying collectors, or you can have them in whole-heap mark-compact or immix collectors that bump-pointer allocate into the holes. Or, true, you can have them in generational collectors with a copying nursery but a freelist-based mark-sweep allocator. But also you can have generational collectors without bump-pointer allocation, for free-list sticky-mark-bit collectors. To simplify this panorama to “generational collectors have good allocators” is incorrect. It’s true, generational GC does lower median pause times: But because a major collection is usually slightly more work under generational GC than in a whole-heap system, because of e.g. the need to reset remembered sets, the maximum pauses are just as big and even a little bigger: I am not even sure that it is meaningful to compare median pause times between generational and non-generational collectors, given that the former perform possibly orders of magnitude more collections than the latter. Doing fewer whole-heap traces is good, though, and in the ideal case, the less frequent major traces under generational collectors allows time for concurrent tracing, which is the true mitigation for long pause times. Could it be that the test harness I am using is in some way unrepresentative? I don’t have more than one test harness for Whippet yet. I will start work on a second Whippet embedder within the next few weeks, so perhaps we will have an answer there. Still, there is ample time spent in GC pauses in these benchmarks, so surely as a GC workload Whiffle has some utility. One reasons that Whiffle might be unrepresentative is that it is an ahead-of-time compiler, whereas nursery addresses are assigned at run-time. Whippet exposes the necessary information to allow a just-in-time compiler to specialize write barriers, for example the inline check that the field being mutated is not in the nursery, and an AOT compiler can’t encode this as an immediate. But it seems a small detail. Also, Whiffle doesn’t do much compiler-side work to elide write barriers. Could the cost of write barriers be over-represented in Whiffle, relative to a production language run-time? Relatedly, Whiffle is just a baseline compiler. It does some partial evaluation but no CFG-level optimization, no contification, no nice closure conversion, no specialization, and so on: is it not representative because it is not an optimizing compiler? How big should the nursery be? I have no idea. As a thought experiment, consider the case of a 1 kilobyte nursery. It is probably too small to allow the time for objects to die young, so the survival rate at each minor collection would be high. Above a certain survival rate, generational GC is probably a lose, because your program violates the weak generational hypothesis: it introduces a needless copy for all survivors, and a synchronization for each minor GC. On the other hand, a 1 GB nursery is probably not great either. It is plenty large enough to allow objects to die young, but the number of survivor objects in a space that large is such that pause times would not be very low, which is one of the things you would like in generational GC. Also, you lose out on locality: a significant fraction of the objects you traverse are probably out of cache and might even incur TLB misses. So there is probably a happy medium somewhere. My instinct is that for a copying nursery, you want to make it about as big as L3 cache, which on my 8-core laptop is 16 megabytes. Systems are different sizes though; in Whippet my current heuristic is to reserve 2 MB of nursery per core that was active in the previous cycle, so if only 4 threads are allocating, you would have a 8 MB nursery. Is this good? I don’t know. I don’t have a very large set of benchmarks that run on Whiffle, and they might not be representative. I mean, they are microbenchmarks. One question I had was about heap sizes. If a benchmark’s maximum heap size fits in L3, which is the case for some of them, then probably generational GC is a wash, because whole-heap collection stays in cache. When I am looking at benchmarks that evaluate generational GC, I make sure to choose those that exceed L3 size by a good factor, for example the 8-mutator splay benchmark in which minimum heap size peaks at 300 MB, or the 8-mutator nboyer-5 which peaks at 1.6 GB. But then, should nursery size scale with total heap size? I don’t know! Incidentally, the way that I scale these benchmarks to multiple mutators is a bit odd: they are serial benchmarks, and I just run some number of threads at a time, and scale the heap size accordingly, assuming that the minimum size when there are 4 threads is four times the minimum size when there is just one thread. However, , in the sense that there is no heap size under which they fail and above which they succeed; I quote:multithreaded programs are unreliable A generational collector partitions objects into old and new sets, and a minor collection starts by visiting all old-to-new edges, called the “remembered set”. As the program runs, mutations to old objects might introduce new old-to-new edges. To maintain the remembered set in a generational collector, the mutator invokes : little bits of code that run when you mutate a field in an object. This is overhead relative to non-generational configurations, where the mutator doesn’t have to invoke collector code when it sets fields.write barriers So, could it be that Whippet’s write barriers or remembered set are somehow so inefficient that my tests are unrepresentative of the state of the art? I used to use card-marking barriers, but I started to suspect they cause too much overhead during minor GC and introduced too much cache contention. I switched to some months back for Whippet’s Immix-derived space, and we use the same kind of barrier in the generational copying (pcc) collector. I think this is state of the art. I need to see if I can find a configuration that allows me to measure the overhead of these barriers, independently of other components of a generational collector.precise field-logging barriers A few months ago, my only generational collector used the algorithm, which is an unconventional configuration: its nursery is not contiguous, non-moving, and can be as large as the heap. This is part of the reason that I implemented generational support for the parallel copying collector, to have a different and more conventional collector to compare against. But generational collection loses on some of these benchmarks in both places!sticky mark-bit On one benchmark which repeatedly constructs some trees and then verifies them, I was seeing terrible results for generational GC, which I realized were because of cooperative safepoints: generational GC collects more often, so it requires that all threads reach safepoints more often, and the non-allocating verification phase wasn’t emitting any safepoints. I had to change the compiler to emit safepoints at regular intervals (in my case, on function entry), and it sped up the generational collector by a significant amount. This is one instance of a general observation, which is that any work that doesn’t depend on survivor size in a GC pause is more expensive with a generational collector, which runs more collections. Synchronization can be a cost. I had one bug in which tracing ephemerons did work proportional to the size of the whole heap, instead of the nursery; I had to specifically add generational support for the way Whippet deals with ephemerons during a collection to reduce this cost. Looking deeper at the data, I have partial answers for the splay benchmark, and they are annoying :) Splay doesn’t actually allocate all that much garbage. At a 2.5x heap, the stock parallel MMC collector (in-place, sticky mark bit) collects... one time. That’s all. Same for the generational MMC collector, because the first collection is always major. So at 2.5x we would expect the generational collector to be slightly slower. The benchmark is simply not very good – or perhaps the most generous interpretation is that it represents tasks that allocate 40 MB or so of long-lived data and not much garbage on top. Also at 2.5x heap, the whole-heap copying collector runs 9 times, and the generational copying collector does 293 minor collections and... 9 major collections. We are not reducing the number of major GCs. It means either the nursery is too small, so objects aren’t dying young when they could, or the benchmark itself doesn’t conform to the weak generational hypothesis. At a 1.5x heap, the copying collector doesn’t have enough space to run. For MMC, the non-generational variant collects 7 times, and generational MMC times out. Timing out indicates a bug, I think. Annoying! I tend to think that if I get results and there were fewer than, like, 5 major collections for a whole-heap collector, that indicates that the benchmark is probably inapplicable at that heap size, and I should somehow surface these anomalies in my analysis scripts. Doing a similar exercise for nboyer at 2.5x heap with 8 threads (4GB for 1.6GB live data), I see that pcc did 20 major collections, whereas generational pcc lowered that to 8 major collections and 3471 minor collections. Could it be that there are still too many fixed costs associated with synchronizing for global stop-the-world minor collections? I am going to have to add some fine-grained tracing to find out. I just don’t know! I want to believe that generational collection was an out-and-out win, but I haven’t yet been able to prove it is true. I do have some homework to do. I need to find a way to test the overhead of my write barrier – probably using the MMC collector and making it only do major collections. I need to fix generational-mmc for splay and a 1.5x heap. And I need to do some fine-grained performance analysis for minor collections in large heaps. Enough for today. Feedback / reactions very welcome. Thanks for reading and happy hacking! hypothesis test workbench results? “generational collection is good because bump-pointer allocation” “generational collection lowers pause times” is it whiffle? is it something about the nursery size? is it something about the benchmarks? is it the write barrier? is it something about the generational mechanism? is it something about collecting more often? is it something about collection frequency? collecting more often redux conclusion?

a month ago 23 votes

More in programming

Reduced Hours and Remote Work Options for Employees with Young Children in Japan

Japan already stipulates that employers must offer the option of reduced working hours to employees with children under three. However, the Child Care and Family Care Leave Act was amended in May 2024, with some of the new provisions coming into effect April 1 or October 1, 2025. The updates to the law address: Remote work Flexible start and end times Reduced hours On-site childcare facilities Compensation for lost salary And more Legal changes are one thing, of course, and social changes are another. Though employers are mandated to offer these options, how many employees in Japan actually avail themselves of these benefits? Does doing so create any stigma or resentment? Recent studies reveal an unsurprising gender disparity in accepting a modified work schedule, but generally positive attitudes toward these accommodations overall. The current reduced work options Reduced work schedules for employees with children under three years old are currently regulated by Article 23(1) of the Child Care and Family Care Leave Act. This Article stipulates that employers are required to offer accommodations to employees with children under three years old. Those accommodations must include the opportunity for a reduced work schedule of six hours a day. However, if the company is prepared to provide alternatives, and if the parent would prefer, this benefit can take other forms—for example, working seven hours a day or working fewer days per week. Eligible employees for the reduced work schedule are those who: Have children under three years old Normally work more than six hours a day Are not employed as day laborers Are not on childcare leave during the period to which the reduced work schedule applies Are not one of the following, which are exempted from the labor-management agreement Employees who have been employed by the company for less than one year Employees whose prescribed working days per week are two days or less Although the law requires employers to provide reduced work schedules only while the child is under three years old, some companies allow their employees with older children to work shorter hours as well. According to a 2020 survey by the Ministry of Health, Labor and Welfare, 15.8% of companies permit their employees to use the system until their children enter primary school, while 5.7% allow it until their children turn nine years old or enter third grade. Around 4% offer reduced hours until children graduate from elementary school, and 15.4% of companies give the option even after children have entered middle school. If, considering the nature or conditions of the work, it is difficult to give a reduced work schedule to employees, the law stipulates other measures such as flexible working hours. This law has now been altered, though, to include other accommodations. Updates to The Child Care and Family Care Leave Act Previously, remote work was not an option for employees with young children. Now, from April 1, 2025, employers must make an effort to allow employees with children under the age of three to work remotely if they choose. From October 1, 2025, employers are also obligated to provide two or more of the following measures to employees with children between the ages of three and the time they enter elementary school. An altered start time without changing the daily working hours, either by using a flex time system or by changing both the start and finish time for the workday The option to work remotely without changing daily working hours, which can be used 10 or more days per month Company-sponsored childcare, by providing childcare facilities or other equivalent benefits (e.g., arranging for babysitters and covering the cost) 10 days of leave per year to support employees’ childcare without changing daily working hours A reduced work schedule, which must include the option of 6-hour days How much it’s used in practice Of course, there’s always a gap between what the law specifies, and what actually happens in practice. How many parents typically make use of these legally-mandated accommodations, and for how long? The numbers A survey conducted by the Ministry of Health, Labor and Welfare in 2020 studied uptake of the reduced work schedule among employees with children under three years old. In this category, 40.8% of female permanent employees (正社員, seishain) and 21.6% of women who were not permanent employees answered that they use, or had used, the reduced work schedule. Only 12.3% of male permanent employees said the same. The same survey was conducted in 2022, and researchers found that the gap between female and male employees had actually widened. According to this second survey, 51.2% of female permanent employees and 24.3% of female non-permanent employees had reduced their hours, compared to only 7.6% of male permanent employees. Not only were fewer male employees using reduced work programs, but 41.2% of them said they did not intend to make use of them. By contrast, a mere 15.6% of female permanent employees answered they didn’t wish to claim the benefit. Of those employees who prefer the shorter schedule, how long do they typically use the benefit? The following charts, using data from the 2022 survey, show at what point those employees stop reducing their hours and return to a full-time schedule.   Female permanent employees Female non-permanent employees Male permanent employees Male non-permanent employees Until youngest child turns 1 13.7% 17.9% 50.0% 25.9% Until youngest child turns 2 11.5% 7.9% 14.5% 29.6% Until youngest child turns 3 23.0% 16.3% 10.5% 11.1% Until youngest child enters primary school 18.9% 10.5% 6.6% 11.1% Sometime after the youngest child enters primary school 22.8% 16.9% 6.5% 11.1% Not sure 10% 30.5% 11.8% 11.1% From the companies’ perspectives, according to a survey conducted by the Cabinet Office in 2023, 65.9% of employers answered that their reduced work schedule system is fully used by their employees. What’s the public perception? Some fear that the number of people using the reduced work program—and, especially, the number of women—has created an impression of unfairness for those employees who work full-time. This is a natural concern, but statistics paint a different picture. In a survey of 300 people conducted in 2024, 49% actually expressed a favorable opinion of people who work shorter hours. Also, 38% had “no opinion” toward colleagues with reduced work schedules, indicating that 87% total don’t negatively view those parents who work shorter hours. While attitudes may vary from company to company, the public overall doesn’t seem to attach any stigma to parents who reduce their work schedules. Is this “the Mommy Track”? Others are concerned that working shorter hours will detour their career path. According to this report by the Ministry of Health, Labour and Welfare, 47.6% of male permanent employees indicated that, as the result of working fewer hours, they had been changed to a position with less responsibility. The same thing happened to 65.6% of male non-permanent employees, and 22.7% of female permanent employees. Therefore, it’s possible that using the reduced work schedule can affect one’s immediate chances for advancement. However, while 25% of male permanent employees and 15.5% of female permanent employees said the quality and importance of the work they were assigned had gone down, 21.4% of male and 18.1% of female permanent employees said the quality had gone up. Considering 53.6% of male and 66.4% of female permanent employees said it stayed the same, there seems to be no strong correlation between reducing one’s working hours, and being given less interesting or important tasks. Reduced work means reduced salary These reduced work schedules usually entail dropping below the originally-contracted work hours, which means the employer does not have to pay the employee for the time they did not work. For example, consider a person who normally works 8 hours a day reducing their work time to 6 hours a day (a 25% reduction). If their monthly salary is 300,000 yen, it would also decrease accordingly by 25% to 225,000 yen. Previously, both men and women have avoided reduced work schedules, because they do not want to lose income. As more mothers than fathers choose to work shorter hours, this financial burden tends to fall more heavily on women. To address this issue, childcare short-time employment benefits (育児時短就業給付) will start from April 2025. These benefits cover both male and female employees who work shorter hours to care for a child under two years old, and pay a stipend equivalent to 10% of their adjusted monthly salary during the reduced work schedule. Returning to the previous example, this stipend would grant 10% of the reduced salary, or 22,500 yen per month, bringing the total monthly paycheck to 247,500 yen, or 82.5% of the normal salary. This additional stipend, while helpful, may not be enough to persuade some families to accept shorter hours. The childcare short-time employment benefits are available to employees who meet the following criteria: The person is insured, and is working shorter hours to care for a child under two years old. The person started a reduced work schedule immediately after using the childcare leave covered by childcare leave benefits, or the person has been insured for 12 months in the two years prior to the reduced work schedule. Conclusion Japan’s newly-mandated options for reduced schedules, remote work, financial benefits, and other childcare accommodations could help many families in Japan. However, these programs will only prove beneficial if enough employees take advantage of them. As of now, there’s some concern that parents who accept shorter schedules could look bad or end up damaging their careers in the long run. Statistically speaking, some of the news is good: most people view parents who reduce their hours either positively or neutrally, not negatively. But other surveys indicate that a reduction in work hours often equates to a reduction in responsibility, which could indeed have long-term effects. That’s why it’s important for more parents to use these accommodations freely. Not only will doing so directly benefit the children, but it will also lessen any negative stigma associated with claiming them. This is particularly true for fathers, who can help even the playing field for their female colleagues by using these perks just as much as the mothers in their offices. And since the state is now offering a stipend to help compensate for lost income, there’s less and less reason not to take full advantage of these programs.

8 hours ago 2 votes
Big endian and little endian

Every time I run into endianness, I have to look it up. Which way do the bytes go, and what does that mean? Something about it breaks my brain, and makes me feel like I can't tell which way is up and down, left and right. This is the blog post I've needed every time I run into this. I hope it'll be the post you need, too. What is endianness? The term comes from Gulliver's travels, referring to a conflict over cracking boiled eggs on the big end or the little end[1]. In computers, the term refers to the order of bytes within a segment of data, or a word. Specifically, it only refers to the order of bytes, as those are the smallest unit of addressable data: bits are not individually addressable. The two main orderings are big-endian and little-endian. Big-endian means you store the "big" end first: the most-significant byte (highest value) goes into the smallest memory address. Little-endian means you store the "little" end first: the least-significant byte (smallest value) goes into the smallest memory address. Let's look at the number 168496141 as an example. This is 0x0A0B0C0D in hex. If we store 0x0A at address a, 0x0B at a+1, 0x0C at a+2, and 0x0D at a+3, then this is big-endian. And then if we store it in the other order, with 0x0D at a and 0x0A at a+3, it's little-endian. And... there's also mixed-endianness, where you use one kind within a word (say, little-endian) and a different ordering for words themselves (say, big-endian). If our example is on a system that has 2-byte words (for the sake of illustration), then we could order these bytes in a mixed-endian fashion. One possibility would be to put 0x0B in a, 0x0A in a+1, 0x0D in a+2, and 0x0C in a+3. There are certainly reasons to do this, and it comes up on some ARM processors, but... it feels so utterly cursed. Let's ignore it for the rest of this! For me, the intuitive ordering is big-ending, because it feels like it matches how we read and write numbers in English[2]. If lower memory addresses are on the left, and higher on the right, then this is the left-to-right ordering, just like digits in a written number. So... which do I have? Given some number, how do I know which endianness it uses? You don't, at least not from the number entirely by itself. Each integer that's valid in one endianness is still a valid integer in another endianness, it just is a different value. You have to see how things are used to figure it out. Or you can figure it out from the system you're using (or which wrote the data). If you're using an x86 or x64 system, it's mostly little-endian. (There are some instructions which enable fetching/writing in a big-endian format.) ARM systems are bi-endian, allowing either. But perhaps the most popular ARM chips today, Apple silicon, are little-endian. And the major microcontrollers I checked (AVR, ESP32, ATmega) are little-endian. It's thoroughly dominant commercially! Big-endian systems used to be more common. They're not really in most of the systems I'm likely to run into as a software engineer now, though. You are likely to run into it for some things, though. Even though we don't use big-endianness for processor math most of the time, we use it constantly to represent data. It comes back in networking! Most of the Internet protocols we know and love, like TCP and IP, use "network order" which means big-endian. This is mentioned in RFC 1700, among others. Other protocols do also use little-endianness again, though, so you can't always assume that it's big-endian just because it's coming over the wire. So... which you have? For your processor, probably little-endian. For data written to the disk or to the wire: who knows, check the protocol! Why do we do this??? I mean, ultimately, it's somewhat arbitrary. We have an endianness in the way we write, and we could pick either right-to-left or left-to-right. Both exist, but we need to pick one. Given that, it makes sense that both would arise over time, since there's no single entity controlling all computer usage[3]. There are advantages of each, though. One of the more interesting advantages is that little-endianness lets us pretend integers are whatever size we like, within bounds. If you write the number 26[4] into memory on a big-endian system, then read bytes from that memory address, it will represent different values depending on how many bytes you read. The length matters for reading in and interpreting the data. If you write it into memory on a little-endian system, though, and read bytes from the address (with the remaining ones zero, very important!), then it is the same value no matter how many bytes you read. As long as you don't truncate the value, at least; 0x0A0B read as an 8-bit int would not be equal to being read as a 16-bit ints, since an 8-bit int can't hold the entire thing. This lets you read a value in the size of integer you need for your calculation without conversion. On the other hand, big-endian values are easier to read and reason about as a human. If you dump out the raw bytes that you're working with, a big-endian number can be easier to spot since it matches the numbers we use in English. This makes it pretty convenient to store values as big-endian, even if that's not the native format, so you can spot things in a hex dump more easily. Ultimately, it's all kind of arbitrary. And it's a pile of standards where everything is made up, nothing matters, and the big-end is obviously the right end of the egg to crack. You monster. The correct answer is obviously the big end. That's where the little air pocket goes. But some people are monsters... ↩ Please, please, someone make a conlang that uses mixed-endian inspired numbers. ↩ If ever there were, maybe different endianness would be a contentious issue. Maybe some of our systems would be using big-endian but eventually realize their design was better suited to little-endian, and then spend a long time making that change. And then the government would become authoritarian on the promise of eradicating endianness-affirming care and—Oops, this became a metaphor. ↩ 26 in hex is 0x1A, which is purely a coincidence and not a reference to the First Amendment. This is a tech blog, not political, and I definitely stay in my lane. If it were a reference, though, I'd remind you to exercise their 1A rights[5] now and call your elected officials to ensure that we keep these rights. I'm scared, and I'm staring down the barrel of potential life-threatening circumstances if things get worse. I expect you're scared, too. And you know what? Bravery is doing things in spite of your fear. ↩ If you live somewhere other than the US, please interpret this as it applies to your own country's political process! There's a lot of authoritarian movement going on in the world, and we all need to work together for humanity's best, most free[6] future. ↩ I originally wrote "freest" which, while spelled correctly, looks so weird that I decided to replace it with "most free" instead. ↩

12 hours ago 1 votes
The Tragic Case of Intel AI

Intel is sitting on a huge amount of card inventory they can’t move, largely because of bad software. Most of this is a summary of the public #intel-hardware channel in the tinygrad discord. Intel currently is sitting on: 15,000 Gaudi 2 cards (with baseboards) 5,100 Intel Data Center GPU Max 1450s (without baseboards) If you were Intel, what would you do with them? First, starting with the Gaudi cards. The open source repo needed to control them was archived on Feb 4, 2025. There’s a closed source version of this that’s maybe still maintained, but eww closed source and do you think it’s really maintained? The architecture is kind of tragic, and that’s likely why they didn’t open source it. Unlike every other accelerator I have seen, the MMEs, which is where all the FLOPS are, are not controllable by the TPCs. While the TPCs have an LLVM port, the MME is not documented. After some poking around, I found the spec: It’s highly fixed function, looks very similar to the Apple ANE. But that’s not even the real problem with it. The problem is that it is controlled by queues, not by the TPCs. Unpacking habanalabs-dkms-1.19.2-32.all.deb you can find the queues. There is some way to push a command stream to the device so you don’t actually have to deal with the host itself for the queues. But that doesn’t prevent you having to decompose the network you are trying to run into something you can put on this fixed function block. Programmability is on a spectrum, ranging from CPUs being the easiest, to GPUs, to things like the Qualcomm DSP / Google TPU (where at least you drive the MME from the program), to this and the Apple ANE being the hardest. While it’s impressive that they actually got on MLPerf Training v4.0 training GPT3, I suspect it’s all hand coded, and if you even can deviate off the trodden path you’ll get almost no perf. Accelerators like this are okay for low power inference where you can adjust the model architecture for the target, Apple does a great job of this. But this will never be acceptable for a training chip. Then there’s the Data Center GPU Max 1450. Intel actually sent us a few of these. You quickly run into a problem…how do you plug them in? They need OAM sockets, 48V power, and a cooling solution that can sink 600W. As far as I can tell, they were only ever deployed in two systems, the Aurora Supercomputer and the Dell XE9640. It’s hard to know, but I really doubt many of these Dell systems were sold. Intel then sent us this carrier board. In some ways it’s helpful, but in other ways it’s not at all. It still doesn’t solve cooling or power, and you need to buy 16x MCIO cables (cheap in quantity, but expensive and hard to find off the shelf). Also, I never got a straight answer, but I really doubt Intel has many of these boards. And that board doesn’t look cheap to manufacturer more of. The connectors alone, which you need two of per GPU, cost $26 each. That’s $104 for just the OAM connectors. tiny corp was in discussions to buy these GPUs. How much would you pay for one of these on a PCIe card? The specs look great. 839 TFLOPS, 128 GB of ram, 3.3 TB/s of bandwidth. However…read this article. Even in simple synthetic benchmarks, the chip doesn’t get anywhere near its max performance, and it looks to be for fundamental reasons like memory latency. We estimate we could sell PCIe versions of these GPUs for $1,000; I don’t think most people know how hard it is to move non NVIDIA hardware. Before you say you’d pay more, ask yourself, do you really want to deal with the software? An adapter card has four pieces. A PCB for the card, a 12->48V voltage converter, a heatsink, and a fan. My quote from the guy who makes an OAM adapter board was $310 for 10+ PCBs and $75 for the voltage converter. A heatsink that can handle 600W (heat pipes + vapor chamber) is going to cost $100, then maybe $20 more for the fan. That’s $505, and you still need to assemble and test them, oh and now there’s tariffs. Maybe you can get this down to $400 in ~1000 quantity. So $200 for the GPU, $400 for the adapter, $100 for shipping/fulfillment/returns (more if you use Amazon), and 30% profit if you sell at $1k. tiny would net $1M on this, which has to cover NRE and you have risk of unsold inventory. We offered Intel $200 per GPU (a $680k wire) and they said no. They wanted $600. I suspect that unless a supercomputer person who already uses these GPUs wants to buy more, they will ride it to zero. tl;dr: there’s 5100 of these GPUs with no simple way to plug them in. It’s unclear if they worth the cost of the slot they go in. I bet they end up shredded, or maybe dumped on eBay for $50 each in a year like the Xeon Phi cards. If you buy one, good luck plugging it in! The reason Meta and friends buy some AMD is as a hedge against NVIDIA. Even if it’s not usable, AMD has progressed on a solid steady roadmap, with a clear continuation from the 2018 MI50 (which you can now buy for 99% off), to the MI325X which is a super exciting chip (AMD is king of chiplets). They are even showing signs of finally investing in software, which makes me bullish. If NVIDIA stumbles for a generation, this is AMD’s game. The ROCm “copy each NVIDIA repo” strategy actually works if your competition stumbles. They can win GPUs with slow and steady improvement + competition stumbling, that’s how AMD won server CPUs. With these Intel chips, I’m not sure who they would appeal to. Ponte Vecchio is cancelled. There’s no point in investing in the platform if there’s not going to be a next generation, and therefore nobody can justify the cost of developing software, therefore there won’t be software, therefore they aren’t worth plugging in. Where does this leave Intel’s AI roadmap? The successor to Ponte Vecchio was Rialto Bridge, but that was cancelled. The successor to that was Falcon Shores, but that was also cancelled. Intel claims the next GPU will be “Jaguar Shores”, but fool me once… To quote JazzLord1234 from reddit “No point even bothering to listen to their roadmaps anymore. They have squandered all their credibility.” Gaudi 3 is a flop due to “unbaked software”, but as much as I usually do blame software, nothing has changed from Gaudi 2 and it’s just a really hard chip to program for. So there’s no future there either. I can’t say that “Jaguar Shores” square instills confidence. It didn’t inspire confidence for “Joseph B.” on LinkedIn either. From my interactions with Intel people, it seems there’s no individuals with power there, it’s all committee like leadership. The problem with this is there’s nobody who can say yes, just many people who can say no. Hence all the cancellations and the nonsense strategy. AMD’s dysfunction is different. from the beginning they had leadership that can do things (Lisa Su replied to my first e-mail), they just didn’t see the value in investing in software until recently. They sort of had a point if they were only targeting hyperscalars. but it seems like SemiAnalysis got through to them that hyperscalars aren’t going to deal with bad software either. It remains to be seem if they can shift culture to actually deliver good software, but there’s movement in that direction, and if they succeed AMD is so undervalued. Their hardware is good. With Intel, until that committee style leadership is gone, there’s 0 chance for success. Committee leadership is fine if you are trying to maintain, but Intel’s AI situation is even more hopeless than AMDs, and you’d need something major to turn it around. At least with AMD, you can try installing ROCm and be frustrated when there are bugs. Every time I have tried Intel’s software I can’t even recall getting the import to work, and the card wasn’t powerful enough that I cared. Intel needs actual leadership to turn this around, or there’s 0 future in Intel AI.

20 hours ago 1 votes
All pretty models are wrong, but some ugly models are useful

Identifying useful frameworks for companies, strategy, markets, and organizations, instead of those that just look pretty in PowerPoint.

yesterday 3 votes
Self-avoiding Walk

I’m a bit late to this, but back in summer 2024 I participated in the OST Composing Jam. The goal of this jam is to compose an original soundtrack (minimum of 3 minutes) of any style for an imaginary game. While I’ve composed a lot of video game music, I’ve never created an entire soundtrack around a single concept. Self Avoiding Walk by Daniel Marino To be honest, I wasn’t entirely sure where to start. I was torn between trying to come up with a story for a game to inspire the music, and just messing around with some synths and noodling on the keyboard. I did a little bit of both, but nothing really materialized. Synth + Metal ≈ Synthmetal Feeling a bit paralyzed, I fired up the ’ole RMG sequencer for inspiration. I saved a handful of randomized melodies and experimented with them in Reaper. After a day or two I landed on something I liked which was about the first 30 seconds or so of the second track: "Defrag." I love metal bands like Tesseract, Periphery, The Algorithm, Car Bomb, and Meshuggah. I tried experimenting with incorporating syncopated guttural guitar sounds with the synths. After several more days I finished "Defrag"—which also included "Kernel Panic" before splitting that into its own track. I didn’t have a clue what to do next, nor did I have a concept. Composing the rest of the music was a bit of a blur because I bounced around from song to song—iterating on the leitmotif over and over with different synths, envelopes, time signatures, rhythmic displacement, pitch shifting, and tweaking underlying chord structures. Production The guitars were recorded using DI with my Fender Squire and Behringer Interface. I’m primarily using the ML Sound Labs Amped Roots Free amp sim because the metal presets are fantastic and rarely need much fuss to get it sounding good. I also used Blue Cat Audio free amp sim for clean guitars. All the other instruments were MIDI tracks either programmed via piano roll or recorded with my Arturia MiniLab MKII. I used a variety of synth effects from my library of VSTs. I recorded this music before acquiring my Fender Squire Bass guitar, so bass was also programmed. Theme and Story At some point I had five songs that all sounded like they could be from the same game. The theme for this particular jam was "Inside my world." I had to figure out how I could write a story that corresponded with the theme and could align with the songs. I somehow landed on the idea of the main actor realizing his addiction to AI, embarking on a journey to "unplug." The music reflects his path to recovery, capturing the emotional and psychological evolution as he seeks to overcome his dependency. After figuring this out, I thought it would be cool to name all the songs using computer terms that could be metaphors for the different stages of recovery. Track listing Worm – In this dark and haunting opening track, the actor grapples with his addiction to AI, realizing he can no longer think independently. Defrag – This energetic track captures the physical and emotional struggles of the early stages of recovery. Kernel Panic – Menacing and eerie, this track portrays the actor’s anxiety and panic attacks as he teeters on the brink during the initial phases of recovery. Dæmons – With initial healing achieved, the real challenge begins. The ominous and chaotic melodies reflect the emotional turbulence the character endures. Time to Live – The actor, having come to terms with himself, experiences emotional growth. The heroic climax symbolizes the realization that recovery is a lifelong journey. Album art At the time I was messing around with Self-avoiding walks in generative artwork explorations. I felt like the whole concept of avoiding the self within the context of addiction and recovery metaphorically worked. So I tweaked some algorithms and generated the self-avoiding walk using JavaScript and the P5.js library. I then layered the self-avoiding walk over a photo I found visually interesting on Unsplash using a CSS blend mode. Jam results I placed around the top 50% out of over 600 entries. I would have liked to have placed higher, but despite my ranking, I thoroughly enjoyed composing the music! I’m very happy with the music, its production quality, and I also learned a lot. I would certainly participate in this style of composition jam again!

yesterday 3 votes