Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
25
Introduction I made an offhand remark about technical debt to a friend and he interrupted me, saying: "technical debt is just bullshit". In his experience, people talking about technical debt were mostly trying to: cover up bad code cover up unfinished work source1 Calling these issues 'technical debt' seems to be a tactic of distancing oneself from these problems. A nice way of avoiding responsibility. To sweep things under the rug. Intrigued, I decided to take a better look at the metaphor of techical debt, to better understand what is actually meant. Tip: this article on Medium by David Vandegrift also tackles this topic. A definition of technical debt Right off the bat, I realised that my own understanding of technical debt was wrong. Most people seem to understand technical debt as: "cut a corner now, to capture short-term business value (taking on debt), and clean up later (repaying the debt)". I think that's wrong. Ward Cunningham, who coined the metaphor of technical...
over a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Louwrentius

Bose SoundLink on-ear headphones battery replacement

Skip to the bottom two paragraph for instructions on how to replace the battery. I bought my Bose SoundLink on-ear Bluetooth headphones for 250 Euros around 2017 and I really like them. They are small, light, comfortable and can easily fit in a coat pocket when folded. Up until now (about 7 years later) I have replaced the ear cushions in 2019 (€25) and 2024 (€18). Early 2025, battery capacity had deteriorated to a point where it became noticeable. The battery was clearly dying. Unfortunately these headphones aren't designed for easy battery replacement: Bose hasn't published instructions on how to replace the battery, doesn't offer a replacement battery and hasn't documented which battery type/model is used. The left 'head phone' has two Torx security screws and most people won't have the appropriate screwdriver for this size There is soldering involved I wanted to try a battery replacement anyway as I hate to throw away a perfectly good, working product just because the battery has worn out. Maybe at some point the headband needs replacing, but with a fresh battery, these headphones can last another 7 years. Let's prevent a bit of e-waste with a little bit of cost and effort. Most of all, the cost of this battery replacement is much lower than a new pair of headphones as the battery was €18 including taxes and shipping. Right to repair should include easy battery replacement Although my repair seemed to have worked out fine, it requires enough effort that most people won't even try. For this reason, I feel that it should be mandatory by law that: Batteries in any product must be user-replaceable (no special equipment or soldering required) Batteries must be provided by the vendor until 10 years after the last day the product was sold (unless it's a standard format like AA(A) or 18650). Batteries must be provided at max 10% of the cost of the original product The penalty for non-compliance should be high enough such that it won't be regarded as the cost of doing business For that matter, all components that may wear down over time should be user-replaceable. What you need to replace the battery Buy the exact battery type: ahb571935pct-01 (350mAh) (notice the three wires!) A Philips #0 screwdriver / bit A Torx T6H security screwdriver / bit (iFixit kits have them) A soldering iron Solder Heat shrink for 'very thin wire' Multimeter (optional) a bit of tape to 'cap off' bare battery leads Please note that I found another battery ahb571935pct-03 with similar specifications (capacity and voltage) but I don't know if it will fit. Putting the headphone ear cushion back on can actually be the hardest part of the process, you need to be firm and this process is documented by Bose. Battery replacement steps I took Make sure you don't short the wires on the old or new battery during replacement The battery is located in the left 'head phone'. Use a multimeter to check if your new battery isn't dead (should be 3+ volt) Remove the ear cushion from the left 'head phone' very gently as not to tear the rim Remove the two philips screws that keep the driver (speaker) in place Remove the two Torx screws (you may have to press a bit harder) Remove the speaker and be carefull not to snap the wire Gently remove the battery from the 'head phone' Cut the wires close to the old battery (one by one!) and cover the wires on the battery to prevent a short Strip the three wires from the headphones a tiny bit (just a few mm) Put a short piece of heat shrink on each of the three wires of the battery Solder each wire to the correct wire in the ear cup Adjust the location of the heat shrink over the freshly soldered joint. Use the soldering iron close to the heat shrink to shrink it (don't touch anything), this can take some time, be patient Check that the heat shrink is fixed in place and can't move Put the battery into it's specific location in the back of the 'head phone' Test the headphones briefly before reassembling the headphones Reassemble the 'head phone' (consider leaving out the two Torx screws) Dispose of the old battery in a responsible manner

2 months ago 30 votes
My 71 TiB ZFS NAS after 10 years and zero drive failures

My 4U 71 TiB ZFS NAS built with twenty-four 4 TB drives is over 10 years old and still going strong. Although now on its second motherboard and power supply, the system has yet to experience a single drive failure (knock on wood). Zero drive failures in ten years, how is that possible? Let's talk about the drives first The 4 TB HGST drives have roughly 6000 hours on them after ten years. You might think something's off and you'd be right. That's only about 250 days worth of runtime. And therein lies the secret of drive longevity (I think): Turn the server off when you're not using it. According to people on Hacker News I have my bearings wrong. The chance of having zero drive failures over 10 years for 24 drives is much higher than I thought it was. So this good result may not be related to turning my NAS off and keeping it off most off the time. My NAS is turned off by default. I only turn it on (remotely) when I need to use it. I use a script to turn the IoT power bar on and once the BMC (Baseboard Management Controller) is done booting, I use IPMI to turn on the NAS itself. But I could have used Wake-on-Lan too as an alternative. Once I'm done using the server, I run a small script that turns the server off, wait a few seconds and then turn the wall socket off. It wasn't enough for me to just turn off the server, but leave the motherboard, and thus the BMC powered, because that's just a constant 7 watts (about two Raspberry Pis at idle) being wasted (24/7). This process works for me because I run other services on low-power devices such as Raspberry Pi4s or servers that use much less power when idling than my 'big' NAS. This proces reduces my energy bill considerably (primary motivation) and also seems great for hard drive longevity. Although zero drive failures to date is awesome, N=24 is not very representative and I could just be very lucky. Yet, it was the same story with the predecessor of this NAS, a machine with 20 drives (1 TB Samsung Spinpoint F1s (remember those?)) and I also had zero drive failures during its operational lifespan (~5 years). The motherboard (died once) Although the drives are still ok, I had to replace the motherboard a few years ago. The failure mode of the motherboard was interesting: it was impossible to get into the BIOS and it would occasionally fail to boot. I tried the obvious like removing the CMOS battery and such but to no avail. Fortunately, the [motherboard]1 was still available on Ebay for a decent price so that ended up not being a big deal. ZFS ZFS worked fine for all these years. I've switched operating systems over the years and I never had an issue importing the pool back into the new OS install. If I would build a new storage server, I would definitely use ZFS again. I run a zpool scrub on the drives a few times a year2. The scrub has never found a single checksum error. I must have run so many scrubs, more than a petabyte of data must have been read from the drives (all drives combined) and ZFS didn't have to kick in. I'm not surprised by this result at all. Drives tend to fail most often in two modes: Total failure, drive isn't even detected Bad sectors (read or write failures) There is a third failure mode, but it's extremely rare: silent data corruption. Silent data corruption is 'silent' because a disk isn't aware it delivered corrupted data. Or the SATA connection didn't detect any checksum errors. However, due to all the low-level checksumming, this risk is extremely small. It's a real risk, don't get me wrong, but it's a small risk. To me, it's a risk you mostly care about at scale, in datacenters4 but for residential usage, it's totally reasonable to accept the risk3. But ZFS is not that difficult to learn and if you are well-versed in Linux or FreeBSD, it's absolutely worth checking out. Just remember! Sound levels (It's Oh So Quiet) This NAS is very quiet for a NAS (video with audio). But to get there, I had to do some work. The chassis contains three sturdy 12V fans that cool the 24 drive cages. These fans are extremely loud if they run at their default speed. But because they are so beefy, they are fairly quiet when they run at idle RPM5, yet they still provide enough airflow, most of the time. But running at idle speeds was not enough as the drives would heat up eventually, especially when they are being read from / written to. Fortunately, the particular Supermicro motherboard I bought at the time allows all fan headers to be controlled through Linux. So I decided to create a script that sets the fan speed according to the temperature of the hottest drive in the chassis. I actually visited a math-related subreddit and asked for an algorithm that would best fit my need to create a silent setup and also keep the drives cool. Somebody recommended to use a "PID controller", which I knew nothing about. So I wrote some Python, stole some example Python PID controller code, and tweaked the parameters to find a balance between sound and cooling performance. The script has worked very well over the years and kept the drives at 40C or below. PID controllers are awesome and I feel it should be used in much more equipment that controls fans, temperature, and so on, instead of 'dumb' on/of behaviour or less 'dumb' lookup tables. Networking I started out with quad-port gigabit network controllers and I used network bonding to get around 450 MB/s network transfer speeds between various systems. This setup required a ton of UTP cables so eventually I got bored with that and I bought some cheap Infiniband cards and that worked fine, I could reach around 700 MB/s between systems. As I decided to move away from Ubuntu and back to Debian, I faced a problem: the Infiniband cards didn't work anymore and I could not figure out how to fix it. So I decided to buy some second-hand 10Gbit Ethernet cards and those work totally fine to this day. The dead power supply When you turn this system on, all drives spin up at once (no staggered spinup) and that draws around 600W for a few seconds. I remember that the power supply was rated for 750W and the 12 volt rail would have been able to deliver enough power, but it would sometimes cut out at boot nonetheless. UPS (or lack thereof) For many years, I used a beefy UPS with the system, to protect against power failure, just to be able to shutdown cleanly during an outage. This worked fine, but I noticed that the UPS used another 10+ watts on top of the usage of the server and I decided it had to go. Losing the system due to power shenanigans is a risk I accept. Backups (or a lack thereof) My most important data is backed up trice. But a lot of data stored on this server isn't important enough for me to backup. I rely on replacement hardware and ZFS protecting against data loss due to drive failure. And if that's not enough, I'm out of luck. I've accepted that risk for 10 years. Maybe one day my luck will run out, but until then, I enjoy what I have. Future storage plans (or lack thereof) To be frank, I don't have any. I built this server back in the day because I didn't want to shuffle data around due to storage space constraints and I still have ample space left. I have a spare motherboard, CPU, Memory and a spare HBA card so I'm quite likely able to revive the system if something breaks. As hard drive sizes have increased tremendously, I may eventually move away from the 24-drive bay chassis into a smaller form-factor. It's possible to create the same amount of redundant storage space with only 6-8 hard drives with RAIDZ2 (RAID 6) redundancy. Yet, storage is always expensive. But another likely scenario is that in the coming years this system eventually dies and I decide not to replace it at all, and my storage hobby will come to an end. I needed the same board, because the server uses four PCIe slots: 3 x HBA and 1 x 10Gbit NIC. ↩ It takes ~20 hours to complete a scrub and it uses a ton of power while doing so. As I'm on a dynamic power tariff, I run it on 'cheap' days. ↩ every time I listen to ZFS enthusiasts you get the impression you are taking insane risks with your data if you don't run ZFS. I disagree, it all depends on context and circumstances. ↩ enterprise hard drives used in servers and SANs had larger sector sizes to accommodate even more checksumming data to prevent against silent data corruption. ↩ Because there is little airflow by default, I had to add a fan to cool the four PCIe cards (HBA and networking) or they would have gotten way too hot. ↩

7 months ago 26 votes
The Raspberry Pi 5 is no match for a tini-mini-micro PC

I've always been fond of the idea of the Raspberry Pi. An energy efficient, small, cheap but capable computer. An ideal home server. Until the Pi 4, the Pi was not that capable, and only with the relatively recent Pi 5 (fall 2023) do I feel the Pi is OK performance wise, although still hampered by SD card performance1. And the Pi isn't that cheap either. The Pi 5 can be fitted with an NVME SSD, but for me it's too little, too late. Because I feel there is a type of computer on the market, that is much more compelling than the Pi. I'm talking about the tinyminimicro home lab 'revolution' started by servethehome.com about four years ago (2020). A 1L mini PC (Elitedesk 705 G4) with a Raspberry Pi 5 on top During the pandemic, the Raspberry Pi was in short supply and people started looking for alternatives. The people at servethehome realised that these small enterprise desktop PCs could be a good option. Dell (micro), Lenovo (tiny) and HP (mini) all make these small desktop PCs, which are also known as 1L (one liter) PCs. These Mini PC are not cheap2 when bought new, but older models are sold at a very steep discount as enterprises offload old models by the thousands on the second hand market (through intermediates). Although these computers are often several years old, they are still much faster than a Raspberry Pi (including the Pi 5) and can hold more RAM. I decided to buy two HP Elitedesk Mini PCs to try them out, one based on AMD and the other based on Intel. The Hardware Elitedesk Mini G3 800 Elitedesk Mini G4 705 CPU Intel i5-6500 (65W) AMD Ryzen 3 PRO 2200GE (35W) RAM 16 GB (max 32 GB) 16 GB (max 32 GB) HDD 250 GB (SSD) 250 GB (NVME) Network 1Gb (Intel) 1Gb (Realtek) WiFi Not installed Not installed Display 2 x DP, 1 x VGA 3 x DP Remote management Yes No Idle power 4 W 10 W Price €160 €115 The AMD-based system is cheaper, but you 'pay' in higher idle power usage. In absolute terms 10 watt is still decent, but the Intel model directly competes with the Pi 5 on idle power consumption. Elitedesk 705 left, Elitedesk 800 right (click to enlarge) Regarding display output, these devices have two fixed displayport outputs, but there is one port that is configurable. It can be displayport, VGA or HDMI. Depending on the supplier you may be able to configure this option, or you can buy them separately for €15-€25 online. Click on image for official specs in PDF format Both models seem to be equipped with socketed CPUs. Although options for this formfactor are limited, it's possible to upgrade. Comparing cost with the Pi 5 The Raspberry Pi 5 with (max) 8 GB of RAM costs ~91 Euro, almost exactly the same price as the AMD-based mini PC3 in its base configuration (8GB RAM). Yet, with the Pi, you still need: power supply (€13) case (€11) SD card or NVME SSD (€10-€45) NVME hat (€15) (optional but would be more comparable) It's true that I'm comparing a new computer to a second hand device, and you can decide if that matters in this case. With a complete Pi 5 at around €160 including taxes and shipping, the AMD-based 1L PC is clearly the cheaper and still more capable option. Comparing performance with the Pi 5 The first two rows in this table show the Geekbench 6 score of the Intel and AMD mini PCs I've bought for evaluation. I've added the benchmark results of some other computers I've access to, just to provide some context. CPU Single-core Multi-core AMD Ryzen 3 PRO 2200GE (32W) 1148 3343 Intel i5-6500 (65W) 1307 3702 Mac Mini M2 2677 9984 Mac Mini i3-8100B 1250 3824 HP Microserver Gen8 Xeon E3-1200v2 744 2595 Raspberry Pi 5 806 1861 Intel i9-13900k 2938 21413 Intel E5-2680 v2 558 5859 Sure, these mini PCs won't come close to modern hardware like the Apple M2 or the intel i9. But if we look at the performance of the mini PCs we can observe that: The Intel i5-6500T CPU is 13% faster in single-core than the AMD Ryzen 3 PRO Both the Intel and AMD processors are 42% - 62% faster than the Pi 5 regarding single-core performance. Storage (performance) If there's one thing that really holds the Pi back, it's the SD card storage. If you buy a decent SD card (A1/A2) that doesn't have terrible random IOPs performance, you realise that you can get a SATA or NVME SSD for almost the same price that has more capacity and much better (random) IO performance. With the Pi 5, NVME SSD storage isn't standard and requires an extra hat. I feel that the missing integrated NVME storage option for the Pi 5 is a missed opportunity that - in my view - hurts the Pi 5. Now in contrast, the Intel-based mini PC came with a SATA SSD in a special mounting bracket. That bracket also contained a small fan(1) to keep the underlying NVME storage (not present) cooled. There is a fan under the SATA SSD (click to enlarge) The AMD-based mini PC was equipped with an NVME SSD and was not equipped with the SSD mounting bracket. The low price must come from somewhere... However, both systems have support for SATA SSD storage, an 80mm NVME SSD and a small 2230 slot for a WiFi card. There seems no room on the 705 G4 to put in a small SSD, but there are adapters available that convert the WiFi slot to a slot usable for an extra NVME SSD, which might be an option for the 800 G3. Noice levels (subjective) Both systems are barely audible at idle, but you will notice them (if you sensitive to that sort of thing). The AMD system seems to become quite loud under full load. The Intel system also became loud under full load, but much more like a Mac Mini: the noise is less loud and more tolerable in my view. Idle power consumption Elitedesk 800 (Intel) I can get the Intel-based Elitedesk 800 G3 to 3.5 watt at idle. Let that sink in for a moment. That's about the same power draw as the Raspberry Pi 5 at idle! Just installing Debian 12 instead of Windows 10 makes the idle power consumption drop from 10-11 watt to around 7 watt. Then on Debian, you: run apt install powertop run powertop --auto-tune (saves ~2 Watt) Unplug the monitor (run headless) (saves ~1 Watt) You have to put the powertop --auto-tune command in /etc/rc.local: #!/usr/bin/env bash powertop --auto-tune exit 0 Then apply chmod +x /etc/rc.local So, for about the same idle power draw you get so much more performance, and go beyond the max 8GB RAM of the Pi 5. Elitedesk 705 (AMD) I managed to get this system to 10-11 watt at idle, but it was a pain to get there. I measured around 11 Watts idle power consumption running a preinstalled Windows 11 (with monitor connected). After installing Debian 12 the system used 18 Watts at idle and so began a journey of many hours trying to solve this problem. The culprit is the integrated Radeon Vega GPU. To solve the problem you have to: Configure the 'bios' to only use UEFI Reinstall Debian 12 using UEFI install the appropriate firmware with apt install firmware-amd-graphics If you boot the computer using legacy 'bios' mode, the AMD Radeon firmware won't load no matter what you try. You can see this by issuing the commands: rmmod amdgpu modprobe amdgpu You may notice errors on the physical console or in the logs that the GPU driver isn't loaded because it's missing firmware (a lie). This whole process got me to around 12 Watt at idle. To get to ~10 Watts idle you need to do also run powertop --auto-tune and disconnect the monitor, as stated in the 'Intel' section earlier. Given the whole picture, 10-11 Watt at idle is perfectly okay for a home server, and if you just want the cheapest option possible, this is still a fine system. KVM Virtualisation I'm running vanilla KVM (Debian 12) on these Mini PCs and it works totally fine. I've created multiple virtual machines without issue and performance seemed perfectly adequate. Boot performance From the moment I pressed the power button to SSH connecting, it took 17 seconds for the Elitedesk 800. The Elitedesk 705 took 33 seconds until I got an SSH shell. These boot times include the 5 second boot delay within the GRUB bootloader screen that is default for Debian 12. Remote management support Some of you may be familiar with IPMI (ILO, DRAC, and so on) which is standard on most servers. But there is also similar technology for (enterprise) desktops. Intel AMT/ME is a technology used for remote out-of-band management of computers. It can be an interesting feature in a homelab environment but I have no need for it. If you want to try it, you can follow this guide. For most people, it may be best to disable the AMT/ME feature as it has a history of security vulnerabilities. This may not be a huge issue within a trusted home network, but you have been warned. The AMD-based Elitedesk 705 didn't came with equivalent remote management capabilities as far as I can tell. Alternatives The models discussed here are older models that are selected for a particular price point. Newer models from Lenovo, HP and Dell, equip more modern processors which are faster and have more cores. They are often also priced significantly higher. If you are looking for low-power small formfactor PCs with more potent or customisable hardware, you may want to look at second-hand NUC formfactor PCs. Stacking multiple mini PCs The AMD-based Elitedesk 705 G4 is closed at the top and it's possible to stack other mini PCs on top. The Intel-based Elitedesk 800 G3 has a perforated top enclosure, and putting another mini pc on top might suffocate the CPU fan. As you can see, the bottom/foot of the mini PC doubles as a VESA mount and has four screw holes. By putting some screws in those holes, you may effectively create standoffs that gives the machine below enough space to breathe (maybe you can use actual standoffs). Evaluation and conclusion I think these second-hand 1L tinyminimicro PCs are better suited to play the role of home (lab) server than the Raspberry Pi (5). The increased CPU performance, the built-in SSD/NVME support, the option to go beyond 8 GB of RAM (up to 32GB) and the price point on the second-hand market really makes a difference. I love the Raspberry Pi and I still have a ton of Pi 4s. This solar-powered blog is hosted on a Pi 4 because of the low power consumption and the availability of GPIO pins for the solar status display. That said, unless the Raspberry Pi becomes a lot cheaper (and more potent), I'm not so sure it's such a compelling home server. This blog post featured on the front page of Hacker News. even a decent quality SD card is no match (in terms of random IOPs and sequential throughput) for a regular SATA or NVME SSD. The fact that the Pi 5 has no on-board NVME support is a huge shortcomming in my view. ↩ in the sense that you can buy a ton of fully decked out Pi 5s for the price of one such system. ↩ The base price included the external power brick and 256GB NVME storage. ↩

10 months ago 38 votes
AI is critically important but not for you

Before Chat-GPT caused a sensation, big tech companies like Facebook and Apple were betting their future growth on virtual reality. But I'm convinced that virtual reality will never be a mainstream thing. If you ever used VR you know why: A heavy thing on your head that messes up your hair Nausea The focus on virtual reality felt like desperation to me. The desperation of big tech companies trying to find new growth, ideally a monopoly they control1, to satisfy the demands of shareholders. And then OpenAI dropped ChatGPT and all the big tech companies started to pivot so fast because in contrary to VR, AI doesn't involve making people nauseated and look silly. It's probably obvious that I feel it's not about AI itself. It is really about huge tech companies that have found a new way to sustain growth a bit longer, now that all other markets have been saturated. Flush with cash, they went nuts and bought up all the AI accelerator hardware2, which in turn uses unspeakable amounts of energy to train new large language models. Despite all the hype, current AI technology is at it's core a very sophisticated statistical model. It's all about probabilities, it can't actually reason. As I see it, work done by AI can't thus be trusted. Depending on the specific application, that may be less of an issue, but that is a fundamental limitation of current technology. And this gives me pause as it limits the application where it is most wanted: to control labour. To reduce the cost of headcount and to suppress wages. As AI tools become capable enough, it would be irresponsible towards shareholders not to pursue this direction. All this just to illustrate that the real value of AI is not for the average person in the street. The true value is for those bigger companies who can keep on growing, and the rest is just collateral damage. But I wonder: when the AI hype is over, what new hype will take it's place? I can't see it. I can't think of it. But I recognise that the internet created efficiencies that are convenient, yet social media weaponised this convenience to exploit our fundamental human weaknesses. As shareholder value rose, social media slowly chips away at the fabric of our society: trust. I've sold my Oculus Rift CV1 long ago, I lost hundreds of dollars of content but I refuse to create a Facebook/Meta account. ↩ climate change accelerators ↩

12 months ago 20 votes
How to run victron veconfigure on a mac

Introduction Victron Multiplus-II inverter/charges are configured with the veconfigure1 tool. Unforntunately this is a Windows-only tool, but there is still a way for Apple users to run this tool without any problems. Tip: if you've never worked with the Terminal app on MacOS, it might not be an easy process, but I've done my best to make it as simple as I can. A tool called 'Wine' makes it possible to run Windows applications on MacOS. There are some caveats, but none of those apply to veconfigure, this tool runs great! I won't cover in this tutorial how to make the MK-3 USB cable work. This tutorial is only meant for people who have a Cerbo GX or similar device, or run VenusOS, which can be used to remotely configure the Multipluss device(s). Step 1: install brew on macos Brew is a tool that can install additional software Visit https://brew.sh and copy the install command open the Terminal app on your mac and paste the command now press 'Enter' or return It can take a few minutes for 'brew' to install. Step 2: install wine Enter the following two commands in the terminal: brew tap homebrew/cask-versions brew install --cask --no-quarantine wine-stable Download Victron veconfigure Visit this page Scroll to the section "VE Configuration tools for VE.Bus Products" Click on the link "Ve Configuration Tools" You'll be asked if it's OK to download this file (VECSetup_B.exe) which is ok Start the veconfigure installer with wine Open a terminal window Run cd Enter the command wine Downloads\VECSetup_B.exe Observe that the veconfigure Windows setup installer starts Click on next, next, install and Finish veconfigure will run for the first time Click on the top left button on the video to enlarge These are the actual install steps: How to start veconfigure after you close the app Open a terminal window Run cd Run cd .wine/drive_c/Program\ Files\ \(x86\)/VE\ Configure\ tools/ Run wine VEConfig.exe Observe that veconfigure starts Allow veconfigure access to files in your Mac Download folder Open a terminal window Run cd run cd .wine/drive_c/ run ls -n ~/Downloads We just made the Downloads directory on your Mac accessible for the vedirect software. If you put the .RSVC files in the Downloads folder, you can edit them. Please follow the instructions for remote configuration of the Multiplus II. Click on the "Ve Configuration Tools" link in the "VE Configuration tools for VE.Bus Products" section. ↩

a year ago 31 votes

More in technology

Sierpiński triangle? In my bitwise AND?

Exploring a peculiar bit-twiddling hack at the intersection of 1980s geek sensibilities.

yesterday 4 votes
Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

yesterday 4 votes
Code Matters

It looks like the code that the newly announced Figma Sites is producing isn’t the best. There are some cool Figma-to-WordPress workflows; I hope Sites gets more people exploring those options.

2 days ago 5 votes
What got you here…

John Siracusa: Apple Turnover From virtue comes money, and all other good things. This idea rings in my head whenever I think about Apple. It’s the most succinct explanation of what pulled Apple from the brink of bankruptcy in the 1990s to its astronomical success today. Don’

2 days ago 3 votes
A single RGB camera turns your palm into a keyboard for mixed reality interaction

Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University […] The post A single RGB camera turns your palm into a keyboard for mixed reality interaction appeared first on Arduino Blog.

2 days ago 3 votes