Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
15
This project was completed in June 2015. Wow, is it 2017 already? In this post, I’d like to say that I wrote a useful little bit of software and built up a crappy hack to demonstrate it but, secretly, the crappy hack came first and I’ve retroactively found something vaguely useful in it. Let’s go: Let there be lights We moved house and needed some temporary* under-cupboard lighting for our transitional* 1970s kitchen. Why buy purpose-designed, expensive and great-looking strip lighting when I can instead hack them together myself using hot glue and scrap wire? Then, I can value-add by using an ESP8266 module to make the lights remote-controllable! I chose Open Sound Control (OSC) for this, which is traditionally used for media signalling, i.e. ‘a better MIDI, over the network’. I don’t know why I did this instead of using something like Blynk or an HTTP-based control page. At least I can control my kitchen lights from Ableton Live if the need arises. That need has not arisen so...
over a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from axio.ms

MicroMac, a Macintosh for under £5

A microcontroller Macintosh This all started from a conversation about the RP2040 MCU, and building a simple desktop/GUI for it. I’d made a comment along the lines of “or, just run some old OS”, and it got me thinking about the original Macintosh. The original Macintosh was released 40.5 years before this post, and is a pretty cool machine especially considering that the hardware is very simple. Insanely Great and folklore.org are fun reads, and give a glimpse into the Macintosh’s development. Memory was a squeeze; the original 128KB version was underpowered and only sold for a few months before being replaced by the Macintosh 512K, arguably a more appropriate amount of memory. But, the 128 still runs some real applications and, though it pre-dates MultiFinder/actual multitasking, I found it pretty charming. As a tourist. In 1984 the Mac cost roughly 1/3 as much as a VW Golf and, as someone who’s into old computers and old cars, it’s hard to decide which is more frustrating to use. So back to this £3.80 RPi Pico microcontroller board: The RP2040’s 264KB of RAM gives a lot to play with after carving out the Mac’s 128KB – how cool would it be to do a quick hack, and play with a Mac on it? Time passes. A lot of time. But I totally delivered on the janky hack front: You won’t believe that this quality item didn’t take that long to build. So the software was obviously the involved part, and turned into work on 3 distinct projects. This post is going to be a “development journey” story, as a kind of code/design/venting narrative. If you’re just here for the pictures, scroll along! What is pico-mac? A Raspberry Pi RP2040 microcontroller (on a Pico board), driving monochrome VGA video and taking USB keyboard/mouse input, emulating a Macintosh 128K computer and disc storage. The RP2040 has easily enough RAM to house the Mac’s memory, plus that of the emulator; it’s fast enough (with some tricks) to meet the performance of the real machine, has USB host capability, and the PIO department makes driving VGA video fairly uneventful (with some tricks). The basic Pico board’s 2MB of flash is plenty for a disc image with OS and software. Here’s the Pico MicroMac in action, ready for the paperless office of the future: The Pico MicroMac RISC CISC workstation of the future I hadn’t really used a Mac 128K much before; a few clicks on a museum machine once. But I knew they ran MacDraw, and MacWrite, and MacPaint. All three of these applications are pretty cool for a 128K machine; a largely WYSIWYG word processor with multiple fonts, and a vector drawing package. A great way of playing with early Macintosh system software, and applications of these wonderful machines is via https://infinitemac.org, which has shrinkwrapped running the Mini vMac emulator by emscriptening it to run in the browser. Highly recommended, lots to play with. As a spoiler, MicroMac does run MacDraw, and it was great to play with it on “real fake hardware”: (Do you find “Pico Micro Mac” doesn’t really scan? I didn’t think this taxonomy through, did I?) GitHub links are at the bottom of this page: the pico-mac repo has construction directions if you want to build your own! The journey Back up a bit. I wasn’t committed to building a Pico thing, but was vaguely interested in whether it was feasible, so started tinkering with building a Mac 128K emulator on my normal computer first. The three rules I had a few simple rules for this project: It had to be fun. It’s OK to hack stuff to get it working, it’s not as though I’m being paid for this. I like writing emulation stuff, but I really don’t want to learn 68K assembler, or much about the 68K. There’s a lot of love for 68K out there and that’s cool, but meh I don’t adore it as a CPU. So, right from the outset I wanted to use someone else’s 68K interpreter – I knew there were loads around. Similarly, there are a load of OSes whose innards I’d like to learn more about, but the shittiest early Mac System software isn’t high on the list. Get in there, emulate the hardware, boot the OS as a black box, done. I ended up breaking 2 of and sometimes all 3 of these rules during this project. The Mac 128K The machines are generally pretty simple, and of their time. I started with schematics and Inside Macintosh, PDFs of which covered various details of the original Mac hardware, memory map, mouse/keyboard, etc. https://tinkerdifferent.com/resources/macintosh-128k-512k-schematics.79/ https://vintageapple.org/inside_o/ Inside Macintosh Volumes I-III are particularly useful for hardware information; also Guide to Macintosh Family Hardware 2nd Edition. The Macintosh has: A Motorola 68000 CPU running at 7.whatever MHz roughly 8MHz Flat memory, decoded into regions for memory-mapped IO going to the 6522 VIA, the 8530 SCC, and the IWM floppy controller. (Some of the address decoding is a little funky, though.) Keyboard and mouse hang off the VIA/SCC chips. No external interrupt controller: the 68K has 3 IRQ lines, and there are 3 IRQ sources (VIA, SCC, programmer switch/NMI). “No slots” or expansion cards. No DMA controller: a simple autonomous PAL state machine scans video (and audio samples) out of DRAM. Video is fixed at 512x342 1BPP. The only storage is an internal FDD (plus an external drive), driven by the IWM chip. The first three Mac models are extremely similar: The Mac 128K and Mac 512K are the same machine, except for RAM. The Mac Plus added SCSI to a convenient space in the memory map and an 800K floppy drive, which is double-sided whereas the original was a single 400K side. The Mac Plus ROM also supports the 128K/512K, and was an upgrade to create the Macintosh 512Ke. ‘e’ for Extra ROM Goodness. The Mac Plus ROM supports the HD20 external hard disc, and HFS, and Steve Chamberlin has annotated a disassembly of it. This was the ROM to use: I was making a Macintosh 128Ke. Mac emulator: umac After about 8 minutes of research, I chose the Musashi 68K interpreter. It’s C, simple to interface to, and had a simple out-of-box example of a 68K system with RAM, ROM, and some IO. Musashi is structured to be embedded in bigger projects: wire in memory read/write callbacks, a function to raise an IRQ, call execute in a loop, done. I started building an emulator around it, which ultimately became the umac project. The first half (of, say, five halves) went pretty well: A simple commandline app loading the ROM image, allocating RAM, providing debug messages/assertions/logging, and configuring Musashi. Add address decoding: CPU reads/writes are steered to RAM, or ROM. The “overlay” register lets the ROM boot at 0x00000000 and then trampoline up to a high ROM mirror after setting up CPU exception vectors – this affects the address decoding. This is done by poking a VIA register, so decoded just that bit of that register for now. At this point, the ROM starts running and accessing more non-existent VIA and SCC registers. Added more decoding and a skeleton for emulating these devices elsewhere – the MMIO read/writes are just stubbed out. There are some magic addresses that the ROM accesses that “miss” documented devices: there’s a manufacturing test option that probes for a plugin (just thunk it), and then we witness the RAM size probing. The Mac Plus ROM is looking for up to 4MB of RAM. In the large region devoted to RAM, the smaller amount of actual RAM is mirrored over and over, so the probe writes a magic value at high addresses and spots where it starts to wrap around. RAM is then initialised and filled with a known pattern. This was an exciting point to get to because I could dump the RAM, convert the region used for the video framebuffer into an image, and see the “diagonal stripe” pattern used for RAM testing! “She’s alive!” Not all of the device code enjoyed reading all zeroes, so there was a certain amount of referring to the disassembly and returning, uh, 0xffffffff sometimes to push it further. The goal was to get it as far as accessing the IWM chip, i.e. trying to load the OS. After seeing some IWM accesses there and returning random rubbish values, the first wonderful moment was getting the “Unknown Disc” icon with the question mark – real graphics! The ROM was REALLY DOING SOMETHING! I think I hadn’t implemented any IRQs at this point, and found the ROM in an infinite loop: it was counting a few Vsyncs to delay the flashing question mark. Diversion into a better VIA, with callbacks for GPIO register read/write, and IRQ handling. This also needed to wire into Musashi’s IRQ functions. This was motivating to get to – remembering rule #1 – and “graphics”, even though via a manual memory dump/ImageMagick conversion, was great. I knew the IWM was an “interesting” chip, but didn’t know details. I planned to figure it out when I got there (rule #1). IWM, 68K, and disc drivers My god, I’m glad I put IWM off until this point. If I’d read the “datasheet” (vague register documentation) first, I’d’ve just gone to the pub instead of writing this shitty emulator. IWM is very clever, but very very low-level. The disc controllers in other contemporary machines, e.g. WD1770, abstract the disc physics. At one level, you can poke regs to step to track 17 and then ask the controller to grab sector 3. Not so with IWM: first, the discs are Constant Linear Velocity, meaning the angular rotation needs to change appropriate to whichever track you’re on, and second the IWM just gives the CPU a firehose of crap from the disc head (with minimal decoding). I spent a while reading through the disassembly of the ROM’s IWM driver (breaking rule #2 and rule #1): there’s some kind of servo control loop where the driver twiddles PWM values sent to a DAC to control the disc motor, measured against a VIA timer reference to do some sort of dynamic rate-matching to get the correct bitrate from the disc sectors. I think once it finds the track start it then streams the track into memory, and the driver decodes the symbols (more clever encoding) and selects the sector of interest. I was sad. Surely Basilisk II and Mini vMac etc. had solved this in some clever way – they emulated floppy discs. I learned they do not, and do the smart engineering thing instead: avoid the problem. The other emulators do quite a lot of ROM patching: the ROM isn’t run unmodified. You can argue that this then isn’t a perfect hardware emulation if you’re patching out inconvenient parts of the ROM, but so what. I suspect they were also abiding by a rule #1 too. I was going to do the same: I figured out a bit of how the Mac driver interface works (gah, rule #3!) and understood how the other emulators patched this. They use a custom paravirtualised 68K driver which is copied over the ROM’s IWM driver, servicing .Sony requests from the block layer and routing them to more convenient host-side code to manage the requests. Basilisk II uses some custom 68K opcodes and a simple driver, and Mini vMac a complex driver with trappy accesses to a custom region of memory. I reused the Basilisk II driver but converted to access a trappy region (easier to route: just emulate another device). The driver callbacks land in the host/C side and some cut-down Basilisk II code interprets the requests and copies data to/from the OS-provided buffers. Right now, all I needed was to read blocks from one disc: I didn’t need different formats (or even write support), or multiple drives, or ejecting/changing images. Getting the first block loaded from disc took waaaayyy longer than the first part. And, I’d had to learn a bit of 68K (gah), but just in the nick of time I got a Happy Mac icon as the System software started to load. This was still a simple Linux commandline application, with zero UI. No keyboard or mouse, no video. Time to wrap it in an SDL2 frontend (the unix_main test build in the umac project), and I could watch the screen redraw live. I hadn’t coded the 1Hz timer interrupt into the VIA, and after adding that it booted to a desktop! The first boot As an aside, I try to create a dual-target build for all my embedded projects, with a native host build for rapid prototyping/debugging; libSDL instead of an LCD. It means I don’t need to code at the MCU, so I can code in the garden. :) Next was mouse support. Inside Macintosh and the schematics show how it’s wired, to the VIA (good) and the SCC (a beast). The SCC is my second least-favourite chip in this machine; it’s complex and the datasheet/manual seems to be intentionally written to hide information, piss off readers, get one back at the world. (I didn’t go near the serial side, its main purpose, just external IRQ management. But, it’ll do all kinds of exciting 1980s line coding schemes, offloading bitty work from the CPU. It was key for supporting things like AppleTalk.) Life was almost complete at this point; with a working mouse I could build a new disc image (using Mini vMac, an exercise in itself) with Missile Command. This game is pretty fun for under 10KB on disc. So: Video works Boots from disc Mouse works, Missile Command I had no keyboard, but it’s largely working now. Time to start on sub-project numero due: Hardware and RP2040 Completely unrelated to umac, I built up a circuit and firmare with two goals: Display 512x342x1 video to VGA with minimal components, Get the TinyUSB HID example working and integrated. This would just display a test image copied to a framebuffer, and printf() keyboard/mouse events, as a PoC. The video portion was fun: I’d done some I2S audio PIO work before, but here I wanted to scan out video and arbitrarily control Vsync/Hsync. Well, to test I needed a circuit. VGA wants 0.7V max on the video R,G,B signals and (mumble, some volts) on the syncs. The R,G,B signals are 75Ω to ground: with some maths, a 3.3V GPIO driving all three through a 100Ω resistor is roughly right. The day I started soldering it together I needed a VGA connector. I had a DB15 but wanted it for another project, and felt bad about cutting up a VGA cable. But when I took a walk at lunchtime, no shitting you, I passed some street cables. I had a VGA cable – the rust helps with the janky aesthetic. Free VGA cable The VGA PIO side was pretty fun. It ended up as PIO reading config info dynamically to control Hsync width, display position, and so on, and then some tricks with DMA to scan out the config info interleaved with framebuffer data. By shifting the bits in the right direction and by using the byteswap option on the RP2040 DMA, the big-endian Mac framebuffer can be output directly without CPU-side copies or format conversion. Cool. This can be fairly easily re-used in other projects: see video.c. But. I ended up (re)writing the video side three times in total: First version had two DMA channels writing to the PIO TX FIFO. The first would transfer the config info, then trigger the second to transfer video data, then raise an IRQ. The IRQ handler would then have a short time (the FIFO depth!) to choose a new framebuffer address to read from, and reprogram DMA. It worked OK, but was highly sensitive to other activity in the system. First and most obvious fix is that any latency-sensitive IRQ handler must have the __not_in_flash_func() attribute so as to run out of RAM. But even with that, the design didn’t give much time to reconfigure the DMA: random glitches and blanks occurred when moving the mouse rapidly. Second version did double-buffering with the goal of making the IRQ handler’s job trivial: poke in a pre-prepared DMA config quickly, then after the critical rush calculate the buffer to use for next time. Lots better, but still some glitches under some high load. Even weirder, it’d sometimes just blank out completely, requiring a reset. This was puzzling for a while; I ended up printing out the PIO FIFO’s FDEBUG register to try to catch the bug in the act. I saw that the TXOVER overflow flag was set, and this should be impossible: the FIFOs pull data from DMA on demand with DMA requests and a credited flow-contr…OH WAIT. If credits get messed up or duplicated, too many transfers can happen, leading to an overflow at the receiver side. Well, I’d missed a subtle rule in the RP2040 DMA docs: Another caveat is that multiple channels should not be connected to the same DREQ. So the third version…… doesn’t break this rule, and is more complicated as a result: One DMA channel transfers to the PIO TX FIFO Another channel programs the first channel to send from the config data buffer A third channel programs the first to send the video data The programming of the first triggers the corresponding “next reprogram me” channel The nice thing – aside from no lock-ups or video corruption – is that this now triggers a Hsync IRQ during the video line scan-out, greatly relaxing the deadline of reconfiguring the DMA. I’d like to further improve this (with yet another DMA channel) to transfer without an IRQ per line, as the current IRQ overhead of about 1% of CPU time can be avoided. (It would’ve been simpler to just hardwire the VGA display timing in the PIO code, but I like (for future projects) being able to dynamically-reconfigure the video mode.) So now we have a platform and firmware framework to embed umac into, HID in and video out. The hardware’s done, fuggitthat’lldo, let’s throw it over to the software team: How it all works Back to emulating things A glance at the native umac binary showed a few things to fix before it could run on the Pico: Musashi constructed a huge opcode decode jumptable at runtime, in RAM. It’s never built differently, and never changes at runtime. I added a Musashi build-time generator so that this table could be const (and therefore live in flash). The disassembler was large, and not going to be used on the Pico, so another option to build without. Musashi tries to accurately count execution cycles for each instruction, with more large lookup tables. Maybe useful for console games, but the Mac doesn’t have the same degree of timing sensitivity. REMOVED. (This work is in my small-build branch.) pico-mac takes shape, with the ROM and disc image in flash, and enjoyably it now builds and runs on the Pico! With some careful attention to not shoving stuff in RAM, the RAM use is looking pretty good. The emulator plus HID code is using about 35-40KB on top of the Mac’s 128KB RAM area – there’s 95+KB of RAM still free. This was a good time to finish off adding the keyboard support to umac. The Mac keyboard is interfaced serially through the VIA ‘shift register’, a basic synchronous serial interface. This was logically simple, but frustrating because early attempts at replying to the ROM’s “init” command just were persistently ignored. The ROM disassembly was super-useful again: reading the keyboard init code, it looked like a race condition in interrupt acknowledgement if the response byte appears too soon after the request is sent. Shoved in a delay to hold off a reply until a later poll, and then it was just a matter of mapping keycodes (boooooorrrriiiiing). With a keyboard, the end-of-level MacWrite boss is reached: One problem though: it totally sucked. It was suuuuper slow. I added a 1Hz dump of instruction count, and it was doing about 300 KIPS. The 68000 isn’t an amazing CPU in terms of IPC. Okay, there are some instructions that execute in 4 cycles. But you want to use those extravagant addressing modes don’t you, and touching memory is spending those cycles all over the place. Not an expert, but targeting about 1 MIPS for an about 8MHz 68000 seems right. Only 3x improvement needed. Performance I didn’t say I wasn’t gonna cheat: let’s run that Pico at 250MHz instead of 125MHz. Okay better, but not 2x better. From memory, only about 30% better. Damn, no free lunch today. Musashi has a lot of configurable options. My first goal was to get its main loop (as seen from disassembly/post-compile end!) small: the Mac doesn’t report Bus Errors, so the registers don’t need copies for unwinding. The opcodes are always fetched from a 16b boundary, so don’t need alignment checking, and can use halfword loads (instead of two byte loads munged into a halfword!). For the Cortex-M0+/armv6m ISA, reordering some of the CPU context structure fields enabled immediate-offset access and better code. The CPU type, mysteriously, was dynamically-changeable and led to a bunch of runtime indirection. Looking better, maybe 2x improvement, but not enough. Missile Command was still janky and the mouse wasn’t smooth! Next, some naughty/dangerous optimisations: remove address alignment checking, because unaligned accesses don’t happen in this constrained environment. (Then, this work is in my umac-hacks branch.) But the real perf came from a different trick. First, a diversion! RP2040 memory access The RP2040 has fast RAM, which is multi-banked so as to allow generally single-cycle access to multiple users (2 CPUs, DMA, etc.). Out of the box, most code runs via XIP from external QSPI flash. The QSPI usually runs at the core clock (125MHz default), but has a latency of ~20 cycles for a random word read. The RP2040 uses a relatively simple 16KB cache in front of the flash to protect you from horrible access latency, but the more code you have the more likely you are to call a function and have to crank up QSPI. When overclocking to 250MHz, the QSPI can’t go that fast so stays at 125MHz (I think). Bear in mind, then, that your 20ish QSPI cycles on a miss become 40ish CPU cycles. The particular rock-and-a-hard-place here is that Musashi build-time generates a ton of code, a function for each of its 1968 opcodes, plus that 256KB opcode jumptable. Even if we make the inner execution loop completely free, the opcode dispatch might miss in the flash cache, and the opcode function itself too. (If we want to get 1 MIPS out of about 200 MIPS, a few of these delays are going to really add up.) The __not_in_flash_func() attribute can be used to copy a given function into RAM, guaranteeing fast execution. At the very minimum, the main loop and memory accessors are decorated: every instruction is going to access an opcode and most likely read or write RAM. This improves performance a few percent. Then, I tried decorating whole classes of opcodes: move is frequent, as are branches, so put ‘em in RAM. This helped a lot, but the remaining free RAM was used up very quickly, and I wasn’t at my goal of much above 1 MIPS. Remember that RISC architecture is gonna change everything? We want to put some of those 1968 68K opcodes into RAM to make them fast. What are the top 10 most often-used instructions? Top 100? By adding a 64K table of counters to umac, booting the Mac and running key applications (okay, playing Missile Command for a bit), we get a profile of dynamic instruction counts. It turns out that the 100 hottest opcodes (5% of the total) account for 89% of the execution. And the top 200 account for a whopping 98% of execution. Armed with this profile, the umac build post-processes the Musashi auto-generated code and decorates the top 200 functions with __not_in_flash_func(). This adds only 17KB of extra RAM usage (leaving 95KB spare), and hits about 1.4 MIPS! Party on! At last, the world can enjoy Missile Command’s dark subject matter in performant comfort: Missile Command on pico-mac What about MacPaint? Everyone loves MacPaint. Maybe you love MacPaint, and have noticed I’ve deftly avoided mentioning it. Okay, FINE: It doesn’t run on a Mac 128Ke, because the Mac Plus ROM uses more RAM than the original. :sad-face: I’d seen this thread on 68kMLA about a “Mac 256K”: https://68kmla.org/bb/index.php?threads/the-mythical-mac-256k.46149/ Chances are that the Mac 128K was really a Mac 256K in the lab (or maybe even intended to have 256K and cost-cut before release), as the OS functions fine with 256KB. I wondered, does the Mac ROM/OS need a power-of-two amount of RAM? If not, I have that 95K going spare. Could I make a “Mac 200K”, and then run precious MacPaint? Well, I tried a local hack that patches the ROM to update its global memTop variable based on a given memory size, and yes, System 3.2 is happy with non-power-of-2 sizes. I booted with 256K, 208K, and 192K. However, there were some additional problems to solve: the ROM memtest craps itself without a power-of-2 size (totally fair), and NOPping that out leads to other issues. These can be fixed, though also some parts of boot access off the end of RAM. A power-of-2 size means a cheap address mask wraps RAM accesses to the valid buffer, and that can’t be done with 192K. Unfortunately, when I then tested MacPaint it still wouldn’t run because it wanted to write a scratch file to the read-only boot volume. This is totally breaking rule #1 by this point, so we are staying with 128KB for now. However, a 256K MicroMac is extremely possible. We just need an MCU with, say, 300KB of RAM… Then we’d be cooking on gas. Goodbye, friend Well, dear reader, this has been a blast. I hope there’s been something fun here for ya. Ring off now, caller! The MicroMac! HDMI monitor, using a VGA-to-HDMI box umac screenshot System 3.2, Finder 5.3 Performance tuning Random disc image working OK Resources https://github.com/evansm7/umac https://github.com/evansm7/pico-mac https://www.macintoshrepository.org/7038-all-macintosh-roms-68k-ppc- https://winworldpc.com/product/mac-os-0-6/system-3x https://68kmla.org/bb/index.php?threads/macintosh-128k-mac-plus-roms.4006/ https://docs.google.com/spreadsheets/d/1wB2HnysPp63fezUzfgpk0JX_b7bXvmAg6-Dk7QDyKPY/edit#gid=840977089

11 months ago 19 votes
Classical virtualisation rules applied to RISC-style atomics

In 1974, Gerald Popek and Robert Goldberg published a paper, “Formal Requirements for Virtualizable Third Generation Architectures”, giving a set of characteristics for correct full-machine virtualisation. Today, these characteristics remain very useful. Computer architects will informally cite this paper when debating Instruction Set Architecture (ISA) developments, with arguments like “but that’s not Popek & Goldberg-compliant!” In this post I’m looking at one aspect of computer architecture evolution since 1974, and observing how RISC-style atomic operations provide some potential virtualisation gotchas for both programmers and architects. Principles of virtualisation First, some virtualisation context, because it’s fun! A key P&G requirement is that of equivalence: it’s reasonable to expect software running under virtualisation to have the same behaviour as running it bare-metal! This property is otherwise known as correctness. :-) P&G classify instructions as being sensitive if they behave differently when running at a lower privilege level (i.e. the program can detect that it is being run in a different manner). An ISA is said to be classically virtualisable if: Sensitive instructions are privileged, and Privileged instructions executed at a lower privilege level can be trapped to a higher level of privilege. For a classically-virtualisable system, perfect equivalence can then be achieved by running software at a lower than usual level of privilege, trapping all privileged/sensitive instructions, and emulating their behaviour in a VMM. That is, if the design of the ISA ensures that all “sensitive” instructions can be trapped, it’s possible to ensure the logical execution of the software cannot be different to running bare-metal. This virtualisation technique is called “privilege compression”. Note: This applies recursively, running OS-level software with user privilege, or hypervisor-level software at OS/user privilege. Popek & Goldberg formalise this too, giving properties required for correct nested virtualisation. System/360 and PowerPC are both classically virtualisable, almost as though IBM thought about this. ;-) Equivalent virtualisation can be achieved by: Running an OS in user mode (privilege compression, for CPU virtualisation), Catching traps (to supervisor mode/HV) when the guest OS performs a privileged operation, In the hypervisor, operating on a software-maintained “shadow” of what would have been the guest OS’s privileged CPU state were it running bare-metal. Constructing shadow address translations (for memory virtualisation). Linux’s KVM support on PowerPC includes a “PR” feature, which does just this: for CPUs without hardware virtualisation, guests are run in user mode (or “PRoblem state” in IBM lingo). Note: It is key that the hypervisor can observe and control all of the guest’s state. Today, most systems address the performance impact of all of this trap-and-emulate by providing hardware CPU and memory virtualisation (e.g. user, OS and hypervisor execution privilege levels, with nested page tables). But, classically virtualisable ISA design remains important for clear reasoning about isolation between privilege levels and composability of behaviours. Computers in 1974 were ~all CISC All computers in 1974 were available in corduroy with a selection of Liberty-print input devices. All consoles had ashtrays (not even joking tbh). Architecture-wise, IBM was working on early RISC concepts leading to the 701, but most of the industry was on a full-steam trajectory to peak CISC (VAX) in the late 1970s. It’s fair to say that “CISC” wasn’t even a thing yet; instruction sets were just complex. P&G’s paper considered three contemporary computers: IBM System/360 Honeywell 6000 DEC PDP-10 CISC atomic operations and synchronisation primitives These machines had composite/”read-modify-write” atomic operations, similar to those in today’s x86 architectures. System/360 had compare-and-swap, locked operations (read-operate-write), test-and-set, and PDP-10 had EXCHange/swap. These kinds of instructions are not sensitive so, unless the addressed memory is privileged, atomic operations can be performed inside virtual machines without the hypervisor needing to know. Atomic operations in RISC machines Many RISC machines support multi-instruction synchronisation sequences built up around two instruction primitives: Load-and-set-reservation Store-conditional MIPS called these load-linked (LL) and store-conditional (SC), and I’ll use these terms. ARMv8 has LDXR/STXR. PowerPC has LWARX/STWCX. RISC-V has LR/SC. Many machines (such as ARMv8-LSE) also add composite operations such as CAS or atomic addition but still retain the base LL/SC mechanism, and sizes/acquire/release variants are often provided. The concept is that the LL simultaneously loads a value and sets a “reservation” covering the address in question, and a subsequent SC succeeds only if the reservation is still present. A conflicting write to the location (e.g. a store on another CPU) clears the reservation and the SC returns a failure value without modifying memory; LL/SC are performed in a loop to retry until the update succeeds. An LL/SC sequence can typically be arbitrarily complex – a lock routine might test a location is cleared and store a non-zero value if so, whereas an update might increment a counter or calculate a “next” value, and so on. Typically an ISA does not restrict what lies between LL and SC. Coming back to virtualisation requirements, the definition of a reservation is interesting because it’s effectively “hidden state” that the hypervisor cannot manage. Typically, a hypervisor cannot easily read whether a reservation exists, and it can’t be saved/restored1. CISC-like RmW atomic operations do not exhibit this property. Problem seen, problem felt Shall I get to the point? I saw an odd but legal guest code sequence that can be difficult to virtualise. I’ve been trying to run MacOS 9.2 in KVM-PR on a PowerPC G4, and observed the NanoKernel acquire-lock routine happens to use a sensitive instruction (mfsprg) between a lwarx and stwcx. This is strange, and guarantees a trap to the host between the LL and SC operations. Though the guest should not be doing weird stuff when acquiring a lock, it’s still an architecturally-correct program. This means that if the reservation isn’t preserved across the trap, the lock is never taken. Forward progress is never achieved and virtualisation equivalence is not maintained (because the guest livelocks). Specifically, if the reservation is always cleared on the trap, we have a problem. If it is sometimes kept, the guest program can progress. Since the state is hidden (the hypervisor can’t save/restore/re-create), correctness depends on two things: The hypervisor’s exception-emulation-return path not itself clearing the reservation every time for any possible trap The ISA and hardware implementation guaranteeing the reservation is not always cleared by hardware This potential issue isn’t limited to PPC or the MacOS guest. Software guarantees The hypervisor must guarantee two things: It must not intentionally clear reservations on all traps. It must not accidentally do so as a side-effect of a chosen activity: For example, using its own synchronisation primitives elsewhere, or by writing memory that would conflict with the guest’s reservation. This can be challenging: context switching must be avoided in the T&E handler (no sleep or pre-emption), and it can’t take locks. In my MacOS guest experiment, KVM-PR does not happen to currently use any synchronisation primitives on its emulation path, ew delicate – but I had tracing on, which does. The guest locked up. Hardware guarantees But does your CPU guarantee that reservations aren’t always cleared?2 That seems to depend. This morning’s light reading gives: PowerPC architecture PowerISA is comparatively clear on the behaviour (which isn’t surprising, as PowerISA is generally very clearly-specified). PowerISA v3.1 section 1.7.2.1 describes reservations, listing specific reasons for reservation loss. Some are the expected “lose the reservation if someone else hits the memory” reasons, but previous PowerISAs (e.g. 2.06) permitted embedded implementations to clear the reservation on all exceptions. This permission was removed by 3.1; in my opinion a good move. (I did just this, for reasons, in my homebrew PowerPC CPU, oops!) PowerISA does permit spontaneous reservation loss due to speculative behaviour, but is careful to require that forward progress is guaranteed (i.e. that an implementation doesn’t happen to clear the reservation every time for a given piece of code). Finally, it includes a virtualisation-related programming note stating a reservation may be lost if software executes a privileged instruction or utilizes a privileged facility (i.e. sensitive instructions). This expresses intent, but isn’t specification: it doesn’t criminalise a guest doing wrong things unless it’s a rule that was there from the dawn of time. At any rate, this post is going to be old news to the PowerISA authors. Nice doc, 8/10, good jokes, would read again. RISC-V architecture The lack of any guest legacy permits the problem to be solved from the other direction. Interestingly, the RISC-V ISA explicitly constrains the instruction sequences between LR/SC: "The dynamic code executed between the LR and SC instructions can only contain instructions from the base “I” instruction set, excluding loads, stores, backward jumps, taken backward branches, JALR, FENCE, FENCE.I, and SYSTEM instructions.“ This is a good move. Tacitly, this bans sensitive instructions in the critical region, and permits an absence of progress if the guest breaks the rules. Ruling out memory accesses is interesting too, because it can be useful for a hypervisor to be able to T&E any given page in the guest address space without repercussions. Reservation granule size An LL operation is usually architecturally permitted to set an address-based reservation with a size larger than the original access, called the “reservation granule”. A larger granule reduces tracking requirements but increases the risk of a kind of false sharing between locks where an unrelated CPU taking an unrelated lock could clear your CPU’s reservation. This is important to our hypervisor, because of guarantee #2 above: when emulating a sensitive instruction it must not access anything that always causes the reservation to clear. You would hope the guest doesn’t soil itself by executing an instruction against its interests, so we can assume the guest won’t intentionally direct the hypervisor to hit on shared addresses, but if hypervisor and guest memory could ever coexist within a reservation granule there is scope for conflict. PowerPC defines the largest granule as, effectively, the (small) page size. ARM defines it as 4KB (effectively, the same). It’s a reasonable architectural assumption that guest and host memory is disjoint at page size granularity. RISC-V permits the reservation granule to be unlimited, which isn’t great3 – but later notes that “a platform specification may constrain the size and shape of the reservation set. For example, the Unix platform is expected to require of main memory that the reservation set be of fixed size, contiguous, naturally aligned, and no greater than the virtual memory page size.” Conclusion An ISA cannot be classically virtualised if it permits some aspect of trapping or emulation (such as the exception itself) to always cause a reservation to be cleared, unless sensitive instructions are prohibited from any region dependent on a reservation. In terms of computer science, it’s quite unsatisfying if it were possible to have a sequence of RISC instructions that cannot be classically virtualised due to hidden state. In practical terms, trap-and-emulate is alive and well in systems supporting nested virtualisation. Although some ISAs provide a level of hardware support for NV, it tends to be assists to speed up use of privilege compression rather than more exception levels and more translation stages (which, to be fair, would be awful). Consequently there is always something hypervisor-privileged being trapped to the real hypervisor, i.e. T&E is used in anger. So, there are some hardware behaviours which must (continue to be) guaranteed and, unfortunately, some constraints on already-complex software which must be observed. I thought this small computer architecture safari might be interesting to others, and hope you enjoyed the read! Footnotes In theory an ISA could provide the hypervisor with a previous reservation’s address, but re-creating it with a later LL raises ordering model questions! ↩ Sorry for the double-negative, but this alludes to the possibility of architecture permissions (for example, statements like “X is permitted to spontaneously happen at any time”) leading to implementations taking convenient liberties such as “always do X when any cache line is fetched”. If these decisions were to exist, they would be impossible to avoid stepping on, even with a carefully-written hypervisor. ↩ It would be terrible to permit an implementation to allow all hypervisor memory accesses to clear the reservation! ↩

over a year ago 19 votes
A small ode to the CRT

Built October 2018 I used to hate Cathode Ray Tubes. As a kid in Europe, everything flickered at 50Hz, or made a loud whistle at 15.625KHz (back when I could still hear it). CRTs just seemed crude, “electro-brutalist” contraptions from the valve era. They were heavy, and delicate, and distorted, and blurry, and whistled, and gave people electric shocks when they weren’t busy imploding and spreading glass shards around the place. When I saw the film Brazil, I remember getting anxious about exposed CRTs all over the place — seems I was the kind of kid who was more worried about someone touching the anode or electron gun than the totalitarian bureaucratic world they lived in. 🤷🏻‍♂️ As ever, I digress. Now in the 2020s, the CRT is pretty much gone. We have astonishing flat-panel LCD and OLED screens. Nothing flickers, everything’s pin-sharp, multi-megapixel resolutions, nothing whines (except me), and display life is pretty incredible for those of us old enough to remember green-screen computing (but young enough to still see the details). But, the march to betterness marches away from accessible: if you take apart a phone, the LCD is a magic glass rectangle, and that’s it. Maybe you can see some LEDs if you tear it apart, but it’s really not obvious how it works. CRTs are also magic, but in a pleasing 19th century top-hat-and-cane science kind of way. Invisible beams trace out images through a foot of empty space. They respond colourfully to magnets (also magic) held to their screens by curious children whose glee rapidly decays into panic, and trying to undo the effect using the other pole before their mother looks around and discovers what they’ve done (allegedly). The magnet-game is a clue: (most) CRTs use electromagnets that scans the invisible electron beam to light an image at the front. There’s something enjoyable about moving the beam yourself, with a magnet in hand, and you can kind of intuitively figure out how it works from doing this. (Remember the Left-hand Rule?) I started to warm to CRTs, maybe a fondness when I realised I hadn’t had to seriously use one for over a decade. I wanted to build something. I also like smol displays, and found an excellent source for a small CRT — a video camera viewfinder. Home cameras had tiny CRTs, roughly 1cm picture size, but I looked to find a higher-end professional viewfinder because they tended to have larger tubes for a higher-quality image. Eventually I found a Sony HVF-2000 viewfinder, from ca. 1980. This viewfinder contained a monochrome 1.5” CRT, and its drive circuitry on stinky 1970s phenolic resin PCBs. All it needs are two turntables and an 8V DC power supply and composite video input. It displays nice, sharp images on a cool white phosphor. I built this from it: Small CRT floating in a box I wanted to show the CRT from all angles, without hiding any of it, in the trusty “desktop curiosity” style. The idea was to show off this beautiful little obsolete glass thingy, in a way that you could sorta guess how it worked. Switching it on with a pleasing clack, it starts silently playing a selection of 1980s TV shows, over and over and over: I had this on my desk at work, and a Young PersonTM came into my office one day to ask about it. He hadn’t really seen a CRT close-up before, and we had a fun chat about how it worked (including waving a magnet at it – everyone has a spare magnet on their desk for these moments, don’t they? Hello…?). Yay! If you’re unfamiliar with CRTs, they work roughly like this: The glass envelope contains a vacuum. The neck contains a heating filament (like a lightbulb) which gives off electrons into the void. This “electron gun” is near some metal plates (with variously high positive and negative voltages), which act to focus the fizz of electrons into a narrow beam, directing it forward. The inside of the front face of the tube is covered by a phosphorescent material which lights up when hit with electrons. The front face is connected to the anode terminal, a high positive voltage. This attracts the beam of electrons, which accelerate to the front. The beam hits the front and creates light in a small spot. To create the picture, the beam is steered in rasters/lines using horizontal and vertical electromagnets wrapped around the neck of the tube. (The magnets are called the “yoke”.) For PAL at 50Hz, lines are drawn 15625 times a second. Relying on the principle of persistence of vision, this creates the illusion of a steady image. The tube is sealed and electron gun inside is largely invisible, but here you can see the malicious-looking thick anode wire, and how dainty the tube really is with the yoke removed: Note: the anode voltage for this tube is, from memory, about 2.5 kilovolts, so not particularly spicy. A large computer monitor will give you 25KV! Did I mention the X-rays? Circuit The original viewfinder was a two-board affair, fitting in a strange transverse shape for the viewfinder case. I removed a couple of controls and indicators unrelated to the CRT operation, and extended the wires slightly so they could be stacked. The viewfinder’s eyepiece looks onto a mirror, turning 90º to the CRT face — so the image is horizontally flipped. This was undone by swapping the horizontal deflection coil wires, reversing the field direction. The circuit’s pretty trivial. It just takes a DC input (9-12V) and uses two DC-DC converter modules to create an 8V supply for the CRT board and a 5V supply for a Raspberry Pi Zero layered at the bottom. The whole thing uses under 2W. The Pi’s composite output drops straight into the CRT board. The Pi starts up a simple shell script that picks a file to play. There’s a rotary encoder on the back, to change channel, but I haven’t wired it up yet. Case For me, the case was the best bit. I had just got (and since lost :((( ) access to a decent laser cutter, and wanted to make a dovetailed transparent case for the parts. It’s made from 3mm colourless and sky-blue acrylic. Rubber bands make the world go round The CRT is supported from two “hangers”, and two trays below hold the circuitry. These are fixed to the sides using a slot/tab approach, with captive nuts. In the close-up pictures you can see there are some hairline stress fractures around the corners of some of the tab cut-outs: they could evidently do with being a few hundred µm wider! The front/top/back/bottom faces are glued together, then the left/right sides are screwed into the shelves/hangers with captive M3 nuts. This sandwiches it all together. The back holds a barrel-style DC jack, power switch, and (as-yet unused) rotary encoder. The encoder was intended to eventually be a kind of “channel select”: The acrylic is a total magnet for fingerprints and dust, which is excellent if you’re into that kind of thing. There seems to also be little flecks filling the case, probably some aquadag flaking off the CRT. This technology just keeps on giving. OpenSCAD The case is designed in OpenSCAD, and is somewhat parameterised: the XYZ dimensions, dovetailing, spacing of shelves and so forth can be tweaked till it looks good. One nice OpenSCAD laser-cutting trick I saw is that 2D parts can be rendered into a “preview” 3D view, tweaked and fettled, and then re-rendered flat on a 2D plane to create a template for cutting. So, make a 3D prototype, change the parameters until it looks good (maybe printing stuff out to see whether the physical items actually fit!)… …then change the mode variable, and the same parts are laid out in 2D for cutting: Feel free to hack on and re-use this template. Resources OpenSCAD box sources Pics Tiny dmesg! Edmund Esq

over a year ago 19 votes
Mac SE/30 odyssey

I’ve always wanted an Apple Macintosh SE/30. Released in 1989, they look quite a lot like the other members of the original “compact Mac” series, but pack in a ton of interesting features that the other compact Macs don’t have. This is the story of my journey to getting to the point of owning a working Mac SE/30, which turns out not to be as simple as just buying one. Stay tuned for tales of debugging and its repair. So, the Mac. Check it out, with the all-in-one-style 9” monochrome display: The beautiful Macintosh SE/30 I mean, look at it, isn’t it lovely? :) The key technical difference between the SE/30 and the other compact Macs is that the SE/30 is much much less crap. It’s like a sleeper workstation, compared to the Mac Plus, SE, or Classic. 8MHz 68K? No! ~16MHz 68030. Emulating FP on a slow 68K? No! It ships with a real FPU! Limited to 4MB of RAM? Naw, this thing takes up to 128MB! Look, I wouldn’t normally condone use of CISC machines (and – unpopular opinion – I’m not actually a 68K fan :D ), but not only has this machine a bunch of capability RAM-wise and CPU-wise, but this machine has an MMU. In my book, MMUs make things interesting (as well as ‘interesting’). Unlike all the other compact Macs, this one can run real operating systems like BSD, and Linux. And, I needed to experience A/UX first-hand. Unpopular opinion #2: I don’t really like ye olde Mac OS/System 7 either! :) It was very cool at the time, and made long-lasting innovations, but lack of memory protection or preemptive scheduling made it a little delicate. At the time, as a kid, it was frustrating that there was no CLI, or any way to mess around and program them without expensive developer tools – so I gravitated to the Acorn Archimedes machines, and RISC OS (coincidentally with the same delicate OS drawbacks), which were much more accessible programming-wise. Anyway, one week during one of the 2020 lockdowns I was reminded of the SE/30, and got a bit obsessed with getting hold of one. I was thinking about them at 2am (when I wasn’t stressing about things like work), planning which OSes to try out, which upgrades to make, how to network it, etc. Took myself to that overpriced auction site, and bought one from a nearby seller. We got one! I picked it up. I was so excited. It was a good deal (hollow laugh from future-Matt), as it came in a shoulder bag and included mouse/keyboard, an external SCSI drive and various cables. Getting it into the car, I noticed an OMINOUS GRITTY SLIDING SOUND. Oh, did I mention that these machines are practically guaranteed to self-destruct because either the on-board electrolytic caps ooze out gross stuff, or the on-board Varta lithium battery poos its plentiful and corrosive contents over the logic board? [If you own one of these machines or, let’s face it, any machine from this era, go right now and remove the batteries if you haven’t already! Go on, it’s important. (I’m also looking at you, Acorn RISC PC owners.) I’ll wait.] I opened up the machine, and the first small clue appeared: Matt: Oh. That’s not a great omen. Matt, with strained optimism: “But maybe the logic board will be okay!” Mac SE/30: “Nah mate, proper fucked sry.” Matt: :( At this point I’d like to say that the seller was a volunteer selling donated items from a charity shop, and it was clear they didn’t really know much about the machine. It was disappointing, but the money paid for this one is just a charitable donation and I’m happy at that. (If it were a private seller taking money for a machine that sounded like it washed up on a beach, it’d be a different level of fury.) Undeterred (give it up, Matt, come on), I spent a weekend trying to resurrect it. Much of the gross stuff washed off, bathing it in a sequence of detergents/vinegar/IPA/etc: You can see some green discolouring of the silkscreen in the bottom right. Submerged in (distilled) water, you can see a number of tracks that vanish halfway, or have disappeared completely. Or, components whose pads and leads have been destroyed! The battery chemicals are very ingenious; they don’t just wash like lava across the board and destroy the top, but they also wick down into the vias and capillary action seems to draw them into the inner layers. Broken tracks, missing pads, missing components, missing vias Poring over schematics and beeping out connections, I started airwiring the broken tracks (absolutely determined to get this machine running, as though some perverse challenge). But, once I found broken tracks on the inner layers, it moved from perverse to Sisyphean because I couldn’t just see where the damage was: wouldn’t even finding the broken tracks by beeping out all connections be O(intractible)? Making the best decision so far in the odyssey, I gave up and searched for another SE/30. At least I got a spare keyboard and mouse out of it. But also, a spare enclosure/CRT/analog board, etc., which will be super-useful in a few paragraphs. Meet the new Mac, same as the old Mac I found someone selling one who happeend to be in the same city (and it turns out, we even worked for the same company – city like village). This one was advertised as having been ‘professionally re-capped’, and came loaded: 128MB of RAM, and a sought-after Ethernet card. Perfecto! Paranoid me immediately took it apart to check the re-capping and battery. :) Whilst there was a teeny bit of evidence of prior capacitor-leakage, it was really clean for a 31 year old machine and I was really pleased with it. The re-capping job looked sensible, check. The battery looked new, but I’m taking no chances this time and pulled it out. I had a good 2 hours merrily pissing about doing the kinds of things you do with a new old computer, setting up networking and getting some utilities copied over, such as a Telnet client: Telnet client, life is complete Disaster strikes After the two hour happiness timer expired, the machine stopped working. Here’s what it did: Otherwise, the Mac made the startup “bong” sound, so the logic board was alive, just unhappy video. I think we’re thinking the same thing: the CRT’s Y-deflection circuit is obviously broken. This family of Macs have a common fault where solder joints on the Analogue board crack, or the drive transistor fails. The excellent “Dead Mac Scrolls” book covers common faults, and fixes. But, remember the first Mac: the logic board was a gonner, but the Analog board/CRT seemed good. I could just swap the logic board over, and I’ve got a working Mac again and can watch the end of Telnet Star Wars. It did exactly the same thing! Bollocks, the problem was on the logic board. Debugging the problem We were both wrong: it wasn’t the Y-deflection circuit for the CRT. The symptoms of that would be that the CRT scans, but all lines get compressed and overdrawn along the centre – no deflection creating one super-bright line in the centre. Debug clues Clue 1: This line wasn’t super-bright. Let’s take a closer look: Clue 2: It’s a dotted line, as though it’s one line of the stippled background when the Mac boots. That’s interesting because it’s clearly not being overdrawn; multiple lines merged together would overlay even/odd odd/even pixels and come out solid white. The line also doesn’t provide any of the “happy Mac” icon in the middle, so it isn’t one of the centre lines of the framebuffer. SE/30 logic board on The Bench, provided with +5V/+12V and probed with scope/LA If you’ve an SE/30 (or a Classic/Plus/128/512 etc.) logic board on a workbench, they’re easy enough to power up without the Analog board/CRT but be aware the /RESET circuitry is a little funky. Reset is generated by the sound chip (…obviously) which requires both +5V and +12V to come out of reset, so you’ll need a dual-rail bench supply. I’d also recommend plugging headphones in, so you can hear the boot chime (or lack of) as you tinker. Note the audio amp technically requires -5V too, but with +5V alone you should still be able to hear something. This generation of machines are one of the last to have significant subsystems still being implemented as multi-chip sections. It’s quite instructive to follow along in the schematic: the SE/30 video system is a cluster of discrete X and Y pixel counters which generate addresses into VRAM (which spits out pixels). Some PALs generate VRAM addresses/strobes/refresh, and video syncs. Clue 3: The video output pin on the chonky connector is being driven, and HSYNC is running correctly (we can deduce this already, though, because the CRT lights up meaning its HT supply is running, and that’s driven from HSYNC). But, there was no VSYNC signal at all. VSYNC comes from a PAL taking a Y-count from a counter clocked by 'TWOLINE' Working backwards, I traced VSYNC from the connector to PAL UG6. It wasn’t simply a broken trace, UG6 wasn’t generating it. UG6 appears to be a comparator that generates vertical timing strobes when the Y line count VADR[7:0] reaches certain lines. The Y line count is generated from a dual hex counter, UF8. Clue 4: The Y line count wasn’t incrementing at all. That explains the lack of VSYNC, as UG6 never saw the “VSYNC starts now” line come past. The UF8 counter is clocked/incremented by the TWOLINE signal output from PAL UG7. Clue 5a: PAL UG7’s TWOLINE output was stuck/not transitioning. Its other outputs (such as HSYNC) were transitioning fine. PALs do die, but it seems unusual for only a single output to conk out. Clue 5b: PAL UG7 was unusually hot! Clue 6, and the root problem: Pulling the PALs out, the TWOLINE pin measures 3Ω to ground. AHA! Debug epiphany *Something is shorting the TWOLINE signal to a power rail. Here’s how the clues correspond to the observations: There is no VSYNC; the Y line count is stuck at 0. The X counter is working fine. (HSYNC is produced, and a stippled pattern line is displayed correctly.) The display shows the top line of the video buffer (from address 0, over and over) but never advances onto the next line. The CRT Y deflection is never “charged up” by a VSYNC so the raster stays in the centre on one line, instead of showing 384 identical lines. We can work with this. TWOLINE is shorted somehow. Tracing it across the PCB, every part of the trace looked fine, except I couldn’t see the part that ran underneath C7 (one of the originally-electrolytic caps replaced with a tantalum). I removed C7: See the problem? It’s pleasingly subtle… How about now? A tiny amount of soldermask has come off the track just south of the silkscreen ‘+’. This was Very Close to the capacitor’s contact, and was shorting against it! Above I thought it was shorting to ground: it’s shorting to +5V (which, when you measure it might be a low number of ohms to ground). My theory is that it wasn’t completely contacting, or wasn’t making a good connection, and that the heat from my 2-hour joyride expanded the material such that it made good contact. You can see that there’s some tarnish on the IC above C7 – this is damage from the previous C7 leaking. This, or the re-capping job, lifted the insulating soldermask leading to the short. Fixed The fix was simple, add some insulation using kapton tape and replace the capacitor: After that, I could see VSYNC being produced! But would it work? The sweet 1bpp stippled smell of success Yasssssss! :) Time to put it all back together, trying not to touch or break the CRT. And now for something completely different, but eerily familiar I mentioned I wanted this particular model because it could run “interesting OSes”. Did you know that, way before NeXT and OS X, Apple was a UNIX vendor? Apple A/UX operating system I’ve always wanted to play with Apple’s A/UX. By version 3.1, it had a very highly-integrated Mac OS ‘Classic’ GUI running on a real UNIX. It's like Mac OS, but... there's a UNIX dmesg too? It’s not X11 (though an X server is available), it really is running the Mac Toolbox etc., and it seems to have some similarities with the later OS X Blue Box/Classic environment in that it runs portions of Mac OS as a UNIX process. In the same way as OS X + Blue Box, A/UX will run unmodified Mac OS applications. The Finder is integrated with the UNIX filesystems in both directions (i.e. from a shell you can manipulate Mac files). These screenshots don’t do it justice, but there are good A/UX screenshots elsewhere. As an OS geek, I’m really impressed with the level of integration between the two OSes! It’s very thorough. Since the usual UNIX development tools are available, there’s a bit of cognitive dissonance of being able to “program a Mac” right out of the box: A/UX example application I mean, not just building normal UNIX command-line apps with cc/make etc., but the development examples include Mac OS GUI apps as well! It’s truly living in the future™. Plug for RASCSI Playing with ancient machines and multiple OSes is pretty painful when using ancient SCSI discs because: Old discs don’t work Old discs are small Transferring stuff to and from discs means plugging it into your Linux box and… I don’t have SCSI there Old discs don’t work and will pretend to and then screw up and ruin your week I built a RASCSI adapter (write-up and PCB posting TBD), software and circuit originally by GIMONS. This is a Raspberry Pi adapter that allows a userspace program to bit-bang the SCSI-I protocol, serving emulated disc/CD-ROM images from SD card. It works beautifully on the SE/30, and lets it both have several discs present at once, and switch between images quickly. Homemade RASCSI clone, SCSI emulator for Raspberry Pi The end, seeeeeeya! Resources https://archive.org/details/mac_The_Dead_Mac_Scrolls_1992 https://winworldpc.com/product/a-ux/3x https://68kmla.org/bb/index.php

over a year ago 27 votes
32-bit hat, with LEDs

Built in November 2015 (now-traditional multi-year writeup delay applied) A hat, bejewelled with 38 RGB LEDs Is this thing on..? It’s been a while since I’ve written one of these. So, the hat. It’s been on the writeup pile for almost 6 years, nagging away. Finally it’s its time to shine! NO PUN ESCAPES Anyway, the hat. It seemed like a good idea, and I even wore it out dancing. I know, so cool. This hat had been through at least two fancy-dress events, and had a natty band aftermarket mod even before the LEDs. Long story short, got a hat, put a battery, ARM Cortex-M0 microcontroller, accelerometer in it and a strip of full-colour RGB LEDs around it. The LEDs then react to movement, with an effect similar to a spirit level: as it tilts, a spark travels to the highest point. The spark rolls around, fading out nicely. Hardware Pretty much full bodge-city, and made in a real rush before a party. Parts: Charity shop Trilby (someone’s going to correct me that this is not an ISO standard Trilby and is in fact a Westcountry Colonel Chap Trilby, or something). Bugger it – a hat. A WS2812B strip of 38 LEDs. 38 is what would fit around the hat. Cheapo ADXL345 board. Cheapo STM32F030 board (I <3 these boards! So power, such price wow). Cheapo Li-Ion charging board and 5V step-up module all-in-one (AKA “powerbank board”). Li-Ion flat/pouch-style battery. Obviously some hot glue in there somewhere too. No schematic, sorry, it was quite freeform. The battery is attached to charging board. That connects to the rest of the system via a 0.1” header/disconnectable “power switch” cable. The 5V power then directly feeds the LED strip, from Cortex-M0 board (which then generates 3.3V itself). The ADXL345 accelerometer is joined directly to the the STM32 board at what was the UART header, which is configured for I2C: The STM32 board is also stripped of any unnecessary or especially pointy parts, such as jumpers/pin headers, to make it as flat and pain-free as possible. The LED strip is bent into a ring and soldered back onto itself. 5V and ground are linked at the join, whereas DI enters at the join and DO is left hanging. This is done for mechanical stability, and can’t hurt for power distribution too. Here’s the ring in testing: The electronics are mounted in an antistatic bag (with a hole for the power “switch” header pins, wires, etc.), and the bag sewn into the top of the hat: The LED ring is attached via a small hole, and sewn on with periodic thread loops: Software The firmware goes through an initial “which way is up?” calibration phase for the first few seconds, where it: Lights a simple red dotted pattern to warn the user it’s about to sample which way is up, so put it on quick and stand as naturally as you can with such exciting technology on your head, Lights a simple white dotted pattern, as it measures the “resting vector”, i.e. which way is up. This “resting vector” is thereafter used as the reference for determining whether the hat is tilted, and in which direction. Tilt direction vectors The main loop’s job is to regulate the rate of LED updates, read the accelerometer, calculate a position to draw a bright spark “blob”, and update the LEDs. The accelerometer returns a 3D vector of a force; when not being externally accelerated, the vector represents the direction of Earth’s gravity, i.e. ‘down’. Trigonometry is both fun and useful Roughly, the calculations that are performed are: Relative to “vertical” (approximated by the resting vector), calculate the hat’s tilt in terms of angle of the measured vector to vertical, and its bearing to “12 o’clock” in the horizontal (XY) plane. Convert the bearing of the vector into a position in the LED hoop. Use the radius of the vector in the XY plane as a crude magnitude, scaling up the spark intensity for a larger tilt. All this talk of tilt and gravity vectors assumes the hat isn’t being moved (i.e. worn by a human). It doesn’t correct for the fact that the hat is likely actually accelerating, rather than sitting static at a tilt but, hey, this is a hat with LEDs and not a rocket. It is incorrect and looks good. Floating-point I never use floating point in any of my embedded projects. I’m a die-hard fixed-point kind of guy. You know where you are with fixed point. Sooo anyway, the firmware uses the excellent Qfplib, from https://www.quinapalus.com/qfplib-m0-tiny.html. This provides tiny single-precision floating point routines, including the trigonometric routines I needed for the angle calculations. Bizarrely, with an embedded hat on, it was way easier using gosh-darnit real FP than it was to do the trigonometry in fixed point. Framebuffer The framebuffer is only one dimensional :) It’s a line of pixels representing the LEDs. Blobs are drawn into the framebuffer at given position, and start off “bright”. Every frame, the brightness of all pixels is decremented, giving a fade-out effect. The code drawing blobs uses a pre-calculated colour look-up table, to give a cool white-blue-purple transition to the spark. Driving the WS2812B RGB LEDs The WS2812B LEDs take a 1-bit stream of data encoding 24b of RGB data, in a fixed-time frame using relative timing of rising/falling edges to give a 0 or 1 bit. The code uses a timer in PWM mode to output a 1/0 data bit, refilled from a neat little DMA routine. Once a framebuffer has been drawn, the LEDs are refreshed. For each pixel in the line, the brightness bits are converted into an array of timer values each representing a PWM period (therefore a 0-time or a 1-time). A double-buffered DMA scheme is used to stream these values into the timer PWM register. This costs a few bytes of memory for the intermediate buffers, and is complicated, but has several advantages: It’s completely flicker-free and largely immune to any other interrupt/DMA activity compared to bitbanging approaches. It goes on in the background, freeing up CPU time to calculate the next frame. Though the CPU is pretty fast, this allows LEDHat to update at over 100Hz, giving incredibly fluid motion. Resources Firmware sourcecode: https://github.com/evansm7/LEDHat

over a year ago 21 votes

More in technology

2025-05-11 air traffic control

Air traffic control has been in the news lately, on account of my country's declining ability to do it. Well, that's a long-term trend, resulting from decades of under-investment, severe capture by our increasingly incompetent defense-industrial complex, no small degree of management incompetence in the FAA, and long-lasting effects of Reagan crushing the PATCO strike. But that's just my opinion, you know, maybe airplanes got too woke. In any case, it's an interesting time to consider how weird parts of air traffic control are. The technical, administrative, and social aspects of ATC all seem two notches more complicated than you would expect. ATC is heavily influenced by its peculiar and often accidental development, a product of necessity that perpetually trails behind the need, and a beneficiary of hand-me-down military practices and technology. Aviation Radio In the early days of aviation, there was little need for ATC---there just weren't many planes, and technology didn't allow ground-based controllers to do much of value. There was some use of flags and signal lights to clear aircraft to land, but for the most part ATC had to wait for the development of aviation radio. The impetus for that work came mostly from the First World War. Here we have to note that the history of aviation is very closely intertwined with the history of warfare. Aviation technology has always rapidly advanced during major conflicts, and as we will see, ATC is no exception. By 1913, the US Army Signal Corps was experimenting with the use of radio to communicate with aircraft. This was pretty early in radio technology, and the aircraft radios were huge and awkward to operate, but it was also early in aviation and "huge and awkward to operate" could be similarly applied to the aircraft of the day. Even so, radio had obvious potential in aviation. The first military application for aircraft was reconnaissance. Pilots could fly past the front to find artillery positions and otherwise provide useful information, and then return with maps. Well, even better than returning with a map was providing the information in real-time, and by the end of the war medium-frequency AM radios were well developed for aircraft. Radios in aircraft lead naturally to another wartime innovation: ground control. Military personnel on the ground used radio to coordinate the schedules and routes of reconnaissance planes, and later to inform on the positions of fighters and other enemy assets. Without any real way to know where the planes were, this was all pretty primitive, but it set the basic pattern that people on the ground could keep track of aircraft and provide useful information. Post-war, civil aviation rapidly advanced. The early 1920s saw numerous commercial airlines adopting radio, mostly for business purposes like schedule coordination. Once you were in contact with someone on the ground, though, it was only logical to ask about weather and conditions. Many of our modern practices like weather briefings, flight plans, and route clearances originated as more or less formal practices within individual airlines. Air Mail The government was not left out of the action. The Post Office operated what may have been the largest commercial aviation operation in the world during the early 1920s, in the form of Air Mail. The Post Office itself did not have any aircraft; all of the flying was contracted out---initially to the Army Air Service, and later to a long list of regional airlines. Air Mail was considered a high priority by the Post Office and proved very popular with the public. When the transcontinental route began proper operation in 1920, it became possible to get a letter from New York City to San Francisco in just 33 hours by transferring it between airplanes in a nearly non-stop relay race. The Post Office's largesse in contracting the service to private operators provided not only the funding but the very motivation for much of our modern aviation industry. Air travel was not very popular at the time, being loud and uncomfortable, but the mail didn't complain. The many contract mail carriers of the 1920s grew and consolidated into what are now some of the United States' largest companies. For around a decade, the Post Office almost singlehandedly bankrolled civil aviation, and passengers were a side hustle [1]. Air mail ambition was not only of economic benefit. Air mail routes were often longer and more challenging than commercial passenger routes. Transcontinental service required regular flights through sparsely populated parts of the interior, challenging the navigation technology of the time and making rescue of downed pilots a major concern. Notably, air mail operators did far more nighttime flying than any other commercial aviation in the 1920s. The post office became the government's de facto technical leader in civil aviation. Besides the network of beacons and markers built to guide air mail between cities, the post office built 17 Air Mail Radio Stations along the transcontinental route. The Air Mail Radio Stations were the company radio system for the entire air mail enterprise, and the closest thing to a nationwide, public air traffic control service to then exist. They did not, however, provide what we would now call control. Their role was mainly to provide pilots with information (including, critically, weather reports) and to keep loose tabs on air mail flights so that a disappearance would be noticed in time to send search and rescue. In 1926, the Watres Act created the Aeronautic Branch of the Department of Commerce. The Aeronautic Branch assumed a number of responsibilities, but one of them was the maintenance of the Air Mail routes. Similarly, the Air Mail Radio Stations became Aeronautics Branch facilities, and took on the new name of Flight Service Stations. No longer just for the contract mail carriers, the Flight Service Stations made up a nationwide network of government-provided services to aviators. They were the first edifices in what we now call the National Airspace System (NAS): a complex combination of physical facilities, technologies, and operating practices that enable safe aviation. In 1935, the first en-route air traffic control center opened, a facility in Newark owned by a group of airlines. The Aeronautic Branch, since renamed the Bureau of Air Commerce, supported the airlines in developing this new concept of en-route control that used radio communications and paperwork to track which aircraft were in which airways. The rising number of commercial aircraft made in-air collisions a bigger problem, so the Newark control center was quickly followed by more facilities built on the same pattern. In 1936, the Bureau of Air Commerce took ownership of these centers, and ATC became a government function alongside the advisory and safety services provided by the flight service stations. En route center controllers worked off of position reports from pilots via radio, but needed a way to visualize and track aircraft's positions and their intended flight paths. Several techniques helped: first, airlines shared their flight planning paperwork with the control centers, establishing "flight plans" that corresponded to each aircraft in the sky. Controllers adopted a work aid called a "flight strip," a small piece of paper with the key information about an aircraft's identity and flight plan that could easily be handed between stations. By arranging the flight strips on display boards full of slots, controllers could visualize the ordering of aircraft in terms of altitude and airway. Second, each center was equipped with a large plotting table map where controllers pushed markers around to correspond to the position reports from aircraft. A small flag on each marker gave the flight number, so it could easily be correlated to a flight strip on one of the boards mounted around the plotting table. This basic concept of air traffic control, of a flight strip and a position marker, is still in use today. Radar The Second World War changed aviation more than any other event of history. Among the many advancements were two British inventions of particular significance: first, the jet engine, which would make modern passenger airliners practical. Second, the radar, and more specifically the magnetron. This was a development of such significance that the British government treated it as a secret akin to nuclear weapons; indeed, the UK effectively traded radar technology to the US in exchange for participation in US nuclear weapons research. Radar created radical new possibilities for air defense, and complimented previous air defense development in Britain. During WWI, the organization tasked with defending London from aerial attack had developed a method called "ground-controlled interception" or GCI. Under GCI, ground-based observers identify possible targets and then direct attack aircraft towards them via radio. The advent of radar made GCI tremendously more powerful, allowing a relatively small number of radar-assisted air defense centers to monitor for inbound attack and then direct defenders with real-time vectors. In the first implementation, radar stations reported contacts via telephone to "filter centers" that correlated tracks from separate radars to create a unified view of the airspace---drawn in grease pencil on a preprinted map. Filter center staff took radar and visual reports and updated the map by moving the marks. This consolidated information was then provided to air defense bases, once again by telephone. Later technical developments in the UK made the process more automated. The invention of the "plan position indicator" or PPI, the type of radar scope we are all familiar with today, made the radar far easier to operate and interpret. Radar sets that automatically swept over 360 degrees allowed each radar station to see all activity in its area, rather than just aircraft passing through a defensive line. These new capabilities eliminated the need for much of the manual work: radar stations could see attacking aircraft and defending aircraft on one PPI, and communicated directly with defenders by radio. It became routine for a radar operator to give a pilot navigation vectors by radio, based on real-time observation of the pilot's position and heading. A controller took strategic command of the airspace, effectively steering the aircraft from a top-down view. The ease and efficiency of this workflow was a significant factor in the end of the Battle of Britain, and its remarkable efficacy was noticed in the US as well. At the same time, changes were afoot in the US. WWII was tremendously disruptive to civil aviation; while aviation technology rapidly advanced due to wartime needs those same pressing demands lead to a slowdown in nonmilitary activity. A heavy volume of military logistics flights and flight training, as well as growing concerns about defending the US from an invasion, meant that ATC was still a priority. A reorganization of the Bureau of Air Commerce replaced it with the Civil Aeronautics Authority, or CAA. The CAA's role greatly expanded as it assumed responsibility for airport control towers and commissioned new en route centers. As WWII came to a close, CAA en route control centers began to adopt GCI techniques. By 1955, the name Air Route Traffic Control Center (ARTCC) had been adopted for en route centers and the first air surveillance radars were installed. In a radar-equipped ARTCC, the map where controllers pushed markers around was replaced with a large tabletop PPI built to a Navy design. The controllers still pushed markers around to track the identities of aircraft, but they moved them based on their corresponding radar "blips" instead of radio position reports. Air Defense After WWII, post-war prosperity and wartime technology like the jet engine lead to huge growth in commercial aviation. During the 1950s, radar was adopted by more and more ATC facilities (both "terminal" at airports and "en route" at ARTCCs), but there were few major changes in ATC procedure. With more and more planes in the air, tracking flight plans and their corresponding positions became labor intensive and error-prone. A particular problem was the increasing range and speed of aircraft, and corresponding longer passenger flights, that meant that many aircraft passed from the territory of one ARTCC into another. This required that controllers "hand off" the aircraft, informing the "next" ARTCC of the flight plan and position at which the aircraft would enter their airspace. In 1956, 128 people died in a mid-air collision of two commercial airliners over the Grand Canyon. In 1958, 49 people died when a military fighter struck a commercial airliner over Nevada. These were not the only such incidents in the mid-1950s, and public trust in aviation started to decline. Something had to be done. First, in 1958 the CAA gave way to the Federal Aviation Administration. This was more than just a name change: the FAA's authority was greatly increased compared tot he CAA, most notably by granting it authority over military aviation. This is a difficult topic to explain succinctly, so I will only give broad strokes. Prior to 1958, military aviation was completely distinct from civil aviation, with no coordination and often no communication at all between the two. This was, of course, a factor in the 1958 collision. Further, the 1956 collision, while it did not involve the military, did result in part from communications issues between separate distinct CAA facilities and the airline's own control facilities. After 1958, ATC was completely unified into one organization, the FAA, which assumed the work of the military controllers of the time and some of the role of the airlines. The military continues to have its own air controllers to this day, and military aircraft continue to include privileges such as (practical but not legal) exemption from transponder requirements, but military flights over the US are still beholden to the same ATC as civil flights. Some exceptions apply, void where prohibited, etc. The FAA's suddenly increased scope only made the practical challenges of ATC more difficult, and commercial aviation numbers continued to rise. As soon as the FAA was formed, it was understood that there needed to be major investments in improving the National Airspace System. While the first couple of years were dominated by the transition, the FAA's second director (Najeeb Halaby) prepared two lengthy reports examining the situation and recommending improvements. One of these, the Beacon report (also called Project Beacon), specifically addressed ATC. The Beacon report's recommendations included massive expansion of radar-based control (called "positive control" because of the controller's access to real-time feedback on aircraft movements) and new control procedures for airways and airports. Even better, for our purposes, it recommended the adoption of general-purpose computers and software to automate ATC functions. Meanwhile, the Cold War was heating up. US air defense, a minor concern in the few short years after WWII, became a higher priority than ever before. The Soviet Union had long-range aircraft capable of reaching the United States, and nuclear weapons meant that only a few such aircraft had to make it to cause massive destruction. Considering the vast size of the United States (and, considering the new unified air defense command between the United States and Canada, all of North America) made this a formidable challenge. During the 1950s, the newly minted Air Force worked closely with MIT's Lincoln Laboratory (an important center of radar research) and IBM to design a computerized, integrated, networked system for GCI. When the Air Force committed to purchasing the system, it was christened the Semi-Automated Ground Environment, or SAGE. SAGE is a critical juncture in the history of the computer and computer communications, the first system to demonstrate many parts of modern computer technology and, moreover, perhaps the first large-scale computer system of any kind. SAGE is an expansive topic that I will not take on here; I'm sure it will be the focus of a future article but it's a pretty well-known and well-covered topic. I have not so far felt like I had much new to contribute, despite it being the first item on my "list of topics" for the last five years. But one of the things I want to tell you about SAGE, that is perhaps not so well known, is that SAGE was not used for ATC. SAGE was a purely military system. It was commissioned by the Air Force, and its numerous operating facilities (called "direction centers") were located on Air Force bases along with the interceptor forces they would direct. However, there was obvious overlap between the functionality of SAGE and the needs of ATC. SAGE direction centers continuously received tracks from remote data sites using modems over leased telephone lines, and automatically correlated multiple radar tracks to a single aircraft. Once an operator entered information about an aircraft, SAGE stored that information for retrieval by other radar operators. When an aircraft with associated data passed from the territory of one direction center to another, the aircraft's position and related information were automatically transmitted to the next direction center by modem. One of the key demands of air defense is the identification of aircraft---any unknown track might be routine commercial activity, or it could be an inbound attack. The air defense command received flight plan data on commercial flights (and more broadly all flights entering North America) from the FAA and entered them into SAGE, allowing radar operators to retrieve "flight strip" data on any aircraft on their scope. Recognizing this interconnection with ATC, as soon as SAGE direction centers were being installed the Air Force started work on an upgrade called SAGE Air Traffic Integration, or SATIN. SATIN would extend SAGE to serve the ATC use-case as well, providing SAGE consoles directly in ARTCCs and enhancing SAGE to perform non-military safety functions like conflict warning and forward projection of flight plans for scheduling. Flight strips would be replaced by teletype output, and in general made less necessary by the computer's ability to filter the radar scope. Experimental trial installations were made, and the FAA participated readily in the research efforts. Enhancement of SAGE to meet ATC requirements seemed likely to meet the Beacon report's recommendations and radically improve ARTCC operations, sooner and cheaper than development of an FAA-specific system. As it happened, well, it didn't happen. SATIN became interconnected with another planned SAGE upgrade to the Super Combat Centers (SCC), deep underground combat command centers with greatly enhanced SAGE computer equipment. SATIN and SCC planners were so confident that the last three Air Defense Sectors scheduled for SAGE installation, including my own Albuquerque, were delayed under the assumption that the improved SATIN/SCC equipment should be installed instead of the soon-obsolete original system. SCC cost estimates ballooned, and the program's ambitions were reduced month by month until it was canceled entirely in 1960. Albuquerque never got a SAGE installation, and the Albuquerque air defense sector was eliminated by reorganization later in 1960 anyway. Flight Service Stations Remember those Flight Service Stations, the ones that were originally built by the Post Office? One of the oddities of ATC is that they never went away. FSS were transferred to the CAB, to the CAA, and then to the FAA. During the 1930s and 1940s many more were built, expanding coverage across much of the country. Throughout the development of ATC, the FSS remained responsible for non-control functions like weather briefing and flight plan management. Because aircraft operating under instrument flight rules must closely comply with ATC, the involvement of FSS in IFR flights is very limited, and FSS mostly serve VFR traffic. As ATC became common, the FSS gained a new and somewhat odd role: playing go-between for ATC. FSS were more numerous and often located in sparser areas between cities (while ATC facilities tended to be in cities), so especially in the mid-century, pilots were more likely to be able to reach an FSS than ATC. It was, for a time, routine for FSS to relay instructions between pilots and controllers. This is still done today, although improved communications have made the need much less common. As weather dissemination improved (another topic for a future post), FSS gained access to extensive weather conditions and forecasting information from the Weather Service. This connectivity is bidirectional; during the midcentury FSS not only received weather forecasts by teletype but transmitted pilot reports of weather conditions back to the Weather Service. Today these communications have, of course, been computerized, although the legacy teletype format doggedly persists. There has always been an odd schism between the FSS and ATC: they are operated by different departments, out of different facilities, with different functions and operating practices. In 2005, the FAA cut costs by privatizing the FSS function entirely. Flight service is now operated by Leidos, one of the largest government contractors. All FSS operations have been centralized to one facility that communicates via remote radio sites. While flight service is still available, increasing automation has made the stations far less important, and the general perception is that flight service is in its last years. Last I looked, Leidos was not hiring for flight service and the expectation was that they would never hire again, retiring the service along with its staff. Flight service does maintain one of my favorite internet phenomenon, the phone number domain name: 1800wxbrief.com. One of the odd manifestations of the FSS/ATC schism and the FAA's very partial privatization is that Leidos maintains an online aviation weather portal that is separate from, and competes with, the Weather Service's aviationweather.gov. Since Flight Service traditionally has the responsibility for weather briefings, it is honestly unclear to what extend Leidos vs. the National Weather Service should be investing in aviation weather information services. For its part, the FAA seems to consider aviationweather.gov the official source, while it pays for 1800wxbrief.com. There's also weathercams.faa.gov, which duplicates a very large portion (maybe all?) of the weather information on Leidos's portal and some of the NWS's. It's just one of those things. Or three of those things, rather. Speaking of duplication due to poor planning... The National Airspace System Left in the lurch by the Air Force, the FAA launched its own program for ATC automation. While the Air Force was deploying SAGE, the FAA had mostly been waiting, and various ARTCCs had adopted a hodgepodge of methods ranging from one-off computer systems to completely paper-based tracking. By 1960 radar was ubiquitous, but different radar systems were used at different facilities, and correlation between radar contacts and flight plans was completely manual. The FAA needed something better, and with growing congressional support for ATC modernization, they had the money to fund what they called National Airspace System En Route Stage A. Further bolstering historical confusion between SAGE and ATC, the FAA decided on a practical, if ironic, solution: buy their own SAGE. In an upcoming article, we'll learn about the FAA's first fully integrated computerized air traffic control system. While the failed detour through SATIN delayed the development of this system, the nearly decade-long delay between the design of SAGE and the FAA's contract allowed significant technical improvements. This "New SAGE," while directly based on SAGE at a functional level, used later off-the-shelf computer equipment including the IBM System/360, giving it far more resemblance to our modern world of computing than SAGE with its enormous, bespoke AN/FSQ-7. And we're still dealing with the consequences today! [1] It also laid the groundwork for the consolidation of the industry, with a 1930 decision that took air mail contracts away from most of the smaller companies and awarded them instead to the precursors of United, TWA, and American Airlines.

yesterday 1 votes
Sierpiński triangle? In my bitwise AND?

Exploring a peculiar bit-twiddling hack at the intersection of 1980s geek sensibilities.

2 days ago 4 votes
Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

2 days ago 4 votes
Code Matters

It looks like the code that the newly announced Figma Sites is producing isn’t the best. There are some cool Figma-to-WordPress workflows; I hope Sites gets more people exploring those options.

3 days ago 5 votes
What got you here…

John Siracusa: Apple Turnover From virtue comes money, and all other good things. This idea rings in my head whenever I think about Apple. It’s the most succinct explanation of what pulled Apple from the brink of bankruptcy in the 1990s to its astronomical success today. Don’

3 days ago 3 votes