The absurdly complicated circuitry for the 386 processor's registers

from Ken Shirriff's blog [alt+shift+b] in technology

The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 architecture. Like most processors, the 386 contains numerous registers; registers are a key part of a processor because they provide storage that is much faster than main memory. The register set of the 386 includes general-purpose registers, index registers, and segment selectors, as well as registers with special functions for memory management and operating system implementation. In this blog post, I look at the silicon die of the 386 and explain how the processor implements its main registers. It turns out that the circuitry that implements the 386's registers is much more complicated than one would expect. For the 30 registers that I examine, instead of using a standard circuit, the 386 uses six different circuits, each one optimized for the particular characteristics of the register. For some registers, Intel squeezes register cells together to double the storage capacity. Other registers...

2 months ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Ken Shirriff's blog

Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

2 months ago • 34 votes

A tricky Commodore PET repair: tracking down 6 1/2 bad chips

.cite { font-size: 70%;} .ref { vertical-align: super; font-size: 60%;} code {font-size: 100%; font-family: courier, fixed;} In 1977, Commodore released the PET computer, a quirky home computer that combined the processor, a tiny keyboard, a cassette drive for storage, and a trapezoidal screen in a metal unit. The Commodore PET, the Apple II, and Radio Shack's TRS-80 started the home computer market with ready-to-run computers, systems that were called in retrospect the 1977 Trinity. I did much of my early programming on the PET, so when someone offered me a non-working PET a few years ago, I took it for nostalgic reasons. You'd think that a home computer would be easy to repair, but it turned out to be a challenge.1 The chips in early PETs are notorious for failures and, sure enough, we found multiple bad chips. Moreover, these RAM and ROM chips were special designs that are mostly unobtainable now. In this post, I'll summarize how we repaired the system, in case it helps anyone else. When I first powered up the computer, I was greeted with a display full of random characters. This was actually reassuring since it showed that most of the computer was working: not just the monitor, but the video RAM, character ROM, system clock, and power supply were all operational. The Commodore PET started up, but the screen was full of garbage. With an oscilloscope, I examined signals on the system bus and found that the clock, address, and data lines were full of activity, so the 6502 CPU seemed to be operating. However, some of the data lines had three voltage levels, as shown below. This was clearly not good, and suggested that a chip on the bus was messing up the data signals. The scope shows three voltage levels on the data bus. Some helpful sites online7 suggested that if a PET gets stuck before clearing the screen, the most likely cause is a failure of a system ROM chip. Fortunately, Marc has a Retro Chip Tester, a cool device designed to test vintage ICs: not just 7400-series logic, but vintage RAMs and ROMs. Moreover, the tester knows the correct ROM contents for a ton of old computers, so it can tell if a PET ROM has the right contents. The Retro Chip Tester showed that two of the PET's seven ROM chips had failed. These chips are MOS Technologies MPS6540, a 2K×8 ROM with a weird design that is incompatible with standard ROMs. Fortunately, several people make adapter boards that let you substitute a standard 2716 EPROM, so I ordered two adapter boards, assembled them, and Marc programmed the 2716 EPROMs from online data files. The 2716 EPROM requires a bit more voltage to program than Marc's programmer supported, but the chips seemed to have the right contents (foreshadowing). The PET opened, showing the motherboard. The PET's case swings open with an arm at the left to hold it open like a car hood. The first two rows of chips at the front of the motherboard are the RAM chips. Behind the RAM are the seven ROM chips; two have been replaced by the ROM adapter boards. The 6502 processor is the large black chip behind the ROMs, toward the right. With the adapter boards in place, I powered on the PET with great expectations of success, but it failed in precisely the same way as before, failing to clear the garbage off the screen. Marc decided it was time to use his Agilent 1670G logic analyzer to find out what was going on; (Dating back to 1999, this logic analyzer is modern by Marc's standards.) He wired up the logic analyzer to the 6502 chip, as shown below, so we could track the address bus, data bus, and the read/write signal. Meanwhile, I disassembled the ROM contents using Ghidra, so I could interpret the logic analyzer against the assembly code. (Ghidra is a program for reverse-engineering software that was developed by the NSA, strangely enough.) Marc wired up the logic analyzer to the 6502 chip. The logic analyzer provided a trace of every memory access from the 6502 processor, showing what it was executing. Everything went well for a while after the system was turned on: the processor jumped to the reset vector location, did a bit of initialization, tested the memory, but then everything went haywire. I noticed that the memory test failed on the first byte. Then the software tried to get more storage by garbage collecting the BASIC program and variables. Since there wasn't any storage at all, this didn't go well and the system hung before reaching the code that clears the screen. We tested the memory chips, using the Retro Chip Tester again, and found three bad chips. Like the ROM chips, the RAM chips are unusual: MOS Technology 6550 static RAM chip, 1K×4. By removing the bad chips and shuffling the good chips around, we reduced the 8K PET to a 6K PET. This time, the system booted, although there was a mysterious 2×2 checkerboard symbol near the middle of the screen (foreshadowing). I typed in a simple program to print "HELLO", but the results were very strange: four floating-point numbers, followed by a hang. This program didn't work the way I expected. This behavior was very puzzling. I could successfully enter a program into the computer, which exercises a lot of the system code. (It's not like a terminal, where echoing text is trivial; the PET does a lot of processing behind the scenes to parse a BASIC program as it is entered.) However, the output of the program was completely wrong, printing floating-point numbers instead of a string. We also encountered an intermittent problem that after turning the computer on, the boot message would be complete gibberish, as shown below. Instead of the "*** COMMODORE BASIC ***" banner, random characters and graphics would appear. The garbled boot message. How could the computer be operating well for the most part, yet also completely wrong? We went back to the logic analyzer to find out. I figured that the gibberish boot message would probably be the easiest thing to track down, since that happens early in the boot process. Looking at the code, I discovered that after the software tests the memory, it converts the memory size to an ASCII string using a moderately complicated algorithm.2 Then it writes the system boot message and the memory size to the screen. The PET uses a subroutine to write text to the screen. A pointer to the text message is held in memory locations 0071 and 0072. The assembly code below stores the pointer (in the X and Y registers) into these memory locations. (This Ghidra output shows the address, the instruction bytes, and the symbolic assembler instructions.) For the code above, you'd expect the processor to read the instruction bytes 86 and 71, and then write to address 0071. Next it should read the bytes 84 and 72 and write to address 0072. However, the logic analyzer output below showed that something slightly different happened. The processor fetched instruction bytes 86 and 71 from addresses D5AE and D5AF, then wrote 00 to address 0071, as expected. Next, it fetched instruction bytes 84 and 72 as expected, but wrote 01 to address 007A, not 0072! 007A 01 0 This was a smoking gun. The processor had messed up and there was a one-bit error in the address. Maybe the 6502 processor issued a bad signal or maybe something else was causing problems on the bus. The consequence of this error was that the string pointer referenced random memory rather than the desired boot message, so random characters were written to the screen. Next, I investigated why the screen had a mysterious checkerboard character. I wrote a program to scan the logic analyzer output to extract all the writes to screen memory. Most of the screen operations made sense—clearing the screen at startup and then writing the boot message—but I found one unexpected write to the screen. In the assembly code below, the Y register should be written to zero-page address 5e, and the X register should be written to the address 66, some locations used by the BASIC interpreter. However, the logic analyzer output below showed a problem. The first line should fetch the opcode 84 from address d3c8, but the processor received the opcode 8c from the ROM, the instruction to write to a 16-bit address. The result was that instead of writing to a zero-page address, the 6502 fetched another byte to write to a 16-bit address. Specifically, it grabbed the STX instruction (86) and used that as part of the address, writing FF (a checkerboard character) to screen memory at 865E3 instead of to the BASIC data structure at 005E. Moreover, the STX instruction wasn't executed, since it was consumed as an address. Thus, not only did a stray character get written to the screen, but data structures in memory didn't get updated. It's not surprising that the BASIC interpreter went out of control when it tried to run the program. 8C 1 186601 D3C9 5E 1 186602 D3CA 86 1 186603 865E FF 0 We concluded that a ROM was providing the wrong byte (8C) at address D3C8. This ROM turned out to be one of our replacements; the under-powered EPROM programmer had resulted in a flaky byte. Marc re-programmed the EPROM with a more powerful programmer. The system booted, but with much less RAM than expected. It turned out that another RAM chip had failed. Finally, we got the PET to run. I typed in a simple program to generate an animated graphical pattern, a program I remembered from when I was about 134, and generated this output: Finally, the PET worked and displayed some graphics. Imagine this pattern constantly changing. In retrospect, I should have tested all the RAM and ROM chips at the start, and we probably could have found the faults without the logic analyzer. However, the logic analyzer gave me an excuse to learn more about Ghidra and the PET's assembly code, so it all worked out in the end. In the end, the PET had 6 bad chips: two ROMs and four RAMs. The 6502 processor itself turned out to be fine.5 The photo below shows the 6 bad chips on top of the PET's tiny keyboard. On the top of each key, you can see the quirky graphical character set known as PETSCII.6 As for the title, I'm counting the badly-programmed ROM as half a bad chip since the chip itself wasn't bad but it was functioning erratically. The bad chips sitting on top of the keyboard. Follow me on Bluesky (@righto.com) or RSS for updates. (I'm no longer on Twitter.) Thanks to Mike Naberezny for providing the PET. Thanks to TubeTime, Mike Stewart, and especially CuriousMarc for help with the repairs. Some useful PET troubleshooting links are in the footnotes.7 Footnotes and references So why did I suddenly decide to restore a PET that had been sitting in my garage since 2017? Well, CNN was filming an interview with Bill Gates and they wanted background footage of the 1970s-era computers that ran the Microsoft BASIC that Bill Gates wrote. Spoiler: I didn't get my computer working in time for CNN, but Marc found some other computers. ↩ Converting a number to an ASCII string is somewhat complicated on the 6502. You can't quickly divide by 10 for the decimal conversion, since the processor doesn't have a divide instruction. Instead, the PET's conversion routine has hard-coded four-byte constants: -100000000, 10000000, -100000, 100000, -10000, 1000, -100, 10, and -1. The routine repeatedly adds the first constant (i.e. subtracting 100000000) until the result is negative. Then it repeatedly adds the second constant until the result is positive, and so forth. The number of steps gives each decimal digit (after adjustment). The same algorithm is used with the base-60 constants: -2160000, 216000, -36000, 3600, -600, and 60. This converts the uptime count into hours, minutes, and seconds for the TIME$ variable. (The PET's basic time count is the "jiffy", 1/60th of a second.) ↩ Technically, the address 865E is not part of screen memory, which is 1000 characters starting at address 0x8000. However, the PET's address uses some shortcuts in address decoding, so 865E ends up the same as 825e, referencing the 7th character of the 16th line. ↩ Here's the source code for my demo program, which I remembered from my teenage programming. It simply displays blocks (black, white, or gray) with 8-fold symmetry, writing directly to screen memory with POKE statements. (It turns out that almost anything looks good with 8-fold symmetry.) The cryptic heart in the first PRINT statement is the clear-screen character. My program to display some graphics. ↩ I suspected a problem with the 6502 processor because the logic analyzer showed that the 6502 read an instruction correctly but then accessed the wrong address. Eric provided a replacement 6502 chip but swapping the processor had no effect. However, reprogramming the ROM fixed both problems. Our theory is that the signal on the bus either had a timing problem or a voltage problem, causing the logic analyzer to show the correct value but the 6502 to read the wrong value. Probably the ROM had a weakly-programmed bit, causing the ROM's output for that bit to either be at an intermediate voltage or causing the output to take too long to settle to the correct voltage. The moral is that you can't always trust the logic analyzer if there are analog faults. ↩ The PETSCII graphics characters are now in Unicode in the Symbols for Legacy Computing block. ↩ The PET troubleshooting site was very helpful. The Commodore PET's Microsoft BASIC source code is here, mostly uncommented. I mapped many of the labels in the source code to the assembly code produced by Ghidra to understand the logic analyzer traces. The ROM images are here. Schematics of the PET are here. ↩↩

3 months ago • 57 votes

Notes on the Pentium's microcode circuitry

Most people think of machine instructions as the fundamental steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the processor's control circuitry from complex logic gates, the control logic is implemented with code known as microcode, stored in the microcode ROM. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In this post, I examine the microcode ROM in the original Pentium, looking at the low-level circuitry. The photo below shows the Pentium's thumbnail-sized silicon die under a microscope. I've labeled the main functional blocks. The microcode ROM is highlighted at the right. If you look closely, you can see that the microcode ROM consists of two rectangular banks, one above the other. This die photo of the Pentium shows the location of the microcode ROM. Click this image (or any other) for a larger version. The image below shows a closeup of the two microcode ROM banks. Each bank provides 45 bits of output; together they implement a micro-instruction that is 90 bits long. Each bank consists of a grid of transistors arranged into 288 rows and 720 columns. The microcode ROM holds 4608 micro-instructions, 414,720 bits in total. At this magnification, the ROM appears featureless, but it is covered with horizontal wires, each just 1.5 µm thick. The 90 output lines from the ROM, with a closeup of six lines exiting the ROM. The ROM's 90 output lines are collected into a bundle of wires between the banks, as shown above. The detail shows how six of the bits exit from the banks and join the bundle. This bundle exits the ROM to the left, travels to various parts of the chip, and controls the chip's circuitry. The output lines are in the chip's top metal layer (M3): the Pentium has three layers of metal wiring with M1 on the bottom, M2 in the middle, and M3 on top. The Pentium has a large number of bits in its micro-instruction, 90 bits compared to 21 bits in the 8086. Presumably, the Pentium has a "horizontal" microcode architecture, where the microcode bits correspond to low-level control signals, as opposed to "vertical" microcode, where the bits are encoded into denser micro-instructions. I don't have any information on the Pentium's encoding of microcode; unlike the 8086, the Pentium's patents don't provide any clues. The 8086's microcode ROM holds 512 micro-instructions, much less than the Pentium's 4608 micro-instructions. This makes sense, given the much greater complexity of the Pentium's instruction set, including the floating-point unit on the chip. The image below shows a closeup of the Pentium's microcode ROM. For this image, I removed the three layers of metal and the polysilicon layer to expose the chip's underlying silicon. The pattern of silicon doping is visible, showing the transistors and thus the data stored in the ROM. If you have enough time, you can extract the bits from the ROM by examining the silicon and seeing where transistors are present. A closeup of the ROM showing how bits are encoded in the layout of transistors. Before explaining the ROM's circuitry, I'll review how an NMOS transistor is constructed. A transistor can be considered a switch between the source and drain, controlled by the gate. The source and drain regions (green) consist of silicon doped with impurities to change its semiconductor properties, forming N+ silicon. (These regions are visible in the photo above.) The gate consists of a layer of polysilicon (red), separated from the silicon by a very thin insulating oxide layer. Whenever polysilicon crosses active silicon, a transistor is formed. Diagram showing the structure of an NMOS transistor. Bits are stored in the ROM through the pattern of transistors in the grid. The presence or absence of a transistor stores a 0 or 1 bit.1 The closeup below shows eight bits of the microcode ROM. There are four transistors present and four gaps where transistors are missing. Thus, this part of the ROM holds four 0 bits and four 1 bits. For the diagram below, I removed the three metal layers and the polysilicon to show the underlying silicon. I colored doped (active) silicon regions green, and drew in the horizontal polysilicon lines in red. As explained above, a transistor is created if polysilicon crosses doped silicon. Thus, the contents of the ROM are defined by the pattern of silicon regions, which creates the transistors. Eight bits of the microcode ROM, with four transistors present. The horizontal silicon lines are used as wiring to provide ground to the transistors, while the horizontal polysilicon lines select one of the rows in the ROM. The transistors in that row will turn on, pulling the associated output lines low. That is, the presence of a transistor in a row causes the output to be pulled low, while the absence of a transistor causes the output line to remain high. A schematic corresponding to the eight bits above. The diagram below shows the silicon, polysilicon, and bottom metal (M1) layers. I removed the metal from the left to reveal the silicon and polysilicon underneath, but the pattern of vertical metal lines continues there. As shown earlier, the silicon pattern forms transistors. Each horizontal metal line has a connection to ground through a metal line (not shown). The horizontal polysilicon lines select a row. When polysilicon lines cross doped silicon, the gate of a transistor is formed. Two transistors may share the drain, as in the transistor pair on the left. Diagram showing the silicon, polysilicon, and M1 layers. The vertical metal wires form the outputs. The circles are contacts between the metal wire and the silicon of a transistor.2 Short metal jumpers connect the polysilicon lines to the metal layer above, which will be described next. The image below shows the upper left corner of the ROM. The yellowish metal lines are the top metal layer (M3), while the reddish metal lines are the middle metal layer (M2). The thick yellowish M3 lines distribute ground to the ROM. Underneath the horizontal M3 line, a horizontal M2 line also distributes ground. The grids of black dots are numerous contacts between the M3 line and the M2 line, providing a low-resistance connection. The M2 line, in turn, connects to vertical M1 ground lines underneath—these wide vertical lines are faintly visible. These M1 lines connect to the silicon, as shown earlier, providing ground to each transistor. This illustrates the complexity of power distribution in the Pentium: the thick top metal (M3) is the primary distribution of +5 volts and ground through the chip, but power must be passed down through M2 and M1 to reach the transistors. The upper left corner of the ROM. The other important feature above is the horizontal metal lines, which help distribute the row-select signals. As shown earlier, horizontal polysilicon lines provide the row-select signals to the transistors. However, polysilicon is not as good a conductor as metal, so long polysilicon lines have too much resistance. The solution is to run metal lines in parallel, periodically connected to the underlying polysilicon lines and reducing the overall resistance. Since the vertical metal output lines are in the M1 layer, the horizontal row-select lines run in the M2 layer so they don't collide. Short "jumpers" in the M1 layer connect the M2 lines to the polysilicon lines. To summarize, each ROM bank contains a grid of transistors and transistor vacancies to define the bits of the ROM. The ROM is carefully designed so the different layers—silicon, polysilicon, M1, and M2—work together to maximize the ROM's performance and density. Microcode Address Register As the Pentium executes an instruction, it provides the address of each micro-instruction to the microcode ROM. The Pentium holds this address—the micro-address—in the Microcode Address Register (MAR). The MAR is a 13-bit register located above the microcode ROM. The diagram below shows the Microcode Address Register above the upper ROM bank. It consists of 13 bits; each bit has multiple latches to hold the value as well as any pushed subroutine micro-addresses. Between bits 7 and 8, some buffer circuitry amplifies the control signals that go to each bit's circuitry. At the right, drivers amplify the outputs from the MAR, sending the signals to the row drivers and column-select circuitry that I will discuss below. To the left of the MAR is a 32-bit register that is apparently unrelated to the microcode ROM, although I haven't determined its function. The Microcode Address Register is located above the upper ROM bank. The outputs from the Microcode Address Register select rows and columns in the microcode ROM, as I'll explain below. Bits 12 through 7 of the MAR select a block of 8 rows, while bits 6 through 4 select a row in this block. Bits 3 through 0 select one column out of each group of 16 columns to select an output bit. Thus, the microcode address controls what word is provided by the ROM. Several different operations can be performed on the Microcode Address Register. When executing a machine instruction, the MAR must be loaded with the address of the corresponding microcode routine. (I haven't determined how this address is generated.) As microcode is executed, the MAR is usually incremented to move to the next micro-instruction. However, the MAR can branch to a new micro-address as required. The MAR also supports microcode subroutine calls; it will push the current micro-address and jump to the new micro-address. At the end of the micro-subroutine, the micro-address is popped so execution returns to the previous location. The MAR supports three levels of subroutine calls, as it contains three registers to hold the stack of pushed micro-addresses. The MAR receives control signals and addresses from standard-cell logic located above the MAR. Strangely, in Intel's published floorplans for the Pentium, this standard-cell logic is labeled as part of the branch prediction logic, which is above it. However, carefully tracing the signals from the standard-cell logic shows that is connected to the Microcode Address Register, not the branch predictor. Row-select drivers As explained above, each ROM bank has 288 rows of transistors, with polysilicon lines to select one of the rows. To the right of the ROM is circuitry that activates one of these row-select lines, based on the micro-address. Each row matches a different 9-bit address. A straightforward implementation would use a 9-input AND gate for each row, matching a particular pattern of 9 address bits or their complements. However, this implementation would require 576 very large AND gates, so it is impractical. Instead, the Pentium uses an optimized implementation with one 6-input AND gate for each group of 8 rows. The remaining three address bits are decoded once at the top of the ROM. As a result, each row only needs one gate, detecting if its group of eight rows is selected and if the particular one of eight is selected. Simplified schematic of the row driver circuitry. The schematic above shows the circuitry for a group of eight rows, slightly simplified.3 At the top, three address bits are decoded, generating eight output lines with one active at a time. The remaining six address bits are inverted, providing the bit and its complement to the decoding circuitry. Thus, the 9 bits are converted into 20 signals that flow through the decoders, a large number of wires, but not unmanageable. Each group of eight rows has a 6-input AND gate that matches a particular 6-bit address, determined by which inputs are complemented and which are not.4 The NAND gate and inverter at the left combine the 3-bit decoding and the 6-bit decoding, activating the appropriate row. Since there are up to 720 transistors in each row, the row-select lines need to be driven with high current. Thus, the row-select drivers use large transistors, roughly 25 times the size of a regular transistor. To fit these transistors into the same vertical spacing as the rest of the decoding circuitry, a tricky packing is used. The drivers for each group of 8 rows are packed into a 3×3 grid, except the first column has two drivers (since there are 8 drivers in the group, not 9). To avoid a gap, the drivers in the first column are larger vertically and squashed horizontally. Output circuitry The schematic below shows the multiplexer circuit that selects one of 16 columns for a microcode output bit. The first stage has four 4-to-1 multiplexers. Next, another 4-to-1 multiplexer selects one of the outputs. Finally, a BiCMOS driver amplifies the output for transmission to the rest of the processor. The 16-to-1 multiplexer/output driver. In more detail, the ROM and the first multiplexer are essentially NMOS circuits, rather than CMOS. Specifically, the ROM's grid of transistors is constructed from NMOS transistors that can pull a column line low, but there are no PMOS transistors in the grid to pull the line high (since that would double the size of the ROM). Instead, the multiplexer includes precharge transistors to pull the lines high, presumably in the clock phase before the ROM is read. The capacitance of the lines will keep the line high unless it is pulled low by a transistor in the grid. One of the four transistors in the multiplexer is activated (by control signal a, b, c, or d) to select the desired line. The output goes to a "keeper" circuit, which keeps the output high unless it is pulled low. The keeper uses an inverter with a weak PMOS transistor that can only provide a small pull-up current. A stronger low input will overpower this transistor, switching the state of the keeper. The output of this multiplexer, along with the outputs of three other multiplexers, goes to the second-stage multiplexer,5 which selects one of its four inputs, based on control signals e, f, g, and h. The output of this multiplexer is held in a latch built from two inverters. The second latch has weak transistors so the latch can be easily forced into the desired state. The output from the first latch goes through a CMOS switch into a second latch, creating a flip-flop. The output from the second latch goes to a BiCMOS driver, which drives one of the 90 microcode output lines. Most processors are built from CMOS circuitry (i.e. NMOS and PMOS transistors), but the Pentium is built from BiCMOS circuitry: bipolar transistors as well as CMOS. At the time, bipolar transistors improved performance for high-current drivers; see my article on the Pentium's BiCMOS circuitry. The diagram below shows three bits of the microcode output. This circuitry is for the upper ROM bank; the circuitry is mirrored for the lower bank. The circuitry matches the schematic above. Each of the three blocks has 16 input lines from the ROM grid. Four 4-to-1 multiplexers reduce this to 4 lines, and the second multiplexer selects a single line. The result is latched and amplified by the output driver. (Note the large square shape of the bipolar transistors.) Next is the shift register that processes the microcode ROM outputs for testing. The shift register uses XOR logic for its feedback; unlike the rest of the circuitry, the XOR logic is irregular since only some bits are fed into XOR gates. Three bits of output from the microcode, I removed the three metal layers to show the polysilicon and silicon. Circuitry for testing Why does the microcode ROM have shift registers and XOR gates? The reason is that a chip such as the Pentium is very difficult to test: if one out of 3.1 million transistors goes bad, how do you detect it? For a simple processor like the 8086, you can run through the instruction set and be fairly confident that any problem would turn up. But with a complex chip, it is almost impossible to design an instruction sequence that would test every bit of the microcode ROM, every bit of the cache, and so forth. Starting with the 386, Intel added circuitry to the processor solely to make testing easier; about 2.7% of the transistors in the 386 were for testing. The Pentium has this testing circuitry for many ROMs and PLAs, including the division PLA that caused the infamous FDIV bug. To test a ROM inside the processor, Intel added circuitry to scan the entire ROM and checksum its contents. Specifically, a pseudo-random number generator runs through each address, while another circuit computes a checksum of the ROM output, forming a "signature" word. At the end, if the signature word has the right value, the ROM is almost certainly correct. But if there is even a single bit error, the checksum will be wrong and the chip will be rejected. The pseudo-random numbers and the checksum are both implemented with linear feedback shift registers (LFSR), a shift register along with a few XOR gates to feed the output back to the input. For more information on testing circuitry in the 386, see Design and Test of the 80386, written by Pat Gelsinger, who became Intel's CEO years later. Conclusions You'd think that implementing a ROM would be straightforward, but the Pentium's microcode ROM is surprisingly complex due to its optimized structure and its circuitry for testing. I haven't been able to determine much about how the microcode works, except that the micro-instruction is 90 bits wide and the ROM holds 4608 micro-instructions in total. But hopefully you've found this look at the circuitry interesting. Disclaimer: this should all be viewed as slightly speculative and there are probably some errors. I didn't want to prefix every statement with "I think that..." but you should pretend it is there. I plan to write more about the implementation of the Pentium, so follow me on Bluesky (@righto.com) or RSS for updates. Peter Bosch has done some reverse engineering of the Pentium II microcode; his information is here. Footnotes and references It is arbitrary if a transistor corresponds to a 0 bit or a 1 bit. A transistor will pull the output line low (i.e. a 0 bit), but the signal could be inverted before it is used. More analysis of the circuitry or ROM contents would clear this up. ↩ When looking at a ROM like this, the contact pattern seems like it should tell you the contents of the ROM. Unfortunately, this doesn't work. Since a contact can be attached to one or two transistors, the contact pattern doesn't give you enough information. You need to see the silicon to determine the transistor pattern and thus the bits. ↩ I simplified the row driver schematic. The most interesting difference is that the NAND gates are optimized to use three transistors each, instead of four transistors. The trick is that one of the NMOS transistors is essentially shared across the group of 8 drivers; an inverter drives the low side of all eight gates. The second simplification is that the 6-input AND gate is implemented with two 3-input NAND gates and a NOR gate for electrical reasons. Also, the decoder that converts 3 bits into 8 select lines is located between the banks, at the right, not at the top of the ROM as I showed in the schematic. Likewise, the inverters for the 6 row-select bits are not at the top. Instead, there are 6 inverters and 6 buffers arranged in a column to the right of the ROM, which works better for the layout. These are BiCMOS drivers so they can provide the high-current outputs necessary for the long wires and numerous transistor gates that they must drive. ↩ The inputs to the 6-input AND gate are arranged in a binary counting pattern, selecting each row in sequence. This binary arrangment is standard for a ROM's decoder circuitry and is a good way to recognize a ROM on a die. The Pentium has 36 row decoders, rather than the 64 that you'd expect from a 6-bit input. The ROM was made to the size necessary, rather than a full power of two. In most ROMs, it's difficult to determine if the ROM is addressed bottom-to-top or top-to-bottom. However, because the microcode ROM's counting pattern is truncated, one can see that the top bank starts with 0 at the top and counts downward, while the bottom bank is reversed, starting with 0 at the bottom and counting upward. ↩ A note to anyone trying to read the ROM contents: it appears that the order of entries in a group of 16 is inconsistent, so a straightforward attempt to visually read the ROM will end up with scrambled data. That is, some of the groups are reversed. I don't see any obvious pattern in which groups are reversed. A closeup of the first stage output mux. This image shows the M1 metal layer. In the diagram above, look at the contacts from the select lines, connecting the select lines to the mux transistors. The contacts on the left are the mirror image of the contacts on the right, so the columns will be accessed in the opposite order. This mirroring pattern isn't consistent, though; sometimes neighboring groups are mirrored and sometimes they aren't. I don't know why the circuitry has this layout. Sometimes mirroring adjacent groups makes the layout more efficient, but the inconsistent mirroring argues against this. Maybe an automated layout system decided this was the best way. Or maybe Intel did this to provide a bit of obfuscation against reverse engineering. ↩

3 months ago • 41 votes

A USB interface to the "Mother of All Demos" keyset

In the early 1960s, Douglas Engelbart started investigating how computers could augment human intelligence: "If, in your office, you as an intellectual worker were supplied with a computer display backed up by a computer that was alive for you all day and was instantly responsive to every action you had, how much value could you derive from that?" Engelbart developed many features of modern computing that we now take for granted: the mouse,1 hypertext, shared documents, windows, and a graphical user interface. At the 1968 Joint Computer Conference, Engelbart demonstrated these innovations in a groundbreaking presentation, now known as "The Mother of All Demos." The Mother of All Demos.](keyset-video2.jpg "w500") --> The keyset with my prototype USB interface. Engelbart's demo also featured an input device known as the keyset, but unlike his other innovations, the keyset failed to catch on. The 5-finger keyset lets you type without moving your hand, entering characters by pressing multiple keys simultaneously as a chord. Christina Englebart, his daughter, loaned one of Engelbart's keysets to me. I constructed an interface to connect the keyset to USB, so that it can be used with a modern computer. The video below shows me typing with the keyset, using the mouse buttons to select upper case and special characters.2 I wrote this blog post to describe my USB keyset interface. Along the way, however, I got sidetracked by the history of The Mother of All Demos and how it obtained that name. It turns out that Engelbart's demo isn't the first demo to be called "The Mother of All Demos". Engelbart and The Mother of All Demos Engelbart's work has its roots in Vannevar Bush's 1945 visionary essay, "As We May Think." Bush envisioned thinking machines, along with the "memex", a compact machine holding a library of collective knowledge with hypertext-style links: "The Encyclopedia Britannica could be reduced to the volume of a matchbox." The memex could search out information based on associative search, building up a hypertext-like trail of connections. In the early 1960s, Engelbart was inspired by Bush's essay and set out to develop means to augment human intellect: "increasing the capability of a man to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems."3 Engelbart founded the Augmentation Research Center at the Stanford Research Institute (now SRI), where he and his team created a system called NLS (oN-Line System). Engelbart editing a hierarchical shopping list. In 1968, Engelbart demonstrated NLS to a crowd of two thousand people at the Fall Joint Computer Conference. Engelbart gave the demo from the stage, wearing a crisp shirt and tie and a headset microphone. Engelbart created hierarchical documents, such as the shopping list above, and moved around them with hyperlinks. He demonstrated how text could be created, moved, and edited with the keyset and mouse. Other documents included graphics, crude line drawing by today's standards but cutting-edge for the time. The computer's output was projected onto a giant screen, along with video of Engelbart. Engelbart using the keyset to edit text. Note that the display doesn't support lowercase text; instead, uppercase is indicated by a line above the character. Adapted from The Mother of All Demos. Engelbart sat at a specially-designed Herman Miller desk6 that held the keyset, keyboard, and mouse, shown above. While Engelbart was on stage in San Francisco, the SDS 9404 computer that ran the NLS software was 30 miles to the south in Menlo Park.5 To the modern eye, the demo resembles a PowerPoint presentation over Zoom, as Engelbart collaborated with Jeff Rulifson and Bill Paxton, miles away in Menlo Park. (Just like a modern Zoom call, the remote connection started with "We're not hearing you. How about now?") Jeff Rulifson browsed the NLS code, jumping between code files with hyperlinks and expanding subroutines by clicking on them. NLS was written in custom high-level languages, which they developed with a "compiler compiler" called TREE-META. The NLS system held interactive documentation as well as tracking bugs and changes. Bill Paxton interactively drew a diagram and then demonstrated how NLS could be used as a database, retrieving information by searching on keywords. (Although Engelbart was stressed by the live demo, Paxton told me that he was "too young and inexperienced to be concerned.") Bill Paxton, in Menlo Park, communicating with the conference in San Francisco. Bill English, an electrical engineer, not only built the first mouse for Engelbart but was also the hardware mastermind behind the demo. In San Francisco, the screen images were projected on a 20-foot screen by a Volkswagen-sized Eiodophor projector, bouncing light off a modulated oil film. Numerous cameras, video switchers and mixers created the video image. Two leased microwave links and half a dozen antennas connected SRI in Menlo Park to the demo in San Francisco. High-speed modems send the mouse, keyset, and keyboard signals from the demo back to SRI. Bill English spent months assembling the hardware and network for the demo and then managed the demo behind the scenes, assisted by a team of about 17 people. Another participant was the famed counterculturist Stewart Brand, known for the Whole Earth Catalog and the WELL, one of the oldest online virtual communities. Brand advised Engelbart on the presentation, as well as running a camera. He'd often point the camera at a monitor to generate swirling psychedelic feedback patterns, reminiscent of the LSD that he and Engelbart had experimented with. The demo received press attention such as a San Francisco Chronicle article titled "Fantastic World of Tomorrow's Computer". It stated, "The most fantastic glimpse into the computer future was taking place in a windowless room on the third floor of the Civic Auditorium" where Engelbart "made a computer in Menlo Park do secretarial work for him that ten efficient secretaries couldn't do in twice the time." His goal: "We hope to help man do better what he does—perhaps by as much as 50 per cent." However, the demo received little attention in the following decades.7 Engelbart continued his work at SRI for almost a decade, but as Engelbart commented with frustration, “There was a slightly less than universal perception of our value at SRI”.8 In 1977, SRI sold the Augmentation Research Center to Tymshare, a time-sharing computing company. (Timesharing was the cloud computing of the 1970s and 1980s, where companies would use time on a centralized computer.) At Tymshare, Engelbart's system was renamed AUGMENT and marketed as an office automation service, but Engelbart himself was sidelined from development, a situation that he described as sitting in a corner and becoming invisible. Meanwhile, Bill English and some other SRI researchers9 migrated four miles south to Xerox PARC and worked on the Xerox Alto computer. The Xerox Alto incorporated many ideas from the Augmentation Research Center including the graphical user interface, the mouse, and the keyset. The Alto's keyset was almost identical to the Engelbart keyset, as can be seen in the photo below. The Alto's keyset was most popular for the networked 3D shooter game "Maze War", with the clicking of keysets echoing through the hallways of Xerox PARC. A Xerox Alto with a keyset on the left. Xerox famously failed to commercialize the ideas from the Xerox Alto, but Steve Jobs recognized the importance of interactivity, the graphical user interface, and the mouse when he visited Xerox PARC in 1979. Steve Jobs provided the Apple Lisa and Macintosh ended up with a graphical user interface and the mouse (streamlined to one button instead of three), but he left the keyset behind.10 When McDonnell Douglas acquired Tymshare in 1984, Engelbart and his software—now called Augment—had a new home.11 In 1987, McDonnell Douglas released a text editor and outline processor for the IBM PC called MiniBASE, one of the few PC applications that supported a keyset. The functionality of MiniBASE was almost identical to Engelbart's 1968 demo, but in 1987, MiniBASE was competing against GUI-based word processors such as MacWrite and Microsoft Word, so MiniBASE had little impact. Engelbart left McDonnell Douglas in 1988, forming a research foundation called the Bootstrap Institute to continue his research independently. The name: "The Mother of All Demos" The name "The Mother of All Demos" has its roots in the Gulf War. In August 1990, Iraq invaded Kuwait, leading to war between Iraq and a coalition of the United States and 41 other countries. During the months of buildup prior to active conflict, Iraq's leader, Saddam Hussein, exhorted the Iraqi people to prepare for "the mother of all battles",12 a phrase that caught the attention of the media. The battle didn't proceed as Hussein hoped: during exactly 100 hours of ground combat, the US-led coalition liberated Kuwait, pushed into Iraq, crushed the Iraqi forces, and declared a ceasefire.13 Hussein's mother of all battles became the mother of all surrenders. The phrase "mother of all ..." became the 1990s equivalent of a meme, used as a slightly-ironic superlative. It was applied to everything from The Mother of All Traffic Jams to The Mother of All Windows Books, from The Mother of All Butter Cookies to Apple calling mobile devices The Mother of All Markets.14 In 1991, this superlative was applied to a computer demo, but it wasn't Engelbart's demo. Andy Grove, Intel's president, gave a keynote speech at Comdex 1991 entitled The Second Decade: Computer-Supported Collaboration, a live demonstration of his vision for PC-based video conferencing and wireless communication in the PC's second decade. This complex hour-long demo required almost six months to prepare, with 15 companies collaborating. Intel called this demo "The Mother of All Demos", a name repeated in the New York Times, San Francisco Chronicle, Fortune, and PC Week.15 Andy Grove's demo was a hit, with over 20,000 people requesting a video tape, but the demo was soon forgotten. On the eve of Comdex, the New York Times wrote about Intel's "Mother of All Demos". Oct 21, 1991, D1-D2. In 1994, Wired writer Steven Levy wrote Insanely Great: The Life and Times of Macintosh, the Computer that Changed Everything.8 In the second chapter of this comprehensive book, Levy explained how Vannevar Bush and Doug Engelbart "sparked a chain reaction" that led to the Macintosh. The chapter described Engelbart's 1968 demo in detail including a throwaway line saying, "It was the mother of all demos."16 Based on my research, I think this is the source of the name "The Mother of All Demos" for Engelbart's demo. By the end of the century, multiple publications echoed Levy's catchy phrase. In February 1999, the San Jose Mercury News had a special article on Engelbart, saying that the demonstration was "still called 'the mother of all demos'", a description echoed by the industry publication Computerworld.17 The book Nerds: A Brief History of the Internet stated that the demo "has entered legend as 'the mother of all demos'". By this point, Engelbart's fame for the "mother of all demos" was cemented and the phrase became near-obligatory when writing about him. The classic Silicon Valley history Fire in the Valley (1984), for example, didn't even mention Engelbart but in the second edition (2000), "The Mother of All Demos" had its own chapter. Interfacing the keyset to USB Getting back to the keyset interface, the keyset consists of five microswitches, triggered by the five levers. The switches are wired to a standard DB-25 connector. I used a Teensy 3.6 microcontroller board for the interface, since this board can act both as a USB device and as a USB host. As a USB device, the Teensy can emulate a standard USB keyboard. As a USB host, the Teensy can receive input from a standard USB mouse. Connecting the keyset to the Teensy is (almost) straightforward, wiring the switches to five data inputs on the Teensy and the common line connected to ground. The Teensy's input lines can be configured with pullup resistors inside the microcontroller. The result is that a data line shows 1 by default and 0 when the corresponding key is pressed. One complication is that the keyset apparently has a 1.5 KΩ between the leftmost button and ground, maybe to indicate that the device is plugged in. This resistor caused that line to always appear low to the Teensy. To counteract this and allow the Teensy to read the pin, I connected a 1 KΩ pullup resistor to that one line. The interface code Reading the keyset and sending characters over USB is mostly straightforward, but there are a few complications. First, it's unlikely that the user will press multiple keyset buttons at exactly the same time. Moreover, the button contacts may bounce. To deal with this, I wait until the buttons have a stable value for 100 ms (a semi-arbitrary delay) before sending a key over USB. The second complication is that with five keys, the keyset only supports 32 characters. To obtain upper case, numbers, special characters, and control characters, the keyset is designed to be used in conjunction with mouse buttons. Thus, the interface needs to act as a USB host, so I can plug in a USB mouse to the interface. If I want the mouse to be usable as a mouse, not just buttons in conjunction with the keyset, the interface mus forward mouse events over USB. But it's not that easy, since mouse clicks in conjunction with the keyset shouldn't be forwarded. Otherwise, unwanted clicks will happen while using the keyset. To emulate a keyboard, the code uses the Keyboard library. This library provides an API to send characters to the destination computer. Inconveniently, the simplest method, print(), supports only regular characters, not special characters like ENTER or BACKSPACE. For those, I needed to use the lower-level press() and release() methods. To read the mouse buttons, the code uses the USBHost_t36 library, the Teensy version of the USB Host library. Finally, to pass mouse motion through to the destination computer, I use the Mouse library. Conclusions Engelbart claimed that learning a keyset wasn't difficult—a six-year-old kid could learn it in less than a week—but I'm not willing to invest much time into learning it. In my brief use of the keyset, I found it very difficult to use physically. Pressing four keys at once is difficult, with the worst being all fingers except the ring finger. Combining this with a mouse button or two at the same time gave me the feeling that I was sight-reading a difficult piano piece. Maybe it becomes easier with use, but I noticed that Alto programs tended to treat the keyset as function keys, rather than a mechanism for typing with chords.18 David Liddle of Xerox PARC said, "We found that [the keyset] was tending to slow people down, once you got away from really hot [stuff] system programmers. It wasn't quite so good if you were giving it to other engineers, let alone clerical people and so on." If anyone else has a keyset that they want to connect via USB (unlikely as it may be), my code is on github.19 Thanks to Christina Engelbart for loaning me the keyset. Thanks to Bill Paxton for answering my questions. Follow me on Bluesky (@righto.com) or RSS for updates. Footnotes and references Engelbart's use of the mouse wasn't arbitrary, but based on research. In 1966, shortly after inventing the mouse, Engelbart carried out a NASA-sponsored study that evaluated six input devices: two types of joysticks, a Graphacon positioner, the mouse, a light pen, and a control operated by the knees (leaving the hands free). The mouse, knee control, and light pen performed best, with users finding the mouse satisfying to use. Although inexperienced subjects had some trouble with the mouse, experienced subjects considered it the best device. A joystick, Graphacon, mouse, knee control, and light pen were examined as input devices. Photos from the study. ↩ The information sheet below from the Augmentation Research Center shows what keyset chords correspond to each character. I used this encoding for my interface software. Each column corresponds to a different combination of mouse buttons. The information sheet for the keyset specifies how to obtain each character. The special characters above are <CD> (Command Delete, i.e. cancel a partially-entered command), <BC> (Backspace Character), <OK> (confirm command), <BW>(Backspace Word), <RC> (Replace Character), <ESC> (which does filename completion). NLS and the Augment software have the concept of a viewspec, a view specification that controls the view of a file. For instance, viewspecs can expand or collapse an outline to show more or less detail, filter the content, or show authorship of sections. The keyset can select viewspecs, as shown below. Back of the keyset information sheet. Viewsets are explained in more detail in The Mother of All Demos. For my keyset interface, I ignored viewspecs since I don't have software to use these inputs, but it would be easy to modify the code to output the desired viewspec characters. ↩ See Augmenting Human Intellect: A Conceptual Framework, Engelbart's 1962 report. ↩ Engelbart used an SDS 940 computer running the Berkeley Timesharing System. The computer had 64K words of core memory, with 4.5 MB of drum storage for swapping and 96 MB of disk storage for files. For displays, the computer drove twelve 5" high-resolution CRTs, but these weren't viewed directly. Instead, each CRT had a video camera pointed at it and the video was redisplayed on a larger display in a work station in each office. The SDS 940 was a large 24-bit scientific computer, built by Scientific Data Systems. Although SDS built the first integrated-circuit-based commercial computer in 1965 (the SDS 92), the SDS 940 was a transistorized system. It consisted of multiple refrigerator-sized cabinets, as shown below. Since each memory cabinet held 16K words and the computer at SRI had 64K, SRI's computer had two additional cabinets of memory. Front view of an SDS 940 computer. From the Theory of Operation manual. In the late 1960s, Xerox wanted to get into the computer industry, so Xerox bought Scientific Data Systems in 1969 for $900 million (about $8 billion in current dollars). The acquisition was a disaster. After steadily losing money, Xerox decided to exit the mainframe computer business in 1975. Xerox's CEO summed up the purchase: "With hindsight, we would not have done the same thing." ↩ The Mother of All Demos is on YouTube, as well as a five-minute summary for the impatient. ↩ The desk for the keyset and mouse was designed by Herman Miller, the office furniture company. Herman Miller worked with SRI to design the desks and office walls as part of their plans for the office of the future. Herman Miller invented the cubicle office in 1964, creating a modern replacement for the commonly used open office arrangement. ↩ Engelbart's demo is famous now, but for many years it was ignored. For instance, Electronic Design had a long article on Engelbart's work in 1969 (putting the system on the cover), but there was no mention of the demo. Engelbart's system was featured on the cover of Electronic Design. Feb 1, 1969. (slightly retouched) But by the 1980s, the Engelbart demo started getting attention. The 1986 documentary Silicon Valley Boomtown had a long section on Engelbart's work and the demo. By 1988, the New York Times was referring to the demo as legendary. ↩ Levy had written about Engelbart a decade earlier, in the May 1984 issue of the magazine Popular Computing. The article focused on the mouse, recently available to the public through the Apple Lisa and the IBM PC (as an option). The big issue at the time was how many buttons a mouse should have: three like Engelbart's mouse, the one button that Apple used, or two buttons as Bill Gates preferred. But Engelbart's larger vision also came through in Levy's interview along with his frustration that most of his research had been ignored, overshadowed by the mouse. Notably, there was no mention of Engelbart's 1968 demo in the article. ↩↩ The SRI researchers who moved to Xerox include Bill English, Charles Irby, Jeff Rulifson, Bill Duval, and Bill Paxton (details). ↩ In 2023, Xerox donated the entire Xerox PARC research center to SRI. The research center remained in Palo Alto but became part of SRI. In a sense, this closed the circle, since many of the people and ideas from SRI had gone to PARC in the 1970s. However, both PARC and SRI had changed radically since the 1970s, with the cutting edge of computer research moving elsewhere. ↩ For a detailed discussion of the Augment system, see Tymshare's Augment: Heralding a New Era, Oct 1978. Augment provided a "broad range of information handling capability" that was not available elsewhere. Unlike other word processing systems, Augment was targeted at the professional, not clerical workers, people who were "eager to explore the open-ended possibilities" of the interactive process. The main complaints about Augment were its price and that it was not easy to use. Accessing Engelbart's NLS system over ARPANET cost an eye-watering $48,000 a year (over $300,000 a year in current dollars). Tymshare's Augment service was cheaper (about $80 an hour in current dollars), but still much more expensive than a standard word processing service. Overall, the article found that Augment users were delighted with the system: "It is stimulating to belong to the electronic intelligentsia." Users found it to be "a way of life—an absorbing, enriching experience". ↩ William Safire provided background in the New York Times, explaining that "the mother of all battles" originally referred to the battle of Qadisiya in A.D. 636, and Saddam Hussein was referencing that ancient battle. A translator responded, however, that the Arabic expression would be better translated as "the great battle" than "the mother of all battles." ↩ The end of the Gulf War left Saddam Hussein in control of Iraq and left thousands of US troops in Saudi Arabia. These factors would turn out to be catastrophic in the following years. ↩ At the Mobile '92 conference, Apple's CEO, John Sculley, said personal communicators could be "the mother of all markets," while Andy Grove of Intel said that the idea of a wireless personal communicator in every pocket is "a pipe dream driven by greed" (link). In hindsight, Sculley was completely right and Grove was completely wrong. ↩ Some references to Intel's "Mother of all demos" are Computer Industry Gathers Amid Chaos, New York Times, Oct 21, 1991 and "Intel's High-Tech Vision of the Future: Chipmaker proposes using computers to dramatically improve productivity", San Francisco Chronicle, Oct 21, 1991, p24. The title of an article in Microprocessor Report, "Intel Declares Victory in the Mother of All Demos" (Nov. 20, 1991), alluded to the recently-ended war. Fortune wrote about Intel's demo in the Feb 17, 1997 issue. A longer description of Intel's demo is in the book Strategy is Destiny. ↩ Several sources claim that Andy van Dam was the first to call Engelbart's demo "The Mother of All Demos." Although van Dam attended the 1968 demo, I couldn't find any evidence that he coined the phrase. John Markoff, a technology journalist for The New York Times, wrote a book What the Dormouse Said: How the Sixties Counterculture Shaped the Personal Computer Industry. In this book, Markoff wrote about Engelbart's demo, saying "Years later, his talk remained 'the mother of all demos' in the words of Andries van Dam, a Brown University computer scientist." As far as I can tell, van Dam used the phrase but only after it had already been popularized by Levy. ↩ It's curious to write that the demonstration was still called the "mother of all demos" when the phrase was just a few years old. ↩ The photo below shows a keyset from the Xerox Alto. The five keys are labeled with separate functions—Copy, Undelete, Move, Draw, and Fine— for use with ALE, a program for IC design. ALE supported keyset chording in combination with the mouse. ↩ Keyset from a Xerox Alto, courtesy of Digibarn. After I implemented this interface, I came across a project that constructed a 3D-printed chording keyset, also using a Teensy for the USB interface. You can find that project here. ↩

3 months ago • 38 votes

More in technology

PSA: part of your Kagi subscription fee goes to a Russian company (Yandex)

Today I learned that Kagi uses Yandex as part of its search infrastructure, making up about 2% of their costs, and their CEO has confirmed that they do not plan to change that. To quote: Yandex represents about 2% of our total costs and is only one of dozens of sources we use. To put this in perspective: removing any single source would degrade search quality for all users while having minimal economic impact on any particular region. The world doesn’t need another politicized search engine. It needs one that works exceptionally well, regardless of the political climate. That’s what we’re building. That is unfortunate, as I found Kagi to be a good product with an interesting take on utilizing LLM models with search that is kind of useful, but I cannot in good heart continue to support it while they unapologetically finance a major company that has ties to the Russian government, the same country that is actively waging a war against Ukraine, an European country, for over 11 years, during which they’ve committed countless war crimes against civilians and military personnel. Kagi has the freedom to decide how they build the best search engine, and I have the freedom to use something else. Please send all your whataboutisms to /dev/null.

2 days ago • 3 votes

Alvik Fight Club: A creative twist on coding, competition, and collaboration

What happens when you hand an educational robot to a group of developers and ask them to build something fun? At Arduino, you get a multiplayer robot showdown that’s part battle, part programming lesson, and entirely Alvik. The idea for Alvik Fight Club first came to life during one of our internal Make Tanks, in […] The post Alvik Fight Club: A creative twist on coding, competition, and collaboration appeared first on Arduino Blog.

4 days ago • 6 votes

Vote for the July 2025 + Post Topic

Past ads get a second chance.

5 days ago • 9 votes

How a Hibernate deprecation log message made our Java backend service super slow

It was time to upgrade Hibernate on that one Java monolithic1 backend service that my team was responsible for. We took great precautions with these types of changes due to the scale of the system, splitting changes into as many small parts as possible and releasing them as often as possible. With bigger changes we opted for running a few instances of the new version in parallel to the existing one. Then came Hibernate 5.2. Hibernate 5.2 introduced a new warning log to indicate that the existing API for writing queries is deprecated. Hibernate's legacy org.hibernate.Criteria API is deprecated; use the JPA javax.persistence.criteria.CriteriaQuery instead Every time you used the Criteria API it would print the line. Just one little issue there. Can you see it? Every time you used the Criteria API it would print the line. In a poorly written Java backend service, one HTTP request can make multiple queries to the database. With hundreds of millions of HTTP requests, this can easily balloon to billions of additional logs a day. Well, that’s exactly what happened to our service, resulting in the CPU usage jumping up considerably and the latency of the service being negatively impacted. We didn’t have the foresight to compare every metric against every instance of the service, and when the metrics were summarized across all instances, this increase was not that noticeable while both new and existing instances of the service were running. Aside from the service itself, this had negative effects downstream as well. If you have a solution for collecting your service logs for analysis and retention, and it’s priced on the amount of logs that you print out, then this can end up being a very costly issue for you. We resolved the issue by making a configuration change to our logger that disabled these specific logs. This does make me wonder who else may have been impacted by this change over the years and what that impact might’ve looked like regarding the resource usage on a world-wide scale. I’m not blaming the Hibernate developers, they had good intentions, but the impact of an innocent change like that was likely not taken into account for large-scale services. Last I heard, the people behind Hibernate are a very small team, and yet their software powers much of the world, including critical infrastructure like the banking system. I’m well aware that we’re talking about Hibernate releases that were released around the time I was still a junior developer (2016-2018). Some call it technical debt, others call it over half a decade of neglect. unmaintaned monoliths suck, but so do unmaintained microservices. ↩︎

5 days ago • 12 votes

The History of Windows XP

NT Vincit Omnia

6 days ago • 12 votes

New here?