Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
5
Introduction The TDS600 Series The Acquisition Board Measuring Along the Signal Path A Closer Look at the Noise Issue Conclusion Introduction I have a Tektronix TDS 684B oscilloscope that I bought cheaply at an auction. It has 4 channels, 1 GHz of BW and a sample rate of 5 Gsps. Those are respectable numbers even by today’s standards. It’s also the main reason why I have it: compared to modern oscilloscopes, the other features aren’t nearly as impressive. It can only record 15k samples per channel at a time, for example. But at least the sample rate doesn’t go down when you increase the number of recording channels: it’s 5 Gsps even all 4 channels are enabled. I’ve always wondered how Tektronix managed to reach such high specifications back in the nineties, so in this blog post I take a quick look at the internals, figure out how it works, and do some measurements along the signal path. The TDS600 Series The first oscilloscopes of the TDS600 series were introduced around 1993....
a week ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Electronics etc…

Brightness and Contrast Adjustment of Tektronix TDS 500/600/700 Oscilloscopes

Introduction Finding the Display Tuning Potentiometers The Result Hardcopy Preview Mode Introduction Less than a week after finishing my TDS 684B analog memory blog post, a TDS 684C landed on my lab bench with a very dim CRT. If you follow the lives the 3-digit TDS oscilloscope series, you probably know that this is normally a bit of death sentence of the CRT: after years of use, the cathode ray loses its strength and there’s nothing you can do about it other than replace the CRT with an LCD screen. I was totally ready to go that route, and if I ever need to do it, here are 3 possible LCD upgrade options that I list for later reference: The most common one is to buy a $350 Newscope-T1 LCD display kit by SimmConn Labs. A cheaper hobbyist alternative is to hack something together with a VGA to LVDS interface board and some generic LCD panel, as described in this build report. He uses a VGA LCD Controller Board KYV-N2 V2 with a 7” A070SN02 LCD panel. As I write this, the cost is $75, but I assume this used to be a lot cheaper before tariffs were in place. If you really want to go hard-core, you could make your own interface board with an FPGA that snoops the RAMDAC digital signals and converts them to LVDS, just like the Newscope-T1. There is a whole thread about this the EEVblog forum. But this blog post is not about installing an LCD panel! Before going that route, you should try to increase the brightness of the CRT by turning a potentiometer on the display board. It sounds like an obvious thing to try, but didn’t a lot of reference to online. And in my case, it just worked. Finding the Display Tuning Potentiometers In the Display Assembly Adjustment section of chapter 5 of the TDS 500D, TDS 600C, TDS 700D and TDS 714L Service Manual, page 5-23, you’ll find the instructions on how to change rotation, brightness and contrast. It says to remove the cabinet and then turn some potentiometer, but I just couldn’t find them! They’re supposed to be next to the fan. Somewhere around there: Well, I couldn’t see any. It’s only the next day, when I was ready to take the whole thing apart that I noticed these dust covered holes: A few minutes and a vaccum cleaning operation later reveals 5 glorious potentiometers: From left to right: horizontal position rotation vertical position brightness contrast Rotate the last 2 at will and if you’re lucky, your dim CRT will look brand new again. It did for me! The Result The weird colors in the picture above is a photography artifact that’s caused by Tektronix NuColor display technology: it uses a monochrome CRT with an R/G/B shutter in front of it. You can read more about it in this Hackaday article. In real life, the image looks perfectly fine! Hardcopy Preview Mode If dialing up the brightness doesn’t work and you don’t want to spend money on an LCD upgrade, there is the option of switching the display to Hardcopy mode, like this: [Display] -> [Settings <Color>] -> [Palette] -> [Hardcopy preview] Instead of a black, you will now get a white background. It made the scope usable before I made the brightness adjustment.

3 days ago 4 votes
Rohde &amp; Schwarz AMIQ Modulation Generator - Teardown and Analog Deep Dive

Introduction The Rohde & Schwarz AMIQ Modulation Generator WinIQSim Software Inside the AMIQ The Signal Generation PCB Analog Signal Generation Architecture: Fixed vs Variable DAC Clock Internal Reference Clock Generation DAC Clock Synthesizer I/Q Output Skew Tuning Variable Gain Amplifier Internal Diagnostics Efficient Distribution of Configuration Signals Conclusion References Introduction Every few months, a local company auctions off all kinds of lab, production and test equipment. I shouldn’t be subscribed to their email list but I am, and that’s one way I end up with more stuff that I don’t really need. During a recent auction, I got my hands on a Rohde & Schwarz AMIQ, an I/Q modulation generator, for a grand total of $45. Add to that another 30% for the auction fee and taxes and you’re still paying much less than what others would pay for a round of golf? But instead of one morning of fun, this thing has the potential to keep me busy for many weekends, so what a deal! (Click to enlarge) A few days after “winning” the auction, I drove to a dark dungeon of a warehouse in San Jose to pick up the loot. The AMIQ has a power on/off button and 3 LEDs and that’s in terms of user interface. There are no dials, there’s no display. So without any other options, I simply powered it up and I was immediately greeted by the ominous clicking of a hard drive. I was right: this thing would keep me entertained for at least a little bit! (Click to enlarge) It took a significant amount of effort to restore the machine back to its working state. I’ll write about that in future blog posts, but let’s start with an overview of the functionality and a teardown of the R&S AMIQ and then deep dive into some of its analog circuits. AMIQ prices on eBay vary wildly, from $129 to $2600 at the time of writing this. Even if you get one of the higher priced ones, you should expect to get a unit that’s close to failing due to leaking capacitors and a flaky harddrive! The Rohde & Schwarz AMIQ Modulation Generator Reduced to an elevator sales pitch, the AMIQ is a 2-channel arbitrary waveform generator (AWG) with a deep sample buffer. That’s it! It has a streaming buffer that feeds samples to 2 14-bit DACs at a sample rate of up to 105MHz. Two output channels I and Q will typically contain quadrature modulation signals that are sent to an RF vector signal generator such as a Rohde & Schwarz SMIQ for the actual high-frequency modulation. In a typical setup, the AMIQ is used to generate the baseband modulated signal and the SMIQ shifts the baseband signal to an RF frequency. Since the AMIQ has no user interface, the waveform data must provided by an external device. This could be a PC that runs R&S WinIQSim software or even the SMIQ itself because it has the ability to control an AMIQ. You can also create your own waveforms and upload them via floppy disk, GPIB or an RS-232 interface using SCPI control commands. Figure 4-1 of the AMIQ Operating Manual has a simplified block diagram. It is pretty straightforward and somewhat similar to the one of my HP 33120A function generator: (Click to enlarge) On the left are 2 blocks that are shared: a clock synthesizer waveform memory And then for each channel: 14-bit D/A converter analog filters output section with amplifier/attenuator differential analog output driver (AMIQ-B2 option) The major blocks are surrounded by a large amount of DACs that are used to control everything from the tuning input of local 10MHz clock oscillator, gain and offset of output signals, clock skew between I and Q signal and much more. You could do some of this with a modern SDR setup, but the specifications of the AMIQ units are dialed up a notch. Both channels are completely symmetrical to avoid modulation errors at the source. If there’s a need to compensate for small delay differences in, for example, the external cables, you can compensate for that by changing the skew between clocks of the output DACs with a precision of 10ps. Similarly, the DAC sample frequency can be programmed with 32-bit precision. (Click to enlarge) In addition to the main I/Q outputs at the front, there are a bunch of secondary input and output signals: 10MHz reference input and output sample clock trigger input marker output external filter loopback output and input bit error rate measurement (BER) connector (AMIQ-B1 option) parallel interface with the digital value of the samples that are sent to the DAC (front panel, AMIQ-B3 option) (Click to enlarge) Above those speciality inputs and outputs are the obligatory GPIB interface and a bunch of generic connectors that look suspiciously like the ones you’d find on an early century PC: PS/2 keyboard and mouse Parallel port RS232 USB There are 3 different AMIQ versions: 1110.2003.02: 4M samples 1110.2003.03: 4M samples 1110.2003.04: 16M samples Mine is an AMIQ-04. WinIQSim Software While it is possible to control the AMIQ over RS-232 or GPIB with your own software, you’d have a hard time matching the features of WinIQSim, a R&S Windows application that supports the AMIQ. (Click to enlarge) With WinIQSim, you can select one of the popular communication protocols from the late nineties and early 2000s, fill in digital framing data, apply all kinds of distortions and interferences, compute the I and Q waveforms and send it to the AMIQ. Some of the supported formats include CDMA2000, 802.11a WLAN, TD-SCDMA and more. But you don’t have to use these official protocols, WinIQSim supports any kind of FSK, QPSK, QAM or other common modulation method. You need a license for some of the communication protocols. The license is linked to the AMIQ device, not the PC, but the license check is pretty naive, and while I haven’t tried it… yet, the EEVblog forum has discussions about how to enable features yourself. My device only came with a license for IS-95 CDMA. Inside the AMIQ It’s trivial to open up an AMIQ: after removing the 4 feet in the back with a regular Philips screwdriver, you can simply slide off the outer case. It has 2 major subsystems: the top contains all the components of a standard PC (Click to enlarge) the bottom has a signal generation PCB (Click to enlarge) The PCB looks incredibly clean and well laid out and I love how they printed the names of different sections on the metal gray shielding plates. We’ll leave the PC system for a future blog post, and focus on the signal generation PCB. The Signal Generation PCB Let’s remove the shielding plates to see what’s underneath. I had to drill out one of the screws that attach the plates because the head was stripped. (Did somebody before me already try to repair it?) (Click to enlarge) The bottom half left and right sections are perfectly symmetrical, as one would expect for a device that has the ability to tune skew mismatches with a 10 ps precision. Annotated, it looks like this: (Click to enlarge) Rohde & Schwarz recently made the terrible decision to lock all their software and manuals behind an approval-only corporate log-in wall, but luckily some of the most important AMIQ assets can be found online elsewhere, including the operating manual and a service manual contains the full schematics! Let’s dig a bit deeper into the various aspects of the design. In what follows I’ll be focusing primarily on analog aspects of the design. This is a very personal choice: not that the digital sections aren’t important, it’s just that, as digital design engineer, they’re not particular interesting to me. By studying the analog sections, I hope to stumble into circuits that I didn’t really know a lot about before. Fantastic Schematics Before digging in for real, a word about the schematics: they are fantastic. Each sub-system has a block diagram that is already pretty detailed, with signal names that match the schematics and test points, often with annotations to indicate the voltage or frequency range. Here’s the block diagram of the reference and DAC clock generation section, for example. Schematic page 5 (Click to enlarge) Signals that come from or go to other pages are fully referenced. Look at the SYN_OUT_CLK signal below: Schematic page 10 (Click to enlarge) The signal is also used on page 6, coordinate 1D and 7B and page 22, coordinate 8A. How cool is that? Signal path test points One of the awesome features of the PCB is the generous amount of test points. We’re not just talking PCB test point against which you can hold your oscilloscope probe or even header pins, though there are plenty of those too, but full on SMB connectors. In addition to these SMD connectors, there are also plenty of jumpers that can be used to interrupt the default signal flow and insert your own test signal instead. Analog Signal Generation Architecture: Fixed vs Variable DAC Clock In the HP 33120A the DAC has a fixed 40 MHz clock. There’s a 16 kB waveform RAM that contains, say, one quarter period of a 100 Hz sine. If you want to send out a sine wave of 200 Hz, instead of sequentially stepping through all the addresses of the waveform RAM, you just skip every other address. One of the benefits of this kind of generation scheme is that you can make do with a fixed frequency analog anti-aliasing filter: the Nyquist frequency is always the same after all. A major disadvantage, however, is that even if the output signal has a bandwidth of only 1MHz, you still need to feed the DAC at the fixed clock rate. You could insert a digital upsampling filter between the waveform memory and the DAC, but that requires significant mathematical DSP fire power, or you’d have to increase the depth of the waveform memory. For an arbitrary waveform generator, it makes more sense to run the DAC at whichever clock speed is sufficient to meet the Nyquist requirement of the desired signal and provide a number of different filtering options. The AMIQ has 4 such options: no filter, a 25 MHz, a 2.5 MHz or a loopback through an external filter. The DAC sample clock range is huge, from 10 Hz all the way to 105 MHz, though specifications are only guaranteed up to 100 MHz. According to the data sheet, the clock frequency can be set with a precision of 10^-7. Internal Reference Clock Generation Like all professional test and measurement equipment, the AMIQ uses a 10MHz reference clock that can come from outside or that can be generated locally with a 10MHz crystal. It’s common for high-end equipment to have an oven controlled crystal oscillator (OCXO), but AMIQ has a lower spec’ed temperature controlled one (TCXO), a Milliren Technologies 453-0210. If we look at the larger reference clock generation block diagram, we can see something slightly unusual: instead of selecting between the internal TCXO output or the external reference clock input, the internal reference clock always comes from the TCXO (green). Schematic page 5 (Click to enlarge) When the internal clock is selected, the TCXO output frequency can be tuned with an analog signal that comes from the VTXCO TUNE DAC (blue), but when the external reference input is active, the TCXO is phase locked to the external clock. You can see the phase comparator and low pass filter in red. The reason for using a PLL with the internal TCXO as the voltage controlled oscillator is probably to ensure that the generated reference clock has the phase noise of the TCXO while tracking the frequency of the external reference clock: a PLL acts as a low-pass filter to the reference clock and as a high-pass filter to the VCO. If the TCXO has a better high frequency phase noise than the external reference clock, that makes sense. This is really out of my wheelhouse, so take all of this with a grain of salt… SYN_REF is the output of the internal reference clock generation unit. DAC Clock Synthesizer The clock synthesizer creates a highly programmable DAC clock from the internal reference clock SYN_REF from previous section. It should come as no surprise that this clock is generated by a PLL as well. Schematic page 5 (Click to enlarge) There are two speciality components in the clock generation path: a Mini-Circuits JTOS-200 VCO an Analog Devices AD9850 DDS Synthesizer (Click to enlarge) The VCO has an operating frequency between 100 and 200 MHz. I don’t know enough about VCOs to give meaningful commentary about the specifications, but based on comparisons with similar components on Digikey, such as this FMVC11009, it’s safe to assume that it’s an expensive and high quality component. The AD9850 is located in the feedback path of the PLL where it acts as a feedback divider with a precision of 32 bits. The signal flow inside the AD9850 is interesting: It works as follows: each clock cycle, the phase of a numerical controlled oscillator (NCO) accumulates with the programmable 32-bit increment. the upper bits of the phase accumulator serve as the address of a sine waveform table. a 10-bit DAC converts the digital sine wave to analog. the analog signal is sent through an external low-pass filter. the output of the low-pass filter goes back into the AD9850 and through a comparator to generate a digital output clock. This thing has its own programmable signal generator! But why the roundtrip from digital to analog back to digital? In theory, the MSB of the NCO could be used as output clock of the clock generator. The problem is that in such a configuration, the length of each clock period toggles between N and N+1 clock cycles, with a ratio so that the average clock period ends up with the desired value. But this creates major spurs in the frequency spectrum of the generated clock. When fed into the phase comparator of a PLL, these frequency spurs can show up as jitter at the output of the PLL and thus into the spectrum of the generated I and Q signals. By converting the signal to analog and using a low pass filter, the spurs can be filtered away. The combination of a low pass filter and a comparator acts as an interpolator: the edges of the generated clock fall somewhere in between N and N+1. Schematic page 21 (Click to enlarge) The steepness of the low pass filter depends on the ratio between the input and the output clock: the lower the ratio, the steeper the filter. There are a bunch of binary clock dividers that make it hard to know the exact ratio, but if the output of the VCO is 100 MHz and the input 10 MHz or less, there is a 10:1 ratio. The AMIQ has a 7th order elliptical low pass filter. I ran a quick simulation in LTspice to check the behavior. (dds_filter.asc source file.) The filter has a cut-off frequency of around 13 MHz: In modern fractional PLLs, instead of using a regular DAC, one could use a high-order sigma-delta unit to create a pulse density modulated output with the noise pushed to higher frequencies and a low pass filter that can be less aggressive. There’s a plenty of literature online about DDS clock generators, some of which I’ve listed in the references at the bottom, but the datasheet of the AD9850 itself is a good start. I/Q Output Skew Tuning Earlier I mentioned the ability to tune the skew between the I and the Q output to compensate for a difference in cable length when connecting the AMIQ to an RF signal generator. While the digital waveform fetching circuit works on one clock, the two clocks that go to the DAC are the ones that can be changed: Here’s what the skewing circuit looks like: Schematic page 10 (Click to enlarge) Starting with input signal DAC_CLK, the goal is to create I_DAC_CLK and Q_DAC_CLK that can be moved up to 1ns ahead or behind the other. Signals can be delayed by sending them through an R/C combo. Since we need a variable delay, we need a way to change the value of either the R or the C. A varactor or varicap diode is exactly that kind of device: its capacitance changes depending on the reverse bias voltage across the diode. Static input signal SKEW_TUNE comes from a DAC. It goes to the cathode of one varicap and the anode of the other. When its voltage increase, the capacitance of those 2 diodes moves in opposite ways, and so does the R/C delay along the I and Q clock path. A BB147 varicap has a capacitance that varies between 2pF and 112pF depending on the bias voltage. Capacitances C168 and C619 prevent the relatively high bias voltages from reaching the digital signal path. Does the circuit work? Absolutely! The scope photos below show the impact of the skewing circuit when dialed to the maximum in both directions: With a 2ns/div horizontal scale, you can see a skew of ~1ns either way. Variable Gain Amplifier The analog signal path starts at the DAC and then goes through the anti-aliasing filters, an output section that does amplification and attenuation, and finally the output connector board. It is common for signal generators to have a fixed gain amplification section and then some selectable fixed attenuation stages: amplifiers with a variable gain and low distortion are hard to design. If you need a signal amplitude that can’t be achieved with one of the fixed attenuation settings, one solution is to multiply the signal before it enters the DAC to reduce the output amplitude, though that’s at the expense of losing some of the dynamic range of the DAC. This is not the case for the AMIQ: while it can use a signal path that only uses fixed amplification and attenuation stages, it also offers the intriguing option to send the analog signal through an analog multiplier stage: If we zoom down from the system diagram to the block diagram, we can see how the multiplier/variable attenuator sits between the filter and the output amplifier, with 2 control input signals: AMPL_CNTRL and OFFSET. This circuit exists twice, of course, for the I and the Q channel. Schematic page 3 (Click to enlarge) Let’s check out the details! The heavy lifting of the variable gain amplifier is performed by an Analog Devices AD835 250 MHz, Voltage Output, 4-Quadrant Multiplier, a part from their Analog Multipliers & Dividers catalog. Schematic page 14 (Click to enlarge) It’s a tiny 8-pin device that calculates W = (X1-X2) * (Y1-Y2) + Z. In low volume, it will set you back $32 on Digikey. In addition to the multiplication, you can set a fixed output gain by feeding back output W through a resistive divider back into Z. For inputs less than 10 MHz, it has a typical harmonic distortion of -70dB. Analog Devices datasheets usually have a Theory of Operation section that explain, well, the underlying theory of operation. The AD835 has such a section as well, but it doesn’t go any further than stating that the multiplier is based on a classic form, having a translinear core, supported by three (X, Y, and Z) linearized voltage-to-current converters, and the load driving output amplifier. I have no clue how the thing works! In the case of the AMIQ, X2 and Y2 are strapped to ground, and there’s no feedback from W back into Z either which reduces the functionality to: W = FILTER_OUT * AMPL_VAR + OFFSET. AMPL_VAR and OFFSET are static analog values that are each created by a 12-bit DAC8143, just like many other analog configuration signals. It’s almost shame that this powerful little device is asked to perform such a basic operation. While researching the AD835, somebody pointed out the AD8310, another interesting speciality chip from Analog Devices. It’s a DC to 440 MHz, 95dB logarithmic amplifier, a converter from linear to logarithmic scale basically. Discovering little gems like this is why I love studying schematics of complex devices. Internal Diagnostics The AMIQ has a set of 15 internal analog signal to monitor the health of the device. Various meaningful signals are gathered all around the signal generation board and measured at power up to give you that dreaded Diagnostic Check Fail message. Schematic page 27 (Click to enlarge) The circuit itself is straightforward: 2 8-to-1 analog multiplexers feed into a CS5507AS 16-bit AD converter. It can only do 100 samples per second, but that’s sufficient to measure a bunch of mostly static signals. Like many other devices, the measured value is sent serially to one of the FPGAs. The ADC needs a 2.5V reference voltage. It’s funny that this reference voltage also goes to the analog multiplexers as one one of the diagnostic signals. One wonders what the ADC returns as a result when it tries to convert the output of a broken reference voltage generator. There are a bunch of circuits in the AMIQ whose only purpose is generating diagnostic signals. Here’s a good example of that: Schematic page 14 (Click to enlarge) I_OUT_DIAG and Q_OUT_DIAG are the outputs after the attenuator that will eventually end up at the connectors. The circuit in the top red square is a signal peak detector, similar to what you’d find in the rectifier of a power supply. It allows the AMIQ to track the output level of signals with a frequency that is way higher than the sample rate of the ADC. The circuit in the red rectangle below performs a digital XOR on the analog I and the Q signals and then sends it through a simple R/C low pass filter. I think that it allows the AMIQ to check that the phase difference between the I and Q channel is sensible when applying a test signal during power-on self-test. Efficient Distribution of Configuration Signals The AMIQ signal board has hundreds of configuration bits: there are the obvious enable/disable or selection bits, such as those to select between different output filters, but the lion’s share are used to set the output value of many 12-bit DACs. Instead of using parallel busses that fan out from an FPGA, the AMIQ has a serial configuration scan chain. Discrete 8-bit 74HCT4094 shift-and-store registers are located all over the design for the digital configuration bits. The DAC8134 devices have their own built-in shift register and are part of the same scan chain. Schematic page 19 (Click to enlarge) The schematic above is an example of that. The red scan chain data input goes through the VTCXO_TUNE DAC, then the 74HCT4094 after which it exist to some other page of the schematics. Conclusion Around the late nineties, test equipment companies started to stop adding schematics to their service manuals, but the R&S AMIQ is a nice exception to that. And while the device already has a bunch of FPGAs, most of the components are off-the-shelf single-function components. Thanks to that, the AMIQ is an excellent candidate for a deep dive: all the information is there, you need to just spend a bit of effort to go through things. I had a ton of fun figuring out how things worked. References Rohde & Schwarz documents R&S - IQ Modulation Generator AMIQ R&S - AMIQ Datasheet R&S - AMIQ Operating Manual R&S - AMIQ Service Manual with schematic Various application notes R&S - Floppy Disk Control of the I/Q Modulation Generator AMIQ R&S - Software WinIQSIM for Calculating I/Q Signals for Modulation Generator R&S AMIQ R&S - Creating Test Signals for Bluetooth with AMIQ / WinIQSIM and SMIQ R&S - WCDMA Signal Generator Solutions R&S - Golden devices: ideal path or detour? R&S - Demonstration of BER Test with AMIQ controlled by WinIQSIM Other AMIQ content on the web zw-ix has a blog - Getting an Rohde Schwarz AMIQ up and running zw-ix has a blog - Connecting a Rohde Schwarz AMIQ to a SMIQ04 Bosco tweets DDS Clock Synthesis MT-085 Tutorial: Fundamentals of Direct Digital Synthesis (DDS) How to Predict the Frequency and Magnitude of the Primary Phase Truncation Spur in the Output Spectrum of a Direct Digital Synthesizer (DDS) Related content The Signal Path - Teardown, Repair & Analysis of a Rohde & Schwarz AFQ100A I/Q (ARB) Modulation Generator

2 weeks ago 2 votes
DSLogic U3Pro16 Review and Teardown

Introduction The DSLogic U3Pro16 In the Box Probe Cables and Clips The Controller Hardware The Input Circuit Impact of Input Circuit on Circuit Under Test Additional IOs: External Clock, Trigger In, Trigger Out Software: From Saleae Logic to PulseView to DSView Installing DSView on a Linux Machine DSView UI Streaming Data to the Host vs Local Storage in DRAM Triggers Conclusion References Footnotes Introduction The year was 2020 and offices all over the world shut down. A house remodel had just started, so my office moved from a comfortably airconditioned corporate building to a very messy garage. Since I’m in the business of developing and debugging hardware, a few pieces of equipment came along for the ride, including a Saleae Logic Pro 16. I had the unit for work stuff, I may once in a while have used it for some hobby-related activities too. There’s no way around it: Saleae makes some of the best USB logic analyzers around. Plenty of competitors have matched or surpassed their digital features, but none have the ability to record the 16 channels in analog format as well. After corporate offices reopened, the Saleae went back to its original habitat and I found myself without a good 16-channel USB logic analyzer. Buying a Saleae for myself was out of the question: even after the $150 hobbyist discount, I can’t justify the $1350 price tag. After looking around for a bit, I decided to give the DSLogic U3Pro16 from DreamSourceLab a chance. I bought it on Amazon for $299. (Click to enlarge) In this blog post, I’ll look at some of the features, my experience with the software, and I’ll also open it up to discover what’s inside. The DSLogic U3Pro16 The DSLogic series currently consists of 3 logic analyzers: the $149 DSLogic Plus (16 channels) the $299 DSLogic U3Pro16 (16 channels) the $399 DSLogic U3Pro32 (32 channels) The DSLogic Plus and U3Pro16 both have 16 channels, but acquisition memory of the Plus is only 256Mbits vs 2Gbits for the Pro, and it has to make do with USB 2.0 instead of a USB 3.0 interface, a crucial difference when streaming acquistion data straight to the PC to avoid the limitations of the acquistion memory. There’s also a difference in sample rate, 400MHz vs 1GHz, but that’s not important in practice. The only functional difference between the U3Pro16 and U3Pro32 is the number of channels. It’s tempting to go for the 32 channel version but I’ve rarely had the need to record more than 16 channels at the same time and if I do, I can always fall back to my HP 1670G logic analyzer, a pristine $200 flea market treasure with a whopping 136 channels1. So the U16Pro it is! In the Box The DSLogic U16Pro comes with a nice, elongated hard case. Inside, you’ll find: the device itself. It has a slick aluminum enclosure. a USB-C to USB-A cable 5 4-way probe cables and 1 3-way clock and trigger cable 18 test clips Probe Cables and Clips You read it right, my unit came with 5 4-way probe cables, not 4. I don’t know if DreamSourceLab added one extra in case you lose one or if they mistakenly included one too much, but it’s good to have a spare. The cables are slightly stiffer than those that comes with a Saleae but not to the point that it adds a meaningful additional strain to the probe point. They’re stiffer because each of the 16 probe wires carries both signal and ground, probably a thin coaxial cable, which lowers the inductance of the probe and reduce ringing when measuring signal with fast rise and fall times. In terms of quality, the probe cables are a step up from the Saleae ones. The case is long enough so that the probe cables can be stored without bending them. The quality of the test clips is not great, but they are no different than those of the 5 times more expensive Saleae Logic 16 Pro. Both are clones of the HP/Agilent logic analyzer grabbers that I got from eBay and will do the job, but I much prefer the ones from Tektronix. The picture below shows 4 different grabbers. From left to right: Tektronix, Agilent, Saleae and DSLogic ones. Compared to the 3 others, the stem of the Tektronix probe is narrow which makes it easier to place multiple ones next to each other one fine-pitch pin arrays. If you’re thinking about upgrading your current probes to Tektronix ones: stay away from fakes. As I write this, you can find packs of 20 probes on eBay for $40 (incl shipping), so around $2 per probe. Search for “Tektronix SMG50” or “Tektronix 020-1386-01”. Meanwhile, you can buy a pack of 12 fake ones on Amazon for $16, or $1.3 a piece. They work, but they aren’t any better than the probes that come standard with the DSLogic. Fake probe on the left, Tek probe on the right The stem of the fake one is much thicker and the hooks are different too. The Tek probe has rounded hooks with a sharp angle at the tip: Tektronix hooks The hooks of a fake probe are flat and don’t attach nearly as well to their target: Fake hooks If you need to probe targets with a pitch that is smaller than 1.25mm, you should check out these micro clips that I reviewed ages ago. The Controller Hardware Each cable supports 4 probes and plugs into the main unit with 8 0.05” pins in 4x2 configuration, one pin for the signal, one pin for ground. The cable itself has a tiny PCB sticking out that slots into a gap of the aluminum enclosure. This way it’s not possible to plug in the cable incorrectly… unlike the Saleae. It’s great. When we open up the device, we can see an Infineon (formerly Cypress) CYUSB3014-BZX EZ-USB FX3 SuperSpeed controller. A Saleae Logic Pro uses the same device. These are your to-go-to USB interface chips when you need a microcontroller in addition to the core USB3 functionatility. They’re relatively cheap too, you can get them for $16 in single digital quantities at LCSC.com. The other size of the PCB is much busier. (Click to enlarge) The big ticket components are: a Spartan-6 XC6SLX16 FPGA Reponsible data acquisition, triggering, run-length encoding/compression, data storage to DRAM, and sending data to the CYUSB3014. A Saleae Logic 16 Pro has a smaller Spartan-6 LX9. That makes sense: its triggering options aren’t as advanced as the DSLogic and since it lacks external DDR memory, it doesn’t need a memory controller on the FPGA either. a DDR3-1600 DRAM It’s a Micron MT41K128M16JT-125, marked D9PTK, with 2Gbits of storage and a 16-bit data bus. an Analog Devices ADF4360-7 clock generator I found this a bit surprising. A Spartan-6 LX16 FPGA has 2 clock management tiles (CMT) that each have 1 real PLL and 2 DCMs (digital clock manager) with delay locked loop, digital frequency synthesizer, etc. The VCO of the PLL can be configured with a frequency up to 1080 MHz which should be sufficient to capture signals at 1GHz, but clearly there was a need for something better. The ADF4360-7 can generate an output clock as fast a 1800MHz. There’s obviously an extensive supporting cast: a Macronix MX25R2035F serial flash This is used to configure the FPGA. an SGM2054 DDR termination voltage controller an LM26480 power management unit It has two linear voltage regulators and two step-down DC-DC convertors. two clock oscillators: 24MHz and 19.2MHz a TI HD3SS3220 USB-C Mux This the glue logic that makes it possible for USB-C connectors to be orientation independent. a SP3010-04UTG for USB ESD protection Marked QH4 Two 5x2 pin connectors J7 and J8 on the right size of the PCB are almost certainly used to connect programming and debugging cables to the FPGA and the CYUSB-3014. (Click to enlarge) The Input Circuit I spent a bit of time Ohm-ing out the input circuit. Here’s what I came up with: The cable itself has a 100k Ohm series resistance. Together with a 100k Ohm shunt resistor to ground at the entrance of the PCB it acts as by-two resistive divider. The series resistor also limits the current going into the device. Before passing through a 33 Ohm series resistor that goes into the FPGA, there’s an ESD protection device. I’m not 100% sure, but my guess is that it’s an SRV05-4D-TP or some variant thereof. I’m not 100% sure why the 33 Ohm resistor is there. It’s common to have these type of resistors on high speed lines to avoid reflection but since there’s already a 100k resistor in the path, I don’t think that makes much sense here. It might be there for additional protection of the ESD structure that resides inside the FPGA IOs? A DSLogic has a fully programmable input threshold voltage. If that’s the case, then where’s the opamp to compare the input voltage against this threshold voltage? (There is such a comparator on a Saleae Logic Pro!) The answer to that question is: “it’s in the FPGA!” FPGA IOs can support many different I/O standards: single-ended ones, think CMOS and TTL, and a whole bunch of differential standards too. Differential protocols compare a positive and a negative version of the same signal but nothing prevents anyone from assigning a static value to the negative input of a differential pair and making the input circuit behave as a regular single-end pair with programmable threshold. Like this: There is plenty of literature out there about using the LVDS comparator in single-ended mode. It’s even possible to create pretty fast analog-digital convertors this way, but that’s outside the scope of this blog post. Impact of Input Circuit on Circuit Under Test 7 years ago, OpenTechLab reviewed the DSLogic Plus, the predecessor of the DSLogic U3Pro16. Joel spent a lot of time looking at its input circuit. He mentions a 7.6k Ohm pull-down resistor at the input, different than the 100k Ohm that I measured. There’s no mention of a series resistor in the cable or about the way adjustable thresholds are handled, but I think that the DSLogic Pro has a simular input circuit. His review continues with an in-depth analysis of how measuring a signal can impact the signal itself, he even builds a simulation model of the whole system, and does a real-world comparison between a DSLogic measurement and a fake-Saleae one. While his measurements are convincing, I wasn’t able to repeat his results on a similar setup with a DSLogic U3Pro and a Saleae Logic Pro: for both cases, a 200MHz signal was still good enough. I need to spend a bit more time to better understand the difference between my and his setup… Either way, I recommend watching this video. Additional IOs: External Clock, Trigger In, Trigger Out In addition to the 16 input pins that are used to record data, the DSLogic has 3 special IOs and a seperate 3-wire cable to wire them up. They are marked with the character “OIC” above the connector, which stands for Output, Input, Clock. Clock Instead of using a free-running internal clock, the 16 input signals can be sampled with an external sampling clock. This corresponds to a mode that’s called “state clocking” in big-iron Tektronix and HP/Agilent/Keysight logic analyzers. Using an external clock that is the same as the one that is used to generate the signals that you want to record is a major benefit: you will always record the signal at the right time as long as setup and hold requirements are met. When using a free-running internal sampling clock, the sample rate must a factor of 2 or more higher to get an accurate representation of what’s going on in the system. The DSLogic U16Pro provides the option to sample the data signals at the positive or negative edge of the external clock. On one hand, I would have prefered more options in moving the edge of the clock back and forth. It’s something that should be doable with the DLLs that are part of the DCMs blocks of a Spartan-6. But on the other, external clocking is not supported at all by Saleae analyzers. The maximum clock speed of the external clock input is 50MHz, significantly lower than the free-running sample speed. This is the usually the case as well for big iron logic analyzers. For example, my old Agilent 1670G has a free running sampling clock of 500MHz and supports a maximum state clock of 150MHz. Trigger In According to the manuals: “TI is the input for an external trigger signal”. That’s a great feature, but I couldn’t figure out a way in DSView on how to enable it. After a bit of googling, I found the following comment in an issue on GitHub. This “TI” signal has no function now. It’s reserved for compatible and further extension. This comment is dated July 29, 2018. A closer look at the U3Pro16 datasheets shows the description of the “TI” input as “Reserved”… Trigger Out When a trigger is activated inside the U3Pro, a pulse is generated on this pin. The manual doesn’t give more details, but after futzing around with the horrible oscilloscope UI of my 1670G, I was able to capture a 500ms trigger-out pulse of 1.8V. Software: From Saleae Logic to PulseView to DSView When Saleae first came to market, they raised the bar for logic analyzer software with Logic, which had a GUI that allowed scrolling and zooming in and out of waveforms at blazing speed. Logic also added a few protocol decoders, and an C++ API to create your own decoders. It was the inspiration of PulseView, an open source equivalent that acts as the front-end application of SigRok, an open source library and tool that acts as the waveform data acquisition backend. PulseView supports protocol decoders as well, but it has an easier to use Python API and it allows stacked protocol decoders: a low-level decoder might convert the recorded signals into, say, I2C tokens (start/stop/one/zero). A second decoder creates byte-level I2C transactions out of the tokens. And I2C EPROM decoder could interpret multiple I2C transactions as read and write operations. PulseView has tons of protocol decoders, from simple UART transactions, all the way to USB 2.0 decoders. When the DSLogic logic analyzer hit the market after a successful Kickstarter campaign, it shipped with DSView, DreamSourceLab’s closed source waveform viewer. However, people soon discovered that it was a reskinned version of PulseView, a big no-no since the latter is developed under a GPL3 license. After a bit of drama, DreamSourceLab made DSView available on GitHub under the required GPL3 as well, with attribution to the sigrok project. DSView is a hard fork of PulseView and there are still some bad feelings because DreamSourceLab doesn’t push changes to the PulseView project, but at least they’ve legally in the clear for the past 6 years. The default choice would be to use DSView to control your DSLogic, but Sigrok/PulseView supports DSView as well. In the figure below, you can see DSView in demo mode, no hardware device connected, and an example of the 3 stacked protocol described earlier: (Click to enlarge) For this review, I’ll be using DSView. Saleae has since upgrade Logic to Logic2, and now also supports stacked protocol decoders. It still uses a C++ API though. You can find an example decoder here. Installing DSView on a Linux Machine DreamSourceLab provides DSView binaries for Windows and MacOS binaries but not for Linux. When you click the Download button for Linux, it returns a tar file with the source code, which you’re expected to compile yourself. I wasn’t looking forward to running into the usual issues with package dependencies and build failures, but after following the instructions in the INSTALL file, I ended up with a working executable on first try. DSView UI The UI of DSView is straightforward and similar to Saleae Logic 2. There are things that annoy me in both tools but I have a slight preference for Logic 2. Both DSView and Logic2 have a demo mode that allows you to play with it without a real device attached. If you want to get a feel of what you like better, just download the software and play with it. Some random observations: DSView can pan and zoom in or out just as fast as Logic 2. On a MacBook, the way to navigate through the waveform really rubs me the wrong way: it uses the pinching gesture on a trackpad to zoom in and out. That seems like the obvious way to do it, but since it’s such a common operation to browse through a waveform it slows you down. On my HP Laptop 17, DSView uses the 2 finger slide up and down to zoom in and out which is much faster. Logic 2 also uses the 2 finger slide up and down. The stacked protocol decoders area amazing. Like Logic 2, DSView can export decoded protocols as CSV files, but only one protocol at a time. It would be nice to be able to export multiple protocols in the same CSV file so that you can easier compare transaction flow between interfaces. Logic 2 behaves predictably when you navigate through waveforms while the devices is still acquiring new data. DSView behaves a bit erratic. In DSView, you need to double click on the waveform to set a time marker. That’s easy enough, but it’s not intuitive and since I only use the device occasionally, I need to google every time I take it out of the closet. You can’t assign a text label to a DSView cursors/time marker. None of the points above disquality DSView: it’s a functional and stable piece of software. But I’d be lying if I wrote that DSView is as frictionless and polished as Logic 2. Streaming Data to the Host vs Local Storage in DRAM The Saleae Logic 16 Pro only supports streaming mode: recorded data is immediately sent to the PC to which the device is connected. The U3Pro supports both streaming and buffered mode, where data is written in the DRAM that’s on the device and only transported to the host when the recording is complete. Streaming mode introduces a dependency on the upstream bandwidth. An Infineon FX3 supports USB3 data rates up 5Gbps, but it’s far from certain that those rates are achieved in practice. And if so, it still limits recording 16 channels to around 300MHz, assuming no overhead. In practice, higher rates are possible because both devices support run length encoding (RLE), a compression technique that reduces sequences of the same value to that value and the length of the sequence. Of course, RLE introduces recording uncertainty: high activity rates may result in the exceeding the available bandwidth. The U3Pro has a 16-bit wide 2Gbit DDR3 DRAM with a maximum data rate of 1.6G samples per second. Theoretically, make it possible to record 16 channels with a 1.6GHz sample rate, but that assumes accessing DRAM with 100% efficiency, which is never the case. The GUI has the option of recording 16 signals at 500MHz or 8 signals at 1GHz. Even when recording to the local DRAM, RLE compression is still possible. When RLE is disabled and the highest sample rate is selected, 268ms of data can be recorded. When connected to my Windows laptop, buffered mode worked fine, but on my MacBook Air M2 DSView always hangs when downloading the data that was recorded at high sample rates and I have to kill the application. In practice, I rarely record at high sample rates and I always use streaming mode which works reliably on the Mac too. But it’s not a good look for DSView. Triggers One of the biggest benefits of the U3Pro over a Saleae is their trigger capability. Saleae Logic 2.4.22 offers the following options: You can set a rising edge, falling edge, a high or a low level on 1 signal in combination with some static values on other signals, and that’s it. There’s not even a rising-or-falling edge option. It’s frankly a bit embarrassing. When you have a FPGA at your disposal, triggering functionality is not hard to implement. Meanwhile, even in Simple Trigger mode, the DSLogic can trigger on multiple edges at the same time, something that can be useful when using an external sampling clock. But the DSLogic really shines when enabling the Advanced Trigger option. In Stage Trigger mode, you can create state sequences that are up to 16 phases long, with 2 16-bit comparisons and a counter per stage. Alternatively, Serial Trigger mode is a powerful enough to capture protocols like I2C, as shown below, where a start flag is triggered by a falling edge of SDA when SCL is high, a stop flag by a rising edge of SDA when SCL is high, and data bits are captured on the rising edge of SCL: You don’t always need powerful trigger options, but they’re great to have when you do. Conclusion The U3Pro is not perfect. It doesn’t have an analog mode, buffered mode doesn’t work reliably on my MacBook, and the DSView GUI is a bit quirky. But it is relatively cheap, it has a huge library of decoding protocols, and the triggering modes are excellent. I’ve used it for a few projects now and it hasn’t let me down so far. If you’re in the market for a cheap logic analyzer, give it a good look. References Logic Analyzer Shopping Comparison between Saleae Logic Pro 16, Innomaker LA2016, Innomaker LA5016, DSLogic Plus, and DSLogic U3Pro16 Footnotes It even has the digital storage scope option with 2 analog channels, 500MHz bandwidth and 2GSa/s sampling rate. ↩

4 weeks ago 19 votes
HP Laptop 17 RAM Upgrade

Introduction Selecting the RAM Opening up Replacing the RAM Reassembly References Introduction I do virtually all of my hobby and home computing on Linux and MacOS. The MacOS stuff on a laptop and almost all Linux work a desktop PC. The desktop PC has Windows on it installed as well, but it’s too much of a hassle to reboot so it never gets used in practice. Recently, I’ve been working on a project that requires a lot of Spice simulations. NGspice works fine under Linux, but it doesn’t come standard with a GUI and, more important, the simulation often refuse to converge once your design becomes a little bit bigger. Tired of fighting against the tool, I switched to LTspice from Analog Devices. It’s free to use and while it support Windows and MacOS in theory, the Mac version is many years behind the Windows one and nearly unusuable. After dual-booting into Windows too many times, a Best Buy deal appeared on my BlueSky timeline for an HP laptop for just $330. The specs were pretty decent too: AMD Ryzen 5 7000 17.3” 1080p screen 512GB SSD 8 GB RAM Full size keyboard Windows 11 Someone at the HP marketing departement spent long hours to come up with a suitable name and settled on “HP Laptop 17”. I generally don’t pay attention to what’s available on the PC laptop market, but it’s hard to really go wrong for this price so I took the plunge. Worst case, I’d return it. We’re now 8 weeks later and the laptop is still firmly in my possession. In fact, I’ve used it way more than I thought I would. I haven’t noticed any performance issues, the screen is pretty good, the SSD larger than what I need for the limited use case, and, surprisingly, the trackpad is the better than any Windows laptop that I’ve ever used, though that’s not a high bar. It doesn’t come close to MacBook quality, but palm rejection is solid and it’s seriously good at moving the mouse around in CAD applications. The two worst parts are the plasticy keyboard and the 8GB of RAM. I can honestly not quantify whether or not it has a practical impact, but I decided to upgrade it anyway. In this blog post, I go through the steps of doing this upgrade. Important: there’s a good chance that you will damage your laptop when trying this upgade and almost certainly void your warranty. Do this at your own risk! Selecting the RAM The laptop wasn’t designed to be upgradable and thus you can’t find any official resources about it. And with such a generic name, there’s guaranteed to be multiple hardware versions of the same product. To have reasonable confidence that you’re buying the correct RAM, check out the full product name first. You can find it on the bottom: Mine is an HP Laptop 17-cp3005dx. There’s some conflicting information about being able to upgrade the thing. The BestBuy Q&A page says: The HP 17.3” Laptop Model 17-cp3005dx RAM and Storage are soldered to the motherboard, and are not upgradeable on this model. This is flat out wrong for my device. After a bit of Googling around, I learned that it has a single 8GB DDR4 SODIMM 260-pin RAM stick but that the motherboard has 2 RAM slots and that it can support up to 2x32GB. I bought a kit with Crucial 2x16GB 3200MHz SODIMMs from Amazon. As I write this, the price is $44. Opening up Removing the screws This is the easy part. There are 10 screws at the bottom, 6 of which are hidden underneath the 2 rubber anti-slip strips. It’s easy to peel these stips loose. It’s als easy to put them back without losing the stickiness. Removing the bottom cover The bottom cover is held back by those annoying plastic tabs. If you have a plastic spudger or prying tool, now is the time to use them. I didn’t so I used a small screwdriver instead. Chances are high that you’ll leave some tiny scuffmarks on the plastic casing. I found it easiest to open the top lid a bit, place the laptop on its side, and start on the left and right side of the keyboard. After that, it’s a matter of working your way down the long sides at the front and back of the laptop. There are power and USB connectors that are right against the side of the bottom panel so be careful not to poke with the spudger or screwdriver inside the case. It’s a bit of a jarring process, going back and forth and making steady improvement. In addition to all the clips around the board of the bottom panel, there are also a few in the center that latch on to the side of the battery. But after enough wiggling and creaking sounds, the panel should come loose. Replacing the RAM As expected, there are 2 SODIMM slots, one of which is populated with a 3200MHz 8GDB RAM stick. At the bottom right of the image below, you can also see the SSD slot. If you don’t enjoy the process of opening up the laptop and want to upgrade to a larger drive as well, now would be the time for that. New RAM in place! It’s always a good idea to test the surgery before reassembly: Success! Reassembly Reassembly of the laptop is much easier than taking it apart. Everything simply clicks together. The only minor surprise was that both anti-slip strips became a little bit longer… References Memory Upgrade for HP 17-cp3005dx Laptop Upgrading Newer HP 17.3” Laptop With New RAM And M.2 NVMe SSD Different model with Intel CPU but the case is the same.

2 months ago 28 votes

More in technology

Sierpiński triangle? In my bitwise AND?

Exploring a peculiar bit-twiddling hack at the intersection of 1980s geek sensibilities.

yesterday 4 votes
Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

yesterday 4 votes
Code Matters

It looks like the code that the newly announced Figma Sites is producing isn’t the best. There are some cool Figma-to-WordPress workflows; I hope Sites gets more people exploring those options.

2 days ago 5 votes
What got you here…

John Siracusa: Apple Turnover From virtue comes money, and all other good things. This idea rings in my head whenever I think about Apple. It’s the most succinct explanation of what pulled Apple from the brink of bankruptcy in the 1990s to its astronomical success today. Don’

2 days ago 3 votes
A single RGB camera turns your palm into a keyboard for mixed reality interaction

Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University […] The post A single RGB camera turns your palm into a keyboard for mixed reality interaction appeared first on Arduino Blog.

2 days ago 3 votes