More from Yazin Alirhayim
Having been in fintech for a while, I’ve noticed something in common between the many new startups that come and go. They all require access to personal information. A specific example that comes to mind is the Plaid-like solution we developed while working on amal. We’d ask customers for their bank credentials[ajy] and then proceed to login on their behalf in order to share the resulting account information with the 3rd party application they are using (e.g. for transferring money, tracking their spending, etc). I’ve noticed similar things with companies that try to optimize the checkout experience – almost always by becoming the interface layer that users use, necessarily being exposed to sensitive cardholder information like the card number, CVC and even PIN number in some instances[axo]. The problem in all of these situations is that these apps require users to trust them, but provide no means for verification that this trust is not misplaced. “Trust, but verify” — Russian proverb Is there a better way? Some Assumptions Many throw their hands in the air and point to the protectionist practices of financial institutions that prevent data sharing — without APIs, the only real option for authenticating users is with the same credentials they use themselves. To explore better ways forward, let’s make a few assumptions (and eliminate thorny edge cases that frontload a lot of complexity): The application handles sensitive data It performs a deterministic task (i.e. same inputs result in the same output) It does not rely on state The application doesn’t keep logs (or retains anonymized and desensitized logs to aid in debugging) What we’re after Let’s be clear about what we’re after. If an app holds the assumptions we set earlier, then we should be able to use that app in a way that allows independent third parties to verify that the app is doing exactly what it says it’s doing. In the case of a checkout app for instance, we should be able to verify that it’s not stealing customer’s card details and siphoning them off to the Russian mafia. Front-end applications This is simple-enough to do, to some extent, on front-end applications – because we can inspect the code ourselves (albeit, obfuscated code to an increasing degree). Still, worst case we’d still be able to track network requests being made from the application – as well as any usage of local storage to ensure there’s nothing fishy going on. Backend But what about the backend? They are completely opaque to anyone outside the server; true black-boxes, where only the inputs and outputs can be observed, but not what’s performed inside. This is exacerbated by the fact that front-ends are also the front-lines of the war against all sorts of user-side attacks (CSRF, poisoning, etc). This has resulted in things like CORS, that prevent clients from making HTTP requests that would otherwise work completely fine on backends. Options For Greater Transparency 1. Fatter front-end’s for security [ajy] It’s so sad that it’s come to this. Sharing bank credentials with 3rd parties should never be ok, and yet it seems to be the only practical way that third party apps can communicate with financial institutions.
Optionality’s one of those things you don’t really think about. People don’t generally wake up one morning thinking “Why, it appears I’ve spent the past several decades of my life optimizing for optionality. Perhaps I should figure out why?”. Most don’t even recognize the term – until it’s already explained to us. If you’re one of those people, this could well be the most important posts you read this year. What Is Optionality? Optionality has been drilled into our heads ever since we were young. No one calls it that, but every one of us has heard the advice to follow opportunities that “open up doors” or “unlock opportunities” down the road. Now that’s all fine and dandy when you’re 12, but our generation is increasingly becoming crippled by the abundance of choices – making it damn near impossible to decide on anything of significance anymore. That’s tragic because optionality is also a privilege. And this disease has a tendency to afflict the most privileged among us – rendering them hollow. In this post, we’ll run through how optionality has pervaded our entire livelihood, why it can unconsciously derail entire careers, and what you can do about it. Let’s dive in! The Obsession With More (Options) We’ve been taught to seek out options that open up horizons, unlock doors and unleash possibilities. This expansive approach to life works great early on, enabling us to have more choices about what we’d like to ultimately do. The trouble is that last bit “(what we’d like) to ultimately do” – opening doors is a means to an end, not the end itself. By the time we’re young adults, we’ve had this idea of optionality engrained into our heads – and we don’t stop to reconsider what we’d actually like to ultimately do. We keep opening doors like it’s nobody’s business – resulting in so many unlocked doors that it’s crippling to pick just one. This phenomenon seems to have existed for a while, but has become increasingly prevalent in recent years. You see it in students choosing varied majors/minors, to people choosing to rent instead of buy. We crave choice, and are horrified of its opposite: commitment. Why this deep seated aversion to risk? I believe it originates from two human afflictions: Fear of missing out (FOMO): What if the choice I make is the “wrong” one, and the other choices somehow diminish in value. Fear of failure: There’s something scary about committing to an idea; that decision can bring about the possibility of real failure. It’s easy to see the allure of deferring the decision, in favor of collecting more and more safety nets to soften a (potential) fall down the road. Unlike their financial brethren, amassing a portfolio of choices is actually pointless beyond a point. An individual option is useless unless nurtured through the dedication of time and effort – and there’s only so much we can do in one life. But there’s more to it than that. Let’s take a deeper look at the downsides of optionality, and what we can do about it. Why Optionality Is Bad 1. The Opportunity Cost Our net output is the result of focused effort; and optionality is the exact opposite of focus. Rather than exploit any single option, we defer it for some undefined later – a later that often never materializes. This is akin to someone, in the early 90s having the ideas for Facebook, Google and Amazon all at once – and pursuing none of them in favor of unlocking more doors as a business consultant. Over 1,000 octogenarians (those aged 80 and above) were asked about their single biggest regret in life. The most common answer by a landslide was this: they regret the risks not taken. You can bet that’s what our fictional character from the previous paragraph would say too. “Oh, I had that idea years ago” means nothing if you chose not to do anything about it. 2. The Paradox of Choice More options make it harder to choose later, not easier. That’s unintuitive; you’d think more is better, but it’s leads to all sorts of strange side-effects. In his book “The Paradox of Choice” (find an example of the paradox of choice – jam, watches, etc) …. The trouble with accumulating options is that, like money, you can never have enough. You postpone your dreams, stuck in a continuous loop of preparation that’ll never end. And if you do ultimately stand to pick between the options, you’re less satisfied overall – the sheer number of choices makes you second guess your decision, and ultimately leads to a less fulfilling life. 3. It Changes How You Think Finally, and worst of all, the pursuit of optionality changes you. It turns you into a dreamer, always amassing options but never executing on any of them. It shares plenty in common with wantrepeneurs, the variety of entrepreneurs that never get started – always waiting for the perfect circumstances to unfold (which, they never do). The Cure To Optionality Addiction Now that we realize the grave downsides to optionality, we’re ready to talk solutions. The first one should be clear: commit. That means picking a single course of action that you would like to pursue – picking one option of the portfolio that you’ve amassed. You should take your time to study the options, deliberate, and find the one that makes the most sense for you. But once you’ve decided, there should be no looking back. To assist you in making sure that you don’t turn back halfway, consider publishing this commitment publicly. Tell your friends, write a blog post, publish it to LinkedIn. Sharing this commitment with others can provide the social pressure necessary to continue, despite the internal urge to resist. Also, as you exercise your option, ignore all the others. Focus solely on the task at hand.
Been having a hard time lately focusing. It’s like whenever I start doing anything of any significance I get derailed, and fall into this spiral of thought where I reconsider whether what I’m about to do matters, why it would, and whether I could be doing something else that would be more “productive”. The end result is that I end up doing nothing (no, the irony is not lost on me). This is a strange state for me to be in, because it isn’t one I recall having gone through before. I’m usually able to operate at a high cadence, get stuff done and just generally apply myself to projects with high autonomy and little structure. I think it may have something to do with the overall situation around COVID, and that I’ve been effectively operating indoors for the past 9 months (!). Cafes, restaurants and most public places are still take-out only, so this makes it difficult to hang out in person (if you’re not working at an office). I don’t know if these are symptoms of depression, or if they are just general apathy resulting from severely diminished social contact. Maybe it’s a bit of both? Again, don’t recall ever having going through a “proper” episode of depression before, so I’m unsure it is in fact that. I was reading about it on Wikipedia yesterday and came across an article describing negative self-talk, something I definitely encounter alot more as of late. The article suggested Cognitive Behavioral Therapy (CBT) as a solution (in addition to exercise, which I do a few times a week already). CBT is a program often run by a counselor (in-person) and involves examining the Behavior, Thoughts and Feelings of a subject in order to pinpoint the specific behavioral pattern that needs to be changed (either intensified, or diminished). This all sounds great in theory, but finding a therapist that I trust to do this sounds like a real chore - especially in Bahrain. I’ve found online providers that delivery virtual therapy; I’ve also read that CCBT - Computerized CBT (delivered primarily via apps) can be effective if used consistently. And so that’s what I’m doing now - using an app called MoodKit (iOS only) that sends me occasional triggers to jot down what I’m thinking, and walks me through the process of assessing my thoughts to see if they’ve fallen victim to “distortions” (yes, you bet they have!). I’ll report back once I’ve had at least a few weeks of experience with it and let you know how it all went. Of course, I don’t plan to just sit around all day while I wait to get better. I want to apply myself, to express myself creatively, and make something useful. But that’s the operative word: “something”. Like, “one thing”. Not 500.
It used to be that you needed to know Kotlin and Swift to develop apps for both Android and iOS, but those days are long gone. Even the fundamental reasons for doing so have changed — some sort of compromise between performance and speed to deployment. React Native apps were kinda slower than native apps by a smidge, but with the introduction of Flutter — that difference was wiped out almost entirely. With Flutter, you get native app performance because you leave HTML and JS behind. Getting the Basics Flutter has an excellent reference for web developers that I found more useful than alot of the intro guides out there. I don’t need to know what a variable or for loop is — just tell me the concepts I need to hit the ground running. Coupled with this brief high level overview it’s pretty much all you need to understand Flutter conceptually. Getting up and running with Flutter, quickly A great place to start was to quickly paste snippets from the web developer article I mentioned above into DartPad and see the results immediately. This gave me a feel for how things works in Flutter land, without fretting too much about setting everything up locally. A+ Resources Build a sick planet app from scratch. No fluff, all goodness. Loved recreating this myself. Flutter Layout cheat sheet Build a functional Startup Name Generator, start to finish, from scratch Comprehensive reference for more great resources on Flutter 👉 Stay tuned for my first app on Flutter, in a future post!
Yes, that’s right. The best bank in Bahrain is … Meem. This nomination will be especially shocking to those of you that read my last post that ripped the Meem app to shreds. Criteria for ‘best bank’ My criteria for “best bank” isn’t the best user experience or customer support. It’s all about the best bank for buck. There are two main parts to this: What’s the best cashback rate I can get back on my spending? What’s the best profit rate I can get back on my deposits? While I always limit myself to shariah-compliant options, you’ll soon see that the rates Meem offer are actually superior to even the non-compliant options in Bahrain. Cashback 💳 I’m always befuddled when I’m standing in line at the cashier and see the person in front of me whip out a debit card. Why on earth are you paying with a debit card?? Paying with a credit card that has Cashback is literally free money (assuming, of course, that you pay the bill in full at the end of each month). Here are the cashback rates for the Meem credit card[1]: 1% cashback for spending below BD 500 a month 2% cashback for spending between BD 500 — BD 1000 a month 3% cashback for spending above BD 1000 a month (capped at BD 50) Take a second here to recognize how insanely good these rates are! Not just at the local or regional level … even globally. You’d have a hard time finding a card that offers more than 2% cashback in the US; see for yourself[2]. As for Bahrain, no bank even comes close. Here are the cashback rates for a sample of popular Bahrain banks: Ithmaar Bank — 0.2% NBB — 0% Credimax — 0% on local purchases; 1% on international Not. Even. Close. I hit the 3% rate consistently with Meem (since I use my card for paying alot of my online business bills), and this means BD 50 of free money each month, or BD 600/yr. Nuts! Return on deposits 💰 These are typically known in the industry as “fixed-term deposits” (or Fixed Deposits). This means locking up your money for a fixed period between 1 month and 1 year, with a guaranteed rate of return (always quoted per year, regardless of the term you pick). While fixed deposits are not Sharia compliant, Murabaha is — a variant where you purchase something from the bank (usually a commodity, like cement) and sell it back for a profit at the end of the term. Once again, the rates Meem offer are higher than anything I’ve seen offered by any other bank in Bahrain. Fixed deposits are tricky to compare since the rates change during the year (with the change in the base rate by the CBB), and they also vary by currency. Here are the rates for Meem in January, 2020[3]: 2.33% on USD (90 day term) — no minimum 1.17% on BHD (90 day term) — no minimum Obviously, huge difference between BHD and USD — and so you’re better off converting to USD when you deposit. Meem makes this easy by offering multicurrency accounts (the BHD/USD rate is capped by the CBB; Meem provides a significantly better rate of 0.3765 BHD/USD vs the 0.375 BHD/USD that most banks in Bahrain provide). So, 2.33% for Meem. Let’s compare to a few other banks in Bahrain, shall we: KFH: 1.85% BHD (90 day terms) — BD600 min. Citi: 1.75% BHD / 0.85% USD (90 day terms) UPDATE: A friend tipped me off to Jazeel, a Digital Bank launched by KFH, that offers Wakala rates that match Meem’s USD rates but for BHD[4]. Details: KFH Jazeel: 2.3% BHD (90 day terms) — BD1000 min. If you can meet the BD 1,000 minimum then they’d be a better option for BHD deposits than Meem’s. I’m unable to provide more links to banks in Bahrain, because they don’t share the rates online. To my knowledge, the rates are less competitive than those of KFH cited above (and Meem, of course). If you’ve got more numbers for me to include, please let me know. Conclusion Using Meem’s financial products can earn you: BD 600 / year in cashback BD 23 / year for every BD 1,000 you deposit (on a USD denominated Murabaha deposit @ a 90 day term) That’s better than every other bank in Bahrain I’ve come across, and they are Sharia compliant. There you go, folks — you’re welcome! Share this post around if you found it useful. 👣 Footnotes [1] Meem isn’t very forthcoming on their website about the cashback rate, simply saying that they offer “up to 3% cashback” on credit card purchases. Their app provides the breakdown that I shared. [2] Any rates over 2% are usually guarded with multiple qualifiers limiting their utility specifying either merchants or a specific category that is included, and always capped at a very low yearly amount. [3] Not shared on their website, but available in the app. Screenshots below: [4] Not shared on their website, but available in the app. Screenshots below:
More in technology
Exploring a peculiar bit-twiddling hack at the intersection of 1980s geek sensibilities.
In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩
It looks like the code that the newly announced Figma Sites is producing isn’t the best. There are some cool Figma-to-WordPress workflows; I hope Sites gets more people exploring those options.
John Siracusa: Apple Turnover From virtue comes money, and all other good things. This idea rings in my head whenever I think about Apple. It’s the most succinct explanation of what pulled Apple from the brink of bankruptcy in the 1990s to its astronomical success today. Don’
Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University […] The post A single RGB camera turns your palm into a keyboard for mixed reality interaction appeared first on Arduino Blog.