More from Frank’s Ramblings
Vision transformers (ViTs) have seen an incredible rise in the past four years. They have an obvious upside: in a visual recognition setting, the receptive field of a pure ViT is effectively the entire image 1. In particular, vanilla ViTs maintain the quadratic time complexity (w.r.t. number of input patches) of language models with dense attention. Kernels in convolutional networks, on the other hand, have the property of being invariant to the input pixel/voxel that it is applied to, a feature that is typically referred to as translation equivariance. This is desirable because it allows the model to effectively recognize patterns and objects regardless of where they are located spatially. The weight sharing present in convolutional layers also makes convnets highly parameter-efficient and less prone to overfitting - a property ViTs do not have. As such, you might expect that ViTs and convnets are used equally in production environments that leverage visual models - ViTs for “global” tasks such as scene recognition and convnets for more “local” tasks such as object recognition. Even so, we’ve been inundated with work that utilizes ViTs, with bold high-level claims (mostly by media outlets) that convnets are a thing of the past. Curious to see if I could lend a hand in helping debunk this claim, I set out to figure whether or not a mostly vanilla ResNet could match or even exceed the performance of both ViT and ConvNeXt. The comparison to ConvNeXt is of particular interest, since it is a fully convolutional network that attempts to bridge the gap between transformers and convnets. With a bit of experimentation on Imagenet-1k, we can reach 82.0% accuracy with a 176x176 training image size with no extra data, matching ConvNeXt-T (v1, without pre-training a-la MAE) and surpassing ViT-S (specifically, the ViT flavor from DeiT-III). Training methodology We start by adopting the training methodology set in Pytorch’s late 2021 blog, where they achieved an impressive 80.8% accuracy on Imagenet-1k with a stock ResNet50 model. Here’s a couple of key points to note: We stick with SGD as the optimizer, rather than going for RMSProp or Adam (or any of their variants). The scheduler uses cosine decay with five warmup epochs and 600 total epochs. This may seem like an unnecessarily large number of epochs, but we’ll get around to reducing this later. We utilize a whole slew of augmentations found in modern literature, including, but not limited to: label smoothing, mixup, cutmix, and model EMA. To prevent overfitting on the validation dataset, we’ll skip hyperparameter tuning and grid search and stick with the stock training methodology listed out in the blog post. Nearly all of these training optimizations have already been used to boost the performance of modern visual recognition models, but adopting these changes don’t quite get us to the magical 82% accuracy we’re looking for. Architectural modifications The baseline ResNet architecture is strong but not optimal, so we adopt a few architectural modifications to enable better performance: ResNet-d First order of business is the embrace some “modernizations” to ResNet. For completeness, here are the changes listed out: The initial 7x7 convolution is changed to a sequence of three 3x3 convolutions with 32, 64, and 128 output channels, respectively. The stride remains on the first convolutional layer. With this change, we now use exclusively 3x3 and 1x1 convolutions across the entire network all while retaining the original size of the receptive field for the network head. Strides in downsampling residual blocks are moved from the first 1x1 convolutional layer to the subsequent 3x3 convolutional layer. This has the effect of capturing all input pixels in a downsampling block, since a strided 1x1 convolution effectively skips every other pixel. The max pooling in the stem is removed. The first 3x3 convolution of the first residual block now has a stride of two, matching the remaining residual blocks. While max pooling is theoretically useful for retaining edges, corners, and other low-level features, I haven’t found it to be particularly useful in practice. The strided 1x1 convolution in the shortcut connections of downsampling blocks is replaced with 2x2 average pooling followed by a standard 1x1 convolutional layer. Again, this has the effect of capturing all input activations rather than just one out of every four input channels. The resulting micro-optimizations result in an architecture that is extremely close to ResNet-d, with some very minor differences. ReLU -> SiLU ReLU has two weaknesses compared to other activation functions: 1) it is not smooth (ReLU is, strictly speaking, non-differentiable at 0), and 2) the “dying ReLU” problem, where pre-activation values are near-universally negative during a forward pass, causing gradients to always be zero and the neuron to carry no information. As a direct result, a number of novel activations have been proposed throughout the years - Leaky ReLU, Parametric ReLU, ELU, and Softplus are three well-known albeit older examples. The idea behind all of these is to fix one or both of the above problems; Parametric ReLU, for example, attempts to fix the dying ReLU problem by introducing a learnable parameter $\alpha$ that defines the slope the function for negative pre-activation values. For this model, I went with the SiLU, (also commonly known as Swish), defined by $SiLU(x) = \frac{x}{1+e^{-x}}$, which has already seen success with a number of visual recognition models. Since this switch enabled faster training, I reduced the number of epochs from 600 to 450. Although I could’ve used GELU, I decided to use SiLU because it has an inplace parameter and could serve as a drop-in replacement for ReLU in the original reference implementation. GELU or GLU variants (SwiGLU, GeGLU) might have performed slightly better as they are widely used in language models. Although GELU and SiLU are highly correlated 2, networks trained with GELU are not equivalent to networks trained with SiLU in terms of representational capacity due to differences in weight decay and initialization. Lastly, I hypothesize that a SiLU network would likely perform better with stochastic depth since ReLU may act like a weak implicit regularizer by adding sparsity to the network activations. This can be great for overparameterized models, but not for parameter-efficient models. SiLU, on the other hand, has nonzero gradients for all values $x$ except for $x \approx -1.278$. As such, with the switch from ReLU to SiLU, adding a bit of regularization might be warranted. I’ll have to experiment more with this in the upcoming weeks. Update (03/23/2024): After some experimentation, I found that stochastic depth with a drop probability of 0.1 negatively impacts the performance of the network (by about 0.2% or so), but reducing it to 0.05 results in what is effectively the same accuracy. I’ll need to play around with it a bit more. Split normalization Vanilla ResNet uses a generous amount of batch normalization (BN); one BN layer per convolutional layer to be exact. The original BN paper argues that BN improves internal covariate shift (ICS) - defined by the authors as the change any intermediate layer sees as upstream network weights shift - but this has since proven to be untrue (I’ll elaborate on this in a bit). I wanted to go back to the original ICS thesis, i.e. normalization in BN was meant to re-center the activations, while the learnable affine transformation immediately following normalization was meant to preserve each layer’s representational capacity. It simply made no sense to me that these two must be applied back-to-back. Furthermore, since backpropogation effectively treats each individual layer of neurons as an independent learner, the most sensible thing to do is to normalize layer inputs rather than outputs. Long story short, I found that splitting BN into two separate layers - pre-convolution normalization and post-convolution affine transformation - improves the network’s performance by over 0.4%. While this does negatively affect speed and memory consumption during training, it has zero impact on inference performance since the normalization and affine transformations can be represented as diagonal matrices and fused with the weights of the convolutional layer once the network is fully trained. Split normalization, visualized. I wanted to better understand the theory behind “split” normalization but couldn’t find it anywhere in ML literature3. As a result, I looked towards BN theory first; the most compelling research in my eyes comes from Santurkar et al.’s 2018 paper. In it, they show that BN often increases ICS. Instead, they argue that batch normalization works well because improves the first- and second-order properties of the loss landscape. Through a quick exercise, we can show that split normalization (SN) has the same effect. Let’s consider two networks - one without SN defined by loss function $L$ and one with SN defined by loss function $\hat{L}$. For the network with SN, the gradients through each of these layers is as follows: Where $m$ is the size of each mini-batch and $y_i$, $\hat{y}_i$, $\hat{x}_i$, $x_i$ represent the activations for the $i$th sample in our batch. In practice, the dimensionality of the activation tensors can be arbitrarily large or small (e.g. 3d for most convnets). With this, we can represent the full loss gradient via dot products: For a function $f(a)$, the L2 norm of its gradient $\left\Vert\frac{df}{da}\right\Vert$ is a good proxy for Lipschitzness. The same holds our loss function, i.e. we would like to show that $\left\Vert\frac{\partial\hat{L}}{\partial\mathbf{x}}\right\Vert \leq \left\Vert\frac{\partial L}{\partial\mathbf{x}}\right\Vert$. Given a matrix $\mathbf{A}$ and vector $\mathbf{b}$, the norm of the two multiplied together is bound above by the largest singular value of $\mathbf{A}$, i.e. $\Vert\mathbf{A}\cdot\mathbf{b}\Vert \leq s_{max}(\mathbf{A})\Vert\mathbf{b}\Vert = \sqrt{\lambda_{max}(\mathbf{W}^T\mathbf{W})}\Vert\mathbf{b}\Vert$. Given this, we have: Applying the reduction from C.2 in Santurkar et al., we get: In my eyes, we should separate the multiplicative term (i.e. $\frac{\gamma^2s_{max}^2}{\sigma^2}$) from the additive term (i.e. $- \frac{1}{m}\left\Vert\mathbf{1} \cdot \frac{\partial L}{\partial\mathbf{y}}\right\Vert^2 - \frac{1}{m}\left\Vert\frac{\partial L}{\partial\mathbf{y}} \cdot \mathbf{x}\right\Vert^2$) since a) the multiplicative effects can be counteracted by increasing or decreasing the learning rate and b) $\mathbf{W}$ tends to change much slower than other terms in the equation. In particular, the additive term is strictly negative, which means that the overall loss landscape is smoother, while the potentially large multiplicative upper bound implies that SN may, in certain situations, be increasing the Lipschitz constant of the loss. At the same time, ICS at the inputs of each layer is strictly decreased, as the learnable affine transformation now comes after the weights rather than before. The results The final 26M parameter model successfully reaches 82.0% accuracy on Imagenet-1k without any external sources of data! In the spirit of modern machine learning research, let’s give this network a fancy name: GResNet (Good/Great/Gangster/Godlike ResNet). Model Accuracy Params Throughput GResNet 82.0%* 25.7M 2057 im/s ConvNeXt 82.1% 28.6M 853 im/s ViT (DeiT) 81.4% 22.0M 1723 im/s Comparison of different models. Throughput calculated on a single Nvidia A100 with batch size 256 without network optimizations. *Accuracy improves to 82.2% and throughput drops to 1250 im/s when we use ConvNeXt's train image size of 224x224 instead of 176x176. The GResNet model definition is available here, while weights are available here. Accuracy curve during training. Ending words What exactly have we shown here? With some simple modifications to ResNet, we can attain excellent performance - on par or better than both ViT and a ViT-inspired convnet (ConvNeXt) on smallish datasets. ConvNets never die, they just Transform — Peyman Milanfar (@docmilanfar) October 27, 2023 ResNet strikes back... again? You might be asking: why Imagenet-1k? Aren’t there a number of much larger labelled visual datasets i.e. YFCC, LAION, etc? Secondly, since modern LLMs are exclusively transformer-based, isn’t it beneficial to also use transformers for vision in order to take advantage of cross-attention or by linearly projecting patches into the decoder? The answer is yes: for large multimodal models bound by text, self-attention reigns supreme. But small models (e.g. most embedding models) are arguably more important because of their portability and adaptability, and these models benefit greatly from the exact type experiment of outlined in this post: strong augmentation with limited data trained across many epochs. This is exactly the type of data that Imagenet-1k represents. And on the topic of ViTs being superior to convnets on large datasets: the 2023 paper titled Convnets match vision transformers at scale from folks at Google DeepMind is worth a read. The concluding section contains a stark result: “Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.” This simply reinforces a lesson that ought to be repeated: optimizations to model architecture should always come after 1) a large, high-quality dataset, 2) a solid, highly parallelizable training strategy, and 3) having lots of H100s. I’d argue that the bulk of transformers’ success has come from their ability to be efficiently and effectively scaled to hundreds of billions of parameters - scaling that could theoretically also be done with RNNs if research scientists had decades of time to train them (spoiler: they don’t). Addendum - comparing embedding quality I thought it might be interesting to compare embeddings from GResNet, ConvNeXt, and ViT by storing and indexing the embeddings from each model in Milvus: >>> from milvus import default_server >>> from pymilvus import MilvusClient >>> default_server.start() >>> client = MilvusClient(uri="http://127.0.0.1:19530") >>> # initialize model, transform, and in1k val paths ... >>> with torch.no_grad(): ... for n, path in enumerate(paths): ... img = Image.open(path).convert("RGB") ... feat = gresnet(transform(img).unsqueeze(0)) ... client.insert(collection_name="gresnet", data=[feat]) ... >>> I removed the model initialization and data loading snippets for brevity and used Euclidean/L2 as the distance metric with no indexing (i.e. FLAT). With that step done, we can then query the collections to get results that look like this: One could argue that GResNet tends to pick out images which are stylistically closer to the query image in addition to being the same class, but aside from that, the results between all three models are very comparable. For a visual recognition model, the receptive field is the effective area of the input Nd-xels that a layer or neuron “sees” and can capture. Early layers in a pure convolutional model, for example, have a very small receptive field, while each layer in a vision transformer with dense attention sees the entire input image. ↩ There exists a fairly accurate approximation that relates GELU and SiLU: $GELU(x) = \frac{SiLU(1.702x)}{1.702}$. ↩ Please reach out to me if you know of prior work that implements this so I can give it a proper citation. ↩
… glorified marketing for portfolio companies, that is I came across one of a16z’s blog posts on Hacker News today, titled Emerging Architectures for LLM Applications. For folks who didn’t catch it, here’s the tl;dr: The emerging LLM stack is composed of several elements centered around data orchestration tools such as Langchain and Llamaindex. Data pipelines, embedding models, vector databases, and queries form the primary input for these orchestration tools. The stack is based on in-context learning, where off-the-shelf LLMs are used and their behavior is controlled through prompting and conditioning on contextual data. Strategies for prompting LLMs are becoming increasingly complex and are a core differentiating factor for both closed-source and open-source LLMs. Of these LLMs, strategies for GPT-3.5 and GPT-4 are most common, seeing as OpenAI is the current leader. AI agents - programmatic runtimes that can reason and plan - excite both developers and researchers alike, but don’t work just yet. Most agent frameworks are currently in PoC phase. Overall, I thought the article was informative, but I was surprised that the section on vector databases mentions neither Milvus nor Zilliz, especially since Milvus was mentioned in an older a16z blog on data and ML infrastructure: Also of note: another Zilliz project (GPTCache) is listed in the post. My initial instinct was that Milvus was left off because it is part of the LF AI & Data Foundation rather being a project wholly owned by Zilliz, so I left a comment on the HN post that links back to the Milvus website. I came back a couple of hours later to find an interesting take: Full disclosure: we (Zilliz) raised $103M back in 2022, and Pinecone raised $100M this April. Running it back in my head, I felt that SheepHerdr’s response actually made excellent sense - a16z’s ultimate goal is to generate returns for LPs, and the best way to do that is by supporting founders and propping their portfolio companies. To me, this is also unequivocally unfair to Vespa, Weaviate, etc as it delivers a subliminal message that they have no realistic long-term chance in the vector database space relative to Pinecone. This, of course, is absolute nonsense: vector databases are NOT a zero-sum game. I dove a bit deeper and was surprised to find that this is fairly commonplace behavior for a16z as a firm: The aforementioned article also lists Databricks in the “Data Pipelines” section, but not Snowflake. There is a Snowflake loader for Langchain and a guide for using Llamaindex with Snowflake. Databricks is an a16z portfolio company. The Modern Transactional Stack doesn’t come close to listing all of the available data connectors. To be fair, Airbyte and Fivetran (an a16z portfolio company) are the two largest and most well-known, but to distill the entire segment to just two companies seems unfair. a16z’s crypto division has backed LayerZero, going as far as actively voting against Wormhole, a LayerZero competitor. Side note: LayerZero was also featured in a16z’s Crypto Startup School. These are just three random examples I dug out - there are probably many other examples in verticals that I am unfamiliar with. Other LLM/GenAI Infrastructure landscapes Here’s a couple alternative landscapes that are, in my eyes, more wholly representative: ML/AI/Data Landscape (Interactive version). Matt Turck’s MAD Landscape is arguably the most complete out there. Companies that do vector search are listed under “Infrastructure/Vector Database” and “Analytics/Enterprise Search” categories. It was released in February 2023 so it’s about 4 months old, but a good resource nonetheless. Future of AI-Native Infrastructure. This one’s from Wei Lien Dang and David Hershey of Unusual Ventures. I found this pretty unique as it has a vertical for AI agents. It’s unfortunately not as complete as the MAD Landscape (missing Vespa, Vectara, etc), but still a good overview. The New Language Model Stack. Sequoia Capital’s blog post on the LLM stack is also excellent. Milvus isn’t in the diagram, but it’s mentioned in the section on vector databases. Vector Database Landscape. Yingjun Wu’s infographic is centered specifically around vector search infrastructure. Final thoughts I have tremendous respect for a16z, a firm that helped pioneer the practice of working with and nurturing founders rather than forcing them out pre-IPO or minmaxing term sheets. Their content is also incredibly informative and valuable for understanding the nuances of building a company, from finding PMF to hiring executives. I also wholeheartedly understand a16z’s motivation for sharing knowledge and highlighting their portfolio companies, but to do so under the guise of being helpful and impartial is just plain silly. In particular, a16z’s blog post yesterday has as much to do with emerging strategies for portfolio company marketing as it does with emerging architectures for LLM applications. This practice would be somewhat analagous to Google putting paid URLs at the very top of search results without an “Ad” label. (To be clear, Google doesn’t do this.) I’d like to end with some glorified marketing of my own: % pip install milvus
In this four-part article, I’ll go over some of the lessons I learned living and doing business in China’s tech industry. During my time in China, I’ve led a team of 10+ engineers to develop a location-based IoT and sensing platform, co-founded an open-source project called Towhee, and developed countless relationships with folks in a number of difference cities (many of whom I now consider good friends). I’ll go over some of the common misconceptions about China ranging from living and working in China to the government’s pandemic response. I originally intended for part II of this blog post to cover the tech industry in more detail (996, CSDN, open-source, etc…), but given the current spike in COVID cases these past two weeks plus the current lockdown in Shanghai, I felt it was more appropriate to first cover pandemic life in China. As always, if you have any questions, comments, or concerns, feel free to connect with me on Twitter or LinkedIn. Thanks for reading! Before reading this blog post, I strongly recommend you read part I if you haven’t yet. Part I received much more exposure than I had anticipated; I received a lot of positive feedback and I enjoyed reading many of the responses, especially those which provided a different outlook on China and its citizens. While I had originally intended for part II to cover China’s tech industry, I decided to instead cover China’s handling of the pandemic first, given Shanghai’s current lockdown. Stocking up on food right before the Shanghai lockdown in March 2022. I imagine the conversation with supermarket staff went something like this: Q - What kind of ramen would you like? A - All if it. A couple of words before steaming ahead into part II: 1) This article will be focused around three pandemic stories from China which will depict how China’s zero-COVID policy has affected Chinese citizens. These purely anecdotal stories are not meant to directly prove a point or argue a cause; rather, my hope is that they can provide a “boots-on-the-ground” perspective for readers unfamiliar with life in China (and, to a lesser extent, other east Asian countries) during COVID. 2) Recent articles with visualizations which conveniently ignore certain population segments (or other statistical anomalies) have unfortunately reduced my faith in “data-driven” articles1. As such, a small portion of this blog post will be dedicated towards picking apart pure data-driven arguments against China’s COVID statistics. 3) Lastly, I’d like to remind everyone to keep comments civil. I was subject to a private but fairly negative personal attack from one of the readers over the Personal identity section in part I. As the post’s author, I’m fine with it, but do not subject other readers and community members to the same or similar treatment - it’s irresponsible and does nothing to improve the quality of the discussion. With that said, let’s dive in. Three pandemic stories It would be easy for me to simply “tell you” how the pandemic has changed China; instead, I’d like to start this blog post with three “pandemic stories” - short excerpts which highlight the scope with which China’s zero-COVID policy has affected the population. These purely anecdotal stories are not meant to directly prove a point or argue a cause; rather, my hope is that they can provide a “boots-on-the-ground” perspective. Alex’s story Alex (I’ve used an alias to ensure privacy) is a Taiwanese expat working in Shanghai. Her story is a bit unique given her background - she’s been in Shanghai since early 2020 and, due to quarantine policies on both sides of the Taiwan strait, hasn’t been back home in over two years. More on this in a bit. When Alex flew from Taiwan to Shanghai in February of 2020, she immediately found herself in unfamiliar territory. Streets were nearly completely empty, and the few folks who did wander outside were tightly masked. The only businesses open were supermarkets, which were required by central government policy to have workers and/or guards standing by entrances, recording everybody’s name, ID number, phone number, and body temperature. News on Wuhan and the COVID-related restrictions popping up around the country were being constantly broadcast by state-run media. Red propaganda posters filled the streets, warning the general populace to remain masked and to stay away from wild game (野味). As time progressed, it became clear to Alex that, while people living in Western countries had lost jobs and loved ones, people living in China lost significant freedom and social capital in a country already short on both. In a culture that prides itself on family and connectivity - especially during Lunar New Year 2 - not returning home for over two years is borderline criminal. However, for Alex, this was not by choice. The policy for foreign travelers entering Taiwan is 14 days of quarantine, while the policy for travelers entering Shanghai is 14 days of hotel quarantine plus another 7 days at home. Because no human contact is allowed during the entire quarantine period, these quarantine periods are generally referred to as isolation (隔离) in Mandarin. For Alex, spending over a month in quarantine/isolation would simply be unacceptable, especially as the rest of her co-workers are all in-office. Two years away from Taiwan also resulted in a loss of something known as household registration (户籍). Although it may not seem like a big deal, household registration is more significantly meaningful in Taiwan than residency in the USA or Canada - everything from local health insurance to voting rights are impacted by the loss or acquisition of household residency. While she’s still in Shanghai today, she remains hopeful for the opportunity to return home to Taiwan later this year. Although the strict COVID policies have soured her attitudes toward working and living in mainland China, her views on the citizens of Shanghai and Cross-Strait relations remain positive. Ding Liren’s story Ding Liren (Ding is his family name; this is the standard naming convention in China) is China’s top-rated chess player. He’s currently world number 2 behind Magnus Carlsen of Norway3. A bit about competitive chess before I continue. The World Chess Championship is almost universally recognized as the world’s premier chess tournament. It is a head-to-head match between the reigning world champion and a challenger. In modern times, the challenger has been determined via a biennial 8-player tournament called the Candidates Tournament. Ding first participated in 2018’s tournament, placing 4th of 8. It was a decent showing, but after an unbeaten streak of 100 games ending in late 2018 and a win in the 2019 Sinquefield Cup (where he beat top-rated Magnus Carlsen in playoffs), he was widely considered to be one of the favorites in the 2020 Candidates Tournament (along with Fabiano Caruana, the winner of the 2018 Candidates Tournament). Early 2020 is where Ding’s story take a turn for the worse. The 2020 Candidates Tournament was scheduled to take place mid-March in Yekaterinburg, Russia. Upon entry, the Russian government decided to put Ding into quarantine in a rural cottage near Moscow due to the COVID pandemic. This quarantine took an incredible mental toll on Ding, putting him in a tie for last place after 7 of 14 rounds. After the 7th round, FIDE, chess’s governing body, decided to suspend play to mitigate the spread of COVID. When play resumed mid-April 2021, Ding (who did not have to quarantine this time around) looked to be back in top form, winning his final three games of the tournament, one of which was over the eventual challenger, Ian Nepomniachtchi. In a game where draws are incredibly common at the highest level of play, three wins in a row can be considered a major accomplishment in and of itself. The story doesn’t quite end there. With Ian bombing out of the 2022 World Chess Championship match with Magnus, Ding is once again widely considered to be one of the favorites to win the 2022 Candidates Tournament… if he could actually qualify for it. The top two finishers of the 2021 Chess World Cup, 2021 Grand Swiss Tournament, and 2022 Grand Prix are given berths into the FIDE Candidates Tournament4. Although Ding was invited to and had planned on participating in all three of the aforementioned tournaments, he ended up being unable to attend any of them due to a combination of China’s zero-COVID stance and the Schengen area visa policy; he’s repeatedly been unable to purchase a return flight from Europe to China due to China’s constant updating of return flight rules and the complete lack of available flight options. For reference, a one-way flight from San Francisco to Shanghai on 05/13 of this year costs $9628 (transiting through a third country is disallowed if direct flights exist). I was able to secure a one-way flight from San Francisco to Shanghai for $267.20 pre-pandemic. In a major twist of events, it seems that Ding may yet qualify due to Sergey Karjakin’s chess ban5. If Ding does end up playing in the 2022 Candidates Tournament, I’ll certainly be rooting for him - I hope you will too. Ding Liren so obviously belongs in the candidates tournament. That he does not even get a chance to qualify, is saddening. — Peter Heine Nielsen (@PHChess) February 1, 2022 My own story The word “lockdown” is generally understood to be a break in transportation and other non-essential public services; this is not the case in China. The last story that I’d like to share is a personal one detailing the time I had the great displeasure of participating in a 48-hour COVID-induced building-wide lockdown in Shanghai. On the evening of December 13th of 2021, the Anlian building in Shanghai’s Yangpu district went under a full-fledged 48-hour lockdown. Although I had left before police and health officials came to lock the building down, Anlian’s building management was still able to contact and inform me of the mandatory 48-hour quarantine (I was obviously not enthralled by this). Right before I re-entered the building, I took the picture below. One of the police officers noticed me snapping photos and was about to confiscate my phone before I told him that I had already deleted them (I lied). I didn’t end up taking any more pictures of the lockdown due to this strict “no photographs” policy. Shanghai's Anlian building on the first night of lockdown - notice the barricade at the entrance to the left of the blue tent. There's police everywhere, and local health workers arrived in full personal protective equipment (PPE) to administer nucleic acid amplification tests (NAATs). The first night was the most eventful. Occupants ordered takeout (外卖) for dinner, resulting in mass confusion as bags of food were left outside the building entrance with nobody to bring them in. There was also mandatory COVID testing for the entire building and strict mask requirements while lining up for the test; those who weren’t wearing them tightly over both the nose and mouth were forcibly pulled aside and given stern warnings. Later at night, internet speeds slowed considerably as everybody began streaming television shows, downloading Steam games (CS:GO, anyone?), watching Netflix (through a VPN), etc. Long lines formed at bathrooms as well. In particular, the women’s bathroom became congested as many vied for mirror space to apply dry shampoo and/or remove makeup. Local health workers brought and distributed blankets, but only enough for about 1/5th of the people in the building - tough luck for everybody else. Day 2 was much of the same, with most folks fairly tired and sleep-deprived from the night before. Another round of NAATs took place on the first floor during a very specific time window. I was unfortunately late, which resulted in a heated argument between building management (who was supposed to make sure everyone in the building was present for the second round of COVID tests) and local health workers (who had to once again put on PPE and re-open test kits). This happened even though it was fairly clear at that point that nobody in the building had contracted COVID. I later found out that the 48-hour lockdown wasn’t due to secondary contact (次秘接) as opposed to primary contact: an international traveller from Japan who was in contact with a confirmed COVID case had passed through the 25th floor of the building earlier in the day. I was skeptical that health officials would go through , but I later confirmed it with both a local health official as well as one of the folks most heavily affected who worked on the 25th floor of the office building. In any case, if there’s one thing I learned from this whole ordeal, it’s that sleeping in office chairs is extremely uncomfortable. On China’s COVID statistics These stories should help shed some light the three distinct phases that China’s zero-COVID policy has gone through. The first phase takes place from December 2019 to April 2020. During these critical months, China set a precedent for the rest of the world by engaging in mass lockdowns, city-wide testing, and virtual meetings. Official statistics (deaths, cases, recoveries) during this time are highly inaccurate due mostly to intentional but also some inadvertent miscounts. From May 2020 onward, China entered a delicate equilibrium, maintaining its zero-COVID policy through strict 21-day quarantine for international travelers - 14 in a hotel plus 7 at home. Chinese policy became fairly standard throughout the country, and most citizens simply forgot about COVID altogether, save for the occasional article or two bashing America for an unnecessarily high death count. Since January 2022, driven by Omicron’s high transmissibility, China has been grappling with outbreak after outbreak and re-engaging in citywide lockdowns. Through a fundamental misunderstanding of the first two phases, China writers such as George Calhoun criticize Beijing for underreporting the infection rate. He views China’s COVID statistics as a “statistical, medical, biological, political and economic impossibility” because he’s never lived in a dense, authoritarian country. Writers like George deserve substantial criticism for cherry-picking statistics while simultaneously avoiding a wholistic approach to analyzing China’s COVID response. China’s COVID eradication program in phases one and two were successful because the central government’s containment policies were unimaginably draconian. The 48-hour lockdown story should serve as a great example of this - a city or state leader in America forcing an entire building into a military lockdown would be political suicide. As mentioned above, I have no doubt that the COVID cases and deaths for phase one are significantly higher than initially reported. Phase two, however, is entirely different. With COVID’s strong transmissibility and incredibly dense urban centers, entire swaths of the population would be simultaneously unable to work if even a few COVID cases slipped through without quarantine. Simply put, hospitals would be overrun, and the Chinese populace would notice. Halloween (2020) in a tier 2 Chinese city. Quite the super-spreader event, no? My personal opinion The purpose of the three above stories was to portray how lockdowns, quarantine, and general COVID policy in China differs from that of other countries. This should hopefully also show why China’s zero-COVID strategy was considerably more successful than that of other countries in addition to why zero-COVID is socially and economically unsustainable in the era of Omicron. Unless China cuts its citizens off completely from the rest of the world, I don’t see zero-COVID as a long-term possibility for any country, let alone one with an economy and population as large as China’s. China’s zero-COVID policy was warranted when the disease was much deadlier, but with Omicron accounting for nearly 100% of all recent worldwide COVID cases, it is highly impractical for China to continue these unsustainable zero-COVID rules, as they will have increasingly negative social and economic side effects. In particular, China’s zero-COVID policy has put the population in a COVID-ignorant state of mind - more and more people are showing an unwillingness to comply with local COVID mandates, all while the percentage of fully vaccinated elderly Chinese citizens remains low. Thankfully, there are rumors that China wants to ease its zero-COVID policy. However, given the speed with which the central government was able to lockdown cities and restrict the flow of people in early 2020, I see no excuse for the current unease and slowness with which opening up is being discussed. Western media coverage One final note on Western media and its coverage of China’s pandemic response. The majority of media outlets have repeatedly failed to read between the lines when it comes to CPC pandemic policy6. While part of the reason is to prevent the spread of COVID domestically, another major reason is talent retention. China is undergoing a fairly seismic demographic shift, with a rapidly shrinking young population (ages 25-34). I personally know several young Chinese professionals who studied at an international university before deciding to return to China instead of staying abroad - nearly all of these instances were due to rising costs associated with traveling in and out of mainland China, both in terms of time and money. Alex’s and Ding’s stories are perfect reflections of this. It’s time for Western media to treat China’s policies as socioeconomic manipulation at the expense of other countries (including America) rather than natural byproducts of an authoritarian government. Western governments should band together and respond in kind with their own talent retention policies, and, if necessary, embargoes/sanctions against China. Wrapping up Thanks for making it this far - I hope this post was informative. As mentioned before, this is part II of a four-part series. In part III, I’ll cover the Chinese tech scene, from 996’ing to the open source community. Stay tuned! Example: where’s the line for white, non-Latina women in this article? ↩ 有钱没钱回家过年, i.e. returning home for LNY is a must, regardless of one’s fiscal condition. ↩ Ding and Levon Aronian are my two favorite players. In particular, I enjoy watching Ding’s solid playstyle in conjunction with his cold, hard calculation capabilities. He’s also an incredibly humble person.</sup> ↩ Traditionally, there has also been a slot for the highest-rated player, but this was removed in the 2022 cycle due to rating protection/manipulation by previous Candidates Tournament participants (Ding would’ve otherwise qualified this year). ↩ Sergey had qualified via the Chess World Cup held in 2021, but due to his support of the Russian invasion of Ukraine, he received a 6-month ban from all FIDE tournaments. This reinforces my belief that the only true winners of Russia’s invasion of Ukraine are China and India. ↩ China’s great firewall is another example of Western media missing the complete picture. While minimizing external influence and internal dissent is undoubtedly a major reason for building the firewall, an equally important reason was to promote the growth of China’s own tech giants - Alibaba, Tencent, Baidu, etc. I’ve actually read articles and papers which argue that the latter reason is the primary one for the great firewall; given the prevalence of VPNs and proxies (翻墙软件) within mainland China, I must say that I agree. ↩
In this four-part article, I’ll go over some of the lessons I learned living and doing business in China’s tech industry. During my time in China, I’ve led a team of 10+ engineers to develop a location-based IoT and sensing platform, co-founded an open-source project called Towhee, and developed countless relationships with folks in a number of difference cities (many of whom I now consider good friends). I’ll go over some of the common misconceptions about China ranging from living and working in China to the government’s pandemic response. Part I of this blog post covers some of the basics without diving too deep into the tech world: some interesting things I learned while living, working, and interacting in China. If you have any questions, comments, or concerns, feel free to connect with me on Twitter or Linkedin. Thanks for reading! Update (03/29/2022): Part II is up. You can read it here. Before I begin, a bit about me. I was born in Nanjing, China, but moved to the US when I was barely three years old. I spent about five years in New Jersey before moving to Corvallis, Oregon (a place that I am, to this day, proud to call home). I moved to Norcal for college, studying EE (with a minor in CS) at Stanford. I stayed there for my Master’s degree as well, which I completed in 2014. Afterwards, I worked at Yahoo’s San Francisco office as a Machine Learning Engineer for two years. As a hybrid software development & research role, I was able to research and productionize the industry’s first deep learning-based model for scoring images based on aesthetics. I also had the pleasure of attending Yahoo’s internal TechPulse conference (where my co-author and I won a best paper award) all while keeping up with interesting deep learning uses cases. All-in-all, I was quite happy with the work I was doing, but also slowly started to develop the entrepreneurship itch. In the lead up to 2017, I returned to my Electrical Engineering roots and co-founded a company developing solutions for indoor localization and navigation. Efforts I put in towards finding investment continuously had little to no return - feedback we got from a lot of investors was that they believed in the team, but that the product lacked a “viability test” with an initial customer, something difficult for an early-stage hardware startup due to the high development overhead. I had some simulations and early board designs which I believed was enough, but for an investor, diving deep into an unknown company’s technology can often be costly in terms of time and energy. This is where my story takes a bit of a turn. In late 2017, the company received an early-stage seed investment offer from mainland China, and after a bit of consideration, we decided to go for it. It was at this point that a lot of friends and family asked me a question I’ve become very good at answering over the years: Why did you choose to leave Silicon Valley for an unknown country with less talent and an arguably inferior tech industry? The answer is threefold: 1) I felt that Chinese investors were more open to funding hardware startups due to the ultra-fast turnaround times for fabrication, 2) the bay area was just getting too damn expensive for my taste, and 3) from a personal perspective, I wanted to understand my birth country from cultural, social, and economic standpoints. I felt good about my decision and thought that the greatest challenge would be language; my Mandarin was workable but far from proficient. San Francisco Chinatown is a poor caricature of Qing dynasty China. Same goes for the architecture you see in Chinese restaurants across America. Photo by Dennis Jarvis, CC BY-SA 2.0 license, original photo. Alipay, WeChat, and QR codes The very first thing you’ll learn about China is that everything revolves around either Alipay (支付宝) or WeChat (微信), two apps known primarily for their payment capabilities. What a lot of folks outside China don’t know is that these two apps can be used as gateways to a number of other mini-programs (小程序), i.e. subapps developed by other organizations such as KFC, Walmart, etc. These subapps can be used directly within either Alipay or Wechat, forgoing the need to individually download apps from an app store. Imagine ordering furniture from IKEA, dinner from Chipotle, and movie tickets to Century Theaters all from the same app - that’s Alipay/Wechat for you. The obvious downside to this is that personal information becomes extremely centralized. If something like this were to happen in the US, antitrust lawsuits would come faster than a speeding bullet, and for good reason too - big conglomerates monopolizing data is dangerous and their wide adoption stilfes innovation. While Alipay and WeChat were years ahead of the US’s card-based (credit/debit) payments system when first released, Android Pay and Apple Pay (NFC-based) have since then become a lot easier to use. Alipay and WeChat work by opening a camera and scanning a QR code, which redirects you to the store's payments page. You can then pay an arbitrary amount of RMB, which will immediately show up in the payee's balance once complete. Photo by Harald Groven, CC BY-SA 2.0 license, original photo. Here's a screenshot of my Alipay. Its primary use is for payments, as evident by the top row, but mini-programs (second row from the top) have now become an important part of the app. Alipay and WeChat’s success within mainland China are in large part due to the smartphone + QR code revolution, which has truly permated all aspects of Chinese life. Shared bikes can be unlocked by scanning a QR code on your phone. You can add friends on Alipay and WeChat using QR codes. Many Communist Party of China (CPC) functions rely on tight Alipay or WeChat integration. You can even login to third-party websites and check in as a guest in office buildings via QR codes. I am by no means a security expert, but this system somehow feels a bit gameable despite its widespread use by over a billion people. Red tape, CPC style While Alipay and WeChat have made life considerably easier for the majority of people living in China, many civil and commercial processes are still incredibly difficult and filled with unnecessary paperwork. Registering for a company and acquiring a work permit in China is quite possibly one of the most insanely frustrating things on Earth. I won’t go into all of the details, but just know that it involved a mountain of paperwork, letters of commitment, countless passport scans and other documentation, etc… We ended up hiring an administrative assistant to handle a lot of this work for us, but the amount of time and energy one has to dedicate towards this can be a bit demoralizing. Some provincial (the equivalent of a state in America) governments have issued new policies aimed towards combating the problem of excessive paperwork. But the CPC is massive, and massive entities have even larger amounts of inertia. Rather than reducing the amount of mandatory paperwork, many of those policies revolved around reducing the number of trips needed to see the process to completion. This is definitely a step in the right direction, but compiling a thick folder of paperwork is still not a fun experience. A common joke in China is that there are four castes. From top to bottom these are: 1) CPC officials, 2) foreigners, 3) white collar workers, and finally 4) blue collar workers. Even with this supposed semi-VIP treatment, getting a business license such as this one is something I do not want to go through again. The same goes for pretty much all processes which require some sort of government approval, including but not limited to acquiring a work permit, registering an address change, and replacing a lost ID card. Even flying to China requires a mountain of paperwork and approvals, even if you already have a Chinese visa. My main problem with all this is the CPC’s complete lack of transparency. Why can’t I transit through a third country on my way to China if I’m going to have to undergo 14 days of mandatory hotel quarantine plus another 7 days of home quarantine anyway? From a foreigner’s perspective, this is one of the most frustrating aspects of China in an otherwise amazing experience - CPC overreach in almost every aspect of everyday life. The CPC grossly mismanages via overregulation in some sectors and underregulation (hello, housing market) in others. Social regression, economic growth This ties into another common misconception about China - the idea that the government wants to track everything you do at all hours of the day (for the moment, let’s ignore the feasibility of doing so for a population for 1.4 billion people) through a combination of CCTV, mobile phones, and browsing habits. I’ve read countless articles written by American and European media outlets overstating the dystopia that China has fallen into, but the reality is that the Chinese government cares little for storing said data long-term and uses it primarily in criminal cases. I was involved in a project that uses face recognition to track residents going in and out of communities; not only were the residents eager to have such a system installed, but it eventually also helped track a man guilty of sexual assault. Data from such a system was also entirely managed at the local level and not automatically shared with the provincial or central governments. Xinjiang and Tibet are two exceptions to this which I won’t dive deep into. I also haven’t been to either province, so it would be inappropriate for me to comment on what’s going on in Western China. Other surveillance programs such as social credit (社会信用) and city brain (城市大脑) are also widely misunderstood. The social credit system primarily punishes and constrains businesses rather than people, while social credit for individuals is somewhat analagous to a background check in America. A lot of American and European commentators will point out some insane social credit rules, such as deducting points for cheating on the college entrance exam (essentially the SAT on steroids); while I do not disagree, there are undoubtedly similar occurances for American laws. When I was still a student at Stanford, I once lost an internship opportunity because a “traffic violation” - biking at night without a bike light - showed up on my background check. In all fairness, I consider it to be extremely easy to stay off China’s social credit “blacklist” - just be reasonable and avoid breaking the law. China’s “city brains” are a totally different beast, designed to anticipate and reduce traffic, improve city planning, and provide advanced 3D models and visualization techniques. My understanding is that most city brain projects achieve… none of these, despite the fact that cities pay the equivalent of tens to hundreds of millions of dollars for just one of these solutions. An interesting side note - a recruiter once tried getting me to lead Yiwu’s city brain project, but it fell through after he discovered I wasn’t a Chinese citizen (these projects, for obvious reasons, strictly prohibit participation from non-Chinese citizens). An image I found of Pudong District's (Pudong is a district in Shanghai, home to Shanghai Pudong International Airport i.e. PVG) city brain platform via a Baidu search. Although it looks fancy, there is really little to no new underlying technology behind these systems. You might wonder how China’s economy is able to grow at such a blistering pace despite the huge number of arguably inefficient government programs. The answer is rooted in East Asian culture: work ethic. Blue collar Chinese workers are willing work 60+ hour weeks while sustaining themselves on ramen and $1.5 cigarette packs every day just to ensure their kids can get the best education and an improved quality of life. The whole concept of 996 is rooted in the Confucian ideals of hard work and industriousness. The “laziest” men and women in China are arguably owners of small- to mid-size businesses; they are often the last to arrive and first to leave from work. The CPC loves to take credit for China’s recent growth, but the reality is that the growth was the result of Chinese work ethic plus a switch from central planning to a mixed economy. By industriousness, I really do mean everybody. In 2019, I visited a prison in Jiangxi to discuss a potential prisoner safety solution. In a meeting with the vice-warden, he tacitly mentioned how Adidas shoes were being made in the prison that he was running. We quickly pulled out of that project. I haven’t bought Adidas- or Nike-branded shoes since1. Personal identity With the current political climate and state of affairs in mainland China, many Gen Z-ers and Millenials (mostly from Guangdong Province), as I consider Macau, Taiwan, and Hong Kong to be separate territories) who hail from mainland China but don’t refer to themselves as Chinese, instead calling themselves Cantonese. While some simply wish to preserve personal identity, there are also many who dissociate themselves simply because they believe the rest of China to be inferior. I’ve heard some of the most asinine reasons - people spit too often in the streets, everybody plays loud Douyin/TikTok videos while riding high-speed rail, too many cigarette smokers, etc. These are the same people who conveniently forget that some sidewalks along the Mission are lined with old discarded chewing gum, that loud music is played frequently on BART or in a BART station, or that open drug usage occurs nightly in the Tenderloin. I strongly dislike the CPC, but have immense love for Chinese people and Chinese culture. China is an super-massive collection of people that, in my eyes, have made incredible economic and social progress since my birth year, and will continue to do so in the decades ahead. And as a result of all of this, I’m proud to call myself Chinese American. Wrapping up Entire dissertations could be dedicated to each of the above sections, but I wanted to highlight misconceptions and some other bits of information that might not be as readily accessible. In particular, the previous section is by no means a comprehensive list of social issues that China is facing, but rather a brief summary of things that might not be too well understood in the West. #MeToo2, a declining natural birth rate, and racial divisions are just a small number of similar/parallel issues that are happening in both America and China. If you made it this far, thanks for reading. This post has been a bit rambly and all over the place, but the next couple should hopefully be a bit more focused. If you liked this article and are an open-source developer like myself, please give the Towhee project a star on Github as a show of support. In part II, I’ll cover the Chinese tech scene, from 996’ing to the open source community. Stay tuned! Forced labor in Xinjiang has made headlines in recent months, but in reality, it happens everywhere in China. ↩ Justice for Zhou Xiaoxuan. ↩
More in AI
In a previous post I made the point that having a weak manager - a manager without political clout - is really bad news if you’re an…
After the positive reception of my cards article “Kelly can’t fail” I decided to share more of the methods used to characterize card counting. So, I’d like to share my new article on the statistics of drawing cards. This note relates the distribution of draw cards (which can seem scare) […]
Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion. RoboCup German Open: 12–16 March 2025, NUREMBERG, GERMANY German Robotics Conference: 13–15 March 2025, NUREMBERG, GERMANY European Robotics Forum: 25–27 March 2025, STUTTGART, GERMANY RoboSoft 2025: 23–26 April 2025, LAUSANNE, SWITZERLAND ICUAS 2025: 14–17 May 2025, CHARLOTTE, NC ICRA 2025: 19–23 May 2025, ATLANTA, GA London Humanoids Summit: 29–30 May 2025, LONDON IEEE RCAR 2025: 1–6 June 2025, TOYAMA, JAPAN 2025 Energy Drone & Robotics Summit: 16–18 June 2025, HOUSTON, TX RSS 2025: 21–25 June 2025, LOS ANGELES ETH Robotics Summer School: 21–27 June 2025, GENEVA IAS 2025: 30 June–4 July 2025, GENOA, ITALY ICRES 2025: 3–4 July 2025, PORTO, PORTUGAL IEEE World Haptics: 8–11 July 2025, SUWON, KOREA IFAC Symposium on Robotics: 15–18 July 2025, PARIS RoboCup 2025: 15–21 July 2025, BAHIA, BRAZIL Enjoy today’s videos! We’re introducing Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics. This is moderately impressive; my favorite part is probably the hand-offs and that extra little bit of HRI with what we’d call eye contact if these robots had faces. But keep in mind that you’re looking at close to best case for robotic manipulation, and that if the robots had been given the bag instead of well-spaced objects on a single color background, or if the fridge had a normal human amount of stuff in it, they might be having a much different time of it. Also, is it just me, or is the sound on this video very weird? Like, some things make noise, some things don’t, and the robots themselves occasionally sound more like someone just added in some ‘soft actuator sound’ or something. Also also, I’m of a suspicious nature, and when there is an abrupt cut between ‘robot grasps door’ and ‘robot opens door,’ I assume the worst. [ Figure ] Researchers at EPFL have developed a highly agile flat swimming robot. This robot is smaller than a credit card, and propels on the water surface using a pair of undulating soft fins. The fins are driven at resonance by artificial muscles, allowing the robot to perform complex maneuvers. In the future, this robot can be used for monitoring water quality or help with measuring fertilizer concentrations in rice fields [ Paper ] via [ Science Robotics ] I don’t know about you, but I always dance better when getting beaten with a stick. [ Unitree Robotics ] This is big news, people: Sweet Bite Ham Ham, one of the greatest and most useless robots of all time, has a new treat. All yours for about $100, overseas shipping included. [ Ham Ham ] via [ Robotstart ] MagicLab has announced the launch of its first generation self-developed dexterous hand product, the MagicHand S01. The MagicHand S01 has 11 degrees of freedom in a single hand. The MagicHand S01 has a hand load capacity of up to 5 kilograms, and in work environments, can carry loads of over 20 kilograms. [ MagicLab ] Thanks, Ni Tao! No, I’m not creeped out at all, why? [ Clone Robotics ] Happy 40th Birthday to the MIT Media Lab! Since 1985, the MIT Media Lab has provided a home for interdisciplinary research, transformative technologies, and innovative approaches to solving some of humanity’s greatest challenges. As we celebrate our 40th anniversary year, we’re looking ahead to decades more of imagining, designing, and inventing a future in which everyone has the opportunity to flourish. [ MIT Media Lab ] While most soft pneumatic grippers that operate with a single control parameter (such as pressure or airflow) are limited to a single grasping modality, this article introduces a new method for incorporating multiple grasping modalities into vacuum-driven soft grippers. This is achieved by combining stiffness manipulation with a bistable mechanism. Adjusting the airflow tunes the energy barrier of the bistable mechanism, enabling changes in triggering sensitivity and allowing swift transitions between grasping modes. This results in an exceptional versatile gripper, capable of handling a diverse range of objects with varying sizes, shapes, stiffness, and roughness, controlled by a single parameter, airflow, and its interaction with objects. [ Paper ] via [ BruBotics ] Thanks, Bram! In this article, we present a design concept, in which a monolithic soft body is incorporated with a vibration-driven mechanism, called Leafbot. This proposed investigation aims to build a foundation for further terradynamics study of vibration-driven soft robots in a more complicated and confined environment, with potential applications in inspection tasks. [ Paper ] via [ IEEE Transactions on Robots ] We present a hybrid aerial-ground robot that combines the versatility of a quadcopter with enhanced terrestrial mobility. The vehicle features a passive, reconfigurable single wheeled leg, enabling seamless transitions between flight and two ground modes: a stable stance and a dynamic cruising configuration. [ Robotics and Intelligent Systems Laboratory ] I’m not sure I’ve ever seen this trick performed by a robot with soft fingers before. [ Paper ] There are a lot of robots involved in car manufacturing. Like, a lot. [ Kawasaki Robotics ] Steve Willits shows us some recent autonomous drone work being done at the AirLab at CMU’s Robotics Institute. [ Carnegie Mellon University Robotics Institute ] Somebody’s got to test all those luxury handbags and purses. And by somebody, I mean somerobot. [ Qb Robotics ] Do not trust people named Evan. [ Tufts University Human-Robot Interaction Lab ] Meet the Mind: MIT Professor Andreea Bobu. [ MIT ]