Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
8
I've always been a fan of the Lego Technic series, especially those models that have gears and cranks and moving parts. But it seems that Lego is shifting the focus of the Technic series away from functional models, so I had to take matters into my own hands. I think an orrery is the perfect project to build out of Lego Technic parts as it makes for a cool display set and is functional at the same time. Introduction An orrery is a scientific instrument that models the motion of the celestial bodies. These are typically the orbital period and the period of the rotation around the axis of a body. For example, the time it takes the moon to travel around the earth once is ~27 times the time it takes the earth to rotate around it's own axis (a day). This relation is accurately modelled in an orrery. However, the orrery does not model distances or the relative sizes of the celestial bodies. I was very excited when I saw this Earth, Moon and Sun Orrery design by JK Brickworks in 2016: ...
8 months ago

More from Marian's Blog

Generating an infinite world with the Wave Function Collapse algorithm

This article describes how I generate an infinite city using the Wave Function Collapse algorithm in a way that is fast, deterministic, parallelizable and reliable. It's a follow-up to my 2019 article on adapting the WFC algorithm to generate an infinite world. The new approach presented in this article removes the limitations of my original implementation. I first mentioned these ideas in this Twitter thread. Objective The goal is to procedurally generate a 3D environment by placing human designed blocks on a 3D grid. The blocks need to be placed in accordance with given adjacency contraints. For each of the 6 sides of each block, some information about the face and its symmetry is used to generate a list of possible neighbors. This is different from the original formulation of the WFC algorithm, where the possible blocks, their adjacency rules and their spawn probabilities are extracted automatically from an example texture. In this improved version, the generation method is robust enough to be shipped in a commercial game, so it needs to be reliable, fast and allow for artistic control over the result. Wave Function Collapse This article is aimed at readers who already know how the WFC algorithm works, but here is a brief recap. Remember, I'm skipping the part where blocks are extracted from an example texture and I'm only using the part where we generate a new "texture". Maxim Gumin on Github) The algorithm starts with an array of slots (the "wave function"), where nothing is decided. Each slot has a list of possible blocks (or "modules") that can be placed there and in the starting state, each list contains all modules. The algorithm will then do collapse steps and a constraint propagation steps until the map is fully collapsed or until it has reached a dead end. Collapse step We pick the slot with the lowest entropy and collapse it. That means we pick one of the modules from that slot's list of possible modules and set it as the selected module for this slot. Intuitively, the "slot with the lowest entropy" is the one with the least amount of choice. If all modules have the same spawn probability, the slot with the fewest possible modules is the one with the lowest entropy. Constraint propagation Collapsing a slot effectively shrinks the list of possible modules of that slot to 1. The constraint propagation step propagates this information through the map by removing modules from the respective lists of other slots that relied on a different choice for the slot we just collapsed. The constraint propagation step of the WFC algorithm is the most compute intensive part. End The algorithm terminates when all slots are collapsed, which means success, or when the list of possible modules for any slot is empty. In that case the procedure has failed and one could backtrack or start over. The original approach and its limitations (skip this section if you just want to know the solution) WFC is usually applied to finite maps that can be stored in an array. In my original post, I described why I thought it would be impractical to do a chunk-based WFC implementation. (I had not figured out how to avoid the problem that constraints need to be propagated across chunk boundaries) Instead, I stored the map in a dictionary, where new slots would be allocated when they were needed or when they were touched by constraint propagation. That means, even to geneate a small area of the map, a large cloud of slots around that area would be allocated since constraint propagation could "bounce back" into the area we're interested in. Problems of that approach include: Non-determinism: The result of the generation depends on the order in which parts of the map are generated (and thus on the path the player takes). Memory leak: We can't release the memory used to generate a part of the world when the player leaves since we don't know at what point distant slots no longer have an effect on local world generation. Reliability: The longer you walk around, the higher the chance becomes that the WFC algorithm runs into a dead end and is unable to continue generating the map. Single threaded: Since there are no chunks, all operations on the map datastructure need to be sequential and can't run in multiple threads. In practice, the map height had to be limited so that the map generation was fast enough. My implementation of this flawed approach is still available on Github and a playable demo is on itch.io. If you want to implement your own WFC algorithm, you shouldn't do it like that though! Chunk-based WFC The idea is to start with a simple pre-generated, tiling map and generate "fitting" replacements at runtime. However, we do this at an offset so that the seam (which would otherwise look the same for each chunk) is replaced. In this section, I'll explain in detail what that means. This solution is a refinement of ideas proposed by Paul Merrel and BorisTheBrave. We start by generating a simple, finite, tiling, 8x8x8 map: This is done offline. We use a small subset of the available modules to make it as simple as possible. The map is tiling in the sense that the boundaries on opposite sides match, so copies of this map could be placed next to each other seamlessly. Doing that would look like this: Generating a tiling map is done by "wrapping around" the constraint propagation at the map boundary. In a finite map, when we propagate a constraint to a slot outside the map, we discard that information. In a tiling map, the slot on the opposing map boundary is treated as if it was a neighbor. Next, we pre-generate a set of replacements for our starting map. We use the boundary of the starting map as a boundary constraint to generate these replacements. For any slots on the boundary of these new maps, we only allow modules with a matching profile for those sides that face the map boundary. This means that we can "swap out" our starting map with any of the pre-generated patches without creating a visible seam. Now we can randomly choose from our collection of pre-generated patches at runtime and we have a simple chunk-based infinite world generator: Note that we're not doing any WFC generation at runtime yet, we're just placing pre-generated 8x8x8 block patches. Since all these patches have a matching boundary, we can spot this chunk boundary as an unnatural pattern in the generated world. Now for the important part: At runtime, we generate an 8x8x8 replacement map for each chunk, but we do it at a 4 block offset in both horizontal directions. The starting point for each chunks's generation is made up of the four pre-generated patches that touch it. The replacement map we generate at runtime has a boundary constraint to "fit in", just like our pre-generated patches. However, due to the offset, the boundary that is shared between all pre-generated patches is replaced at runtime and the area that is different in every pre-generated patch remains unchanged during the runtime generation. (This is needed so that neighbor chunks can be generated independently from each other.) If this replacement map fails to generate, we just copy the blocks from the starting patches. Here is the result of that: Notice how the chunk boundary artifacts from the previous screenshot are gone! Consider this drawing, where the gray blocks are one chunk (seen from above). We determine the four starting patches that overlap this chunk (the blue boxes). This needs to be random but deterministic, since neighbor chunks will need to use the same information. We query the pre-generated patches at the boundary of the chunk (shown in green) and use this as the boundary constraint for the generation of the chunk. The green boundary area will stay the same during runtime generation, but this looks ok due to the variance in the pre-generated patches. The blue boundary is the same for each pre-generated patch, but will be replaced at runtime. Note how this has the properties we want: Each chunk can be generated deterministically and independently from other chunks. If the generation for one chunk fails, we simply copy the blocks from the starting patches. Using a heightmap In this section, I'll explain how to generate a world in the shape of an arbitrary heightmap. Consider an integer heightmap where the difference between two adjacent points is always one. The next point is either one above or one below, but never at the same level or anywhere else. Each 2x2 cell in that heightmap has one of these six shapes: For each of these six possible 2x2 cell shapes, we pre-generate a set of starting patches: These starting patches are no longer tiling in the classic sense. Instead, each side matches the opposite side with a vertical offset. With our special integer heightmap where adjacent points always have a difference of 1, we will now generate one chunk for each point in the heightmap. Our query point has four adjacent 2x2 cells. For each 2x2 cell, we determine which of the six possible shapes it has and pick a pre-generated starting patch from the respective collection. Then, we generate a replacement map as explained in the previous section. Here is an example of the heightmap in engine, each chunk is represented as one flat quad: This mesh is used to render the world far away from the camera. I added a "city like" texture and some billboard-rendered fake buildings. In the foreground, you can see the actual chunks generated by the algorithm: Okay, now we know how to turn our integer heightmap into a cool looking infinite WFC world, but how do we get that integer heightmap in the first place? How do we get a function that generates an integer elevation function where the vertical difference between two adjacent points is always 1 or -1, but never 0 or anything else? We start with a target function that doesn't have this property. In my case, I'm using 8 octaves of Perlin noise, but any heightmap can be used here. Then, we use an elaborate clamping process to force the step constraint on our target function. It works in a hierarchical way, similarly to descending down a quadtree. We start with a relatively large square (the root of the quadtree) and evaluate our target function for the four corners. Then, we generate the heightmap value on the edge centers and the square center by querying the target function and then clamping the value to fulfil our slope constraint. The slope constraint requires that the vertical difference is less than or equal the horizontal difference. If our query point is inside any of the four quadrants, we repeat this process for the respective quadrant (descending the quadtree). If our query point is one of the points we just calculated, we're done. The hierarchical nature of this approach means that it lends itself very well to caching. Here is a 2D visualization of the process: The blue line shows the target function. At every descend step down the quadtree, new limits are introduced to adhere to the slope constraint (shown as black lines). The orange dots are the values of our resulting heightmap. Outlook and notes Each chunk can be generated independently. That makes it easy to parallelize the computation required to generate the world. In my case, I'm using Unity's Burst compiler to do the runtime generation. By varying the module probabilities for different areas of the map, I can generate different biomes. Here is an example of a biome boundary: The biome on the left spawns copper roofs and bridges, the one on the right spawns tiled roofs and arches. On the boundary, there is one row of chunks where modules from both biomes can spawn, creating a natural looking transition. I want to mention some progress on this project that is unrelated to the WFC algorithm. Since my last blog post in 2019, I've created lots of new blocks, textured them, added a water plane and added procedurally generated trees and climbing plants. The trees and plants are generated at runtime using the Space Colonization algorithm and adapt to the geometry of the world. The next challenge is to come up with interesting gameplay ideas for this project.

a year ago 16 votes
Adversarial Generation of Continuous Implicit Shape Representations

This article provides an overview of the paper "Adversarial Generation of Continuous Implicit Shape Representations", which I co-authored with Matthias Fey. While the paper focuses on the theoretical aspects, I'll provide a higher level explanation and and some visualizations here on the blog. In the paper, we propose a GAN that generates 3D shapes. The GAN uses a DeepSDF network as a generator and either a 3D CNN or a Pointnet as the discriminator. Update June 2020: Here is a recording of the presentation I gave about the paper at the virtual Eurographics 2020 conference. DeepSDF First, I'll introduce the idea behind DeepSDF. The standard way of representing 3D shapes in deep learning models uses voxel volumes. They are a generalization of images to 3D space and and use voxels instead of pixels. With this 3D CNN approach, concepts from deep learning for images can be applied to 3D shapes. However, CNNs are well suited for learning texture properties and learning to represent a 3D shape as a combination of 3D texture patches is not ideal. Furthermore, due to the memory requirements of this approach, high voxel resolutions are not feasible. A voxel volume contains a rasterized representation of the signed distance field of the shape. The signed distance function is a function that maps a point in 3D space to a scalar signed distance value. The idea behind the DeepSDF network is to train a neural network to predict the value of the signed distance directly for an arbitrary point in space. Thus, the network learns a continuous representation instead of a rasterized one. An SDF network for a single shape has three input neurons and one output neuron. The DeepSDF network doesn't use convolutions. To learn a representation of multiple shapes, the DeepSDF network receives a latent code as an additional input. The decision boundary of the network is the surface of the learned shape. For a given latent code, a mesh can be created using Marching Cubes by evaluating the network for a raster of points. The resolution of that raster can be selected arbitrarily after the network was trained. The DeepSDF network is trained on a dataset of 3D points with corresponding SDF values. These points are in part uniformly sampled and in part normally distributed around the shape surface, resulting in a high density of training data near the surface. Generating SDF data for the ShapeNet dataset is quite challenging because the dataset contains non-watertight meshes. I made my own implementation of the approach proposed in the DeepSDF paper, as well as a slightly different approach. I published this project as a python module. I created my own implementation of the DeepSDF network and trained it on the ShapeNet dataset. The DeepSDF autodecoder works like an autoencoder, but without the encoder. The latent codes are assigned randomly and then optimized during training using SGD. In the animation above, you see a t-SNE plot of the learned latent space of the DeepSDF autodecoder for the ShapeNet dataset. The colors show dataset categories which are not known to the network. It has learned on its own to assign similar latent codes to shapes of the same category. The shapes on the left are reconstructions of latent codes along a random spline path through the latent space. Click here for a high resolution version of the animation. GAN So far, I introduced the DeepSDF autodecoder that learns to reconstruct a given dataset of shapes. The contribution of our paper is to propose a GAN architecture that trains a DeepSDF network to generate new shapes. A generative adversarial network is a pair of a generator network and a discriminator network. The generator proposes new samples (in our case shapes) and the discriminator predicts whether a given sample was generated by the generator ("fake") or taken from the dataset ("real"). The trick is that the generator is improved using the gradient of the discriminator output and the discriminator is trained on the dataset of real samples and the increasingly realistic fake samples provided by the generator. Thus, the generator and discriminator have adversarial goals. If the GAN training is successful, the GAN reaches an equilibrium where the generator has learned a mapping from the latent distribution to the underlying distribution of the dataset and discriminator assesses generated and real samples with the same score. 3D GANs based on 3D CNNs have been proposed. Our research question was whether a GAN can be trained where the generator uses the DeepSDF architecture. Usually, the discriminator of the GAN is a mirrored version of the generator. In the case of the DeepSDF network, this is not feasible because a single sample of the DeepSDF network provides only the SDF for one point. From one point alone, a discriminator could not assess if the sample is realistic. Instead, the discriminator needs multiple points to judge the output value in context. Voxel discriminator One solution to the problem of missing context is to use a 3D CNN as the discriminator. In this case, the generator is evaluated for a batch of raster points and the resulting SDF values are rearranged into a voxel volume. The idea to combine a continuous implicit shape network with a 3D CNN was proposed by Chen and Zhang for their autoencoder. The training data for the voxel discriminator are voxel volumes. We use datasets with resolution 8³, 16³, 32³ and 64³ and apply progressive growing. We use the Wasserstein distance with gradient penalty (WGAN-GP). Here is a latent space interpolation of shapes generated with the voxel discriminator: Another interesting observation is the generator's ability to generalize from the raster points to intermediate points that it was not trained on. Here is an example of a network that was only trained on 16^3 voxels (!), which is the first stage of the progressive growing. On the left, the shapes were reconstructed with 16^3 raster points (the same resolution it was trained on). Scaling up the resolution linearly makes it smoother, but doesn't add any detail. When querying the network at 128^3, we see that the network is able to generalize to higher resolutions. For some parts of the geometry, the low resolution volume has no sample points with negative values, resulting in apparently missing geometry. In these cases, the network still generates negative SDFs for intermediate points, thus making the geometry appear when sampled at a higher resolution. Pointnet discriminator Since the voxel-based discriminator keeps some of the disadvantages of fully voxel-based GANs, we investigated an approach that doesn't use voxels at all. The Pointnet is a neural network architecture that can operate on point clouds. In our case, we sample a point cloud of uniform points and use the generator to predict SDFs for the points. The Pointnet then receives the positions of the points and the signed distance values as a "feature vector" with one element. This way, we avoid the fixed raster points and use always changing query points. A Pointnet typically infers information from the spatial structure of the points, which in our case is random. Regardless, we found that the Pointnet can be used as the discriminator in our case. To train the GAN with the Pointnet discriminator, we use a ground truth dataset of uniformly sampled points with their corresponding SDFs. We "grow" the point clouds during training simply by increasing the number of points. When the surface of the generated SDF is reconstructed using Marching Cubes, only values close to sign changes matter. We would like the network to spend more model capacity on values close to the surface, as they influence the result the most. We achieved that by refining the network with additional sample points close to the surface. The gradient of the SDF gives us the direction towards the shape's surface and for a neural network the gradient is easily computed. Using that, we can move randomly sampled points closer to the surface of the generated shape. The non-uniform point cloud can then be evaluated by the discriminator. Since the Pointnet takes the positions of the points into account, it could discern a uniformly sampled point cloud from a surface point cloud. Thus, we add surface points to the ground truth data as well. Here are some examples of shapes generated with the Pointnet discriminator and uniformly sampled points (top) and shapes generated with the refinement method (bottom). (I'll replace this with an animated version soon.) For both approaches, we found a surprising result when trying different levels for Marching Cubes. By default, Marching Cubes reconstructs the isosurface at the level 0, i.e. the surface where the predicted SDF is zero. However, we can choose another level, effectively eroding or dilating the shape. We would expect that the ideal level would be 0, since that is the isosurface of the training data. However, when experimenting with different values, we observe that some features appear to be missing at level 0. We chose an isosurface level of 0.04 to reduce missing geometry at the cost of slightly rounded corners. Since we clip the ground truth SDF at -0.1 and 0.1, the isosurfaces of generated SDF outside of that range are not usable. The source code for this project is available on Github.

over a year ago 11 votes
What I learned from building autonomous model race cars for a year

I was part of a university project group that develops autonomous model race cars. We are a group of twelve students working on the project in part time for year. We were provided with a car that meets the requirements for the F1/10th competition. Even though competing in F1/10th was not our goal, we kept the rules for the competition in mind. We focussed mostly on trying different driving algorithms, which I'll explain below. The software we developed is available on Github. This post is about what we did and what insights I gained from this work. Hardware Our car is a regular 1/10 scale RC car where the the RC receiver is replaced with a computer. The car is based on a Traxxas Ford Fiesta ST Rally. The computer is an Nvidia Jetson TX2. The car is equipped with a LIDAR scanner with a resolution of 1 ✕ 1080, an IMU and a stereo camera. We ended up not using the stereo camera. ROS Our software stack is based on ROS and Gazebo. We used ROS and Gazebo because it is the standard for academic robotics projects. Most online materials assume that your project uses on ROS. ROS is a robotics framework that facilitates communication between robot compontents such as sensors and actors. All the hardware components we use have ROS nodes that publish or subscribe to data in a standardized way. ROS provides a rich ecosystem of software. For all kinds of robot parts and control problems there are ROS nodes available. You can get pretty far by just tying together existing ROS packages. ROS nodes can be written in C++ and Python 2.7 and since each node is its own process, both languages can be used at the same time. A ROS architecture is built like a distributed system with nodes that subscribe and publish to topics. This is the case even if everything runs on a single computer, like with our project. This creates some overhead. Using ROS means that you'll write lots of configuration files, package definition files, launch files, robot description files that all reference each other. With Gazebo, you'll also need mesh description XML files and world definition files. However, I think that using ROS is worth it for ecosystem of existing ROS modules. Gazebo We use Gazebo as a simulation environment. Our robot car can drive in the real world and in the Gazebo simulation using the same software. Having this simulation turned out to be extremely useful. Since we mostly used the Lidar, the simulated track is designed with walls enclosing the track. The visual appearance and realism was less important for our use case. What follows next is my rant about Gazebo. Gazebo is bad software in every way. It is poorly documented, buggy and misses important features. For example, about 20% of the time, Gazebo doesn't launch at all and crashes instead. In the video above, you can see that the car receives shadows, but the track doesn't. See how the car isn't shiny? That's because Gazebo's lighting model doesn't support metalicity. Getting it to render the scene this decent was super difficult, it would be trivial to get way better looks in an engine like Unity. These are just a few out of many problems we had with Gazebo. Seemingly trivial things are unneccesarily difficult. My takeaway is: Avoid using Gazebo if at all possible. Use a game engine instead. Getting a working game engine to simulate a robotics scenario is way easier than getting Gazebo to do what it's supposed to do. For example, there is a project to let ROS communicate with the Unity engine. This is what you should use instead, it will save you a lot of headaches. There are some features in Gazebo specific to robot simulation that a game engine doesn't provide, such as a headless mode. Now for the interesting part, the autonomous driving algorithms. All algorithms use only LIDAR and IMU data. We didn't use the camera. SLAM SLAM stands for simultaneous localization and mapping. It is the concept that the robot determines its own position in an unknown environment based on a map that it creates while exploring the enviroment. For autonomous racing, this means that the car generates a map of the racetrack and can then calculate and follow an ideal racing path using that map and its position. Since we use ROS, we can use lots of existing SLAM implementations that are built specifically for ROS. In addition, ROS offers the navigation stack, which has implementations for path planning and execution. But working with SLAM in practice turned out to be difficult. The existing SLAM algorithms are mostly designed for slow robots and stop working at higher speeds. This makes them unsuitable for racing. But seeing the map generate looks pretty cool! Wallfollowing Wallfollowing is what we call our greedy driving algorithm. It has no concept of the track or the ideal path. It only considers the part of the track that the LIDAR can see from the current position of the car. Our approach is to separate the laser sample points (shown in red) into a left and a right wall and fit two circles into them. The algorithm calculates a predicted position of the car (in the center of the video) and a desired position (in the center of the track). The difference between them (shown in orange) is used as the error for a PID controller, which controlls the steering. To control throttle, we calculate multiple maximum speeds based on the PID error, the curviness of the track and the distance towards the next obstacle (shown in yellow). Out of these maximum speeds, we apply the smallest one. This allows us to slow down in curves, cut corners, etc. Teams competing in the F1/10 competition typically use an approach like this. The wallfollowing approach provided the best lap times in our tests. The videos for Gazebo and for SLAM above show the car being controlled by the Wallfollowing algorithm. Here is a video of the physical car driving with it: Reinforcement Learning One of the methods of autonous driving we tried during this project group was reinforcement learning. It is an area of machine learning where an agent, in our case a car, learns to select actions given a state to optimize a reward function. In particular, we used deep reinforcement learning, where the function that selects an action given a state is a neural network, instead of a lookup table. In our case, the state vector was a downscaled version of the Lidar scan. That means, each element of the vector contains the measured distance for a fixed direction. We also experimented with other state information, such as velocity and throttle, but this didn't bring any improvement. In reinforcement learning, the actions are discrete, meaning that the agent selects an action out of a finite set, instead of providing numbers for throttle and steering. In the simplest example, we used a fixed velocity and two actions for steering left and right. Straight driving would be achieved by oscilating between left and right. We also tried more granular action sets, but this increases the difficulty of the learning task. The neural network has one input neuron for each element of the state vector and one output neuron for each action. It predicts a Q value for each action and the car will perform the action with the highest Q value. During training however, sometimes random actions are taken instead, according an epsilon-greedy policy (exploration vs. exploitation). For the reward function, it would be possible to use a reward of 1 for each step. There is an implicit reward for not crashing as this leads to longer episodes and the reward is accumulated over the episode. But it will learn to stay in the center faster if we give it higher rewards for driving close to the center of the track and lower rewards for driving close to the walls. In addition, we reward high speed. We tested Q-Learning and Policy Gradient. Both took hours to learn to drive in the center, but eventually did so decently. Overall, policy gradient worked better than Q-learning. But both were unreliable in the simulation and didn't translate well from simulation to reality. I think that we could achieve significantly better results if we had more time. This is a video of the car driving in our simulation with policy gradient: Evolutionary Algorithm For the evolutionary algorithm, we took the neural networks from our reinforcement learning approach and learned the network parameters with an evolutionary algorithm. Instead of using a set of actions, we let two output neurons control throttle and steering directly. We started with random weights and randomly changed them while measuring the performance of the car in the simulation. Surprisingly, this worked about as well as reinforcement learning, but was easier to implement and took less time to train. However, like reinforcement learning, it did not perform as well as our wallfollowing algorithm. Outlook Since our goal was not primarily to participate in the official competition, we investigated sevaral methods of autonomous driving. The least fancy one, our greedy wallfollowing algorithm, turned out to be the fastest. All of the driving methods could probably be improved with more time. There is a follow-up project group that continues to work with our code base. It looks like they are even working on moving away from Gazebo.

over a year ago 9 votes
Visualizing 150000 butterflies from the Natural History Museum

Click here for the interactive visualization. The Natural History Museum in London has a data portal in which they provide digital records for many of their specimens. Some of these records have images. I recently learned how to use machine learning tools such as convolutional neural networks and I wanted to use the NHM data to see what can be done with these tools. The dataset of butterflies seemed particularly interesting to me because the images are visually interesting, yet they are all similar in that they all contain a butterfly in a canonical pose. Since I'm only interested in learning the visuals of the butterflies and not the rest of the photographs, such as the labels, I needed to find a way to separate the butterflies from the background. Here, you can see one collection of ~2000 butterflies, which are photographed against a clear background: It is quite easy to remove the background automatically based on contrast. However, besides this collection, there are a lot more images of butterflies available. They can be found by searching for "lepidoptera" in the data portal. This way, I downloaded 30 GB of butterflies. Most of the 150000 images have noisy backgrounds with varying brightness, such as this one: To separate the butterflies from the backgrounds, I trained a U-Net that predicts for each pixel of the image whether it belongs to the butterfly or background. It is a convolutional neural network that receives an image as the input and produces a new image as the output, in my case a background-foreground mask. I started by manually creating masks for a few images: I trained the U-Net and tested it on random images from the dataset. From those I selected the images where the classification went poorly and manually added masks for them to the training set. This selection process leads to a ground truth dataset that contains mostly "challenging" items. I ended up with ~300 masks, however it works reasonably well even with fewer images. Since the network needs to make a prediction for every single pixel, it is forced to generalize even when trained on a single image. Using the masks generated by the U-Net, I can remove the background and crop all the images in the dataset: Next, I created a 128✕128 image with a white background for each butterfly and trained an autoencoder on them. The autoencoder receives an images as the input and produces a reconstructed image as the output. Its loss function is the difference between the input and output so that during training, it learns to reduce that difference and creates an accurate reconstruction of the input. The autoencoder has a bottleneck in the middle which consists of 128 neurons. This way, it learns to compress each image into a latent vector of 128 numbers. Here are some examples of images reconstructed with the autoencoder: During 6 hours of training, the network didn't converge, meaning that a better quality could be achieved by training longer. However, for this project I'm mostly interested in the latent space learned by the autoencoder, not the reconstructed images. I used the encoder of the autoencoder to calculate the latent vectors for all images in the dataset. To visualize the latent space, I used t-SNE to transform the latent vectors from 128-dimensional space to a 2D plane. In this t-SNE plot, each dot corresponds to an image in the dataset. The t-SNE transformation places the points so that those points appear close to each other that have similar latent vectors, meaning that they appear similar to the autoencoder. Next, I wanted to replace the dots in the plot with images of the corresponding butterflies. To prevent overlapping of very close points, I iteratively moved them apart. This was done by defining a "safe" radius and finding all points that have a neighbor within this radius. Each point is then moved away from its closest neighbor by a small distance: Here is a part of the resulting image, where I also added some drop shadows: We can already see that similar butterflies are clustered together. However, this t-SNE plot is 131000✕131000 pixels big. That is 17 Gigapixels! To display it, I'm using Leaflet, a javascript library that can render maps. The image is split into tiles with a resolution of 256x256 pixels which are loaded dynamically by Leaflet. I also created tiles for lower zoom levels, so that people can zoom out like on a map. When the map is zoomed all the way out, the images appear as tiny dots. For the intermediate zoom levels, I rendered a few "representative" images. These are selected by applying a k-means clustering to the t-SNE points. For each cluster, a representative image is rendered at the cluster center. The representative image is selected by calculating the average of the cluster in latent space (not t-SNE space) and finding the butterfly with the latent vector that is closest to the average. I found that when using t-SNE space to determine the representative, an outlier is often selected, which doesn't happen in latent space. The tileset contains 138000 images, which is ~900 MB. This is what it looks like to zoom around in the interactive t-SNE plot: Click here for the interactive version of the t-SNE plot. I also added the ability to click on the butterflies to show a popup with the original image, the scientific name and some other data. These information are provided in the NHM data portal as a CSV file. I wrote a script that creates a JSON file with the t-SNE positions of the points and the corresponding data. But this metadata file alone is 50MB in size. It is not feasible to load it when someone visits the website. So I split the metadata into smaller chunks using a quadtree, much like the images are split into tiles. Now the metadata is loaded only for the regions that the user looks at. Using the data from these CSV files, I can color the dots in the t-SNE plot according to the genus of the butterflies: We can see that the autoencoder has learned to separate the butterflies by genus, solely based on the visuals, without being provided any information about the genus! This is an example of unsupervised training of a classifier. Similarly, it mostly placed the families in adjacent clusters: There is an area in the plot where lots of colors mix. These are specimens with missing wings. Since the missing wing is the most striking visual feature, it primarily determines the latent vector that the autoencoder assigns. Using the same method for the sex, we can see that most of the records contain no sex information. But for the birdwings, for which this information is available, the autoencoder has learned to separate male from female, again, without being tasked to do so. We can observe a similar result for the delias. The source code for this project is available on Github. The images of the butterflies are provided by the Trustees of the Natural History Museum under a CC BY 4.0 license.

over a year ago 7 votes

More in creative

What sort of progress?

Nothing stays still. Relative to the rest of the world, even something that’s not moving is changing. It’s tempting to talk about not making fast enough progress. But it’s far more useful to ask which direction we’re progressing. Often, people will point to the velocity of the change they’re making without pausing to consider the […]

11 hours ago 1 votes
The UnPopulist: Abundance Politics

This week I’m in The UnPopulist with an article about the politics of the abundance agenda:

yesterday 1 votes
Are More Celebrities Dying? A Statistical Analysis

Are more famous figures dying, and if so, why?

yesterday 3 votes
Organizing for urgent

There are many ways to prioritize our time and focus, but the easiest and most vivid way is to do the urgent things first. If we wait until a house plant is sick before we take care of it, though, it’s too late. Deadlines, loud requests and last-minute interventions are crude forcing functions. They’re inefficient […]

yesterday 2 votes
Weekly Scroll: All Hail New Media

Censorship, platforms, and exactly who are we empowering?

2 days ago 3 votes