More from alexwlchan
In my previous post, there was a first for this site: I embedded a post from Mastodon. Like many social media services, Mastodon has built-in support for embedding posts. If you’re looking at a public post, you can get a snippet of HTML and JavaScript to show that post in another web page. You add that snippet to your page, and when somebody opens it, the snippet will appear as a Mastodon post. It’s quick, easy, and not how I did it. When I want to embed post from social media sites, I don’t use the native embed. Instead, I write my own HTML and CSS to mimic their appearance, and it looks pretty close to the real thing. Here’s a comparison of a native/custom Mastodon embed – they’re not exactly the same, but close enough that you probably wouldn’t notice unless you were looking: This is something I’ve been doing for over a decade – I got the original idea from Dr Drang, who does something similar for tweets. (He wrote that post in 2012, and it highlights the value of resilient embeds – two of the four tweets he’s quoted are no longer available. The post would be harder to read if you couldn’t see the tweets he was quoting and replying to.) Many years ago, I copied Dr Drang’s code, created my own variant, and I used that for embedding tweets. I’ve now created another variant that works for Mastodon toots, and I have unfinished branches with more variants for Instagram and Bluesky. Why do I prefer my embeds? There are several reasons: My embeds are smaller and faster. Mastodon posts are short, and yet the native embed downloads nearly a megabyte of data to display 88 words of text – including the audio file boop.mp3, for reasons I can’t imagine. Meanwhile my custom embed requires just 35KB. I try to keep this site pretty lean and lightweight – the average size of an HTML page is just 13KB. Adding a megabyte of data for an embed would undo all that hard work. My embeds don’t require any JavaScript, third-party or otherwise. You don’t need JS to show static content, and adding third-party code introduces a privacy risk for my readers. I’m not completely opposed to JavaScript, but it’s massively overused on the modern web. It’s useful for interactive elements, but I really don’t need it on this content-only site. My embeds are more resilient. Because I have no dependency on the Mastodon server, it doesn’t matter if the server goes away or the toot is deleted. My page will be unaffected. This is why many people include social media posts as images, or copy the text into a blockquote. We’re in a time of increased tumult and instability for social media platforms, but their woes aren’t going to leave holes in my posts. My embeds support dark mode. A few years ago I added dark mode to this site. It’s not something I use myself, but I know it’s important to a lot of people and it was a fun little project. The native Mastodon embeds always show toots in light mode, whereas my embeds will adapt to your preference: On the other hand, the argument in favour of native embeds is that they need minimal effort, they should always work, and they support more features. My custom embeds can’t do pictures, or link previews, or quote toots, because I’ve never embedded a toot that uses those. If/when I do, I’ll have to write the code to support that. I’ll find that fun, but most people would find that annoying. I don’t know what accessibility is like for native embeds. My custom embeds only use a handful of semantic HTML elements, so they get a lot of good behaviour “by default from the browser. I hope native embeds are good for accessibility, but I don’t know enough to say whether my approach is better or worse in that regard. How does it work? I have some HTML and CSS that render the embedded toot. Here’s the entirety of the HTML – I’ve tweaked this ever so slightly for readability, but the key parts are there. <blockquote class="mastodon-embed"> <div class="header"> <a class="name_header" href="https://code4lib.social/@linguistory"> <img class="avatar" src="linguistory.jpg" alt=""> <div class="name"> <span class="display_name">James Truitt (he/him)</span> <span class="account_name">@linguistory@code4lib.social</span> </div> </a> <img class="mastodon_logo" src="logo.svg"> </div> <p class="text"> Do any <a href="https://code4lib.social/tags/digipres">#digipres</a> folks happen to have a handy repo of small invalid bags for testing purposes? <br> <br> I'm trying to automate our ingest process, and want to make sure I'm accounting for as many broken expectations as possible. </p> <p class="meta"> <a href="https://code4lib.social/@linguistory/113924700205617006">31 Jan 2025 at 19:49</a> </p> </blockquote> The CSS styles are a bit long to include here, but you can see them by reading the source code of my demo page. I’m using CSS grid layout to lay out the different components, but otherwise nothing too complicated. I designed my custom embed by creating two HTML files: one with a native embed, and one with my custom embed. I used the developer tools to get key values from the native embed, like colours and spacing, then I kept adding styles to my custom embed until it looked about right. When I want to embed a toot now, I write a line like: {% mastodon https://code4lib.social/@linguistory/113924700205617006 %} This calls a Jekyll plugin that replaces this line with an embedded toot. This code is very scrappy and poorly documented, so it may not be especially easy to adapt to your own site – if you want to do this, start from the HTML and CSS instead. Like everything on this site, my Mastodon embeds are a work-in-progress and not something that everybody should copy. The built-in embeds are quick, easy, and convenient, and they’re what most people should use. But what I like about having my own website is that when I do want to spend an unreasonable amount of effort on something, and do it just because I think it’s fun, I can do that, and nobody can stop me. [If the formatting of this post looks odd in your feed reader, visit the original article]
Last week, James Truitt asked a question on Mastodon: James Truitt (he/him) @linguistory@code4lib.social Mastodon #digipres folks happen to have a handy repo of small invalid bags for testing purposes? I'm trying to automate our ingest process, and want to make sure I'm accounting for as many broken expectations as possible. Jan 31, 2025 at 07:49 PM The “bags” he’s referring to are BagIt bags. BagIt is an open format developed by the Library of Congress for packaging digital files. Bags include manifests and checksums that describe their contents, and they’re often used by libraries and archives to organise files before transfering them to permanent storage. Although I don’t use BagIt any more, I spent a lot of time working with it when I was a software developer at Wellcome Collection. We used BagIt as the packaging format for files saved to our cloud storage service, and we built a microservice very similar to what James is describing. The “bag verifier” would look for broken bags, and reject them before they were copied to long-term storage. I wrote a lot of bag verifier test cases to confirm that it would spot invalid or broken bags, and that it would give a useful error message when it did. All of the code for Wellcome’s storage service is shared on GitHub under an MIT license, including the bag verifier tests. They’re wrapped in a Scala test framework that might not be the easiest thing to read, so I’m going to describe the test cases in a more human-friendly way. Before diving into specific examples, it’s worth remembering: context is king. BagIt is described by RFC 8493, and you could create invalid bags by doing a line-by-line reading and deliberately ignoring every “MUST” or “SHOULD” but I wouldn’t recommend this aproach. You’d get a long list of test cases, but you’d be overwhelmed by examples, and you might miss specific requirements for your system. The BagIt RFC is written for the most general case, but if you’re actually building a storage service, you’ll have more concrete requirements and context. It’s helpful to look at that context, and how it affects the data you want to store. Who’s creating the bags? How will they name files? Where are you going to store bags? How do bags fit into your wider systems? And so on. Understanding your context will allow you to skip verification steps that you don’t need, and to add verification steps that are important to you. I doubt any two systems implement the exact same set of checks, because every system has different context. Here are examples of potential validation issues drawn from the BagIt specification and my real-world experience. You won’t need to check for everything on this list, and this list isn’t exhaustive – but it should help you think about bag validation in your own context. The Bag Declaration bagit.txt This file declares that this is a BagIt bag, and the version of BagIt you’re using (RFC 8493 §2.1.1). It looks the same in almost every bag, for example: BagIt-Version: 1.0 Tag-File-Character-Encoding: UTF-8 This tightly prescribed format means it can only be invalid in a few ways: What if the bag doesn’t have a bag declaration? It’s a required element of every BagIt bag; it has to be there. What if the bag declaration is the wrong format? It should contain exactly two lines: a version number and a character encoding, in that order. What if the bag declaration has an unexpected version number? If you see a BagIt version that you’ve not seen before, the bag might have a different structure than what you expect. The Payload Files and Payload Manifest The payload files are the actual content you want to save and preserve. They get saved in the payload directory data/ (RFC 8493 §2.1.2), and there’s a payload manifest manifest-algorithm.txt that lists them, along with their checksums (RFC 8493 §2.1.3). Here’s an example of a payload manifest with MD5 checksums: 37d0b74d5300cf839f706f70590194c3 data/waterfall.jpg This tells us that the bag contains a single file data/waterfall.jpg, and it has the MD5 checksum 37d0…. These checksums can be used to verify that the files have transferred correctly, and haven’t been corrupted in the process. There are lots of ways a payload manifest could be invalid: What if the bag doesn’t have a payload manifest? Every BagIt bag must have at least one Payload Manifest file. What if the payload manifest is the wrong format? These files have a prescribed format – one file per line, with a checksum and file path. What if the payload manifest refers to a file that isn’t in the bag? Either one of the files in the bag has been deleted, or the manifest has an erroneous entry. What if the bag has a file that isn’t listed in the payload manifest? The manifest should be a complete listing of all the payload files in the bag. If the bag has a file which isn’t in the payload manifest, either that file isn’t meant to be there, or the manifest is missing an entry. Checking for unlisted files is how I spotted unwanted .DS_Store and Thumbs.db files. What if the checksum in the payload manifest doesn’t match the checksum of the file? Either the file has been corrupted, or the checksum is incorrect. What if there are payload files outside the data/ directory? All the payload files should be stored in data/. Anything outside that is an error. What if there are duplicate entries in the payload manifest? Every payload file must be listed exactly once in the manifest. This avoids ambiguity – suppose a file is listed twice, with two different checksums. Is the bag valid if one of those checksums is correct? Requiring unique entries avoids this sort of issue. What if the payload directory is empty? This is perfectly acceptable in the BagIt RFC, but it may not be what you want. If you know that you will always be sending bags that contain files, you should flag empty payload directories as an error. What if the payload manifest contains paths outside data/, or relative paths that try to escape the bag? (e.g. ../file.txt) Now we’re into “malicious bag” territory – a bag uploaded by somebody who’s trying to compromise your ingest pipeline. Any such bags should be treated with suspicion and rejected. If you’re concerned about malicious bags, you need a more thorough test suite to catch other shenanigans. We never went this far at Wellcome Collection, because we didn’t ingest bags from arbitrary sources. The bags only came from internal systems, and our verification was mainly about spotting bugs in those systems, not defending against malicious actors. A bag can contain multiple payload manifests – for example, it might contain both MD5 and SHA1 checksums. Every payload manifest must be valid for the overall bag to be valid. Payload filenames There are lots of gotchas around filenames and paths. It’s a complicated problem, and I definitely don’t understand all of it. It’s worth understanding the filename rules of any filesystem where you will be storing bags. For example, Azure Blob Storage has a number of rules around how you can name files, and Amazon S3 has different rules. We stored files in both at Wellcome Collection, and so the storage service had to enforce the superset of these rules. I’ve listed some edge cases of filenames you might want to consider, but it’s not a comlpete list. There are lots of ways that unexpected filenames could cause you issues, but whether you care depends on the source of your bags. If you control the bags and you know you’re not going to include any weird filenames, you can probably skip most of these. We only checked for one of these conditions at Wellcome Collection, because we had a pre-ingest step that normalised filenames. It converted filenames to ASCII, and saved a mapping between original and normalised filename in the bag. However, the normalisation was only designed for one filesystem, and produced filenames with trailing dots that were still disallowed in Azure Blob. What if a filename is too long? Some systems have a maximum path length, and an excessively deep directory structure or long filename could cause issues. What if a filename contains special characters? Spaces, emoji, or special characters (\, :, *, etc.) can cause problems for some tools. You should also think about characters that need to be URL-encoded. What if a filename has trailing spaces or dots? Some filesystems can’t support filenames ending in a dot or a space. What happens if your bag contains such a file, and you try to save it to the filesystem? This caused us issues at Wellcome Collection. We initially stored bags just in Amazon S3, which is happy to take filenames with a trailing dot – then we added backups to Azure Blob, which doesn’t. One of the bags we’d stored in Amazon S3 had a trailing dot in the filename, and caused us headaches when we tried to copy it to Azure. What if a filename contains a mix of path separators? The payload manifest uses a forward slash (/) as a path separator. If you have a filename with an alternative path separator, it might behave differently on different systems. For example, consider the payload file a\b\c. This would be a single file on macOS or Linux, but it would be nested inside two folders on Windows. What if the filenames are a mix of uppercase and lowercase characters? Some fileystems are case-sensitive, others aren’t. This can cause issues when you move bags between systems. For example, suppose a bag contains two different files Macrodata.txt and macrodata.txt. When you save that bag on a case-insensitive filesystem, only one file will be saved. What if the same filename appears twice with different Unicode normalisations? This is similar to filenames which only differ in upper/lowercase. They might be treated as two files on one filesystem, but collapsed into one file on another. The classic example is the word “café”: this can be encoded as caf\xc3\xa9 (UTF-8 encoded é) or cafe\xcc\x81 (e + combining acute accent). What if a filename contains a directory reference? A directory reference is /./ (current directory) or /../ (parent directory). It’s used on both Unix and Windows-like systems, and it’s another case of two filenames that look different but can resolve to the same path. For example: a/b, a/./b and a/subdir/../b all resolve to the same path under these rules. This can cause particular issues if you’re moving between local filesystems and cloud storage. Local filesystems treat filenames as hierarchical paths, where cloud storage like Amazon S3 often treats them as opaque strings. This can cause issues if you try to copy files from cloud storage to a local system – if you’re not careful, you could lose files in the process. The Tag Manifest tagmanifest-algorithm.txt Similar to the payload manifest, the tag manifest lists the tag files and their checksums. A “tag file” is the BagIt term for any metadata file that isn’t part of the payload (RFC 8493 §2.2.1). Unlike the payload manifest, the tag manifest is optional. A bag without a tag manifest can still be a valid bag. If the tag manifest is present, then many of the ways that a payload manifest can invalidate a bag – malformed contents, unreferenced files, or incorrect checksums – can also apply to tag manifests. There are some additional things to consider: What if a tag manifest lists payload files? The tag manifest lists tag files; the payload manifest lists payload files in the data/ directory. A tag manifest that lists files in the data/ directory is incorrect. What if the bag has a file that isn’t listed in either manifest? Every file in a bag (except the tag manifests) should be listed in either a payload or a tag manifest. A file that appears in neither could mean an unexpected file, or a missing manifest entry. Although the tag manifest is optional in the BagIt spec, at Wellcome Collection we made it a required file. Every bag had to have at least one tag manifest file, or our storage service would refuse to ingest it. The Bag Metadata bag-info.txt This is an optional metadata file that describes the bag and its contents (RFC 8493 §2.2.2). It’s a list of metadata elements, as simple label-value pairs, one per line. Here’s an example of a bag metadata file: Source-Organization: Lumon Industries Organization-Address: 100 Main Street, Kier, PE, 07043 Contact-Name: Harmony Cobel Unlike the manifest files, this is primarily intended for human readers. You can put arbitrary metadata in here, so you can add fields specific to your organisation. Although this file is more flexible, there are still ways it can be invalid: What if the bag metadata is the wrong format? It should have one metadata entry per line, with a label-value pair that’s separated by a colon. What if the Payload-Oxum is incorrect? The Payload-Oxum contains some concise statistics about the payload files: their total size in bytes, and how many there are. For example: Payload-Oxum: 517114.42 This tells us that the bag contains 42 payload files, and their total size is 517,114 bytes. If these stats don’t match the rest of the bag, something is wrong. What if non-repeatable metadata element names are repeated? The BagIt RFC defines a small number of reserved metadata element names which have a standard meaning. Although most metadata element names can be repeated, there are some which can’t, because they can only have one value. In particular: Bagging-Date, Bag-Size, Payload-Oxum and Bag-Group-Identifier. Although the bag metadata file is optional in a general BagIt bag, you may want to add your own rules based on how you use it. For example, at Wellcome Collection, we required all bags to have an External-Identifier value, that matched a specific schema. This allowed us to link bags to records in other databases, and our bag verifier would reject bags that didn’t include it. The Fetch File fetch.txt This is an optional element that allows you to reference files stored elsewhere (RFC 8493 §2.2.3). It tells the person reading the bag that a file hasn’t been included in this copy of the bag; they have to go and fetch it from somewhere else. The file is still recorded in the payload manifest (with a checksum you can verify), but you don’t have a complete bag until you’ve downloaded all the files. Here’s an example of a fetch.txt: https://topekastar.com/~daria/article.txt 1841 data/article.txt This tells us that data/article.txt isn’t included in this copy of the bag, but we we can download it from https://topekastar.com/~daria/article.txt. (The number 1841 is the size of the file in bytes. It’s optional.) Using fetch.txt allows you to send a bag with “holes”, which saves disk space and network bandwidth, but at a cost – we’re now relying on the remote location to remain available. From a preservation standpoint, this is scary! If topekastar.com goes away, this bag will be broken. I know some people don’t use fetch.txt for precisely this reason. If you do use fetch.txt, here are some things to consider: What if the fetch file is the wrong format? There’s a prescribed format – one file per line, with a URL, optional file size, and file path. What if the fetch file lists a file which isn’t in the payload manifest? The fetch.txt should only tell us that a file is stored elsewhere, and shouldn’t be introducing otherwise unreferenced files. If a file appears in fetch.txt but not the payload manifest, then we can’t verify the remote file because we don’t have a checksum for it. There’s either an erroneous fetch file entry or a missing manifest entry. What if the fetch file points to a file at an unusable URL? The URL is only useful if the person who receives the bag can use it to download the file. If they can’t, the bag might technically be valid, but it’s functionally broken. For example, you might reject URLs that don’t start with http:// or https://. What if the fetch file points to a file with the wrong length? The fetch.txt can optionally specify the size of a file, so you know how much storage you need to download it. If you download the file, the actual size should match the stated size. What if the fetch files points to a file that’s already included in the bag? Now you have two ways to get this file: you can read it from the bag, or from the remote URL. If a file is listed in both fetch.txt and included in the bag, either that file isn’t meant to be in the bag, or the fetch file has an erroneous entry. We used fetch files at Wellcome Collection to implement versioning, and we added extra rules about what remote URLs were allowed. In particular, we didn’t allow fetching a file from just anywhere – you could fetch from our S3 buckets, but not the general Internet. The bag verifier would reject a fetch file entry that pointed elsewhere. These examples illustrate just how many ways a BagIt bag can be invalid, from simple structural issues to complex edge cases. Remember: the key is to understand your specific needs and requirements. By considering your context – who creates your bags, where they’ll be stored, and how they fit into your wider systems – you can build a validation process to catch the issues that matter to you, while avoiding unnecessary complexity. I can give you my ideas, but only you can build your system. [If the formatting of this post looks odd in your feed reader, visit the original article]
One of my recent home organisation projects has been sorting out my LEGO collection. I have a bunch of sets which are mixed together in one messy box, and I’m trying to separate bricks back into distinct sets. My collection is nowhere near large enough to be worth sorting by individual parts, and I hope that breaking down by set will make it all easier to manage and store. I’ve been creating spreadsheets to track the parts in each set, and count them out as I find them. I briefly hinted at this in my post about looking at images in spreadsheets, where I included a screenshot of one of my inventory spreadsheets: These spreadsheets have been invaluable – I can see exactly what pieces I need, and what pieces I’m missing. Without them, I wouldn’t even attempt this. I’m about to pause this cleanup and work on some other things, but first I wanted to write some notes on how I’m creating these spreadsheets – I’ll probably want them again in the future. Getting a list of parts in a set There are various ways to get a list of parts in a LEGO set: Newer LEGO sets include a list of parts at the back of the printed instructions You can get a list from LEGO-owned website like LEGO.com or BrickLink There are community-maintained databases on sites like Rebrickable I decided to use the community maintained lists from Rebrickable – they seem very accurate in my experience, and you can download daily snapshots of their entire catalog database. The latter is very powerful, because now I can load the database into my tools of choice, and slice and dice the data in fun and interesting ways. Downloading their entire database is less than 15MB – which is to say, two-thirds the size of just opening the LEGO.com homepage. Bargain! Putting Rebrickable data in a SQLite database My tool of choice is SQLite. I slept on this for years, but I’ve come to realise just how powerful and useful it can be. A big part of what made me realise the power of SQLite is seeing Simon Willison’s work with datasette, and some of the cool things he’s built on top of SQLite. Simon also publishes a command-line tool sqlite-utils for manipulating SQLite databases, and that’s what I’ve been using to create my spreadsheets. Here’s my process: Create a Python virtual environment, and install sqlite-utils: python3 -m venv .venv source .venv/bin/activate pip install sqlite-utils At time of writing, the latest version of sqlite-utils is 3.38. Download the Rebrickable database tables I care about, uncompress them, and load them into a SQLite database: curl -O 'https://cdn.rebrickable.com/media/downloads/colors.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/parts.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventories.csv.gz' curl -O 'https://cdn.rebrickable.com/media/downloads/inventory_parts.csv.gz' gunzip colors.csv.gz gunzip parts.csv.gz gunzip inventories.csv.gz gunzip inventory_parts.csv.gz sqlite-utils insert lego_parts.db colors colors.csv --csv sqlite-utils insert lego_parts.db parts parts.csv --csv sqlite-utils insert lego_parts.db inventories inventories.csv --csv sqlite-utils insert lego_parts.db inventory_parts inventory_parts.csv --csv The inventory_parts table describes how many of each part there are in a set. “Set S contains 10 of part P in colour C.” The parts and colors table contains detailed information about each part and color. The inventories table matches the official LEGO set numbers to the inventory IDs in Rebrickable’s database. “The set sold by LEGO as 6616-1 has ID 4159 in the inventory table.” Run a SQLite query that gets information from the different tables to tell me about all the parts in a particular set: SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1'; Or use sqlite-utils to export the query results as a spreadsheet: sqlite-utils lego_parts.db " SELECT ip.img_url, ip.quantity, ip.is_spare, c.name as color, p.name, ip.part_num FROM inventory_parts ip JOIN inventories i ON ip.inventory_id = i.id JOIN parts p ON ip.part_num = p.part_num JOIN colors c ON ip.color_id = c.id WHERE i.set_num = '6616-1';" --csv > 6616-1.csv Here are the first few lines of that CSV: img_url,quantity,is_spare,color,name,part_num https://cdn.rebrickable.com/media/parts/photos/9999/23064-9999-e6da02af-9e23-44cd-a475-16f30db9c527.jpg,1,False,[No Color/Any Color],Sticker Sheet for Set 6616-1,23064 https://cdn.rebrickable.com/media/parts/elements/4523412.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Chequered Print,2335pr0019 https://cdn.rebrickable.com/media/parts/photos/15/2335px13-15-33ae3ea3-9921-45fc-b7f0-0cd40203f749.jpg,2,False,White,Flag 2 x 2 Square [Thin Clips] with Octan Logo Print,2335pr0024 https://cdn.rebrickable.com/media/parts/elements/4141999.jpg,4,False,Green,Tile Special 1 x 2 Grille with Bottom Groove,2412b https://cdn.rebrickable.com/media/parts/elements/4125254.jpg,4,False,Orange,Tile Special 1 x 2 Grille with Bottom Groove,2412b Import that spreadsheet into Google Sheets, then add a couple of columns. I add a column image where every cell has the formula =IMAGE(…) that references the image URL. This gives me an inline image, so I know what that brick looks like. I add a new column quantity I have where every cell starts at 0, which is where I’ll count bricks as I find them. I add a new column remaining to find which counts the difference between quantity and quantity I have. Then I can highlight or filter for rows where this is non-zero, so I can see the bricks I still need to find. If you’re interested, here’s an example spreadsheet that has a clean inventory. It took me a while to refine the SQL query, but now I have it, I can create a new spreadsheet in less than a minute. One of the things I’ve realised over the last year or so is how powerful “get the data into SQLite” can be – it opens the door to all sorts of interesting queries and questions, with a relatively small amount of code required. I’m sure I could write a custom script just for this task, but it wouldn’t be as concise or flexible. [If the formatting of this post looks odd in your feed reader, visit the original article]
I’ve had a couple of projects recently where I needed to work with a list that involved images. For example, choosing a series of photos to print, or making an inventory of Lego parts. I could write a simple text list, but it’s really helpful to be able to see the images as part of the list, especially when I’m working with other people. The best tool I’ve found is Google Sheets – not something I usually associate with pictures! I’m using Google Sheets, and I use the IMAGE function, which inserts an image into a cell. For example: =IMAGE("https://www.google.com/images/srpr/logo3w.png") There’s a similar function in Microsoft Excel, but not in Apple Numbers. This function can reference values in other cells, so I’ll often prepare my spreadsheet in another tool – say, a Python script – and include an image URL in one of the columns. When I import the spreadsheet into Google Sheets, I use IMAGE() to reference that column, and then I see inline images. After that, I tend to hide the column with the image URL, and resize the rows/columns containing images to make them bigger and easier to look at. I often pair this with the HYPERLINK function, which can add a clickable link to a cell. This is useful to link to the source of the image, or to more detail I can’t fit in the spredsheet. I don’t know how far this approach can scale – I’ve never tried more than a thousand or so images in a single spreadsheet – but it’s pretty cool that it works at all! Using a spreadsheet gives me a simple, lightweight interface that most people are already familiar with. It doesn’t take much work on my part, and I get useful features like sorting and filtering for “free”. Previously I’d only thought of spreadsheets as a tool for textual data, and being able to include images has made them even more powerful. [If the formatting of this post looks odd in your feed reader, visit the original article]
More in programming
It’s a fact that Japan needs more international developers. That doesn’t mean integrating those developers into Japanese companies, as well as Japanese society, is a simple process. But what are the most common challenges encountered by these companies with multinational teams? To find out, TokyoDev interviewed a number of Japanese companies with international employees. In addition to discussing the benefits of hiring overseas, we also wanted to learn more about what challenges they had faced, and how they had overcome them. The companies interviewed included: Autify, which provides an AI-based software test automation platform Beatrust, which has created a search platform that automatically structures profile information Cybozu, Japan’s leading groupware provider DeepX, which automates heavy equipment machinery Givery, which scouts, hires, and trains world-class engineers Shippio, which digitalizes international trading and is Japan’s first digital forwarding company Yaraku, which offers web-based translation management According to those companies, the issues they experienced fell into two categories: addressing the language barrier, and helping new hires come to Japan. The language barrier Language issues are by far the most universal problem faced by Japanese companies with multinational teams. As a result, all of the companies we spoke to have evolved their own unique solutions. AI translation To help improve English-Japanese communication, Yaraku has turned to AI and its own translation tool, YarakuZen. With these they’ve reduced comprehension issues down to verbal communication alone. Since their engineering teams primarily communicate via text anyway, this has solved the majority of their language barrier issue, and engineers feel that they can now work smoothly together. Calling on bilinguals While DeepX employs engineers from over 20 countries, English is the common language between them. Documentation is written in English, and even Japanese departments still write minutes in English so colleagues can check them later. Likewise, explanations of company-wide meetings are delivered in both Japanese and English. Still, a communication gap exists. To overcome it, DeepX assigns Japanese project managers who can also speak English well. English skills weren’t previously a requirement, but once English became the official language of the engineering team, bilingualism was an essential part of the role. These project managers are responsible for taking requests from clients and communicating them accurately to the English-only engineers. In addition, DeepX is producing more bilingual employees by offering online training in both Japanese and English. The online lessons have proven particularly popular with international employees who have just arrived in Japan. Beatrust has pursued a similar policy of encouraging employees to learn and speak both languages. Dr. Andreas Dippon, the Vice President of Engineering at Beatrust, feels that bilingual colleagues are absolutely necessary to business. I think the biggest mistake you can make is just hiring foreigners who speak only English and assuming all the communication inside engineering is just English and that’s fine. You need to understand that business communication with [those] engineers will be immensely difficult . . . You need some almost bilingual people in between the business side and the engineering side to make it work. Similar to DeepX, Beatrust offers its employees a stipend for language learning. “So nowadays, it’s almost like 80 percent of both sides can speak English and Japanese to some extent, and then there are like two or three people on each side who cannot speak the other language,” Dippon said. “So we have like two or three engineers who cannot speak Japanese at all, and we have two or three business members who cannot speak English at all.” But in the engineering team itself “is 100 percent English. And the business team is almost 100 percent Japanese.” “ Of course the leaders try to bridge the gap,” Dippon explained. “So I’m now joining the business meetings that are in Japanese and trying to follow up on that and then share the information with the engineering team, and [it’s] also the same for the business lead, who is joining some engineering meetings and trying to update the business team on what’s happening inside engineering.” “Mixed language” Shippio, on the other hand, encountered negative results when they leaned too hard on their bilingual employees. Initially they asked bilinguals to provide simultaneous interpretation at meetings, but quickly decided that the burden on them was too great and not sustainable in the long term. Instead, Shippio has adopted a policy of “mixed language,” or combining Japanese and English together. The goal of mixed language is simple: to “understand each other.” Many employees who speak one language also know a bit of the other, and Shippio has found that by fostering a culture of flexible communication, employees can overcome the language barrier themselves. For example, a Japanese engineer might forget an English word, in which case he’ll do his best to explain the meaning in Japanese. If the international engineer can understand a bit of Japanese, he’ll be able to figure out what his coworker intended to say, at which point they will switch back to English. This method, while idiosyncratic to every conversation, strikes a balance between the stress of speaking another language and consideration for the other person. The most important thing, according to Shippio, is that the message is conveyed in any language. Meeting more often Another method these companies use is creating structured meeting schedules designed to improve cross-team communication. Givery teams hold what they call “win sessions” and “sync-up meetings” once or twice a month, to ensure thorough information-sharing within and between departments. These two types of meetings have different goals: A “win session” reviews business or project successes, with the aim of analyzing and then repeating that success in future. A “sync-up meeting” helps teams coordinate project deadlines. They report on their progress, discuss any obstacles that have arisen, and plan future tasks. In these meetings employees often speak Japanese, but the meetings are translated into English, and sometimes supplemented with additional English messages and explanations. By building these sort of regular, focused meetings into the company’s schedule, Givery aims to overcome language difficulties with extra personal contact. Beatrust takes a similarly structured, if somewhat more casual, approach. They tend to schedule most meetings on Friday, when engineers are likely to come to the office. However, in addition to the regular meetings, they also hold the “no meeting hour” for everyone, including the business team. “One of the reasons is to just let people talk to each other,” Dippon explained. “Let the engineers talk to business people and to each other.” This kind of interaction, we don’t really care if it’s personal stuff or work stuff that they talk about. Just to be there, talking to each other, and getting this feeling of a team [is what’s important]. . . . This is hugely beneficial, I think. Building Bonds Beatrust also believes in building team relationships through regular off-site events. “Last time we went to Takaosan, the mountain area,” said Dippon. “It was nice, we did udon-making. . . . This was kind of a workshop for QRs, and this was really fun, because even the Japanese people had never done it before by themselves. So it was a really great experience. After we did that, we had a half-day workshop about team culture, company culture, our next goals, and so on.” Dippon in particular appreciates any chance to learn more about his fellow employees. Like, ‘Why did you leave your country? Why did you come to Japan? What are the problems in your country? What’s good in your country?’ You hear a lot of very different stories. DeepX also hopes to deepen the bonds between employees with different cultures and backgrounds via family parties, barbecues, and other fun, relaxing events. This policy intensified after the COVID-19 pandemic, during which Japan’s borders were closed and international engineers weren’t able to immigrate. When the borders opened and those engineers finally did arrive, DeepX organized in-house get-togethers every two weeks, to fortify the newcomers’ relationships with other members of the company. Sponsoring visas Not every company that hires international developers actually brings them to Japan—-quite a few prefer to hire foreign employees who are already in-country. However, for those willing to sponsor new work visas, there is universal consensus on how best to do it: hire a professional. Cybozu has gone to the extent of bringing those professionals in-house. The first international member they hired was an engineer living in the United States. Though he wanted to work in Japan, at that time they didn’t have any experience in acquiring a work visa or relocating an employee, so they asked him to work for their US subsidiary instead. But as they continued hiring internationally, Cybozu realized that quite a few engineers were interested in physically relocating to Japan. To facilitate this, the company set up a new support system for their multinational team, for the purpose of providing their employees with work visas. Other companies prefer to outsource the visa process. DeepX, for example, has hired a certified administrative scrivener corporation to handle visa applications on behalf of the company. Autify also goes to a “dedicated, specialized” lawyer for immigration procedures. Thomas Santonja, VP of Engineering at Autify, feels that sponsoring visas is a necessary cost of business and that the advantages far outweigh the price. We used to have fully remote, long-term employees outside of Japan, but we stopped after we noticed that there is a lot of value in being able to meet in person and join in increased collaboration, especially with Japanese-speaking employees that are less inclined to make an effort when they don’t know the people individually. “It’s kind of become a requirement, in the last two years,” he concluded, “to at least be capable of being physically here.” However, Autify does prevent unnecessary expenses by having a new employee work remotely from their home country for a one month trial period before starting the visa process in earnest. So far, the only serious issues they encountered were with an employee based in Egypt; the visa process became so complicated, Autify eventually had to give up. But Autify also employs engineers from France, the Philippines, and Canada, among other countries, and has successfully brought their workers over many times. Helping employees adjust Sponsoring a visa is only the beginning of bringing an employee to Japan. The next step is providing special support for international employees, although this can look quite different from company to company. DeepX points out that just working at a new company is difficult enough; also beginning a new life in a new country, particularly when one doesn’t speak the language, can be incredibly challenging. That’s why DeepX not only covers the cost of international flights, but also implemented other support systems for new arrivals. To help them get started in Japan, DeepX provides a hired car to transport them from the airport, and a furnished monthly apartment for one month. Then they offer four days of special paid leave to complete necessary procedures: opening a bank account, signing a mobile phone contract, finding housing, etc. The company also introduces real estate companies that specialize in helping foreigners find housing, since that can sometimes be a difficult process on its own. Dippon at Beatrust believes that international employees need ongoing support, not just at the point of entry, and that it’s best to have at least one person in-house who is prepared to assist them. I think that one trap many companies run into is that they know all about Japanese laws and taxes and so on, and everybody grew up with that, so they are all familiar. But suddenly you have foreigners who have basically no idea about the systems, and they need a lot of support, because it can be quite different. Santonja at Autify, by contrast, has had a different experience helping employees get settled. “I am extremely tempted to say that I don’t have any challenges. I would be extremely hard pressed to tell you anything that could be remotely considered difficult or, you know, require some organization or even extra work or thinking.” Most people we hire look for us, right? So they are looking for an opportunity to move to Japan and be supported with a visa, which is again a very rare occurrence. They tend to be extremely motivated to live and make it work here. So I don’t think that integration in Japan is such a challenge. Conclusion To companies unfamiliar with the process, the barriers to hiring internationally may seem high. However, there are typically only two major challenges when integrating developers from other countries. The first, language issues, has a variety of solutions ranging from the technical to the cultural. The second, attaining the correct work visa, is best handled by trained professionals, whether in-house or through contractors. Neither of these difficulties is insurmountable, particularly with expert assistance. In addition, Givery in particular has stressed that it’s not necessary to figure out all the details in advance of hiring: in fact, it can benefit a company to introduce international workers early on, before its own internal policies are overly fixed. This information should also benefit international developers hoping to work in Japan. Since this article reflects the top concerns of Japanese companies, developers can work to proactively relieve those worries. Learning even basic Japanese helps reduce the language barrier, while becoming preemptively familiar with Japanese society reassures employers that you’re capable of taking care of yourself here. If you’d like to learn more about the benefits these companies enjoy from hiring international developers, see part one of this article series here. Want to find a job in Japan? Check out the TokyoDev job board. If you want to know more about multinational teams, moving to Japan, or Japanese work life in general, see our extensive library of articles. If you’d like to continue the conversation, please join the TokyoDev Discord.
When you're brand new, how do you select your first marketing channel?
The new American vice president JD Vance just gave a remarkable talk at the Munich Security Conference on free speech and mass immigration. It did not go over well with many European politicians, some of which immediately proved Vance's point, and labeled the speech "not acceptable". All because Vance dared poke at two of the holiest taboos in European politics. Let's start with his points on free speech, because they're the foundation for understanding how Europe got into such a mess on mass immigration. See, Europeans by and large simply do not understand "free speech" as a concept the way Americans do. There is no first amendment-style guarantee in Europe, yet the European mind desperately wants to believe it has the same kind of free speech as the US, despite endless evidence to the contrary. It's quite like how every dictator around the world pretends to believe in democracy. Sure, they may repress the opposition and rig their elections, but they still crave the imprimatur of the concept. So too "free speech" and the Europeans. Vance illustrated his point with several examples from the UK. A country that pursues thousands of yearly wrong-speech cases, threatens foreigners with repercussions should they dare say too much online, and has no qualms about handing down draconian sentences for online utterances. It's completely totalitarian and completely nuts. Germany is not much better. It's illegal to insult elected officials, and if you say the wrong thing, or post the wrong meme, you may well find yourself the subject of a raid at dawn. Just crazy stuff. I'd love to say that Denmark is different, but sadly it is not. You can be put in prison for up to two years for mocking or degrading someone on the basis on their race. It recently become illegal to burn the Quran (which sadly only serves to legitimize crazy Muslims killing or stabbing those who do). And you may face up to three years in prison for posting online in a way that can be construed as morally supporting terrorism. But despite all of these examples and laws, I'm constantly arguing with Europeans who cling to the idea that they do have free speech like Americans. Many of them mistakenly think that "hate speech" is illegal in the US, for example. It is not. America really takes the first amendment quite seriously. Even when it comes to hate speech. Famously, the Jewish lawyers of the (now unrecognizable) ACLU defended the right of literal, actual Nazis to march for their hateful ideology in the streets of Skokie, Illinois in 1979 and won. Another common misconception is that "misinformation" is illegal over there too. It also is not. That's why the Twitter Files proved to be so scandalous. Because it showed the US government under Biden laundering an illegal censorship regime -- in grave violation of the first amendment -- through private parties, like the social media networks. In America, your speech is free to be wrong, free to be hateful, free to insult religions and celebrities alike. All because the founding fathers correctly saw that asserting the power to determine otherwise leads to a totalitarian darkness. We've seen vivid illustrations of both in recent years. At the height of the trans mania, questioning whether men who said they were women should be allowed in women's sports or bathrooms or prisons was frequently labeled "hate speech". During the pandemic, questioning whether the virus might have escaped from a lab instead of a wet market got labeled "misinformation". So too did any questions about the vaccine's inability to stop spread or infection. Or whether surgical masks or lock downs were effective interventions. Now we know that having a public debate about all of these topics was of course completely legitimate. Covid escaping from a lab is currently the most likely explanation, according to American intelligence services, and many European countries, including the UK, have stopped allowing puberty blockers for children. Which brings us to that last bugaboo: Mass immigration. Vance identified it as one of the key threats to Europe at the moment, and I have to agree. So should anyone who've been paying attention to the statistics showing the abject failure of this thirty-year policy utopia of a multi-cultural Europe. The fast changing winds in European politics suggest that's exactly what's happening. These are not separate issues. It's the lack of free speech, and a catastrophically narrow Overton window, which has led Europe into such a mess with mass immigration in the first place. In Denmark, the first popular political party that dared to question the wisdom of importing massive numbers of culturally-incompatible foreigners were frequently charged with claims of racism back in the 90s. The same "that's racist!" playbook is now being run on political parties across Europe who dare challenge the mass immigration taboo. But making plain observations that some groups of immigrants really do commit vastly more crime and contribute vastly less economically to society is not racist. It wasn't racist when the Danish Folkparty did it in Denmark in the 1990s, and it isn't racist now when the mainstream center-left parties have followed suit. I've drawn the contrast to Sweden many times, and I'll do it again here. Unlike Denmark, Sweden kept its Overton window shut on the consequences of mass immigration all the way up through the 90s, 00s, and 10s. As a prize, it now has bombs going off daily, the European record in gun homicides, and a government that admits that the immigrant violence is out of control. The state of Sweden today is a direct consequence of suppressing any talk of the downsides to mass immigration for decades. And while that taboo has recently been broken, it may well be decades more before the problems are tackled at their root. It's tragic beyond belief. The rest of Europe should look to Sweden as a cautionary tale, and the Danish alternative as a precautionary one. It's never too late to fix tomorrow. You can't fix today, but you can always fix tomorrow. So Vance was right to wag his finger at all this nonsense. The lack of free speech and the problems with mass immigration. He was right to assert that America and Europe has a shared civilization to advance and protect. Whether the current politicians of Europe wants to hear it or not, I'm convinced that average Europeans actually are listening.
I've never used any other AI "assistant," although I've talked with those who have, most of whom are not very positive. My experience using Xcode's AI is that it occasionally offers a line of code that works, but you mostly get junk
Hey all, quick post today to mention that I added tracing support to the . If the support library for is available when Whippet is compiled, Whippet embedders can visualize the GC process. Like this!Whippet GC libraryLTTng Click above for a full-scale screenshot of the trace explorer processing the with the on a 2.5x heap. Of course no image will have all the information; the nice thing about trace visualizers like is that you can zoom in to sub-microsecond spans to see exactly what is happening, have nice mouseovers and clicky-clickies. Fun times!Perfetto microbenchmarknboyerparallel copying collector Adding tracepoints to a library is not too hard in the end. You need to , which has a file. You need to . Then you have a that includes the header, to generate the code needed to emit tracepoints.pull in the librarylttng-ustdeclare your tracepoints in one of your header filesminimal C filepkg-config Annoyingly, this header file you write needs to be in one of the directories; it can’t be just in the the source directory, because includes it seven times (!!) using (!!!) and because the LTTng file header that does all the computed including isn’t in your directory, GCC won’t find it. It’s pretty ugly. Ugliest part, I would say. But, grit your teeth, because it’s worth it.-Ilttngcomputed includes Finally you pepper your source with tracepoints, which probably you so that you don’t have to require LTTng, and so you can switch to other tracepoint libraries, and so on.wrap in some macro I wrote up a little . It’s not as easy as , which I think is an error. Another ugly point. Buck up, though, you are so close to graphs!guide for Whippet users about how to actually get tracesperf record By which I mean, so close to having to write a Python script to make graphs! Because LTTng writes its logs in so-called Common Trace Format, which as you might guess is not very common. I have a colleague who swears by it, that for him it is the lowest-overhead system, and indeed in my case it has no measurable overhead when trace data is not being collected, but his group uses custom scripts to convert the CTF data that he collects to... (?!?!?!!).GTKWave In my case I wanted to use Perfetto’s UI, so I found a to convert from CTF to the . But, it uses an old version of Babeltrace that wasn’t available on my system, so I had to write a (!!?!?!?!!), probably the most Python I have written in the last 20 years.scriptJSON-based tracing format that Chrome profiling used to usenew script Yes. God I love blinkenlights. As long as it’s low-maintenance going forward, I am satisfied with the tradeoffs. Even the fact that I had to write a script to process the logs isn’t so bad, because it let me get nice nested events, which most stock tracing tools don’t allow you to do. I fixed a small performance bug because of it – a . A win, and one that never would have shown up on a sampling profiler too. I suspect that as I add more tracepoints, more bugs will be found and fixed.worker thread was spinning waiting for a pool to terminate instead of helping out I think the only thing that would be better is if tracepoints were a part of Linux system ABIs – that there would be header files to emit tracepoint metadata in all binaries, that you wouldn’t have to link to any library, and the actual tracing tools would be intermediated by that ABI in such a way that you wouldn’t depend on those tools at build-time or distribution-time. But until then, I will take what I can get. Happy tracing! on adding tracepoints using the thing is it worth it? fin