More from Uncharted Territories
How LLMs work, how they're improving today, what are the next ways in which they can get better, and is that a straight shot to AGI?
How fast AI is improving, and how that's impacting jobs today
AGI Is Coming Sooner Due to o3, DeepSeek, and Other Cutting-Edge AI Developments
More in science
Shortly after the Trump administration took office in the United States in late January, more than 8,000 pages across several government websites and databases were taken down, the New York Times found. Though many of these have now been restored, thousands of pages were purged of references to gender and diversity initiatives, for example, and others including the U.S. Agency for International Development (USAID) website remain down. By 11 February, a federal judge ruled that the government agencies must restore public access to pages and datasets maintained by the Centers for Disease Control and Prevention (CDC) and the Food and Drug Administration (FDA). While many scientists fled to online archives in a panic, ironically, the Justice Department had argued that the physicians who brought the case were not harmed because the removed information was available on the Internet Archive’s Wayback Machine. In response, a federal judge wrote, “The Court is not persuaded,” noting that a user must know the original URL of an archived page in order to view it. The administration’s legal argument “was a bit of an interesting accolade,” says Mark Graham, director of the Wayback Machine, who believes the judge’s ruling was “apropos.” Over the past few weeks, the Internet Archive and other archival sites have received attention for preserving government databases and websites. But these projects have been ongoing for years. The Internet Archive, for example, was founded as a nonprofit dedicated to providing universal access to knowledge nearly 30 years ago, and it now records more than a billion URLs every day, says Graham. Since 2008, Internet Archive has also hosted an accessible copy of the End of Term Web Archive, a collaboration that documents changes to federal government sites before and after administration changes. In the most recent collection, it has already archived more than 500 terabytes of material. Complementary Crawls The Internet Archive’s strength is scale, Graham says. “We can often [preserve] things quickly, at scale. But we don’t have deep experience in analysis.” Meanwhile, groups like the Environmental Data and Governance Initiative and the Association of Health Care Journalists provide help for activists and academics identifying and documenting changes. The Library Innovation Lab at Harvard Law School has also joined the efforts with its archive of data.gov, a 16 TB collection that includes more than 311,000 public datasets and is being updated daily with new data. The project began in late 2024, when the library realized that data sets are often missed in other web crawls, says Jack Cushman, a software engineer and director of the Library Innovation Lab. “You can miss anything where you have to interact with JavaScript or with a button or with a form.” —Jack Cushman, Library Innovation Lab A typical crawl has no trouble capturing basic HTML, PDF, or CSV files. But archiving interactive web services that are driven by databases poses a challenge. It would be impossible to archive a site like Amazon, for example, says Graham. The datasets the Library Innovation Lab (LIL) is working to archive are similarly tricky to capture. “If you’re doing a web crawl and just clicking from link to link, as the End of Term archive does, you can miss anything where you have to interact with JavaScript or with a button or with a form, where you have to ask for permission and then register or download something,” explains Cushman. “We wanted to do something that was complementary to existing web crawls, and the way we did that was to go into APIs,” he says. By going into the API’s, which bypass web pages to access data directly, the LIL’s program could fetch a complete catalog of the data sets—whether CSV, Excel, XML, or other file types—and pull the associated URLs to create an archive. In the case of data.gov, Cushman and his colleagues wrote a script to send the right 300 queries that would fetch 1,000 items per query, then go through the 300,000 total items to gather the data. “What we’re looking for is areas where some automation will unlock a lot of new data that wouldn’t otherwise be unlocked,” says Cushman. The other important factor for the LIL archive was to make sure the data was in a usable format. “You might get something in a web crawl where [the data] is there across 100,000 web pages, but it’s very hard to get it back out into a spreadsheet or something that you can analyze,” Cushman says. Making it usable, both in the data format and user interface, helps create a sustainable archive. Lots Of Copies Keep Stuff Safe The key to preserving the internet’s data is a principle that goes by the acronym LOCKSS: Lots Of Copies Keep Stuff Safe. When the Internet Archive suffered a cyberattack last October, the Archive took down the site for a three-and-a-half week period to audit the entire site and implement security upgrades. “Libraries have traditionally always been under attack, so this is no different,” Graham says. As part of its defense, the Archive now has several copies of the materials in disparate physical locations, both inside and outside the U.S. “The US government is the world’s largest publisher,” Graham notes. It publishes material on a wide range of topics, and “much of it is beneficial to people, not only in this country, but throughout the world, whether that is about energy or health or agriculture or security.” And the fact that many individuals and organizations are contributing to preservation of the digital world is actually a good thing. “The goal is for those copies to be diverse across every metric that you can think of. They should be on different kinds of media. They should be controlled by different people, with different funding sources, in different formats,” says Cushman. “Every form of similarity between your backups creates a risk of loss.” The data.gov archive has its primary copy stored through a cloud service with others as backup. The archive also includes open source software to make it easy to replicate. In addition to maintaining copies, Cushman says it’s important to include cryptographic signatures and timestamps. Each time an archive is created, it’s signed with cryptographic proof of the creator’s email address and time, which can help verify the validity of an archive. An Ongoing Challenge Since President Trump took office, a lot of material has been removed from US federal websites—quantifiably more than previous new administrations, says Graham. On a global scale, however, this isn’t unprecedented, he adds. In the U.S., official government websites have been changed with each new administration since Bill Clinton’s, notes Jason Scott, a “free range archivist” at the Internet Archive and co-founder of digital preservation site Archive Team. “This one’s more chaotic,” Scott says. But “the web is a very high entropy entity ... Google is an archive like a supermarket is a food museum.” The job of digital archivists is a difficult one, especially with a backlog of sites that have existed across the evolution of internet standards. But these efforts are not new. “The ramping up will only be in terms of disk space and bandwidth resources, not the process that has been ongoing,” says Scott. For Cushman, working on this project has underscored the value of public data. “The government data that we have is like a GPS signal,” he says. “It doesn’t tell us where to go, but it tells us what’s around us, so that we can make decisions. Engaging with it for the first time this way has really helped me appreciate what a treasure we have.”
Britta Späth has dedicated her career to proving a single, central conjecture. She’s finally succeeded, alongside her partner, Marc Cabanes. The post After 20 Years, Math Couple Solves Major Group Theory Problem first appeared on Quanta Magazine
The evolution of the human brain is a fascinating subject. The brain is arguably the most complex structure in the known (to us) universe, and is the feature that makes humanity unique and has allowed us to dominate (for good or ill) the fate of this planet. But of course we are but a twig […] The post Birds Separately Evolved Complex Brains first appeared on NeuroLogica Blog.
Assa Auerbach’s course was the most maddening course I’ve ever taken. I was a master’s student in the Perimeter Scholars International program at the Perimeter Institute for Theoretical Physics. Perimeter trotted in world experts to lecture about modern physics. Many … Continue reading →