Within the walls of a beautiful old church in San Francisco’s Richmond neighborhood, racks of computer servers hum and flash with activity. They contain the Internet. Well, a very large amount.
Internet Archive, a non-profit organization, has been collecting web pages since 1996 for its famous and beloved Wayback Machine. In 1997, the collection amounted to 2 terabytes of data. Colossal back then, you could install it on a $50 USB stick now.
Today, archive founder Brewster Kahle tells me, the project is on the verge of surpassing 100 petabytes — about 50,000 times more than in 1997. It contains more than 700 billion web pages.
The work doesn’t get any easier. Websites today are very dynamic and change with every refresh. Walled gardens like Facebook are a source of great frustration for Kahle, who fears that much of the political activity that has taken place on the platform will be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.
News outlet paywalls (like the FT’s) are also “problematic”, says Kahle. Archiving news was once taken very seriously, but a change in ownership or even a simple site redesign can cause content to disappear. Technology journalist Kara Swisher recently lamented that some of her early work at the Wall Street Journal went “poof,” after the paper refused to sell her the material several years ago.
As we begin to explore the possibilities of the metaverse, the work of the Internet Archive will only become more complex. Its mission is to “provide universal access to all knowledge”, by archiving audio, video, video games, books, magazines and software. Currently, it works to preserve the work of independent news agencies in Iran and stores Russian TV news broadcasts. Sometimes keeping things online can be an act of justice, protest, or accountability.
Still, some question whether the Internet Archive has the right to provide the material. It is currently being sued by several major book publishers over its “OpenLibrary” e-book lending platform, which allows users to borrow a limited number of e-books for up to 14 days. Publishers say it hurts revenue.
Kahle says that’s ridiculous. He likes to describe the task of the archives as being no different from a traditional library. But if a book doesn’t disappear from a shelf if the publisher goes bankrupt, digital content is more vulnerable. You cannot own a Netflix show. News articles are only there as long as publishers want them. Even the songs we pay to download are rarely ours, they are simply licensed.
Set up so that it is not dependent on anyone else, the Internet Archive has created its own server infrastructure, largely hosted within the church, rather than using a third-party host such as Amazon or Google. All of this costs $25 million a year. A bargain, says Kahle, pointing out that San Francisco’s public library system alone costs $171 million.
Unless we think the earliest version of today’s story isn’t worth preserving, the internet’s disappearing acts should trouble us all. Consider how hollow the coverage of Queen Elizabeth’s death would have been had it not been illustrated by extensive archival documents.
Can we say with certainty that the journalism produced around his death will be as accessible even in 20 years? And what about all the social media posts made by regular people? We will come to regret not having competently preserved “everyday” life on the Internet.
Dave Lee is a correspondent for the FT in San Francisco
#everincreasing #work #preserving #Internet #backpages