Wednesday, July 29

Ask Hackaday: Why Did GitHub Ship All Our Software Off To The Arctic?

If you’ve logged onto GitHub recently and you’re an active user, you might have noticed a new badge on your profile: “Arctic Code Vault Contributor”. Sounds pretty awesome right? But whose code got archived in this vault, how is it being stored, and what’s the point?

They Froze My Computer!

On February 2nd, GitHub took a snapshot of every public repository that met one of the following criteria:

  • Active since Nov 2019
  • 250 or more stars
  • At least one star and new commits since Feb 2019

Then they traveled to Svalbard, found a decommissioned coal mine, and archived the code in deep storage underground – but not before they made a very cinematic video about it.

How It Works

Source: GitHub

For the combination of longevity, price and density, GitHub chose film storage, provided by piql.

There’s nothing too remarkable about the storage medium: the tarball of each repository is encoded on standard silver halide film as a 2d barcode, which is distributed across frames of 8.8 million pixels each (roughly 4K). Whilst officially rated for 500, the film should last at least 1000 years.

You might imagine that all of GitHub’s public repositories would take up a lot of space when stored on film, but the data turns out to only be 21TB when compressed – this means the whole archive fits comfortably in a shipping container.

Each reel starts with slides containing an un-encoded human readable text guide in multiple languages, explaining to future humanity how the archive works. If you have five minutes, reading the guide and how GitHub explains the archive to whoever discovers it is good fun. It’s interesting to see the range of future knowledge the guide caters to — it starts by explaining in very basic terms what computers and software are, despite the fact that de-compression software would be required to use any of the archive. To bridge this gap, they are also providing a “Tech Tree”, a comprehensive guide to modern software, compilation, encoding, compression etc. Interestingly, whilst the introductory guide is open source, the Tech Tree does not appear to be.

But the question bigger than how GitHub did it is why did they do it?

Why?

The mission of the GitHub Archive Program is to preserve open source software for future generations.

GitHub talks about two reasons for preserving software like this: historical curiosity and disaster. Let’s talk about historical curiosity first.

There is an argument that preserving software is essential to preserving our cultural heritage. This is an easily bought argument, as even if you’re in the camp that believes there’s nothing artistic about a bunch of ones and zeros, it can’t be denied that software is a platform and medium for an incredibly diverse amount of modern culture.

GitHub also cites past examples of important technical information being lost to history, such as the search for the blueprints of the Saturn V, or the discovery of the Roman mortar which built the Pantheon. But data storage, backup, and networks have evolved significantly since Saturn V’s blueprints were produced. Today people frequently quip, “once it’s on the internet, it’s there forever”. What do you reckon? Do you think the argument that software (or rather, the subset of software which lives in public GitHub repos) could be easily lost in 2020+ is valid?

Whatever your opinion, simply preserving open source software on long timescales is already being done by many other organisations. And it doesn’t require an arctic bunker. For that we have to consider GitHub’s second motive: a large scale disaster.

If Something Goes Boom

We can’t predict what apocalyptic disasters the future may bring – that’s sort of the point. But if humanity gets into a fix, would a code vault be useful?

Firstly, let’s get something straight: in order for us to need to use a code archive buried deep in Svalbard, something needs to have gone really, really, wrong. Wrong enough that things like softwareheritage.org, Wayback Machine, and countless other “conventional” backups aren’t working. So this would be a disaster that has wiped out the majority of our digital infrastructure, including worldwide redundancy backups and networks, requiring us to rebuild things from the ground up.

This begs the question: if we were to rebuild our digital world, would we make a carbon copy of what already exists, or would we rebuild from scratch? There are two sides to this coin: could we rebuild our existing systems, and would we want to rebuild our existing systems.

Tackling the former first: modern software is built upon many, many layers of abstraction. In a post-apocalyptic world, would we even be able to use much of the software with our infrastructure/lower-level services wiped out? To take a random, perhaps tenuous example, say we had to rebuild our networks, DNS, ISPs, etc. from scratch. Inevitably behavior would be different, nodes and information missing, and so software built on layers above this might be unstable or insecure. To take more concrete examples, this problem is greatest where open-source software relies on closed-source infrastructure — AWS, 3rd party APIs, and even low-level chip designs that might not have survived the disaster. Could we reimplement existing software stably on top of re-hashed solutions?

The latter point — would we want to rebuild our software as it is now — is more subjective. I have no doubt every Hackaday reader has one or two things they might change about, well, almost everything but can’t due to existing infrastructure and legacy systems. Would the opportunity to rebuild modern systems be able to win out over the time cost of doing so?

Finally, you may have noticed that software is evolving rather quickly. Being a web developer today who is familiar with all the major technologies in use looks pretty different from the same role 5 years ago. So does archiving a static snapshot of code make sense given how quickly it would be out of date? Some would argue that throwing around numbers like 500 to 1000 years is pretty meaningless for reuse if the software landscape has completely changed within 50. If an apocalypse were to occur today, would we want to rebuild our world using code from the 80s?

Even if we weren’t to directly reuse the archived code to rebuild our world, there are still plenty of reasons it might be handy when doing so, such as referring to the logic implemented within it, or the architecture, data structures and so on. But these are just my thoughts, and I want to hear yours.

Was This a Useful Thing to Do?

The thought that there is a vault in the Arctic directly containing code you wrote is undeniably fun to think about. What’s more, your code will now almost certainly outlive you! But do you, dear Hackaday reader, think this project is a fun exercise in sci-fi, or does it hold real value to humanity?

No comments:

Post a Comment