The National Archives is a non-ministerial department of the UK government, housing 1,000 years worth of official British archives.
It’s tasked with preserving important political and cultural heritage, working with both physical and digital records ranging from Shakespeare’s will to tweets from government accounts. Other digital content includes records from the Office for National Statistics website, and data.gov.uk.
John Sheridan, Digital Director of the National Archives, has worked at the National Archives since 2010. In his long career here, he is currently involved in one of the most exciting projects to date – the huge undertaking of digitising web-published government content in partnership with web archiving company, MirrorWeb.
Digitising this vast collection naturally fits with the aims of this body. “As government has moved onto the web, we’ve followed it, and we have for many years now tried to capture, preserve, and make available a comprehensive archive of the UK government on the web,” says Sheridan. “It’s part and parcel of how the government goes about capturing its corporate memory.”
“We’re a physical archive, we’re a first generation digital archive, and we’re moving onto become a second generation disruptive digital archive.”
Preserving government records is obviously an important task, but who are the people most likely to be regularly drawing upon this database? “It’s really anyone who’s got an interest in what the government was saying, and that ranges from the engaged citizen who’s wanting to point at what we’ve archived as part of a conversation they’re having on Twitter right the way through to people working inside government itself,” says Sheridan.
“We know that a lot of civil servants or public officials will use our web archive for quickly and easily checking what previous policy used to be, or looking up something that was said in an old document.”
The web archive was previously held in “an on-site bespoke designed hosting infrastructure”. And another aim of the project is to migrate this collection to cloud-based storage, as well as to improve searching, access and the use of it for the people who are trying to access the collection, according to Sheridan. “Search is a traditionally very difficult problem with a large web archive, and this collection is currently about 135 terabytes, so it’s quite a big collection,” he says.
But another complicating factor is that the collection is not solely made up of text-based artifacts. “You’ve got things like video, you’ve got text, you’ve got spreadsheets, you’ve got raw data, you’ve got CSV files, you’ve got photographs and images, all sorts of things,” says Sheridan. “It’s a very diverse collection of data.”
To improve searchability of this extensive content, they have adopted ‘optical character recognition’ (OCR), which is the electronic conversion of written text into machine-encoded text, a common way of digitising printed documents.
“By OCRing that content, it meant that for the first time we could provide a full-text search across all of the material held in the archive,” says Sheridan.
“We were then able to add some faceting to the search,” he says. “If something had been on the old Department for Education’s website and you knew it was there, rather than doing a search across the whole of the web archive, you can now search just on that domain and find any content that we’ve archived within that domain over the time, or you can limit your search by a time period.”
Sheridan also sees possibilities in different emerging technologies. “We have a strong interest in artificial intelligence, and we are looking at a range of different kinds of applications of artificial intelligence that’s enabled by the cloud, because we can see the opportunities when we process content at scale to be able to improve access and improve intellectual control and what we understand about a collection,” he says.
“Artificial intelligence in the sense of things like handwriting recognition technology, but also exploring artificial intelligence for things like appraisal and selection, so sorting email, for example, so you can distinguish between personal email and business email using artificial intelligence.”
Blockchain technologies also offer potential interest to the organisation. “We’re very interested in technologies around trust and authenticity, so we’re doing quite a lot of research work around technologies for assuring that records haven’t been changed over time,” says Sheridan.
The ongoing project represents a pivotal moment in his extensive career with the National Archives. “It’s one of the best things I’ve been involved with,” he says. “I’ve been here for about 10 years in the organisation, and it’s just been such a brilliant project.”