It’s one of the world’s richest library collections; an archive that holds a vast array of historical gems.
From the papal bull that excommunicated Martin Luther to desperate pleas for help that Mary Queen of Scots sent to Pope Sixtus V before her execution, it holds some of history’s greatest treasures.
It’s secretive by name and nature.
But, for the first time in its existence, it’s about to have its collection digitised so that anyone with an interest can access the knowledge locked inside it.
Located within the Vatican, the Vatican Secret Archives (VSA) or Archivum Secretum, is not currently open to just anyone with an interest.
If you want to browse this collection of rare historical items, you need to apply for special access, head to Rome and go through every page by hand. If you get granted permission, that is.
But that’s all about to change, thanks to some artificial intelligence (AI) and an unlikely group of high school kids.
The scale of the VSA collection, which dates back as far as the 12th century, is massive. It’s literally over 8500 metres of shelving. Of those 8500 metres, just a few millimetres’ worth of pages have been scanned and are available online. Even fewer pages have been transcribed into optical-character-recognition (OCR) text and made searchable.
This VSA digitisation project is using something known as In Codice Ratio, the technique used to translate to machine-readable documents. It uses a combo of artificial intelligence and OCR software to scan the texts, translate them and then make their transcripts available online. But things kept going awry when the modern computers met the ancient texts, which look like this:
The solution: high school kids. The team behind Codice Ration recruited students from 24 schools in Italy to build the projects’ memory banks. The students logged on to a website, where they found a screen with three sections:
The green bar along the top contains clear examples of letters from a medieval Latin text — in this case, the letter g. The red bar in the middle contains different examples of g, what the Codice scientists call “false friends”.
The grid at the bottom is the heart of the program. Each of the images there is composed of a few jigsaw pieces that the OCR software grouped together – pretty much its guess at a plausible letter. The students then judged the OCR’s efforts, telling it which guesses were good and which were bad. They did this by comparing each image to the platonically perfect green letters and clicking a checkbox when they saw a match.
Image by image, click by click, the students ‘taught’ the software what each of the 22 characters in the Medieval Latin alphabet (plus some alternative forms) looks like.
The students didn’t need to be able to read Latin. All they had to do is match visual patterns.
“The idea of involving high-school students was considered foolish,” says Paolo Merialdo, who dreamed up Codice Ratio.
“But now the machine is learning thanks to their efforts. I like that a small and simple contribution by many people can indeed contribute to the solution of a complex problem.”
Eventually, the students identified enough Latin examples and the software started grouping symbols together independently and judging what letters were there.
This technology has the potential to unlock untold numbers of other documents at historical archives around the world, which can only be considered awesome.