La version française de cet article est ici.


A few years ago I bought Malapartes’s Kaputt, translated into French by Juliette Bertrand and published by Denoël in 2006. As usual, I hardly remember the plot. However, I will remember for a long time the powerful images created by Malaparte : the German soldiers terrorised by the anti-tank Ukrainians dogs, the screams in the night during Jassy pogrom, the moose dying right in the middle of Helsinki’s presidential palace’s court.

I also associate this wonderful text with the feeling of a huge waste. I remember having been chocked by the amazingly high amount of typos : u instead of n (and vice-versa), l instead of I,… I had never seen such a poor work. These mistakes are typical. The publisher is however supposed to guarantee the quality of the published content, and he has not to betray the author and his text.

Although I have fortunately never encountered again such a lack of respect for the authors and their readership, I did notice that such a wrong behaviour happened quite often in scholarly publishing.
Most errors come from a bad OCR and an output in ASCII rather than in UTF-8. When academic publishers digitised their backfiles in the early 2000’s, they probably had all processed through a bulk scanner. To the hell with diacritics, math symbols, tables and other usual OCR mistakes. A pdf file for the image, a poor OCR output for the text, and the job is done.

Problem is that a scholar does not read the same way as a casual reader does. The electronic version of an article should enhance the global experience that would become swifter and more efficient. When a human being reads an article only available in image mode pdf (which represents roughly 0,5% of the articles on Science Direct, about 44 000 articles) or coming from a poor OCR, he might at its worst have a bad user experience : “CTRL+F” research that leads to nothing although the text is in the document, difficulties to reuse the content,…
In a lot of research fields text and data mining (or content mining) is expanding, letting machines analyse corpuses of several hundreds or thousands documents. Before being able to analyse probable underlying or distant bonds between two genes, two molecules, to phenomenons, machines have to be taught how to automatically recognise specific terms called named entities, according to the context. But even before that, they must simply be taught how to read a text so they can identify whether they deal with a title, a paragraph, or a citation. A pdf with an OCR layer does not have tags. Everything is at the same level: text, notes, and even headers and page numbers. End of lines are hard coded. Even if several projects aiming at converting pdf to xml do exist, the pre-processing tasks are huge but also redundant: research projects that rely on similar corpuses have to do the same pre-processing work since the publisher does not.

Some benevolent publishers seem to think that XML is the magic answer. Alas, when they only stick to the minimal requirements and put the whole article’s body between two <body></body> tags (and in some cases between header meta tags), we cannot say that the job is properly done…

So let’s say it is a youthful error coming from pioneering practices at a time when no standards existed and when everything had to be invented. Still, we are talking here about roughly half of the scholarly output ever, which is quite significant.

For born digital content, problems related to text structure are less present, even though it is not satisfactory that some publishers, starting with Elsevier, use their own proprietary DTD and not a standard like JATS. But this solution does not settle all. Even now, there are some flaws that prevent machines – hence scholars – from using at its best the texts they analyse : home made use of UTF-8, lack of consideration for W3C guidelines like MathML, conversion of vector images into a plain bulk of pixels…

Like Denoël for Kaputt, a lot of publishers do not respect the rich manuscripts sent by the authors. It is like they have lost these manuscripts, sometimes written with TeX or LaTeX, and used instead cheap OCR (as shown by this example, among others – Yes, you read « Typesetting by the editors in TeX » but your computer reads « Typesetting by the editors in 1l9 »). But as it is possible to find a better version of Malaparte’s text, it is surprisingly possible to find better version of articles outside of the publishers’ websites. For older content, aggregators seem to have done a better work than the publishers, and the OCR process was probably better monitored. A nice example is the Quarterly journal of economics which can be found on OUP, EBSO, and JSTOR platforms. For more recent content, the sources uploaded by the authors on institutional or subject repositories (like arxiv.org) are probably more reliable.
Did we live in an ideal world where text and data mining activities would be possible in a clearer and more transparent way, without any particular licence, scholars would still have to face with this quality issue which harms their work.

So far librarians are more focused on metadata quality issues, and we all know it is a big deal. But perhaps should we seize this data quality issue that also harms the tools we offer to our patrons. Shouldn’t we ask for instance that discovery tools include TDM tools showing first better quality sources or making bulk downloads easier (à la Pubget for instance) ? Is this a way vendors wish or can follow ?
Anyway, it seems pretty obvious that we should not take for granted the publisher’s statement presenting themselves as quality champions by systematically question the value they say they add to the contents provided by the authors. Maybe academic publishers even infringe the authors moral rights when they produce crappy OCR output.