[N.B. If you only skim this post, or read just a part of it, please jump to the last few paragraphs to read my call for help and collaboration.]
Introduction
Optical Character Recognition (OCR) software has increasingly been a part of scholarship, particularly in digital humanities. For example, it is fundamental to the Google Books project (which so many use for research), corpus creation and curation, and various aspects of ebook media. Yet most of these aspects relate to printed books. One way to expand the methods of OCR is to apply them to extracting data from medieval manuscripts, but this area of research has received much less attention.[1]
My premise is that OCR has potential that can be harnessed for working with medieval texts in their manuscript witnesses.[2] At the outset, I admit that I’m not an OCR expert, but I have jumped into experimenting in order to see what might come from it. In what follows, I outline my experimental process, some results, and some reflections on what might come from further work.
What I imagine as possible is to take every digitized manuscript witness for a text (even hundreds, if that’s how many there are), use OCR extraction to create a plain text file for each, and collate these witnesses with a computer-assisted tool like JuxtaCommons. Such an ideal would not eliminate issues like post-processing correction of OCR extractions, or editorial decisions about modernizing forms like abbreviations, punctuation, and capitalization. The goal is not to eliminate human editorial work with computers, but creating accurate OCR for manuscripts has the potential to limit the time of editing and increase the efficiency of dealing with large numbers of witnesses. This type of work is potentially useful considering how many significant medieval texts survive in hundreds of manuscripts, yet remain unedited—often due to the unwieldiness of an editorial project dealing with every witness. Such a process depends on the abilities of OCR software. To evaluate the possibilities, and to point toward future directions for research, the following explains my process with a set of related documents, with the goal of establishing the current state of OCR software as a baseline for future work.
The sources that I chose for this examination revolve around a Latin text known as the Pseudo-Marcellus Passio Petri et Pauli. Because of the popularity of this text and its significance in the tradition of medieval veneration of Peter and Paul, it has been central to some of my work on apocrypha in Anglo-Saxon England. No modern critical edition of the Passio exists (one is in preparation by Alberto D’Anna), although Richard Adalbert Lipsius and Max Bonnet published an edition in 1891 based on a limited group of manuscripts.[3] This process, then, is both experimental as well as practical, since this text is one among many in need of scholarly editing from hundreds of medieval manuscript sources.
I used two commercial software engines for OCR extraction: Adobe Acrobat Pro X (10.1.10) on Mac OS X; and ABBYY FineReader Pro 12 (12.0.6) on a PC with Windows 7 Enterprise. Both are reputed OCR engines with demonstrated, high results of accuracy. These engines were chosen based on a variety of factors, including OCR accuracy percentages, accessibility,[4] languages recognized, and ease of use. In a recent review, these same criteria and others were used to evaluate the top ten pieces of OCR software; only OmniPage Standard ranked higher in accuracy, and recognizes 123 languages (Acrobat recognizes 42, FineReader recognizes 190), including Greek and Latin.[5]
While all of the engines mentioned so far represent commercial software, open source options also exist, though with varying rates of accuracy. The most prominent and high-quality open source option is the Tesseract OCR Engine, first developed by Hewlett-Packard and now owned by Google. I excluded Tesseract from my experiments because of the technical proficiency in coding needed to use it efficiently (and I want more time to learn the engine and work with it).[6] Unfortunately, I was not able to find definitive evidence about the accuracy of Tesseract in comparison with commercial engines.[7] Although I only used Acrobat and FineReader for this initial study, other OCR engines (especially Tesseract) offer possibilities for future work on manuscript OCR, as I will discuss below.
Processing the Documents
For this experiment, I performed OCR extraction on three documents representing witnesses to the Passio Petri et Pauli. The first is the 1891 edition by Lipsius and Bonnet, digitized and provided by Google but without OCR in the downloadable document. This document serves as a comparison in light of other studies of OCR for historical (particularly nineteenth-century) sources. The other two sources were high-resolution digital photographs of two early medieval manuscripts containing the Passio: Wolfenbüttel, Herzog August Bibliothek, Cod. Guelf. Weissenburg 48, folios 22v-32v; and St. Gall, Stiftsbibliothek, Cod. Sang. 561, pages 3-20.[8] Notably, Lipsius and Bonnet consulted both manuscripts for their edition (see lxxv-vi), although neither is part of the critical apparatus of the text. These documents and the results may be found on this GitHub repository.
For the Lipsius-Bonnet edition, I processed the pdf with OCR using each piece of software, saving the extracted text as a plain text (txt) document. After extracting the data, I cleaned the plain text documents (what David Mimno has called “data carpentry”),[9] eliminating extraneous data such as chapter numbers, hyphens, page numbers, page headers and footers, apparatus, etc., in order to retain only the main text for comparison.[10] When cleaning the data, however, OCR readings of the main text were not modified.
I extracted OCR text from the manuscript images in both FineReader and Acrobat with similar processes. In FineReader, modifying some pre-processing settings is possible to optimize results, such as using Latin as the text language, indicating which parts of the page include text, defining text boundaries, and omitting noise (extraneous features detected as text). I took advantage of these options, but otherwise left the settings to the program defaults. In Acrobat, I used only default options. I saved extracted text from each manuscript as a plain text (txt) document. Since the manuscripts did not contain extraneous apparatus, I did not clean the data, particularly in order to establish a baseline for future comparison.
Because color and contrast seemed like one potentially difficulty for OCR with the manuscript images (see discussion of results below), I also processed a single page from each manuscript in an attempt to extract data with higher contrast levels. For this process, I chose page 3 of St. Gall 561 and folio 23r of Weissenburg 48 (since folio 22v has only several lines of text rather than a full page). I edited each image to a black and white (grayscale) color scheme with optimal contrast and balance. For use in Acrobat, I converted each image to pdf before running the OCR process. Acrobat showed sporadic results, since it was unable to extract any OCR text from the St. Gall 561 image, but it did extract text from the Weissenburg 48 image. FineReader, on the other hand, produced results with both manuscripts.
Results
In general, FineReader was a more efficient OCR engine for dealing with both print and manuscript sources. Presumably this is due in part because of the different language capabilities of the two engines. One major benefit of working with FineReader is that the OCR language can be set to Latin (as it was with every document in this process), whereas Acrobat does not include Latin in its language settings. While both English and Latin use Roman letters—English was used for a baseline for evaluating Acrobat—the data incorporated into machine reading of the two languages is largely different in many respects. Many mistakes in the Acrobat results bear out this notion. In what follows, I first discuss the OCR results for the Lipsius-Bonnet edition, followed by discussion of results for the two manuscript witnesses.
In order to assess the basic accuracy of each OCR engine, I compared a set of sample passages from the extracted texts to the Lipsius-Bonnet edition. For this analysis, I compared passages from chapters 1-4, 31-33, and 64-66 of the Passio, taken from the beginning, middle, and end of the text.[11] All together, these passages comprise a total of 676 words. The results of this comparison are summarized in Table 1.
Chapter | FineReader Errors | Acrobat Errors | Distinct FineReader Errors | Distinct Acrobat Errors |
1 | 1 | 10 | 0 | 9 |
2 | 4 | 15 | 4 | 11 |
3 | 4 | 5 | 2 | 3 |
4 | 1 | 6 | 0 | 5 |
31 | 2 | 7 | 1 | 6 |
32 | 3 | 11 | 2 | 10 |
33 | 2 | 12 | 0 | 10 |
64 | 2 | 5 | 1 | 4 |
65 | 2 | 9 | 1 | 8 |
66 | 4 | 9 | 4 | 9 |
Totals | 25 | 89 | 15 | 75 |
Table 1. Errors by FineReader and Acrobat in OCR of Lipsius-Bonnet Edition (Sample).
This table provides the number of errors for each engine, as well as number of distinct errors for each engine (where errors did not overlap); where there were no distinct errors, the engines sometimes misrecognized the same word (though not always with the same result). As these numbers indicate, Acrobat produced both the most errors as well as the most distinct errors. Additionally, most of the errors found in the FineReader extraction were also found in the Acrobat extraction.
Common Acrobat errors were misrecognitions of in for iu, e for c, b for h, o for u, li for ll, l for t, t for f, periods for commas, and difficulties with dashes for end-of-line word divisions (often rendered as a period). Although FineReader produced fewer errors, it similarly produced misrecognitions of n for u, e for c, periods for commas, and rendering some end-of-line dashes rather than recognizing whole words. For both, occasional numbers and other non-letter characters also appeared in errors.
FineReader also produced the most number of punctuation errors—a total of ten errors in the sample passages, while Acrobat produced only five errors of this sort. The most likely explanation has to do with punctuation and capitalization in the edition. Following certain editorial standards, Lipsius and Bonnet did not capitalize the beginnings of all sentences, causing a disjuncture for modern reading practices. Modern readers hold certain assumptions about reading, including the expectation that sentences end with periods and new sentences begin with capital letters. Difficulties of reading premodern texts with computers arise since software is trained on these reading assumptions and expectations. FineReader, then, likely analyzed periods and lowercase letters together, determining results that render periods as commas—as expected before non-capitalized words. Both training and post-process correction could improve upon these issues (as discussed below).
In several instances, both FineReader and Acrobat produced errors for the same word, as in Table 2.
Chapter (page.line) | Lipsius-Bonnet Reading | FineReader Misrecognition | Acrobat Misrecognition |
1 (119.2) | Iudaei | ludaei | ludaei |
3 (121.11) | uenias | nenias | ueuias |
4 (123.2) | indicasset | indica8S6t | indieasset |
31 (147.7) | Iude | lube | lube |
32 (147.19) | congelauerat | eongelauerat | ~ngelauerat |
33 (149.4) | ciuitatibus | duitatibus | eiuit&tibus |
33 (149.8) | cuius | cqius | cqius |
64 (173.17-18) | ad-uenisse | ad-uenisse | ad.nenisae |
Table 2. Common FineReader and Acrobat Errors in OCR of Lipsius-Bonnet Edition (Sample).
Because both engines produced errors for these words, the data suggests that the problems were due to obscurities in the digitized page images. My comparison of the page images confirmed this, since many of these errors correspond with words that appear smudged, blotted, or otherwise obscured in the digitized edition. Such problems, and errors in OCR because of them, are not uncommon in dealing with digitized versions of older books, but do present an obstacle that in many ways can only be addressed with post-processing correction.
Based on the number of errors for each engine with the selected passages, I calculated sample accuracy rates. I calculated these accuracy percentages based on word accuracy rather than character accuracy (often the percentage calculated and reported) for two reasons: first, accuracy rates for these engines have previously been generally studied and recognized; and, second, as Simon Tanner claims, for my purposes, “In terms of effort and usefulness the word accuracy matters more than the character accuracy.”[12] For the sample passages, the calculated accuracy (excepting punctuation errors) of FineReader is 97.78%, and of Acrobat is 87.57%. While these percentages are based on only a sample of data, they demonstrate my general findings.
In turning toward the results of OCR with the manuscript images, we encounter even greater problems. A short summary indicates—as expected—that this is not a viable process for extracting text from manuscripts. Yet the process does reveal some possibilities.
While neither OCR engine produced usable results, the texts extracted with FineReader show greater potential. This is not unexpected, given the higher rates of accuracy demonstrated with the FineReader OCR of the Lipsius-Bonnet edition. With the text extractions from the manuscripts, Acrobat results present a higher number of non-letter forms, including special characters and punctuation marks. As already mentioned, Acrobat failed to render any data with the black and white image of St. Gall 561, and it extracted much less data than expected with the color image of St. Gall 561—only 302 characters in 60 lines of text, compared to 21,863 characters in 372 lines with FineReader. The major differences in results between the two engines are likely due to the ability to modify pre-processing settings in FineReader, especially the use of Latin language data built into the software. For these reasons, I will focus mainly on FineReader results in the following discussion.
The text extracted using FineReader is clearly flawed, but there are some indications in this baseline extraction to indicate potential future success. Tables 3 and 4 provide comparisons for assessing the OCR extracted texts. Table 3 presents the manuscript readings and extracted text for the first seventeen lines of the Passio in Weissenburg 48 (folio 22v), while Table 4 presents the manuscript readings and extracted text for the first four lines (excluding the incipit) in St. Gall 561 (folio 3r)—both comprising the first two sentences (six lines) in the Lipsius-Bonnet edition.[13]
Weissenburg 48, folio 22v | FineReader OCR Extracted Text |
INCIP PASSIO SCORU APLOR
PETRI ET PAULI [C]UM UENIS SET PAULUS IN URBI ROMA CONNUENERT ad eum omnes iudȩi dicentes nram fide in qua natus es ipsam defende.’ Non est enim iustum ut qui sis ebreus ex ebreis ueniens gen tium te magistru iudices.’ et incircucisoru defensor factus. tu cusicircucisus fide circucisionis euacues; |
{ hfcifr\tciofconv\pIok
riTMlTPMUl– 4) i «■> ‘*l c N‘15 5 -r rr PAulrium ppy^ M® m – r. ftl a’ma – jT /*)(kf<»Nl : V : . «mnefiudfl ■&tcctrrrf• fide m.cjux. n.xxufefipfvm defende ^ J*J o n eftmim lufium utcjuiflf ctreuf oc cKmfiienie7jf}gm -num Tpnujnftru tudtcef– Cr incircu afof-u defenfpf fvmif.-cucuftf circii cifuffulr ci jtu «fionif e£* eu*f% |
Table 3. Comparison of FineReader OCR of Weissenburg 48 with Manuscript Text (Sample).
St. Gall 561, folio 3r | FineReader OCR Extracted Text |
[C]um ueniss&t paulus ad roma.’ conuenerunt a deum oms iudei dicentes; nostram fide In qua natus es.’ ipsam defende. non est enim iustum ut cu sis hebreus. & ex hebreis ueniens. gentiu te magistru iudices. & incircucisoru defensor factus. tu cum sis circucisus fidem euacues circucisionis; | uenify^itkwtuf*
/ccimiufti^rwnr dicr^ref ■ y, floftrimf.de‘ ! rs i| fl onsft Cnjmtufrutmutui nebrruf ■ <*&.” reif uemcfif – p^ica. ci(of*d defbiforfacwf‘ iMCurnfifcirui •>, Ttde-m euAcutfCiridaft otufy |
Table 4. Comparison of FineReader OCR of St. Gall 561 with Manuscript Text (Sample).
As parallels in italicized bold letters indicate, the OCR texts are not completely incomparable with the manuscript texts. While some letters are correctly recognized, others are common OCR misreadings with older documents: long-s, long-r, and f are frequently mistaken; j is frequently mistaken for i (and vice versa); and there is often confusion of minims with letters like i, j, m, n, r, and u adjacent to each other; as with the OCR of the Lipsius-Bonnet edition, there are also confusions between c and e as well as l and I/i. With an instance like Weissenburg 48, line 4, we can recognize how the last letters of UENIS are rendered as N‘15 in the same way in which we might see it in Internet-based leetspeak (“l33t” or “1337”). Still, the amount of post-processing work (even to identify these parallels) is not feasible compared to human transcription.
Using OCR on the edited, black and white images of the manuscripts yielded no substantially better results. This is consistent with the findings of one previous study, which found that editing images of older documents does not improve accuracy rates, and that, in fact, RGB images actually maximize OCR accuracy.[14] As with the extractions using color images of the manuscripts, there are parallels to be found in the OCR texts based on the black and white images, but they are no more substantial or improved readings. Table 5 presents a representative comparison of the first seven lines from the two OCR text extractions from St. Gall 561, folio 3r, processed with FineReader.
St. Gall 561, folio 3r, color image | St. Gall 561, folio 3r, black and white image |
‘ s . . .
uenify^itkwtuf* /ccimiufti^rwnr dicr^ref ■ y, floftrimf.de’ ! rs i| fl onsft Cnjmtufrutmutui nebrruf ■ <*&.” reif uemcfif – p^ica. ci(of*d defbiforfacwf’ iMCurnfifcirui •>, Ttde-m euAcutfCiridaft otufy |
tmnfy&y&iluf xArom4. .conucnrtmm ddtumctnf
Jr^^KlttBbJei dicenfef’7 Upfmim InetuMxuf «Rfr. i de- tf}Jenefr em lufnitreuxeri fkebrruf ■ <*&-‘ uemtfif’ 4ptnd^m4yp{j0f- Utdicef- sUn Ctford dtfbifer ‘t rucumftfarmafufJ* eiuuutf |
Table 5. Comparison of OCR text extractions from St. Gall 561, folio 3r, from color and black and white images.
For the results of OCR with manuscript images, it is not possible to provide any conclusive accuracy rates, but the sample texts presented demonstrate a baseline. What all of this demonstrates is the need for more robust means of implementing error recognition and correction into the OCR process.
Future Directions
Future possibilities therefore include ways to optimize OCR by incorporating more efficient means of recognizing and correcting errors. These may be implemented with both pre-processing language identification and training data as well as post-processing error correction. Research on OCR with classical languages and texts (Latin and Greek) already exists, as do methods, scripts, and training data. Notably, scholars associated with the Perseus Project and the Duke Collaboratory for Classics Computing like David Bamman, Gregory Crane, David Smith, and Ryan Baumman are already conducting this type of work on editions of classical texts.[15] It may be possible, then, to incorporate their methods and tools for working with manuscripts. For these pursuits, the commercial OCR engines used in this examination become a less useful option, although FineReader does allow for some amount of training.[16] The customizable features of the Tesseract OCR Engine, however, offer more potential.[17]
Post-processing correction of OCR is another way to establish better data from extractions. As Ted Underwood has discussed, “OCR correction becomes much more reliable when the program is given statistical information about the language, and errors, to be expected in a given domain.”[18] For this reason we need to develop customized dictionary-based data and spellchecker scripts to run on extracted texts for correction. Underwood and others working with eighteenth- and nineteenth-century digitized books have developed these types of tools for English, but further work like this on medieval languages is necessary. Still others like Laura Turner O’Hara and Jon Crump have also provided helpful tutorials for wrangling post-processed OCR data, for which there are surely further avenues to follow.[19]
These models provide some directions, but Latin and Greek are only two languages of many used the medieval period. In my own field of research, Old English presents a set of issues similar to Latin but idiosyncratic enough to need specialized methods and data. Especially difficult to reconcile with medieval manuscripts and languages are spelling variations in a time before language standardization (spurred on largely by print as well as many other factors from the early modern period onward). While dictionary forms can contribute to spellchecker corrections, there are also avenues for working to account for scribal idiosyncrasies like dialects, spelling variations, and personal preferences, as well as distinguishing them from mistakes that might need editorial emendation rather than correction in the OCR process.
Finally, all of this points to collaboration, the subject with which I conclude. Since others have already created some solutions, or steps toward solutions, then there is clearly opportunity for bringing together interested scholars to tackle OCR and medieval manuscripts. If you are reading this (and if you’ve made it to the end of this long post), I welcome your feedback, help, and partnership. Please feel free to comment, share, and contact me. I hope that this is just the beginning of work that could push manuscript studies forward.
[1] Some studies of capturing text from medieval manuscripts (not always with OCR) stand out: see, for example, Jaety Edwards, et al., “Making Latin Manuscripts Searchable using gHMM’s,” Advances in Neural Information Processing Systems 17 (2004), 385-392; discussion in Frederico Boschetti et al., “Improving OCR Accuracy for Classical Critical Editions,” Research and Advanced Technology for Digital Libraries, ed. Maristella Agosti, et al., Lecture Notes in Computer Science 5714 (Heidelberg: Springer, 2009), 156-67; and Yann Leydier, et al., “Learning-Free Text-Image Alignment for Medieval Manuscripts,” Proceedings: 14th International Conference on Frontiers in Handwriting Recognition (Los Alamitos, CA: IEEE Computer Society, 2014), 363-68, with references to previous studies there.
[2] Other technical methods for reading historical documents also exist, such as Handwritten Text Recognition; see studies already cited, as well as Joan Andreu Sánchez, “Handwritten Text Recognition for Historical Documents in the Transcriptorium Project,” Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (New York: Association for Computing Machinery, 2014), 111-17, with further references there.
[3] Acta apostolorum apocrypha, ed. Richard Adalbert Lipsius and Max Bonnet, 2 vols. (Leipzig: Hermann Mendelssohn, 1891-1903), 1:119-77.
[4] I already owned Acrobat, so only needed to purchase FineReader, limiting the necessity of funding for this study.
[5] See “OCR Software Review: Reviews and Comparison,” 2015, TopTenReviews, http://ocr-software-review.toptenreviews.com/.
[6] See Ray Smith, “An Overview of the Tesseract OCR Engine,” Proceedings: Ninth International Conference on Document Analysis and Recognition (Los Alamitos, CA: IEEE Computer Society, 2007), 629-633, available at http://research.google.com/pubs/pub33418.html.
[7] See ibid.; and Marcin Heliński, Miłosz Kmieciak, and Tomasz Parkoła, “Report on the Comparison of Tesseract and ABBYY FineReader OCR Engines,” IMPACT: Improving Access to Text (2012), at http://lib.psnc.pl/dlibra/docmetadata?id=358.
[8] Descriptions and digital facsimiles at Wolfenbütteler Digitale Bibliothek, Herzog August Bibliothek Wolfenbüttel, http://diglib.hab.de/mss/48-weiss/start.htm; and e-codices: Virtual Manuscript Library of Switzerland, http://www.e-codices.unifr.ch/en/description/csg/0561. For both manuscripts, digital images were used from these repositories, by permission under Creative Commons licenses.
[9] “Data carpentry is a skilled, hands-on craft which will form a major part of data science in the future,” September 1, 2014, The Impact Blog, The London School of Economics and Political Science, http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/.
[10] For OCR extracted text from the Lipsius-Bonnet edition, each line of extracted text comprises one chapter of the source, based on divisions established by Lipsius and Bonnet (e.g. line 1 corresponds to chapter 1, line 2 to chapter 2, etc.).
[11] Parenthetical citations refer to the edition in Acta apostolorum apocrypha, ed. Lipsius and Bonnet, 1:119-77, by page and line numbers.
[12] Deciding whether Optical Character Recognition Is Feasible (London: King’s Digital Consultancy Services, 2004), at http://www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf.
[13] For readings from both manuscripts, I provide diplomatic editions; abbreviations are not expanded, and punctuation and capitalization are given as in the manuscripts. To show parallels, the extracted texts have been slightly relineated; see the plain text files for comparison.
[14] Jon M. Booth and Jeremy Gelb, “Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products,” June 2006 (rev.), Government Printing Office, at http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf.
[15] See, for example, David Bamman, “11K Latin Texts,” http://www.cs.cmu.edu/~dbamman/latin.html, for a corpus as well as further references; Ryan Baumman, “Command-Line OCR with Tesseract on Mac OS X,” November 13, 2014, ryanfb.github.io, https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html; and idem, “Automatic evaluation of OCR quality,” March 16, 2015, ryanfb.github.io, https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html, with further links to his work there.
[16] See Heliński, Kmieciak, and Parkoła “Report on the Comparison of Tesseract and ABBYY FineReader OCR Engines.”
[17] See, for example, Ray Smith, Daria Antonova, and Dar-Shyang Lee, “Adapting the Tesseract Open Source OCR Engine for Multilingual OCR,” Proceedings of the International Workshop on Multilingual OCR (New York: Association for Computing Machinery, 2009), 1:1-8.
[18] “The challenges of digital work on early-19c collections,” October 7, 2011, The Stone and the Shell, http://tedunderwood.com/2011/10/07/the-challenges-of-digital-work-on-early-19c-collections/.
[19] Laura Turner O’Hara, “Cleaning OCR’d text with Regular Expressions,” May 22, 2013, The Programming Historian, http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions; and Jon Crump, “Generating an Ordered Data Set from an OCR Text File,” November 25, 2014, The Programming Historian, http://programminghistorian.org/lessons/generating-an-ordered-data-set-from-an-OCR-text-file.
