Data Diversity in handwritten text recognition: challenge or opportunity?

1. Introduction here

In this paper, we wish to show approaches in handling diversity in larger collections of training data for text acquisition pipelines, specifically handwritten text recognition for medieval manuscripts in Latin and French. Present throughout medieval Europe, Latin is one, if not the most used written language of the time on this continent, while French has known from a relatively early date (around the 12th century judging from preserved manuscripts) a vernacular production that soon became one of the most prominent of Western Europe, influencing the written culture of its neighbours from its central position. Combined, they provide a case study whose diversity and general scope could, we hope, allow to provide results with broader applicability, even beyond medieval Western manuscripts.

Heterogeneity or diversity in the collections can result from intrinsic features (e.g. linguistic, palaeographic, diachronic variation in the sources), but also from extrinsic features (aim and provenance of transcriptions, idiosyncrasies of transcribers…). We propose to approach both types of diversity by reusing several open data sets from various research projects in diverse fields and involving many collaborators. We add a double focus, linguistic (Latin vs. French manuscripts) and graphic (abbreviated vs. normalised transcriptions). We hope to be able to overcome, to some extent, the issue of linguistic diversity and propose a common, modular pipeline for different languages, related but different in their inner structure and declension mechanisms.

When, on the one hand, recent studies focus on “hyperdiplomatic” digital editions to study the production of specific items, the implementation of natural language processing and text mining is commonly based on a normalised text. Instead of aiming at defining a single universal translinguistic transcriptional standard to merge all existing standards – an utopic endeavour, and perhaps even not desirable –, and instead of designing a unified pipeline supported by dedicated libraries (e.g. image > hyperdiplomatic > normalised > lemmatised+POS-tagged > critical text) to constrain all existing editions, we applied a more modular approach to reuse and pool datasets to train multiple models and design paths more fitted to the variety of goals encountered in medieval studies.

In this attempt, we will strive to answer more specifically the following questions:

  1. To what extent can we (and should we) mutualise HTR training material between preexisting datasets and even related languages? (and is it worth the effort?)
  2. Are approaches that decompose image to text prediction and further linguistic normalisations (abbreviation expansion for instance) better performing for that goal than straightforward “image to normalised text” approaches?

2. Diversity in our corpus here

2.1. Extrinsic diversity: variation in data production here

The most obvious source of diversity is artificial, in the sense that it is the result of the production of the data (and particularly of transcription choices) and not of the sources itself.

For this research three macro-datasets, themselves mostly aggregates of smaller micro-datasets, have been used, one French and two Latin.

The French dataset is Cremma-medieval, composed of 17431 lines from eleven Old French manuscripts written between the 13 th and 14 th centuries (Table 1). It is made from pre-existing transcriptions, and sample size is very different from one source manuscript to the other. A graphemic 1 transcription method has been chosen to maintain a many to one mapping between signs in the source and the transcription (abbreviations and their expansions are both kept, u/v or i/j are not dissimilated), but allographs are normalised ( e.g., round and long s are both transcribed s). Finally, spaces are not homogeneously represented in the ground truth text annotation, with transcribers reproducing the manuscript spacing while others are using lexical spaces. It must be stressed that spaces are the most important source of errors in medieval HTR models (see for instance the model Bicerin, where spaces represent 33.9% of errors Pinche 2021). In this cremma-medieval macro-dataset, several transcriptions from different transcribers, coming from different projects, have been collected.

This diversity is also very present in the Oriflamms macro-dataset, containing 120 111 lines from no less than 779 manuscripts (Table 1). This dataset has been composed along several different projects over a substantial interval of time, and is a mix of aligned preexisting normalised editions (without abbreviations) and graphemic transcriptions (including abbreviations and their expansion). It is composed of both French, Latin and bilingual texts.

Table 1: Composition of the cremma-medieval, Oriflamms and st-victor macro-datasets [For this abstract, only corpora in bold have been used] Table 2: Distribution of corpora into the four main datasets Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor (BnF, Latin 14525, fol. 41va). Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor.
Corpus Editors Manuscripts Pages Lines
Otinel Camps 2 75 13568
Wauchier Pinche 1 49 6148
Maritem Mariotti 1 18 1026
CremmaLab Pinche et al. 7 55 13568
Total 11 149 17431
Reg.chancell. Poitou Guérin   200 1217 30015
Reg.chancell. Paris Viard 2 29 474
Morchesne Guyotjeannin et al. 1 189 10394
Cartulaire de Nesle Hélary 1 117 3899
Chartes Fontenay Stutzmann 104 104 1384
Psautiers Oriflamms 27 48 5793
PsautierIMS Stutzmann 48 132 3145
MSS dat. lat. Oriflamms / ecmen 101 101 2299
Queste del saint Graal Marchello-Nizia, Lavrentiev 1 130 10725
BnF fonds fr. ecmen 159 189 13510
Mss dat. fr. ecmen 45 55 3355
Album XIIIe. Careri, et al.+ ecmen 52 52 1992
Légende dorée irht+ ecmen 18 679 31742
Pèlerinage opvs+ ecmen 20 56 1384
Total 779 3098 120111
Saint-Victor Vernet 2 54 12596

The last macro-dataset Saint-Victor is the most homogeneous, composed of transcriptions from two Victorine mss, i.e., BnF latin 14588 and BnF latin 14525 written by no less than twelve scribes at the end of the 12 th century and the first part of the 13 th century (Table 1). Both mss have the same type of writing. It has been created during a master’s thesis. It is divided into two sub-corpus. A first corpus is transcribed without abbreviations. The transcription uses lexical spaces. It is the most important of the two sub-corpus with 10736 lines. The second sub-corpus consists of a small part of the first (1860 lines), which has been transcribed with abbreviations.

Early tests have shown the tremendous variations in the choice of signs used to transcribed medieval graphemes, in particular abbreviations, including MUFI and out of MUFI characters. For example, the common abbreviative marker has been transcribed alternatively as U+0303 COMBINING TILDE, U+0304 COMBINING MACRON, U+0305 COMBINING OVERLINE, F00A COMBINING HIGH MACRON WITH FIXED HEIGHT (PART-WIDTH), and even, in composition, U+1EBD LATIN SMALL LETTER E WITH TILDE, U+0113 LATIN SMALL LETTER E WITH MACRON, etc. Even when using MUFI (Medieval Unicode Font Initiative), different types of Tironian et or p flourish can be used. To facilitate machine learning, a conversion table was used to apply a first level of normalisation, and to reduce the 262 preexisting character class to around 30 (Clérice and Pinche 2021).

2.2. Intrinsic diversity: variation in language, script and scribal practice here

Diversity is also due to linguistic differences inside the corpus, with a main distinction between Latin and French texts, the latter in a variety of regional scriptae, including Anglo-French, Eastern (Lorrain) and Picard, and also diachronic variation, from 12 to 14th century.

The variety is also in the writing styles. Copyists used different script types according to their place and date of activity (e.g. praegothica, textualis, cursiva, semitextualis 2 ). Some script types were used preferentially according to the genre of the text under copy (e.g. liturgy, literature, diplomatic and pragmatic texts). Conversely, textual genres could influence some specific scribal practices (layout, abbreviations, etc.).

3. Pipeline description here

Our aim is to evaluate the impact of data heterogeneity to build models for Latin and medieval French. Our corpus contains two levels of heterogeneity: it contains documents in one of two different languages (including internally some diatopic variation) 3 , and variety of specifications for transcriptions. Each sentence of our corpus includes both abbreviated forms and expanded forms of words, thanks to the original encoding of the editions, that followed the Guidelines of the Text Encoding Initiative, and used a combination of <choice>, <abbr> and <expan> (TEI Consortium 2022).

Corpus have endured varying types of normalisations, sometimes contradictory (combined or decombined, etc.), to smooth discrepancies between transcriptions. The normalisation step follows this pipeline:

  1. lowercasing;
  2. normalising unicode (NFKD);
  3. making substitutions based on an equivalence table and the use of “chocoMUFIn” (Clérice and Pinche 2021). In particular,

We have divided our corpus into four training datasets to perform our evaluations and see potential benefits of fine-tuning for such an approach, on Latin or French texts and on abbreviated or expanded texts. The distribution of each corpus is described in table 2.

Table 2: Distribution of corpora into the four main datasets Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor (BnF, Latin 14525, fol. 41va). Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor.
Abbr (Lines) Exp (Lines)
TOTAL : 8,528 TOTAL :17,404
Fontenay : 1,365 Fontenay : 1,365
MsDat : 2,217 MsDat : 2,217
PsautierIMS : 3,086 PsautierIMS : 3,086
StVictorLite : 1,860 StVictorFull : 10,736
TOTAL : 19,532 TOTAL : 19,530
ecmen : 9,831 ecmen : 9,831
otinel bodmer : 1,977 otinel bodmer : 1,977
otinel vaticane : 1,758 otinel vaticane : 1,758
wauchier : 4,582 wauchier : 4,580
Pelerinage : 1,384 Pelerinage : 1,384

Based on the experiments made by (Camps, Vidal-Gorène, and Vernet 2021) on abbreviated manuscripts, two approaches have been considered. Training on abbreviated data has been carried out with Kraken (Kiessling 2019; Kiessling et al. 2019), an OCR and HTR system previously used with success on a wide range of manuscripts (Camps, Clérice, and Pinche 2021; Scheithauer et al. 2021; Thompson and others 2021), and training on expanded data with Calfa, an OCR and HTR system originally developped for highly abbreviated Oriental manuscripts (Vidal-Gorène et al. 2021). These two architectures use an encoder-decoder approach, the first one trained at the character level, the second one at the word level. If we keep the same hyperparameters defined previously (Camps, Vidal-Gorène, and Vernet 2021), we use a deeper architecture for the first one, architecture capable of high recognition rate in CREMMA (Pinche and Clérice 2021).

4. Preliminary results and discussion here

Figure 1: Matrix of the cross evaluation of models

Figure 2: Distribution of character error rates per page in the test sets; histograms (top) and boxplots (with outliers above 25% removed)
Figure 1. Figure 2: Distribution of character error rates per page in the test sets; histograms (top) and boxplots (with outliers above 25% removed)

Current results show, perhaps counter intuitively, a better performance for expanded models, at least for Latin (fig. 1 and table 3), while, for French, the abbreviated model seem to perform slightly better (fig. 1). Perhaps more importantly, they show important variation in the distribution of the character error rates per page inside each test set and between test sets (fig. 3). Apart from a few strong outliers on the Latin corpora, with CER between 40 and 90% (due to issues in the test material), they show a situation that varies according mostly to the origin of the data. For some subcorpora, the CERs display very limited variation, with a very small interquartile range (CREMMA corpora for instance), while the results obtained for corpora such as ECMEN could reflect the larger variety of material they contain.

Nevertheless, among various observations, the following cases can be noted. On the one hand, on LAT-Exp predictions, the efficiency of the model is especially linked to the script used. Thus, the particularly angular textualis quadrata, widely used in PsautierIMS and some manuscripts of MSS dat. lat, is poorly recognised. We find a lot of issues related to the stems ii / u / n / etc. In the most extreme cases a significant difficulty in differentiating c and e occurs. For these scripts, tildes are seldom understood and abbreviations are therefore badly expanded. Meanwhile, in diplomatic texts of Fontenay, although the form of the letters is often sophisticated and flourished - especially in the first line of the charters - the model is able to recognise tildes and abbreviations. We also observe that the quality of the ink greatly influences the efficiency of the model. On the other hand, this multi-level heterogeneity seems to affect benefits we could expect of fine-tuning. We do not notice any gain in recognition by fine-tuning abbreviated models with expanded data yet. Nevertheless we can already observe that cross-lingual fine-tuned models achieve similar recognition rates, even though abbreviations are widely different for these languages.

Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor (BnF, Latin 14525, fol. 41va). Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor.
GT Abbr GT Expan
one pecc̃oꝝ sicut scͥptum ẽ gaudiũ ẽangl̃is d̃i suꝑ uno pecc̃ore penitentiãagente uñ ⁊ signis ext̃iorib penitẽtieplurimũ delectant᷑ ut pote usu ciliciiꝓfundis suspiriis deuotis or̃onib᷒ cre one peccatorum sicut scriptum est gaudium est angelis dei super uno peccatore penitentiamagente unde et signis exterioribus penitentieplurimum delectantur ut pote usu ciliciiprofundis suspiriis deuotis orationibus cre
Table 3: Facsimile with ground truth abbreviated and extended and abbreviated and extended predictions from an extract of latin corpus of St Victor.
Prediction Abbr Prediction Expan

one pec̃c̃oꝝꝝ sicut scͥptum ẽ sudiũ

angl̃is dĩ suꝑ uno pecc̃ore penitentiã

agente uñ ⁊ signis ext̃iorib penitẽtie

plurimũ delectant᷑ ut pote usu cilicii

ꝓfundis suspinis deuotis or̃onib᷒ cre

one peccatorum sicut scriptum est saudium est

angelis dei super uno peccatore penitentiam

agente unde et signis exterioribus penitentie

plurimum delectantur ut pote usu cilicii

profundis suspiriis deuotis orationibus cre

All of this is deserving of further investigations, particularly to evaluate the impact of training set size versus training set diversity, and to measure the robustness of models trained with and applied to mixed language corpora. . Moreover, further normalisation of the training sets, and a direct inspection of outliers could allow to increase performance and intelligibility of the results.

Appendix A

Bibliography
  1. Camps, Jean-Baptiste, Thibault Clérice, and Ariane Pinche. 2021. “Noisy Medieval Data, from Digitized Manuscript to Stylometric Analysis: Evaluating Paul Meyer’s Hagiographic Hypothesis.” Digital Scholarship in the Humanities 36 (Supplement_2): ii49–ii71. https://doi.org/10.1093/llc/fqab033 .
  2. Camps, Jean-Baptiste, Chahan Vidal-Gorène, and Marguerite Vernet. 2021. “Handling Heavily Abbreviated Manuscripts: HTR Engines Vs Text Normalisation Approaches.” In International Conference on Document Analysis and Recognition, 306–16. Springer.
  3. Clérice, Thibault, and Ariane Pinche. 2021. “Choco-Mufin, a tool for controlling characters used in OCR and HTR projects.” https://doi.org/10.5281/zenodo.5356154 .
  4. Derolez, Albert. 2003. The Palaeography of Gothic Manuscript Books from the Twelfth to the Early Sixteenth Century. Cambridge Studies in Palaeography and Codicology 9. Cambridge: Cambridge University Press.
  5. Kiessling, Benjamin. 2019. “Kraken - an Universal Text Recognizer for the Humanities.” In Proceedings of the Dh2019 Conference - Digital Humanities: Complexities, Utrecht, the Netherlands, 9–12 July 2019. Utrecht: CLARIAH. https://dev.clariah.nl/files/dh2019/boa/0673.html .
  6. Kiessling, Benjamin, Robin Tissot, Peter Stokes, and Daniel Stökl Ben Ezra. 2019. “EScriptorium: An Open Source Platform for Historical Document Analysis.” In 2019 International Conference on Document Analysis and Recognition Workshops (Icdarw), 2:19–19. IEEE.
  7. Pinche, Ariane. 2021. “CREMMA Medieval, an Old French dataset for HTR and segmentation.” https://doi.org/10.5281/zenodo.5235186 .
  8. Pinche, Ariane, and Thibault Clérice. 2021. “HTR-United/Cremma-Medieval: 1.0.1 Bicerin (Doi).” Zenodo. https://doi.org/10.5281/zenodo.5235186 .
  9. Scheithauer, Hugo, Alix Chagué, Rostaing Aurélia, Lucas Terriel, Laurent Romary, Marie-Françoise Limon-Bonnet, Benjamin Davy, et al. 2021. “Production d’un Modèle Affiné de Reconnaissance d’écriture Manuscrite Avec eScriptorium et évaluation de Ses Performances.” In Les Futurs Fantastiques-3e Conférence Internationale Sur L’Intelligence Artificielle Appliquée Aux Bibliothèques, Archives et Musées, Ai4lam.
  10. Stutzmann, Dominique, Christopher Tensmeyer, and Vincent Christlein. 2020. “Writer Identification and Script Classification. Two Tasks for a Common Understanding of Cultural Heritage.” Manuscript Cultures 15: 11–24. https://www.csmc.uni-hamburg.de/publications/mc/files/articles/mc15-02-stutzmann.pdf .
  11. Stuzmann, Dominique. 2011. “Paléographie Statistique Pour décrire, Identifier, Dater. . . Normaliser Pour Coopérer et Aller Plus Loin ?” In Kodikologie Und Paläographie Im Digitalen Zeitalter 2 - Codicology and Palaeography in the Digital Age 2, edited by Franz Fischer, Christiane Fritze, and Georg Vogeler, 3:247–77. Norderstedt: Books on Demand (BoD).
  12. TEI Consortium. 2022. “3.6.5 Abbreviations and Their Expansions.” In TEI P5: Guidelines for Electronic Text Encoding and Interchange, V4.4.0. Text Encoding Initiative Consortium. https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CONAAB .
  13. Thompson, Walker, and others. 2021. “Using Handwritten Text Recognition (Htr) Tools to Transcribe Historical Multilingual Lexica.” Scripta & E-Scripta 21: 217–31.
  14. Vidal-Gorène, Chahan, Boris Dupin, Aliénor Decours-Perez, and Thomas Riccioli. 2021. “A Modular and Automated Annotation Platform for Handwritings: Evaluation on Under-Resourced Languages.” In International Conference on Document Analysis and Recognition, 507–22. Springer.
Notes
1.
We use the terminology graphemic ( graphématique) and graphetic ( allographétique) following (Stuzmann 2011).
2.
For classification criteria and the lack of consensus among palaeographers, see (Derolez 2003; Stutzmann, Tensmeyer, and Christlein 2020).
3.
We excluded documents with mixed contents (i.e., parts in French intertwined with parts in Latin), except for the ECMEN corpus which only contains small quotations or single words in Latin.
Jean-Baptiste Camps (jean-baptiste.camps_at_chartes.psl.eu), École nationale des chartes | Université PSL, France and Ariane Pinche (ariane.pinche_at_chartes.psl.eu), École nationale des chartes | Université PSL, France and Dominique Stutzmann (dominique.stutzmann_at_irht.cnrs.fr), Institut de recherche et d'histoire des textes/CNRS and Marguerite Vernet (marguerite.vernet_at_chartes.psl.eu), École nationale des chartes | Université PSL, France and Chahan Vidal-Gorène (chahan.vidal-gorene_at_chartes.psl.eu), École nationale des chartes | Université PSL, France