The Latin Text Archive. A Platform for Historical Semantics and Text Mining

A long-term Project as Part of the Text Archive Series at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)

Authors

  • Tim Geelhaar Goethe-Universität Frankfurt am Main

DOI:

https://doi.org/10.60923/issn.2532-8816/23548

Keywords:

Text Mining, Corpus Building, Historial Semantics, Ancient Latin, Lemmatization, Medieval Latin

Abstract

The Latin Text Archive (LTA) is an online platform hosted by the Berlin-Brandenburg Academy of Sciences (BBAW) since 2020 (https://LTA.bbaw.de). Its primary objective is to facilitate computer-assisted semantic analysis of Latin texts and corpora spanning various epochs and genres. The LTA collaborates with prominent text providers and related projects in this field. Its core activities center on post-philological editorial text preparation, which is essential for implementing text mining techniques in corpus-based historical semantics. The archive lemmatizes and stores Latin texts, augments them with relevant metadata, and organizes them within thematic or genre-specific corpora. These texts can be also read online and downloaded in various formats. Currently in a beta version, the LTA offers already 12,960 curated texts authored by 1,280 identified individuals, amounting to 54 million words. Furthermore, the LTA supplies access to its morphological lexicon, which supports the lemmatization process. Through the 'Latin Universe', users may also access additional texts not yet fully curated. Both texts and corpora are searchable via third-party tools such as 'Voyant-Tools' or through integrated functionalities like the 'Time series query' — which allows for diachronic comparison of keywords and lemmas — and 'Diacollo', which analyses co-occurring lemmas over time.

References

[1] Brunner, Otto, Werner Conze, and Reinhart Koselleck. 1972–1992. Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch-sozialen Sprache in Deutschland. Ernst Klett Verlag.

[2] Cimino, Roberta, Tim Geelhaar, and Silke Schwandt. 2015. “Digital Approaches to Historical Semantics: New Research Directions at Frankfurt University”. Storicamente 11 (7): 1-16. http://dx.doi.org/10.12977/stor594

[3] Eger, Steffen, Tim vor der Brück, and Alexander Mehler. 2015. “Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods”. In Proceedings of the 9th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (Latech 2015). https://doi.org/10.18653/v1/W15-3716

[4] Geelhaar, Tim. 2025. “Hospitalitas: A Virtue in Danger: Semantic Observations on the Use of hospitalitas in Latin Narrative Sources, 1000–1400”. In Guests, Strangers, Aliens, Enemies: Ambiguities of Hospitality in the Middle Ages, c. 1000–1350, edited by Wojtek Jezierski, and Lars Kjaer, 39-73. Brepols. https://doi.org/10.1484/M.CURSOR-EB.5.149651

[5] Gippert, Jost. 2015. “Preface”. In Historical corpora. Challenges and perspectives, edited by Jost Gippert, and Ralf Gehrke, 9-12. Narr Dr. Gunter.

[6] Jussen, Bernhard, and Gregor Rohmann. 2015. “Historical Semantics in Medieval Studies. New Means and Approaches”. Contributions to the History of Concepts 10 (2): 1-6. https://doi.org/10.3167/choc.2015.100201.

[7] Jussen, Bernhard, and Karl Ubl. 2022. “Die Sprache der Kapitularien. Einleitung”. In Die Sprache des Rechts. Historische Semantik und karolingische Kapitularien, edited by Bernhard Jussen, and Karl Ubl, 9-32. Vandenhoeck&Ruprecht. https://doi.org/10.1515/hzhz-2024-1267.

[8] Mehler, Alexander, Bernhard Jussen, and Tim Geelhaar. 2020. “The Frankfurt Latin Lexicon: From morphological expansion and word embeddings to SemioGraphs”. Studi e Saggi Linguistici 58 (1): 121-155. https://doi.org/10.4454/ssl.v58i1.276.

[9] Perreaux, Nicolas. 2021. “Possibilities, Challenges and Limits of a European Charters Corpus (Cartae Europae Medii Aevi – CEMA)”. arXiv:2105.00932.

[10] Reynolds, Susan. 1994. Fiefs and Vassals. The Medieval Evidence Reinterpreted. Oxford University Press.

[11] Schiel, Juliane, Ludolf Kuchenbuch, Isabelle Schürch, Nicolas Perreaux, and Tim Geelhaar. 2023. ”Historical Semantics: A Vade Mecum”. Österreichische Zeitschrift für Geschichtswissenschaften (OeZG) 34 (2): 18-47. https://doi.org/10.25365/oezg-2023-34-2-2.

[12] Schonhardt, Michael, Tim Geelhaar, Tobias Hodel, and Jan Odstrčilík. 2025. Automated Text Recognition: Theory, Platforms, Best Practices. Bielefeld University Press.

[13] Sinclair, John. 2005. “Corpus and Text – Basic Principles”. In Developing Linguistic Corpora: a Guide to Good Practice, edited by Martin Wynne, 1-16. Oxbow Books.

Downloads

Published

2026-05-21

How to Cite

Geelhaar, T. (2026). The Latin Text Archive. A Platform for Historical Semantics and Text Mining: A long-term Project as Part of the Text Archive Series at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). Umanistica Digitale, 10(23), 31–43. https://doi.org/10.60923/issn.2532-8816/23548