Corpus Corporum
An Overview of the Current Development
DOI:
https://doi.org/10.60923/issn.2532-8816/23668Keywords:
Computational linguistics, Latin literature, DatabaseAbstract
The Corpus Corporum project hosted by the University of Zurich is the largest structured digital collection of Latin texts. The texts span from antiquity to the twentieth century, currently totalling approximately 226 million words across thirty corpora. Conceived as an open-access research infrastructure, it provides philologists, linguists, historians, and scholars of Latin with a unified environment for reading, searching, and analysing texts encoded in standardised TEI XML format. Important Latin dictionaries are integrated into the site. The platform, built on open-source technologies including BaseX, Sphinx, and TreeTagger, maintains a distinction between corpus, author, work, and edition levels, and integrates persistent identifiers (VIAF, Wikidata) and external resources such as geschichtsquellen.de. Recent advancements are discussed in the article, especially two major new analytical tools. The Text Reuse module enables configurable intertextual analysis based on k-skip-n-gram algorithms, while the Metrical Analysis module automatically identifies Latin poetic metres. These innovations allow large-scale, reproducible investigations of textual transmission and poetic structure. An example concerning the sources of Isidore of Seville’s Etymologiae is briefly discussed. Future developments envision AI-assisted translation, semantic indexing, and synonym-based search, thereby enhancing the platform’s potential as a comprehensive, interoperable resource for digital Latin philology and the broader field of computational humanities.
References
[1] Jacobsen, Peter Christian, and Peter Orth. 2002. Materialien zu einem Lexikon der irregulären lateinischen Prosodie. Erlangen. https://kups.ub.uni-koeln.de/62924.
[2] Roelli, Philipp, and Jan Ctibor. 2022. "A New Version of Corpus Corporum, the Latin Full-Text Database and Tool". Archivum Latinitatis Medii Aevi (ALMA): Bulletin Du Cange 80 (3): 251–266. https://doi.org/10.5167/uzh-265929.
[3] Roelli, Philipp. 2025. "An Introduction and a Status-Report on the Latin Database Corpus Corporum". Indo-European Linguistics and Classical Philology 29 (2): 359–374. https://doi.org/10.5167/uzh-279205.
[4] Verkerk, Philippe. 2022. "Elaboration of a Practical Lemmatiser for Latin using Artificial Intelligence". Archivum Latinitatis Medii Aevi (ALMA): Bulletin Du Cange 80 (3): 267–294. https://hal.science/hal-04721577v1.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Philipp Roelli

This work is licensed under a Creative Commons Attribution 4.0 International License.