Corpus Corporum: An Overview of the Current Development

Philipp Roelli

doi:10.60923/issn.2532-8816/23668

Authors

Philipp Roelli University of Zurich

DOI:

https://doi.org/10.60923/issn.2532-8816/23668

Keywords:

Computational linguistics, Latin literature, Database

Abstract

The Corpus Corporum project hosted by the University of Zurich is the largest structured digital collection of Latin texts. The texts span from antiquity to the twentieth century, currently totalling approximately 226 million words across thirty corpora. Conceived as an open-access research infrastructure, it provides philologists, linguists, historians, and scholars of Latin with a unified environment for reading, searching, and analysing texts encoded in standardised TEI XML format. Important Latin dictionaries are integrated into the site. The platform, built on open-source technologies including BaseX, Sphinx, and TreeTagger, maintains a distinction between corpus, author, work, and edition levels, and integrates persistent identifiers (VIAF, Wikidata) and external resources such as geschichtsquellen.de. Recent advancements are discussed in the article, especially two major new analytical tools. The Text Reuse module enables configurable intertextual analysis based on k-skip-n-gram algorithms, while the Metrical Analysis module automatically identifies Latin poetic metres. These innovations allow large-scale, reproducible investigations of textual transmission and poetic structure. An example concerning the sources of Isidore of Seville’s Etymologiae is briefly discussed. Future developments envision AI-assisted translation, semantic indexing, and synonym-based search, thereby enhancing the platform’s potential as a comprehensive, interoperable resource for digital Latin philology and the broader field of computational humanities.

References

[1] Jacobsen, Peter Christian, and Peter Orth. 2002. Materialien zu einem Lexikon der irregulären lateinischen Prosodie. Erlangen. https://kups.ub.uni-koeln.de/62924.

[2] Roelli, Philipp, and Jan Ctibor. 2022. "A New Version of Corpus Corporum, the Latin Full-Text Database and Tool". Archivum Latinitatis Medii Aevi (ALMA): Bulletin Du Cange 80 (3): 251–266. https://doi.org/10.5167/uzh-265929.

[3] Roelli, Philipp. 2025. "An Introduction and a Status-Report on the Latin Database Corpus Corporum". Indo-European Linguistics and Classical Philology 29 (2): 359–374. https://doi.org/10.5167/uzh-279205.

[4] Verkerk, Philippe. 2022. "Elaboration of a Practical Lemmatiser for Latin using Artificial Intelligence". Archivum Latinitatis Medii Aevi (ALMA): Bulletin Du Cange 80 (3): 267–294. https://hal.science/hal-04721577v1.

Corpus Corporum

An Overview of the Current Development

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Make a Submission

Current Issue