Philo-L1
The Emendatio of Latin Texts as a Denoising Problem
DOI:
https://doi.org/10.60923/issn.2532-8816/23602Keywords:
digital philology, Large Language Model, Ianus AI, Philo-L1, emendationAbstract
The emendation of ancient literary texts is one of the most challenging tasks in classical philology. Existing models designed to assist with this task (Latin BERT and Logion) rely on a fill-mask approach that presents significant limitations. This paper introduces Philo-L1, a seq2seq LLM of approximately 297 million parameters based on the T5 architecture, which reframes the emendatio of Latin literary texts as a text generation task with input denoising, alongside Ianus AI, a web platform developed for its use. Philo-L1, obtained by fine-tuning Philo-1-preview (itself the result of fine-tuning PhilTa), was trained on a synthetic dataset of approximately 5 million sentence pairs covering nine classes of textual corruptions: palaeographic and pronunciation errors, errors of divisio, inversion, echo, saut du même au même, errors arising from integration with a signal word, haplographies, and dittographies. The model achieved an exact match accuracy (EMA) of 74.01%, a perplexity of 1.17, and a BLEU score of 94.51. A direct comparison with Latin BERT confirms the validity of the proposed approach (EMA: 77.96% vs 0.50%). Future work will focus on extending the model’s scope and incorporating chain of thought and Explainable AI techniques.
References
[1] Assael, Yannis, Thea Sommerschield, Alison Cooley, Brendan Shillingford, John Pavlo-poulos, Priyanka Suresh, Bailey Herms, et al. 2025. "Contextualizing Ancient Texts with Generative Neural Networks". Nature 645 (8079): 141–147. https://doi.org/10.1038/s41586-025-09292-5.
[2] Assael, Yannis, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pa-vlopoulos, Marita Chatzipanagiotou, Ion Androutsopoulos, Jonathan Prag, e Nando de Freitas. 2022. "Restoring and Attributing Ancient Texts Using Deep Neural Networks". Nature 603 (7900): 280–83. https://doi.org/10.1038/s41586-022-04448-z.
[3] Assael, Yannis, Thea Sommerschield, e Jonathan Prag. 2019. "Restoring ancient text using deep learning: a case study on Greek epigraphy". arXiv preprint aXiv:1910.06262.
[4] Bamman, D., e P. J. Burns. 2020. "Latin BERT: A Contextual Language Model for Classical Philology". arXiv preprint arXiv:2009.10053.
[5] Braccini, Tommaso. 2017. La scienza dei testi antichi. Introduzione alla filologia classi-ca. Le Monnier Università.
[6] Cowen-Breen, Charlie, Creston Brooks, Johannes Haubold, e Barbara Graziosi. 2023. "Logion: Machine Learning for Greek Philology". arXiv preprint arXiv:2305.01099.
[7] Ferrara, Giuseppe. 2025. "Philo-1-preview. Un modello T5-Base per l’emendazione dei testi antichi". In Diversità, Equità e Inclusione: Sfide e Opportunità per l’Informatica Umanistica nell’Era dell’Intelligenza Artificiale, Proceedings del XIV Convegno Annuale AIUCD2025, a cura di Simone Rebora, Marco Rospocher, e Stefano Bazzaco, 404-410. AIUCD. https://doi.org/10.6092/unibo/amsacta/8380.
[8] Graziosi, Barbara, Johannes Haubold, Charlie Cowen-Breen, e Creston Brooks. 2023. "Machine Learning and the Future of Philology: A Case Study". TAPA 153 (1): 253–84. https://doi.org/10.1353/apa.2023.a901022.
[9] Havet, Louis. 1911. Manuel de critique verbale appliquée aux textes latins. Hachette.
[10] Johnson, Justin M. e Taghi M. Khoshgoftaar. 2019. "Survey on Deep Learning with Class Imbalance". Journal of Big Data 6 (1): 27. https://doi.org/10.1186/s40537-019-0192-5.
[11] Kernighan, Mark D., Kenneth W. Church, e William A. Gale. 1990. "A spelling correction program based on a noisy channel model". In Proceedings of the 13th conference on Computational linguistics - Volume 2 (USA), 205–10. https://doi.org/10.3115/997939.997975.
[12] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Ma-tena, Yanqi Zhou, Wei Li, e Peter J. Liu. 2023. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". arXiv preprint arXiv:1910.10683.
[13] Riemenschneider, Frederick e Anett Frank. 2023. "Exploring Large Language Models for Classical Philology". arXiv preprint arXiv:2305.13698.
[14] Shannon, Claude E. 1948. "A Mathematical Theory of Communication". Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
[15] Shannon, Claude E. e Warren Weaver. 1998. The Mathematical Theory of Communica-tion. University of Illinois Press. https://books.google.it/books?id=IZ77BwAAQBAJ.
[16] Singh, Pranaydeep, Gorik Rutten e Els Lefever. 2021. "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek". In Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, a cura di Stefania Degaetano-Ortlieb, Anna Kazantseva, Nils Reiter, Stan Szpakowicz, 128–37. Association for Computational Lin-guistics. https://doi.org/10.18653/v1/2021.latechclfl-1.15.
[17] Sommerschield, Thea, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, e Nando de Frei-tas. 2023. "Machine Learning for Ancient Languages: A Survey". Computational Lingui-stics 49 (3): 703–47. https://doi.org/10.1162/coli_a_00481.
[18] Straka, Milan, Jana Straková, e Federica Gamba. 2024. "ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin". arXiv preprint arXiv:2404.05839.
[19] Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, e Denny Zhou. 2023. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". arXiv preprint arXiv:2201.11903.
[20] Wróbel, Krzysztof, e Krzysztof Nowak. 2022. "Transformer-based Part-of-Speech Tag-ging and Lemmatization for Latin". In Proceedings of the Second Workshop on Langua-ge Technologies for Historical and Ancient Languages, a cura di Rachele Sprugnoli e Marco Passarotti, 193–97. European Language Resources Association. https://aclanthology.org/2022.lt4hala-1.31/.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Giuseppe Ferrara

This work is licensed under a Creative Commons Attribution 4.0 International License.