Expliciting Contexts: Semantic Knowledge Extraction from Traditional Archival Descriptions

Lucia Giagnolini; Andrea Schimmenti; Paolo Bonora; Francesca Tomasi

doi:10.6092/issn.2532-8816/21229

Authors

Lucia Giagnolini Università di Bologna https://orcid.org/0000-0002-4876-2691
Andrea Schimmenti Università di Bologna https://orcid.org/0000-0001-7865-7537
Paolo Bonora Università di Bologna https://orcid.org/0000-0001-8337-3379
Francesca Tomasi Università di Bologna https://orcid.org/0000-0002-6631-8607

DOI:

https://doi.org/10.6092/issn.2532-8816/21229

Keywords:

Linked Open Data, Archives, Information retrieval, Knowledge extraction, Knowledge Representation, supervised annotation, archival contexts, AIUCD2024

Abstract

Archival finding aids are often only partially capable of fully expressing the informational potential of data due to the presence of numerous unstructured fields in the descriptions of documentary collections. The prevalence of extensive literal sections, or full-text fields, limits both the possibility of semantic queries and the ability to uncover the latent contexts embedded in such unstructured text. This study proposes a methodology for the automatic extraction of knowledge (Knowledge Extraction, KE) from archival descriptions, aiming to enhance their structuring and semantic interoperability. Through a case study based on the Italian National Archival System (SAN) and leveraging ready-to-use tools such as TINT, FRED, and GPT-4o, we conducted a preliminary evaluation of various morphosyntactic, lexical, and semantic analysis techniques. The most promising results highlighted the potential of Large Language Models (LLMs), leading to the development of a KE pipeline based on the open-source model Llama 3.3. The findings demonstrate a high capacity for extracting biographical events and relationships, achieving a good balance between precision and recall, thus confirming the validity of the approach. However, the need for a more robust software architecture emerges, as LLM-based pipelines must become truly scalable to enable effective integration into archival systems.

References

Palmero Aprosio Alessio, and Giovanni Moretti. "Tint 2.0: An All-Inclusive Suite for NLP in Italian." In Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-It 2018, edited by Elena Cabrio, Alessandro Mazzei, and Fabio Tamburini, 311–17. Torino: Accademia University Press, 2019.

Babaei Giglou, Hamed, Jennifer D'Souza, and Sören Auer. "LLMs4OL: Large Language Models for Ontology Learning." In The Semantic Web – ISWC 2023, edited by Terry R. Payne et al., 408–27. Cham: Springer Nature Switzerland, 2023.

Bonora, Paolo, and Angelo Pompilio. "Automatic Extraction of Opera Character Characteristics through Lexical-Syntactic Patterns." Umanistica Digitale 5, no. 10 (January 2021): 193–210.

Borgo, Stefano, Roberta Ferrario, Aldo Gangemi, Nicola Guarino, Claudio Masolo, Daniele Porello, Emilio M. Sanfilippo, and Laure Vieu. "DOLCE: A Descriptive Ontology for Linguistic and Cognitive Engineering." Special issue "Foundational Ontologies in Action," edited by Stefano Borgo, Antony Galton, and Oliver Kutz. Applied Ontology 17, no. 1 (March 2022): 45-69. https://doi.org/10.3233/AO-210259.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. "Language Models Are Few-Shot Learners." In Advances in Neural Information Processing Systems 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, 1877-1901. Red Hook, NY: Curran Associates.

Colavizza, Giovanni, Tobias Blanke, Charles Jeurgens, and Julia Noordegraaf. "Archives and AI: An Overview of Current Debates and Future Perspectives." Journal on Computing and Cultural Heritage 15, no. 1 (December 14, 2021): 4:1–4:15.

Damiani, Concetta. "Archival Description and Conceptual Transversality." JLIS.It 13, no. 3 (September 15, 2022): 154–61.

Daquino, Marilena, and Francesca Tomasi. "Historical Context Ontology (HiCO): A Conceptual Model for Describing Context Information of Cultural Heritage Objects." In Metadata and Semantics Research, edited by Emmanouel Garoufallou et al., 424–36. Cham: Springer International Publishing, 2015.

Daquino, Marilena, Valentina Pasqual, and Francesca Tomasi. "Knowledge Representation of Digital Hermeneutics of Archival and Literary Sources." JLIS.It 11, no. 3 (September 15, 2020): 59–76.

Daquino, Marilena. "Linked Open Data Native Cataloguing and Archival Description." JLIS.It 12, no. 3 (September 15, 2021): 91–104.

Gangemi, A., Graciotti, A., Meloni, A., Marzi, E., Nuzzolese, A., Presutti, V., Recupero, D.R., Russo, A., & Tripodi, R. MusicBO, an application of Text2AMR2FRED to the Musical Heritage domain.

Gangemi, Aldo, Valentina Presutti, Diego Reforgiato Recupero, Andrea Giovanni Nuzzolese, Francesco Draicchio, and Misael Mongiovì. "Semantic Web Machine Reading with FRED." Semantic Web 8, no. 6 (August 7, 2017): 873–93.

Giagnolini, Lucia, Bonora, Paolo and Francesca Tomasi, "Affinare il contesto: estrazione di informazioni strutturate per l’arricchimento dei contesti archivistici”, In Me.Te. Digitali. Mediterraneo in rete tra testi e contesti, Venezia, Associazione per l’Informatica Umanistica e la Cultura Digitale, 2024, pp. 411 – 416

Guerrini, Mauro, and Tiziana Possemato. "Linked Data: Un Nuovo Alfabeto del Web Semantico." Biblioteche Oggi 30, no. 3 (2012): 7–15.

Mihindukulasooriya, Nandana, Sanju Tiwari, Carlos F. Enguix, and Kusum Lata. "Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text." In The Semantic Web – ISWC 2023, edited by Terry R. Payne et al., 247–65. Cham: Springer Nature Switzerland, 2023.

Polley, Katherine Louise, Vivian Teresa Tompkins, Brendan John Honick, and Jian Qin. "Named Entity Disambiguation for Archival Collections: Metadata, Wikidata, and Linked Data" Proceedings of the Association for Information Science and Technology 58, no. 1 (2021): 520–24.

Shahriar, Sakib, Brady D. Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool. 2024. "Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency" Applied Sciences 14, no. 17: 7782. https://doi.org/10.3390/app14177782

Strötgen, Jannik, and Michael Gertz. "A Baseline Temporal Tagger for All Languages." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 541–47. Lisbon, Portugal: Association for Computational Linguistics, 2015.

Tomasi, Francesca. "Archival Finding Aids in Linked Open Data between Description and Interpretation." JLIS.It 14, no. 3 (September 15, 2023): 134–46.

Valacchi, Federico. "The Parts and the Whole. Integrate Knowledge." *JLIS.It* 13, no. 3 (September 15, 2022): 1–11.

Valacchi, Federico. "Not the Institutions but the Subjects Matter. Beyond the Necessary Approximation of Finding Aids?" JLIS.It 14, no. 3 (September 15, 2023): 1–14.

Vitali, Stefano. "La Descrizione Degli Archivi Nell'Epoca Degli Standard e Dei Sistemi Informatici." In Archivistica. Teorie, Metodi, Pratiche, edited by Linda Giuva and Maria Guercio, 179–210. Roma: Carocci, 2014.

Chen, Ruirui, Chengwei Qin, Weifeng Jiang, and Dongkyu Choi. 2024. “Is a Large Language Model a Good Annotator for Event Extraction?”. Proceedings of the AAAI Conference on Artificial Intelligence 38 (16):17772-80. https://doi.org/10.1609/aaai.v38i16.29730.

Shiri, Fatemeh, Van Nguyen, Farhad Moghimifar, John Yoo, Gholamreza Haffari, and Yuan-Fang Li. 2024. "Decompose, Enrich, and Extract! Schema-aware Event Extraction using LLMs." arXiv preprint arXiv:2406.01045. https://arxiv.org/abs/2406.01045.

Waltl, B., Bonczek, G., & Matthes, F. (2018). Rule-based information extraction: Advantages, limitations, and perspectives. Jusletter IT (02 2018), 4.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." In Advances in Neural Information Processing Systems 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, 24824-24837. Red Hook, NY: Curran Associates.

Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. "Emergent Abilities of Large Language Models" Transactions on Machine Learning Research.

Expliciting Contexts: Semantic Knowledge Extraction from Traditional Archival Descriptions

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Language

Make a Submission

Current Issue