Masked texts: new tools for the security and linguistic analysis of legal corpora


  • Laura Clemenzi Università degli Studi della Tuscia
  • Francesca Fusco Università degli Studi di Padova
  • Daniele Fusi Università degli Studi di Venezia Ca' Foscari - Venice Centre for digital and public humanities (VeDPH)
  • Giulia Lombardi Università di Genova



AIUCD2022, legal linguistics, legal writing, pseudonymization, Pythia, TEI, linguistica giuridica, motore di ricerca, scrittura forense, pseudonimizzazione, search engine


The Atti Chiari project, collecting the first large Italian corpus of judicial acts, presents strict legal requirements as well as many peculiarities in terms of language and content; to meet them, a number of processes and tools have been designed and implemented. The first issue is the requirement to remove any personal data from the documents, without however destroying their linguistic form, nor compromising their readability. To this end, a pseudonymisation procedure has been created based on a preliminary annotation stage, which adds information right in order to remove it in different ways, according to different purposes (linguistic analysis, legal analysis, etc.). At the same time, this light annotation provides data useful not only for pseudonymization, but also for the conversion of documents, from their original presentational format into a semantic one based on TEI. Once documents have been prepared in this way, they are then centralized in a corpus, ready to be indexed for linguistic research. Given the multiple search criteria that must be combined, whatever their origin and model, a new type of search engine, designed primarily in the philological field, has been used here to obtain the required openness and granularity of metadata.



How to Cite

Clemenzi, L., Fusco, F., Fusi, D., & Lombardi, G. (2023). Masked texts: new tools for the security and linguistic analysis of legal corpora. Umanistica Digitale, 7(16), 1–32.