CompL-it: a Computational Lexicon of Italian
Keywords:
Computational Lexicon, Linguistic Resources, Linguistic Linked Open Data, OntoLex-Lemon, Information RetrievalAbstract
This paper describes CompL-it, a new open computational lexicon for contemporary Italian. The resource was constructed from three sources: an already available Italian lexicon, a lemmatized list of inflected forms obtained from a morphological analyzer, and a set of treebanks. Integrating these resources required a standardisation process in accordance with the standards of the Linguistic Linked Open Data community, which was necessary for the subsequent conversion into the OntoLex-Lemon model. The resulting computational lexicon comprises approximately 100,000 lexical entries, 790,000 forms, 57,000 senses, and 86,000 semantic relations. The lexicon, thanks to its rich and articulated linguistic structure, can be used, as shown, to enhance information retrieval in the context of full-text search tasks.
References
BabelNet. n.d. “BabelNet | Il Più Grande Dizionario Enciclopedico e Rete Semantica Multilingue.” Accessed December 3, 2024. https://babelnet.org/.
Bamman, David, and Gregory Crane. 2010. “Computational Linguistics and Classical Lexicography.” In Changing the Center of Gravity, edited by Melissa Terras and Gregory Crane, 297-322. Gorgias Press. https://doi.org/10.31826/9781463219222-015.
Bartolini, Roberto. 2016. “IWN-LOD.” http://hdl.handle.net/20.500.11752/ILC-66.
Basili, Roberto, Silvia Brambilla, Danilo Croce, and Fabio Tamburini.2017. “Developing a Large Scale FrameNet for Italian: The IFrameNet Experience.” In Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-It 2017, edited by Roberto Basili, Malvina Nissim and Giorgio Satta, 59-64. Torino: Accademia University Press. https://doi.org/10.4000/books.aaccademia.2364.
Battista, Marco, and Vito Pirrelli. 1999. “Una Piattaforma di Morfologia Computazionale per l’analisi e la Generazione delle Parole Italiane.”
ILC-CNR Technical Report.
Bel, Nuria, Federica Busa, Nicoletta Calzolari, Elisabetta Gola, et al. 2000. “SIMPLE: A General Framework for the Development of Multilingual Lexicons.” In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), edited by M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhauer.
Athens, Greece: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2000/pdf/61.pdf.
Brown, Susan Windisch, Julia Bonn, Ghazaleh Kazeminejad, Annie Zaenen, James Pustejovsky, and Martha Palmer. 2022. “Semantic Representations for NLP Using VerbNet and the Generative Lexicon.” Frontiers in Artificial Intelligence 5 (April):821697. https://doi.org/10.3389/frai.2022.821697.
Chen, Hsinchun, Tak Yim, David Fye, and Bruce Schatz. 1995. “Automatic Thesaurus Generation for an Electronic Community System.” Journal of the American Society for Information Science 46 (3): 175-93.
Chiarcos, Christian, and Maria Sukhareva. 2015. “OLiA – Ontologies of Linguistic Annotation.” Edited by Sebastian Hellmann, Steven Moran, Martin Brümmer, and John P. McCrae. Semantic Web 6 (4): 379-86. https://doi.org/10.3233/SW-140167.
Chiarcos, Christian, Elena-Simona Apostol, Besim Kabashi, and Ciprian-Octavian Truică. 2022. “Modelling Frequency, Attestation, and Corpus-Based Information with OntoLex-FrAC.” In Proceedings of the 29th International Conference on Computational Linguistics, edited by Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, et al., 4018-27. Gyeongju, Republic of Korea: International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.353.
Chiari, Isabella. 2012. “Il Dato Empirico in Lessicografia: Dizionari Tradizionali e Collaborativi a Confronto.” Bollettino Di Italianistica II (January): 94-125.
Cimiano, Philipp, Christian Chiarcos, John P. McCrae, and Jorge Gracia. 2020. Linguistic Linked Data: Representation, Generation and Applications. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-30225-2.
Cimiano, Philipp, Paul Buitelaar, John McCrae, and Michael Sintek. 2011. “Lex- Info: A Declarative Model for the Lexicon-Ontology Interface.” Journal of Web Semantics 9 (1): 29-51. https://doi.org/10.1016/j.websem.2010.11.001.
CLARIN. n.d. “ParlaMint: Comparable and Interoperable Parliamentary Corpora | CLARIN ERIC.” Accessed December 3, 2024. https://www.clarin.eu/parlamint.
CLARIN-IT. n.d.a. “CompL-It.” Accessed December 3, 2024. https://dspaceclarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-1007.
CLARIN-IT. n.d.b. “LexicO.” Accessed December 3, 2024. https://dspaceclarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-977.
CLARIN-IT. n.d.c. “MAGIC - Generated Lemmatized Forms.” Accessed December 3, 2024. https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-1002.
CLARIN VLO. n.d. “Virtual Language Observatory.” Accessed September 30, 2024. https://www.clarin.eu/content/virtual-language-observatory-vlo.
Dankova, Klara, Maria Teresa Zanola, and Silvia Calvi. 2022. “Pan-Latin Textile Fibres Vocabulary.” http://hdl.handle.net/20.500.11752/OPEN-975.
DatCatInfo. n.d. “Welcome to DatCatInfo.” Accessed December 3, 2024. https://datcatinfo.net/.
De Mauro, Tullio. 1980. Guida all’uso delle parole: parlare e scrivere semplice e preciso per capire e farsi capire. Libri di base. Roma: Editori Riuniti.
De Mauro, Tullio, a cura di. 2016. Il Nuovo Vocabolario Di Base Della Lingua Italiana. December 23, 2016. https://www.dropbox.com/scl/fi/zg2y99xqik4k11nj19fgi/nuovovocabolariodibase.pdf?rlkey=s0uf8ggv11kf44ip6a2ldz16n&e=1&dl=0.
Del Gratta, Riccardo, Francesca Frontini, Anas Fahad Khan, and Monica Monachini. 2015. “Converting the PAROLE SIMPLE CLIPS
Lexicon into RDF with Lemon.” Semantic Web 6 (4): 387-92. https://doi.org/10.3233/SW-140168.
ELEXIS. n.d. “ELEXIS European Lexicographic Infrastructure.” Accessed September 30, 2024. https://elex.is/.
Francopoulo, Gil, Monte George, Nicoletta Calzolari, et al. 2006. “Lexical Markup Framework (LMF).” In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), edited by Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, and Daniel Tapias. Genoa, Italy: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/577_pdf.pdf.
Frontini, Francesca, Riccardo Del Gratta, and Monica Monachini. 2016. “Geodomain WordNet ITA ENG V 1.0.” http://hdl.handle.net/20.500.11752/ILC-68.
Giovannetti, Emiliano, Davide Albanesi, Andrea Bellandi, and Giulia Benotto. 2016. “Traduco: A Collaborative Web-Based CAT Environment for the Interpretation and Translation of Texts.” Digital Scholarship in the Humanities 32 (suppl_1): i47-62. https://doi.org/10.1093/llc/fqw054.
Giovannetti, Emiliano, Davide Albanesi, Andrea Bellandi, Simone Marchi, Mafalda Papini, and Flavia Sciolette. 2022. “The Role of a Computational Lexicon for Query Expansion in Full-Text Search.” In Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-It 2021, edited by Elisabetta Fersini, Marco Passarotti, and Viviana Patti, 162-68. Accademia University Press. https://doi.org/10.4000/books.aaccademia.10638.
Github. n.d.a. “The Ontolex Module for Frequency, Attestation and Corpus Information.” Accessed December 3, 2024. https://github.com/acoli-repo/frac-addenda/blob/master/index.md.
Github. n.d.b. “CompL-It Mapping Tables.” Accessed December 3, 2024. https://github.com/klab-ilc-cnr/Tables-for-mapping-of-Italian-lexicon-CompL-it.
Global WordNet Association. n.d. “Main Page.” Accessed December 3, 2024.http://globalwordnet.org/.
Grella, Matteo. 2018a. “Italian Content Words V3.” http://hdl.handle.net/11372/LRT-2894.
Grella, Matteo. 2018b. “Italian Function Words V3.” http://hdl.handle.net/11372/LRT-2893.
Hmeidi, Ismail, Mahmoud Al-Ayyoub, Nizar A. Mahyoub, and Mohammed A. Shehab. 2016. “A Lexicon Based Approach for Classifying Arabic Multi-Labeled Text.” International Journal of Web Information Systems 12 (4): 504-32. https://doi.org/10.1108/IJWIS-01-2016-0002.
Hodge, Gail. 2000. Systems of Knowledge Organization for Digital Libraries:Beyond Traditional Authority Files. Washington, DC: Digital Library Federation, Council on Library and Information Resources.
ILC4CLARIN CNR. 2016. “PAROLE-SIMPLE-CLIPS.” http://hdl.handle.net/20.500.11752/ILC-88.
Khan, Fahad, Ana Salgado, Isuri Anuradha, et al. 2024. “CHAMUÇA: Towards a Linked Data Language Resource of Portuguese Borrowings in Asian Languages.” In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, edited by Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda, and Patricia Martín Chozas, 44-48. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.ldl-1.6.
KLAB. n.d. “CompL-It SPARQL Search Interface.” Accessed December 3, 2024. https://klab.ilc.cnr.it/CompL-it-SPARQL-interface/.
Kyjánek, Lukáš, Zdeněk Žabokrtský, Jonáš Vidra, and Magda Ševčíková. 2021.“Universal Derivations v1.1.” http://hdl.handle.net/11234/1-3247.
LexInfo. n.d. “About the Ontology.” Accessed September 30, 2024. https://lexinfo.net/.
LLOD. n.d. “Linguistic Linked Open Data.” Accessed December 3, 2024.https://linguistic-lod.org/.
Mallia, Michele, Michela Bandini, Andrea Bellandi, et al. 2024. “DigItAnt: A Platform for Creating, Linking and Exploiting LOD Lexica with Heterogeneous Resources.” In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, edited by Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda, and Patricia Martín Chozas, 55-65. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.ldl-1.8.
Malmgren, Sven-Göran. 1988. “On Regular Polysemy in Swedish.” In Studies in Computer-Aided Lexicology, 179-200. Data Linguistica 18. Stockholm: Almqvist & Wiksell.
Mambrini, Francesco, and Marco Carlo Passarotti. 2023. “The LiLa Lemma Bank: A Knowledge Base of Latin Canonical Forms.” Journal of Open Humanities Data 9 (November):1-5. https://doi.org/10.5334/johd.145.
Mazzei, Alessandro. 2016. “Building a Computational Lexicon by Using SQL.” In Proceedings of the Third Italian Conference on Computational Linguistics CLiC-It 2016, 200-04. Napoli: Accademia University Press. https://doi.org/10.4000/books.aaccademia.1808.
Meijssen, Gerard. 2014. “OmegaWiki.” http://hdl.handle.net/11372/LRT-853.
Miller, George A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11): 39-41. https://doi.org/10.1145/219717.219748.
Montiel-Ponsoda, Elena, Wim Peters, Mauricio Espinoza, Asunción Gómez-Pérez, and Margherita Sini. 2008. “Multilingual and Localization Support for Ontologies.” Technical Report 2.4.2. http://neon-project.org/deliverables/WP2/NeOn_2008_D242.pdf.
Morph-it! 2018. “Resources:Morph-It.” Last Modified May 03. https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it.
MultiWordNet. n.d. “NLP Research Group - MultiWordNet.” Accessed December 3, 2024. https://nlplab.fbk.eu/tools-and-resources/lexical-resources-and-corpora/multiwordnet.
Navigli, Roberto, and Simone Paolo Ponzetto. 2012. “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network.” Artificial Intelligence 193 (December): 217-50. https://doi.org/10.1016/j.artint.2012.07.001.
OLiA. n.d. “Ontologies of Linguistic Annotation (OLiA) | Olia.” Accessed December 3, 2024. https://acoli-repo.github.io/olia/.
Ontotext. n.d. “Ontotext GraphDB.” Accessed December 3, 2024. https://www.ontotext.com/products/graphdb/.
Passarotti, Marco Carlo, and Francesco Mambrini. 2021. “Linking Latin: Interoperable Lexical Resources in the LiLa Project.” In Building New Resources for Historical Linguistics, edited by Erica Biagetti, Chiara Zanchi and Silvia Luraghi, 103-24. https://doi.org/10.5281/zenodo.5994271.
Pianta, Emanuele, Luisa Bentivogli, and Christian Girardi. 2002. “Multi-WordNet: Developing an Aligned Multilingual Database.” In Proceedings of the First International Conference on Global WordNet.
Pirrelli, Vito, and Marco Battista. 2000. “The Paradigmatic Dimension of Stem Allomorphy in Italian Verb Inflection.” Italian Journal of Linguistics 12 (2): 307-80.
Prakash, T. Nikil, and Amalanathan Aloysius. 2021. “Textual Sentiment Analysis Using Lexicon Based Approaches.” Annals of the Romanian Society for Cell Biology 25 (4): 9878–85.
Princeton University. n.d. “WordNet.” Accessed December 3, 2024. https://wordnet.princeton.edu/.
PTTB. n.d. “Progetto Traduzione Talmud Babilonese.” Accessed December 3, 2024. https://www.talmud.it/it/.
Pustejovsky, James. 1995. The Generative Lexicon. The MIT Press. https://doi.org/10.7551/mitpress/3225.001.0001.
Realiter. n.d. “Home | Realiter.” Accessed December 3, 2024. https://www.realiter.net/.
Roventini, Adriana, Antonietta Alonge, Francesca Bertagna, et al. 2003. “‘Ital-WordNet’: Building a Large Semantic Database for the Automatic Treatment of Italian.” Linguistica computazionale: XVIII/XIX, 1998/1999, 745-91. https://doi.org/10.1400/18178.
Roventini, Adriana, Rita Marinelli, and Francesca Bertagna. 2016. “ItalWord-Net v.2.” http://hdl.handle.net/20.500.11752/ILC-62.
Ruimy, Nilda, Monica Monachini, Raffaella Distante, et al. 2002. “CLIPS, a Multi-Level Italian Computational Lexicon: A Glimpse to Data.” In Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, May 29-31, 2002, Las Palmas, Canary Islands, Spain. European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2002/sumarios/197.htm.
Sabatini, Francesco. 2006. “La Storia dell’Italiano nella Prospettiva della Corpus Linguistics.” In Proceedings of the 12th EURALEX International Congress, edited by Cristina Onesti, Elisa Corino and Carla Marello, 31-37. Torino: Edizioni dell’Orso.
Sanguinetti, Manuela, and Cristina Bosco. 2015. “PartTUT: The Turin University Parallel Treebank.” In Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project, edited by Roberto Basili, Cristina Bosco, Rodolfo Delmonte, Alessandro Moschitti, and Maria Simi, 589: 51-69. Studies in Computational Intelligence. Springer. https://doi.org/10.1007/978-3-319-14206-7_3.
Sciolette, Flavia, Emiliano Giovannetti, and Simone Marchi. 2023. “LexicO: An Italian Computational Lexicon Derived from Parole-Simple-Clips.” Umanistica Digitale 7 (15): 169-93. https://doi.org/10.6092/issn.2532-8816/15176.
Sciolette, Flavia. 2024. “Modeling Linking between Text and Lexicon with OntoLex-Lemon: A Case Study of Computational Terminology for the Babylonian Talmud.” In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, edited by Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda, and Patricia Martín Chozas, 103-7. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.ldl-1.13.
Sérasset, Gilles. 2015. “DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.” Semantic Web 6 (4): 355-61. https://doi.org/10.3233/SW-140147.
Shiri, Ali. 2015. “Semantic Access and Exploration in Cultural Heritage Digital Libraries.” In Cultural Heritage Information: Access and Management, edited by Ian Ruthven and Gobinda G. Chowdhury, 177-96. Facet Publishing.
Published
Issue
Section
License
Copyright (c) 2024 Aracne Editrice

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.