Knowledge extraction, research projects and archives management.
Keywords:
Knowledge Extraction, Research Project, Metadata, Records Management, Digital PreservationAbstract
Archives play an important role in the knowledge society and must respond ever more quickly to information needs. For example, in the case of universities, research projects are a strategic asset for the growth of territories, the rationalization of financial resources and the development of archival science. Clearly, the documentation that characterizes the research projects has an administrative value as well. This paper, investigates the possibility of extracting knowledge from this class of documents. In particular, the purpose of this paper is to experiment with the application of some automatic metadata extraction tools on archival documents. An approach of metadata automatic extraction could provide a greater continuity between production and representation of objects. Metadata can be useful in accessing or sharing contents within digital preservation systems (i.e. ontologies, Linked Data). The chosen tools use Machine Learning technologies and supervised learning techniques together with newer Deep Learning technologies.
References
Aristarán, Manuel. 2018. “Tabula”. (Version v1.2.1). Accessed June 22, 2023. https://github.com/tabulapdf/tabula.
Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc.
Colavizza, Giovanni, Tobias Blanke, Charles Jeurgens, and Julia Noordegraaf. 2022. “Archives and AI: An Overview of Current Debates and Future Perspectives.” Journal on Computing and Cultural Heritage 15 (1) (Association for Computing Machinery): 1-15. https://doi.org/10.1145/3479010.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Tautanova. 2019. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of NAACL-HLT Minneapolis, Minnesota, edited by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 1, 4171-86. https://doi.org/10.18653/v1/N19-1423.
European Commission. n.d. “Horizon 2020 country profile.” Accessed June 22, 2023. https://research-and-innovation.ec.europa.eu/statistics/framework-programme-facts-and-figures/horizon-2020-country-profiles_en.
ExplosionAI GmbH.2022. “spaCy”. (Version v3.4.0). Accessed June 22, 2023. https://spacy.io/.
Grootendorst, Maarteen. 2021. “BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics (Version v0.7.0).” Accessed June 22, 2023. https://github.com/MaartenGr/BERTopic. https://doi.org/10.5281/zenodo.4381785.
Hoffstaetter, Samuel, and Matthias Lee. 2022. “Pytesseract.”(Version v0.3.10) Accessed June 22, 2023. https://pypi.org/project/pytesseract/.
Kamath, Uday, Liu John, and James Whitaker. 2019. “Deep Learning for NLP and Speech Recognition.” Cham: Springer Nature. https://doi.org/10.1007/978-3-030-14596-5.
Mindee. 2022. “docTR”.(Version v0.5.1). Accessed June 22, 2023. https://github.com/mindee/doctr.
Rane, Chinmayee, Seshasayee M. Subramanya, Devi S. Endluri, Jian Wu, and Lee C Giles. 2021. “ChartReader: Automatic Parsing of Bar-Plots.” In IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), Las Vegas, USA, 318-25. https://doi.org/10.1109/IRI51335.2021.00050.
Rehurek, Radim, and Petr Sojka, 2021. “Software Framework for Topic Modelling with Large Corpora”. (Version v4.1.0). Accessed June 22, 2023. https://github.com/piskvorky/gensim.
Rovella, Anna, Alexander Murzaku, Eugenio Cesario, Martin Critelli, Armando Bartucci, and Francesca Maria Caterina Messiniti. 2022. “Analysis, evaluation and comparison of knowledge extraction tools in the environmental and Health domain. A holistic approach.” In Proceedings of the International Knowledge Organization and Management in the Domain of Environment and Earth Observation (KOMEEO) Conference, edited by Antonietta Folino and Roberto Guarasci. Advances in knowledge organization 18. Würzburg: Ergon Verlag, 121-46. https://doi.org/10.5771/9783956508752-121.
Tan, Mingxing, and Le V. Quoc. 2019. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California. arXiv:1905.11946. https://doi.org/10.48550/arXiv.1905.11946.
Tkaczyk, Dominika, Pawel Szostek, Mateusz Fedoryszak, Piot Jan Dendek, and Lukasz Bolikowski. 2015. “CERMINE: automatic extraction of structured metadata from scientific literature.” International Journal on Document Analysis and Recognition 18(4): 317-35. https://doi.org/10.1007/s10032-015-0249-8.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Aracne Editrice
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.