Measuring the Use of Tools and Software in the Digital Humanities: A Machine-Learning Approach for Extracting Software Mentions from Scholarly Articles

1. Introduction

Tools and software are an important part of Digital Humanities (DH) practice (Dombrowski 2014, Barbot et al. 2019, Barbot et al. 2020, Fischer/Moranville 2020, Dombrowski 2021, Fischer et al. 2021, Luhmann/Burghardt 2021). Previous attempts to gain an overview of these tools were mainly based on manual aggregations, as in the case of the long-running Canadian project TAPoR. 1 Around 1,500 tools can be found there, an order of magnitude that should illustrate how difficult it is to keep everything up to date. To learn more about the actual use of tools in scientific work, especially in the Digital Humanities, we present a machine-learning approach for extracting tools and software mentioned by name in scientific publications, adding to other recent endeavours in this field (Du et al. 2020, Henny-Krahmer/Jettka 2022).

2. Related Works

Different approaches have emerged over the years for named entity recognition (NER), which can be categorised broadly into two groups, rule-based and machine-learning (ML) approaches (Isozaki 2002).

2.1. Rule-based approach

One of the main methods used for extracting facts, including named entities, from texts, are Hearst patterns (Hearst 1992). Hearst patterns were revisited by the Facebook team for their Hypernymy Suite tool (Roller et al. 2008) and were still found to be superior in accurately extracting relations compared to other distributional methods. Similar efforts include GATE (Cunningham et al. 2013) and other legacy solutions (Chiticariu et al. 2013).

2.2. Machine-learning approach

Among the relevant methods that use ML, we looked at SoftwareKG (Schindler et al. 2020), which uses DBpedia (Lehmann et al. 2015) to validate entities as software. The results, though, lack various software mentions from other sources, because of memorising the words (memorisation effect) (Arpit et al. 2017). Another example is the GROBID tool 2 (Du et al. 2020). After applying it to selected DH publications, we found that the recall is limited and that there is some memorisation effect. Given the recent efforts in this area (see also Henny-Krahmer/Jettka 2022), we were motivated to see if we could solve these issues to get a model that is production-ready.

3. Our Approach

Following the ML-based approach and considering the limitations of related work, we started to build a model that can recognise a tool based on its context (e.g. neighbourhood and grammar) and appearance (e.g. capitalised words, adjacent numbers, etc.).

3.1. Dataset

For the dataset, we preprocessed a deduplicated collection of sentences from PLOS Sociology, Linguistics and abstracts from ADHO’s annual DH conferences (2015 and 2020), resulting in 1,899,652 sentences.

We created two versions of the dataset. The baseline dataset was created by preparing approximately 55,000 tool mentions 3 as patterns by processing names of tools and software coming from TAPoR and Wikidata. These patterns were fed into Prodigy, which suggested 2,205 sentences containing tool names. Following the annotation guidelines, 4 1,000 of them were annotated manually using Prodigy as the baseline dataset. 5

In order to avoid the memorisation of tool names in our patterns, we conducted another round of annotations using Prodigy’s manual-annotation feature with suggestions from a model. The suggestions of the baseline model were corrected using Prodigy, focusing on false-positive and false-negative suggestions, resulting in an additional 583 high quality-annotations as our second dataset as corrections.

3.2. Model training

Four different models were trained and evaluated to find the best training strategy for the context of the task. All models were trained using compounding batch size, a drop-out rate of 0.2 and a split of the evaluation set of 0.2 over 20 iterations.

3.3. Transfer learning

As shown in related work (Ruder et al. 2019), we wanted to see if transfer learning would improve our results. Two models, based on the first two models, were trained with transfer learning by pre-training spaCy’s 6 en_vectors_web_lg model on our entire corpus of sentences.

4. Results

The performance of the three trained models is shown in Table 1.

Table 1: Results of different Tool Entity recognition models.

Model Precision Recall F-Score
Baseline .89 .83 .86
Baseline with corrections .90 .88 .89
Baseline with Transfer Learning .89 .84 .86
Baseline with corrections & Transfer Learning .91 .92 .92

After training the corrected model, it can be seen that the model has improved significantly, especially regarding recall. This was a major criterion where many other models failed as a result of the memorisation effect (Arpit et al. 2017) of their selected tools or software. Adding newly found tools that were not present in our 55,000 tool examples and fixing the errors of our first model contributed to this result. The limitations of using a single task NLP model with a single dataset have already been studied (Ruder et al. 2019), and most recent practices consider transfer learning for their solutions. While applying transfer learning on the baseline without corrections showed no significant improvements, it significantly improved the F-score when applied together with corrections.

4.1. Application of the model to real data

The trained NER model was used to extract tool names from publications already ingested in the SSHOC Marketplace (Zarei et al. 2022). 470 publications, consisting of 54,841 sentences, were fed into the model. 2,257 different potential tool names were suggested by the NER model from 5,091 sentences mentioning tools.

4.2. Evaluation of extracted tool names

Since the discovery of previously unseen tool names is the most interesting benefit of the NER model, the evaluation of suggested tool names is important. Suggested tool names were evaluated semi-automatically by looking them up in the Marketplace and Wikidata. From the 2,257 distinct tool name suggestions, 125 were available in Marketplace entries and 38 were available in Wikidata. The rest of the suggestions were evaluated manually.

5. Future Work

5.1. Exploring the use of Transformer models

Transformer architecture based models such as BERT (Devlin et al. 2019) have given better results for many downstream tasks, including named entity recognition. It will be interesting to fit such transformer models to the task described in this paper and compare the results.

5.2. Validation of the model on real data

In order to monitor the performance of the NER model and detect its decay, it is important to design a feasible evaluation step that includes an automatic lookup in external resources and a manual curation to trigger the retraining of the model.

Appendix A

Bibliography
  1. Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S. et al . (2017). A Closer Look at Memorization in Deep Networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pp. 233–242.
  2. Barbot, L., Fischer, F., Moranville, Y., Pozdniakov, I. (2019): Which DH Tools Are Actually Used in Research? In: weltliteratur.net, 6 December 2019. (URL: https://weltliteratur.net/dh-tools-used-in-research/ )
  3. Barbot, L., Dombrowski, Q., Fischer, F., Rockwell, G., Spiro, L. (2020): Who Needs Tool Directories? A Forum on Sustaining Discovery Portals Large and Small. In: DH2020: “carrefours/intersections”. 22–24 July 2020. Book of Abstracts. University of Ottawa.
  4. Chiticariu, L., Li, Y., & Reiss, F. R. (2013). Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 827–832.
  5. Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K. (2013). Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLOS Computational Biology 9(2), doi:10.1371/journal.pcbi.1002854 .
  6. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. doi:10.48550/arXiv.1810.04805 .
  7. Dombrowski, Q. (2014): What Ever Happened to Project Bamboo? In: Literary and Linguistic Computing, Vol. 29, Issue 3, September 2014, pp. 326–339, doi:10.1093/llc/fqu026 .
  8. Dombrowski, Q. (2021): “The Directory Paradox.” In: Anne McGrail et al. (eds.): Debates in Digital Humanities: Institutions, Infrastructures at the Interstices. University of Minnesota Press, pp. 83–98.
  9. Du, C., Howison, J., Lopez, P. (2020). Softcite: Automatic Extraction of Software Mentions in Research Literature. Poster contribution. 1st SciNLP workshop at AKBC.
  10. Fischer, F., Moranville, Y. (2020): “DH Tools Mentioned in ‘The Programming Historian’.” In: weltliteratur.net, 17 Jan 2020. (URL: https://weltliteratur.net/dh-tools-programming-historian/ )
  11. Fischer, F., Burghardt, M., Luhmann, J., Barbot, L., Moranville, Y., Zarei, A. (2021): Die Werkbänke der Digital Humanities: Zur Rolle von Tools und Software für die Forschungsarbeit. In: vDHd2021: “Experimente”, Zenodo, doi:10.5281/zenodo.4639228 .
  12. Hearst, M.A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In: COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics.
  13. Henny-Krahmer, U., Jettka, D. (2022). Softwarezitation als Technik der Wissenschaftskultur: Vom Umgang mit Forschungssoftware in den Digital Humanities. DHd2022: Kulturen des digitalen Gedächtnisses. 7–11 March 2022. Book of Abstracts. University of Potsdam, doi:10.5281/zenodo.6328047 .
  14. Isozaki, H., Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th International Conference on Computational Linguistics. Volume 1, pp. 1–7, doi:10.3115/1072228.1072282 .
  15. Lehmann, J. et al. (2015). DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. In: Semantic Web 6(2), pp. 167–195.
  16. Luhmann, J., Burghardt, M. (2021): Digital humanities – A discipline in its own right? An analysis of the role and position of digital humanities in the academic landscape. In: Journal of the Association for Information Science and Technology, pp. 1–24, doi:10.1002/asi.24533 .
  17. Roller, S., Kiela, D., & Nickel, M. (2018). Hearst patterns revisited: Automatic hypernym detection from large text corpora, doi:10.48550/arXiv.1806.03191 .
  18. Ruder, S., Peters, M.E., Swayamdipta, S., & Wolf, T. (2019, June). Transfer learning in natural language processing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials (pp. 15–18).
  19. Schindler D., Zapilko B., Krüger F. (2020). Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In: Harth A. et al. (eds.): The Semantic Web. ESWC 2020. Lecture Notes in Computer Science, vol 12123. Springer, Cham, doi:10.1007/978-3-030-49461-2_16 .
  20. Zarei, A., Seung-Bin, Y., Ďurčo, M., Illmayer, K., Barbot, L., Fischer, F., Gray, E. (2022). Der SSH Open Marketplace: Kontextualisiertes Praxiswissen für die Digital Humanities. In: DHd2022: “Kulturen des digitalen Gedächtnisses”. 7–11 March 2022. Book of Abstracts. University of Potsdam, doi:10.5281/zenodo.6327975 .
Notes
1.
2.
3.
4.
5.
6.
Alireza Zarei (alireza.zarei_at_gwdg.de), GWDG und Yim Seung-Bin (Seung-Bin.Yim_at_oeaw.ac.at), Austrian Centre for Digital Humanities and Cultural Heritage und Frank Fischer (frank.fischer_at_dariah.eu), DARIAH-EU und Matej Ďurčo (matej.durco_at_oeaw.ac.at), Austrian Centre for Digital Humanities and Cultural Heritage und Philipp Wieder (philipp.wieder_at_gwdg.de), GWDG