Tools and software are an important part of Digital Humanities (DH) practice (Dombrowski 2014, Barbot et al. 2019, Barbot et al. 2020, Fischer/Moranville 2020, Dombrowski 2021, Fischer et al. 2021, Luhmann/Burghardt 2021). Previous attempts to gain an overview of these tools were mainly based on manual aggregations, as in the case of the long-running Canadian project TAPoR. 1 Around 1,500 tools can be found there, an order of magnitude that should illustrate how difficult it is to keep everything up to date. To learn more about the actual use of tools in scientific work, especially in the Digital Humanities, we present a machine-learning approach for extracting tools and software mentioned by name in scientific publications, adding to other recent endeavours in this field (Du et al. 2020, Henny-Krahmer/Jettka 2022).
Different approaches have emerged over the years for named entity recognition (NER), which can be categorised broadly into two groups, rule-based and machine-learning (ML) approaches (Isozaki 2002).
One of the main methods used for extracting facts, including named entities, from texts, are Hearst patterns (Hearst 1992). Hearst patterns were revisited by the Facebook team for their Hypernymy Suite tool (Roller et al. 2008) and were still found to be superior in accurately extracting relations compared to other distributional methods. Similar efforts include GATE (Cunningham et al. 2013) and other legacy solutions (Chiticariu et al. 2013).
Among the relevant methods that use ML, we looked at SoftwareKG (Schindler et al. 2020), which uses DBpedia (Lehmann et al. 2015) to validate entities as software. The results, though, lack various software mentions from other sources, because of memorising the words (memorisation effect) (Arpit et al. 2017). Another example is the GROBID tool 2 (Du et al. 2020). After applying it to selected DH publications, we found that the recall is limited and that there is some memorisation effect. Given the recent efforts in this area (see also Henny-Krahmer/Jettka 2022), we were motivated to see if we could solve these issues to get a model that is production-ready.
Following the ML-based approach and considering the limitations of related work, we started to build a model that can recognise a tool based on its context (e.g. neighbourhood and grammar) and appearance (e.g. capitalised words, adjacent numbers, etc.).
For the dataset, we preprocessed a deduplicated collection of sentences from PLOS Sociology, Linguistics and abstracts from ADHO’s annual DH conferences (2015 and 2020), resulting in 1,899,652 sentences.
We created two versions of the dataset. The baseline dataset was created by preparing approximately 55,000 tool mentions 3 as patterns by processing names of tools and software coming from TAPoR and Wikidata. These patterns were fed into Prodigy, which suggested 2,205 sentences containing tool names. Following the annotation guidelines, 4 1,000 of them were annotated manually using Prodigy as the baseline dataset. 5
In order to avoid the memorisation of tool names in our patterns, we conducted another round of annotations using Prodigy’s manual-annotation feature with suggestions from a model. The suggestions of the baseline model were corrected using Prodigy, focusing on false-positive and false-negative suggestions, resulting in an additional 583 high quality-annotations as our second dataset as corrections.
Four different models were trained and evaluated to find the best training strategy for the context of the task. All models were trained using compounding batch size, a drop-out rate of 0.2 and a split of the evaluation set of 0.2 over 20 iterations.
As shown in related work (Ruder et al. 2019), we wanted to see if transfer learning would improve our results. Two models, based on the first two models, were trained with transfer learning by pre-training spaCy’s 6 en_vectors_web_lg model on our entire corpus of sentences.
The performance of the three trained models is shown in Table 1.
Table 1: Results of different Tool Entity recognition models.
Model | Precision | Recall | F-Score |
Baseline | .89 | .83 | .86 |
Baseline with corrections | .90 | .88 | .89 |
Baseline with Transfer Learning | .89 | .84 | .86 |
Baseline with corrections & Transfer Learning | .91 | .92 | .92 |
After training the corrected model, it can be seen that the model has improved significantly, especially regarding recall. This was a major criterion where many other models failed as a result of the memorisation effect (Arpit et al. 2017) of their selected tools or software. Adding newly found tools that were not present in our 55,000 tool examples and fixing the errors of our first model contributed to this result. The limitations of using a single task NLP model with a single dataset have already been studied (Ruder et al. 2019), and most recent practices consider transfer learning for their solutions. While applying transfer learning on the baseline without corrections showed no significant improvements, it significantly improved the F-score when applied together with corrections.
The trained NER model was used to extract tool names from publications already ingested in the SSHOC Marketplace (Zarei et al. 2022). 470 publications, consisting of 54,841 sentences, were fed into the model. 2,257 different potential tool names were suggested by the NER model from 5,091 sentences mentioning tools.
Since the discovery of previously unseen tool names is the most interesting benefit of the NER model, the evaluation of suggested tool names is important. Suggested tool names were evaluated semi-automatically by looking them up in the Marketplace and Wikidata. From the 2,257 distinct tool name suggestions, 125 were available in Marketplace entries and 38 were available in Wikidata. The rest of the suggestions were evaluated manually.
Transformer architecture based models such as BERT (Devlin et al. 2019) have given better results for many downstream tasks, including named entity recognition. It will be interesting to fit such transformer models to the task described in this paper and compare the results.
In order to monitor the performance of the NER model and detect its decay, it is important to design a feasible evaluation step that includes an automatic lookup in external resources and a manual curation to trigger the retraining of the model.