By Samuel Swire and Thijs Amersfoort
In his article Computational Linguistic Analysis of the Biblical Text Willem van Peursen gave an overview of the development of the computational analysis of the Bible from its early beginnings through 2022. Since then, the development within Artificial Intelligence (AI), especially Natural Language Processing (NLP), has continued to grow. In this blogpost we highlight some of the major advancements to-date within these fields as they apply to Digital Humanities (DH) broadly and Biblical Studies in particular.
One of the main drivers of AI innovation comes from transformer architectures, in particular for large language models (LLMs). Transformers were first announced by researchers at Google in 20171, which subsequently gave rise to the advent of other architectures such as Bidirectional Encoder Representations from Transformers (BERT)2 in 2018. Concurrently, OpenAI published a paper outlining methods for enhancing such models, resulting in the release of the first Generative Pre-trained Transformer (GPT).3 GPT was one of the first models that could generate natural language responses to natural language prompts. When trained on large corpora of text and combined with reinforcement learning, these models excel at understanding and generating human language, making them highly capable at tasks including translation, answering questions about a text, summarizing, sentiment analysis, entity recognition, and topic modeling.
Transformer-based models and LLMs have led to many studies and new applications in DH. To begin with, the University of Leiden has published two LLMs based on BERT, trained on historical Dutch and English corpora.4 These models are designed to facilitate the interpretation of historical texts, for which LLMs trained on modern languages may perform poorly. In a similar application, OpenAI’s GPT-4 was employed to translate and analyze a substantial corpus of travelogues as part of the DEHisRe project.5 The researchers observed that the initial outcomes of utilizing these LLM were encouraging, with the models exhibiting “superior flexibility over traditional machine learning methods”. However, they also identified potential limitations. The researchers highlight the substantial resources required for the operation of these models and their propensity to “hallucinate,” which introduces risks to the reliability, reproducibility, and factual integrity of the outcomes.6
The application of within the Hebrew Bible has likewise increased rapidly. In paleography and codicology, for instance, the usage of transformer-based with unsupervised learning techniques7,8,9 has aided in classifying DSS fragments and documents within the Cairo Genizah.10,11,12,13 Within textual analysis, the application of more advanced statistical and NLP models has become more common, as has the awareness of the limitations of these models on the Bible due to the limited corpus size.14,15 Researchers have begun to utilize data enrichment and corpus enlargement techniques such as constructing datasets and training models with closely related Semitic languages (e.g. Hebrew and Syriac)16, as well as focusing the tuning stages on low-shot or zero-shot approaches.17 The application of LLMs has also led to methodological reflections on the limitations of models pre-trained on modern corpora and/or non-Semitic language corpora in performing tasks on classical texts like the Hebrew Bible.18,19
Given the concerns surrounding reliability, reproducibility, and factual integrity, the outputs generated by such models will invariably necessitate verification. It remains to be seen whether this will improve with future advancements, as well as what novel applications will emerge in the field of humanities.
[1]: Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia. (2017). “Attention is All you Need”. Advances in Neural Information Processing Systems. 30 Curran Associates, Inc.
[2]: Kenton, Lee; Devlin, Jacob; Chang, Ming-Wei; Kristina Toutanova; Lee. “Bert: Pre-training of deep bidirectional transformers for language understanding.” Proceedings of naacL-HLT, vol. 1, p. 2 2019
[3]: Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya. “Improving Language Understanding by Generative Pre-Training”. OpenAI. [https://openai.com/index/language-unsupervised/](https://openai.com/index/language-unsupervised/) 2018 Access 09 Sept 2024
[4]: They released MacBERTH for historical English (1450-1950), Manjavacas, Enrique & Lauren Fonteyn. 2022 Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities jdmdh:9152, and GijsBERT for historical Dutch (1500-1950), Manjavacas, Enrique & Lauren Fonteyn. 2022 Non-Parametric Word Sense Disambiguation for Historical Languages. Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (NLP4DH), 123-134. Association for Computational Linguistics.
[5]: Ananieva, Anna, “The Study of Historical Travelogues from a Digital Humanities Perspective: Experiences and New Approaches”, 2024, (with Sandra Balck and Jacob Möhrke), in: Comparative Southeast European Studies, 72/3, S. 370–385
[6]: Ibid, 375
[7]: Celebi, M. Emre, and Kemal Aydin, eds. Unsupervised learning algorithms. Vol. 9 Cham: Springer, 2016
[8]: James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. “Unsupervised learning.” in An introduction to statistical learning: with applications in Python, pp. 503-556. Cham: Springer International Publishing, 2023
[9]: Naeem, Samreen; Aqib, Ali; Sania, Anam; Muhammad, Munawar Ahmed. “An unsupervised machine learning algorithms: Comprehensive review.” International Journal of Computing and Digital Systems, 2024
[10]: Brown-deVost, Bronson, Berat Kurar-Barakat, and Nachum Dershowitz. “Segmenting Dead Sea Scroll Fragments for a Scientific Image Set.” arXiv preprint arXiv:2406.15692, 2024
[11]: Avi Shmidman, Ometz Shmidman, Hillel Gershuni, and Moshe Koppel. “MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts.” Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), 13–18, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics, 2024
[12]: Niv Fono, Harel Moshayof, Eldar Karol, Itai Assraf, and Mark Last. 2024 “Embible: Reconstruction of Ancient Hebrew and Aramaic Texts Using Transformers.” In Findings of the Association for Computational Linguistics: EACL, pages 846–852, St. Julian’s, Malta. Association for Computational Linguistics, 2024
[13]: Guéville, Estelle, and David Joseph Wrisley. “Transcribing medieval manuscripts for machine learning.” Journal of Data Mining & Digital Humanities, 2023
[14]: Sommerschield, Thea; Yannis Assael; Pavlopoulos, John; Stefanak, Vanessa; Senior, Andrew; Dyer,Chris; Bodel, John Jonathan Prag; Androutsopoulos, Ion; de Freitas, Nando. “Machine learning for ancient languages: A survey.” Computational Linguistics 49, no. 3 (2023): 703-747.
[15]: Dörpinghaus, Jens. “Automated annotation of parallel bible corpora with cross-lingual semantic concordance.” Natural Language Engineering. 1-24, 2024
[16]: Naaijer, Martijn; Sikkel, Constantijn; Coeckelbergs, Mathias; Attema, Jisk; van Peursen, W.T. “A Transformer-based parser for Syriac morphology”. In Proceedings of the Ancient Language Processing Workshop associated with RANLP-2023, 23–29, 2023 [https://aclanthology.org/2023.alp-1.3](https://aclanthology.org/2023.alp-1.3)
[17]: Liebeskind, Chaya; Liebeskind, Shmuel; Bouhnik, Dan. “Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew Translations.” ACM Journal on Computing and Cultural Heritage 17, no. 2 1-23, 2024
[18]: Staps, Camil. “Large Language Models and Biblical Hebrew: Limitations, Pitfalls, Opportunities”. HIPHIL Novum 9 (1): 46-55, 2024 https://doi.org/10.7146/hn.v9i1.144177.
[19]: Elrod, A. G. ‘Nothing new under the sun? The study of Biblical Hebrew in the era of generative pre-trained AI’. HIPHIL Novum 8(2):1–32, 2023 [https://doi.org/10.7146/hn.v8i2.143114](https://doi.org/10.7146/hn.v8i2.143114)