The Open Scriptures Hebrew Bible (OSHB) is a collaborative project that was launched online in December 2009 and has attracted over 300 contributors. According to the project’s site,
The vision of the Open Scriptures Hebrew Bible project is not only to provide resources, but to develop a unified system integrating four facets of biblical study.
- The text of the Hebrew Bible, marked up in OSIS XML.
- Looking up the Hebrew words in a Lexicon superior to Strong’s dictionary.
- Interpreting Morphology data in a readable format.
- Visualizing connections and divisions within a verse, recorded by the Masoretes using Cantillation marks in the text.
Of these four initiatives, the bulk of the work done to date has been on number 3, developing the morphology. Nearly 10 years later, we’ve crowdsourced a fully open set of morphological data for the Westminster Leningrad Codex (WLC)! The latest OSHB release, version 2.1, is available for anyone to use under the terms of a Creative Commons Attribution 4.0 license. The text is made available in OSIS XML and also represented as JSON in an NPM package.
Why Another Morphology?
In a single word, licensing. At the time we were first discussing the project, around April of 2009, there were no Hebrew texts openly available with morphology. There was a commercial morphology available (the Westminster Hebrew Morphology) which we were aware of and the ETCBC dataset, which we were not aware of. Though, at that time, the ETCBC dataset wasn’t available online, so an openly licensed Hebrew text with morphology was a need for those of us interested in doing further work with the Hebrew text.
I was one of those people interested in doing analysis on the Hebrew text as a student of biblical languages at Gordon-Conwell Theological Seminary. Since I also had a background in technology I was interested in what sort of corpus linguistic studies I could carry out on the Hebrew text. Although I had Logos at the time, there was no way to perform my own exploratory analyses of the text using it. Hence, I searched in vain to find a digital copy of the text that I could simply load into a Python or shell program to explore (of course the WLC text itself was available, but that included no metadata). The research I wanted to do was fairly simple but nevertheless required direct access to the text.
This search led me to join the newly formed Open Scriptures mailing list. Here I found dozens of other technologists that had run into the same problem I had. We pooled our resources together and began creating openly licensed data and programs on top of the biblical texts. One of the acute needs that we saw was for a morphology of the Hebrew Bible.
So, over the past 10 years, several developers pitched in and built a site that was designed for morphological parsing of the text according to a word’s form. Using the site, hundreds of people volunteered their time and expertise to develop a morphology that will always be open for any to use. In my estimation, this project has been a resounding success! It was not without its ups and downs (see Lessons Learned below), but hundreds of people worked together on a highly technical task and reached the finished the line. Today, there is a fully open morphology for the Hebrew Bible that anyone can build upon, whether technically or linguistically–and now I can go exploring the text on my command line!
The general methodology for creating this data was to enlist the help of volunteers to contribute parsing data on a collaborative website. New contributors were given the role of “Contributor”, while those who had demonstrated competency in Hebrew or Aramaic and a high degree of commitment to the project were given an “Editor” status. The primary difference between these roles was that Editors had the ability to mark a parsing as verified, whereas Contributors could only contribute possible parsings. The software was careful to track each contribution without overwriting anyone’s submission.
Consistency of data is a concern when working with a large corpus, even more so when one is coordinating work among many volunteers in a domain that does not have one set of rules to govern it (e.g., morphology). Primarily two techniques were used to alleviate this complication. First, open discussion followed by documentation of design decisions with agreed upon solutions (see the parsing README for details). Second, software was used to standardize and verify several decisions that were made in the parsing schemes.
The largest jump in productivity came from sophisticated “copy and paste” routines that were devised in the software (see the code here). Since many forms in the Old Testament are identical (e.g., וַיֹּאמֶר), it was postulated that copying and pasting a verified parsing onto the identical occurrences of the same form would be a simple solution. This was found to be true for the vast majority of forms, as long as certain overlapping forms were excluded (for instance, several inflected forms that could be marked imperfect or jussive). This method increased first pass parsing and augmented the process of marking forms verified, without compromising the integrity of the data.
In addition, because the ETCBC data became available in the intervening years, we were able to run comparisons against that morphology to validate or invalidate some of the parsings in our system. This was an advantage but it did come with several technical and morphological caveats. Differences in terminology and approach prevented this from being a perfect solution, but it did help our verification process where there was correspondence.
Is it Any Good?
A few people have carried out comparisons with the ETCBC morphology and have found that the OSHB data compares favorably, in terms of accuracy and precision. Because the OSHB project was independent of the ETCBC project there is value for both systems in learning from one another. A couple examples of differences that could be leveraged to provide more detailed information in the ETCBC data:
- OSHB marks whether a verb is jussive or cohortative where the ETCBC labels it as imperfect.
- Some nouns are parsed differently: אֶרֶץ has unknown gender and מַיִם is listed as plural instead of dual.
There may also be good reasons for these differences. Every comparison I’ve seen so far has been ad hoc, a result of another task someone is trying to carry out. It would be interesting to see a research project that details the differences, publishes the findings, and makes suggestions to both projects about how the data sets could be improved.
The OSHB project had its first release of the morphology in December 2013, though the second and subsequent releases were just in the past year and a half. This increased activity is a result of the involvement of unfoldingWord, which began helping in the summer of 2016. Since then, several dedicated parsers and a few software developers have generated an immense amount of productivity in just over a year’s time, yielding the currently complete status.
However, there was about a 3-4 year period where not a lot of activity happened in the project. The reason for this, I think, is that there was not an active “maintainer” to help newcomers get started. Every project based on an open workflow needs to have at least one person that can guide people who are interested in contributing all the way toward becoming active contributors. When this doesn’t exist, it’s easy for interested people to walk away because they don’t know where or how to begin. If there is an active maintainer, then anyone bold enough to send an email, post a question on a mailing list, or join a Slack team gets a welcome and a “how can I help you?”
This was the small role that I assumed in 2016–simply someone to spend a little bit of time orienting newcomers, whether that be software developers or linguists. As a result, we were able to productively utilize the skills of dozens of people to bring the morphology coverage from a few percent to over 85% in just over a year! I’m still amazed at how big of a jump that was in such a short amount of time.
Another aspect to consider is recruiting. It turns out that there are quite a few people with expertise in biblical languages that don’t have the opportunity to utilize those skills in their day job. When presented with the opportunity to dig into the Hebrew text and sharpen their skills while they get to productively contribute to an open project, they are eager to help! Yet, finding these people can sometimes be difficult. This is where simply getting the word out through as many avenues as you can is really helpful. Making the need known to other organizations that may also benefit from the work is very effective if they can bring their own networks to bear on the problem.
Before you begin throwing people at a problem, we also learned that it is essential to have some clear guidelines that contributors can follow. Even though the project may appear to be clearly defined to its owner(s), it really takes a newcomer to start asking questions to identify what is still murky. The best advice I have is to work closely with newcomers and learn as much as you can from their floundering or their mistakes. If they don’t know what to do or they don’t know how to do it, then there is an opportunity to make it easier and clearer for the next person. Enlist their help in shaping the process as much as contributing to the project itself.
I see a couple of areas for future work that could be explored. The first was mentioned above, someone to do a comprehensive comparison of the OSHB morphology and the ETCBC morphology, making some recommendations to both projects on how they can improve. Ideally, this could result in more granular data for both morphologies.
The second avenue of work is creating a “higher level” tagging that can begin to describe the functions of words and phrases as opposed to merely their form. Naturally, this will lend itself to more debate and a wider variety of interpretations. But perhaps for that very reason, it is well suited for a crowd-sourced project. A web-based application that allows tagging an arbitrary selection of text would be a requirement for this approach to work. The data generated from such a site would be very beneficial for downstream projects, but the data creation process would reveal much about how we collectively understand not just the text but also the terminology relating to the text.
A ten year project to tag over 400,000 words naturally has a long list of people that deserve acknowledgement for their work. I want to thank everyone that contributed in any way to the project, the majority of those names are listed in every XML file in our project. Great work and well done!
I want to give special thanks to David Troidl for his vision in initiating and keeping with the project. Thank you to Weston Ruter for the vision to start the Open Scriptures group. Thank you to the software developers of the parsing site: Darrell Smith (initial builder and architect), Austen Dutton, Ben Dwyer, and Andy Hubert. Several key people provided morphological expertise throughout the years, including Daniel Owens, Joel Ruark, Kenny Hilliard, and Perry Oakes. Also, thank you to tummy.com for graciously hosting our parsing site for seven years.
May the OSHB morphology be utilized and augmented by generations to come!