In this third and final blogpost for the PaTraCoSy project (for the previous blogposts, click here or here), we delve deeper into the possibilities of the Colibri Core (CC) infrastructure to study the translation patterns between Hebrew and Syriac versions of the Bible. In our previous blogposts, we already discussed how the pattern modeller module of CC can be used to extract useful information to detect interesting diverging patterns and their relevance to the broader corpus. We have then also indicated that the main technique which was left to investigate was the log-likelihood computation of the main words in these constructions. A description of the steps necessary to achieve the results for these computations, both for the entire Bible, as well as for each book individually, for both languages, will form the goal of this blogpost.

The main idea behind the log-likelihood computation between two corpus files is to allow researchers to discover which words are used more often in one corpus compared to the other, within the confines of statistical relevance. This means that for any structural pattern of n-grams/skipgrams or flex grams, we can investigate its relevance between several biblical books, or infer its score for a particular biblical book in light of the entire Bible. Since in the last blogpost we have already discovered and discussed interesting patterns in the book of Genesis, we will now further elaborate on the structure of these patterns in light of loglikelihood computations between Hebrew and Syriac, and place these results within the context of the entire text tradition.

Using the Biblical books in both Hebrew and Syriac as our source files, we used the Colibri-loglikelihood function of the CC library to compute the loglikelihood files for every Biblical book. On our GitHub page, these can be found under the maps log_heb and log/syr respectively. The file log-likelihood-scores-general.txt on the general page provides the log likelihood computations for the entire Peshitta, contrasted to the entire Hebrew Bible. In general, we don’t expect this file to be directly useful, since it would only display those patterns where a given frequency in one language has zero in the other, due to both corpus files representing different languages. However, due to Hebrew and Syriac being such closely related languages, this file does actually provide a comparison of each lexeme that is used in both languages. Regarding only the rows which do not contain a zero value, we learn for example that the most common lexeme used in both is MN (the preposition ‘from’). This shows that a direct comparison of both files is difficult without knowledge of the languages under the encoding. Thus for example, the second greatest divergence between two lexemes are MTL, which encodes mital in Hebrew (‘from dew’), but the important preposition metul in Syriac (‘because’).

More interesting is the complete other end of the spectrum, where the divergences are the smallest. Here we see that the lowest score, 0.00430385, is shared by no less than twenty-three lexemes. The word TQBRNJ, for example, is found in Genesis 47,29. This presents exactly the same verb and conjugation in both languages, meaning ‘I pray to thee’. QNZ is an example of the city Kenaz, which for understandable reasons receives the same translation. A semantically deeper example can be found in the lexeme MGN, meaning ‘shield’ in both Hebrew and Syriac. Even though the Hebrew and Syriac translation differ in which verse they use the word, it is still considered to be among the least diverging patterns in the entire Bible.

Turning our attention now to the study of log-likelihood patterns of individual Biblical books, in comparison to the entire corpus in the respective languages, we notice that the most diverging examples of lexemes can immediately identify special characteristics of the book in question. Taking the book of Genesis as an example, we see several personal names being very contrastive in this book in comparison to the entire Bible. This is entirely logical, since for example Abraham and Joseph are mainly discussed in this book. In this sense, it is also remarkable that we find two lexemes representing God, JHWH and >LHJM. This reflects the particular redaction history of the book of Genesis. Taking any other book, we see that the scores for these words are particularly lower, indicating that both of these words receive a particular treatment in the book of Genesis. Comparing this to the Peshitta, we see that MRJ> ‘Lord’, the most common Syriac term to refer to God, has a much lower log-likelihood score than its Hebrew counterpart, whereas those for Abraham and Joseph, two important figures in the books of Genesis, are very similar.

Regarding lexical richness, we can conclude that the Syriac version of Genesis is particularly richer than the Hebrew version, but that it scores averagely compared to other Bible books. Looking at the general log-likelihood scores, we detect that the scores are respectively … and … Following the research of Frederick Greenspahn,[1] we notice how strong influential hapax legomena scores are for the determination of biblical books. Using our results, we can conclude that the same can be said for general n-gram, skipgram and flex gram scores.

The most important aspect of using the log-likelihood scores is probably the comparison of longer  patterns in both languages. This extra computation allows us to further investigate the patterns we have discussed in the previous blogpost. We will apply this methodology only to the first example, namely Genesis 2:2, where the Hebrew BJWM HCBJ<J ‘on the seventh day’ is translated into the Syriac BJWM> CTJTJ> ‘on the sixth day’, we note that the Hebrew ‘HCBJ<J’ (‘seventh’) has a log-likelihood score of 8,49798. The Syriac CTJTJ> (‘sixth’) scores 6,23123. This indicates clearly that the Hebrew word for seventh has a slightly higher likelihood than the Syriac word for ‘sixth’. This score reveals that this verse is not only unique within the book of Genesis, as we already established in our previous blogpost based on n-gram computations, but that this pattern is never repeated within the entire Bible, since these scores diverge the highest within the book of Genesis. Concerning the word ‘BJWM’ (‘on (the) day’ in both languages), we noted previously that the Syriac translation provides a more free application of the word than its Hebrew source. Considering the three variants ‘WBJWM’, ‘BJWM’ and ‘BJWMW’ we find the respective scores for Hebrew to be 18,1469 ;  7,42672 and 2,59241. For Syriac, we find 13,4211; 3,38081 and 0,163634. This similar distribution of values indicates that morphological and syntactic differences do not contribute to a divergence in log-likelihood, but that it is merely to be found on the semantic level. Combining the log-likelihood and n-gram data, we can conclude that although the Syriac uses the word ‘BJWM’ significantly more often than Hebrew (787 constructions compared to 524 for the book of Genesis), it does so in a structurally similar manner.


[1] Frederick Greenspahn, Hapax Legomena in Biblical Hebrew: A Study of the Phenomenon and Its Treatment Since Antiquity with Special Reference to Verbal Forms (SBL Dissertation Series 74)