By Martijn Naaijer
One of the goals of the CACCHT project is to integrate the Dead Sea Scrolls in the ETCBC database by converting them in the ETCBC encoding system. In a previous blogpost we explained how POS tagging of Hebrew texts can be done using an LSTM network. In this blogpost we show how the lexemes of the words in the biblical scrolls can be converted from the encoding done by Abegg to the ETCBC lexemes.
The text-fabric package containing the Dead Sea Scrolls has a variety of word features, such as number, gender, person, and lexeme. Some of these can be converted relatively straightforwardly to ETCBC encoding. An example is number. We assume that verb forms in the first person in the dss package correspond one to one with the first person in the ETCBC database.
There are features, for which this correspondence cannot be applied straightforwardly. An example is the word feature part of speech. In the dss package there is the value "ptcl" (particle). This value corresponds with more values in the ETCBC database, like "conj" (conjunction) and "prep" (preposition).
Something similar is going on with the feature lexeme. In the ETCBC database, the lexemes are based on KBL lexicon, but in the course of time a whole range of improvements have been implemented, based on ongoing research. Therefore, it is to be expected, that there is not always a one to one correspondence between the lexemes of the dss package and the BHSA.
So, how can we assign the ETCBC values of the feature lexeme to the words in the Scrolls? In the case of the biblical scrolls one can simply use lexemes used for the same words in corresponding verses. For instance, the scroll 4Q2 contains part of the text of Genesis. Its text of verse 1:1 is (partly reconstructed) BR>CJT BR> >LHJM >T HCMJM W>T H>RY, which is identical to the text of Genesis 1:1 in the BHSA. In this case, we can simply give each word the lexeme of the same word in the BHSA. In many cases, however, the text of the scrolls deviates more or less with that of the BHSA.
This problem is solved by using sequence alignment. This means that two strings are arranged in such a way that similar parts are identified. Alignment of sequences is used often in biology, if one wants to identify similarities and differences in DNA strings or proteins. Sequence alignment techniques are often based on dynamic programming, such as the Smith Waterman Algorithm. In the package biopython a whole range of algorithms for the study of biological sequences are implemented. We use this package, and the module pairwise2, which is contained in it, to align biblical verses in the DSS and the BHSA.
In practice , this looks as follows. As an example, we look at Isaiah 48:2 in the BHSA and 1QIsaa.
KJ- M <JR H Q-DC NQR>W W <L >L-HJ JFR>L NSMKW JHWH YB>WT CMW BHSA
KJ> M <JR H QWDC NQR>W W <L >LWHJ JFR>L NSMKW JHWH YB>WT CMW 1QIsaa
It is clear that the text of the BHSA and 1QIsaa is very similar, but there are also some differences. Similar parts in the verses are put together, resulting in two sequences of equal length. On the place of the extra matres lectiones in 1QIsaa, one finds a gap ("-") in the BHSA.
Now, we look at every character in 1QIsaa, and check the lexeme of the corresponding character in the BHSA. If more than half of the characters in a word in the scroll corresponds with one lexeme in the BHSA, we give the value of this lexeme to the word in the scroll. In the case of the first word in Isaiah 48:2, the K and J correspond to a character in the corresponding word with the lexeme KJ, and the character > does not correspond with a word in the BHSA. The result is that 2 out of 3 characters correspond with the lexeme KJ, which is more than 50%, so the first word KJ> in this verse of the scroll gets the value KJ. This approach works well in practice, but it does not result in a value for each word.
An example is Isaiah 42:23 in 1QIsaa:
MJ- BKM- --J>ZJN Z->T --JQCB W JCM< L >XWR BHSA
MJ> BKMH W J>ZJN ZW>T W JQCB W JCM< L >XWR 1QIsaa
In this case, the word "W" occurs three times in 1QIsaa, and only once in the BHSA. This means, that in the first two cases, we cannot give these words an ETCBC lexeme. We can proceed by checking the lexeme "W" has in these cases in the dss module, which is "W:", and then find out with which lexeme "W:" corresponds in the BHSA most often. This is the BHSA lexeme "W", which can be assigned to the unmatched words "W".
This procedure works well in practice, but it is not infallible. In some cases the alignment is unfortunate, in other cases there is no one to one mapping between lexemes or the dss package has different word boundaries. For instance, in the case of place names consisting of two words, the dss package treats them as distinct words, whereas the BHSA has one lexeme, in general. To bring the resulting dataset to "ETCBC standard", more steps are needed. These can consist of additional automatic processing and/or manual cleaning.
The same approach is used for the feature part of speech. You can find the resulting dataset "lexemes_pos_all_bib_books.csv" in the github repository: https://github.com/ETCBC/DSS2ETCBC.
import collections
from pprint import pprint
import pandas as pd
import dill
The package biopython is used for aligning sequences
from Bio import pairwise2
from Bio.Seq import Seq
Now the dss package is loaded. The classes L, T and F, that are used here are renamed. In that case we can use them together with the BHSA.
from tf.app import use
A = use('dss', hoist=globals())
# give the relevant classes for the DSS new names
Ldss = L
Tdss = T
Fdss = F
from tf.app import use
A = use('bhsa', hoist=globals())
The dictionary book_dict_bhsa_dss provides a mapping between the booknames of the BHSA and dss packages.
book_dict_bhsa_dss = {'Genesis': 'Gen',
'Exodus': 'Ex',
'Leviticus': 'Lev',
'Numbers': 'Num',
'Deuteronomy': 'Deut',
'Joshua': 'Josh',
'Judges': 'Judg',
'1_Samuel': '1Sam',
'2_Samuel': '2Sam',
'1_Kings': '1Kgs',
'2_Kings': '2Kgs',
'Isaiah': 'Is',
'Jeremiah': 'Jer',
'Ezekiel': 'Ezek',
'Hosea': 'Hos',
'Joel': 'Joel',
'Amos': 'Amos',
'Obadiah': 'Obad',
'Jonah': 'Jonah',
'Micah': 'Mic',
'Nahum': 'Nah',
'Habakkuk': 'Hab',
'Zephaniah': 'Zeph',
'Haggai': 'Hag',
'Zechariah': 'Zech',
'Malachi': 'Mal',
'Psalms': 'Ps',
'Job': 'Job',
'Proverbs': 'Prov',
'Ruth': 'Ruth',
'Song_of_songs':'Song',
'Ecclesiastes': 'Eccl',
'Lamentations': 'Lam',
'Daniel': 'Dan',
'Ezra': 'Ezra',
'2_Chronicles': '2Chr'
}
And the reversed way can be useful as well.
book_dict_dss_bhsa = {v: k for k, v in book_dict_bhsa_dss.items()}
print(book_dict_dss_bhsa)
We define some helper functions. align_verses takes two sequences as input and aligns them. It returns the aligned sequences.
def align_verses(bhsa_data, dss_data):
seq_bhsa = ' '.join(bhsa_data).strip()
seq_dss = ' '.join(dss_data).strip()
seq1 = Seq(seq_bhsa)
seq2 = Seq(seq_dss)
alignments = pairwise2.align.globalxx(seq1, seq2)
bhsa_al = (alignments[0][0]).strip(' ')
dss_al = (alignments[0][1]).strip(' ')
return bhsa_al, dss_al
The function most_frequent takes a list with lexemes as input, and returns the most frequent lexeme, together with its frequency in the list.
from collections import Counter
def most_frequent(lex_list):
occurrence_count = Counter(lex_list)
lex = occurrence_count.most_common(1)[0][0]
count = occurrence_count.most_common(1)[0][1]
return(lex, count)
In the function produce_value, the value of the POS or lexeme in the ETCBC format is retrieved.
def produce_value(key, chars_to_feat_etcbc, chars_to_feat_dss):
# check for each consonant what the value of the feature is of the word of the corresponding consonant in the BHSA
if len(chars_to_feat_etcbc[(key[0], key[1], key[2])]) > 0:
all_feat_etcbc = chars_to_feat_etcbc[(key[0], key[1], key[2])]
# check for a word to which lexeme in the bhsa it corresponds with most of its characters
feat_etcbc_proposed, count = most_frequent(all_feat_etcbc)
# an etcbc-lexeme is assigned only to a word if more than half of its consonants
# corresponds with a word in the BHSA
if len(list(word)) == 0:
feat_etcbc = ''
elif (count / len(list(word))) > 0.5:
feat_etcbc = feat_etcbc_proposed
else:
feat_etcbc = ''
else:
feat_etcbc = ''
if len(chars_to_feat_dss[key]) > 0:
all_feat_dss = chars_to_feat_dss[key]
feat_dss, count = most_frequent(all_feat_dss)
else:
feat_dss = ''
return feat_dss, feat_etcbc
In the next cell the text and lexemes of the Scrolls are extracted from the DSS package, and some minor manipulations are done with the text.
For every character in the text of the scrolls, we look up what the lexeme is of the word in which the character occurs. This information is saved in the dictionary lexemes_dss. The keys of this dict are the verse numbers. Each value is a list with the lexemes, one for each character in the verse.
Note that the data retrieved in the following cell consists partly of reconstructions of scrolls. Other features in the package, not discussed here, deal with which part of the text can be read on the scrolls and which part is reconstructed.
# In dss_data_dict, the text of each verse in the biblical scrolls is collected
dss_data_dict = collections.defaultdict(lambda: collections.defaultdict(list))
lexemes_dss = collections.defaultdict(lambda: collections.defaultdict(list))
pos_dss = collections.defaultdict(lambda: collections.defaultdict(list))
ids_dss = collections.defaultdict(lambda: collections.defaultdict(list))
for scr in Fdss.otype.s('scroll'):
scroll_name = Tdss.scrollName(scr)
words = Ldss.d(scr, 'word')
for w in words:
# exclude fragmentary data, these chapters start with 'f'
chapter = Ldss.u(w, 'chapter')
bo = Fdss.book.v(w)
if bo == None or bo not in book_dict_dss_bhsa:
continue
if (Fdss.chapter.v(w))[0] == 'f':
continue
# Do a bit of preprocessing
if Fdss.glyphe.v(w) != None:
lexeme = Fdss.glexe.v(w)
glyphs = "".join(Fdss.glyphe.v(w).split()) # remove whitespace in word
# dummy value
if lexeme == None:
lexeme = 'XXX'
# the consonant '#' is used for both 'C' and 'F'. We check in the lexeme
# to which of the two alternatives it should be converted. This appproach is crude,
# but works generally well. There is only one word with both F and C in the lexeme:
# >RTX##T> >AR:T.AX:CAF:T.:> in 4Q117
if '#' in glyphs:
# hardcode the single word with both 'C' and 'F' in the lexeme.
if glyphs == '>RTX##T>':
glyphs = '>RTXCFT>'
elif 'F' in lexeme:
glyphs = glyphs.replace('#', 'F')
# cases in wich 'C' occurs in lexeme or morphology
else:
glyphs = glyphs.replace('#', 'C')
# Some characters are removed or replaced, in the case of 'k', 'n', 'm', 'y', 'p', it concerns final consonants of words.
glyphs = glyphs.replace(u'\xa0', u' ').replace("'", "").replace("k", "K").replace("n", "N").replace("m", "M").replace("y", "Y").replace("p", "P")
dss_book = Fdss.book.v(w)
bhsa_book_name = book_dict_dss_bhsa[dss_book]
# replace(' ', '') is needed for strange case in Exodus 13:16 with a space in the word
dss_data_dict[(bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w)))][scroll_name].append(glyphs.replace(' ', ''))
ids_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(w)
# retrieve POS and lexeme of every character of every word in the scrolls and save in a dictionary
for character in glyphs:
pos_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.sp.v(w))
lexemes_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.glexe.v(w))
The same is done for all the characters in the BHSA. Also, a list with the verses of Isaiah in the BHSA is made. Finally, for each verse, the consonantal structure of each word is collected per verse in the dictionary bhsa_data_dict.
all_verses = []
lexemes_bhsa = collections.defaultdict(list)
pos_bhsa = collections.defaultdict(list)
bhsa_data_dict = collections.defaultdict(list)
for w in F.otype.s('word'):
# Remove words without consonantal representation in the text (elided he).
if F.g_cons.v(w) == '':
continue
bo, ch, ve = T.sectionFromNode(w)
cl = L.u(w, "clause")[0]
words_in_cl = L.d(cl, "word")
# use feature g_cons for consonantal representation of words
bhsa_data_dict[(bo, ch, ve)].append(F.g_cons.v(w))
# loop over consonants and get lexeme of each consonant in a word
for cons in F.g_cons.v(w):
lexemes_bhsa[(bo, ch, ve)].append(F.lex.v(w))
pos_bhsa[(bo, ch, ve)].append(F.sp.v(w))
if (bo, ch, ve) not in all_verses:
all_verses.append((bo, ch, ve))
In the following cell, the verses are aligned and characters are compared.
chars_to_lexemes_etcbc = collections.defaultdict(list)
chars_to_lexemes_dss = collections.defaultdict(list)
chars_to_pos_etcbc = collections.defaultdict(list)
chars_to_pos_dss = collections.defaultdict(list)
chars_per_word = collections.defaultdict(list)
all_verses = list(set(all_verses))
count = 0
# loop over verses in BHSA
for verse in all_verses:
# check if verse occurs in the dss package
if verse in dss_data_dict:
scrolls = (dss_data_dict[verse]).keys()
for scroll in scrolls:
# get the text of the verse
bhsa_data = bhsa_data_dict[verse]
dss_data = dss_data_dict[verse][scroll]
# all pairs of verses in the BHSA and scrolls are aligned
bhsa_al, dss_al = align_verses(bhsa_data, dss_data)
count += 1
# the first 20 alignments are printed
if count < 20:
print(verse)
print('BHSA')
print(bhsa_al)
print(scroll)
print(dss_al)
print(' ')
# some indexes are initialized, these keep track of how many consonants have been observed
# in both aligned sequences
ind_dss = 0
ind_bhsa = 0
dss_word_ind = 0
# loop over all characters in the BHSA verse
for pos in range(len(bhsa_al)):
# for each character in the BHSA sequence, it is cheched what the character is in the DSS sequence.
# Now, a number of scenarios are defined. Most scenarios are not so exciting, e.g.
# if the character is a space in both sequences, then move further
if bhsa_al[pos] == ' ' and dss_al[pos] == ' ':
dss_word_ind += 1
elif bhsa_al[pos] == '-' and dss_al[pos] == ' ':
dss_word_ind += 1
elif bhsa_al[pos] == ' ' and dss_al[pos] == '-':
chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
else:
if bhsa_al[pos] == '-':
chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
ind_dss += 1
elif dss_al[pos] == '-':
chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
ind_bhsa += 1
# Now the real matching is done
# For a matching consonant, it is checked in the dicts to which lexeme it corresponds
else:
chars_to_lexemes_etcbc[(scroll, verse, dss_word_ind)].append(lexemes_bhsa[verse][ind_bhsa])
chars_to_pos_etcbc[(scroll, verse, dss_word_ind)].append(pos_bhsa[verse][ind_bhsa])
chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
ind_dss += 1
ind_bhsa += 1
Lexemes and parts of speech are collected and stored in lists, these are converted later to a pandas dataframe.
The distionary mapping_dict shows which lexemes in the dss package correspond with their new BHSA alternatives.
tf_word_id = [] # tf index
scrolls = [] # scroll code
books = [] # biblical book name
chapters = [] # chapter number
verses = [] # verse number
words_dss = [] # consonantal representation of word on scroll
lexs_dss = [] # lexeme in dss package
lexs_etcbc = [] # corresponding etcbc lexeme
poss_dss = [] # POS in dss package
poss_etcbc = [] # corresponding etcbc POS
mapping_dict = collections.defaultdict(lambda: collections.defaultdict(list))
# loop over all words in the dss, and in each word over all consonants
for key in chars_per_word.keys():
tf_word_id.append(ids_dss[key[1]][key[0]][key[2]])
word = (''.join(chars_per_word[key])).replace("-", "")
lexeme_dss, lexeme_etcbc = produce_value(key, chars_to_lexemes_etcbc, chars_to_lexemes_dss)
pos_dss, pos_etcbc = produce_value(key, chars_to_pos_etcbc, chars_to_pos_dss)
if lexeme_dss != '':
mapping_dict[lexeme_dss][(key[0], key[1])].append(lexeme_etcbc)
# collect info in lists
scrolls.append(key[0])
books.append(key[1][0])
chapters.append(key[1][1])
verses.append(key[1][2])
words_dss.append(word)
lexs_dss.append(lexeme_dss)
lexs_etcbc.append(lexeme_etcbc)
poss_dss.append(pos_dss)
poss_etcbc.append(pos_etcbc)
In the following two cells, some empty cells are filled, using the second part of the procedure above.
# the list second remembers which cases have been given a lexeme this way
second = []
for index, lex in enumerate(lexs_etcbc):
if lex == '' and Fdss.lang.v(tf_word_id[index]) != 'a': # exclude aramaic, some strange cases
if lexs_dss[index] != '':
all_candidates_lists = list((mapping_dict[lexs_dss[index]]).values())
candidates_list = [item for sublist in all_candidates_lists for item in sublist]
best_cand, count = most_frequent(candidates_list)
lexs_etcbc[index] = best_cand
second.append('x')
else:
second.append('')
else:
second.append('')
mapping_lex_pos = collections.defaultdict(list)
mapping_id_pos = collections.defaultdict(lambda: collections.defaultdict(list))
for index, lex in enumerate(lexs_etcbc):
if lex == "" or poss_etcbc[index] == "":
continue
else:
mapping_lex_pos[lex].append(poss_etcbc[index])
mapping_id_pos[lex][poss_etcbc[index]].append(tf_word_id[index])
Collect the data in a dataframe.
df_pos_lex = pd.DataFrame(list(zip(tf_word_id, scrolls, books, chapters, verses, words_dss, lexs_dss, lexs_etcbc, poss_dss, poss_etcbc, second)),
columns =['tf_word_id', 'scroll','book','chapter', 'verse', 'g_cons', 'lex_dss', 'lex_etcbc', 'pos_dss', 'pos_etcbc', 'second_assignment'])
df_new = df_pos_lex.sort_values(['book', 'scroll', 'chapter', 'verse'], ascending=[True, True, True, True])
df_new
Save the data in a csv file and print the mapping_dict.
df_new.to_csv("lexemes_pos_all_bib_books.csv", index=False)
It goes without saying, that various things can be improved and refined in this procedure, but it is a good starting point for further investigations.