Conversion of the Lexemes and Parts of Speech of the Biblical DSS to ETCBC Encoding

By Martijn Naaijer

One of the goals of the CACCHT project is to integrate the Dead Sea Scrolls in the ETCBC database by converting them in the ETCBC encoding system. In a previous blogpost we explained how POS tagging of Hebrew texts can be done using an LSTM network. In this blogpost we show how the lexemes of the words in the biblical scrolls can be converted from the encoding done by Abegg to the ETCBC lexemes.

General approach

The text-fabric package containing the Dead Sea Scrolls has a variety of word features, such as number, gender, person, and lexeme. Some of these can be converted relatively straightforwardly to ETCBC encoding. An example is number. We assume that verb forms in the first person in the dss package correspond one to one with the first person in the ETCBC database.

There are features, for which this correspondence cannot be applied straightforwardly. An example is the word feature part of speech. In the dss package there is the value "ptcl" (particle). This value corresponds with more values in the ETCBC database, like "conj" (conjunction) and "prep" (preposition).

Something similar is going on with the feature lexeme. In the ETCBC database, the lexemes are based on KBL lexicon, but in the course of time a whole range of improvements have been implemented, based on ongoing research. Therefore, it is to be expected, that there is not always a one to one correspondence between the lexemes of the dss package and the BHSA.

So, how can we assign the ETCBC values of the feature lexeme to the words in the Scrolls? In the case of the biblical scrolls one can simply use lexemes used for the same words in corresponding verses. For instance, the scroll 4Q2 contains part of the text of Genesis. Its text of verse 1:1 is (partly reconstructed) BR>CJT BR> >LHJM >T HCMJM W>T H>RY, which is identical to the text of Genesis 1:1 in the BHSA. In this case, we can simply give each word the lexeme of the same word in the BHSA. In many cases, however, the text of the scrolls deviates more or less with that of the BHSA.

This problem is solved by using sequence alignment. This means that two strings are arranged in such a way that similar parts are identified. Alignment of sequences is used often in biology, if one wants to identify similarities and differences in DNA strings or proteins. Sequence alignment techniques are often based on dynamic programming, such as the Smith Waterman Algorithm. In the package biopython a whole range of algorithms for the study of biological sequences are implemented. We use this package, and the module pairwise2, which is contained in it, to align biblical verses in the DSS and the BHSA.

In practice , this looks as follows. As an example, we look at Isaiah 48:2 in the BHSA and 1QIsaa.

KJ- M <JR H Q-DC NQR>W W <L >L-HJ JFR>L NSMKW JHWH YB>WT CMW   BHSA

KJ> M <JR H QWDC NQR>W W <L >LWHJ JFR>L NSMKW JHWH YB>WT CMW   1QIsaa

It is clear that the text of the BHSA and 1QIsaa is very similar, but there are also some differences. Similar parts in the verses are put together, resulting in two sequences of equal length. On the place of the extra matres lectiones in 1QIsaa, one finds a gap ("-") in the BHSA.

Now, we look at every character in 1QIsaa, and check the lexeme of the corresponding character in the BHSA. If more than half of the characters in a word in the scroll corresponds with one lexeme in the BHSA, we give the value of this lexeme to the word in the scroll. In the case of the first word in Isaiah 48:2, the K and J correspond to a character in the corresponding word with the lexeme KJ, and the character > does not correspond with a word in the BHSA. The result is that 2 out of 3 characters correspond with the lexeme KJ, which is more than 50%, so the first word KJ> in this verse of the scroll gets the value KJ. This approach works well in practice, but it does not result in a value for each word.

An example is Isaiah 42:23 in 1QIsaa:

MJ- BKM- --J>ZJN Z->T --JQCB W JCM< L >XWR    BHSA

MJ> BKMH W J>ZJN ZW>T W JQCB W JCM< L >XWR    1QIsaa

In this case, the word "W" occurs three times in 1QIsaa, and only once in the BHSA. This means, that in the first two cases, we cannot give these words an ETCBC lexeme. We can proceed by checking the lexeme "W" has in these cases in the dss module, which is "W:", and then find out with which lexeme "W:" corresponds in the BHSA most often. This is the BHSA lexeme "W", which can be assigned to the unmatched words "W".

This procedure works well in practice, but it is not infallible. In some cases the alignment is unfortunate, in other cases there is no one to one mapping between lexemes or the dss package has different word boundaries. For instance, in the case of place names consisting of two words, the dss package treats them as distinct words, whereas the BHSA has one lexeme, in general. To bring the resulting dataset to "ETCBC standard", more steps are needed. These can consist of additional automatic processing and/or manual cleaning.

The same approach is used for the feature part of speech. You can find the resulting dataset "lexemes_pos_all_bib_books.csv" in the github repository: https://github.com/ETCBC/DSS2ETCBC.

In [1]:
import collections
from pprint import pprint

import pandas as pd
import dill

The package biopython is used for aligning sequences

In [2]:
from Bio import pairwise2
from Bio.Seq import Seq

Now the dss package is loaded. The classes L, T and F, that are used here are renamed. In that case we can use them together with the BHSA.

In [3]:
from tf.app import use
A = use('dss', hoist=globals())

# give the relevant classes for the DSS new names
Ldss = L
Tdss = T
Fdss = F
	connecting to online GitHub repo annotation/app-dss ... connected
Using TF-app in C:\Users\geitb/text-fabric-data/annotation/app-dss/code:
	rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/dss/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/dss/parallels/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
In [4]:
from tf.app import use
A = use('bhsa', hoist=globals())
	connecting to online GitHub repo annotation/app-bhsa ... connected
Using TF-app in C:\Users\geitb/text-fabric-data/annotation/app-bhsa/code:
	rv1.2=#5fdf1778d51d938bfe80b37b415e36618e50190c (latest release)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/bhsa/tf/c:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/phono/tf/c:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/parallels/tf/c:
	r1.2 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used

The dictionary book_dict_bhsa_dss provides a mapping between the booknames of the BHSA and dss packages.

In [5]:
book_dict_bhsa_dss = {'Genesis':      'Gen',
             'Exodus':       'Ex',
             'Leviticus':    'Lev',
             'Numbers':      'Num',
             'Deuteronomy':  'Deut',
             'Joshua':       'Josh',
             'Judges':       'Judg',
             '1_Samuel':     '1Sam',
             '2_Samuel':     '2Sam',
             '1_Kings':      '1Kgs',
             '2_Kings':      '2Kgs',
             'Isaiah':       'Is',
             'Jeremiah':     'Jer',
             'Ezekiel':      'Ezek',
             'Hosea':        'Hos',
             'Joel':         'Joel',
             'Amos':         'Amos',
             'Obadiah':      'Obad',
             'Jonah':        'Jonah',
             'Micah':        'Mic',
             'Nahum':        'Nah',
             'Habakkuk':     'Hab',
             'Zephaniah':    'Zeph',
             'Haggai':       'Hag',
             'Zechariah':    'Zech',
             'Malachi':      'Mal',
             'Psalms':       'Ps',
             'Job':          'Job',
             'Proverbs':     'Prov',
             'Ruth':         'Ruth',
             'Song_of_songs':'Song',
             'Ecclesiastes': 'Eccl',
             'Lamentations': 'Lam',
             'Daniel':       'Dan',
             'Ezra':         'Ezra',
             '2_Chronicles': '2Chr'
            }

And the reversed way can be useful as well.

In [6]:
book_dict_dss_bhsa = {v: k for k, v in book_dict_bhsa_dss.items()}
print(book_dict_dss_bhsa)
{'Gen': 'Genesis', 'Ex': 'Exodus', 'Lev': 'Leviticus', 'Num': 'Numbers', 'Deut': 'Deuteronomy', 'Josh': 'Joshua', 'Judg': 'Judges', '1Sam': '1_Samuel', '2Sam': '2_Samuel', '1Kgs': '1_Kings', '2Kgs': '2_Kings', 'Is': 'Isaiah', 'Jer': 'Jeremiah', 'Ezek': 'Ezekiel', 'Hos': 'Hosea', 'Joel': 'Joel', 'Amos': 'Amos', 'Obad': 'Obadiah', 'Jonah': 'Jonah', 'Mic': 'Micah', 'Nah': 'Nahum', 'Hab': 'Habakkuk', 'Zeph': 'Zephaniah', 'Hag': 'Haggai', 'Zech': 'Zechariah', 'Mal': 'Malachi', 'Ps': 'Psalms', 'Job': 'Job', 'Prov': 'Proverbs', 'Ruth': 'Ruth', 'Song': 'Song_of_songs', 'Eccl': 'Ecclesiastes', 'Lam': 'Lamentations', 'Dan': 'Daniel', 'Ezra': 'Ezra', '2Chr': '2_Chronicles'}

We define some helper functions. align_verses takes two sequences as input and aligns them. It returns the aligned sequences.

In [7]:
def align_verses(bhsa_data, dss_data):
    
    seq_bhsa = ' '.join(bhsa_data).strip()
    seq_dss = ' '.join(dss_data).strip()
        
    seq1 = Seq(seq_bhsa) 
    seq2 = Seq(seq_dss)
    
    alignments = pairwise2.align.globalxx(seq1, seq2)
    
    bhsa_al = (alignments[0][0]).strip(' ')
    dss_al = (alignments[0][1]).strip(' ')
        
    return bhsa_al, dss_al

The function most_frequent takes a list with lexemes as input, and returns the most frequent lexeme, together with its frequency in the list.

In [8]:
from collections import Counter 
  
def most_frequent(lex_list): 
    occurrence_count = Counter(lex_list) 
    lex = occurrence_count.most_common(1)[0][0]
    count = occurrence_count.most_common(1)[0][1]
    return(lex, count)

In the function produce_value, the value of the POS or lexeme in the ETCBC format is retrieved.

In [9]:
def produce_value(key, chars_to_feat_etcbc, chars_to_feat_dss):
    
    # check for each consonant what the value of the feature is of the word of the corresponding consonant in the BHSA 
    if len(chars_to_feat_etcbc[(key[0], key[1], key[2])]) > 0:
        all_feat_etcbc = chars_to_feat_etcbc[(key[0], key[1], key[2])]
        
        # check for a word to which lexeme in the bhsa it corresponds with most of its characters
        feat_etcbc_proposed, count = most_frequent(all_feat_etcbc)
        
        # an etcbc-lexeme is assigned only to a word if more than half of its consonants
        # corresponds with a word in the BHSA    
        if len(list(word)) == 0:
            feat_etcbc = ''

        elif (count / len(list(word))) > 0.5:
            feat_etcbc = feat_etcbc_proposed

        else:
            feat_etcbc = ''
            
    else:
        feat_etcbc = ''
        
    if len(chars_to_feat_dss[key]) > 0:
        
        all_feat_dss = chars_to_feat_dss[key]
        feat_dss, count = most_frequent(all_feat_dss)
        
    else:
        feat_dss = ''
        
    return feat_dss, feat_etcbc

Prepare scrolls

In the next cell the text and lexemes of the Scrolls are extracted from the DSS package, and some minor manipulations are done with the text.

For every character in the text of the scrolls, we look up what the lexeme is of the word in which the character occurs. This information is saved in the dictionary lexemes_dss. The keys of this dict are the verse numbers. Each value is a list with the lexemes, one for each character in the verse.

Note that the data retrieved in the following cell consists partly of reconstructions of scrolls. Other features in the package, not discussed here, deal with which part of the text can be read on the scrolls and which part is reconstructed.

In [11]:
# In dss_data_dict, the text of each verse in the biblical scrolls is collected
dss_data_dict = collections.defaultdict(lambda: collections.defaultdict(list))
lexemes_dss = collections.defaultdict(lambda: collections.defaultdict(list))
pos_dss = collections.defaultdict(lambda: collections.defaultdict(list))
ids_dss = collections.defaultdict(lambda: collections.defaultdict(list))

for scr in Fdss.otype.s('scroll'):
    scroll_name = Tdss.scrollName(scr)
    
    words = Ldss.d(scr, 'word')
        
    for w in words:
            
        # exclude fragmentary data, these chapters start with 'f'
        chapter = Ldss.u(w, 'chapter')
        
        bo = Fdss.book.v(w) 
        
        if bo == None or bo not in book_dict_dss_bhsa:
            continue
        
        if (Fdss.chapter.v(w))[0] == 'f':
            continue
        
        # Do a bit of preprocessing
        if Fdss.glyphe.v(w) != None:
            
            lexeme = Fdss.glexe.v(w)
            glyphs = "".join(Fdss.glyphe.v(w).split()) # remove whitespace in word
            
            # dummy value
            if lexeme == None:
                lexeme = 'XXX'
                
            # the consonant '#' is used for both 'C' and 'F'. We check in the lexeme
            # to which of the two alternatives it should be converted. This appproach is crude, 
            # but works generally well. There is only one word with both F and C in the lexeme: 
            # >RTX##T> >AR:T.AX:CAF:T.:> in 4Q117
            if '#' in glyphs:  
                # hardcode the single word with both 'C' and 'F' in the lexeme.
                if glyphs == '>RTX##T>':
                    glyphs = '>RTXCFT>'
                    
                elif 'F' in lexeme:
                    glyphs = glyphs.replace('#', 'F')
                    
                # cases in wich 'C' occurs in lexeme or morphology
                else:                        
                    glyphs = glyphs.replace('#', 'C')                        
            
            # Some characters are removed or replaced, in the case of 'k', 'n', 'm', 'y', 'p', it concerns final consonants of words.
            glyphs = glyphs.replace(u'\xa0', u' ').replace("'", "").replace("k", "K").replace("n", "N").replace("m", "M").replace("y", "Y").replace("p", "P")   
                
            dss_book = Fdss.book.v(w)
            bhsa_book_name = book_dict_dss_bhsa[dss_book]
            
            # replace(' ', '') is needed for strange case in Exodus 13:16 with a space in the word
            dss_data_dict[(bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w)))][scroll_name].append(glyphs.replace(' ', ''))
            
            ids_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(w)
            
            # retrieve POS and lexeme of every character of every word in the scrolls and save in a dictionary
            for character in glyphs:   
                pos_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.sp.v(w))
                lexemes_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.glexe.v(w))

Prepare BHSA

The same is done for all the characters in the BHSA. Also, a list with the verses of Isaiah in the BHSA is made. Finally, for each verse, the consonantal structure of each word is collected per verse in the dictionary bhsa_data_dict.

In [12]:
all_verses = []
lexemes_bhsa = collections.defaultdict(list)
pos_bhsa = collections.defaultdict(list)

bhsa_data_dict = collections.defaultdict(list)

for w in F.otype.s('word'):
    
    # Remove words without consonantal representation in the text (elided he). 
    if F.g_cons.v(w) == '':
        continue
    
    bo, ch, ve = T.sectionFromNode(w)
    cl = L.u(w, "clause")[0]
    words_in_cl = L.d(cl, "word")
        
    # use feature g_cons for consonantal representation of words
    bhsa_data_dict[(bo, ch, ve)].append(F.g_cons.v(w))
        
    # loop over consonants and get lexeme of each consonant in a word
    for cons in F.g_cons.v(w):
        lexemes_bhsa[(bo, ch, ve)].append(F.lex.v(w)) 
        pos_bhsa[(bo, ch, ve)].append(F.sp.v(w))
        
    if (bo, ch, ve) not in all_verses:
        all_verses.append((bo, ch, ve))
    

Align verses

In the following cell, the verses are aligned and characters are compared.

In [15]:
chars_to_lexemes_etcbc = collections.defaultdict(list)
chars_to_lexemes_dss = collections.defaultdict(list)

chars_to_pos_etcbc = collections.defaultdict(list)
chars_to_pos_dss = collections.defaultdict(list)
chars_per_word =  collections.defaultdict(list)

all_verses = list(set(all_verses))

count = 0

# loop over verses in BHSA
for verse in all_verses:
    
    # check if verse occurs in the dss package
    if verse in dss_data_dict:
        
        scrolls = (dss_data_dict[verse]).keys()
        
        for scroll in scrolls:
            
            # get the text of the verse
            bhsa_data = bhsa_data_dict[verse]
            dss_data = dss_data_dict[verse][scroll]
            
            # all pairs of verses in the BHSA and scrolls are aligned
            bhsa_al, dss_al = align_verses(bhsa_data, dss_data)
            
            count += 1
            
            # the first 20 alignments are printed
            if count < 20:
                print(verse)
                print('BHSA')
                print(bhsa_al)
                print(scroll)
                print(dss_al)
                print(' ')
        
            # some indexes are initialized, these keep track of how many consonants have been observed 
            # in both aligned sequences
            ind_dss = 0
            ind_bhsa = 0
        
            dss_word_ind = 0
        
            # loop over all characters in the BHSA verse
            for pos in range(len(bhsa_al)):
            
                # for each character in the BHSA sequence, it is cheched what the character is in the DSS sequence.
                # Now, a number of scenarios are defined. Most scenarios are not so exciting, e.g.
                # if the character is a space in both sequences, then move further
                if bhsa_al[pos] == ' ' and dss_al[pos] == ' ':
                    
                    dss_word_ind += 1
            
                elif bhsa_al[pos] == '-' and dss_al[pos] == ' ':
                    
                    dss_word_ind += 1
                
                elif bhsa_al[pos] == ' ' and dss_al[pos] == '-':
                    
                    chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                               
                else:
                    if bhsa_al[pos] == '-':
                        
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                        chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
                        chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
                        
                        ind_dss += 1
                    
                    elif dss_al[pos] == '-':
                        
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                        
                        ind_bhsa += 1
                    
                    # Now the real matching is done
                    # For a matching consonant, it is checked in the dicts to which lexeme it corresponds
                    else:                        
                        
                        chars_to_lexemes_etcbc[(scroll, verse, dss_word_ind)].append(lexemes_bhsa[verse][ind_bhsa])
                        chars_to_pos_etcbc[(scroll, verse, dss_word_ind)].append(pos_bhsa[verse][ind_bhsa])
                        
                        chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
                        chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                    
                        ind_dss += 1
                        ind_bhsa += 1               
('Numbers', 11, 22)
BHSA
H Y>N W BQR JCXV LHM W MY> LHM >M >T KL DGJ H JM J>SP LHM W MY> LHM
4Q23
H Y>N W BQR JCXV------------------------------------- LHM W MY>----
 
('Numbers', 30, 6)
BHSA
W >M HN--J--> >BJH >-TH B JWM C-M<W K-L NDRJH W >SRJH >CR >SRH <L NPCH L-> JQWM- W JHWH JSLX LH KJ- HNJ> >BJH >-TH
4Q27
W >M HN> JNJ> >BJH >WTH B JWM CWM<W KWL NDRJH W >SRJH >CR >SRH <L NPCH LW> JQWMW W JHWH JSLX LH KJ> HNJ> >BJH >WTH
 
('Leviticus', 23, 41)
BHSA
W XGTM >TW XG L JHWH CB<T JMJM B CNH XQT <WLM L DRTJKM B XDC H CBJ<J TXGW >TW
Arugleviticus
W XGTM >TW XG L JHWH CB<T JMJM B CNH XQT <WLM L DRTJKM B XDC H CBJ<J TXGW >TW
 
('Psalms', 107, 42)
BHSA
JR>W JCRJM W JFMXW W KL <WLH QPYH PJH
4Q88
JR>W JCRJM W JFMXW W KL <WLH QPY- PJH
 
('Exodus', 14, 13)
BHSA
W J>MR MCH >L H <M >L TJR>W HTJYBW W R>W >T JCW<T JHWH >CR J<FH LKM H JWM KJ >CR R>JTM >T MYRJM H JWM L> TSJPW L R>TM <WD <D <WLM
4Q14
W J>MR MCH >L H <M >L TJR>W HTJYBW W R>W >T JCW<T JHWH >CR J<FH LKM H JWM KJ >CR R>JTM >T MYRJM H------- --J------------------W-M
 
('Genesis', 36, 22)
BHSA
W JHJW BNJ LWVN XRJ W HJMM W >XWT LWVN TMN<
4Q1
W JHJW BNJ LWVN XRJ W HJMM W >XWT LWVN TMN<
 
('Psalms', 45, 8)
BHSA
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FFWN M XBRJK
4Q85
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FF-N M XBRJK
 
('Psalms', 45, 8)
BHSA
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FFWN M XBRJK
11Q8
>HBT YDQ W TFN> RC< <L KN MCXK---H-- >LHJ-----------M------
 
('Song_of_songs', 3, 5)
BHSA
HCB<TJ >TKM BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
4Q106
HCB<TJ >TKM BNWT JRWCLM B Y--------B------------------------>-------W----------------T---
 
('Song_of_songs', 3, 5)
BHSA
HCB<TJ >TKM- BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
4Q107
HCB<TJ >TKMH BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
 
('Psalms', 27, 13)
BHSA
LWL> H>MNTJ L R>WT B VWB JHWH B >RY XJJM
4Q85
-----H>MNTJ L R>WT B VWB JHWH B >RY XJJM
 
('1_Samuel', 1, 9)
BHSA
W TQM XNH >XRJ >KLH B C-LH W >XRJ CTH W <LJ H KHN JCB <L H KS> <L MZWZT HJKL JHWH
4Q51
--------------------B CJLH W >XR---------------------------------------------J---
 
('2_Samuel', 13, 26)
BHSA
W J->MR >BCLWM W L-> JLK N> >TNW >MNWN >XJ W J->MR LW H MLK LMH JLK <MK
4Q51
W JW>MR >BCLWM W LW> JLK N-->--- >MNWN >XJ W JW>MR LW H MLK LMH JLK <MK
 
('Isaiah', 63, 10)
BHSA
W HMH MRW W <YBW >T RWX Q-DC-W W JHPK LHM- L >WJB --HW>- NLXM BM
1Qisaa
W HMH MRW W <YBW >T RWX QWDCJW W JHPK LHMH L >WJB W HW>H NLXM BM
 
('Isaiah', 63, 10)
BHSA
W HMH MRW W <YBW >T RWX QDCW W JHPK LHM L >WJB HW> NLXM BM
1Q8
W HMH MRW W <YBW >T RWX QDCW W JHPK LHM L >WJB HW> NLXM BM
 
('Numbers', 22, 41)
BHSA
W JHJ B B-QR W JQX BLQ >T BL<M W J<LHW BMWT_-B<L W JR> M CM QYH H <M
4Q27
W JHJ B BWQR W JQX BLQ >T BL<M W J<LHW BMWT- B<L W JR> M CM QYH H <M
 
('Micah', 4, 11)
BHSA
W <TH N>SPW <LJK GWJM RBJM H >MRJM TXNP W TXZ B YJWN <JNJNW
Mur88
W <TH N>SPW <LJK GWJM RBJM H >MRJM TXNP W TXZ B YJWN <JNJNW
 
('1_Samuel', 14, 25)
BHSA
W KL H >RY-- B>W B J<R W JHJ DBC <L PNJ H FDH
4Q51
W KL H ---<M B>- B J<R W JHJ DBC <L PNJ H FDH
 
('Psalms', 19, 7)
BHSA
M QYH- H CMJM MWY>W W TQWPTW <L QYWTM- W >JN NSTR M XMTW
11Q7
M QY-J H CMJM MWY>W W TQWPTW <L QYWTMH W >JN NSTR M XMTW
 

Postprocess data

Lexemes and parts of speech are collected and stored in lists, these are converted later to a pandas dataframe.

The distionary mapping_dict shows which lexemes in the dss package correspond with their new BHSA alternatives.

In [16]:
tf_word_id = [] # tf index
scrolls = [] # scroll code
books = [] # biblical book name
chapters = [] # chapter number
verses = [] # verse number
words_dss = [] # consonantal representation of word on scroll
lexs_dss = [] # lexeme in dss package
lexs_etcbc = [] # corresponding etcbc lexeme
poss_dss = [] # POS in dss package
poss_etcbc = [] # corresponding etcbc POS

mapping_dict = collections.defaultdict(lambda: collections.defaultdict(list))

# loop over all words in the dss, and in each word over all consonants
for key in chars_per_word.keys():
    
    tf_word_id.append(ids_dss[key[1]][key[0]][key[2]])

    word = (''.join(chars_per_word[key])).replace("-", "")
    
    lexeme_dss, lexeme_etcbc = produce_value(key, chars_to_lexemes_etcbc, chars_to_lexemes_dss)
    pos_dss, pos_etcbc = produce_value(key, chars_to_pos_etcbc, chars_to_pos_dss)
    
    if lexeme_dss != '':
        mapping_dict[lexeme_dss][(key[0], key[1])].append(lexeme_etcbc)
    
    # collect info in lists
    scrolls.append(key[0])
    books.append(key[1][0])
    chapters.append(key[1][1])
    verses.append(key[1][2])
    words_dss.append(word)
    lexs_dss.append(lexeme_dss)
    lexs_etcbc.append(lexeme_etcbc)
    poss_dss.append(pos_dss)
    poss_etcbc.append(pos_etcbc)
    

In the following two cells, some empty cells are filled, using the second part of the procedure above.

In [17]:
# the list second remembers which cases have been given a lexeme this way

second = []

for index, lex in enumerate(lexs_etcbc):
    if lex == '' and Fdss.lang.v(tf_word_id[index]) != 'a': # exclude aramaic, some strange cases
        if lexs_dss[index] != '':
        
            all_candidates_lists = list((mapping_dict[lexs_dss[index]]).values())
            candidates_list = [item for sublist in all_candidates_lists for item in sublist]
        
            best_cand, count = most_frequent(candidates_list)

            lexs_etcbc[index] = best_cand
            second.append('x')
        else:
            second.append('')
    else:
        second.append('')
In [18]:
mapping_lex_pos = collections.defaultdict(list)
mapping_id_pos = collections.defaultdict(lambda: collections.defaultdict(list))

for index, lex in enumerate(lexs_etcbc):
    
    if lex == "" or poss_etcbc[index] == "":
        continue
        
    else:
        mapping_lex_pos[lex].append(poss_etcbc[index])
        mapping_id_pos[lex][poss_etcbc[index]].append(tf_word_id[index])

Collect the data in a dataframe.

In [19]:
df_pos_lex = pd.DataFrame(list(zip(tf_word_id, scrolls, books, chapters, verses, words_dss, lexs_dss, lexs_etcbc, poss_dss, poss_etcbc, second)), 
               columns =['tf_word_id', 'scroll','book','chapter', 'verse', 'g_cons', 'lex_dss', 'lex_etcbc', 'pos_dss', 'pos_etcbc', 'second_assignment']) 

df_new = df_pos_lex.sort_values(['book', 'scroll', 'chapter', 'verse'], ascending=[True, True, True, True])
df_new
Out[19]:
tf_word_id scroll book chapter verse g_cons lex_dss lex_etcbc pos_dss pos_etcbc second_assignment
70916 2013632 4Q54 1_Kings 7 20 W W: W ptcl conj
70917 2013633 4Q54 1_Kings 7 20 KTRT K.OTERET KTRT/ subs subs
70918 2013634 4Q54 1_Kings 7 20 <L <AL <L ptcl prep
70919 2013635 4Q54 1_Kings 7 20 CNJ C:NAJIm CNJM/ numr subs
70920 2013636 4Q54 1_Kings 7 20 H HA H ptcl art
70921 2013637 4Q54 1_Kings 7 20 <MWDJM <AM.W.D <MWD/ subs subs
70922 2013638 4Q54 1_Kings 7 20 GM G.Am GM ptcl advb
70923 2013639 4Q54 1_Kings 7 20 M MIn MN ptcl prep
70924 2013640 4Q54 1_Kings 7 20 M<L MA<AL M<L/ ptcl subs
70925 2013641 4Q54 1_Kings 7 20 M MIn MN ptcl prep
70926 2013642 4Q54 1_Kings 7 20 L L: L ptcl prep
70927 2013643 4Q54 1_Kings 7 20 <MT <UM.@H <MH/ ptcl subs
70928 2013644 4Q54 1_Kings 7 20 H HA H ptcl art
70929 2013645 4Q54 1_Kings 7 20 BVN B.EVEn BVN/ subs subs
70930 2013646 4Q54 1_Kings 7 20 >CR >:ACER >CR ptcl conj
70931 2013647 4Q54 1_Kings 7 20 L L: L ptcl prep
70932 2013648 4Q54 1_Kings 7 20 <BR <;BER <BR/ subs subs
70933 2013649 4Q54 1_Kings 7 20 FBKH F:B@K@H FBKH/ subs subs
70934 2013650 4Q54 1_Kings 7 20 W W: W ptcl conj
70935 2013651 4Q54 1_Kings 7 20 H HA H ptcl art
70936 2013652 4Q54 1_Kings 7 20 RMWNJM RIM.OWn RMWN/ subs subs
70937 2013653 4Q54 1_Kings 7 20 M>TJM M;>@H M>H/ numr subs
70938 2013654 4Q54 1_Kings 7 20 VRJM VW.R VWR/ subs subs
70939 2013655 4Q54 1_Kings 7 20 SBJB S@BIJB SBJB/ ptcl subs
70940 2013656 4Q54 1_Kings 7 20 <L <AL <L ptcl prep
70941 2013657 4Q54 1_Kings 7 20 H HA H ptcl art
70942 2013658 4Q54 1_Kings 7 20 KTRT K.OTERET KTRT/ subs subs
70943 2013659 4Q54 1_Kings 7 20 H HA H ptcl art
70944 2013660 4Q54 1_Kings 7 20 CNJT C;NIJ CNJ/ numr adjv
47641 2013662 4Q54 1_Kings 7 21 W W: W ptcl conj
... ... ... ... ... ... ... ... ... ... ... ...
85231 2098822 Mur88 Zephaniah 3 20 H HA H ptcl art
85232 2098823 Mur88 Zephaniah 3 20 HJ> HIJ> HJ> pron prps
85233 2098824 Mur88 Zephaniah 3 20 >BJ> BW> BW>[ verb verb
85234 2098825 Mur88 Zephaniah 3 20 >TKM >;T >T suff prep
85235 2098826 Mur88 Zephaniah 3 20 W W: W ptcl conj
85236 2098827 Mur88 Zephaniah 3 20 B B.: B ptcl prep
85237 2098828 Mur88 Zephaniah 3 20 <T <;T <T/ subs subs
85238 2098829 Mur88 Zephaniah 3 20 QBYJ QBy QBY[ suff verb
85239 2098830 Mur88 Zephaniah 3 20 >TKM >;T >T suff prep
85240 2098831 Mur88 Zephaniah 3 20 KJ K.IJ KJ ptcl conj
85241 2098832 Mur88 Zephaniah 3 20 >TN NTn NTN[ verb verb
85242 2098833 Mur88 Zephaniah 3 20 >TKM >;T >T suff prep
85243 2098834 Mur88 Zephaniah 3 20 L L: L ptcl prep
85244 2098835 Mur88 Zephaniah 3 20 CM C;m CM/ subs subs
85245 2098836 Mur88 Zephaniah 3 20 W W: W ptcl conj
85246 2098837 Mur88 Zephaniah 3 20 L L: L ptcl prep
85247 2098838 Mur88 Zephaniah 3 20 THLH T.:HIL.@H THLH/ subs subs
85248 2098839 Mur88 Zephaniah 3 20 B B.: B ptcl prep
85249 2098840 Mur88 Zephaniah 3 20 KL K.OL KL/ subs subs
85250 2098841 Mur88 Zephaniah 3 20 <MJ <Am <M/ subs subs
85251 2098842 Mur88 Zephaniah 3 20 H HA H ptcl art
85252 2098843 Mur88 Zephaniah 3 20 >RY >EREy >RY/ subs subs
85253 2098844 Mur88 Zephaniah 3 20 B B.: B ptcl prep
85254 2098845 Mur88 Zephaniah 3 20 CWBJ CWB CWB=[ suff verb
85255 2098846 Mur88 Zephaniah 3 20 >T >;T >T ptcl prep
85256 2098847 Mur88 Zephaniah 3 20 CBWTJKM C:BW.T CBWT/ suff subs
85257 2098848 Mur88 Zephaniah 3 20 L L: L ptcl prep
85258 2098849 Mur88 Zephaniah 3 20 <JNJKM <AJIn <JN/ suff subs
85259 2098850 Mur88 Zephaniah 3 20 >MR >MR >MR[ verb verb
85260 2098851 Mur88 Zephaniah 3 20 JHWH JHWH JHWH/ subs nmpr

197899 rows × 11 columns

Save the data in a csv file and print the mapping_dict.

In [65]:
df_new.to_csv("lexemes_pos_all_bib_books.csv", index=False)

It goes without saying, that various things can be improved and refined in this procedure, but it is a good starting point for further investigations.