Conversion of the Lexemes and Parts of Speech of the Biblical DSS to ETCBC Encoding¶

By Martijn Naaijer

One of the goals of the CACCHT project is to integrate the Dead Sea Scrolls in the ETCBC database by converting them in the ETCBC encoding system. In a previous blogpost we explained how POS tagging of Hebrew texts can be done using an LSTM network. In this blogpost we show how the lexemes of the words in the biblical scrolls can be converted from the encoding done by Abegg to the ETCBC lexemes.

General approach¶

The text-fabric package containing the Dead Sea Scrolls has a variety of word features, such as number, gender, person, and lexeme. Some of these can be converted relatively straightforwardly to ETCBC encoding. An example is number. We assume that verb forms in the first person in the dss package correspond one to one with the first person in the ETCBC database.

There are features, for which this correspondence cannot be applied straightforwardly. An example is the word feature part of speech. In the dss package there is the value "ptcl" (particle). This value corresponds with more values in the ETCBC database, like "conj" (conjunction) and "prep" (preposition).

Something similar is going on with the feature lexeme. In the ETCBC database, the lexemes are based on KBL lexicon, but in the course of time a whole range of improvements have been implemented, based on ongoing research. Therefore, it is to be expected, that there is not always a one to one correspondence between the lexemes of the dss package and the BHSA.

So, how can we assign the ETCBC values of the feature lexeme to the words in the Scrolls? In the case of the biblical scrolls one can simply use lexemes used for the same words in corresponding verses. For instance, the scroll 4Q2 contains part of the text of Genesis. Its text of verse 1:1 is (partly reconstructed) BR>CJT BR> >LHJM >T HCMJM W>T H>RY, which is identical to the text of Genesis 1:1 in the BHSA. In this case, we can simply give each word the lexeme of the same word in the BHSA. In many cases, however, the text of the scrolls deviates more or less with that of the BHSA.

This problem is solved by using sequence alignment. This means that two strings are arranged in such a way that similar parts are identified. Alignment of sequences is used often in biology, if one wants to identify similarities and differences in DNA strings or proteins. Sequence alignment techniques are often based on dynamic programming, such as the Smith Waterman Algorithm. In the package biopython a whole range of algorithms for the study of biological sequences are implemented. We use this package, and the module pairwise2, which is contained in it, to align biblical verses in the DSS and the BHSA.

In practice , this looks as follows. As an example, we look at Isaiah 48:2 in the BHSA and 1QIsaa.

KJ- M <JR H Q-DC NQR>W W <L >L-HJ JFR>L NSMKW JHWH YB>WT CMW   BHSA

KJ> M <JR H QWDC NQR>W W <L >LWHJ JFR>L NSMKW JHWH YB>WT CMW   1QIsaa

It is clear that the text of the BHSA and 1QIsaa is very similar, but there are also some differences. Similar parts in the verses are put together, resulting in two sequences of equal length. On the place of the extra matres lectiones in 1QIsaa, one finds a gap ("-") in the BHSA.

Now, we look at every character in 1QIsaa, and check the lexeme of the corresponding character in the BHSA. If more than half of the characters in a word in the scroll corresponds with one lexeme in the BHSA, we give the value of this lexeme to the word in the scroll. In the case of the first word in Isaiah 48:2, the K and J correspond to a character in the corresponding word with the lexeme KJ, and the character > does not correspond with a word in the BHSA. The result is that 2 out of 3 characters correspond with the lexeme KJ, which is more than 50%, so the first word KJ> in this verse of the scroll gets the value KJ. This approach works well in practice, but it does not result in a value for each word.

An example is Isaiah 42:23 in 1QIsaa:

MJ- BKM- --J>ZJN Z->T --JQCB W JCM< L >XWR    BHSA

MJ> BKMH W J>ZJN ZW>T W JQCB W JCM< L >XWR    1QIsaa

In this case, the word "W" occurs three times in 1QIsaa, and only once in the BHSA. This means, that in the first two cases, we cannot give these words an ETCBC lexeme. We can proceed by checking the lexeme "W" has in these cases in the dss module, which is "W:", and then find out with which lexeme "W:" corresponds in the BHSA most often. This is the BHSA lexeme "W", which can be assigned to the unmatched words "W".

This procedure works well in practice, but it is not infallible. In some cases the alignment is unfortunate, in other cases there is no one to one mapping between lexemes or the dss package has different word boundaries. For instance, in the case of place names consisting of two words, the dss package treats them as distinct words, whereas the BHSA has one lexeme, in general. To bring the resulting dataset to "ETCBC standard", more steps are needed. These can consist of additional automatic processing and/or manual cleaning.

The same approach is used for the feature part of speech. You can find the resulting dataset "lexemes_pos_all_bib_books.csv" in the github repository: https://github.com/ETCBC/DSS2ETCBC.

import collections
from pprint import pprint

import pandas as pd
import dill

The package biopython is used for aligning sequences

from Bio import pairwise2
from Bio.Seq import Seq

Now the dss package is loaded. The classes L, T and F, that are used here are renamed. In that case we can use them together with the BHSA.

from tf.app import use
A = use('dss', hoist=globals())

# give the relevant classes for the DSS new names
Ldss = L
Tdss = T
Fdss = F

	connecting to online GitHub repo annotation/app-dss ... connected
Using TF-app in C:\Users\geitb/text-fabric-data/annotation/app-dss/code:
	rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/dss/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/dss/parallels/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used

from tf.app import use
A = use('bhsa', hoist=globals())

	connecting to online GitHub repo annotation/app-bhsa ... connected
Using TF-app in C:\Users\geitb/text-fabric-data/annotation/app-bhsa/code:
	rv1.2=#5fdf1778d51d938bfe80b37b415e36618e50190c (latest release)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/bhsa/tf/c:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/phono/tf/c:
	r1.2 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in C:\Users\geitb/text-fabric-data/etcbc/parallels/tf/c:
	r1.2 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used

The dictionary book_dict_bhsa_dss provides a mapping between the booknames of the BHSA and dss packages.

book_dict_bhsa_dss = {'Genesis':      'Gen',
             'Exodus':       'Ex',
             'Leviticus':    'Lev',
             'Numbers':      'Num',
             'Deuteronomy':  'Deut',
             'Joshua':       'Josh',
             'Judges':       'Judg',
             '1_Samuel':     '1Sam',
             '2_Samuel':     '2Sam',
             '1_Kings':      '1Kgs',
             '2_Kings':      '2Kgs',
             'Isaiah':       'Is',
             'Jeremiah':     'Jer',
             'Ezekiel':      'Ezek',
             'Hosea':        'Hos',
             'Joel':         'Joel',
             'Amos':         'Amos',
             'Obadiah':      'Obad',
             'Jonah':        'Jonah',
             'Micah':        'Mic',
             'Nahum':        'Nah',
             'Habakkuk':     'Hab',
             'Zephaniah':    'Zeph',
             'Haggai':       'Hag',
             'Zechariah':    'Zech',
             'Malachi':      'Mal',
             'Psalms':       'Ps',
             'Job':          'Job',
             'Proverbs':     'Prov',
             'Ruth':         'Ruth',
             'Song_of_songs':'Song',
             'Ecclesiastes': 'Eccl',
             'Lamentations': 'Lam',
             'Daniel':       'Dan',
             'Ezra':         'Ezra',
             '2_Chronicles': '2Chr'
            }

And the reversed way can be useful as well.

book_dict_dss_bhsa = {v: k for k, v in book_dict_bhsa_dss.items()}
print(book_dict_dss_bhsa)

{'Gen': 'Genesis', 'Ex': 'Exodus', 'Lev': 'Leviticus', 'Num': 'Numbers', 'Deut': 'Deuteronomy', 'Josh': 'Joshua', 'Judg': 'Judges', '1Sam': '1_Samuel', '2Sam': '2_Samuel', '1Kgs': '1_Kings', '2Kgs': '2_Kings', 'Is': 'Isaiah', 'Jer': 'Jeremiah', 'Ezek': 'Ezekiel', 'Hos': 'Hosea', 'Joel': 'Joel', 'Amos': 'Amos', 'Obad': 'Obadiah', 'Jonah': 'Jonah', 'Mic': 'Micah', 'Nah': 'Nahum', 'Hab': 'Habakkuk', 'Zeph': 'Zephaniah', 'Hag': 'Haggai', 'Zech': 'Zechariah', 'Mal': 'Malachi', 'Ps': 'Psalms', 'Job': 'Job', 'Prov': 'Proverbs', 'Ruth': 'Ruth', 'Song': 'Song_of_songs', 'Eccl': 'Ecclesiastes', 'Lam': 'Lamentations', 'Dan': 'Daniel', 'Ezra': 'Ezra', '2Chr': '2_Chronicles'}

We define some helper functions. align_verses takes two sequences as input and aligns them. It returns the aligned sequences.

def align_verses(bhsa_data, dss_data):
    
    seq_bhsa = ' '.join(bhsa_data).strip()
    seq_dss = ' '.join(dss_data).strip()
        
    seq1 = Seq(seq_bhsa) 
    seq2 = Seq(seq_dss)
    
    alignments = pairwise2.align.globalxx(seq1, seq2)
    
    bhsa_al = (alignments[0][0]).strip(' ')
    dss_al = (alignments[0][1]).strip(' ')
        
    return bhsa_al, dss_al

The function most_frequent takes a list with lexemes as input, and returns the most frequent lexeme, together with its frequency in the list.

from collections import Counter 
  
def most_frequent(lex_list): 
    occurrence_count = Counter(lex_list) 
    lex = occurrence_count.most_common(1)[0][0]
    count = occurrence_count.most_common(1)[0][1]
    return(lex, count)

In the function produce_value, the value of the POS or lexeme in the ETCBC format is retrieved.

def produce_value(key, chars_to_feat_etcbc, chars_to_feat_dss):
    
    # check for each consonant what the value of the feature is of the word of the corresponding consonant in the BHSA 
    if len(chars_to_feat_etcbc[(key[0], key[1], key[2])]) > 0:
        all_feat_etcbc = chars_to_feat_etcbc[(key[0], key[1], key[2])]
        
        # check for a word to which lexeme in the bhsa it corresponds with most of its characters
        feat_etcbc_proposed, count = most_frequent(all_feat_etcbc)
        
        # an etcbc-lexeme is assigned only to a word if more than half of its consonants
        # corresponds with a word in the BHSA    
        if len(list(word)) == 0:
            feat_etcbc = ''

        elif (count / len(list(word))) > 0.5:
            feat_etcbc = feat_etcbc_proposed

        else:
            feat_etcbc = ''
            
    else:
        feat_etcbc = ''
        
    if len(chars_to_feat_dss[key]) > 0:
        
        all_feat_dss = chars_to_feat_dss[key]
        feat_dss, count = most_frequent(all_feat_dss)
        
    else:
        feat_dss = ''
        
    return feat_dss, feat_etcbc

Prepare scrolls¶

In the next cell the text and lexemes of the Scrolls are extracted from the DSS package, and some minor manipulations are done with the text.

For every character in the text of the scrolls, we look up what the lexeme is of the word in which the character occurs. This information is saved in the dictionary lexemes_dss. The keys of this dict are the verse numbers. Each value is a list with the lexemes, one for each character in the verse.

Note that the data retrieved in the following cell consists partly of reconstructions of scrolls. Other features in the package, not discussed here, deal with which part of the text can be read on the scrolls and which part is reconstructed.

# In dss_data_dict, the text of each verse in the biblical scrolls is collected
dss_data_dict = collections.defaultdict(lambda: collections.defaultdict(list))
lexemes_dss = collections.defaultdict(lambda: collections.defaultdict(list))
pos_dss = collections.defaultdict(lambda: collections.defaultdict(list))
ids_dss = collections.defaultdict(lambda: collections.defaultdict(list))

for scr in Fdss.otype.s('scroll'):
    scroll_name = Tdss.scrollName(scr)
    
    words = Ldss.d(scr, 'word')
        
    for w in words:
            
        # exclude fragmentary data, these chapters start with 'f'
        chapter = Ldss.u(w, 'chapter')
        
        bo = Fdss.book.v(w) 
        
        if bo == None or bo not in book_dict_dss_bhsa:
            continue
        
        if (Fdss.chapter.v(w))[0] == 'f':
            continue
        
        # Do a bit of preprocessing
        if Fdss.glyphe.v(w) != None:
            
            lexeme = Fdss.glexe.v(w)
            glyphs = "".join(Fdss.glyphe.v(w).split()) # remove whitespace in word
            
            # dummy value
            if lexeme == None:
                lexeme = 'XXX'
                
            # the consonant '#' is used for both 'C' and 'F'. We check in the lexeme
            # to which of the two alternatives it should be converted. This appproach is crude, 
            # but works generally well. There is only one word with both F and C in the lexeme: 
            # >RTX##T> >AR:T.AX:CAF:T.:> in 4Q117
            if '#' in glyphs:  
                # hardcode the single word with both 'C' and 'F' in the lexeme.
                if glyphs == '>RTX##T>':
                    glyphs = '>RTXCFT>'
                    
                elif 'F' in lexeme:
                    glyphs = glyphs.replace('#', 'F')
                    
                # cases in wich 'C' occurs in lexeme or morphology
                else:                        
                    glyphs = glyphs.replace('#', 'C')                        
            
            # Some characters are removed or replaced, in the case of 'k', 'n', 'm', 'y', 'p', it concerns final consonants of words.
            glyphs = glyphs.replace(u'\xa0', u' ').replace("'", "").replace("k", "K").replace("n", "N").replace("m", "M").replace("y", "Y").replace("p", "P")   
                
            dss_book = Fdss.book.v(w)
            bhsa_book_name = book_dict_dss_bhsa[dss_book]
            
            # replace(' ', '') is needed for strange case in Exodus 13:16 with a space in the word
            dss_data_dict[(bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w)))][scroll_name].append(glyphs.replace(' ', ''))
            
            ids_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(w)
            
            # retrieve POS and lexeme of every character of every word in the scrolls and save in a dictionary
            for character in glyphs:   
                pos_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.sp.v(w))
                lexemes_dss[bhsa_book_name, int(Fdss.chapter.v(w)), int(Fdss.verse.v(w))][scroll_name].append(Fdss.glexe.v(w))

Prepare BHSA¶

The same is done for all the characters in the BHSA. Also, a list with the verses of Isaiah in the BHSA is made. Finally, for each verse, the consonantal structure of each word is collected per verse in the dictionary bhsa_data_dict.

all_verses = []
lexemes_bhsa = collections.defaultdict(list)
pos_bhsa = collections.defaultdict(list)

bhsa_data_dict = collections.defaultdict(list)

for w in F.otype.s('word'):
    
    # Remove words without consonantal representation in the text (elided he). 
    if F.g_cons.v(w) == '':
        continue
    
    bo, ch, ve = T.sectionFromNode(w)
    cl = L.u(w, "clause")[0]
    words_in_cl = L.d(cl, "word")
        
    # use feature g_cons for consonantal representation of words
    bhsa_data_dict[(bo, ch, ve)].append(F.g_cons.v(w))
        
    # loop over consonants and get lexeme of each consonant in a word
    for cons in F.g_cons.v(w):
        lexemes_bhsa[(bo, ch, ve)].append(F.lex.v(w)) 
        pos_bhsa[(bo, ch, ve)].append(F.sp.v(w))
        
    if (bo, ch, ve) not in all_verses:
        all_verses.append((bo, ch, ve))

Align verses¶

In the following cell, the verses are aligned and characters are compared.

chars_to_lexemes_etcbc = collections.defaultdict(list)
chars_to_lexemes_dss = collections.defaultdict(list)

chars_to_pos_etcbc = collections.defaultdict(list)
chars_to_pos_dss = collections.defaultdict(list)
chars_per_word =  collections.defaultdict(list)

all_verses = list(set(all_verses))

count = 0

# loop over verses in BHSA
for verse in all_verses:
    
    # check if verse occurs in the dss package
    if verse in dss_data_dict:
        
        scrolls = (dss_data_dict[verse]).keys()
        
        for scroll in scrolls:
            
            # get the text of the verse
            bhsa_data = bhsa_data_dict[verse]
            dss_data = dss_data_dict[verse][scroll]
            
            # all pairs of verses in the BHSA and scrolls are aligned
            bhsa_al, dss_al = align_verses(bhsa_data, dss_data)
            
            count += 1
            
            # the first 20 alignments are printed
            if count < 20:
                print(verse)
                print('BHSA')
                print(bhsa_al)
                print(scroll)
                print(dss_al)
                print(' ')
        
            # some indexes are initialized, these keep track of how many consonants have been observed 
            # in both aligned sequences
            ind_dss = 0
            ind_bhsa = 0
        
            dss_word_ind = 0
        
            # loop over all characters in the BHSA verse
            for pos in range(len(bhsa_al)):
            
                # for each character in the BHSA sequence, it is cheched what the character is in the DSS sequence.
                # Now, a number of scenarios are defined. Most scenarios are not so exciting, e.g.
                # if the character is a space in both sequences, then move further
                if bhsa_al[pos] == ' ' and dss_al[pos] == ' ':
                    
                    dss_word_ind += 1
            
                elif bhsa_al[pos] == '-' and dss_al[pos] == ' ':
                    
                    dss_word_ind += 1
                
                elif bhsa_al[pos] == ' ' and dss_al[pos] == '-':
                    
                    chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                               
                else:
                    if bhsa_al[pos] == '-':
                        
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                        chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
                        chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
                        
                        ind_dss += 1
                    
                    elif dss_al[pos] == '-':
                        
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                        
                        ind_bhsa += 1
                    
                    # Now the real matching is done
                    # For a matching consonant, it is checked in the dicts to which lexeme it corresponds
                    else:                        
                        
                        chars_to_lexemes_etcbc[(scroll, verse, dss_word_ind)].append(lexemes_bhsa[verse][ind_bhsa])
                        chars_to_pos_etcbc[(scroll, verse, dss_word_ind)].append(pos_bhsa[verse][ind_bhsa])
                        
                        chars_to_lexemes_dss[(scroll, verse, dss_word_ind)].append(lexemes_dss[verse][scroll][ind_dss])
                        chars_to_pos_dss[(scroll, verse, dss_word_ind)].append(pos_dss[verse][scroll][ind_dss])
                        chars_per_word[(scroll, verse, dss_word_ind)].append(dss_al[pos])
                    
                        ind_dss += 1
                        ind_bhsa += 1

('Numbers', 11, 22)
BHSA
H Y>N W BQR JCXV LHM W MY> LHM >M >T KL DGJ H JM J>SP LHM W MY> LHM
4Q23
H Y>N W BQR JCXV------------------------------------- LHM W MY>----
 
('Numbers', 30, 6)
BHSA
W >M HN--J--> >BJH >-TH B JWM C-M<W K-L NDRJH W >SRJH >CR >SRH <L NPCH L-> JQWM- W JHWH JSLX LH KJ- HNJ> >BJH >-TH
4Q27
W >M HN> JNJ> >BJH >WTH B JWM CWM<W KWL NDRJH W >SRJH >CR >SRH <L NPCH LW> JQWMW W JHWH JSLX LH KJ> HNJ> >BJH >WTH
 
('Leviticus', 23, 41)
BHSA
W XGTM >TW XG L JHWH CB<T JMJM B CNH XQT <WLM L DRTJKM B XDC H CBJ<J TXGW >TW
Arugleviticus
W XGTM >TW XG L JHWH CB<T JMJM B CNH XQT <WLM L DRTJKM B XDC H CBJ<J TXGW >TW
 
('Psalms', 107, 42)
BHSA
JR>W JCRJM W JFMXW W KL <WLH QPYH PJH
4Q88
JR>W JCRJM W JFMXW W KL <WLH QPY- PJH
 
('Exodus', 14, 13)
BHSA
W J>MR MCH >L H <M >L TJR>W HTJYBW W R>W >T JCW<T JHWH >CR J<FH LKM H JWM KJ >CR R>JTM >T MYRJM H JWM L> TSJPW L R>TM <WD <D <WLM
4Q14
W J>MR MCH >L H <M >L TJR>W HTJYBW W R>W >T JCW<T JHWH >CR J<FH LKM H JWM KJ >CR R>JTM >T MYRJM H------- --J------------------W-M
 
('Genesis', 36, 22)
BHSA
W JHJW BNJ LWVN XRJ W HJMM W >XWT LWVN TMN<
4Q1
W JHJW BNJ LWVN XRJ W HJMM W >XWT LWVN TMN<
 
('Psalms', 45, 8)
BHSA
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FFWN M XBRJK
4Q85
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FF-N M XBRJK
 
('Psalms', 45, 8)
BHSA
>HBT YDQ W TFN> RC< <L KN MCXK >LHJM >LHJK CMN FFWN M XBRJK
11Q8
>HBT YDQ W TFN> RC< <L KN MCXK---H-- >LHJ-----------M------
 
('Song_of_songs', 3, 5)
BHSA
HCB<TJ >TKM BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
4Q106
HCB<TJ >TKM BNWT JRWCLM B Y--------B------------------------>-------W----------------T---
 
('Song_of_songs', 3, 5)
BHSA
HCB<TJ >TKM- BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
4Q107
HCB<TJ >TKMH BNWT JRWCLM B YB>WT >W B >JLWT H FDH >M T<JRW W >M T<WRRW >T H >HBH <D C TXPY
 
('Psalms', 27, 13)
BHSA
LWL> H>MNTJ L R>WT B VWB JHWH B >RY XJJM
4Q85
-----H>MNTJ L R>WT B VWB JHWH B >RY XJJM
 
('1_Samuel', 1, 9)
BHSA
W TQM XNH >XRJ >KLH B C-LH W >XRJ CTH W <LJ H KHN JCB <L H KS> <L MZWZT HJKL JHWH
4Q51
--------------------B CJLH W >XR---------------------------------------------J---
 
('2_Samuel', 13, 26)
BHSA
W J->MR >BCLWM W L-> JLK N> >TNW >MNWN >XJ W J->MR LW H MLK LMH JLK <MK
4Q51
W JW>MR >BCLWM W LW> JLK N-->--- >MNWN >XJ W JW>MR LW H MLK LMH JLK <MK
 
('Isaiah', 63, 10)
BHSA
W HMH MRW W <YBW >T RWX Q-DC-W W JHPK LHM- L >WJB --HW>- NLXM BM
1Qisaa
W HMH MRW W <YBW >T RWX QWDCJW W JHPK LHMH L >WJB W HW>H NLXM BM
 
('Isaiah', 63, 10)
BHSA
W HMH MRW W <YBW >T RWX QDCW W JHPK LHM L >WJB HW> NLXM BM
1Q8
W HMH MRW W <YBW >T RWX QDCW W JHPK LHM L >WJB HW> NLXM BM
 
('Numbers', 22, 41)
BHSA
W JHJ B B-QR W JQX BLQ >T BL<M W J<LHW BMWT_-B<L W JR> M CM QYH H <M
4Q27
W JHJ B BWQR W JQX BLQ >T BL<M W J<LHW BMWT- B<L W JR> M CM QYH H <M
 
('Micah', 4, 11)
BHSA
W <TH N>SPW <LJK GWJM RBJM H >MRJM TXNP W TXZ B YJWN <JNJNW
Mur88
W <TH N>SPW <LJK GWJM RBJM H >MRJM TXNP W TXZ B YJWN <JNJNW
 
('1_Samuel', 14, 25)
BHSA
W KL H >RY-- B>W B J<R W JHJ DBC <L PNJ H FDH
4Q51
W KL H ---<M B>- B J<R W JHJ DBC <L PNJ H FDH
 
('Psalms', 19, 7)
BHSA
M QYH- H CMJM MWY>W W TQWPTW <L QYWTM- W >JN NSTR M XMTW
11Q7
M QY-J H CMJM MWY>W W TQWPTW <L QYWTMH W >JN NSTR M XMTW

Postprocess data¶

Lexemes and parts of speech are collected and stored in lists, these are converted later to a pandas dataframe.

The distionary mapping_dict shows which lexemes in the dss package correspond with their new BHSA alternatives.

tf_word_id = [] # tf index
scrolls = [] # scroll code
books = [] # biblical book name
chapters = [] # chapter number
verses = [] # verse number
words_dss = [] # consonantal representation of word on scroll
lexs_dss = [] # lexeme in dss package
lexs_etcbc = [] # corresponding etcbc lexeme
poss_dss = [] # POS in dss package
poss_etcbc = [] # corresponding etcbc POS

mapping_dict = collections.defaultdict(lambda: collections.defaultdict(list))

# loop over all words in the dss, and in each word over all consonants
for key in chars_per_word.keys():
    
    tf_word_id.append(ids_dss[key[1]][key[0]][key[2]])

    word = (''.join(chars_per_word[key])).replace("-", "")
    
    lexeme_dss, lexeme_etcbc = produce_value(key, chars_to_lexemes_etcbc, chars_to_lexemes_dss)
    pos_dss, pos_etcbc = produce_value(key, chars_to_pos_etcbc, chars_to_pos_dss)
    
    if lexeme_dss != '':
        mapping_dict[lexeme_dss][(key[0], key[1])].append(lexeme_etcbc)
    
    # collect info in lists
    scrolls.append(key[0])
    books.append(key[1][0])
    chapters.append(key[1][1])
    verses.append(key[1][2])
    words_dss.append(word)
    lexs_dss.append(lexeme_dss)
    lexs_etcbc.append(lexeme_etcbc)
    poss_dss.append(pos_dss)
    poss_etcbc.append(pos_etcbc)

In the following two cells, some empty cells are filled, using the second part of the procedure above.

# the list second remembers which cases have been given a lexeme this way

second = []

for index, lex in enumerate(lexs_etcbc):
    if lex == '' and Fdss.lang.v(tf_word_id[index]) != 'a': # exclude aramaic, some strange cases
        if lexs_dss[index] != '':
        
            all_candidates_lists = list((mapping_dict[lexs_dss[index]]).values())
            candidates_list = [item for sublist in all_candidates_lists for item in sublist]
        
            best_cand, count = most_frequent(candidates_list)

            lexs_etcbc[index] = best_cand
            second.append('x')
        else:
            second.append('')
    else:
        second.append('')

mapping_lex_pos = collections.defaultdict(list)
mapping_id_pos = collections.defaultdict(lambda: collections.defaultdict(list))

for index, lex in enumerate(lexs_etcbc):
    
    if lex == "" or poss_etcbc[index] == "":
        continue
        
    else:
        mapping_lex_pos[lex].append(poss_etcbc[index])
        mapping_id_pos[lex][poss_etcbc[index]].append(tf_word_id[index])

Collect the data in a dataframe.

df_pos_lex = pd.DataFrame(list(zip(tf_word_id, scrolls, books, chapters, verses, words_dss, lexs_dss, lexs_etcbc, poss_dss, poss_etcbc, second)), 
               columns =['tf_word_id', 'scroll','book','chapter', 'verse', 'g_cons', 'lex_dss', 'lex_etcbc', 'pos_dss', 'pos_etcbc', 'second_assignment']) 

df_new = df_pos_lex.sort_values(['book', 'scroll', 'chapter', 'verse'], ascending=[True, True, True, True])
df_new

Save the data in a csv file and print the mapping_dict.

df_new.to_csv("lexemes_pos_all_bib_books.csv", index=False)

It goes without saying, that various things can be improved and refined in this procedure, but it is a good starting point for further investigations.

	tf_word_id	scroll	book	chapter	verse	g_cons	lex_dss	lex_etcbc	pos_dss	pos_etcbc	second_assignment
70916	2013632	4Q54	1_Kings	7	20	W	W:	W	ptcl	conj
70917	2013633	4Q54	1_Kings	7	20	KTRT	K.OTERET	KTRT/	subs	subs
70918	2013634	4Q54	1_Kings	7	20	<L	<AL	<L	ptcl	prep
70919	2013635	4Q54	1_Kings	7	20	CNJ	C:NAJIm	CNJM/	numr	subs
70920	2013636	4Q54	1_Kings	7	20	H	HA	H	ptcl	art
70921	2013637	4Q54	1_Kings	7	20	<MWDJM	<AM.W.D	<MWD/	subs	subs
70922	2013638	4Q54	1_Kings	7	20	GM	G.Am	GM	ptcl	advb
70923	2013639	4Q54	1_Kings	7	20	M	MIn	MN	ptcl	prep
70924	2013640	4Q54	1_Kings	7	20	M<L	MA<AL	M<L/	ptcl	subs
70925	2013641	4Q54	1_Kings	7	20	M	MIn	MN	ptcl	prep
70926	2013642	4Q54	1_Kings	7	20	L	L:	L	ptcl	prep
70927	2013643	4Q54	1_Kings	7	20	<MT	<UM.@H	<MH/	ptcl	subs
70928	2013644	4Q54	1_Kings	7	20	H	HA	H	ptcl	art
70929	2013645	4Q54	1_Kings	7	20	BVN	B.EVEn	BVN/	subs	subs
70930	2013646	4Q54	1_Kings	7	20	>CR	>:ACER	>CR	ptcl	conj
70931	2013647	4Q54	1_Kings	7	20	L	L:	L	ptcl	prep
70932	2013648	4Q54	1_Kings	7	20	<BR	<;BER	<BR/	subs	subs
70933	2013649	4Q54	1_Kings	7	20	FBKH	F:B@K@H	FBKH/	subs	subs
70934	2013650	4Q54	1_Kings	7	20	W	W:	W	ptcl	conj
70935	2013651	4Q54	1_Kings	7	20	H	HA	H	ptcl	art
70936	2013652	4Q54	1_Kings	7	20	RMWNJM	RIM.OWn	RMWN/	subs	subs
70937	2013653	4Q54	1_Kings	7	20	M>TJM	M;>@H	M>H/	numr	subs
70938	2013654	4Q54	1_Kings	7	20	VRJM	VW.R	VWR/	subs	subs
70939	2013655	4Q54	1_Kings	7	20	SBJB	S@BIJB	SBJB/	ptcl	subs
70940	2013656	4Q54	1_Kings	7	20	<L	<AL	<L	ptcl	prep
70941	2013657	4Q54	1_Kings	7	20	H	HA	H	ptcl	art
70942	2013658	4Q54	1_Kings	7	20	KTRT	K.OTERET	KTRT/	subs	subs
70943	2013659	4Q54	1_Kings	7	20	H	HA	H	ptcl	art
70944	2013660	4Q54	1_Kings	7	20	CNJT	C;NIJ	CNJ/	numr	adjv
47641	2013662	4Q54	1_Kings	7	21	W	W:	W	ptcl	conj
...	...	...	...	...	...	...	...	...	...	...	...
85231	2098822	Mur88	Zephaniah	3	20	H	HA	H	ptcl	art
85232	2098823	Mur88	Zephaniah	3	20	HJ>	HIJ>	HJ>	pron	prps
85233	2098824	Mur88	Zephaniah	3	20	>BJ>	BW>	BW>[	verb	verb
85234	2098825	Mur88	Zephaniah	3	20	>TKM	>;T	>T	suff	prep
85235	2098826	Mur88	Zephaniah	3	20	W	W:	W	ptcl	conj
85236	2098827	Mur88	Zephaniah	3	20	B	B.:	B	ptcl	prep
85237	2098828	Mur88	Zephaniah	3	20	<T	<;T	<T/	subs	subs
85238	2098829	Mur88	Zephaniah	3	20	QBYJ	QBy	QBY[	suff	verb
85239	2098830	Mur88	Zephaniah	3	20	>TKM	>;T	>T	suff	prep
85240	2098831	Mur88	Zephaniah	3	20	KJ	K.IJ	KJ	ptcl	conj
85241	2098832	Mur88	Zephaniah	3	20	>TN	NTn	NTN[	verb	verb
85242	2098833	Mur88	Zephaniah	3	20	>TKM	>;T	>T	suff	prep
85243	2098834	Mur88	Zephaniah	3	20	L	L:	L	ptcl	prep
85244	2098835	Mur88	Zephaniah	3	20	CM	C;m	CM/	subs	subs
85245	2098836	Mur88	Zephaniah	3	20	W	W:	W	ptcl	conj
85246	2098837	Mur88	Zephaniah	3	20	L	L:	L	ptcl	prep
85247	2098838	Mur88	Zephaniah	3	20	THLH	T.:HIL.@H	THLH/	subs	subs
85248	2098839	Mur88	Zephaniah	3	20	B	B.:	B	ptcl	prep
85249	2098840	Mur88	Zephaniah	3	20	KL	K.OL	KL/	subs	subs
85250	2098841	Mur88	Zephaniah	3	20	<MJ	<Am	<M/	subs	subs
85251	2098842	Mur88	Zephaniah	3	20	H	HA	H	ptcl	art
85252	2098843	Mur88	Zephaniah	3	20	>RY	>EREy	>RY/	subs	subs
85253	2098844	Mur88	Zephaniah	3	20	B	B.:	B	ptcl	prep
85254	2098845	Mur88	Zephaniah	3	20	CWBJ	CWB	CWB=[	suff	verb
85255	2098846	Mur88	Zephaniah	3	20	>T	>;T	>T	ptcl	prep
85256	2098847	Mur88	Zephaniah	3	20	CBWTJKM	C:BW.T	CBWT/	suff	subs
85257	2098848	Mur88	Zephaniah	3	20	L	L:	L	ptcl	prep
85258	2098849	Mur88	Zephaniah	3	20	<JNJKM	<AJIn	<JN/	suff	subs
85259	2098850	Mur88	Zephaniah	3	20	>MR	>MR	>MR[	verb	verb
85260	2098851	Mur88	Zephaniah	3	20	JHWH	JHWH	JHWH/	subs	nmpr