by Mark Klooster | October 2, 2020 | ETCBC
This notebook publishes the results of my internship the ETCBC (Eep Talstra Centre for Bible and Computer). The internship project involved creating a phrase atom parser for Hebrew text by building a machine learning model. The phrase atom parser contributes to the joint project between the ETCBC and the Theological Seminary at Andrews University, which is called the Creating Annotated Corpora of Classical Hebrew Texts project (CACCHT). The CACCHT project is currently broadening their scope by adding more text corpora to their database. A few years ago, the research group created a new Text-Fabric module containing the Dead Sea Scrolls (DSS) with morphological encoding. The digitized DSS and their morphological annotations were provided by Martin Abegg. However, Abegg's encoding system is very different from the other modules encoded by the ETCBC (such as the BHSA package or the extra-biblical package). Therefore, the CACCHT project has been working on converting all morphological features, thereby using a bottom-up approach (i.e. converting word features first, then phrase features, clause features, and so on).
The encoding of word features is well underway and the project is about to move on to encoding phrase atom features. To start this, it should first be known which words constitute a phrase atom. However, Abegg's encoding does not have information about phrase atom boundaries in the dataset. Therefore, the phrase atom boundaries have to be constructed first. The construction or prediction of phrase atom boundaries is the project of this notebook.
The reason this project predicts phrase atom boundaries instead of phrase boundaries is that whereas phrases might be separated by words of another phrase, phrase atoms consist of continuous words. Take for example this English sentence:
'A clearer example has never been given'.
The adverb and adverbial phrase 'never' split the verbal phrase 'has been given' in two smaller phrase atoms, namely 'has' and 'been given'. As phrases that are interrupted by other phrases are harder to detect, it is more logical to try and find phrase atom boundaries first. This agrees with the bottom-up approach that is used in the CACCHT project. (In another project, phrase atoms found here could be used to find complete phrases).
Determining things like phrase atom boundaries used to be done manually. However, this notebook uses a different approach. Phrase atom boundaries will be deducted and predicted based on information on word level. The data set of the BHSA already has information on all levels, including phrase atom boundaries, while the DSS data set has only information on word level. The data of the BHSA will be used as training data for a neural network. A neural network is an example of a machine learning algorithm which has a pattern-based approach. This means that rather than feeding rules to an algorithm to predict phrase atom boundaries, the networks will find patterns between input (on word level) and the output (phrase atom boundaries), to come up with these rules itself. These rules, in turn, will be applied on word-level input of the DSS to predict phrase atom boundaries.
The neural network will be trained to find statistical patterns between part of speech - a word-level feature that has been encoded for the DSS already - and phrase atom boundaries. The deep learning model is trained on 90 per cent of the chapters of the BHSA, of which the phrase atom boundaries (the output) are known. As input, the model takes part of speech (e.g. noun, verb, adjective, etc.). The output consists of a 'p' or an 'x', indicating, respectively, whether the word is the end of a phrase atom, or not.
The trained model is then tested on the remaining 10 per cent of the BHSA, which is called the test set. The mistakes are evaluated in detail, to get insight into specific cases in which the model is incorrect. The evaluation has led to several alterations in the input data, which in turn have improved the accuracy of the model. The model below only shows the final script of the most accurate model. The following alterations were made based on the evaluation of simpler models:
Moreover, whether a word is the end of a phrase atom or not, cannot be deducted from its part of speech alone. When dealing with language, context is crucial. Therefore, as is common in the practice of natural language processing, the model works with input and output sequences instead of single input and output. This is called a sequence to sequence model (seq2seq). After testing sequence lengths ranging from 5 to 20, the most ideal and efficient sequence length was 9. Therefore, the model works with sequences of length 9. This means that the input consists of 9 consecutive parts of speech and the output of 9 phrase atom boundary indicators (x’s or p’s).
In the script below, the following steps are taken:
Each step is explained in more detail below.
First, the necessary libraries and modules are imported. This includes the Tensorflow package to build neural networks and the Text-Fabric package containing the BHSA database.
It is recommended to run the model on a GPU instead on a CPU because that is much faster (depending on the specifications of the GPU of course). In order to do this, a virtual environment needs to be created. This might be a bit complicated but there are various good explanations and tutorials available online. See, for example, this tutorial on how to install a Tensorflow-GPU: https://www.youtube.com/watch?v=tPq6NIboLSc
# imports the necessary libraries and modules
import collections
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
# imports the ETCBC database of the BHSA
from tf.app import use
A = use('bhsa:hot', hoist=globals())
First, it is important to collect all words of the BHSA that are suitable for this project's purposes. As the ideal sequence length is 9, it is useful to collect these words in ranges longer than the sequence length. Moreover, there has to be a certain amount of ranges so that a random 10 per cent (for the test set) is representative of all genres throughout the Hebrew Bible. Therefore, the entire Hebrew Bible is split up into 929 smaller blocks containing consecutive words from exactly one chapter. The next step is to delete all words that are not Hebrew but Aramaic (to get a more homogeneous dataset). This is done in two steps:
This way, the resulting 927 blocks consist of only consecutive words.
Moreover, in Hebrew writing, when a word has an article that has and a prefixed preposition, the article is elided. Therefore, it is no longer visible, except in vocalised texts (such as the 10th-century Masoretic Text). As BHSA is based on an edition of the Masoretic Text (the BHS), it includes the information about 'hidden' articles. As the goal of this research is to predict phrase atom boundaries for the Dead Sea Scrolls - which are unvocalised texts - this added information is ignored and deleted. In the dataset of the BHSA, these words have an empty string ('') as the value for the feature g_cons, the transliterated consonantal presentation of words.
Also, when a word has a pronominal suffix, in the pre-processing phase, this suffix will be separated from that word and considered as a word on its own. This way, the pronominal suffix becomes similar to the 'normal' personal pronouns. More importantly, regarding the pronominal suffix as a separate, individual word matches the encoding of the DSS, the target text set.
def create_hebrew_blocks():
hebrew_blocks = collections.defaultdict(list)
chapters = [chap for chap in F.otype.s("chapter")]
block_index = 0
# iterates over all chapters
for chap in chapters:
chap_words = []
# iterates over and collects all words except the elided-he
# adds an extra word if there is a pronominal suffix
for word in L.d(chap, "word"):
if F.g_cons.v(word) != '':
chap_words.append(word)
if F.prs.v(word) not in ['absent', 'n/a']:
chap_words.append(word)
# splits chapter into blocks when it encounters non-Hebrew words
for node in range(len(chap_words)):
if F.language.v(chap_words[node]) == 'Hebrew':
hebrew_blocks[block_index].append(chap_words[node])
elif F.language.v(chap_words[node]) != 'Hebrew':
if F.language.v(chap_words[node - 1]) == 'Hebrew':
block_index += 1
continue
else:
continue
block_index += 1
# shuffles the blocks randomly
indexes = shuffle(list(hebrew_blocks.keys()))
hebrew_blocks = {k: hebrew_blocks[k] for k in indexes}
return hebrew_blocks
To give an example of what the 'hebrew blocks' look like, here are the first ten:
hebrew_blocks = create_hebrew_blocks()
[" ".join([str(i) for i in T.sectionFromNode(words[0])]).replace("_", " ") + "-" + str(T.sectionFromNode(words[-1])[2]) for words in hebrew_blocks.values()][:10]
Now that the dataset is defined, the next step is to collect the input and output data. For these purposes, the following three functions are used. The first function, get_pos, returns the part of speech when the input is a word and extends the part of speech - if needed - by the word's state. The second function returns a 'p' when the word is the end of a phrase atom, and an 'x' when it is not. The third function iterates through each block and all words and adds the input and output to each word. The resulting blocks, that now also contain input and output data, are split into training blocks and test blocks according to a predefined ratio of 9:1.
def get_pos(w):
# customises the part of speech for a word and returns it
# when a word has a suffix and a defined state,
# its part of speech is extended by '_c' indicating a construct state.
if F.prs.v(w) not in ['absent', 'n/a'] and F.st.v(w) != "NA":
pos = str(F.sp.v(w)) + "_c"
# in all other cases, when a word has a state and no suffix, the
# part of speech is extended by the state
elif F.st.v(w) != "NA":
pos = str(F.sp.v(w)) + "_" + str(F.st.v(w))
# when the word has neither state nor suffix, its part of speech remains unchanged
else:
pos = str(F.sp.v(w))
return pos
def position_in_phrase_atom(w):
# returns an 'p" when a word is the end of a phrase atom and an 'x' if it is not.
ph_atom = L.u(w, 'phrase_atom')[0]
words_in_ph_atom = L.d(ph_atom, "word")
# when the word is the end of the phrase atom
if w == words_in_ph_atom[-1]:
ph_atom_end = 'p'
# when it is not
else:
ph_atom_end = "x"
return ph_atom_end
def collect_data(hebrew_blocks, ratio=0.9):
data = {}
# iterates through all blocks
for block_idx, block_words in hebrew_blocks.items():
block_data = []
done = False
# iterates through all words
for w in block_words:
# looks up the phrase to find the phrase function later
phrase = L.u(w, "phrase")[0]
# checks whether a word has a suffix
if done == True:
done = False
continue
# when a word appears twice in a block, the second one represents the suffix
# the following lines make sure that the suffix gets a fitting part of speech
# (prps) and phrase atom position
elif block_words.count(w) == 2:
# if the word has a suffix the data collection will happen for both the
# word and the suffix. The second time the word passes the loop, it is ignored
# by setting the bolean 'done' to true.
done = True
# if the phrase function of the word indicates a SUBJECT or OBJECT suffix
if F.function.v(phrase)[-1] in "SO":
# if it is the end of a phrase atom, the suffix becomes a separte phrase atom
if position_in_phrase_atom(w) == 'p':
block_data.append(['p', get_pos(w), w])
block_data.append(['p', 'prps', w])
# if it is not, both original word and suffix remain 'x' for the same phrase atom
else:
block_data.append(['x', get_pos(w), w])
block_data.append(['x', 'prps', w])
# if the phrase function does not indicate a subject or object suffix
# the suffix takes over the phrase atom position form its base word.
# If it becomes the end of phrase atom because of this, the base word gets an 'x'
else:
if position_in_phrase_atom(w) == 'p':
block_data.append(['x', get_pos(w), w])
block_data.append(['p', 'prps', w])
else:
block_data.append(['x', get_pos(w), w])
block_data.append(['x', 'prps', w])
# in all other cases, without suffixes involved, the phrase atom position and part of speech
# are determined in the regular way
else:
block_data.append([position_in_phrase_atom(w), get_pos(w), w])
data[block_idx] = block_data
# shuffles the data randomly by block index
data = {k: data[k] for k in shuffle(list(data.keys()))}
# splits the shuffled data into train blocks and test blocks according to the preset ratio
keys = list(data.keys())
train_blocks = {k: data[k] for k in keys[:int(len(keys) * ratio)]}
test_blocks = {k: data[k] for k in keys[int(len(keys) * ratio):]}
return train_blocks, test_blocks
This is what the data of the first ten words of the test blocks looks like:
train_blocks, test_blocks = collect_data(hebrew_blocks)
[words for words in train_blocks.values()][0][:10]
The resulting train and test blocks consist of blocks containing the following three features for each word:
The following two functions create the input and output sequences for the train and test set. In addition to this, each unique input and output value for every single word is collected in the input and output vocabularies. Lastly, the maximum length of the input and output sequences is calculated. These parameters are useful for choosing the dimensions of the neural network.
def prep_train_data(train_blocks):
ip_pos_seq = []
op_ph_seq = []
ip_pos_voc = set()
op_ph_voc = set()
# iterates over all training blocks
for train_word_nodes in train_blocks.values():
# iterates over all words except the last 8,
# this way the last sequence won't run out of words
# and have exactly 9 words
for w in range(len(train_word_nodes[:-8])):
# the following lines collect the training data
# for 9 consecutive words in a list
# input data: part of speech
pos = [train_word_nodes[w][1] for w in range(w, w + 9)]
# output data: position in phrase atom
ph_atom = [
train_word_nodes[w][0] for w in range(w, w + 9)
]
# adds the start and stop symbol
ph_atom = ['\t'] + ph_atom + ['\n']
# collects the input and output for this word (w)
# in a list
ip_pos_seq.append(pos)
op_ph_seq.append(ph_atom)
# collects all unique input and output values in vocabularies
for p in pos:
ip_pos_voc.add(p)
for ph in ph_atom:
op_ph_voc.add(ph)
# sorts the vocabuluries and converts them into lists
ip_pos_voc = sorted(list(ip_pos_voc))
op_ph_voc = sorted(list(op_ph_voc))
# calculated the the maximum lenght of input and output sequences
max_len_ip = max([len(pos) for pos in ip_pos_seq])
max_len_op = max([len(ph) for ph in op_ph_seq])
# shuffles all sequences randomly
ip_pos_seq, op_ph_seq = shuffle(ip_pos_seq, op_ph_seq)
return ip_pos_seq, op_ph_seq, ip_pos_voc, op_ph_voc, max_len_ip, max_len_op
This is what the first four input and output sequences look like.
ip_pos_seq, op_ph_seq = prep_train_data(train_blocks)[:2]
for i in range(0, 4):
print(ip_pos_seq[i])
print(op_ph_seq[i])
def prep_test_data(test_blocks):
ip_pos_test = {}
op_ph_test = {}
for block_idx, test_word_nodes in test_blocks.items():
ip_pos_test_block = []
op_ph_test_block = []
for w in range(len(test_word_nodes[:-8])):
# collects test data
pos = [test_word_nodes[w][1] for w in range(w, w + 9)]
ph_atom = [test_word_nodes[w][0] for w in range(w, w + 9)]
ip_pos_test_block.append(pos)
op_ph_test_block.append(ph_atom)
ip_pos_test[block_idx] = ip_pos_test_block
op_ph_test[block_idx] = op_ph_test_block
return ip_pos_test, op_ph_test
The data is then transformed because the neural network can only handle numerical data. To convert the numeric data back to the original data, the following dictionaries are created to map the input and output vocabularies to integers.
def create_dicts(ip_pos_voc, op_ph_voc):
# maps the input vocabulary of part of speech to indeces
ip_idx2pos = {}
ip_pos2idx = {}
for k, v in enumerate(ip_pos_voc):
ip_idx2pos[k] = v
ip_pos2idx[v] = k
# maps the output vocabulary of phrase atom position to indeces
op_idx2ph = {}
op_ph2idx = {}
for k, v in enumerate(op_ph_voc):
op_idx2ph[k] = v
op_ph2idx[v] = k
return ip_idx2pos, ip_pos2idx, op_idx2ph, op_ph2idx
Because the input and output data are categorical, the data is being one-hot encoded. This means that each input value is represented by an array containing as many values as there are values in the input variable. The arrays contain zero's, with a 1 on the place of the integer value of the input. An input for a single word might look like
[1, 0, 0, ... , 0, 0]
which corresponds with the integer value 1 of the input vocabulary, which is 'adjv_a', an adjective with an absolute state.
def one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
op_ph2idx, ip_pos_test, op_ph_seq):
# creates three-dimensional numpy arrays
one_hot_ip = np.zeros(shape=(len(ip_pos_test), max_len_ip, len(ip_pos_voc)),
dtype='float32')
one_hot_op = np.zeros(shape=(len(ip_pos_test), max_len_op, len(op_ph_voc)),
dtype='float32')
target_data = np.zeros((len(ip_pos_test), max_len_op, len(op_ph_voc)),
dtype='float32')
for i in range(len(ip_pos_test)):
for k, ps in enumerate(ip_pos_test[i]):
one_hot_ip[i, k, ip_pos2idx[ps]] = 1
for k, ph in enumerate(op_ph_seq[i]):
one_hot_op[i, k, op_ph2idx[ph]] = 1
# the decoder target data is ahead one timestep and does
# not include the start symbol
if k > 0:
target_data[i, k - 1, op_ph2idx[ph]] = 1
return one_hot_ip, one_hot_op, target_data
The following function creates the structure of the neural network, which has an encoder-decoder architecture. The encoder consists of an input layer that has as many cells as the size of the input vocabulary of parts of speech and two LSTM layers which both have 250 cells. The input layer of the decoder has as many cells as the size of the output vocabulary of x's and p's. The decoder also has a LSTM layer of 250 cells, and a dense layer of exactly as many cells as the output vocabulary. The dense layer uses the softmax activation to normalise the outputs into a probability distribution.
def define_LSTM_model(ip_pos_voc, op_ph_voc):
# encoder model
encoder_input = Input(shape=(None, len(ip_pos_voc)))
encoder_LSTM = LSTM(250,
activation='relu',
return_state=True,
return_sequences=True)(encoder_input)
encoder_LSTM = LSTM(250, return_state=True)(encoder_LSTM)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM
encoder_states = [encoder_h, encoder_c]
# decoder model
decoder_input = Input(shape=(None, len(op_ph_voc)))
decoder_LSTM = LSTM(250, return_sequences=True, return_state=True)
decoder_out, _, _ = decoder_LSTM(decoder_input,
initial_state=encoder_states)
decoder_dense = Dense(len(op_ph_voc), activation='softmax')
decoder_out = decoder_dense(decoder_out)
model = Model(inputs=[encoder_input, decoder_input], outputs=[decoder_out])
model.summary()
return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model
The model's architecture is defined and the next step is to feed the data into the model. First, stopping conditions are defined. When these are met, the model is finished and stops running. Then, the optimiser and loss function are set. Finally, the model is fed the training data and begins fitting itself to the training data.
def compile_and_train(model, one_hot_ip, one_hot_op, target_data, batch_size,
epochs, val_split):
# defines stop conditions
callback = EarlyStopping(monitor='val_loss',
patience=patience,
verbose=0,
mode='auto')
# defines optimizer
adam = Adam(lr=0.0008, beta_1=0.99, beta_2=0.999, epsilon=0.00000001)
# compiles the model
model.compile(optimizer=adam,
loss='binary_crossentropy',
metrics=['accuracy'])
# fits the model to the training data
model.fit(x=[one_hot_ip, one_hot_op],
y=target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=val_split,
callbacks=[callback])
return model
The following script sets all parameters and then runs all functions mentioned above. The data is collected, created, pre-processed and the network is defined and compiled. In the end, the model is fit to the training data.
batch_size = 1024
epochs = 150
val_split = 0.05
patience = 3
ratio = 0.9
# collects the relevant parts of the Hebrew Bible
hebrew_blocks = create_hebrew_blocks()
# collects input and output data and creates training and test sets
train_blocks, test_blocks = collect_data(hebrew_blocks, ratio)
# creates training sequences
ip_pos_seq, op_ph_seq, ip_pos_voc, op_ph_voc, max_len_ip, max_len_op = prep_train_data(
train_blocks)
# creates test sequences
ip_pos_test, op_ph_test = prep_test_data(test_blocks)
# converts data to numerical data
ip_idx2pos, ip_pos2idx, op_idx2ph, op_ph2idx = create_dicts(
ip_pos_voc, op_ph_voc)
# one-hot encodes the data
one_hot_ip, one_hot_op, target_data = one_hot_encode(max_len_ip, max_len_op,
ip_pos_voc, op_ph_voc,
ip_pos2idx, op_ph2idx,
ip_pos_seq, op_ph_seq)
one_hot_test_data = {
block:
one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
op_ph2idx, ip_pos_test[block], op_ph_seq)[0]
for block in test_blocks
}
# defines the model
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(
ip_pos_voc, op_ph_voc)
# fits the model to the training data
model = compile_and_train(model, one_hot_ip, one_hot_op, target_data,
batch_size, epochs, val_split)
After 22 epochs, the stopping conditions were met and the model stopped training. It reached an accuracy of 98.67% on the validation set (5% of the training data that was set aside for self-evaluation). Although this is a decent result, it is more important to find out how accurate the model is on completely new data. This is where the test set, the 10% that was set apart at the beginning, comes in.
First, a few more functions are needed to be able to convert input data into predicted outcomes. The function prediction_dict converts the predicted sequences of 9 words into phrase atom boundary predictions for each individual word.
def encoder_decoder_model(encoder_input, encoder_states, decoder_LSTM, decoder_dense):
# encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)
# decoder inference model
decoder_state_input_h = Input(shape=(250, ))
decoder_state_input_c = Input(shape=(250, ))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]
decoder_out, decoder_h, decoder_c = decoder_LSTM(
decoder_input, initial_state=decoder_input_states)
decoder_states = [decoder_h, decoder_c]
decoder_out = decoder_dense(decoder_out)
decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
outputs=[decoder_out] + decoder_states)
return encoder_model_inf, decoder_model_inf
The function decode_seq() uses the trained model to predict output sequences. It takes one-hot encoded sequences of words as input.
def decode_seq(ip_seq, encoder_model_inf, decoder_model_inf, op_ph_voc,
op_ph2idx, op_idx2ph):
states_val = encoder_model_inf.predict(ip_seq)
target_seq = np.zeros((1, 1, len(op_ph_voc)))
target_seq[0, 0, op_ph2idx['\t']] = 1
pred_ph = []
stop_condition = False
while not stop_condition:
decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(
x=[target_seq] + states_val)
max_val_index = np.argmax(decoder_out[0, -1, :])
sampled_out_char = op_idx2ph[max_val_index]
pred_ph.append(sampled_out_char)
if (sampled_out_char == '\n'):
stop_condition = True
target_seq = np.zeros((1, 1, len(op_ph_voc)))
target_seq[0, 0, max_val_index] = 1
states_val = [decoder_h, decoder_c]
return pred_ph
The function prediction_dict() converts the predicted outputs for sequences into predicted outputs for single words.
def prediction_dict(test_blocks, one_hot_test_data, op_ph_test):
decision_dict = {}
for block, block_seqs in test_blocks.items():
decision_dict_block = collections.defaultdict(list)
for seq_index in range(len(one_hot_test_data[block])):
ip_seq = one_hot_test_data[block][seq_index:seq_index+1]
pred_ph = decode_seq(ip_seq, encoder_model_inf, decoder_model_inf, op_ph_voc,
op_ph2idx, op_idx2ph)
if len(pred_ph[:-1]) == len(op_ph_test[block][seq_index]):
for pred_index in range(len(pred_ph[:-1])):
decision_dict_block[seq_index + pred_index].append(pred_ph[:-1][pred_index])
decision_dict[block] = decision_dict_block
return decision_dict
The function safe_div() divides two numbers and returns the result. If the denominator is zero, it returns zero. This function comes in handy when calculating percentages in the evaluation later.
def safe_div(numerator, denominator):
if denominator == 0:
return 0
else:
return numerator / denominator
The following function runs all words in the test set through the model and counts the correct and false predictions. Of the latter, it also registers the corresponding part of speech to get insight into the performance of the model per input value.
def test_evaluation(test_blocks, decision_dict):
correct_test = 0
wrong_test = 0
bible_section = []
pos_dict = collections.defaultdict(lambda: collections.defaultdict(int))
cross_dict = collections.defaultdict(lambda: collections.defaultdict(int))
# iterates through all blocks
for block in test_blocks:
# iterates through all words
for key in range(len(test_blocks[block])):
w = test_blocks[block][key][2]
# collects all predictions for the word (up to 9)
data = collections.Counter(decision_dict[block][key])
# determines the most common prediction
pred = data.most_common(1)[0][0]
# counts each possible combination of true and predicted output
cross_dict[test_blocks[block][key][0]][pred] += 1
# if the prediction is correct
if test_blocks[block][key][0] == pred:
correct_test += 1
# if the prediciton is false
else:
wrong_test += 1
# registers the exact location in the BHSA of the misprediction
# along with information about the output
bible_section.append(
str(w) + " " + T.sectionFromNode(w)[0].replace("_", " ") +
" " + str(T.sectionFromNode(w)[1]) + ":" +
str(T.sectionFromNode(w)[2]) + " " +
test_blocks[block][key][1] + " " +
test_blocks[block][key][0] + " " +
data.most_common(1)[0][0])
# registers input corresponding with the misprediciton
pos = test_blocks[block][key][1]
pos_dict[pos][pred] = pos_dict[pos].get(pred, 0) + 1
# creates an extensive evaluation of errors by part of speech
eval_by_pos = {}
for k in pos_dict.keys():
total_pos = len([
test_blocks[block][key][2] for block in test_blocks
for key in range(len(test_blocks[block]))
if test_blocks[block][key][1] == k
])
total_pos_ph_x = len([
test_blocks[block][key][2] for block in test_blocks
for key in range(len(test_blocks[block]))
if test_blocks[block][key][1] == k
and test_blocks[block][key][0] == 'x'
])
total_pos_ph_p = len([
test_blocks[block][key][2] for block in test_blocks
for key in range(len(test_blocks[block]))
if test_blocks[block][key][1] == k
and test_blocks[block][key][0] == 'p'
])
total_wrong = pos_dict[k]['x'] + pos_dict[k]['p']
pct_x = 100 * safe_div(pos_dict[k]['p'], total_pos_ph_x)
pct_p = 100 * safe_div(pos_dict[k]['x'], total_pos_ph_p)
pct_tot = 100 * \
safe_div(total_wrong, total_pos)
eval_by_pos[k] = {
"Total in Test Set": total_pos,
"Total Mistakes": total_wrong,
"Mistakes Percentage": pct_tot,
"Total 'x' in Test Set": total_pos_ph_x,
"Mistaken for '" + 'p' + "'": pos_dict[k]['p'],
"Percentage 'x'": pct_x,
"Total '" + 'p' + "' in Test Set": total_pos_ph_p,
"Mistaken for 'x'": pos_dict[k]['x'],
"Percentage '" + 'p' + "'": pct_p
}
eval_by_pos = {
item[0]: item[1]
for item in sorted(eval_by_pos.items(),
key=lambda x: (x[1]["Total Mistakes"]),
reverse=True)
}
df_eval_by_pos = pd.DataFrame.from_dict(eval_by_pos).T
int_cols = [
"Total in Test Set", "Total Mistakes", "Total 'x' in Test Set", "Mistaken for 'x'",
"Total '" + 'p' + "' in Test Set", "Mistaken for '" + 'p' + "'"
]
float_cols = [
"Mistakes Percentage", "Percentage 'x'", "Percentage '" + 'p' + "'"
]
# creates a data frame containing the evaluation per the part of speech
df_eval_by_pos[int_cols] = df_eval_by_pos[int_cols].applymap(np.int64)
df_eval_by_pos[float_cols] = df_eval_by_pos[float_cols].round(2)
# creates a cross evaluation
cross_eval = [[
cross_dict[key][key2] if key2 in cross_dict[key] else 0
for key2 in list(cross_dict.keys())
] for key in list(cross_dict.keys())]
df_cross_eval = pd.DataFrame(
cross_eval,
columns=["End of Phrase Atom", "Not " + "End of Phrase Atom"],
index=["Predicted as End", "Predicted as Not End"])
eval_summary = {
"Correct Classifications":
correct_test,
"Misclassifications":
wrong_test,
"Accuracy":
round(100 * safe_div(correct_test, (correct_test + wrong_test)), 2)
}
print("Accuracy:",
round(100 * safe_div(correct_test, (correct_test + wrong_test)), 2))
# creates a dataframe of the cross evaluation
df_eval_summary = pd.DataFrame(eval_summary, index=["Value"])
return df_eval_by_pos, df_cross_eval, df_eval_summary, bible_section
The following script runs the previous functions to predict the outcomes for the test set, does some evaluations, and displays the results in tables.
# creates the encoder and decoder inference model
encoder_model_inf, decoder_model_inf = encoder_decoder_model(
encoder_input, encoder_states, decoder_LSTM, decoder_dense)
# creates the decision dictionary containing up to predicted outcomes for each word
decision_dict = prediction_dict(test_blocks, one_hot_test_data, op_ph_test)
# evaluates the results and publishes the results in tables
df_eval_by_pos, df_cross_eval, df_eval_summary, bible_section = test_evaluation(
test_blocks, decision_dict)
df_eval_summary
df_cross_eval
The model was able to predict the phrase atom boundaries for the test set correctly for 96.46% of the words. It is important to analyse the model's performance further. Therefore, the results are evaluated more specifically. The following table shows the errors per part of speech:
df_eval_by_pos
Most of the mistakes occurred in predicting the phrase atom position for words that were conjunctions or substantives with an absolute state (416 and 493 errors). Relatively, most errors occurred for verbs in the construct state and adverbs (24.67 and 15.74%).
For a complete list of incorrect predictions, see the end of this notebook.
The final goal of this notebook was to predict phrase atom boundaries for the DSS package. For that reason, the model is tested on one scroll, namely, the Community Scroll (1QS). First, the extra-biblical package that contains this scroll is imported:
from tf.fabric import Fabric
TF = Fabric(locations='C:/Users/Mark/text-fabric-data/etcbc/extrabiblical/tf/0.2')
api = TF.load('''
otype mother lex st typ code function rela det txt prs kind vs vt sp book chapter verse label language
''')
api.makeAvailableIn(globals())
The following steps of collecting and pre-processing the data are similar to the steps taken earlier when the model was trained on the BHSA. The main difference is that - this time - only one data set is created, which is the test set. As the model has already been trained, a training set is no longer needed.
Moreover, there are some important differences between the structure of the data of the BHSA and the extra-biblical package:
Most functions that were used before on the BHSA can be used again without alterations. Because of the differences mentioned above, only the bundling of usable segments of consecutive words, the collection of input and output data, and the preparation of the test set need to be programmed differently.
def create_dss_blocks(test_book=['B_1QS']):
dss_blocks = collections.defaultdict(list)
chapters = [
chap for chap in F.otype.s("chapter") if F.book.v(chap) in test_book
]
block_index = 0
# iterates over all chapters and collects all words except the elided-he
for chap in chapters:
chap_words = [w for w in L.d(chap, "word") if F.g_cons.v(w) != '']
block = []
# detects and removes omissions and splits blocks when they occur
for word in range(len(chap_words)):
if F.lex.v(chap_words[word]) == '=':
if block != []:
dss_blocks[block_index] = block
elif F.lex.v(chap_words[word]) != '=':
block.append(chap_words[word])
block_index += 1
if block != []:
dss_blocks[block_index] = block
block_index += 1
# filters out blocks that are shorter than the sequence length (9)
dss_blocks = {block: words for block, words in dss_blocks.items() if len(words) >= 9}
# shuffles the blocks randomly
indexes = shuffle(list(dss_blocks.keys()))
dss_blocks = {k: dss_blocks[k] for k in indexes}
return dss_blocks
def collect_dss_data(dss_blocks, ratio=0.9):
dss_data = {}
# iterates through all blocks
for block_idx, block_words in dss_blocks.items():
block_data = []
# iterates through all words
for w in block_words:
block_data.append([position_in_phrase_atom(w), get_pos(w), w])
dss_data[block_idx] = block_data
return dss_data
def prep_test_data(dss_data):
ip_pos_dss = {}
op_ph_dss = {}
# iterates through dss blocks
for block in dss_data:
ip_pos_dss_block = []
op_ph_dss_block = []
dss_words = dss_data[block]
for w in range(len(dss_words[:-8])):
# collects dss data
pos = [dss_words[w][1] for w in range(w, w + 9)]
ph_atom = [dss_words[w][0] for w in range(w, w + 9)]
ip_pos_dss_block.append(pos)
op_ph_dss_block.append(ph_atom)
ip_pos_dss[block] = ip_pos_dss_block
op_ph_dss[block] = op_ph_dss_block
return ip_pos_dss, op_ph_dss
The following script runs all necessary functions to create the input data the DSS and to run it through the model. The resulting outcomes are shown in tables similar to those of the test set of the BHSA.
test_book = ['B_1QS']
# creates test data
dss_blocks = create_dss_blocks(test_book)
dss_data = collect_dss_data(dss_blocks)
# prepares test data
ip_pos_dss, op_ph_dss = prep_test_data(dss_data)
# one-hot encodes test data
one_hot_dss_data = {
block: one_hot_encode(max_len_ip, max_len_op, ip_pos_voc, op_ph_voc, ip_pos2idx,
op_ph2idx, ip_pos_dss[block], op_ph_seq)[0]
for block in dss_data
}
# creates prediction dictionary
decision_dict_dss = prediction_dict(dss_data, one_hot_dss_data, op_ph_dss)
df_eval_by_pos_dss, df_cross_eval_dss, df_eval_summary_dss, bible_section_dss = test_evaluation(
dss_data, decision_dict_dss)
df_eval_summary_dss
df_cross_eval_dss
In the end, the model trained on the BHSA is able to predict the phrase atom boundaries for the Qumran Community Scroll with a 94.47% accuracy.
df_eval_by_pos_dss
Most of the mistakes occurred in predicting the phrase atom position for words that were conjunctions or substantives with an absolute state (144 and 224 errors). Relatively, the most errors occurred for verbs in the construct state and adverbs (45.19 and 25.53%). The high error rate of interjections is not meaningful as interjections only occur 6 times in 1QS of which 3 are wrongly predicted. The results are strikingly similar to the evaluation of the test set of the BHSA. This could mean that the Hebrew of 1QS is not that different from the Hebrew of the BHSA.
In conclusion, a sequence to sequence Neural Network with a LSTM encoder-decoder is quite capable of finding relations between parts of speech and phrase atom end. For further research, one could build upon this model to predict phrase functions, for instance. In fact, these kind of models could be used in the field of ancient languages for many more applications, such as manuscript clustering, feature parsing, or to address questions of authorship, dating, and much more.
For the sake of completeness, here follows the complete list of wrong predictions both on the test set of the BHSA and on 1QS. Each line shows the node, verse, part of speech, correct and predicted position in the phrase atom:
for error in bible_section:
print(error)
for error in bible_section_dss:
print(error)