Willem van Peursen, Martijn Naaijer, Constantijn Sikkel, Mathias Coeckelbergs[1]

“How can we make a Machine Learning (ML) based parser for the morphology of inflectional languages?” This question was the starting point for the project “Morphological Parser for Inflectional Languages Using Deep Learning” of the ETCBC and the Netherlands e-Science Center.

Partly, this project served a practical purpose: the creation of the ETCBC database of the Hebrew Bible with all its levels of linguistic analysis, which started in the 1970s, took about four decades. With our intention to extend our scope to other corpora, such as the Peshitta and even more comprehensive parts of the vast amount of Syriac literature, we need to accelerate the process of the linguistic encoding of texts.

Inflectional languages

The most exciting part of this project, however, was not its practical applicability to accelerate the encoding process. Rather, we wanted to contribute to finding answers to underlying questions such as: how can the method of linguistic encoding that has been developed at the ETCBC, be made fruitful for other languages that also have a rich morphology? How can our work contribute to the large field of corpus linguistics, which to a large extent is dominated by the study of English, which has a relatively poor morphology?

Due to the dominance of English, linguistic corpora are usually parsed at word level. For example, in the English sentence “he said”, the first word is annotated to be a pronoun 3rd person masculine singular and the second word as the simple past of “say”. That works fine for languages such as English. However, for languages with a rich morphology, called ‘inflectional’ (e.g. Semitic languages, Sanskrit) and ‘agglutinative’ (e.g. Turkish) it is rewarding to take morphemes, rather than words, as the basic units. Compare the Hebrew word וַיַּמְלִכֵהוּ (2 Samuel 2:9). In Hebrew this is only one graphic word (that is to say, a string of letters between spaces), but it corresponds to five words in English translation: “and they made him king”.

For this reason, the ETCBC developed an encoding system in which not the words, but the separate morphemes are encoded. Moreover, there is an important difference between agglutinative and inflectional languages. Both kinds of languages have a rich morphology, but in an agglutinative language such as Turkish, all morphemes are concatenated. Hence the word

anlamıyorum ‘I don’t understand’

is built from the verbal stem anla-, the negative suffix –m(ı), the first person present continuous tense indicator –(i)yor, and the first person marker -(u)m.[2]

Inflectional languages, however, show fusion of morphemes, which requires that not only the morphemes are marked, but also the relation between the paradigmatic form of each morpheme and its realization. This is precisely what is done in the morphological encoding of the ETCBC. To give one example: the Hebrew word וַיֹּ֥ולֶד ‘and he begat’ (Gen 5:3) shows fusion of the prefix of the imperfect, the prefix of the causative stem, and the first radical of the verbal lexeme. Moreover, with regular verbs the paradigmatically expected letter following the inflectional prefix is the first radical of the root (e.g. the Mem in וַיַּמְלִ֥יכוּ in Judges 9:6), but in this verb we do not find a Yod but a Waw before the second radical. In the ETCBC morphological encoding this becomes

W:n-!J!](H](J&WLD[

In this encoding, the combination ](H] indicates that this is a form of the causative stem, even though this is not visible in the consonantal framework of this form (contrast the prefix in the perfect of the Hiphil) and the combination (J&W indicates that there is a Waw here instead of the paradigmatically expected letter Yod, which is not realized in the surface form (a phenomenon that is typical of the class of Pe-Yodh verbs). There may be all kinds of mechanisms behind the variation between the paradigmatic forms and the realized forms, such as phonetic rules or historical changes. In the example under discussion, one might even argue that from a historical perspective the Waw is paradigmatically expected, because in Hebrew this Pe-Yodh verb derives from a Pe-Waw verb (cf. Arabic walada).

Accordingly, the morphological encoding of the ETCBC provides a lot of linguistic information. It comprises all the word functions, but also the complete lexicon, because it links the realization of a word or morpheme with its paradigmatic form. Over the past decades much research at the ETCBC has focused on the insights that this encoding has yielded for lexicography and grammar, for example regarding the lexeme status of suffixes, the treatment of masculine and feminine words in dictionaries, the treatment of numbers and verbal stems and so on. This relates to question such as: Why do most Hebrew and Syriac dictionaries include entries for words (one-letter prepositions or conjunctions) that are prefixed to a following word, but not for the words (pronominal suffixes) that are attached to a preceding word? Why are feminine words (for, e.g., queen or lioness) sometimes included in the entry of their masculine counterpart, whereas in the same dictionaries they receive their own entry in other cases? Should the so-called Shaphel forms be considered as quadriradical verbs to be listed under the Shin, or rather on the triradical root from which they derive? Why are tens (e.g., “forty”), sometimes listed in dictionary entries of their basic (e.g., “four”), whereas otherwise they receive their own entry?

In this blogpost, we will not go into detail to explain the role of the exclamation marks, hyphens, brackets and slashes that are used in the encoding example given above. (For details see Constantijn Sikkel’s Brief description of the morphological encoding). What is important to note, however, is that the various elements of the words (e.g., the prefix of the fem. 3rd sing. imperfect) are marked according to certain conventions and that from this encoding, the grammatical word functions can be calculated in a rule-based manner.

For the application of Machine Learning, it is also important to note that this string contains the letters of the Hebrew phrase (in transliterated format) and that the encodings are added in-line, rather than as flags or footnotes. The result is a concisely structured string. This means that the process of the linguistic encoding can be conceptualized as the transformation of one string into another. Since both the input and the result of the morphological encoding are strings, the obvious solution for parsing texts by Machine Learning (ML) is to use a sequence-to-sequence (seq2seq) model. How we applied seq2seq models to the morphological encoding of Hebrew and Syriac texts will be a topic of Part II of this blogpost (to be published soon).

[1] The picture above (taken from Overview of Turkish Word Inflection (metu.edu.tr) shows an attempt to represent the Turkish word influection with a finite state machine.

[2] Example taken from https://www.surfacelanguages.com/articles/turkish/learningturkish.html.