GSCL-Kaleidoskop

### Using (parallel) texts for language comparison

• In recent years, linguists have become aware of the necessity to collect significant amounts of primary data for as many languages as possible (cf. Abney and Bird, 2010)
• While parallel text corpora (bitexts) have been popular among computational linguists since the advent of statistical machine translation (Brown et al., 1988), there have also been some efforts to compile parallel texts in more than one language.
• The most widely used multilingual text is the Europarl corpus (http://www.statmt.org/europarl/), a collection of proceedings of the European Parliament, which includes versions in 21 European languages

### Using (parallel) texts for language comparison (cont'd)

• There also exist parallel texts for literary works (e.g. Harry Potter, Le Petit Prince, Master i Margarita), mostly available for a set of closely related languages
• However, only very few of them are freely available or can be regarded as massively parallel texts in the strict sense (Cysouw and Wälchli, 2007).

### Using (parallel) Bible texts

• No other book has been translated into so many languages over such a long period of time as the Bible
• Starting with its first translation, the so-called Septuagint, in 300 BC, the Bible is to the present day the object of the most intense translation activity worldwide (Noss, 2007)
• A growing number of Bible translations are now available in electronic form on the internet. Yet until now there is no large-scale parallel Bible corpus that allows researchers to easily get access to Bible texts (but see Resnik et al., 1999 for an earlier effort to collect such a corpus)

### Bible statistics

• The Protestant biblical canon comprises 66 books of varying textual styles, ranging from poetry to prose literature and legal documents
• The 66 books are divided into 1,189 chapters and 31,102 verses (The statistics are based on the 1769 edition of the 1611 King James Bible as presented on http://www.biblebelievers.com/believers-org/kjv-stats.html, accessed on April 22nd, 2013.)

 Continent or Region Portions Testaments Bibles Total Africa 227 334 182 743 Asia 207 265 146 618 Australia/New Zealand/Pacific Islands 138 271 40 449 Europe 107 41 62 210 North America 41 30 8 79 Caribbean islands/Central America/Mexico/South America 101 299 36 436 Constructed Languages 2 0 1 3 TOTAL 823 1240 475 2,538

Statistical summary of languages in which at least one book of the Bible had been registered as of December 31, 2001 (Source: http://www.unitedbiblesocieties.org)

### Paralleltext.info

• Current status of the Bible corpus
• Formats used
• File format
• File name conventions (BCP 47)
• Web interface
• Collaboration ↓

### Current status of the Bible corpus

• 1.0 Version
• Statistics on ISO codes and translations ↓
(different translations, different diachronic stages, different dialects)
• Map of languages in the corpus ↓
• We have made checks on duplicate translations
(wrong language names/ISO codes, different formatting)

### Statistics on the Bible corpus

• 994 texts, 839 different language codes ↓
•  Resource Number of translations bible.is 372 scriptureearth.org 197 pngscriptures.org 188 Dahl 140 Unboundbible 97
• Average number of verses per translation:
10,707 (SD: 7,727)
• Total number of different verses: 41,964 (compare to 31,102 in KJV)

### Languages in the Bible corpus

(839 different codes)

### Statistics on the Bible corpus II

• translation with highest number of verses:
eng-x-bible-engkj [36,986]
• translation with lowest number of verses:
wed-x-bible-wedau (Wedau) [677]
• verse with highest number of translations:
41001007 [976] (18 missing!)
There is not a single verse that is available for all translations!
• average number of words per translation: 408,973 (SD: 367,572)
• average number of types per translation: 21,176 (SD: 15,134)

### Text Preparation

Texts are prepared with (automatic) linguistic analysis in mind (not theological use)
• Bare base texts
• No analysis included: all analysis will happen as stand-off
• No headings, no footnotes, no cross-references
• Unicode NFC and checks on character harmonization
• Punctuation separated from words (problematic step!)
• No harmonization of capitalization
• Missing lines: checking for non-consistent encoding of originals
• Combined translations: marked as empty verses

### File format

(adapted from Östen Dahl and Bernhard Wälchli)
• The information about the book, chapter and verse number is structured as follows (e.g. line 3 below: 40-001-003)
• the first two digits represent the number of the book (e.g., 40 refers to the first book in the New Testament, the Gospel according to Matthew).
• the next three digits indicate the chapter (e.g., 001 refers to the first chapter in the book)
• the last three digits show the verse number (e.g., 003 refers to the third verse in the chapter)
40001001\tThe book of the generations of Jesus Chris...\n
40001002\tThe son of Abraham was Isaac ; and the so...\n
40001003\tAnd the sons of Judah were Perez and Zerah...\n
40001004\tAnd the son of Ram was Amminadab ; and th...\n
40001005\tAnd the son of Salmon by Rahab was Boaz ; ...\n
40001006\tAnd the son of Jesse was David the king ; ...\n
...

### File name conventions

(according to language-naming convention of BCP 47)

• ISO-x-bible-TRANSLATION-VERSION
• ISO 639-3 code
• x: separator for private codes in BCP 47
• bible: tag for texts in the parallel Bible corpus
• TRANSLATION: tag for the specific translation
e.g. "wosera" vs. "maprik" (dialects of Ambulas), "elberfelder" (specific German translation)
• VERSION: version number of our corpus

### Web application on

http://paralleltext.info/data/

(not yet online)

• Basic functionalities
• Browse translations (restricted to Book of Mark)
• Search text in translations and get parallel verses
(i.e. scrambled words per verse, no copyright)
• Alignment demo

### Collaboration on base texts

• We offer to be the central repository for the base texts

### Collaboration on analysis

• Addition of linguistic annotation should go via stand-off annotation
• automatic: stemming, morpheme segmentation, named-entity recognition etc ...
• manual linguistic: glossing, construction identification, etc ...
• Basic form: CSV file with five columns using character counts:
File name, verse number, start character, end character, annotation
• File Name               Verse No    Start   End     Annotation
abc-x-bible-text-1.0    4003015     26      33      Reflexive              
• Such files can be distributed independently of our central repository

### Matrix representation

• The parallel text can then be encoded as three sparse matrices:
•  $\mathbf{UL}$ (utterances $\times$ languages'): which utterance belongs to which language? $\mathbf{UW}$ (utterances $\times$ words'): which words occur in which utterance?
• $\mathbf{UL}$ is defined as
• $\mathbf{UL}_{ij} = 1$ if the utterance $i$ belongs to language $j$ and
$\mathbf{UL}_{ij} = 0$ if not.
Likewise for $\mathbf{UW}$.
• Note the similarity with the wordlist approach where sentences correspond to concepts, utterances to words and words to phonemes/graphemes.

### Matrix representation (cont'd)

• The matrix $\mathbf{WU}$ will be used to compute co-occurrence statistics of all pairs of words, both within and across languages. Basically, we define $\mathbf{O}$ (observed co-occurrences) and $\mathbf{E}$ (expected co-occurrences) as:
• $$\mathbf{O} = \mathbf{WU} \cdot \mathbf{WU^T}$$ $$\mathbf{E} = \mathbf{WU} \cdot \frac{\mathbf{1_{SS}}}{n} \cdot \mathbf{WU^T}$$ The symbol $\mathbf{1_{ab}}$ refers to a matrix of size $a \times b$ consisting of only 1's
• Assuming that the co-occurrence of words follows a poisson process (Quasthoff and Wolff, 2002), the co-occurrence matrix $\mathbf{WW}$ (words $\times$ words) can be calculated as follows:
• $$\mathbf{WW} = -\log[\frac{\mathbf{E^O} \exp(-\mathbf{E})}{\mathbf{O}!}] \\ = \mathbf{E} + \log{\mathbf{O}!} - \mathbf{O}\log{\mathbf{E}}$$
• Based on the co-occurrence matrix $\mathbf{WW}$ we compute concrete alignments (many-to-many mappings between words) for each utterance separately, but for all languages at the same time (Mayer and Cysouw, 2012)
• For each utterance $U_i$ we take the subset of the similarity matrix $\mathbf{WW}$ only including those $n$ words that occur in the row $\mathbf{UW_i}$, i.e., only those words that occur in utterance $U_i$.
• $$WW_{i} = \left( \begin{array}{lcccccl} ww_{11}& \dots & ww_{1n} \\ \vdots & \vdots & \vdots \\ ww_{n1} & \dots & ww_{nn} \\ \end{array} \right)$$
• We then perform a partitioning on this subset of the similarity matrix $\mathbf{WW}$ (e.g., affinity propagation clustering; Frey and Dueck, 2007).

### Thank you for your attention!

http://th-mayer.de | thomas.mayer@uni-marburg.de
Michael Cysouw | cysouw@uni-marburg.de