Creating a Massively Parallel Bible Corpus

LREC 2014, Reykjavik

Thomas Mayer / thomas.mayer@uni-marburg.de
Michael Cysouw / cysouw@uni-marburg.de

Using (parallel) texts for language comparison

  • While parallel text corpora (bitexts) have been popular among computational linguists since the advent of statistical machine translation (Brown et al., 1988), there have also been some efforts to compile parallel texts in more than one language.
  • The most widely used multilingual text is the Europarl corpus ( http://www.statmt.org/europarl/), a collection of proceedings of the European Parliament, which includes versions in 21 European languages.

Using (parallel) texts for language comparison (cont'd)

  • There also exist parallel texts for literary works (e.g. Harry Potter, Le Petit Prince, Master i Margarita), mostly available for a set of closely related languages
  • However, only very few of them are freely available or can be regarded as massively parallel texts in the strict sense (Cysouw and Wälchli, 2007).

Using (parallel) Bible texts

  • No other book has been translated into so many languages over such a long period of time as the Bible
  • Starting with its first translation, the so-called Septuagint, in 300 BC, the Bible is to the present day the object of the most intense translation activity worldwide (Noss, 2007)
  • A growing number of Bible translations are now available in electronic form on the internet. Yet until now there is no large-scale parallel Bible corpus that allows researchers to easily get access to Bible texts (but see Resnik et al., 1999 for an earlier effort to collect such a corpus)

Bible statistics

  • The Protestant biblical canon comprises 66 books of varying textual styles
  • The 66 books are divided into 1,189 chapters and 31,102 verses (The statistics are based on the 1769 edition of the 1611 King James Bible as presented on http://www.biblebelievers.com/believers-org/kjv-stats.html, accessed on April 22nd, 2013.)
Continent or Region Portions Testaments Bibles Total
Africa 227 334 182 743
Asia 207 265 146 618
Australia/New Zealand/Pacific Islands 138 271 40 449
Europe 107 41 62 210
North America 41 30 8 79
Caribbean islands/Central America/Mexico/South America 101 299 36 436
Constructed Languages 2 0 1 3
TOTAL 823 1240 475 2,538

Statistical summary of languages in which at least one book of the Bible had been registered as of December 31, 2001 (Source: http://www.unitedbiblesocieties.org)

    Paralleltext.info  


  • Current status of the Bible corpus
  • Formats used
    • File format
    • File name conventions (BCP 47)
  • Web interface
  • Collaboration ↓

Current status of the Bible corpus

  • 1.0 Version
  • Statistics on ISO codes and translations ↓
    (different translations, different diachronic stages, different dialects)
  • Map of languages in the corpus ↓
  • We have made checks on duplicate translations
    (wrong language names/ISO codes, different formatting)

Statistics on the Bible corpus

Languages in the Bible corpus

(811 different codes)

Language families in the corpus

Statistics on the Bible corpus II

  • translation with highest number of verses:
    deu-x-bible-pattloch [36,204]
  • translation with lowest number of verses:
    wed-x-bible-wedau (Wedau) [677]
  • verse with highest number of translations:
    41001007 [903] (14 missing!)
    There is not a single verse that is available for all translations!
  • average number of words per translation: 304,345 (SD: 192,767)
  • average number of types per translation: 14,542 (SD: 13,130)

Text Preparation

Texts are prepared with (automatic) linguistic analysis in mind (not theological use)
  • Bare base texts
  • No headings, no footnotes, no cross-references
  • No analysis included: all analysis will happen as stand-off
  • No harmonization of capitalization
  • Unicode NFC and checks on character harmonization
  • Punctuation separated from words (problematic step!) ↓
  • Missing lines: checking for non-consistent encoding of originals
  • Combined translations: marked as empty verses
  • Collect metadata on translations and copyright

Example: Arifama-Miniafia [aai]

  • The right single quotation mark (0x2019) stands for the glottal stop (Wakefield, 1992).
  • 40008009: Anayabin ayu i roubabaruwen ana fair biyauumaim emaam , naatu baiyowayah etei ayu babumaim temaam , imih baiyowayan orot ta isan anao , ‘ Niimaim kwen , i boro nan , naatu orot ta isan anao , ‘ Iti imaim kuna , i boro nan , naatu au bowayan orot ta isan anao , iti kusinaf , i boro nasinaf , imih turawat kuo au orot boro nayawas . ”

File format

(adapted from Östen Dahl and Bernhard Wälchli)
  • The information about the book, chapter and verse number is structured as follows (e.g. line 3 below: 40-001-003)
    • the first two digits represent the number of the book (e.g., 40 refers to the first book in the New Testament, the Gospel according to Matthew).
    • the next three digits indicate the chapter (e.g., 001 refers to the first chapter in the book)
    • the last three digits show the verse number (e.g., 003 refers to the third verse in the chapter)

File format example

40001001\tThe book of the generations of Jesus Chris...\n
40001002\tThe son of Abraham was Isaac ; and the so...\n
40001003\tAnd the sons of Judah were Perez and Zerah...\n
40001004\tAnd the son of Ram was Amminadab ; and th...\n 
40001005\tAnd the son of Salmon by Rahab was Boaz ; ...\n 
40001006\tAnd the son of Jesse was David the king ; ...\n
...

Books of the Bible

File name conventions

(according to language-naming convention of BCP 47)

  • ISO-x-bible-TRANSLATION-VERSION
    • ISO 639-3 code
    • x: separator for private codes in BCP 47
    • bible: tag for texts in the parallel Bible corpus
    • TRANSLATION: tag for the specific translation
      e.g. "wosera" vs. "maprik" (dialects of Ambulas), "elberfelder" (specific German translation)
    • VERSION: version number of our corpus
  • Each verse can be referenced by its URL: e.g., http://paralleltext.info/data/eng-x-bible-darby-v1/41/001/001/

Web application on

http://paralleltext.info/data/

  • Basic functionalities
    • Browse translations (restricted to Book of Mark)
    • Search text in translations and get parallel verses
    • Download word lists (with frequencies)
    • Download sparse matrix of Words x Verses ↓
      (i.e. scrambled words per verse, no copyright)
    • Download complete texts (password protected due to copyright)
  • Alignment demo

Word and sparse matrix file


Collaboration on base texts

  • We offer to be the central repository for the base texts
    (adding new versions, correcting mistakes, updating metadata)
  • Do you have any corrections? Please just leave a comment on the website
  • We welcome help with this cleaning and preparation: please contact us! ↓

Collaboration on analysis

  • Addition of linguistic annotation should go via stand-off annotation
    • automatic: stemming, morpheme segmentation, named-entity recognition etc ...
    • manual linguistic: glossing, construction identification, etc ...
  • Basic form: CSV file with five columns using character counts:
    File name, verse number, start character, end character, annotation
  • File Name               Verse No    Start   End     Annotation          
    abc-x-bible-text-1.0    4003015     26      33      Reflexive              

Thank you for your attention!

Thomas Mayer / thomas.mayer@uni-marburg.de
Michael Cysouw / cysouw@uni-marburg.de