Interactive Visualizations
of Linguistic Data

Bochum, Germany, October 22nd, 2013

Thomas Mayer /

More and more databases...

Anscombe's quartet

Source: Anscombe (1973: 19)

Anscombe's quartet

Anscombe's quartet

The best of both sides

Source: Keim, Andrienko, Fekete, Görg, Kohlhammer, and Melançon (2008:157)

Syntax trees

Genealogical trees

Source: Schleicher (1888:82), reproduced from Stewart (1976:6)


Source: Ladefoged (2001:37)

Linguistic Information Visualization

"The application of information visualization principles to display any kind of information concerning language and its use"

Source: Lyding and Culy (2012)


  • Part I: Motivation
  • Part II: Visualization techniques
  • Part III: Case studies
    1. PhonMatrix
    2. The World's Languages Explorer
    3. Colexification

Part II: Visualization techniques

with a focus on interactive techniques

Three major goals of visualization

  1. presentation
    (communicating ideas)
  2. confirmatory analysis
    (seeing old things in new ways, provide a way to interact with data)
  3. exploratory analysis
    (detecting patterns, creating and discovering ideas/knowledge)

Visual variables

"Data visualization is a process of mapping data to visuals." (Murray 2013:71)

Source: Bertin (1974:51)

Visual variables

Source: Bertin (1974:70f)

Preattentive processing (Healey)


Periodic table of visualization methods

Interactive visualization techniques

"Overview first, zoom and filter, then details-on-demand." (Shneiderman 1996)

Interactive map

Mayer and List (in prep.)

Interactive network

Mayer and List (in prep.)

Brushing and linking

Geometric Zooming

Semantic Zooming

Distortion techniques

  • Fisheye distortion
  • Cartograms:
    Cartograms obtain more space for regions with a high point density by distorting regions such that their size corresponds to a statistical feature (Bak et al. 2009).

Cartogram of the world's languages

Hessen Dialekterkenner

Why build (geo-)visualizations in the browser?

  • JavaScript is the language of the modern browser. As such, it is the most installed language in the world: the one language you can be confident is installed on the user's computer.
  • Together, the combination of JavaScript and SVG allows us to create sophisticated charts that are accessible by a majority of Internet users.

Source: Dewar, Mike. 2012. Getting Started with D3. O’Reilly Media, p. 1f.

A web-based visualization...

  • does not require any additional software (modern browsers are installed on almost all computers)
  • can be easily published on the internet (with no additional plugins needed)
  • can be easily enhanced with interactive features (e.g. mouse-over, mouse-click, etc.)
  • uses open-source technology
  • is scalable / can be zoomed in
  • can be exported as PDF
  • uses SVG (scalable vector graphics)
  • is easily customizable

D3 (Data-Driven Documents)

  • D3 is a JavaScript library for creating data visualizations.
  • D3 connects the data to web-based documents, meaning anything that can be rendered by a web browser, such as HTML and SVG.
  • The project is entirely open source and freely available. You may use, modify and adapt the code for noncommercial or commercial use at no cost.
  • Few lines of code: World Tour example

Source: Murray, Scott. 2013. Interactive Data Visualization for the Web. O'Reilly Media. p. 7.

Part III: Case studies

Case study 1: PhonMatrix

URL:, Mayer and Rohrdantz (2013)


Phonotactic constraints in languages abound. One of the most well-known and wide-spread constraints is commonly referred to as vowel harmony (van der Hulst and van de Weijer, 1995).

Likewise, in some languages there are patterns of consonant harmony (Hansson, 2010) that show a similar behavior with respect to consonants. Less common are cases of “synharmonism” (Trubetzkoy, 1967, p. 251) where both vowels and consonants form such groups and words usually only contain sounds from the same group (e.g., only front vowels and palatalized consonants).

In addition, there are disharmony constraints, the most famous of which is the principle of Similar Place Avoidance (SPA) in Semitic consonantal roots (Greenberg, 1950).


V/C distinction with Sukhotin's algorithm (Sukhotin 1962); Mayer and Rohrdantz (2013:75)

Co-occurrence statistics

With the distinction of symbols into vowels and consonants at hand, the user can then select a relevant context for the co-occurrence counts. The relevant context can be chosen from a list of predefined options (VCV and CVC)

The counts are then summarized in a quadratic contingency table and can be used for further statistical analyses.

In our experiments, two measures turned out to be especially useful for the detection of potential patterns: the probability and φ values. The φ value is a normalized χ2 measure which allows for an easier mapping of values to the color scale because it is always between −1 and 1 (Manning and Schütze, 1999).

Mayer and Rohrdantz (2013:75)


Two additional steps have to be performed in order to arrive at the final matrix visualization:

1) the rows and columns of the matrix have to be sorted in a meaningful way;

2) the association measures have to be mapped to visual variables.

The order of symbols is determined by a clustering of the symbols based on the similarity of their row values. The clustering is performed with the Python scipy.cluster.hierarchy package from the SciPy library. As a default setting Ward’s algorithm (Ward, 1963) is used but other clustering algorithms can also be easily integrated.

Whereas the preprocessing steps and the data-driven sorting of rows and columns have been written in Python (using the web framework), the actual visualization of the results in the browser is implemented in Javascript using the D3 library (Bostock et al., 2011).

Mayer and Rohrdantz (2013:75f)

Turkish vowels

URL: , Mayer and Rohrdantz (2013:77)

Finnish vowels

URL: , Mayer and Rohrdantz (2013:77)

Maltese consonants

URL: , Mayer and Rohrdantz (2013:78)

Case study 2: The World's Languages Explorer

Mayer et al. (in print)

Language similarity

In principle, there are four reasons why languages can share a certain feature (cf. Comrie 1989: 201):

  1. All languages share this feature (universal)
  2. Features are shared by chance
  3. Features are shared due to genealogical inheritance
  4. Features are shared due to areal contact (borrowing)

Analysis goals and tasks

  1. Features are shared due to genealogical inheritance
  2. Features are shared due to areal contact (borrowing)

In order to be able to distinguish between 3. and 4. both genealogical (hierarchical) and areal (geo-spatial) information has to be combined in one visualization.

Two main tasks

1. Determining whether a given geographical distribution is due to the fact that the languages involved are all genealogically closely related and thereby inherited the shared feature from the proto-language or due to language contact


2. Detecting outliers in a language family which have an unusual feature value with respect to the other members of the family that might be due to its geographical position because of which it acquired the divergent value from its neighbors

Linking geolocation and genealogical hierarchy

Selecting regions


SunBurst (Stasko and Zhang 2000); implemented in Java using Prefuse (Heer et al. 2005)

Sunburst with Germanic languages

Sunburst demo

Example: Languages of Papua New Guinea

Example: Sepik languages

Case study 3: Colexification, List et al. (2013), Mayer et al. (in prep)

Modeling Cross-Linguistic Polysemies as Weighted Networks

The idea to model polysemies as networks itself is not new. It was already underlying Haspelmath’s (2003) semantic map approach which is used as a heuristic tool to analyze grammatical categories in linguistic typology. François (2008) applied this approach to the lexical domain, followed by further work by Croft et al. (2009), Perrin (2010), Cysouw (2010a, 2010b), and Steiner et al. (2011), who also introduced a simplified procedure to retrieve putative polysemies from semantically aligned word lists.

In less formal terms, we reconstruct a weighted network by representing all concepts in a given multilingual word list as nodes (vertices), and draw edges between all nodes that show up as polysemies in the data. The edge weights reflect the number of language families in which these polysemies are attested.

List et al. (2013: 2)

Data and analysis

Our analysis is based on a large multilingual word list consisting of 1252 glosses (“concepts”) translated into 195 different languages, covering 44 different language families (see Supplemental Material). The data was taken from three different sources, namely the Intercontinental Dictionary Series (IDS, Key and Comrie 2007, 133 languages), the World Loanword Database (WOLD, Haspelmath and Tadmor 2009, 30 languages), and a multilingual dictionary provided by the Logos Group (Logos, Logos Group 2008, 32 languages).

cf. German [arm] meaning both "poor" and "arm".

List et al. (2013: 3f)

Resulting network

Interactive network

Communities: groups of vertices within which the connections are dense but between which they are sparser (Newman 2004:4).
List et al. (2013) employed the Girvan-Newman algorithm for community detection (Girvan and Newman 2002).

URL: (soon available online) [124]


  • Visualizing linguistic data is an important research area
  • Python library for visualizing linguistic data
  • Evaluating visualizations

Selected references

Anscombe, F. J. 1973. “Graphs in Statistical Analysis”. English. In: The American Statistician 27.1, pp. 17–21.

Bak, Peter, Matthias Schaefer, Andreas Stoffel, Daniel Keim, Itzhak Omer. 2009. Density equalizing distortion of large geographic point sets. Journal of Cartographic and Geographic Information Science (CaGIS), 36(3):237–250.

Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-driven documents. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis), 17(12):2301–2309.

Dewar, Mike. 2012. Getting Started with D3. O’Reilly Media.

Keim, Daniel A., Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, and Guy Melançon. 2008. “Visual Analytics: definition, process, and challenges”. In: Information Visualization. Ed. by Andreas Kerren, John T. Stasko, Jean-Daniel Fekete, and Chris North. Berlin: Springer Verlag, pp. 154–175.

Ladefoged, Peter. 2001. Vowels and Consonants: An Introduction to the Sounds of Languages. Oxford: Wiley-Blackwell.

List, Johann-Mattis; Anselm Terhalle & Matthias Urban (2013): Using Network Approaches to Enhance the Analysis of Cross-Linguistic Polysemies. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013): Short Papers, March 20-22, Potsdam, Germany. 347-353.

Mayer, Thomas, Bernhard Wälchli, Christian Rohrdantz and Michael Hund. In print. From the extraction of continuous features in parallel texts to visual analytics of heterogeneous areal-typological datasets: An extended functional and algorithmic processing pipeline. In Language processing and grammars: The role of functionally oriented computational models (SLCS) (Serie: Studies in Language). Amsterdam: John Benjamins.

Mayer, Thomas and Christian Rohrdantz. 2013. PhonMatrix: Visualizing co-occurrence constraints in sounds. In Proceedings of the ACL 2013 System Demonstration, pp.73-78.

Murray, Scott. 2013. Interactive Data Visualization for the Web. O'Reilly Media.

Schleicher, August. 1888. Die deutsche Sprache. 5th ed. Stuttgart: J.G. Cotta'schen.

Shneiderman, Ben. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of the IEEE Symposium on Visual Languages, pages 336-343, Washington. IEEE Computer Society Press.

Stewart, Ann Harleman. 1976. Graphic Representation of Models in Linguistic Theory. Indiana University Press.

Sukhotin, Boris V. 1962. Eksperimental’noe vydelenie klassov bukv s pomoşçju evm. Problemy strukturnoj lingvistiki, 234:189–206.

Ward, Joe H. Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(1)(1):236–244.

Thank you for your attention! |