Bochum, Germany, October 22nd, 2013
Phonotactic constraints in languages abound. One of the most well-known
and wide-spread constraints is commonly referred to as vowel harmony
(van der Hulst and van de Weijer, 1995).
Likewise, in some languages there are patterns of consonant harmony (Hansson, 2010) that show a similar behavior with respect to consonants. Less common are cases of “synharmonism” (Trubetzkoy, 1967, p. 251) where both vowels and consonants form such groups and words usually only contain sounds from the same group (e.g., only front vowels and palatalized consonants).
In addition, there are disharmony constraints, the most famous of which is the principle of Similar Place Avoidance (SPA) in Semitic consonantal roots (Greenberg, 1950).
With the distinction of symbols into vowels and consonants at hand, the user can then select a relevant context for the co-occurrence counts. The relevant context can be chosen from a list of predefined options (VCV and CVC)
The counts are then summarized in a quadratic contingency table and can be used for further statistical analyses.
In our experiments, two measures turned out to be especially useful for the detection of potential patterns: the probability and φ values. The φ value is a normalized χ2 measure which allows for an easier mapping of values to the color scale because it is always between −1 and 1 (Manning and Schütze, 1999).
Two additional steps have to be performed in order to arrive at the
final matrix visualization:
1) the rows and columns of the matrix have to be sorted in a meaningful way;
2) the association measures have to be mapped to visual variables.
The order of symbols is determined by a clustering of the symbols based on the similarity of their row values. The clustering is performed with the Python scipy.cluster.hierarchy package from the SciPy library. As a default setting Ward’s algorithm (Ward, 1963) is used but other clustering algorithms can also be easily integrated.
In principle, there are four reasons why languages can share a certain feature (cf. Comrie 1989: 201):
In order to be able to distinguish between 3. and 4. both genealogical (hierarchical) and areal (geo-spatial) information has to be combined in one visualization.
1. Determining whether a given geographical distribution is due to the fact that the languages involved are all genealogically closely related and thereby inherited the shared feature from the proto-language or due to language contact
2. Detecting outliers in a language family which have an unusual feature value with respect to the other members of the family that might be due to its geographical position because of which it acquired the divergent value from its neighbors
The idea to model polysemies as networks itself is not new. It was already underlying Haspelmath’s (2003) semantic map approach which is used as a heuristic tool to analyze grammatical categories in linguistic typology. François (2008) applied this approach to the lexical domain, followed by further work by Croft et al. (2009), Perrin (2010), Cysouw (2010a, 2010b), and Steiner et al. (2011), who also introduced a simplified procedure to retrieve putative polysemies from semantically aligned word lists.
In less formal terms, we reconstruct a weighted network by representing all concepts in a given multilingual word list as nodes (vertices), and draw edges between all nodes that show up as polysemies in the data. The edge weights reflect the number of language families in which these polysemies are attested.
Our analysis is based on a large multilingual word list consisting of 1252 glosses (“concepts”) translated into 195 different languages, covering 44 different language families (see Supplemental Material). The data was taken from three different sources, namely the Intercontinental Dictionary Series (IDS, Key and Comrie 2007, 133 languages), the World Loanword Database (WOLD, Haspelmath and Tadmor 2009, 30 languages), and a multilingual dictionary provided by the Logos Group (Logos, Logos Group 2008, 32 languages).
cf. German [arm] meaning both "poor" and "arm".
Communities: groups of vertices within which the connections are
dense but between which they are sparser (Newman 2004:4).
List et al. (2013) employed the Girvan-Newman algorithm for community detection (Girvan and Newman 2002).
Anscombe, F. J. 1973. “Graphs in Statistical Analysis”. English. In: The American Statistician 27.1, pp. 17–21.
Bak, Peter, Matthias Schaefer, Andreas Stoffel, Daniel Keim, Itzhak Omer. 2009. Density equalizing distortion of large geographic point sets. Journal of Cartographic and Geographic Information Science (CaGIS), 36(3):237–250.
Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-driven documents. IEEE Transactions on Visualization & Computer Graphics (Proc. InfoVis), 17(12):2301–2309.
Dewar, Mike. 2012. Getting Started with D3. O’Reilly Media.
Keim, Daniel A., Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, and Guy Melançon. 2008. “Visual Analytics: definition, process, and challenges”. In: Information Visualization. Ed. by Andreas Kerren, John T. Stasko, Jean-Daniel Fekete, and Chris North. Berlin: Springer Verlag, pp. 154–175.
Ladefoged, Peter. 2001. Vowels and Consonants: An Introduction to the Sounds of Languages. Oxford: Wiley-Blackwell.
List, Johann-Mattis; Anselm Terhalle & Matthias Urban (2013): Using Network Approaches to Enhance the Analysis of Cross-Linguistic Polysemies. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013): Short Papers, March 20-22, Potsdam, Germany. 347-353.
Mayer, Thomas, Bernhard Wälchli, Christian Rohrdantz and Michael Hund. In print. From the extraction of continuous features in parallel texts to visual analytics of heterogeneous areal-typological datasets: An extended functional and algorithmic processing pipeline. In Language processing and grammars: The role of functionally oriented computational models (SLCS) (Serie: Studies in Language). Amsterdam: John Benjamins.
Mayer, Thomas and Christian Rohrdantz. 2013. PhonMatrix: Visualizing co-occurrence constraints in sounds. In Proceedings of the ACL 2013 System Demonstration, pp.73-78.
Murray, Scott. 2013. Interactive Data Visualization for the Web. O'Reilly Media.
Schleicher, August. 1888. Die deutsche Sprache. 5th ed. Stuttgart: J.G. Cotta'schen.
Shneiderman, Ben. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of the IEEE Symposium on Visual Languages, pages 336-343, Washington. IEEE Computer Society Press.
Stewart, Ann Harleman. 1976. Graphic Representation of Models in Linguistic Theory. Indiana University Press.
Sukhotin, Boris V. 1962. Eksperimental’noe vydelenie klassov bukv s pomoşçju evm. Problemy strukturnoj lingvistiki, 234:189–206.
Ward, Joe H. Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(1)(1):236–244.