CEPII - Language

Language provides new series for Common Official Language (COL), Common Spoken Language (CSL), Common Native Language (CNL) and in the case of Linguistic Proximity, LP1 and LP2, the unadjusted values that serve for constructing two different measures, which we label PROX1 and PROX2. This series are available for 195 countries.

A do-file will permit you to construct our series for LP1 and LP2. But it will also permit you to construct both variables based on your own dataset. This is important in the case of LP1 and LP2 since both variables are data dependent. The same do-file will permit you to construct a unique Common Language index (CL) as well, a variable which depends, in turn, on LP and therefore is also data dependent.

Please read the related publication. If you don't find the answer to your question, please contact us.

Reference document to cite: Melitz, J. and Toubal, F (2014) Native Language, Spoken Language, Translation and Trade. Journal of International Economics, Vol. 92, N°2: 351-363.

Person in charge & contact: Jacques Melitz & Farid Toubal, language

cepii.fr

Licence: Etalab 2.0

Download

Language Data (STATA)

Methodology

In this data, as everywhere, this measure of Common Official Language (COL) is a binary one, either 0 or 1. With regard to COL, the usual source is the CIA World Factbook. Though we used it as well, we adopted a slightly broader definition of COL.

With regard to Common Native Language (CNL) and Common Spoken Language (CSL), we required all languages to be spoken by at least 4% of the population in 2 countries. The data on native and spoken language are collected from various sources described in detail in the paper.

We constructed two separate measures of Linguistic Proximity, LP1 and LP2. LP1 is inspired by the idea in Fearon (2003) and Laitin (2000) of calculating linguistic proximities on the basis of the Ethnologue classification of language trees between trees, branches and sub-branches. As regards LP2, the source is an analysis of lexical similarity between 40 words that were compiled by the Automated Similarity Judgment Program (ASJP) project.