Getting Wiktionary into PanLex

If we want to achieve the miracle of translation from any language into any other language, it would be enormously helpful to have a machine that can translate any word, or word-like phrase, from any language into any other language. The PanLex project aims to build exactly that machine. It is documenting all known lexical translations among all the world’s languages and dialects. The project draws mainly on published sources rather than eliciting translations directly from native speakers. An obvious place to turn in working toward this ambitious goal is Wiktionary, an online multilingual dictionary with content curated by thousands of users. Wiktionary contains millions of translations in thousands of languages, and in fact was one of the first sources mined for PanLex in 02006. However, this was done as a rough one-off procedure that could not take advantage of the regular growth of Wiktionary over time. Over the past several months, the PanLex team has been developing a better procedure for incorporating most of Wiktionary’s translations into the PanLex database. This has turned out to be an intricate process.

Wiktionary is in fact many resources, not just one. There are more than 150 editions of Wiktionary, each based on a particular language. Each edition contains entries mainly in that language; many entries include translations into other languages. For example, the English Wiktionary contains an entry for the verb go, whose primary sense “to move through space” is translated into German as either gehen (“to walk”) or fahren (“to go by vehicle”). The German Wiktionary contains separate entries for gehen and fahren, each of which is translated into English as go. Entries among different Wiktionaries must be manually linked, as there is no reliable automatic way to do this.

Several factors make it very difficult to treat different Wiktionaries as a single, uniform, computer-readable resource. Each Wiktionary contains different editorial standards for the standard structure of an entry, and these standards are not perfectly followed by all editors. Furthermore, the wiki markup in which entries are written is designed to be easy for editors to learn, not easy for computers to parse.

The DBnary project, created by Gilles Sérasset at the Université Joseph Fourier in Grenoble, is an effort to convert some of the largest Wiktionaries (currently 13 editions) into linked online data. This means that the data are computer-readable and made to conform to existing standards for lexical data, language codes, parts of speech, and so on. DBnary is a valuable contribution to making Wiktionaries tractable for PanLex, without which our task would have been much more difficult. However, much additional work has been necessary to make use of DBnary.

DBary translation map of “cat”One major challenge in interpreting DBnary for PanLex is language variety identification. DBnary uses three-letter codes, drawn from the ISO 639-3 standard that identifies more than 7,000 languages. PanLex uses codes from this and other ISO 639 standards, but additionally recognizes varieties of each language, which generally correspond to dialects or to different script standards for writing the language. Given a language code and a text string in that language, it is no simple matter to identify the PanLex variety code. Many cases can be resolved with a heuristic that detects the string’s Unicode script (e.g., Roman, Cyrillic, Arabic, Han) and then looks for a variety of the appropriate language which is written in that script. For about a hundred more difficult cases, we have had to create custom mappings and (in a small number of cases) custom code.

Another major challenge in making use of DBnary is lemmatization. PanLex records only the lemma of any given word or phrase, which generally corresponds to a dictionary headword, also known as a citation form. For example, most English nouns are recorded in the singular (table, not tables), and verbs are recorded in the infinitive (go, not goes or went). Wiktionaries generally record lemmas as their translations, but there is significant messiness in the data. We use a variety of heuristics to detect whether a string is likely to be lemmatic. For example, we remove most parenthesized material from strings, so that “divan (old-fashioned)” is converted to “divan”; the complete original string is preserved as a definition. Strings that contain certain characters, such as commas or semicolons, are likely to be lists of translations rather than single translations and are also converted to definitions.

We have written extensive custom code to convert all 13 available DBnary editions into a format that can be ingested into the PanLex database. The resulting files contain over 4 million translations. We are still in the process of perfecting the code and expect to have the ingestion completed in 02015. This will represent a substantial contribution to PanLex, which currently contains about 57 million translations. Once the new DBnary-provided Wiktionary data are ingested, we will retire the out-of-date PanLex Wiktionary sources. We will also be able to periodically update PanLex with the latest data from DBnary, thereby incorporating new crowd-sourced Wiktionary translations.

The PanLex project is always looking for skilled help in analyzing sources such as Wiktionary. Other sources, though typically much smaller, present similar challenges. We currently hope to hire a small number of source analysts to process our ever-growing backlog of sources. If this sort of work would interest you, please contact info@panlex.org.

Ideas

Getting Wiktionary into PanLex