PanLex: Overcoming Language Barriers with the World’s Largest Lexical Translation Database

In an unassuming office on the fourth floor of Downtown Berkeley’s historic Chamber of Commerce high rise, three linguists are at work building the world’s largest lexical translation database. The mission of PanLex, a project of The Long Now Foundation, is to overcome language barriers to human rights, information, and opportunities. After ten years of pooling together different sources from across the world, PanLex’s database covers over 2,500 dictionaries, 5,700 languages, 25 million words, and 1.3 billion translations. Now, the PanLex team is ready to see what it can do. They’re targeting under-served language communities, international humanitarian organizations, and global businesses to explore what practical problems PanLex can address.

“Choose any language you can imagine,” Julie Andersen, the PanLex Director of Programs, instructs me as we power up the PanLex translator for a demo. “The most interesting language you can think of,” Ben Yang, the Director of Technology, adds.

Unlike the machine translation service Google Translate, which translates whole sentences and texts in up to a hundred major world languages (sometimes to comedic effect), PanLex is a panlingual database (built to contain every language), and lexical (focused on words, not sentences).

Stumped by the possibilities, I opt for Classical Nahuatl, the language of the Aztec empire, modern forms of which are spoken today by an estimated 1.5 million people. I once read that the word “avocado” originated with Nahuatl, and that the same Nahuatl word for avocado also meant “testicle”—due, presumably, to the similarity in shape.

“How about avocado?” I ask.

The PanLex translator app in action, translating “avocado” into Classical Nahuatl.

Yang types avocado in the field for English, selects Nahuatl, and we’re immediately presented with words with different translation quality scores, with ahuacatl having the top score. Tapping on the word displays the paths from the English word through equivalent words in different languages which lead to the Nahuatl word. Translating ahuacatl back to English provides the words: avocado, bollock, egg, and testicle, among others, with avocado having the highest quality score.

It’s a simple, intuitive interface, one that belies the implications for human rights embedded within. At the heart of the PanLex project is the conviction that with access to information and the ability to communicate comes the ability to exercise one’s rights.

“You might want to communicate in a different language just because you want to connect with someone,” David Kamholz, PanLex’s Project Director, tells me. “Say you see a person on the street and you want to talk to them and you don’t share a language. By not speaking the same language, you’ve lost the richness in life that comes from communicating with someone you might want to. But that’s not necessarily a human rights issue.”

“Human rights comes into play where you’re talking about a scenario where, say you’re sick and want to see a doctor, but there’s no doctor who speaks your language. Or you need a lawyer, or you want to vote, but you can’t understand what’s going on in the election. Or you just want to look up some information on Wikipedia, or understand something about a field you’re studying, and you’re trying to read it and you can’t. The rights outlined in documents like the Universal Declaration of Human Rights require the ability to communicate. If you can’t communicate with certain entities, be they your government, your doctor, your lawyer, or your teacher in school, you can’t exercise your rights.”

The PanDictionary was the conceptual forerunner to PanLex.

This focus on breaking down barriers to human rights did not mark the PanLex project from its inception. At first, it wasn’t even known whether building such a database was possible. In 02004, a group of researchers at the University of Washington’s Turing Center set out to answer a question:

Can we automatically compose a large set of Wiktionaries and translation dictionaries to yield a massive, multilingual dictionary whose coverage is substantially greater than that of any of its constituent dictionaries?

The result of their research, PanDictionary, demonstrated that it was possible to significantly improve the quality of inferred translations using a novel algorithm that pooled together several multilingual dictionaries and placed them in an interoperable format.

Say you want to translate something in Basque, a language linguistically unrelated to any living language, to Zulu, the most widely spoken home language in South Africa. You can go from Basque to English, and from English to Zulu, but what is the probability that the word in Zulu is an accurate translation of the Basque word? The English might not preserve the meaning completely, giving rise to what’s called a transitive inference problem. But if you have independent confirmation from enough intermediate languages, such as French, Russian, Hindi, et cetera, you can correct for the ambiguities and provide multiple paths that converge on the same Zulu word, and therefore receive a reliable translation.

Jonathan Pool, a political scientist who helped advise the research project, wanted to go a step further than a proof of concept demonstrating that such a database was possible. He wanted to build it.

Pool was struck at an early age by the degree to which linguistic knowledge influenced the universe of opportunities for people. As a member of the Peace Corps teaching English in Turkey in the the 1960s, he observed that it was knowledge of languages, rather than professional skills, that more often than not determined who got hired for jobs. Thus began a career at the intersection of academia, language politics and policy, where Pool’s research focused on individual and collective choices about language, linguistic diversity and the consequences of linguistic discrimination.

In the PanDictionary project, Pool glimpsed the practical implications that a massive lexical database could provide. The vision of PanLex—a database enabling anyone in the world, regardless of their language, to communicate and exercise their rights—was born.

For the next six years, Pool dedicated himself to building out the database, singlehandedly doing the programming, improving its structure, and scouring the Internet for every possible linguistic source he could find. He was also independently funding the venture.

The PanLex team. From left: Project Director David Kamholz, Director of Technology Ben Yang, and Director of Programs Julie Anderson. Photo by Carolyn Wachnicki.

As the database grew larger, Pool expanded his one-man operation, bringing in linguist and self-taught programmer David Kamholz in 02013. Linguists Julie Anderson and Ben Yang joined in 02015. Anderson acquired new data to be ingested into the database, which Yang analyzed and integrated, along with building new tools for it.

“It’s really fun getting my hands on all these dictionaries of languages from all over the world, especially the under-served languages,” Anderson says. “To me, this is brain candy.”

PanLex soon caught the attention of Laura Welcher, project director for The Long Now Foundation’s Rosetta Project. The Rosetta Project began in 02000 as Long Now’s first exploration into long-term archiving, with the goal of building a publicly accessible digital library of human languages. Rosetta had been collecting parallel vocabulary lists early on as a targeted collection effort. As part of Rosetta’s sharing efforts with other linguistics projects, many of these lists made their way to the PanLex project, where the PanLex team incorporated them into their database, linked that data to other language data, and cleaned up and normalized the data. Rosetta and PanLex agreed that they were complementary projects and should work closely together. PanLex became a sponsored nonprofit project of the Long Now Foundation in 02012.

“I think of Rosetta and PanLex as sister projects,” Welcher says. “They are functionally separate projects with separate staff, but with similar and complementary goals. Rosetta also focuses on explorations in very long-term archiving media which PanLex doesn’t specifically do, although they are participating in the larger data collection effort and PanLex lexical data currently makes up about half the language data on the Rosetta Wearable Disk.”

Pool stepped back from day-to-day operations in 02017, and Kamholz took over as Project Director. Now that the database is sufficiently robust (“We have the largest collection of lexical data in the world,” notes Yang), Kamholz is leading PanLex through its next phase. A part of that next phase entails more clearly elucidating PanLex’s value proposition. Another part means finding sustainable ways to generate revenue.

PanLex’s data is freely available and no permission is needed for noncommercial use. Photo by Carolyn Wachnicki

“Earlier this year, we started the process of asking: Who are we, what are we really trying to do?” Kamholz says. “What is the world we want and what is our vision of where PanLex fits into that? We’ve always said that we want to help these under-served communities and partner with global humanitarian organizations, but what exactly do we want to do for them? I wouldn’t say we necessarily changed our mission so far as make it more explicit and concrete.”

“Before, our mission was to translate every word into every language, with a vision of universal communication,” Anderson says. “Now…”

“I wouldn’t say that’s not what we’re trying to do at this point,” Kamholz interjects. “But we’re also trying to do things that in the relatively short term can immediately help people.”

At the moment, PanLex is looking to partner with international organizations both large and small, from the Red Cross, World Bank and OxFam to Translators Without Borders.

“If, for example, there are NGO’s that deal with disaster preparedness,” Anderson says, “we can provide them with dictionaries of languages with disaster and medical terminology tailored to their specific needs and specific regions.”

PanLex is also looking to partner with global businesses. “There are many businesses that are trying to expand into markets around the world,” Kamholz says. “And they’re getting to the point where the major world languages are not enough for them to reach everyone, and we would have the ability to help them reach more people.”

Katrina Esau, one of the last remaining speakers of a Khoisan language that was thought extinct nearly 40 years ago, teaches her native tongue to a group of school children in Upington, South Africa on 21 September 2015. Photo by Mujahid Safodien/AFP/Getty

PanLex’s vision of overcoming language barriers to human rights is inspiring, to be sure. But there are some who contend that the preservation of a diversity of languages could actually make it more challenging for communities to communicate. In an increasingly globalized and interconnected world, wouldn’t an easier solution to the problem be to have everyone learn the same language, like Mandarin or English? As philosopher Rebecca Roache recently put it:

The advantages to adopting a single language are clear. It would enable us to travel anywhere in the world, confident that we could communicate with the people we met. We would save money on translation and interpretation. Scientific advances and other news could be shared faster and more thoroughly. By preserving a diversity of languages, we preserve the obstacles to communication. Wouldn’t it be better to allow as many languages as possible to die out, leaving us with just one universal lingua franca?

“There are two ways to answer that,” Kamholz says. “One is, well, what about the people who don’t speak those languages yet, what are they supposed to do now? Do we say to them: You won’t have human rights until fifty to a hundred years from now and then you’ll speak English or Mandarin? Those people exist now and still need their rights.”

Endangered Languages in Australia, Indonesia and Papua New Guinea. Via Endangered Languages

 

“But I would go even further and say, we don’t want a world where the only possible future, and the only way to exercise your rights, is to speak English or Mandarin,” Kamholz continued. “We want a diverse world with many points of view, with different cultural traditions. We don’t want everyone to be the same in that sense, and we don’t want that to be the only solution. We’re enabling people to access information and exercise their rights, but it’s also driven by this desire for diversity and pluralism. We want to make it easier and more possible for people who are in these under-served language communities to access the information they need, and empower them to make their own decisions regarding the preservation of their cultures, their traditions, their languages. There are lot of people in the world who want to do that, but it feels like such a lopsided struggle of us against the world. It seems impossible. But we believe PanLex helps make it easier for people to maintain things they want to maintain. This is just one small piece of the many things that need to happen to make that a reality. I’m not under the illusion that we can do it singlehandedly. I just want us to contribute to the process and hopefully inspire others along the way.”

To learn more about PanLex, go to panlex.org or email info@panlex.org.

 

Share on Facebook Share on Twitter

More from Language

What is the long now?

The Long Now Foundation is a nonprofit established in 01996 to foster long-term thinking. Our work encourages imagination at the timescale of civilization — the next and last 10,000 years — a timespan we call the long now.

Learn more

Join our newsletter for the latest in long-term thinking

Long Now's website is changing...