Smarter dictionaries for reading

Neahttadigisánit gives "normal people" access to computational linguistic tools to make reading easier. Have you ever had no clue what word form you're looking at, or how to read a text with "special characters" left out?

Let’s say you’re in a country you’ve never been in before, and all you have to get you around is your guidebook and a dictionary. While trying to get around, you find you have to look a word up here and there. Let’s say the word you see is gažaldahkii, but looking through the dictionary you don’t find anything. You see words in the area, but since you don’t know anything about the language, nothing can really tell you that gažaldat is what you’re looking for.

Another situation: you’re on your mobile phone, and you see a word geaŧŧu, but the keyboard on your phone won’t support typing any of these characters. How do you find the word without typing it?

These are common problems in many of the world’s languages, and this situation is what has motivated creating all sorts of tools for machines to handle text: tools that take a word and return a “dictionary form” or lemma, and heck, even spellcheckers can be repurposed to handle the second scenario. But, these tools frequently aren’t combined with dictionaries to allow for additional use cases that are a little more human.

It should be obvious that it’s important, but why is it? Northern Saami, the language gažaldahkii and geaŧŧu come from is a good example: lemmas are exceedingly rare in running texts in Northern Saami. In one North Saami evaluation (Antonsen et al., 2009), a text of 252,461 words was run against a dictionary of 99,071 lemmas. Even with that amount of lemmas, you’d expect good coverage, right? No, only 7.9% of the words returned results. This is exceptionally bad, and it means more or less that if you want to look up a word in a dictionary for this language, you better know how the words work.

The statistics get better for other languages: the same study evaluated a Finnish text, and a Norwegian text, and found (unsurprisingly) that Norwegian resulted in the best, and Finnish was only a step up (but with a perhaps humorous twist, where there are more lemmas in the dictionary, and less wordforms in the text). The results are listed below:

  North Saami Finnish Norwegian
Words in text 252,461 45,155 64,944
Lemmas in dictionary 99,071 94,111 38,983
Success rate 7.9% 10.0% 30.5%

The authors of this particular paper saw this as a reason to go one step further than simple wordforms, so they created dictionaries that included additional wordforms– which is a large step up from not even including wordforms at all. However, even this isn’t enough: wordform lists can’t handle dynamic compounding (a process whereby all sorts of words are combined into on, and thus may not be frequent enough for a dictionary), because the number of wordforms you’d have to pregenerate would be immense.

But, what to do? Northern Saami already has great tools for analysing words and even sentences (with ridiculously amazing coverage rates), it has a spellchecker, and all these systems are extensible, flexible and in active use already in learning tools. Crucially, these tools are open-source and free (open your lexica, people!). All that’s left really, is to bring them together.

Neahttadigisánit

Over the last year, I wrote a dictionary application in Python, using Flask, to bring these resources together. The general set of steps is:

1.) User enters a word 2.) Morphological analyser looks it up, and returns possible wordforms 3.) Wordforms are looked up in the dictionary 4.) User sees beana when they were searching for beatnaga

Some additional features were necessary however, and some of these were built in to the morphological analyser by other members of the team:

  • Compound words are split up, thus they may be searched for in the lexicon individually or separately. If there is nothing in the dictionary for the whole word, then the individual parts are looked up.

  • Word derivations are split up in the same way: if there is no lexicalized form in the dictionary, then the user sees the stem. Thus: gaskabeaivvit ‘dinner’ still returns dinner, even though this word is a plural form of another (gaskabeaivi ‘noon’).

  • A separate analyser is available so users can select for what we call “social media Sámi”: accepting a d t s n z c as possible input for á đ ŧ š ŋ ž č, but also offering a wider range of additional dialectical and spelling variants.

  • For the Norwegian side of dictionaries we have, we also merged Nynorsk and Bokmål in the input, so, if you enter a Nynorsk word, you still end up with a definition. One of the reasons here is that users may not be aware that a word is a Nynorsk word, but are only aware that the text is Norwegian– maybe they’re a speaker of a language that is spoken in Norway, but live in Sweden or Finland.

The resulting end product, Neahttadigisánit is available online, and has since seen quite a large bit of use. We also store lookups and use them occasionally to add new lexical entries and morphological data. In addition to dictionaries for Northern Saami, we’ve also released another for South Saami and Norwegian, and are currently developing dictionaries for other minority Uralic languages (Kven, Olonets Karelian, Nenets, Komi, Livonian, and I’m probably forgetting a few), with translations to Russian, Finnish, Latvian, Estonian.

We also wanted these to work on mobile, so we used Twitter Bootstrap to produce a responsive design that fits on a variety of devices for little work which meant that there was more time for feature implementation, less time spent on all sorts of browser interoperability issues.

One last thing that we decided to do was to create an API so we could interact with these dictionaries via a JavaScript plugin that’s built into a learning system, and available via bookmarklet: this enables people to read texts without having to go back and forth between what they’re reading, and a dictionary– entries just pop up when you click a word.

The takeaways however, are many: this is extensible to other languages (although we haven’t gotten any right-to-left languages in yet, as the initial languages are Uralic), and it wasn’t all that hard to do. This enables learners and speakers of all levels to interact: the social media feature also means that it’s easier for non-native speakers of a language to read a language as though they were: the system can make all the substitutions to turn cakca into čakča, like native speakers do on the fly.

The other major point is that you don’t need an amazing analyzer or an amazing lexicon to reap the benefits of this kind of system for a language. For instance, if you have a language where nouns have 9 forms, adding even one word to these systems means a little more.

Reception

Neahttadigisánit made it into NoDaLida 2013, where I gave a talk along with Trond Trosterud, and since launch it has been seeing some amazing Google Analytics statistics. It seems like a tool like this was something people just needed– maybe you’d like to try it for your languages?

References

  • Ryan Johnson, Lene Antonsen, Trond Trosterud. 2013. Using Finite State Transducers for Making Efficient Reading Comprehension Dictionaries. 05/2013; In proceedings of: NODALIDA 2013 (The 19th Nordic Conference of Computational Linguistics) [Link].

  • Antonsen, L., Trosterud, T., Gerstenberger, C.-V., and Moshagen, S. N. 2009. Ei intelligent ordbok for samisk. LexicoNordica, 16:271–283.